# Lesson 5: Merging DataFrames with pandas

[View this lesson on datacamp](https://learn.datacamp.com/courses/merging-dataframes-with-pandas)

## Chapter 1: Preparing Data

In [1]:
import pandas as pd

To read a single file into a dataframe, use the following format:

`variable = pd.read_csv('filename.csv')`

For example, this reads a file named `fav_colour.csv` and saves it as a pandas DataFrame called `colour_dat`:

In [5]:
colour_dat = pd.read_csv('fav_colour.csv')

We can see the resulting DataFrame by executing the name of the variable:

In [6]:
colour_dat

Unnamed: 0,Participant num,Fav Colour
0,1,blue
1,2,red
2,3,green
3,4,purple
4,5,red
5,6,green
6,7,orange
7,8,yellow
8,9,yellow
9,10,pink


### Use a `for` loop to read in multiple files:

In many data science cases, we have data in multiple files. For example, if we run an experiment with human participants, we might have a data file from each participant, but for group analysis we'll want to combine them all into one big DataFrame. This is a great use case for looping, since we're doing the same thing (reading a file into a DataFrame) multiple times. 

First, create a list of the files, Later we will loop through this list.

In [8]:
filenames = ['s1.csv', 's2.csv', 's3.csv']

Next, we create an empty list to append the files to. Note that what we're doing here doesn't actually create one big DataFrame (we'll see how to do that in the next chapter). For now, we're creating a list in which each list item is the DataFrame from one of the files we read. We name the list in a way that makes it obvious that it's a list of DataFrames.

In [9]:
df_list = []

Finally, use a `for` loop to read the files in. This will cycle through the items in the `filenames` list; each time through the loop, `filename` has the value of the current file name, and we used the DataFrame `append()` method to add the data from that file to 

In [10]:
for filename in filenames:
    df_list.append(pd.read_csv(filename))

WHen we view the contents of the list, we see each dataset, with its two columns (with headers sayign what they are), and commas separating the list entries, as is normal. 

In [12]:
df_list

[   trial        RT
 0      1  0.508971
 1      2  0.389858
 2      3  0.404175
 3      4  0.269520
 4      5  0.437765
 5      6  0.368142
 6      7  0.400544
 7      8  0.335198
 8      9  0.341722
 9     10  0.439583,
    trial        RT
 0      1  0.433094
 1      2  0.392526
 2      3  0.396831
 3      4  0.417988
 4      5  0.371810
 5      6  0.659228
 6      7  0.411051
 7      8  0.409580
 8      9  0.486828
 9     10  0.468912,
    trial        RT
 0      1  0.322099
 1      2  0.396106
 2      3  0.384297
 3      4  0.364524
 4      5  0.454075
 5      6  0.494156
 6      7  0.492787
 7      8  0.506836
 8      9  0.340722
 9     10  0.704491]

The output above isn't nicely formatted as a pandas DataFrame, because Python is treating the DataFrames as list entries. But, if we ask to see one of the list entries, things look prettier: 

In [13]:
df_list[0]

Unnamed: 0,trial,RT
0,1,0.508971
1,2,0.389858
2,3,0.404175
3,4,0.26952
4,5,0.437765
5,6,0.368142
6,7,0.400544
7,8,0.335198
8,9,0.341722
9,10,0.439583


---
## Summary 

DataFrame methods covered in the lesson:

|Method                    |Funciton               |
|--------------------------|-----------------------|
|`.sort_index()`           |sort index - default is by ascending order, specifying `ascending=False` reverses order|
|`.sort_values()`          |sort values (for example in a column); default is ascending (lowest to highest) but can be changed with argument|
|[`.reindex()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html)              |resets or changes index of a DataFrame. Useful to manually sort, or rest index in a sequence, e.g., if you removed some rows|
|`.ffill()`                |forward fill; replace NaN with last preceding non-null value|
|[`.str.replace('', '')`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html?highlight=str%20replace#pandas.Series.str.replace)    |replace every occurrence of the first string listed, with the second, in a pandas Series|
|`.multiply()`             |performs multiplication along specified column/row and broadcasts output|

---
**Pop Quiz:** What is broadcasting? 

[Click here to check your answer.](https://www.datacamp.com/community/tutorials/python-numpy-tutorial#broadcasting)