# Concatenating Data

A central problem in data science is combining data from different sources. One of the simplest ways to combine data sets is to **concatenate** them.

In this class, we will use "concatenate" to refer to stacking data frames on top of one another and "merge" to refer to stacking data frames next to each other.

In [1]:
import pandas as pd

## Baby Names Dataset

The Social Security Administration tracks the names of all babies born in the United States each year. The data is [publicly available](https://www.ssa.gov/OACT/babynames/limits.html), and a copy of the data has been made available at `/data/names/`.

In [2]:
!ls /data/names

NationalReadMe.pdf  yob1907.txt  yob1935.txt  yob1963.txt  yob1991.txt
yob1880.txt	    yob1908.txt  yob1936.txt  yob1964.txt  yob1992.txt
yob1881.txt	    yob1909.txt  yob1937.txt  yob1965.txt  yob1993.txt
yob1882.txt	    yob1910.txt  yob1938.txt  yob1966.txt  yob1994.txt
yob1883.txt	    yob1911.txt  yob1939.txt  yob1967.txt  yob1995.txt
yob1884.txt	    yob1912.txt  yob1940.txt  yob1968.txt  yob1996.txt
yob1885.txt	    yob1913.txt  yob1941.txt  yob1969.txt  yob1997.txt
yob1886.txt	    yob1914.txt  yob1942.txt  yob1970.txt  yob1998.txt
yob1887.txt	    yob1915.txt  yob1943.txt  yob1971.txt  yob1999.txt
yob1888.txt	    yob1916.txt  yob1944.txt  yob1972.txt  yob2000.txt
yob1889.txt	    yob1917.txt  yob1945.txt  yob1973.txt  yob2001.txt
yob1890.txt	    yob1918.txt  yob1946.txt  yob1974.txt  yob2002.txt
yob1891.txt	    yob1919.txt  yob1947.txt  yob1975.txt  yob2003.txt
yob1892.txt	    yob1920.txt  yob1948.txt  yob1976.txt  yob2004.txt
yob1893.txt	    yob1921.txt  yob1949.txt  yo

The data for each year is stored in separate files. The files are named `yob####.txt`, where `####` is the year. Notice that the data goes all the way back to 1880!

Now let's look at how the data is stored.

In [3]:
!head /data/names/yob1997.txt

Emily,F,25731
Jessica,F,21043
Ashley,F,20895
Sarah,F,20694
Hannah,F,20588
Samantha,F,20169
Taylor,F,19503
Alexis,F,17171
Elizabeth,F,15415
Madison,F,15187


Each row specifies the number of babies born that year with that name and gender.

## Exercises

**Question 1.** Read in the data for the year you were born. Specify appropriate column names. How many people were born with your name?

In [54]:
data96 = pd.read_csv("/data/names/yob1996.txt",
                  names = ["Name","Sex","Babies"])
pd.set_option("display.max_rows", 15)
data96[(data96["Name"] == "Nicolas")&(data96["Sex"] == "M")]

Unnamed: 0,Name,Sex,Babies
16039,Nicolas,M,2294


**Question 2.** Track the popularity of your name in the years since you were born. To do this, you'll have to read in data from multiple years and concatenate the data sets into a single data frame using `pd.concat()`. Make sure you add a column "Year" so that you can keep track of which year each row came from.

Some code to generate the filenames has been provided for you.

In [66]:
dfs = []
for year in range(1990, 2016):
    filename = "/data/names/yob%d.txt" % year
    dataIN = pd.read_csv(filename, names=["Name","Gender","Amount"])
    dataIN["Year"] = year
    dfs.append(dataIN) 
    
data = pd.concat(dfs)
data

Unnamed: 0,Name,Gender,Amount,Year
0,Jessica,F,46470,1990
1,Ashley,F,45553,1990
2,Brittany,F,36534,1990
3,Amanda,F,34405,1990
4,Samantha,F,25865,1990
5,Sarah,F,25810,1990
6,Stephanie,F,24859,1990
...,...,...,...,...
32945,Zolton,M,5,2015
32946,Zyah,M,5,2015


**Question 3.** What do you notice about the index? What problems might this cause? How do you fix it? (_Hint_: Look at the documentation for `pd.concat`.)

In [70]:
data.loc(0)

<pandas.core.indexing._LocIndexer at 0x7f979426c0b8>

In [51]:
data = pd.concat(dfs,ignore_index=True)
data

Unnamed: 0,Name,Gender,Amount,Year
0,Jessica,F,46470,1990
1,Ashley,F,45553,1990
2,Brittany,F,36534,1990
3,Amanda,F,34405,1990
4,Samantha,F,25865,1990
5,Sarah,F,25810,1990
6,Stephanie,F,24859,1990
...,...,...,...,...
789179,Zolton,M,5,2015
789180,Zyah,M,5,2015


**Question 4.** What happens if you try to concatenate two data frames with different column names?

In [59]:
data96
data97 = pd.read_csv("/data/names/yob1997.txt",
                  names = ["Name","Gender","Babies"])

In [60]:
pd.concat([data96,data97])

Unnamed: 0,Babies,Gender,Name,Sex
0,25150,,Emily,F
1,24192,,Jessica,F
2,23676,,Ashley,F
3,21029,,Sarah,F
4,20545,,Samantha,F
5,19151,,Taylor,F
6,18594,,Hannah,F
...,...,...,...,...
26959,5,M,Zenas,
26960,5,M,Zhaire,


## A Word on `.append()`

Another way to concatenate two data frames is:

`df1.append(df2)`.

Note that unlike `pd.concat()`, which is called on the data frames you want to concatenate, `.append()` is actually a method of a data frame.

While `.append()` works fine for two data frames, it's much too slow to concatenate many data frames by using `.append()` in a loop. In general, use `pd.concat()`, but in some situations, `.append()` looks nicer.