# Concatenating Data

A central problem in data science is combining data from different sources. One of the simplest ways to combine data sets is to **concatenate** them.

In this class, we will use "concatenate" to refer to stacking data frames on top of one another and "merge" to refer to stacking data frames next to each other.

In [None]:
import pandas as pd

## Baby Names Dataset

The Social Security Administration tracks the names of all babies born in the United States each year. The data is [publicly available](https://www.ssa.gov/OACT/babynames/limits.html), and a copy of the data has been made available at `/data/names/`.

In [None]:
!ls /data/names

The data for each year is stored in separate files. The files are named `yob####.txt`, where `####` is the year. Notice that the data goes all the way back to 1880!

Now let's look at how the data is stored.

In [None]:
!head /data/names/yob1997.txt

Each row specifies the number of babies born that year with that name and gender.

## Exercises

**Question 1.** Read in the data for the year you were born. Specify appropriate column names. How many people were born with your name?

**Question 2.** Track the popularity of your name in the years since you were born. To do this, you'll have to read in data from multiple years and concatenate the data sets into a single data frame using `pd.concat()`. Make sure you add a column "Year" so that you can keep track of which year each row came from.

Some code to generate the filenames has been provided for you.

In [None]:
for year in range(1990, 2016):
    filename = "/data/names/yob%d.txt" % year

**Question 3.** What do you notice about the index? What problems might this cause? How do you fix it? (_Hint_: Look at the documentation for `pd.concat`.)

**Question 4.** What happens if you try to concatenate two data frames with different column names?

## A Word on `.append`

Another way to concatenate two data frames is:

`df1.append(df2)`.

Note that unlike `pd.concat()`, which is called on the data frames you want to concatenate, `.append()` is actually a method of a data frame.

While `.append()` works fine for two data frames, it's much too slow to concatenate many data frames by using `.append()` in a loop. In general, use `pd.concat()`, but in some situations, `.append()` looks nicer.