# Let's Import & Merging many files (Baby Names Dataset) 

### Importing one File & Understanding the Data Structure (Ease Case)

In [1]:
import pandas as pd 

In [2]:
pd.read_csv("yob1880.txt")

Unnamed: 0,Mary,F,7065
0,Anna,F,2604
1,Emma,F,2003
2,Elizabeth,F,1939
3,Minnie,F,1746
4,Margaret,F,1578
...,...,...,...
1994,Woodie,M,5
1995,Worthy,M,5
1996,Wright,M,5
1997,York,M,5


The most popular name of the females was Mary with a total count of 7065. F stands for female, and at the bottom, we can see some less frequent male names such as Woodie, Worthy, Wright, etc. Also, we can see that our data set do not have any column headers or labels. Let's create the dataframe with right column labels and columns headers.

In [3]:
df = pd.read_csv("yob1880.txt", header = None, names = ["Name", "Gender", "Count"])

In [4]:
df

Unnamed: 0,Name,Gender,Count
0,Mary,F,7065
1,Anna,F,2604
2,Emma,F,2003
3,Elizabeth,F,1939
4,Minnie,F,1746
...,...,...,...
1995,Woodie,M,5
1996,Worthy,M,5
1997,Wright,M,5
1998,York,M,5


### Let's Import & merge many files 

In [5]:
import pandas as pd

Let's firstly try to merge two files. We can use pd.concat() method to concatenate two dataframe. 

In [6]:
df_1880 = pd.read_csv("yob1880.txt", header = None, names = ["Name", "Gender", "Count"])
df_1880

Unnamed: 0,Name,Gender,Count
0,Mary,F,7065
1,Anna,F,2604
2,Emma,F,2003
3,Elizabeth,F,1939
4,Minnie,F,1746
...,...,...,...
1995,Woodie,M,5
1996,Worthy,M,5
1997,Wright,M,5
1998,York,M,5


In [7]:
df_1881 = pd.read_csv("yob1881.txt", header = None, names = ["Name", "Gender", "Count"])
df_1881

Unnamed: 0,Name,Gender,Count
0,Mary,F,6919
1,Anna,F,2698
2,Emma,F,2034
3,Elizabeth,F,1852
4,Margaret,F,1658
...,...,...,...
1930,Wiliam,M,5
1931,Wilton,M,5
1932,Wing,M,5
1933,Wood,M,5


We merge two dataframe vertically. Then, we added the year column to indicate which year the data came from, and after dropping the first index values, we reset the indexes. It was a hands-on approach. We should generalize this approach to work with many files.

In [9]:
pd.concat(objs = [df_1880, df_1881], axis = 0, keys = [1880, 1881],
          names = ["Year"]).droplevel(-1).reset_index()

Unnamed: 0,Year,Name,Gender,Count
0,1880,Mary,F,7065
1,1880,Anna,F,2604
2,1880,Emma,F,2003
3,1880,Elizabeth,F,1939
4,1880,Minnie,F,1746
...,...,...,...,...
3930,1881,Wiliam,M,5
3931,1881,Wilton,M,5
3932,1881,Wing,M,5
3933,1881,Wood,M,5


We can use string replacement here and also for loop top create a list of dataframe for every year file. Then, we can easily concatenate all dataframes.

In [10]:
years = list(range(1880, 2019))
dataframes = []
for year in years:
    data = pd.read_csv("yob{}.txt".format(year), header = None, 
                       names = ["Name", "Gender", "Count"])
    dataframes.append(data)

In [11]:
df = pd.concat(dataframes,  axis = 0, keys = years, names = ["Year"]).droplevel(-1).reset_index()

In [12]:
df

Unnamed: 0,Year,Name,Gender,Count
0,1880,Mary,F,7065
1,1880,Anna,F,2604
2,1880,Emma,F,2003
3,1880,Elizabeth,F,1939
4,1880,Minnie,F,1746
...,...,...,...,...
1957041,2018,Zylas,M,5
1957042,2018,Zyran,M,5
1957043,2018,Zyrie,M,5
1957044,2018,Zyron,M,5


Almost two million rows and each and every row stands for a combination of name, gender and, year. Now, we can import the dataframe as a csv file and store it in our local computer.

In [13]:
df.to_csv("us_baby_names.csv", index = False)