# Baby Names Analysis

## basic logic

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://github.com/huilincai/baby-names-analysis/blob/master/BABY%20NAMES%20ANALYSIS.png?raw=true")

## 1.Combine all of the individual state files into one dataset. 

In [2]:
### Get data
link= "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
import urllib.request
urllib.request.urlretrieve(link,'namesbystate.zip')

('namesbystate.zip', <http.client.HTTPMessage at 0x1117c4ef0>)

In [3]:
import zipfile
with zipfile.ZipFile("namesbystate.zip", 'r') as zip_ref:
    zip_ref.extractall("data")

In [4]:
### combine all individual files into one
import os
file_list =[ file for file in os.listdir('data') if file.endswith("TXT")]

with open('data/names.txt', 'w') as outfile:
    for fname in file_list:
        with open("data/" + fname) as infile:
            outfile.write(infile.read())

In [5]:
import numpy as np  
import pandas as pd  

In [6]:
### read the data as a dataframe
data = pd.read_csv('data/names.txt', sep=",", header = None)
data.columns = ['state','gender','year','name','number of babies']

In [7]:
data.head(10)

Unnamed: 0,state,gender,year,name,number of babies
0,AK,F,1910,Mary,14
1,AK,F,1910,Annie,12
2,AK,F,1910,Anna,10
3,AK,F,1910,Margaret,8
4,AK,F,1910,Helen,7
5,AK,F,1910,Elsie,6
6,AK,F,1910,Lucy,6
7,AK,F,1910,Dorothy,5
8,AK,F,1911,Mary,12
9,AK,F,1911,Margaret,7


## 2.Find the most gender-neutral names

In [8]:
### find the number of names used by males and females 
df1 = data.groupby(['name','gender']).size().reset_index()

In [9]:
df1.columns = ['name','gender','count']
df1.head()

Unnamed: 0,name,gender,count
0,Aaban,M,2
1,Aadan,M,4
2,Aadarsh,M,1
3,Aaden,M,248
4,Aadhav,M,6


In [10]:
def helper(df):
    if  df.shape[0] == 2: ### have names used for both M and F
        count = df['count'].values
        return count[0]/count[1] ### calculate the ratio of M to F
    else:
        return 0

In [11]:
ratios = df1.groupby("name").apply(lambda x:helper(x))

In [12]:
MostGenderNeutralNames = ratios[ratios == 1] ### most gender-neutral
MostGenderNeutralNames

name
Addis            1.0
Afnan            1.0
Aiman            1.0
Albie            1.0
Altair           1.0
Amandeep         1.0
Amil             1.0
Arvie            1.0
Asuncion         1.0
Bentlie          1.0
Britten          1.0
Chia             1.0
Child            1.0
Christia         1.0
Clemence         1.0
Daris            1.0
Darriel          1.0
Dayan            1.0
Decklyn          1.0
Deshone          1.0
Devonne          1.0
Dezi             1.0
Diarra           1.0
Dorsie           1.0
Dossie           1.0
Edris            1.0
Eri              1.0
Evann            1.0
Francesc         1.0
Garnell          1.0
                ... 
Navjot           1.0
Newborn          1.0
Olamide          1.0
Olie             1.0
Oluwadamilola    1.0
Oluwasemilore    1.0
Osa              1.0
Parris           1.0
Peni             1.0
Rael             1.0
Rajdeep          1.0
Rameen           1.0
Rei              1.0
Rumi             1.0
Salam            1.0
Shaune           1.0
Shine   

### Hence, the most gender-neutral names are the names shown above.

## 3. Find the “trendsetters” and "followers"

In [13]:
### record the names count as 1 if it was used in a specific state in a specific year
df2 = data.groupby(['name','year','state']).size().reset_index()

df2.head(10)

Unnamed: 0,name,year,state,0
0,Aaban,2013,NY,1
1,Aaban,2014,NY,1
2,Aadan,2008,CA,1
3,Aadan,2008,TX,1
4,Aadan,2009,CA,1
5,Aadan,2014,CA,1
6,Aadarsh,2009,IL,1
7,Aaden,2005,OH,1
8,Aaden,2007,AL,1
9,Aaden,2007,AZ,1


### Assumptions:<br/>
1. The new names we are going to analyze all appeared after 1910, because we don't want to confuse the new names that came out in 1910 with the names that came out before 1910.<br/>
2. We don't consider the name only appeared once, beacuase we don't want the new name to be just an accident rather than a new name that can lead the trend.<br/>
3. Only one state can be the trendsetter for each new name.<br/>
4. The total number of new names after the first use should be more than 1,000, since we want the names to have certain popularity.


In [14]:
def helper2(df):
    if df.year.values[0] != "1910":
        if df.shape[0] > 1: # row > 1
            if df.state.unique().size > 1 and df.year.unique().size : # state count > 1
                if df[0].values[1:].sum()>1000:
                    return df.state.values[0] 

In [15]:
### find "trendsetter" state for the names
df2Count = df2.groupby("name").apply(helper2)


In [16]:
dftmp = pd.DataFrame(df2Count).reset_index()

In [17]:
dftmp.columns = ['name','state']

In [18]:
### calculate how many times has each state been a “trendsetter”
trendsetter = dftmp.groupby("state").size()

In [19]:
trendsetter

state
AK     16
AL    455
AR     46
AZ     22
CA    287
CO      2
CT     10
DC      1
FL      8
GA     52
HI      2
IA     36
IL     85
IN     12
KS      3
KY     24
LA     28
MA     39
ME      2
MI     21
MN     11
MO      7
MS     19
NC     29
ND      2
NE      3
NH      1
NJ      7
NM      8
NY    199
OH     21
OK      7
OR      1
PA     45
SC     11
TN      8
TX     89
UT     13
VA     12
WA      3
WI      3
WV      4
dtype: int64

#### Now let us explore are there any "followers"(states that tend to adopt names after they gain popularity in other states)?

In [20]:
def helper3(df):
    if df.year.values[0] != "1910":
        if df.shape[0] > 1: # row > 1
            if df.state.unique().size > 1 and df.year.unique().size : # state count > 1
                if df[0].values[1:].sum()>1000:
                    return df.state.values[1:] #find the followers
    

In [21]:
follower = df2.groupby("name").apply(helper3).dropna()
follower

name
Aaliyah      [NJ, OH, PA, NJ, NY, PA, NJ, NY, PA, NJ, NY, P...
Aaron        [AR, GA, IL, MO, MS, NC, NY, PA, TN, TX, VA, W...
Abbey        [NY, NY, NY, NY, NJ, NY, NY, PA, NY, NY, PA, C...
Abbie        [MA, MS, TX, GA, MS, OK, TX, AL, AR, MO, MS, O...
Abby         [NY, NY, NY, NY, NY, NY, NY, NY, NY, NY, NY, N...
Abel         [TX, LA, TX, TX, AZ, TX, CO, LA, MA, NM, TX, A...
Abigail      [NY, NY, NY, TX, CA, NY, TX, PA, CA, HI, HI, M...
Abraham      [GA, MA, NY, PA, SC, CT, FL, GA, IL, MA, MS, N...
Ada          [AR, CA, CO, CT, FL, GA, IA, IL, IN, KS, KY, L...
Adam         [AR, FL, GA, IL, LA, MI, MS, NC, NJ, NY, OH, P...
Addie        [AR, FL, GA, KY, LA, MO, MS, NC, OK, SC, TN, T...
Addison      [TX, NY, PA, TX, NC, NY, PA, GA, NY, PA, VA, K...
Adelaide     [CT, GA, IA, IL, MA, MI, MN, NC, NJ, NY, OH, P...
Adele        [CA, GA, IL, LA, MA, MD, MO, NC, NJ, NY, PA, T...
Adeline      [CT, GA, IA, IL, LA, MA, MI, MN, MO, MS, MT, N...
Adrian       [TX, NJ, NY, OH, PA, GA, IA, IL, KS, 

### conclusions: <br/>

1.Certain states are “trendsetters”.<br/>
2.There are states that tend to adopt names after they gain popularity in other states.<br/>

### limitations:<br/>
1.It is possible that we missed some trendsetters since there might be names which first appeared in 1910, but we chose to ignore it since we don't want to confuse them with the names that came out before 1910.<br/>
2.We ignored those names which were first used by more than one state, since we think there should be only one trendsetter for each name. And our data limit our ability to decide which state should be only trendsetter for each name under this circumstance.

## 4.Using external data

### Motivation:<br/>
How did the naming style change over time?<br/>

### External Data:<br/>
The external dataset I need basically has two variables: name, category. And this dataset can divide the names into several categories, for example, biblical, modern, traditional and other. After merging the data, the final dataset will have six variables: name, gender, state, year, number of babies, category.<br/>

### Approach:<br/>
1.To calculate the total number of babies for each category for each year. <br/>
2.To make a trend graph to see how these four categories change over time. There should be four lines in the graph which means different categories with four different colors. The x axis is the year and the y axis is the number of babies.
