# Matthew Naeher

## Introduction

In this report, we will be visualizing baby name data from the US over the past century. The data for this project has been provided by the United States Social Security Administration. With this data, Pandas dataframes will be used to organize name counts by state, year, and sex. Using the dataframe, the data will then be visualized using Plotly Express. The two types of plots used are line graphs and choropleth maps. The line graphs will illustrate a change in name frequency or diversity over time, while the choropleth maps will allow this data to be visualized on a state-by-state basis.

Using this dataset, we will explore three topics:
1. Popularity of the name "Matthew" over time
2. Name diversity over time
3. The top name in each state per year

## Preliminaries 

The code below imports necessary packages (e.g., Plotly Express, Pandas, and glob) so the dataset can be downloaded and unzipped.

In [1]:
import plotly.express as px

In [None]:
from glob import glob

from zipfile import ZipFile
import requests
import pandas as pd

from glob import glob


#When I tried unzipping namesbystate.zip the way that was provided on the course website, it created a gpgz file which I could not open. 
#To work around this, I unzipped this by passing the unzip command to the shell and loading all of the individual text files into the working directory.
#I'm sorry for this inconvinience, but this was the only way I could figure out how to get the code working.
!unzip namesbystate.zip 

Archive:  namesbystate.zip
replace AK.TXT? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
#Find all files of type txt
glob('*.TXT')


Below, the data from each individual txt file is concatenated into a single pandas dataframe.

In [None]:
file_names = glob('*.TXT')

df = pd.concat(
    (pd.read_csv(f, names=['state', 'sex', 'year', 'name', 'count']) for f in file_names)
).reset_index(drop=True)

df.head()

## Part 1: Popularity of the name "Matthew"

To start, we will look at the popularity of the common name "Matthew." Before we can visualize the data, we need to preprocess. To do so, a new dataframe is made to hold the count of the name Matthew for each year and state. Then, the number of Matthews will be divided by the total number of babies to determine the percentage of babies named Matthew. 

In [None]:
# Create a dataframe of just Matthews
matthew = df[df['name']=="Matthew"]
#Create a dataframe holding the total number of babies for each state and year
n_babies = df.groupby(by=["year","state"])["count"].sum()
n_babies_year = df.groupby(by=["year"])["count"].sum()
n_babies

In [None]:
#Total the Matthews for the entire country for each year and find the percentage
matthew_total = matthew.groupby(by=["year"])['count'].sum()
matthew_ratio = (matthew_total/n_babies_year)*100
matthew_ratio.reset_index()


In [None]:
px.line(matthew_ratio, x = matthew_ratio.index, y="count", title="Percentage of babies named 'Matthew' by year", labels={"count": "Percentage named Matthew"})


In the plot above, we see that the name Matthew saw a rise in popularity after 1950 and peaked in 1983 at 1.6%. It has been on a steady decline ever since.

Now, we will evaluate the popularity of the name on a state-by-state basis. To do so, we will now group the Matthew dataframe by year and state and then find the frequency of the name by state.

In [None]:
# Seperate again by state.
matthew_total_state = matthew.groupby(by=["year","state"])['count'].sum()
matthew_ratio_state = (matthew_total_state/n_babies)*100
#Find the percentage of Matthews for each state
matthew_ratio_state

In [None]:
matthew_ratio_state = matthew_ratio_state.reset_index()

In [None]:
matthew_ratio_state

To illustrate the popularity of the name Matthew on a state-by-state basis, a choropleth map will be used. This map can be animated such that the data for each year between 1910 and 2019 can be visualized. The darker the green color, the more popular the name "Matthew" is in that state.

In [None]:
fig = px.choropleth(matthew_ratio_state, 
                    locationmode="USA-states",
                    scope="usa",
                    locations="state",
                    color="count",
                    title="Percentage of babies named Matthew",
                    color_continuous_scale = "greens",
                    range_color=(0, 2),
                    animation_frame="year",
                    hover_name="count",
                    labels={"count": "Percentage of babies named Matthew"},
                    hover_data = {"state":True}
                   )



fig.update_traces(marker_line_color="white")
fig.show()

Looking at the change over time in the map, it is clear that when the name was on the rise after 1950 that it was first popular in the north before the south. When the name's popularity started to decrease post 1983, there was no clear regional pattern.

## Part 2: Analyzing name diversity

Similar to the analysis of the name Matthew, we will now analyze how the number of different names has changed over time. First, we will see how many different names have been used across the whole country by year. Then, we will use a cholopleth map to see which state used the highest number of different names per year.

In [None]:
#Disregard sex when getting the number of different names
df.drop("sex", axis=1)

In [None]:
unique_total = df.groupby(by=["year"])['name'].nunique()
unique_total.reset_index()

In [None]:
#Plpt unique names
px.line(unique_total, x = unique_total.index, y="name", labels={"name":"Number of different names used"},title="The number of different names given to babies in the US (1910-2019)")


In the plot above, the general trend shows that name diversity in the US has increased over time although we are now in a period of decline since the peak of 10,023 in 2007. It should also be noted that the actual number of different names is higher than reflected in this graph (and the subsequent plots in part 2 for name diversity) because data was only recorded if more than 5 babies were given the name.

Next, we will analyze the breakdown by state.

In [None]:
# Group data by year and state
unique_total_state = df.groupby(by=["year","state"])['name'].nunique()
unique_total_state = unique_total_state.reset_index()
unique_total_state

Here, a choropleth map will be used to indicate the number of different names used in each state for each year. The darker the purple color, the higher the number of different names in a given state.

In [None]:
fig = px.choropleth(unique_total_state, 
                    locationmode="USA-states",
                    scope="usa",
                    locations="state",
                    color="name",
                    title="Number of different names used by state",
                    color_continuous_scale = "purples",
                    range_color=(0, 7000),
                    animation_frame="year",
                    hover_name="state",
                    labels={"name": "Different baby names"},
                    hover_data = {"state":True}
                   )

 

fig.update_traces(marker_line_color="white")
fig.show()

Using the choropleth, we see the same trend of increased diversity over time. It is also apparent that the states New York, California, Texas, and Florida have the most name diversity in the past few decades. This can likely be attributed to the fact that these are the highest states in terms of population and that these states are among the most ethnically diverse.

Next, let's see if there's a significant difference in the number of different boy names and girl names.

In [None]:
#Group data by year and sex
unique_by_sex = df.groupby(by=["year","sex"])['name'].nunique()


In [None]:
unique_by_sex = unique_by_sex.reset_index()
unique_by_sex

In [None]:
#Plot male and female data seperately
px.line(unique_by_sex, x="year", y="name", color='sex', title="Male vs Female name diversity in US (1910-2019)", labels={"name":"Number of different names"})


The plot above illustrates that there is a similar trend in the name diversity of males and females, while females have consistently had a larger number of different names.

To conclude this analysis of name diversity, we will evaluate the breakdown by state and sex.

In [None]:
# Also group by sex now.
unique_by_sex_state = df.groupby(by=["year","state","sex"])['name'].nunique()
unique_by_sex_state.reset_index()

In [None]:
#Allow data to be grouped by sex
unique_by_state_grouped = unique_by_sex_state.groupby(by="sex")



In [None]:
#Make dataframe for girls
girls = unique_by_state_grouped.get_group('F')
girls = girls.reset_index()


The choropleth below shows the number of different female names per state for each year. The darker the pink color, the more different names that have been assigned.

In [None]:
#Create choropleth for female data
fig = px.choropleth(girls, 
                    locationmode="USA-states",
                    scope="usa",
                    locations="state",
                    color="name",
                    title="Number of different female names used by state",
                    color_continuous_scale = "magenta",
                    range_color=(0, 4000),
                    animation_frame="year",
                    hover_name="state",
                    labels={"name": "Different baby female names"},
                    hover_data = {"state":True}
                   )


fig.update_traces(marker_line_color="white")
fig.show()

In [None]:
#Create dataframe for boys
boys = unique_by_state_grouped.get_group('M')
boys = boys.reset_index()
boys

The choropleth below shows the number of different male names per state for each year. The darker the blue color, the more different names that have been assigned.

In [None]:
#Create choropleth for boys 
fig = px.choropleth(boys, 
                    locationmode="USA-states",
                    scope="usa",
                    locations="state",
                    color="name",
                    title="Number of different male names used by state",
                    color_continuous_scale = "blues",
                    range_color=(0, 4000),
                    animation_frame="year",
                    hover_name="state",
                    labels={"name": "Different baby male names"},
                    hover_data = {"state":True}
                   )



fig.update_traces(marker_line_color="white")
fig.show()

The choropleths, now divided by sex, mostly mirror the trends from the combined choropleth as the more populous states tend to have more names.

## Part 3: Tracking the most popular name by state and sex

In this final section, choropleth maps will be used to illustrate the most popular name for each year in every state. Data will be divided between males and females such that the most popular name from each sex can be seen. 

In [None]:
n_babies = df.groupby(by=["state","year","sex"])["count"].sum()


In [None]:
# Function that gets the most popular name
def top_name(grp):
    return grp.sort_values(by="count", ascending=False).head(1)



In [None]:
# Create datatframe for most popular male names
top_state_name = df.groupby(by="sex")

top_state_boys = top_state_name.get_group('M')
top_state_boys.reset_index()
top_state_boys

most_popular_boys = top_state_boys.groupby(by=["state","year", "sex"]).apply(top_name)



In [None]:
#Make choropleth of most popular male names
fig = px.choropleth(most_popular_boys, 
                    locationmode="USA-states",
                    scope="usa",
                    locations="state",
                    color="name",
                    title="Frequency of state's most popular name",
                    color_continuous_scale = "blues",
                    #range_color=(0, 0.1),
                    animation_frame="year",
                    hover_name="state",
                    labels={"name": "Top boy's name"},
                    hover_data = {"state":False, "year":False}
                   )



fig.update_traces(marker_line_color="white")
fig.show()

In the choropleth above, the five most popular names for each year are listed in the legend. If a state's most popular name is not amongst the nation's top five, the state will be colored gray. In the early 20th century, the names "John" and "Robert" were popular in most states. Toward the end of the century, names such as "Michael" or "Jacob" reached similar heights, but for shorter periods of time.

In [None]:
top_state_girls = top_state_name.get_group('F')
top_state_girls.reset_index()
top_state_girls

most_popular_girls = top_state_girls.groupby(by=["state","year", "sex"]).apply(top_name)

In [None]:
fig = px.choropleth(most_popular_girls, 
                    locationmode="USA-states",
                    scope="usa",
                    locations="state",
                    color="name",
                    title="Most popular girl's name by state",
                    color_continuous_scale = "blues",
                    range_color=(0, 0.1),
                    animation_frame="year",
                    hover_name="state",
                    labels={"name": "Top girl's name"},
                    hover_data = {"state":False}
                   )



fig.update_traces(marker_line_color="white")
fig.show()

In the choropleth above, the three most popular female names for each year are listed in the legend. If a state's most popular name is not amongst the nation's top three, the state will be colored gray. We can ssee that the nanme "Mary" was extremely popular across the country for much of the early 20th century. Later in the century, the names  "Lisa" and "Jennifer" were the most popular in almost every state, albeit for a shorter period of time.

## Conclusion 

Upon analyzing the data provided by the Social Security Administration, it is clear that the popularity of certain names has changed over time. Americans have also become more creative with their naming, as the number of different baby names has steadily increased over time. Thanks to the tools provided by pandas and Plotly Express, this data could be easily processed and presented in a way that is easily digestible without having to read through thousands of rows of data. Plots such as choropleth maps allow us to take visualization further by breaking down data geographically by state.