## USA Baby Names 1880-2016

The United States Social Security Administration maintains an interesting data set of (almost) all names given to babies born in the United States, by sex and year, going back to 1880. This data set is available at [https://www.ssa.gov/oact/babynames/limits.html](https://www.ssa.gov/oact/babynames/limits.html)

This data set is interesting and fun to explore and we'll use it as the basis of a simple data analysis project with the end goal of creating a script that can be called to output a plot of a single name's popularity over time.

To start, we will assume that this dataset has already been downloaded and <font color="red"><b> unzipped </b></font> into a subfolder called names.

In [None]:
import platform # some of the subsequent code depends on operating system

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Set up some variables for use later
dataset_path_windows = 'data\\names'   # Windows
dataset_path_nix = './data/names'   # UNIX, Linux

begin_year = 1880
end_year = 2016

In [None]:
# The subsequent non-python commands depend on your operating system
platform.system()  # show name of operating system

In [None]:
if platform.system()=='Windows':
    dataset_path = dataset_path_windows
else:
    dataset_path = dataset_path_nix

Let's first examine the data files to see what we're working with. Note the `type` command on Windows is equivalent to `cat` on MacOS or Linux.

In [None]:
# List files folder using the inbuilt commands of your operating system
# jupyter notebooks can call the operating system by using the exclamation mark
if platform.system()=='Windows':
    !dir $dataset_path
else:
    !ls $dataset_path

In [None]:
# Read a single file into a python variable and print out the first five lines
if platform.system()=='Windows':
    sample = !type $dataset_path\\yob1880.txt
else:
    sample = !cat $dataset_path/yob1880.txt

sample[:5]

We will need a function to read in all of these files one by one and combine them into a single dataframe. Note that Pandas will correctly intepret the paths to the files, irrespectively of whether they were formulated in a Windows or Unix-based manner.

In [None]:
def create_dataframe(begin_year, end_year):
    columns = ('name', 'sex', 'births')
    pieces = []
    for year in range(begin_year, end_year + 1):
        filename = '%s/yob%d.txt' % (dataset_path, year)
        piece = pd.read_csv(filename, names=columns)
        piece['year'] = year
        pieces.append(piece)
        
    return pd.concat(pieces, ignore_index=True)

In [None]:
# Now call our new function to get the dataset loaded into a Dataframe.
df = create_dataframe(begin_year, end_year)
df.head()

Now lets explore this data a little, first, how many records do we have?

Now lets look at a specific name, lets make a new dataframe that includes only your name and look at the first 5 rows

Lets now look at some stats for your name

When was your name at peak popularity?

How can we convert the raw birth numbers into percent of births that year? Lets make a new column for that

Wow, some of these percentages are really small, why dont we change it to number of births of a given name per million births that year

Why dont we make a graph of how common your name is over the years

If your name is like mine, there is actually a bunch of shading indicating variance, why would that be?


Its because this data is also split on gender, so there is a chance to have the name listed twice because of gender. The gender split could be interesting though, so lets look at it graphically

There is a actually a really good breakdown of different name trends by Tim Urban at https://waitbutwhy.com/2013/12/how-to-name-baby.html

so lets look quickly at a couple of the interesting trends he found with our code

### Name Fads

A name fad is when a specific name gets really popular for a specific generation, causing a person's age to be reasonable guessed based on their name alone.

Check out Jennifer, Ashley, or Shirley for some examples

### Gender Takeovers

Sometimes a name that is uncommon but solely one gender becomes extremely popular for the other gender, to the point that the original gender stops using it

Check out Lynn or Aubrey