<a href="https://colab.research.google.com/github/mkitti/pandas-workshop/blob/master/Example_Project_Baby_Names.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## USA Baby Names 1880-2016

The United States Social Security Administration maintains an interesting data set of (almost) all names given to babies born in the United States, by sex and year, going back to 1880. This data set is available at [https://www.ssa.gov/oact/babynames/limits.html](https://www.ssa.gov/oact/babynames/limits.html)

This data set is interesting and fun to explore and we'll use it as the basis of a simple data analysis project with the end goal of creating a script that can be called to output a plot of a single name's popularity over time.

This notebook has been modified to download and unzip the data automatically by Mark Kittisopikul on July 16th, 2019.


In [0]:
import platform # some of the subsequent code depends on operating system

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [0]:
# Imports to download and unzip names.zip
# Added by Mark Kittisopikul, 2019/07/16
import urllib.request as req
import os
import zipfile

In [0]:
# Set up some variables for use later
dataset_path_windows = 'data\\names'   # Windows
dataset_path_nix = './data/names'   # UNIX, Linux

begin_year = 1880
end_year = 2016

In [4]:
# The subsequent non-python commands depend on your operating system
platform.system()  # show name of operating system

'Linux'

In [0]:
if platform.system()=='Windows':
    dataset_path = dataset_path_windows
else:
    dataset_path = dataset_path_nix

In [6]:
os.path.join(dataset_path,'yob1880.txt')

'./data/names/yob1880.txt'

In [7]:
# Download and unzip file
# Added by Mark Kittisopikul, 2019/07/16
# Check to see if data files exist
if not os.path.isfile(os.path.join(dataset_path,'yob1880.txt')):
  # If not, try to make the directories
  try:
    os.makedirs(dataset_path)
  except FileExistsError:
    # They may already exist
    pass
  # Zip file has the same name as the directory
  zipfile_path = dataset_path+'.zip'
  # Download the zip file if it does not exist already
  if not os.path.isfile(zipfile_path):
    req.urlretrieve("https://github.com/nuitrcs/pandas-workshop/blob/master/data/names.zip?raw=true", zipfile_path)
  # Unzip the file
  with zipfile.ZipFile(zipfile_path, 'r') as zip_ref:
    zip_ref.extractall(dataset_path)
else:
  print('Data is good to go!')

Data is good to go!


Let's first examine the data files to see what we're working with. Note the `type` command on Windows is equivalent to `cat` on MacOS or Linux.

In [8]:
# List files folder using the inbuilt commands of your operating system
# jupyter notebooks can call the operating system by using the exclamation mark
if platform.system()=='Windows':
    !dir $dataset_path
else:
    !ls $dataset_path

NationalReadMe.pdf  yob1907.txt  yob1935.txt  yob1963.txt  yob1991.txt
yob1880.txt	    yob1908.txt  yob1936.txt  yob1964.txt  yob1992.txt
yob1881.txt	    yob1909.txt  yob1937.txt  yob1965.txt  yob1993.txt
yob1882.txt	    yob1910.txt  yob1938.txt  yob1966.txt  yob1994.txt
yob1883.txt	    yob1911.txt  yob1939.txt  yob1967.txt  yob1995.txt
yob1884.txt	    yob1912.txt  yob1940.txt  yob1968.txt  yob1996.txt
yob1885.txt	    yob1913.txt  yob1941.txt  yob1969.txt  yob1997.txt
yob1886.txt	    yob1914.txt  yob1942.txt  yob1970.txt  yob1998.txt
yob1887.txt	    yob1915.txt  yob1943.txt  yob1971.txt  yob1999.txt
yob1888.txt	    yob1916.txt  yob1944.txt  yob1972.txt  yob2000.txt
yob1889.txt	    yob1917.txt  yob1945.txt  yob1973.txt  yob2001.txt
yob1890.txt	    yob1918.txt  yob1946.txt  yob1974.txt  yob2002.txt
yob1891.txt	    yob1919.txt  yob1947.txt  yob1975.txt  yob2003.txt
yob1892.txt	    yob1920.txt  yob1948.txt  yob1976.txt  yob2004.txt
yob1893.txt	    yob1921.txt  yob1949.txt  yob1977.txt  yob

In [9]:
# Read a single file into a python variable and print out the first five lines
if platform.system()=='Windows':
    sample = !type $dataset_path\\yob1880.txt
else:
    sample = !cat $dataset_path/yob1880.txt

sample[:5]

['Mary,F,7065',
 'Anna,F,2604',
 'Emma,F,2003',
 'Elizabeth,F,1939',
 'Minnie,F,1746']

We will need a function to read in all of these files one by one and combine them into a single dataframe. Note that Pandas will correctly intepret the paths to the files, irrespectively of whether they were formulated in a Windows or Unix-based manner.

In [0]:
def create_dataframe(begin_year, end_year):
    columns = ('name', 'sex', 'births')
    pieces = []
    for year in range(begin_year, end_year + 1):
        filename = '%s/yob%d.txt' % (dataset_path, year)
        piece = pd.read_csv(filename, names=columns)
        piece['year'] = year
        pieces.append(piece)
        
    return pd.concat(pieces, ignore_index=True)

In [11]:
# Now call our new function to get the dataset loaded into a Dataframe.
df = create_dataframe(begin_year, end_year)
df.head()

Unnamed: 0,name,sex,births,year
0,Mary,F,7065,1880
1,Anna,F,2604,1880
2,Emma,F,2003,1880
3,Elizabeth,F,1939,1880
4,Minnie,F,1746,1880


Now lets explore this data a little, first, how many records do we have?

Now lets look at a specific name, lets make a new dataframe that includes only your name and look at the first 5 rows

Lets now look at some stats for your name

When was your name at peak popularity?

How can we convert the raw birth numbers into percent of births that year? Lets make a new column for that

Wow, some of these percentages are really small, why dont we change it to number of births of a given name per million births that year

Why dont we make a graph of how common your name is over the years

If your name is like mine, there is actually a bunch of shading indicating variance, why would that be?


Its because this data is also split on gender, so there is a chance to have the name listed twice because of gender. The gender split could be interesting though, so lets look at it graphically

There is a actually a really good breakdown of different name trends by Tim Urban at https://waitbutwhy.com/2013/12/how-to-name-baby.html

so lets look quickly at a couple of the interesting trends he found with our code

### Name Fads

A name fad is when a specific name gets really popular for a specific generation, causing a person's age to be reasonable guessed based on their name alone.

Check out Jennifer, Ashley, or Shirley for some examples

### Gender Takeovers

Sometimes a name that is uncommon but solely one gender becomes extremely popular for the other gender, to the point that the original gender stops using it

Check out Lynn or Aubrey