# Data Analysis with *pandas*

The United States Social Security Administration (SSA) has made available data on the
frequency of baby names from 1880 through the present. Download data set [US Baby Names 1880-2023](https://www.ssa.gov/oact/babynames/names.zip) (names.zip, 7MB). Create a folder named **data** and unpack the contents of the .zip file there.

The ZIP file contains several text files (`yob1880.txt`, `yob1881.txt`, ...). Each file contains yearly data about the number of births for particular name. To safeguard privacy, the SSA restricted the list of names to those with at least 5 occurrences.

Each file contains multiple comma-separated values. Here are the first five rows of `yob1880.txt`:

    Mary,F,7065
    Anna,F,2604
    Emma,F,2003
    Elizabeth,F,1939
    Minnie,F,1746

We will try to answer several questions such as:
- How many boys/girls were born each year?
- Which was the most popular boy/girl name each particular year?
- ... 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# show matplotlib graphics inline
%matplotlib inline
plt.rc('figure', figsize=(18, 3))
plt.rcParams['figure.facecolor'] = 'w'

In [None]:
# set the maximum number of rows to be displayed
pd.options.display.max_rows = 10  

In [None]:
pd.__version__

<h3>The DataFrame Data Structure</h3>

The main pandas data structure is the [DataFrame](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which you can think of as representing a table or spreadsheet of data.

We will first read the file "yob1880.txt" into a DataFrame. We will then obtain some basic information about the data.

In [None]:
# read one of the files and set header
names = pd.read_csv('data/yob1880.txt', names=['name', 'sex', 'births'])

- `Shift + Enter` run the current cell, move to the next cell
- `Ctrl + Enter` run the current cell without moving to the next cell

In [None]:
names.sample(n=6)

<h3>Indexing and selecting data</h3>

<p>There are three slicing methods for selecting row data: two explicit slicing methods, and a general case.</p>
<ol>
<li>Position-oriented (Python slicing style : exclusive of end)</li>
<li>Label-oriented (Non-Python slicing style : inclusive of end)</li>
<li>General (Either slicing style : depends on if the slice contains labels or positions</li>
</ol>

In [None]:
# position oriented


In [None]:
# label oriented (here index labels are numbers!)


In [None]:
# choose a subset of columns and change their order


# Questions and answers

### Names in the year 1880

<mark><b>Q1</b> How many boys and how many girls were born in the year 1880?</mark>

<mark><b>Q2</b> How many different names of boys and girls occured in the year 1880?</mark><br>

<mark><b>Q3</b> What was the most common boy name of the year 1880?</mark><br>

<h3>Merge data</h3>
<p>The data set is split into files by year, so let's merge this data in the following way:</p>
<ul>
   <li>assemble all of the data into a single DataFrame, and</li>
   <li>add a year field. 
</ul>
<br>
We will use the [concat](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) command for this purpose:

In [None]:
# 2023 is the last available year right now.
years = range(1880,2024)

pieces = []
columns = ['names', 'sex', 'births']

for year in years:
    path = 'data/yob%d.txt' % year
    current = pd.read_csv(path, names=columns)
    
    current['year'] = year
    pieces.append(current)

In [None]:
# Concatenate everything into a single DataFrame.
# We have to pass ignore_index=True because we’re not interested in preserving the original row numbers.
df = pd.concat(pieces, ignore_index=True)

In [None]:
# rename the "names" column
df = df.rename(columns = {'names':'name'})

<mark><b>Q4</b> How many boys and how many girls were born in the year 2023?</mark><br>

### Total births

We will use [pivot_table](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html) here to illustrate the number of births per year.

In [None]:
df.pivot_table(values='births', index=['year'], columns=['sex'], aggfunc='sum').plot(
    title='Total births by sex and year');

<mark><b>Q5</b> What was the number of births over the entire period?</mark>

<mark><b>Q6</b> What was the number of different boy and girl names over the entire period?</mark>

<mark><b>Q7</b> How many different names appeared in the years 1880 and 2023?</mark>

<h3>Boys and girls</h3>

<p>We can divide the data into two data frames: one for boys and one for girls. Moreover, we can extract a subset of the data to facilitate further analysis.

In [None]:
boys = df.loc[df.sex == 'M'].reset_index(drop=True)
girls = df.loc[df.sex == 'F'].reset_index(drop=True)

### Names in time

We will use [advanced indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html) to set the DataFrame index (row labels) using one or more existing columns.

In [None]:
boys = boys.set_index(['name', 'year'])
girls = girls.set_index(['name', 'year'])

Let's see how popular were particular names in time. Can you guess who is who? :)

In [None]:
fig, ax = plt.subplots()
boys.loc['Michael']['births'].plot(ax=ax)
boys.loc['Jason']['births'].plot(ax=ax)
boys.loc['John']['births'].plot(ax=ax);

In [None]:
fig, ax = plt.subplots()
girls.loc['Mary']['births'].plot(ax=ax)
girls.loc['Marilyn']['births'].plot(ax=ax)
girls.loc['Sophia']['births'].plot(ax=ax);

<mark><b>Q8</b> How many girls named Marylin were born in the years 1960 and 2000?</mark>

<mark><b>Q9</b> In which of all years was boy name Jason the most popular?</mark>

<mark><b>Q10</b> How many girls were named John? How many boys were named Mary?</mark>

### Top names

<mark><b>Q11</b> For how many names there were more than 50,000 births in one year? List those names.</mark>

<mark><b>Q12</b> How many of those top names were born before and after year 1950?</mark>

### The names with the longest history

<mark><b>Q13</b> How many names occured every year so far?</mark>

In [None]:
# How many different years are there in our data?
count_years = df['year'].nunique()
count_years

### Just for fun...

Let's see what pandas is capable to do in just one line of programming code...

<mark><b>Q14</b> What was the most popular name, based on the number of births, in the year 2023? </mark>

In [None]:
year = 2023

In [None]:
dict(df.groupby('year').apply(lambda row: row['name'][row['births'].idxmax()]))[year]

<mark><b>Q15</b> Which name had the highest frequency in any single year? Please provide the name, the number of occurrences, and the year. </mark>