# Lecture 5 –Fall 2024

A demonstration of advanced `pandas` syntax to accompany Lecture 4.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px

## Dataset: California baby names

In today's lecture, we'll work with the `babynames` dataset, which contains information about the names of infants born in California.

The cell below pulls census data from a government website and then loads it into a usable form. The code shown here is outside of the scope of Data 100, but you're encouraged to dig into it if you are interested!

In [2]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # If the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.head()

Unnamed: 0,State,Sex,Year,Name,Count
0,CA,F,1910,Mary,295
1,CA,F,1910,Helen,239
2,CA,F,1910,Dorothy,220
3,CA,F,1910,Margaret,163
4,CA,F,1910,Frances,134


## Case Study: Name "Popularity"

#**Case Study Question**
**Title**: Identifying the Most Consistently Popular Female Baby Name Over Time

Objective:
In this exercise, we will analyze the dataset to find the female baby name that has shown the most consistent popularity over the years. This involves filtering the data, calculating the consistency of name counts, and determining the most stable name.



# Instructions:

**Data** Preparation:

Filter the dataset to only include entries where the sex is "F" (female).
Calculate Consistency:

For each name, calculate the standard deviation of the counts over the years. A lower standard deviation indicates more consistent popularity.
**Identify Most Consistent Name:**

Determine the name with the lowest standard deviation in counts, signifying the most consistent popularity.


In [9]:

# Filter the dataset to only include female baby names
babynames_f = babynames[babynames['Sex'] == 'F']

# standard deviation of the counts for each name
std_devs = babynames_f.groupby('Name')['Count'].std()

# lowest standard deviation
most_consistent_name = std_devs.idxmin()
lowest_std_dev = std_devs.min()

print("\nMost consistently popular female baby name:", most_consistent_name)
print("\nStandard deviation of counts:", round(lowest_std_dev, 2))

# sorted list of names by their consistency
sorted_std_devs = std_devs.sort_values()
print("\n Sorted by consistency (standard deviation): ")
print(sorted_std_devs)


Most consistently popular female baby name: Aaleah

Standard deviation of counts: 0.0

 Sorted by consistency (standard deviation): 
Name
Suriah     0.0
Lavanya    0.0
Alyzza     0.0
Lavena     0.0
Lavera     0.0
          ... 
Zula       NaN
Zuleidy    NaN
Zunaira    NaN
Zuni       NaN
Zuzu       NaN
Name: Count, Length: 13930, dtype: float64
