# Pandas More Utility Functions

A demonstration of advanced `pandas` syntax to accompany Lecture 4.

In [20]:
import numpy as np
import pandas as pd
import plotly.express as px

## Dataset: California baby names

In today's lecture, we'll work with the `babynames` dataset, which contains information about the names of infants born in California.

The cell below pulls census data from a government website and then loads it into a usable form. The code shown here is outside of the scope of Data 100, but you're encouraged to dig into it if you are interested!

In [21]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # If the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.head()

Unnamed: 0,State,Sex,Year,Name,Count
0,CA,F,1910,Mary,295
1,CA,F,1910,Helen,239
2,CA,F,1910,Dorothy,220
3,CA,F,1910,Margaret,163
4,CA,F,1910,Frances,134


### Exercises
We want to obtain the first three baby names with `count > 250`.

1.Code this using head()

2.Code this using loc

3.Code this using iloc

4.Code this using []


In [22]:
# Answer Here
sort_names=babynames[babynames['Count']>250]
sort_names.head(3)

Unnamed: 0,State,Sex,Year,Name,Count
0,CA,F,1910,Mary,295
233,CA,F,1911,Mary,390
484,CA,F,1912,Mary,534


In [23]:
# Answer Here
sort_names.loc[sort_names.index[0:3],:]

Unnamed: 0,State,Sex,Year,Name,Count
0,CA,F,1910,Mary,295
233,CA,F,1911,Mary,390
484,CA,F,1912,Mary,534


In [24]:
# Answer Here
sort_names.iloc[0:3,:]

Unnamed: 0,State,Sex,Year,Name,Count
0,CA,F,1910,Mary,295
233,CA,F,1911,Mary,390
484,CA,F,1912,Mary,534


In [25]:
# Answer Here
sort_names[0:3]

Unnamed: 0,State,Sex,Year,Name,Count
0,CA,F,1910,Mary,295
233,CA,F,1911,Mary,390
484,CA,F,1912,Mary,534


### `.isin` for Selection based on a list, array, or `Series`

In [26]:
# Note: The parentheses surrounding the code make it possible to break the code into multiple lines for readability
babynames[(babynames["Name"] == "Bella") |
              (babynames["Name"] == "Alex") |
              (babynames["Name"] == "Narges") |
              (babynames["Name"] == "Lisa")]


Unnamed: 0,State,Sex,Year,Name,Count
6289,CA,F,1923,Bella,5
7512,CA,F,1925,Bella,8
12368,CA,F,1932,Lisa,5
14741,CA,F,1936,Lisa,8
17084,CA,F,1939,Lisa,5
...,...,...,...,...,...
399773,CA,M,2019,Alex,438
402648,CA,M,2020,Alex,379
405452,CA,M,2021,Alex,334
408335,CA,M,2022,Alex,345


In [27]:
# A more concise method to achieve the above: .isin
#Answer Here
name_list=["Bella","Alex","Narges","Lisa"]
babynames[babynames['Name'].isin(name_list)]

Unnamed: 0,State,Sex,Year,Name,Count
6289,CA,F,1923,Bella,5
7512,CA,F,1925,Bella,8
12368,CA,F,1932,Lisa,5
14741,CA,F,1936,Lisa,8
17084,CA,F,1939,Lisa,5
...,...,...,...,...,...
399773,CA,M,2019,Alex,438
402648,CA,M,2020,Alex,379
405452,CA,M,2021,Alex,334
408335,CA,M,2022,Alex,345


### `.str` Functions for Defining a Condition

In [28]:
# What if we only want names that start with "J"?
#Answer Here
start_with_j=babynames[babynames['Name'].str.startswith('J')]
start_with_j

Unnamed: 0,State,Sex,Year,Name,Count
16,CA,F,1910,Josephine,66
44,CA,F,1910,Jean,35
46,CA,F,1910,Jessie,32
59,CA,F,1910,Julia,28
66,CA,F,1910,Juanita,25
...,...,...,...,...,...
413714,CA,M,2023,Jj,5
413715,CA,M,2023,Johnathon,5
413716,CA,M,2023,Jorden,5
413717,CA,M,2023,Jozef,5


# Custom Sort

In [29]:
# Sort a Series Containing Names
sort_series=babynames['Name'].sort_values(ascending=False)
sort_series

408216      Zyrus
217445      Zyrah
197542      Zyrah
220708      Zyrah
232190      Zyrah
           ...   
370812      Aaden
401876    Aadarsh
387660      Aadan
372774      Aadan
369654      Aadan
Name: Name, Length: 413894, dtype: object

In [30]:
# Sort a DataFrame – there are lots of Michaels in California
sort_by_micheal=babynames[babynames['Name']=='Michael'].sort_values(by='Count',ascending=False)
sort_by_micheal

Unnamed: 0,State,Sex,Year,Name,Count
271693,CA,M,1957,Michael,8263
270669,CA,M,1956,Michael,8257
321036,CA,M,1990,Michael,8247
285500,CA,M,1969,Michael,8244
286795,CA,M,1970,Michael,8197
...,...,...,...,...,...
175078,CA,F,2006,Michael,9
18484,CA,F,1941,Michael,8
179866,CA,F,2007,Michael,7
16200,CA,F,1938,Michael,7


### Approach 1: Create a temporary column

In [31]:
# Create a Series of the length of each name
name_lengths=babynames['Name'].str.len()
# Add the Series as a new column to the DataFrame
babynames['name_lengths']=name_lengths
# Sort the DataFrame by the new column
sort_length=babynames.sort_values(by='name_lengths')

In [32]:
# drop new column
babynames=babynames.drop(columns='name_lengths',axis=1)

### Approach 2: Sorting using the `key` argument

---



In [33]:
# Answer Here
baby_sort=babynames.sort_values(by='Name',key=lambda x:x.str.len())
baby_sort

Unnamed: 0,State,Sex,Year,Name,Count
83016,CA,F,1979,Ji,5
331174,CA,M,1993,Vu,5
298821,CA,M,1978,Al,13
277555,CA,M,1962,Ty,55
404824,CA,M,2020,Jj,6
...,...,...,...,...,...
337819,CA,M,1996,Franciscojavier,8
325562,CA,M,1991,Franciscojavier,6
316193,CA,M,1987,Franciscojavier,5
317627,CA,M,1988,Franciscojavier,10


### Approach 3: Sorting Using the `map` Function

We can also use the Python map function if we want to use an arbitrarily defined function. Suppose we want to sort by the number of occurrences of "dr" plus the number of occurences of "ea".

In [34]:
# Define a function to count occurrences of 'dr' and 'ea'
def dr_ea_count(string):
    return string.count('dr') + string.count('ea')
# Apply the function to each name in the "Name" column and add as a new column
babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)
# Sort the DataFrame by the new column in descending order
babynames = babynames.sort_values(by = "dr_ea_count", ascending=False)
# Display the top rows
babynames.head()

Unnamed: 0,State,Sex,Year,Name,Count,dr_ea_count
115965,CA,F,1990,Deandrea,5,3
311780,CA,M,1985,Deandrea,6,3
108738,CA,F,1988,Deandrea,5,3
131037,CA,F,1994,Leandrea,5,3
101982,CA,F,1986,Deandrea,6,3


In [35]:
# Drop the `dr_ea_count` column
babynames=babynames.drop(columns='dr_ea_count',axis=1)
babynames.head()

Unnamed: 0,State,Sex,Year,Name,Count
115965,CA,F,1990,Deandrea,5
311780,CA,M,1985,Deandrea,6
108738,CA,F,1988,Deandrea,5
131037,CA,F,1994,Leandrea,5
101982,CA,F,1986,Deandrea,6


## Grouping

Group rows that share a common feature, then aggregate data across the group.

In this example, we count the total number of babies born in each year (considering only a small subset of the data, for simplicity).

<img src="images/groupby.png" width="800"/>

In [36]:
# DataFrame with baby gril names only
babynames_girl=babynames[babynames['Sex']=='F']

# Answer Here
#Groupby similar features like year and apply aggregate
babynames_girl=babynames_girl.groupby('Year')['Count'].agg(np.sum)
# Answer Here
# Sort by Count
# Sort by Count in descending order
babynames_girl_sorted = babynames_girl.sort_values(ascending=False)
babynames_girl_sorted


The provided callable <function sum at 0x000001D9740D7BA0> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.



Year
1990    262422
1991    261497
1992    256788
1993    249572
1989    243985
         ...  
1914     13815
1913     11860
1912      9804
1911      6602
1910      5950
Name: Count, Length: 114, dtype: int64

In [37]:
# print first 10 entries
babynames_girl_sorted.head(10)

Year
1990    262422
1991    261497
1992    256788
1993    249572
1989    243985
1994    242484
2007    236219
2006    234748
1995    234581
2005    230396
Name: Count, dtype: int64

In [38]:
# the total baby count in each year
# Answer Here
total_baby_count_per_year = babynames.groupby('Year')['Count'].sum()
total_baby_count_per_year

Year
1910      9163
1911      9983
1912     17946
1913     22094
1914     26926
         ...  
2019    387325
2020    363307
2021    363206
2022    361960
2023    342550
Name: Count, Length: 114, dtype: int64

There are many different aggregation functions we can use, all of which are useful in different applications.

In [39]:
# What is the earliest year in which each name appeared?
# Answer Here
earliest_year=babynames.groupby('Name')['Year'].agg(min)
earliest_year


The provided callable <built-in function min> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.



Name
Aadan      2008
Aadarsh    2019
Aaden      2007
Aadhav     2014
Aadhini    2022
           ... 
Zymir      2020
Zyon       1999
Zyra       2012
Zyrah      2011
Zyrus      2021
Name: Year, Length: 20629, dtype: int64

In [40]:
# What is the largest single-year count of each name?
# Answer Here
largest_year=babynames.groupby('Name')['Year'].agg(max)
largest_year


The provided callable <built-in function max> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.



Name
Aadan      2014
Aadarsh    2019
Aaden      2020
Aadhav     2019
Aadhini    2022
           ... 
Zymir      2020
Zyon       2023
Zyra       2023
Zyrah      2020
Zyrus      2021
Name: Year, Length: 20629, dtype: int64

In [None]:
#Can you find the most popular baby name in the state of California (CA) for each year? use idxmax function.
#Provide a list of years along with the corresponding most popular names."
result = babynames.groupby('Year')['Count'].idxmax()
#Answer Here
popular_baby_names = babynames.loc[result, ['Year', 'Name']]
popular_baby_names

## Case Study: Name "Popularity"

In this exercise, let's find the name with sex "F" that has dropped most in popularity since its peak usage. We'll start by filtering `babynames` to only include names corresponding to sex "F".

In [3]:
#Answer Here
f_babynames=babynames[babynames['Sex']=='F']
f_babynames

Unnamed: 0,State,Sex,Year,Name,Count
0,CA,F,1910,Mary,295
1,CA,F,1910,Helen,239
2,CA,F,1910,Dorothy,220
3,CA,F,1910,Margaret,163
4,CA,F,1910,Frances,134
...,...,...,...,...,...
243185,CA,F,2023,Zeppelin,5
243186,CA,F,2023,Zhamira,5
243187,CA,F,2023,Zina,5
243188,CA,F,2023,Zooey,5


In [4]:
# We sort the data by year
f_babynames.sort_values(['Year'])

Unnamed: 0,State,Sex,Year,Name,Count
0,CA,F,1910,Mary,295
148,CA,F,1910,Merle,9
149,CA,F,1910,Rosalie,9
150,CA,F,1910,Rosie,9
151,CA,F,1910,Teresa,9
...,...,...,...,...,...
240783,CA,F,2023,Zayna,22
240784,CA,F,2023,Aashvi,21
240785,CA,F,2023,Aida,21
240759,CA,F,2023,Eimy,22


To build our intuition on how to answer our research question, let's visualize the prevalence of the name "Jennifer" over time.

In [5]:
# We'll talk about how to generate plots in a later lecture
fig = px.line(f_babynames[f_babynames["Name"] == "Jennifer"],
              x = "Year", y = "Count")
fig.update_layout(font_size = 18,
                  autosize=False,
                 width=1000,
                  height=400)

We'll need a mathematical definition for the change in popularity of a name.

Define the metric "ratio to peak" (RTP). We'll calculate this as the count of the name in 2022 (the most recent year for which we have data) divided by the largest count of this name in *any* year.

A demo calculation for Jennifer:

In [6]:
# Find the highest Jennifer 'count'
max_count_j=f_babynames[f_babynames['Name']=='Jennifer']['Count'].max()
max_count_j

6065

In [7]:
# Remember that we sorted f_babynames by year.
# This means that grabbing the final entry gives us the most recent count of Jennifers: 114
# In 2024, the most recent year for which we have data, 88 Jennifers were born
current_count_j=f_babynames[f_babynames['Name']=='Jennifer']['Count'].iloc[-1]
current_count_j

88

In [8]:
# Compute the RTP
rtp=current_count_j/max_count_j
rtp

0.014509480626545754

We can also write a function that produces the `ratio_to_peak`for a given `Series`. This will allow us to use `.groupby` to speed up our computation for all names in the dataset.

In [9]:
# define the function for RTP
"""
Compute the RTP for a Series containing the counts per year for a single name
"""
def ratio_to_peak(series):
    return series.iloc[-1]/max(series)

In [10]:
# Construct a Series containing our Jennifer count data
jen_rtp_count=f_babynames[f_babynames['Name']=='Jennifer']['Count']
# Then, find the RTP using the function define above
ratio_to_peak(jen_rtp_count)

0.014509480626545754

Now, let's use `.groupby` to compute the RTPs for *all* names in the dataset.

You may see a warning message when running the cell below. As discussed in lecture, `pandas` can't apply an aggregation function to non-numeric data (it doens't make sense to divide "CA" by a number). By default, `.groupby` will drop any columns that cannot be aggregated.

In [11]:
# Results in a TypeError
rtp_table = f_babynames.groupby('Name')[['Year','Count']].agg(ratio_to_peak)
rtp_table

Unnamed: 0_level_0,Year,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Aadhini,1.0,1.000000
Aadhira,1.0,0.500000
Aadhya,1.0,0.760000
Aadya,1.0,0.758621
Aahana,1.0,0.269231
...,...,...
Zyanya,1.0,0.800000
Zyla,1.0,1.000000
Zylah,1.0,1.000000
Zyra,1.0,1.000000


In [12]:
# Find the RTP for all names at once using groupby as describe in lec slides
f_babynames.groupby("Name").agg(ratio_to_peak)

TypeError: unsupported operand type(s) for /: 'str' and 'str'

To avoid the warning message above, we explicitly extract only the columns relevant to our analysis before using `.agg`.

In [13]:
# Recompute the RTPs, but only performing the calculation on the "Count" column
rtp_table = f_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
rtp_table

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Aadhini,1.000000
Aadhira,0.500000
Aadhya,0.760000
Aadya,0.758621
Aahana,0.269231
...,...
Zyanya,0.800000
Zyla,1.000000
Zylah,1.000000
Zyra,1.000000


In [14]:
# Rename "Count" to "Count RTP" for clarity
rtp_table=rtp_table.rename(columns={'Count':'Count RTP'})

In [15]:
# What name has fallen the most in popularity?
rtp_table=rtp_table.sort_values("Count RTP")
max_name=rtp_table.head(1).index[0]
max_name

'Debra'

We can visualize the decrease in the popularity of the name "?:"

In [16]:
def plot_name(*names):
    fig = px.line(f_babynames[f_babynames["Name"].isin(names)],
                  x = "Year", y = "Count", color="Name",
                  title=f"Popularity for: {names}")
    fig.update_layout(font_size = 18,
                  autosize=False,
                  width=1000,
                  height=400)
    return fig
# pass the name into plot_name
plot_name(max_name)

In [17]:
# Find the 10 names that have decreased the most in popularity
# Answer Here
top10 = rtp_table.sort_values("Count RTP").head(10).index[1:10]

In [18]:
plot_name(*top10)

For fun, try plotting your name or your friends' names.

In [19]:
plot_name('Zyrah')