<a href="https://colab.research.google.com/github/kreatorkat2004/kreatorkat2004/blob/main/aw80_hw1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## COMP 341: Practical Machine Learning
## Homework assignment 1: exploring baby names

### Due: Tuesday, September 9 at 11:59pm on Gradescope

The goal of this homework is to analyze the baby name data from the US Social Security Agency. We will explore various facets of the data and see if there is enough information to predict a person's age given only their first name. While we will not be using an ML model just yet, we will use data science skills to explore interesting patterns.

Fill in the missing code following `# TODO:` comments or `####### YOUR CODE HERE ########` blocks and be sure to answer the short answer questions marked with `[WRITE YOUR ANSWER HERE]` in the text.

All code in this notebook needs to be able to be run sequentially, so make sure things work in order! Be sure to also use good coding practices (e.g., logical variable names, comments as needed, etc), and make plots that are clear and readable (e.g., with legible axes).

For this assignment, there will be **15 points** allocated for general coding and formatting points:
* **5 points** for coding style
* **5 points** for code flow (accurate results when everything is run sequentially)
* **5 points** for additional style guidelines listed below

Additional style guidelines:
* Make sure to rename your .ipynb file to include your netid in the file name: `netid-hw1.ipynb`
* For any TODO cell, make sure to include that cell's output in the .ipynb file that you submit. Many text editors have an option to clear cell outputs which is useful for getting a blank slate and running everything beginning-to-end, but always be sure to run the notebook before submitting and ensure that every cell has an output.
* When displaying DataFrames, please do not include `.head()` or `.tail()` calls unless specifically asked to. Just removing these calls will work as well, and will allow us to see both the beginning and end of your DataFrames, which help us ensure data is processed properly. Notebooks will by default show only the beginning and end, so you don't have to worry about long outputs here.
* Please use the column names specified in the assignment, and please avoiding any sorting not specified in the instructions.
* For plots, please ensure you have included axis labels, legends, and titles.
* To format your short answer responses nicely, we recommend either **bolding** or *italicizing* your answer, or formatting it ```as a code block```.
* Generally, please keep your notebook cells to one solution per cell, and preserve the order of the questions asked.
* Finally, this can be harder to check/control and depends on which plotting libraries you prefer, but it would be helpful to limit the size/resolution of plot images in the notebook. Our grading platform has an upper limit on submission sizes it can display, and high-res plots are the usual culprit when submissions are hidden or truncated.

**Important note:** For this assignment only, since we are learning to use pandas methods, from Part 1 onwards, you *should not* use for loops at any point in the assignment. If you have to use a for loop to solve one of the questions, it will result in a **-5 point** deduction.

### Setup

First, we need to import some libraries that are necessary to complete the assignment.

In [1]:
import pandas as pd
import numpy as np

Here, we have also suggested some other modules/libraries to import that can make your life easier:
*   [glob](https://docs.python.org/3/library/glob.html) is useful for pattern-based pathname expansion, which can come in handy when reading in the data

Feel free to also add your own here (rather than wherever you first use them below). For example, you will likely need to import a plotting library (e.g., plotnine, seaborn)

In [2]:
import glob


### Part 0: Read in Data [12 pts]

Run the following code that uses common command line tools (jupyter knows that it is command line and not python based on the `!` at the beginning of the command) to download and unzip the national data linked on the [SSA website](https://www.ssa.gov/oact/babynames/limits.html) into your workspace.

In [3]:
# downloading the data
!wget https://www.ssa.gov/oact/babynames/names.zip
# unzipping the data quietly to a "names" directory
!unzip -q names.zip -d names

--2025-08-28 01:57:51--  https://www.ssa.gov/oact/babynames/names.zip
Resolving www.ssa.gov (www.ssa.gov)... 23.218.93.82, 23.218.93.66, 2600:1408:c400:11::17cd:6b52, ...
Connecting to www.ssa.gov (www.ssa.gov)|23.218.93.82|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7726516 (7.4M) [application/zip]
Saving to: ‘names.zip’


2025-08-28 01:57:52 (16.2 MB/s) - ‘names.zip’ saved [7726516/7726516]



Now, we first need to check how the SSA has structured their data by looking at what we have unzipped. (Remember you need the `!` at the beginning of the command to denote that it is command line and not a python command.)

In [4]:
# TODO: use the command line to list the contents of the "names" directory [1 pt]
!ls names/

NationalReadMe.pdf  yob1909.txt  yob1939.txt  yob1969.txt  yob1999.txt
yob1880.txt	    yob1910.txt  yob1940.txt  yob1970.txt  yob2000.txt
yob1881.txt	    yob1911.txt  yob1941.txt  yob1971.txt  yob2001.txt
yob1882.txt	    yob1912.txt  yob1942.txt  yob1972.txt  yob2002.txt
yob1883.txt	    yob1913.txt  yob1943.txt  yob1973.txt  yob2003.txt
yob1884.txt	    yob1914.txt  yob1944.txt  yob1974.txt  yob2004.txt
yob1885.txt	    yob1915.txt  yob1945.txt  yob1975.txt  yob2005.txt
yob1886.txt	    yob1916.txt  yob1946.txt  yob1976.txt  yob2006.txt
yob1887.txt	    yob1917.txt  yob1947.txt  yob1977.txt  yob2007.txt
yob1888.txt	    yob1918.txt  yob1948.txt  yob1978.txt  yob2008.txt
yob1889.txt	    yob1919.txt  yob1949.txt  yob1979.txt  yob2009.txt
yob1890.txt	    yob1920.txt  yob1950.txt  yob1980.txt  yob2010.txt
yob1891.txt	    yob1921.txt  yob1951.txt  yob1981.txt  yob2011.txt
yob1892.txt	    yob1922.txt  yob1952.txt  yob1982.txt  yob2012.txt
yob1893.txt	    yob1923.txt  yob1953.txt  yob1983.txt  yob

The included `NationalReadMe.pdf` file describes how the data is organized. We have taken the liberty of summarizing the main points:
* data for each year is included as a separate file and clearly indicated in the filename
* the data spans all years between 1880-2024
* each file is comma-delimited
* each row provides a baby name, whether the name was given to a male (M) or female (F) baby, and corresponding number of births for that year
* only names with at least 5 births for the corresponding year is listed in its file (for privacy reasons)

We can quickly take a look at the top of the file for baby names in 1900.

In [5]:
# TODO: use the command line to:
# (1) check the top of the file of baby names from 1880 [1 pt]
!head names/yob1880.txt
# (2) check how many names were recorded in 1880 [1 pt]
!wc -l names/yob1880.txt

Mary,F,7065
Anna,F,2604
Emma,F,2003
Elizabeth,F,1939
Minnie,F,1746
Margaret,F,1578
Ida,F,1472
Alice,F,1414
Bertha,F,1320
Sarah,F,1288
2000 names/yob1880.txt


And also peruse the bottom of the file for baby names in 2024.

In [None]:
# TODO: use the command line to:
# (1) check the bottom of the file of baby names from 2024 [1 pt]
!tail names/yob2024.txt
# (2) check how many names were recorded in 2024 [1 pt]
!wc -l names/yob2024.txt

Now that we have a reasonably good understanding of how the data is structured, it is time to read in the data from all of the different years and combine everything into a single pandas DataFrame for further analysis.

Remember to check that the DataFrame is formatted as you expect, use intuitive column names and data types, and that we also need a column to keep track of the year the data came from!

In [None]:
# TODO: read in data from all years into a single DataFrame [5 pts]


**Short Answer Question:** Is this dataset tidy? Why or why not? [2 pts]

`[WRITE YOUR ANSWER HERE]`

### Part 1: Sanity Checks [8 pts]

A good habit to get into is checking your DataFrames; evaluating if they match expectation early and often can help you spot errors before they pop up later. Here, we walk through some sanity checks. You can use them or others to compare with your command line output earlier. If there is anything unexpected, you can go back and update your code for reading in the data and rerun the commands (it is also okay if everything is as you expect!).

In [None]:
# TODO: use pandas to check the dimensions of the DataFrame [1 pt]


In [None]:
# TODO: use pandas to check the top of the DataFrame [1 pt]


In [None]:
# TODO: use pandas to check the bottom of the DataFrame [1 pt]


In [None]:
# TODO: use pandas to check the data types for each column in the DataFrame [1 pt]


In [None]:
# TODO: use pandas to look at the top 10 most popular baby names in the year 1880 [1 pt]


**Short Answer Question:** Were these sanity checks helpful to check whether your DataFrame was read in and processed as you expect? Did they lead to you making any design decisions differently for reading in your data? You can suggest / run simple variations that you find more helpful if you wish as part of your answer. [3 pts]

`[WRITE YOUR ANSWER HERE]`

### Part 2: Search for General Patterns [10 pts]

Let's check out this baby data in the most general way possible, the total number of births per year. In other words, we'll take a look at population growth trends across time.

We will do this by plotting the number of births for each year represented in our data (1880-2024), with different colors for the sex of the baby.

Make sure that your axes are labeled clearly, the figure is sized appropriately, and in general, that your figure makes it easy to examine the number of births for any particular year of interest (one key component of this is the density of your tick labels).

In [None]:
# TODO: taking the DataFrame from Part 0-1, build a new DataFrame that
# has the total number of births in that year (separated by sex, but regardless of name)
# and display the resultant DataFrame [3 pts]


In [None]:
# TODO: in a single figure, plot total births each year by sex (Male/Female),
# with sex represented by color [5 pts]


**Short Answer Question:** Do you notice any interesting patterns across years? Do they relate to historical events? [2 pts]

`[WRITE YOUR ANSWER HERE]`

### Part 3: Sex-associated biases in names [55 pts]

There are so many interesting questions we can explore using this baby names dataset. One of these is the trends of sex biases in baby names: are certain names more often male-associated, female-associated, or assigned to both male and female babies (more "neutral")? Do these trends change over time? Does historical context affect this?

To explore this question, we will some more DataFrame manipulations and visualize some of the most popular "biased" and "neutral" names, but first, we motivate this question by exploring how some names that we associate with a specific sex might not have always been that way. Consider the name, Ruth, which is now more often associated with women. In our data, are there any boys named Ruth? Let's investigate.

In [None]:
# TODO: make a plot that shows the number of boys named Ruth over time [3 pts]


In [None]:
# TODO: which year had the most boys born named Ruth and how many there were in that year? [1 pt]


**Short Answer Question:** Do you think the famous baseball player, Babe Ruth, had any influence on boys named Ruth? Comment on this. [1 pt]



`[WRITE YOUR ANSWER HERE]`


In [None]:
# TODO: manipulate your original DataFrame of all data so that for each name,
# there is one column each for the # female and # male babies born per year [1 pt]


In [None]:
# TODO: what is the shape of your new DataFrame? [1 pt]


**Short Answer Question:** The number of rows in your new DataFrame should be different from what you found in your original one in Part 1.

What is the difference? Explain what this difference tells you about the naming patterns of male and female babies in the US. [2 pts]

`[WRITE YOUR ANSWER HERE]`

In [None]:
# TODO: if there is a year where no babies of a particular sex was born,
# make sure those cases record a baby count of 0 (instead of a missing value) [1 pt]


In [None]:
# TODO: check to see if the table looks correct by checking how many
# "Lily"s were born in 1900 [1 pt]


Since our dataset includes all names with at least 5 people born with that name per year, there are some very niche names. We want to filter out some of the most uncommon names from our downstream exploration.

In [None]:
# TODO: using the DataFrame that you have just constructed here in Part 3,
# build a new DataFrame that gives the total number of female and male babies per name
# and display this new DataFrame [2 pts]
# (hint: this DataFrame should have only 3 columns)


In [None]:
# TODO: using the 2 DataFrames you have just constructed in Part 3 of the assignment,
# make a new DataFrame called `names_filtered`:
# names_filtered should have all the individual year level data for female and male babies born,
# but only if there were at least 20,000 occurrences of that name within a particular sex across the whole dataset
# (i.e., there have been at least 20,000 male or 20,000 female babies that have been given that name) [2 pts]


How many names were filtered out?

In [None]:
# TODO: calculate how many total names were filtered out [1 pt]


**Short Answer Question:** Why might we choose to filter out uncommon names based on the whole dataset versus simply choosing a per year threshold for each name? [2 pts]

`[WRITE YOUR ANSWER HERE]`

In [None]:
# TODO: add 2 new columns in your filtered DataFrame:
# 'total': the total number of births per name per year
# 'sex_bias': the proportion of female babies per year
# (# female babies given name / total # of babies in that year)
# and look at the resultant DataFrame [2 pts]


We will now first explore some naming trends for "recently popular" names (as defined by the average popularity within the last 20 years).

In [None]:
# TODO: subset your 'names_filtered' DataFrame into a 'recent_names' DataFrame
# that includes only data after 2004 up through (and including) 2024 and
# display the resulting DataFrame [2 pts]


Now, we will collapse this dataset of the common names from only the last 20 years. Specifically, we will calculate the average number of babies (regardless of sex) assigned that name, as well as the average sex bias ratio.

In [None]:
# TODO: in one command, calculate the average of the total births and sex bias of births in `recent_names`
# and store the resultant DataFrame in a new DataFrame, `recent_names_summ` [2 pts]


In [None]:
# TODO: create a new column, `bias_categ` in your `recent_names_summ` DataFrame,
# where sex bias categories are defined as:
# average sex bias < 0.25 are assigned category 'male'
# average sex bias > 0.75 are assigned category 'female'
# everything else is assigned category 'neutral'
# and display your resultant DataFrame [3 pts]


Our next step is to create a DataFrame of the 3 most popular names within each of the "bias categories" (male, female, neutral) that we just calculated. The ranking we will use for most popular name is average number of total births within the last 20 years. The 'total' column you calculated earlier can come in handy for this.

In [None]:
# TODO: make a single `top_recent_names` DataFrame that includes
# the top 3 baby names within each bias category (male, female, neutral), for 9 names total
# display the resultant DataFrame, where the rows in the same category are displayed together
# and the "most popular" name within each category is shown first [3 pts]


With these 9 names (top 3 within each bias category) in hand, we want to go back to our original `names_filtered` DataFrame including data from 1880.

In [None]:
# TODO: look at a subsetted version of the `names_filtered` DataFrame that only includes
# the 9 top_recent_names you just calculated in the previous step [1 pt]


In [None]:
# TODO: visualize how the 'sex_bias' of the 9 top recent names has changed from 1880-2023
# in a single plot; keep in mind the same criteria for clear visualizations mentioned in Part 2 [4 pts]


**Short Answer Question:** What are some of the conclusions you can draw about the ways in which biases in name assignment change across time? [3 pts]

`[WRITE YOUR ANSWER HERE]`

We will now do the same exercise for "old" names, aka names that were popular from 1900-1920 (inclusive).

In [None]:
# TODO: subset your `names_filtered` DataFrame into an `old_names` DataFrame
# that includes only data between the years of 1880 and 1900, including both 1880 and 1900 [2 pts]


In [None]:
# TODO: just as before, in one command, calculate the average of the total births and sex bias of births in `old_names`
# and store the resultant DataFrame in a new DataFrame, `old_names_summ` [2 pts]


# then create a new column, 'bias_categ' in your `old_names_summ` DataFrame,
# where bias categories are defined as:
# average sex bias < 0.25 are assigned category 'male'
# average sex bias > 0.75 are assigned category 'female'
# everything else is assigned category 'neutral'
# and display your resultant DataFrame [3 pts]


In [None]:
# TODO: make a single `top_old_names` DataFrame that includes the top 3 baby names within each bias category
# "top 3" is defined based on the names with highest average total births between 1900-1920
# display the resultant DataFrame, where the rows in the same category are displayed together
# and the "most popular" name within each category is shown first [3 pts]


In [None]:
# TODO: visualize how the 'sex_bias' of the 9 top old names has changed from 1880-2023
# keep in mind the same criteria for clear visualizations mentioned in Part 2 [4 pts]


**Short Answer Question:** How is this similar or different from what you saw in the trends with top recent names earlier? [3 pts]

`[WRITE YOUR ANSWER HERE]`

### Bonus: Which baby names stand the test of time? [Extra Credit: up to 10 points]

Another interesting thing to explore is which baby name can claim to be the "reigning champion" name for the longest stretch of time.

Each year, there is typically only one female and one male name that can claim to be the "most popular." Find which "most popular" male and female names had the longest run and show how many consecutive years they were able to claim the title of "most popular."

As with the rest of the assignment, it is best if you can solve this without using a loop, though note that if you have to use a loop and calculate the answer correctly, you can receive up to 50%, i.e., 5 points extra credit.

## To Submit
Download the notebook from Colab as a `.ipynb` notebook (`File > Download > Download .ipynb`), rename it to `netid-hw1.ipynb`, and upload it to the corresponding Gradescope assignment.