# Assignment 1: Data analysis with pandas

All questions are weighted the same in this assignment. You are encouraged to check out the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/). 

What to submit: HTML version of this notebook (use File -> Download as -> HTML from the menu) with solutions and answers to the questions. Please rename the file as follows: Assignment_1_*Name*_*Surname*.html.

In [1]:
import pandas as pd

In [2]:
# show matplotlib graphics inline
%matplotlib inline

## US Baby Names 1880-2023

In the exercises that follow, use data set [US Baby Names 1880-2023](https://www.ssa.gov/oact/babynames/names.zip) (names.zip, 7MB) by United States Social Security Administration (SSA) -- the same data set that we used in the lecture.

![Popular Names by Birth Year](https://www.ssa.gov/oact/babynames/assets/images/myss.jpg)

<center>source: https://www.ssa.gov/oact/babynames/</center>

In [3]:
# 2023 is the last available year right now.
years = range(1880,2024)

pieces = []
columns = ['names', 'sex', 'births']

for year in years:
    path = 'data/yob%d.txt' % year
    current = pd.read_csv(path, names=columns)
    
    current['year'] = year
    pieces.append(current)

# Concatenate everything into a single DataFrame.
# We have to pass ignore_index=True because we’re not interested in preserving the original row numbers.
df = pd.concat(pieces, ignore_index=True)

# rename the "names" column
df = df.rename(columns = {'names':'name'})

### Exercise 1

How many boys named Michael were born between the year 1955 an 1957 (inclusive)? [271825]

In [6]:
print(df.head())

# Filter the DataFrame for boys named Michael born between 1955 and 1957
filtered_df = df[(df['name'] == 'Michael') & (df['sex'] == 'M') & (df['year'].between(1955, 1957))]

# Sum the births for the filtered DataFrame
total_births = filtered_df['births'].sum()

print(f"Total boys named Michael born between 1955 and 1957: {total_births}")



        name sex  births  year
0       Mary   F    7065  1880
1       Anna   F    2604  1880
2       Emma   F    2003  1880
3  Elizabeth   F    1939  1880
4     Minnie   F    1746  1880
Total boys named Michael born between 1955 and 1957: 271825


### Exercise 2

What was the number of births between 1941 and 1945 (inclusive)? Compare to the same period a decade later. [13333617, 19331055]

In [7]:
# Calculate total births between 1941 and 1945
births_1941_1945 = df[df['year'].between(1941, 1945)]['births'].sum()

# Calculate total births between 1951 and 1955
births_1951_1955 = df[df['year'].between(1951, 1955)]['births'].sum()

# Print the results
print(f"Total births between 1941 and 1945: {births_1941_1945}")
print(f"Total births between 1951 and 1955: {births_1951_1955}")


Total births between 1941 and 1945: 13333617
Total births between 1951 and 1955: 19331055


### Exercise 3

How many girls were named Emma and how many were named Sophia? [749903, 414202]

In [8]:
# Filter and sum births for girls named Emma
emma_births = df[(df['name'] == 'Emma') & (df['sex'] == 'F')]['births'].sum()

# Filter and sum births for girls named Sophia
sophia_births = df[(df['name'] == 'Sophia') & (df['sex'] == 'F')]['births'].sum()

# Print the results
print(f"Total girls named Emma: {emma_births}")
print(f"Total girls named Sophia: {sophia_births}")


Total girls named Emma: 749903
Total girls named Sophia: 414202


### Exercise 4

In which year did the boy name Tristan appear first? [1946]

In [9]:
# Filter the DataFrame for boys named Tristan
tristan_years = df[(df['name'] == 'Tristan') & (df['sex'] == 'M')]

# Find the first year Tristan appeared
first_year = tristan_years['year'].min()

print(f"The boy name Tristan first appeared in: {first_year}")


The boy name Tristan first appeared in: 1946


### Exercise 5

In which year did the name Woodie appear last? [1998]

In [10]:
# Filter the DataFrame for the name Woodie
woodie_years = df[df['name'] == 'Woodie']

# Find the last year Woodie appeared
last_year = woodie_years['year'].max()

print(f"The name Woodie last appeared in: {last_year}")


The name Woodie last appeared in: 1998


### Exercise 6

How many children named Mary or John (one number!) were born after the year 2000? [350634]

In [11]:
# Filter for children named Mary or John born after the year 2000
mary_john_births = df[(df['name'].isin(['Mary', 'John'])) & (df['year'] > 2000)]['births'].sum()

print(f"Total children named Mary or John born after the year 2000: {mary_john_births}")


Total children named Mary or John born after the year 2000: 350634


### Exercise 7

How many unique boy names were given in the year 2022? Compare to the years 1972 and 1922. [14311, 5751, 4967]

In [12]:
# Get unique boy names for the year 2022
unique_boy_names_2022 = df[(df['sex'] == 'M') & (df['year'] == 2022)]['name'].nunique()

# Get unique boy names for the year 1972
unique_boy_names_1972 = df[(df['sex'] == 'M') & (df['year'] == 1972)]['name'].nunique()

# Get unique boy names for the year 1922
unique_boy_names_1922 = df[(df['sex'] == 'M') & (df['year'] == 1922)]['name'].nunique()

# Print the results
print(f"Unique boy names in 2022: {unique_boy_names_2022}")
print(f"Unique boy names in 1972: {unique_boy_names_1972}")
print(f"Unique boy names in 1922: {unique_boy_names_1922}")


Unique boy names in 2022: 14311
Unique boy names in 1972: 5751
Unique boy names in 1922: 4967


### Exercise 8

How many unique girl names were given between the years 2000 and 2009 (inclusive)? Compare to the periods 1950-1959 and 1900-1909. [35769, 11658, 3829]

In [13]:
# Unique girl names between 2000 and 2009
unique_girl_names_2000_2009 = df[(df['sex'] == 'F') & (df['year'].between(2000, 2009))]['name'].nunique()

# Unique girl names between 1950 and 1959
unique_girl_names_1950_1959 = df[(df['sex'] == 'F') & (df['year'].between(1950, 1959))]['name'].nunique()

# Unique girl names between 1900 and 1909
unique_girl_names_1900_1909 = df[(df['sex'] == 'F') & (df['year'].between(1900, 1909))]['name'].nunique()

# Print the results
print(f"Unique girl names between 2000 and 2009: {unique_girl_names_2000_2009}")
print(f"Unique girl names between 1950 and 1959: {unique_girl_names_1950_1959}")
print(f"Unique girl names between 1900 and 1909: {unique_girl_names_1900_1909}")


Unique girl names between 2000 and 2009: 35769
Unique girl names between 1950 and 1959: 11658
Unique girl names between 1900 and 1909: 3829
