In [1]:
# HIDDEN
Base.displaysize() = (5, 80)
using DataFrames
using CSV

In [2]:
# HIDDEN
students = CSV.read("roster.csv")

# Do we want to replace the original column?
students.Name = copy(students.Name)
for i in 1:size(students, 1)
    students.Name[i] = lowercase(students.Name[i])
end

## What's in a Name?

So far, we have asked a broad question about our data: "Do the first names of students in Data 100 tell us anything about the class?"

We have cleaned our data by converting all our names to lowercase. During our exploratory data analysis we discovered that our roster contains about 270 names of students in the class and on the waitlist. Most of our first names are between 4 and 8 characters long.

What else can we discover about our class based on their first names? We might consider a single name from our dataset:

In [3]:
students.Name[6]

"jerry"

From this name we can infer that the student is likely a male. We can also take a guess at the student's age. For example, if we happen to know that Jerry was a very popular baby name in 1998, we might guess that this student is around twenty years old.

This thinking gives us two new questions to investigate:

1. "Do the first names of students in Data 100 tell us the distribution of sex in the class?"
1. "Do the first names of students in Data 100 tell us the distribution of ages in the class?"

In order to investigate these questions, we will need a dataset that associates names with sex and year. Conveniently, the US Social Security department hosts such a dataset online ([https://www.ssa.gov/oact/babynames/index.html](https://www.ssa.gov/oact/babynames/index.html)). Their dataset records the names given to babies at birth and is thus often referred to as the Baby Names dataset.

We will start by downloading and then loading the dataset into Julia. Again, don't worry about understanding the code in this this chapter—focus instead on understanding the overall process.

[zipfile]: https://en.wikipedia.org/wiki/Zip_(file_format)

In [4]:
using CSV
using ZipFile

data_url = "https://www.ssa.gov/oact/babynames/names.zip"
local_filename = "babynames.zip"

if !(local_filename in readdir()) # if the data exists don't download again
    run(`curl --output $local_filename $data_url`)
end

r = ZipFile.Reader("babynames.zip");

function read_txt_files_into_dataframe()
    babynames = DataFrame()
    columnSymbols = [:Name, :Sex, :Count]
    
    for file in r.files
        if split(file.name, ".")[end] == "txt"
            df = CSV.read(file, header=columnSymbols, copycols=true)
            df[!, :Year] .= parse(Int, file.name[end-7:end-4])
            babynames = vcat(babynames, df)
        end
    end

    return babynames
end

babynames = read_txt_files_into_dataframe()
babynames

Unnamed: 0_level_0,Name,Sex,Count,Year
Unnamed: 0_level_1,String,String,Int64,String
1,Mary,F,9217,1884
2,Anna,F,3860,1884
3,Emma,F,2587,1884
4,Elizabeth,F,2549,1884
5,Minnie,F,2243,1884
⋮,⋮,⋮,⋮,⋮


It looks like the dataset contains names, the sex given to the baby, the number of babies with that name, and the year of birth for those babies. To be sure, we check the dataset description from the SSN Office ([https://www.ssa.gov/oact/babynames/background.html](https://www.ssa.gov/oact/babynames/background.html)).

> All names are from Social Security card applications for births that occurred in the United States after 1879. Note  that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.
> 
> All data are from a 100% sample of our records on Social Security card applications as of March 2017.

We begin by plotting the number of male and female babies born each year: