# Last name analysis

The *Social Security Administration* publishes stats on first names for new borns. They show the top 1,000 on their web page, but you can [download a more complete dataset](https://www.ssa.gov/oact/babynames/limits.html) which includes every name that was at least seen 5 times.

Note: The data was first recorded in 1937, data in the dataset prior to that (it goes back to 1879) is retroactively determined by birth data, so it only shows the population still alive in 1937. That's why I'm only using 1937+ data.

In [13]:
import pandas as pd
from glob import glob
from tqdm import tqdm_notebook as tqdm
import holoviews as hv
hv.extension('matplotlib')
# NOTE: Change this to hv.extension('bokeh') for prettier output and zooming etc., and to get rid off

%output size=300
%opts Curve [height=160 width=300 tools=['hover']]
%opts Bars [width=300 height=120 tools=['hover']]

Unexpected plot option 'tools' for Curve in loaded backend 'matplotlib'.

Possible keywords in the currently active 'matplotlib' renderer are: ['apply_extents', 'apply_ranges', 'apply_ticks', 'aspect', 'autotick', 'bgcolor', 'fig_alpha', 'fig_bounds', 'fig_inches', 'fig_latex', 'fig_rcparams', 'fig_size', 'final_hooks', 'finalize_hooks', 'fontsize', 'initial_hooks', 'interpolation', 'invert_axes', 'invert_xaxis', 'invert_yaxis', 'invert_zaxis', 'labelled', 'logx', 'logy', 'logz', 'normalize', 'projection', 'relative_labels', 'show_frame', 'show_grid', 'show_legend', 'show_title', 'sublabel_format', 'sublabel_position', 'sublabel_size', 'title_format', 'xaxis', 'xrotation', 'xticks', 'yaxis', 'yrotation', 'yticks', 'zaxis', 'zrotation', 'zticks']

If you believe this keyword is correct, please make sure the backend has been imported or loaded with the hv.extension.Unexpected plot option 'tools' for Bars in loaded backend 'matplotlib'.

Possible keywords in the currently active 'matplotl

## Let's read the data

I used the [national data set](https://www.ssa.gov/oact/babynames/names.zip) which consists of one file per year.

In [2]:
slices = []

print("Reading files...")
pbar = tqdm(glob('names/*.txt'))
for fname in pbar:
    year = int(fname[-8:-4])
    assert(1880 <= year <= 2017)
    if year<1937: continue #Not so good data till 1937.
    pbar.set_description_str(f"Processing '{fname}'...")
    temp = pd.read_csv(fname, header=None, names=['name','sex','count'])
    temp['year'] = year
    slices.append(temp)
    
print("Merging files...")

names = pd.concat(slices)

print(f"{names.shape[0]:,} rows read.")

Reading files...


HBox(children=(IntProgress(value=0, max=138), HTML(value='')))


Merging files...
1,585,258 rows read.


We want to look at the last name of letters, so let's add that to the dataset.

In [3]:
names['last_letter'] = names.name.str.get(-1)
names.last_letter.value_counts()

a    417849
n    275187
e    260165
y    108208
l     82852
i     75694
h     66874
s     61093
o     52795
r     49520
d     33184
t     26060
m     17018
k     15854
z      6480
g      6385
c      5718
u      4924
b      4001
f      3200
x      3188
w      2381
p      2117
v      1970
j      1842
q       699
Name: last_letter, dtype: int64

Many names ending in `a`, and only a few ending in `q`. Sounds about right.

## Calculate some stats on the last letter of the name

How often does a certain last letter of a name occur? Let's find out.

In [4]:
freq = names.groupby(['year','sex','last_letter'])['count'].sum().unstack('last_letter').fillna(0)
freq = freq.divide(freq.sum('columns'), 'rows') #calculate the relative frequencies
freq.head()

Unnamed: 0_level_0,last_letter,a,b,c,d,e,f,g,h,i,j,...,q,r,s,t,u,v,w,x,y,z
year,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1937,F,0.274469,1.8e-05,9e-06,0.006164,0.257431,0.0,5.5e-05,0.033116,0.002465,0.0,...,0.0,0.007692,0.049165,0.028487,0.001032,0.0,3e-05,1.7e-05,0.187994,0.000853
1937,M,0.002852,0.002831,0.001664,0.176465,0.114887,0.000349,0.001167,0.03959,0.000438,0.0,...,0.0,0.039828,0.131576,0.085614,6.5e-05,2.6e-05,0.003388,0.002587,0.106828,0.00022
1938,F,0.279453,1.2e-05,0.0,0.005725,0.254027,0.0,6.1e-05,0.036515,0.002994,0.0,...,0.0,0.007136,0.045737,0.027947,0.000973,0.0,2.7e-05,3.4e-05,0.183789,0.000695
1938,M,0.002715,0.002549,0.001688,0.174488,0.113569,0.000372,0.001142,0.039872,0.00042,0.0,...,0.0,0.039794,0.133519,0.082795,7.1e-05,2.3e-05,0.003332,0.002616,0.111217,0.000204
1939,F,0.283062,7e-06,5e-06,0.005572,0.248019,0.0,5.7e-05,0.041504,0.003117,0.0,...,0.0,0.006911,0.042871,0.027936,0.000927,0.0,3.7e-05,2.4e-05,0.179588,0.000684


In [5]:
TOP = 3
temp = freq.unstack('sex').sum('rows').unstack()
common_f = set(temp.nlargest(TOP, 'F').index)
common_f = common_f.union(set(freq.unstack('sex').loc[1937].unstack().nlargest(TOP, 'F').index))
common_f = common_f.union(set(freq.unstack('sex').loc[2017].unstack().nlargest(TOP, 'F').index))
common_m = set(temp.nlargest(TOP, 'M').index)
common_m = common_m.union(set(freq.unstack('sex').loc[1937].unstack().nlargest(TOP, 'M').index))
common_m = common_m.union(set(freq.unstack('sex').loc[2017].unstack().nlargest(TOP, 'M').index))
common = common_m.union(common_f)
common

{'a', 'd', 'e', 'n', 'r', 's', 'y'}

The most common name endings across all times are `'a', 'd', 'e', 'n', 'r', 's', 'y'`.

In [6]:
ds = hv.Dataset(
    pd.DataFrame(freq.loc[:,common].stack(), columns=['freq']).reset_index(),
    kdims=['sex','year','last_letter'],
    vdims='freq'
)

chart_f = ds.select(sex=['F'], last_letter=common_f).to(hv.Curve, 'year', 'freq', group='Frequency of last letter').overlay('last_letter')
chart_m = ds.select(sex=['M'], last_letter=common_m).to(hv.Curve, 'year', 'freq').overlay('last_letter')

(chart_m * chart_f).layout('sex').cols(1)

## The diversity gender gap

Let's calculate the [Shannon Entropy](https://en.wikipedia.org/wiki/Diversity_index) which is a measure of how much information the last letter of a name conveys. The highest possible value is 26 which would mean that all 26 letters are equally likely.

In [7]:
from math import log, exp
shannon = freq.applymap(lambda x:-x*log(x) if x>0 else 0).sum('columns').apply(exp)
shannon = pd.DataFrame(shannon, columns=['entropy'])
shannon.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,entropy
year,sex,Unnamed: 2_level_1
1937,F,6.48852
1937,M,10.906265
1938,F,6.489513
1938,M,10.874115
1939,F,6.528721


In [8]:
%%opts Curve (color=hv.Cycle(['hotpink', 'dodgerblue']) line_width=3)
%%opts NdOverlay [legend_position='bottom_right']

ds = hv.Dataset(shannon.reset_index(), kdims=['year','sex'], vdims='entropy')

ds.to(hv.Curve, 'year', 'entropy', group='Entropy over time').overlay('sex').redim.range(entropy=(0,13))\
  * hv.Arrow(2011, 6.5, ' min', 'v', arrowstyle='->') * hv.Arrow(2011, 10.6, ' 2011', '^', arrowstyle='->')

Unexpected style option 'line_width' for Curve in loaded backend 'matplotlib'.

Similar keywords in the currently active 'matplotlib' renderer are: ['linewidth']

If you believe this keyword is correct, please make sure the backend has been imported or loaded with the hv.extension.Unexpected style option 'line_width' for Curve in loaded backend 'matplotlib'.

Similar keywords in the currently active 'matplotlib' renderer are: ['linewidth']

If you believe this keyword is correct, please make sure the backend has been imported or loaded with the hv.extension.

How does the gender gap look over time?

In [12]:
%%opts Curve (color='darkslategray' line_width=3) [width=200 height=140]
%%opts Table [height=400]

gap = shannon.unstack('sex')
gap.columns = gap.columns.get_level_values(1)
gap['gap'] = gap.M-gap.F

ds = hv.Dataset(gap.reset_index(), kdims=['year'], vdims='gap')

hv.Curve(ds).redim.range(gap=(0,7)) * hv.Arrow(2011, 4.2, '2011', '^') + ds.iloc[-15:,:].table()
#add .options(text_align='center') to arrow for centering in bokeh

Unexpected style option 'line_width' for Curve in loaded backend 'matplotlib'.

Similar keywords in the currently active 'matplotlib' renderer are: ['linewidth']

If you believe this keyword is correct, please make sure the backend has been imported or loaded with the hv.extension.