# Babynames Revisited

**Todd M. Gureckis**  
New York University  
(todd.gureckis@nyu.edu)

---

This notebook and the associated github repository provide the code and data for the publication:  

Gureckis, T.M. and Goldstone, R.L. (2009) How You Named Your Child: Understanding The Relationship Between Individual Decision Making and Collective Outcomes. TopiCS in Cognitive Science, 1 (4), 651-674.

---

The original code for this paper was written in Mathematica.  As a learning exercise I reimplemented the analyses using a slightly updated dataset using Python and Jupyter notebooks.  This was mostly a learning exercise for me but also useful to have the code for this publically shared.  As noted in the read me, I recreated all the plots from the paper, but did not re-do the modeling analyses with the MILEY model.  This is largely because this "model" basically re-explains the reported data analyses.  In addition, the model is a little hard to fit/estimate to the data because there are many local minima in the likelihood function for the model with this data and as a result I'm not sure the model is really that useful.  However, the analysis is useful for anyone following up on this runs analysis and also for learning pandas/bokeh, etc...

## Read in the Data

In [7]:
import pandas as pd
import numpy as np
from bokeh.plotting import figure, output_notebook, show
from bokeh.layouts import gridplot
from bokeh.palettes import Spectral11, Paired12
import itertools

# for bokeh to render to the notebook
output_notebook()

In [9]:
applicants = pd.read_csv('./names/applicants.csv',
                         names=["null", "year", "sex", "n"], skiprows=1)
del applicants['null']
applicants.head()

Unnamed: 0,year,sex,n
0,1880,F,97605
1,1880,M,118400
2,1881,F,98855
3,1881,M,108282
4,1882,F,115695


In [213]:
pieces = []
columns = ['names', 'sex', 'births']
years = range(1880, 2018)
for year in years:

    path = f'names/yob{year}.txt'
    frame = pd.read_csv(path, names=columns)
    frame['year'] = year

    females = frame[frame.sex == 'F'].reset_index(drop=True)
    males = frame[frame.sex == 'M'].reset_index(drop=True)
    females = females[:1000]
    males = males[:1000]
    males['rank'] = males.index+1
    females['rank'] = females.index+1

    m_den = applicants[(applicants.year == year) & (
        applicants.sex == 'M')].reset_index().at[0, 'n']
    males['rfreq'] = males.births/m_den

    f_den = applicants[(applicants.year == year) & (
        applicants.sex == 'F')].reset_index().at[0, 'n']
    females['rfreq'] = females.births/f_den

    males['pct_rank'] = males['rfreq'].rank(
        method='max', ascending=False, pct=True)
    females['pct_rank'] = females['rfreq'].rank(
        method='max', ascending=False, pct=True)

    newyear = pd.concat([females, males], ignore_index=True)
    pieces.append(newyear)
names = pd.concat(pieces, ignore_index=True)
names.to_pickle('names.pkl')

In [201]:
year_grouped = dict(list(names.groupby(['year', 'sex'])))
for key in year_grouped:
    print(key)
    print(year_grouped[key].head())

(1880, 'F')
       names sex  births  year  rank     rfreq  pct_rank
0       Mary   F    7065  1880     1  0.072384  0.001062
1       Anna   F    2604  1880     2  0.026679  0.002123
2       Emma   F    2003  1880     3  0.020521  0.003185
3  Elizabeth   F    1939  1880     4  0.019866  0.004246
4     Minnie   F    1746  1880     5  0.017888  0.005308
(1880, 'M')
       names sex  births  year  rank     rfreq  pct_rank
942     John   M    9655  1880     1  0.081546     0.001
943  William   M    9532  1880     2  0.080507     0.002
944    James   M    5927  1880     3  0.050059     0.003
945  Charles   M    5348  1880     4  0.045169     0.004
946   George   M    5126  1880     5  0.043294     0.005
(1881, 'F')
          names sex  births  year  rank     rfreq  pct_rank
1942       Mary   F    6919  1881     1  0.069991  0.001066
1943       Anna   F    2698  1881     2  0.027292  0.002132
1944       Emma   F    2034  1881     3  0.020576  0.003198
1945  Elizabeth   F    1852  1881     4 

         names sex  births  year  rank     rfreq  pct_rank
50877     John   M    8060  1905     1  0.056270     0.001
50878  William   M    6495  1905     2  0.045344     0.002
50879    James   M    6042  1905     3  0.042182     0.003
50880   George   M    4256  1905     4  0.029713     0.004
50881  Charles   M    3608  1905     5  0.025189     0.005
(1906, 'F')
          names sex  births  year  rank     rfreq  pct_rank
51877      Mary   F   16370  1906     1  0.052227     0.001
51878     Helen   F    7176  1906     2  0.022894     0.002
51879  Margaret   F    6096  1906     3  0.019449     0.003
51880      Anna   F    5502  1906     4  0.017554     0.004
51881      Ruth   F    5140  1906     5  0.016399     0.005
(1906, 'M')
         names sex  births  year  rank     rfreq  pct_rank
52877     John   M    8265  1906     1  0.057368     0.001
52878  William   M    6567  1906     2  0.045582     0.002
52879    James   M    5908  1906     3  0.041008     0.003
52880   George   M    4201

         names sex  births  year  rank     rfreq  pct_rank
86877     John   M   57472  1923     1  0.050756     0.001
86878   Robert   M   56120  1923     2  0.049562     0.002
86879  William   M   52135  1923     3  0.046042     0.003
86880    James   M   50465  1923     4  0.044568     0.004
86881  Charles   M   28960  1923     5  0.025576     0.005
(1924, 'F')
          names sex  births  year  rank     rfreq  pct_rank
87877      Mary   F   73532  1924     1  0.056751     0.001
87878   Dorothy   F   39997  1924     2  0.030869     0.002
87879     Helen   F   31191  1924     3  0.024073     0.003
87880     Betty   F   30602  1924     4  0.023618     0.004
87881  Margaret   F   26550  1924     5  0.020491     0.005
(1924, 'M')
         names sex  births  year  rank     rfreq  pct_rank
88877   Robert   M   60798  1924     1  0.052006     0.001
88878     John   M   59052  1924     2  0.050512     0.002
88879  William   M   53515  1924     3  0.045776     0.003
88880    James   M   52944

           names sex  births  year  rank     rfreq  pct_rank
137877     Linda   F   91016  1949     1  0.051846     0.001
137878      Mary   F   66862  1949     2  0.038087     0.002
137879  Patricia   F   46330  1949     3  0.026391     0.003
137880   Barbara   F   42598  1949     4  0.024266     0.004
137881     Susan   F   37707  1949     5  0.021479     0.005
(1949, 'M')
          names sex  births  year  rank     rfreq  pct_rank
138877    James   M   86855  1949     1  0.048204     0.001
138878   Robert   M   83869  1949     2  0.046546     0.002
138879     John   M   81155  1949     3  0.045040     0.003
138880  William   M   61501  1949     4  0.034132     0.004
138881  Michael   M   60039  1949     5  0.033321     0.005
(1950, 'F')
           names sex  births  year  rank     rfreq  pct_rank
139877     Linda   F   80432  1950     1  0.045736     0.001
139878      Mary   F   65482  1950     2  0.037235     0.002
139879  Patricia   F   47945  1950     3  0.027263     0.003
139880

              names sex  births  year  rank     rfreq  pct_rank
188877      Michael   M   67583  1974     1  0.041443     0.001
188878        Jason   M   54776  1974     2  0.033590     0.002
188879  Christopher   M   48610  1974     3  0.029809     0.003
188880        David   M   41811  1974     4  0.025639     0.004
188881        James   M   41353  1974     5  0.025358     0.005
(1975, 'F')
           names sex  births  year  rank     rfreq  pct_rank
189877  Jennifer   F   58185  1975     1  0.037280     0.001
189878       Amy   F   32252  1975     2  0.020665     0.002
189879   Heather   F   24300  1975     3  0.015570     0.003
189880   Melissa   F   24167  1975     4  0.015484     0.004
189881    Angela   F   23359  1975     5  0.014967     0.005
(1975, 'M')
              names sex  births  year  rank     rfreq  pct_rank
190877      Michael   M   68454  1975     1  0.042177     0.001
190878        Jason   M   52183  1975     2  0.032152     0.002
190879  Christopher   M   46592  1

              names sex  births  year  rank     rfreq  pct_rank
234877      Michael   M   37548  1997     1  0.018799     0.001
234878        Jacob   M   34151  1997     2  0.017098     0.002
234879      Matthew   M   31513  1997     3  0.015778     0.003
234880  Christopher   M   29103  1997     4  0.014571     0.004
234881       Joshua   M   28283  1997     5  0.014160     0.005
(1998, 'F')
           names sex  births  year  rank     rfreq  pct_rank
235877     Emily   F   26181  1998     1  0.013509     0.001
235878    Hannah   F   21373  1998     2  0.011028     0.002
235879  Samantha   F   20193  1998     3  0.010419     0.003
235880     Sarah   F   19879  1998     4  0.010257     0.004
235881    Ashley   F   19874  1998     5  0.010255     0.005
(1998, 'M')
              names sex  births  year  rank     rfreq  pct_rank
236877      Michael   M   36614  1998     1  0.018062     0.001
236878        Jacob   M   36014  1998     2  0.017766     0.002
236879      Matthew   M   31142  1

          names sex  births  year  rank     rfreq  pct_rank
272877     Noah   M   19082  2016     1  0.009457     0.001
272878     Liam   M   18198  2016     2  0.009019     0.002
272879  William   M   15739  2016     3  0.007800     0.003
272880    Mason   M   15230  2016     4  0.007548     0.004
272881    James   M   14842  2016     5  0.007356     0.005
(2017, 'F')
           names sex  births  year  rank     rfreq  pct_rank
273877      Emma   F   19738  2017     1  0.010528     0.001
273878    Olivia   F   18632  2017     2  0.009938     0.002
273879       Ava   F   15902  2017     3  0.008482     0.003
273880  Isabella   F   15100  2017     4  0.008054     0.004
273881    Sophia   F   14831  2017     5  0.007910     0.005
(2017, 'M')
          names sex  births  year  rank     rfreq  pct_rank
274877     Liam   M   18728  2017     1  0.009539     0.001
274878     Noah   M   18326  2017     2  0.009334     0.002
274879  William   M   14904  2017     3  0.007591     0.003
274880    

## Figure 1: Cumulative distribution on log-log scale

These examples should look more or less identical to the data from the paper.

In [8]:
# create a bokeh plot
p = figure(
    width=700,
    height=400,
    tools="",
    x_axis_type="log", x_range=[0.00001, 0.2], x_axis_label="frequency",
    y_axis_type="log", y_range=[0.001, 1.5], y_axis_label="P(X>x)"
)

sex = 'F'
for year, color in zip([1880, 1900, 1920, 1940, 1960, 1980, 2000, 2007], itertools.cycle(Paired12)):
    plot_data = year_grouped[(year, sex)]
    p.line(plot_data.rfreq, plot_data.pct_rank,
           legend=f"{year} {sex}", line_color=color, line_width=0.8)

p.toolbar.logo = None
p.toolbar_location = None
p.legend.location = "bottom_left"
show(p)


# create a bokeh plot
p = figure(
    width=700,
    height=400,
    tools="",
    x_axis_type="log", x_range=[0.00001, 0.2], x_axis_label="frequency",
    y_axis_type="log", y_range=[0.001, 1.5], y_axis_label="P(X>x)"
)


sex = 'M'
for year, color in zip([1880, 1900, 1920, 1940, 1960, 1980, 2000, 2007], itertools.cycle(Paired12)):
    plot_data = year_grouped[(year, sex)]
    p.line(plot_data.rfreq, plot_data.pct_rank,
           legend=f"{year} {sex}", line_color=color, line_width=0.8)

p.toolbar.logo = None
p.toolbar_location = None
p.legend.location = "bottom_left"
show(p)

## Figure 2: Best fit line to cumulative distribution

The fitted values here are very closely related to the ones on the Gureckis & Goldstone paper.  One difference is that when constructing the CDF plot in the original paper, we idenfified each of the unique values of frequency of names and then computed the P(X>x) at each of those points.  Here used all the names and so there are duplicate points in the fitting.  If the duplicate points were remove the expoenents would be identical.

Note: there is a drop_duplicates() pandas function that might be helpful here.

In [206]:
year = 1880
fitdata_m = year_grouped[(year, 'M')].copy()
fitdata_f = year_grouped[(year, 'F')].copy()

# remove the non useful columns
fitdata_m.drop(['year', 'names', 'sex', 'births', 'rank'],
               axis=1, inplace=True)
fitdata_f.drop(['year', 'names', 'sex', 'births', 'rank'],
               axis=1, inplace=True)

fitdata_m = fitdata_m.drop_duplicates()
fitdata_f = fitdata_f.drop_duplicates()

print(fitdata_m.tail())
print(fitdata_f.tail())

         rfreq  pct_rank
1581  0.000076     0.681
1623  0.000068     0.744
1686  0.000059     0.808
1750  0.000051     0.908
1850  0.000042     1.000
        rfreq  pct_rank
579  0.000092  0.665605
627  0.000082  0.719745
678  0.000072  0.782378
737  0.000061  0.886412
835  0.000051  1.000000


In [207]:
import statsmodels.api as sm

X = np.log(fitdata_f["rfreq"])
y = np.log(fitdata_f["pct_rank"])
X = sm.add_constant(X)  # add an intercept to model

# for making predictions
Xp = pd.DataFrame({"rfreq": np.arange(0.00001, 1.0, 0.01)})
Xp_m = sm.add_constant(np.log(Xp))


model_f = sm.OLS(y, X).fit()
predictions_f = model_f.predict(Xp_m)


rsq_f, alpha_f = model_f.rsquared, model_f.params['rfreq']-1
model_f.summary()


X = np.log(fitdata_m["rfreq"])
y = np.log(fitdata_m["pct_rank"])
X = sm.add_constant(X)  # add an intercept to model

model_m = sm.OLS(y, X).fit()
predictions_m = model_m.predict(Xp_m)

rsq_m, alpha_m = model_m.rsquared, model_m.params['rfreq']-1
model_m.summary()

  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,pct_rank,R-squared:,0.986
Model:,OLS,Adj. R-squared:,0.986
Method:,Least Squares,F-statistic:,14090.0
Date:,"Thu, 28 Mar 2019",Prob (F-statistic):,4.48e-188
Time:,13:42:29,Log-Likelihood:,114.14
No. Observations:,203,AIC:,-224.3
Df Residuals:,201,BIC:,-217.7
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-7.8497,0.048,-165.164,0.000,-7.943,-7.756
rfreq,-0.8155,0.007,-118.686,0.000,-0.829,-0.802

0,1,2,3
Omnibus:,175.997,Durbin-Watson:,0.192
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3430.36
Skew:,-3.201,Prob(JB):,0.0
Kurtosis:,22.094,Cond. No.,34.5


In [180]:
Xp.head()

Unnamed: 0,const,rfreq
0,1.0,-9.21034
1,1.0,-4.59522
2,1.0,-3.907035
3,1.0,-3.50323
4,1.0,-3.216379


In [173]:
predictions.head(), fitdata.pct_rank.head()

(248877   -3.961191
 248878   -3.888718
 248879   -3.848451
 248880   -3.803571
 248881   -3.779930
 dtype: float64, 248877    0.001
 248878    0.002
 248879    0.003
 248880    0.004
 248881    0.005
 Name: pct_rank, dtype: float64)

In [208]:
colors = Paired12

# create a bokeh plot
p = figure(
    width=700,
    height=400,
    tools="",
    x_axis_type="log", x_range=[0.00001, 0.2], x_axis_label="frequency",
    y_axis_type="log", y_range=[0.001, 1.5], y_axis_label="P(X>x)"
)

# plot data line
p.line(fitdata_f.rfreq, fitdata_f.pct_rank,
       legend=f"{year} F", line_color='grey', line_width=1.0)
p.line(Xp.rfreq, np.exp(predictions_f),
       legend=f"model F (R^2={rsq_f:.2f}, alpha={alpha_f:.2f})", line_color='grey', line_dash="4 4")

p.line(fitdata_m.rfreq, fitdata_m.pct_rank,
       legend=f"{year} M", line_color='orange', line_width=1.0)
p.line(Xp.rfreq, np.exp(predictions_m),
       legend=f"model M (R^2={rsq_m:.2f}, alpha={alpha_m:.2f})", line_color='orange', line_dash="4 4")


#p.circle(fitdata.rfreq, np.exp(predictions), legend="model", fill_color="white", size=1)


p.toolbar.logo = None
p.toolbar_location = None
p.legend.location = "bottom_left"
show(p)

This figure fits a linear model to the cumulative distribution and plots the resulting exponent over time.  

In [216]:
bygroup = names.groupby(['year', 'sex'])


def regress(fitdata, yvar, xvars):
    data = fitdata.copy()
    data.drop(['names', 'births', 'rank'], axis=1, inplace=True)
    data = data.drop_duplicates()

    Y = np.log(data[yvar])
    X = np.log(data[xvars])
    X['intercept'] = 1.
    results = sm.OLS(Y, X).fit()
    return pd.Series({"Rsq": results.rsquared, "Alpha": results.params['rfreq']-1})


fitted_results = bygroup.apply(regress, 'pct_rank', ['rfreq'])

In [218]:
male_exp = fitted_results.unstack()['Alpha']['M']
female_exp = fitted_results.unstack()['Alpha']['F']


# create a bokeh plot
p = figure(
    width=700,
    height=400,
    tools="",
    x_range=[1879, 2018], x_axis_label="year",
    y_range=[1.5, 2.2], y_axis_label="Alpha"
)

# plot data line
p.line(female_exp.index.values, np.abs(female_exp.values),
       legend="Females", line_color='grey', line_width=1.0)
p.line(male_exp.index.values, np.abs(male_exp.values),
       legend="Males", line_color='orange', line_width=1.0)

p.circle(male_exp.index.values, np.abs(male_exp.values),
         legend="Males", line_color='orange', fill_color="white", size=3)
p.circle(female_exp.index.values, np.abs(female_exp.values),
         legend="Females", line_color='grey', fill_color="white", size=3)


p.toolbar.logo = None
p.toolbar_location = None
p.legend.location = "bottom_left"
show(p)

## Figure 2b: New names introduced on each top 1000 decadal list

In [None]:
# decadal data
decade_data = pd.read_csv('./names/ssababydata-decade.dat', sep='\s+')
decade_data.head()

decade_data.columns

In [275]:
decade_groups = dict(list(decade_data.groupby(['year', 'sex'])))
names_list = {}
years = []
for key in decade_groups:
    if key[1] == 'f':
        years.append(key[0])
    # print(decade_groups[key]['name'].values)
    names_list[key] = decade_groups[key]['name'].values

In [276]:
def count_new(year_pair, gender, names_list):
    count = 0
    for name in names_list[(year_pair[1], gender)]:
        if name not in names_list[(year_pair[0], gender)]:
            count += 1
    return count


yearcol = []
countcol = []
for pair in zip(years[:-1], years[1:]):
    yearcol.append(pair[1])
    countcol.append(count_new(pair, 'f', names_list))

female_change = pd.DataFrame({'year': yearcol, 'counts': countcol})


yearcol = []
countcol = []
for pair in zip(years[:-1], years[1:]):
    yearcol.append(pair[1])
    countcol.append(count_new(pair, 'm', names_list))
male_change = pd.DataFrame({'year': yearcol, 'counts': countcol})

In [280]:
# create a bokeh plot
p = figure(
    width=700,
    height=400,
    tools="",
    x_range=[1880, 2010], x_axis_label="year",
    y_range=[0, 300], y_axis_label="Number of New Names"
)

# plot data line
p.line(female_change.year, female_change.counts,
       legend="Females", line_color='grey', line_width=1.0)
p.line(male_change.year, male_change.counts,
       legend="Males", line_color='orange', line_width=1.0)


p.toolbar.logo = None
p.toolbar_location = None
p.legend.location = "bottom_left"
show(p)

Interestingly, the scale of this plot seems a little different than the one in the paper.  I think the data pattern looks identical so I'm wondering if a zero was chopped of the figure during production (these were illustrator modified version of the Mathematica output).

## Figure 3: Nobel Prize Winners names over time

In [288]:
names.head()

Unnamed: 0,names,sex,births,year,rank,rfreq,pct_rank
0,Mary,F,7065,1880,1,0.072384,0.001062
1,Anna,F,2604,1880,2,0.026679,0.002123
2,Emma,F,2003,1880,3,0.020521,0.003185
3,Elizabeth,F,1939,1880,4,0.019866,0.004246
4,Minnie,F,1746,1880,5,0.017888,0.005308


In [360]:
def plot_name(name_str, sex, data):
    name_over_time = data[data.sex == sex].pivot_table(
        'rfreq', index='year', columns='names')
    name_t = name_over_time[name_str]
    # begin and return to zero for the plot
    name_t = name_t.reindex(range(1879, 2019, 1), fill_value=0)

    # create a bokeh plot
    p = figure(
        width=300,
        height=200,
        tools="",
        x_range=[1880, 2018], x_axis_label="year",
        y_axis_label="Name market share (\%)"
    )

    # plot data line
    p.patch(name_t.index, name_t*100, legend=name_str,
            color="lightgrey", line_color='grey', line_width=1.0)

    p.toolbar.logo = None
    p.toolbar_location = None
    p.legend.location = "top_right"
    return p

In [361]:
from bokeh.layouts import gridplot
nobels = [('Albert', 'M'), ('Doris', 'F'), ('Eric', 'M'), ('Mario', 'M'),
          ('Martin', 'M'), ('Oliver', 'M'), ('Peter', 'M'), ('Roger', 'M')]
plots = list(map(lambda x: plot_name(x[0], x[1], names), nobels))

grid = gridplot([[plots[0], plots[1], plots[2]], [
                plots[3], plots[4], plots[5]], [plots[6], plots[7], None]])
show(grid)

## Figure 4: conditional probabilities of movement

I thought long and hard about how to recreate this analysis only using Pandas dataframe operations (at least mostly).  However, either the solution is too unweildy or the tool is not quite right.  As a I understand it perhaps outside of simple sliding window analyses really complex reshaping approaches are possible best left to loops and things outside of the data manipulation features of the dataframe.

In [495]:
def get_chunked_dataframe(babyname, sex, dataframe):
    def get_rank(babyname, sex, year):
        if names[(names.names == babyname) & (names.sex == sex) & (names.year == year)].empty:
            return np.NaN
        else:
            return names[(names.names == babyname) & (names.sex == sex) & (names.year == year)]['rank'].values[0]

    def chunk(data, width, babyname, sex):
        for i in range(0, len(data)-2):
            d = data[i:i+width]
            name = d.columns[0]
            years = d.index.values
            freqs = d.values.ravel()
            rank = get_rank(babyname, sex, years[1])
            yield [name, sex, years[1], rank, freqs[0], freqs[1], freqs[2]]

    one_name = pd.DataFrame(dataframe[babyname])
    return pd.DataFrame(list(chunk(one_name, 3, babyname, sex)), columns=['name', 'sex', 'year', 'rank', 'prev', 'now', 'next'])

In [500]:
name_over_time=names[names.sex=='M'].pivot_table('rfreq',index='year',columns='names')

In [501]:
name_over_time.head()

names,Aaden,Aarav,Aaron,Aarush,Ab,Abb,Abbie,Abbott,Abdiel,Abdul,...,Zebulon,Zechariah,Zed,Zeke,Zenas,Zeno,Zigmund,Zion,Zollie,Zyaire
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1880,,,0.000861,,4.2e-05,,,4.2e-05,,,...,,,,5.1e-05,,,,,,
1881,,,0.000868,,,,,,,,...,,,,,,,,,,
1882,,,0.000697,,4.1e-05,,,,,,...,,,,,,,,,,
1883,,,0.000934,,,,,,,,...,,,,5.3e-05,,5.3e-05,,,,
1884,,,0.00079,,,4.1e-05,,,,,...,,,,,,,,,4.9e-05,


### Caution these cells involve very time-expensive operations to create the runs analysis!

It might make sense to skip these cells and jump ahead to reading in the provide pickle files that have pre-computed these values.

In [503]:
%% time
name_over_time = names[names.sex == 'F'].pivot_table(
    'rfreq', index='year', columns='names')
female_runs = [get_chunked_dataframe(
    name, 'F', name_over_time) for name in name_over_time.columns]

CPU times: user 7h 15min 6s, sys: 1min 29s, total: 7h 16min 35s
Wall time: 7h 20min 11s


In [509]:
female_runs_df = pd.concat(female_runs, ignore_index=True)
female_runs_df.to_pickle('femaleruns.pkl')

In [719]:
%% time
name_over_time = names[names.sex == 'M'].pivot_table(
    'rfreq', index='year', columns='names')
male_runs = [get_chunked_dataframe(name, 'M', name_over_time)
             for name in name_over_time.columns]

CPU times: user 7h 15min 48s, sys: 2min 21s, total: 7h 18min 10s
Wall time: 7h 24min 22s


In [720]:
male_runs_df = pd.concat(male_runs, ignore_index=True)
male_runs_df.to_pickle('maleruns.pkl')

In [721]:
%% time
runs = pd.concat(female_runs+male_runs, ignore_index=True)
runs.to_pickle('allruns.pkl')

CPU times: user 2.58 s, sys: 259 ms, total: 2.84 s
Wall time: 2.94 s


In [722]:
female_runs_df.head()

Unnamed: 0,name,sex,year,rank,prev,now,next,t0_change,t1_change,movement
135,Aadhya,F,2016,953.0,0.0,0.000147,0.000155,0.000147,7e-06,p(up|up)
249,Aaliyah,F,1994,202.0,0.0,0.000744,0.000653,0.000744,-9.1e-05,p(down|up)
250,Aaliyah,F,1995,225.0,0.000744,0.000653,0.000434,-9.1e-05,-0.00022,p(down|down)
251,Aaliyah,F,1996,335.0,0.000653,0.000434,0.000911,-0.00022,0.000478,p(up|down)
252,Aaliyah,F,1997,176.0,0.000434,0.000911,0.000722,0.000478,-0.000189,p(down|up)


In [723]:
male_runs_df.head()

Unnamed: 0,name,sex,year,rank,prev,now,next
0,Aaden,M,1881,,,,
1,Aaden,M,1882,,,,
2,Aaden,M,1883,,,,
3,Aaden,M,1884,,,,
4,Aaden,M,1885,,,,


### Pick up here for the actual runs analysis

In [693]:
female_runs_df = pd.read_pickle('femaleruns.pkl')

female_runs_df = female_runs_df[

    ~(female_runs_df['prev'].isnull() &
      female_runs_df['now'].isnull() &
      female_runs_df['next'].isnull())
    &
    ~(female_runs_df['prev'].isnull() &
      female_runs_df['now'].isnull() &
      female_runs_df['next'].notnull())
    &
    ~(female_runs_df['prev'].notnull() &
      female_runs_df['now'].isnull() &
      female_runs_df['next'].isnull())
]


female_runs_df = female_runs_df.fillna(0)
female_runs_df['t0_change'] = female_runs_df.now-female_runs_df.prev
female_runs_df['t1_change'] = female_runs_df.next-female_runs_df.now

# this needs to count moving up from nan as growth or vice versa.


def classify(t0, t1):
    def code(x):
        if x > 0:
            return 'up'
        elif x == 0:
            return 'same'
        elif x < 0:
            return 'down'
    if np.isnan(t0) and np.isnan(t1):
        return np.NaN
    else:
        return 'p('+code(t1) + '|' + code(t0)+')'


female_runs_df['movement'] = female_runs_df.apply(
    lambda x: classify(x['t0_change'], x['t1_change']), axis="columns")

In [724]:
male_runs_df = pd.read_pickle('maleruns.pkl')

male_runs_df = male_runs_df[

    ~(male_runs_df['prev'].isnull() &
      male_runs_df['now'].isnull() &
      male_runs_df['next'].isnull())
    &
    ~(male_runs_df['prev'].isnull() &
      male_runs_df['now'].isnull() &
      male_runs_df['next'].notnull())
    &
    ~(male_runs_df['prev'].notnull() &
      male_runs_df['now'].isnull() &
      male_runs_df['next'].isnull())
]


male_runs_df = male_runs_df.fillna(0)
male_runs_df['t0_change'] = male_runs_df.now-male_runs_df.prev
male_runs_df['t1_change'] = male_runs_df.next-male_runs_df.now

# this needs to count moving up from nan as growth or vice versa.


def classify(t0, t1):
    def code(x):
        if x > 0:
            return 'up'
        elif x == 0:
            return 'same'
        elif x < 0:
            return 'down'
    if np.isnan(t0) and np.isnan(t1):
        return np.NaN
    else:
        return 'p('+code(t1) + '|' + code(t0)+')'


male_runs_df['movement'] = male_runs_df.apply(
    lambda x: classify(x['t0_change'], x['t1_change']), axis="columns")

In [744]:
def plot_movement(y1, y2, runs_df, sex):
    movement_ps = {}
    for rank in range(1, 1000):
        np1r1 = runs_df[(runs_df['rank'] == rank) & (
            runs_df['year'] >= y1) & (runs_df['year'] <= y2)].copy()
        r1c = np1r1['movement'].value_counts()
        r1c = r1c.reindex(['p(down|down)', 'p(up|down)',
                           'p(down|up)', 'p(up|up)'], fill_value=0)
        down_den = r1c.loc['p(down|down)']+r1c.loc['p(up|down)']
        up_den = r1c.loc['p(down|up)']+r1c.loc['p(up|up)']
        denom = pd.Series({'p(down|down)': down_den, 'p(up|down)': down_den,
                           'p(down|up)': up_den, 'p(up|up)': up_den})
        result = r1c*(1.0/denom)
        movement_ps[rank] = result
    #np1r2=female_runs_df[(female_runs_df['rank']==2)& (female_runs_df['year']<=1905)]
    # r2c=np1r2['movement'].value_counts()

    mv_df = pd.DataFrame(movement_ps)

    # create a bokeh plot
    p = figure(
        width=400,
        height=300,
        tools="",
        x_range=[1, 1000], x_axis_label="rank",
        y_range=[0, 1.0], y_axis_label="Probability (movement | rank)",
        title=f"{sex} {y1}-{y2}"
    )

    # plot data line
    p.line(range(1, 1000), mv_df.loc['p(down|down)'].rolling(
        50).mean(), legend="p(down|down)", line_color='orange', line_width=1.0)
    p.line(range(1, 1000), mv_df.loc['p(up|down)'].rolling(
        50).mean(), legend="p(up|down)", line_color='red', line_width=1.0)
    p.line(range(1, 1000), mv_df.loc['p(down|up)'].rolling(
        50).mean(), legend="p(down|up)", line_color='blue', line_width=1.0)
    p.line(range(1, 1000), mv_df.loc['p(up|up)'].rolling(
        50).mean(), legend="p(up|up)", line_color='purple', line_width=1.0)

    p.toolbar.logo = None
    p.toolbar_location = None
    p.legend.location = "top_left"
    p.legend.label_text_font_size = "8pt"
    p.legend.padding = 2
    p.legend.spacing = 0
    p.legend.label_height = 10
    p.legend.glyph_height = 10

    return p

In [745]:
from bokeh.layouts import gridplot
p1f = plot_movement(1880, 1904, female_runs_df, 'Female')
p2f = plot_movement(1930, 1954, female_runs_df, 'Female')
p3f = plot_movement(1983, 2007, female_runs_df, 'Female')

p1m = plot_movement(1880, 1904, male_runs_df, 'Male')
p2m = plot_movement(1930, 1954, male_runs_df, 'Male')
p3m = plot_movement(1983, 2007, male_runs_df, 'Male')

grid = gridplot([[p1f, p1m], [p2f, p2m], [p3f, p3m]])
show(grid)

## Figure 5: How predictive is movement one year to the next?

In [784]:
def compute_same_direction(y1, runs_df):
    np1r1 = runs_df[(runs_df['year'] == y1)].copy()
    r1c = np1r1['movement'].value_counts()
    p_same = (r1c.loc['p(down|down)']+r1c.loc['p(up|up)'])/r1c.sum()
    return p_same


female_movement_predict = [compute_same_direction(
    year, female_runs_df) for year in range(1881, 2017)]
male_movement_predict = [compute_same_direction(
    year, male_runs_df) for year in range(1881, 2017)]

# create a bokeh plot
p = figure(
    width=500,
    height=350,
    tools="",
    x_range=[1881, 2017], x_axis_label="year",
    y_range=[0.2, 0.7], y_axis_label="Accuracy",
    title=f"Probability of correct using last year to predict this year"
)

# plot data line
p.line(range(1881, 2017), female_movement_predict,
       legend="Female", line_color='orange', line_width=1.0)
p.line(range(1881, 2017), male_movement_predict,
       legend="Male", line_color='blue', line_width=1.0)

p.toolbar.logo = None
p.toolbar_location = None
p.legend.location = "bottom_right"

show(p)