# Main Article Tables

This notebook creates the data used in Table 1, Table 2 and Table 3 of the main text. Note that the data created for Table 2 provides multiple examples of CVC and multisyllabic words that are 'easier' or 'harder' to acquire at 3, 4 and 6 years. 

# 1. Imports and Loading

## Imports

In [15]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
current_dir = %pwd
if not current_dir == '/home/melissa/Dropbox/experiments/python/':
    %cd '/home/melissa/Dropbox/experiments/python/'

/home/melissa/Dropbox/experiments/python


In [36]:
from pathlib import Path
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.ticker as mtick
from scipy import stats

homedir = "/home/melissa/Dropbox/experiments/python/current_projects/jcl_multisyllabic_neighborhoods_2021/"
datadir = "data/combined/"
data = Path(homedir+datadir)
outdir = Path(homedir+"data/table_data/")

# Table 1

In [17]:
convert = {'PACT':'float64','Pct Child':'float64','Pct Adult':'float64',
           'len_syllables':'int32','ND':'int32', 'SOND':'int32'}
data3 = pd.read_pickle(data/'data3.pickle')
data3 = data3.astype(convert)
data4 = pd.read_pickle(data/'data4.pickle')
data4 = data4.astype(convert)
data6 = pd.read_pickle(data/'data6.pickle')
data6 = data6.astype(convert)

In [18]:
def word_count_table(data):
    
    return data.shape[0], data[np.isfinite(data['Pct Adult'])].shape[0], \
           data[np.isfinite(data['Pct Adult'])][data.ismulti].shape[0], \
           data[np.isfinite(data['Pct Adult'])][data.iscvc].shape[0]

In [19]:
word_count_table(data3),word_count_table(data4),word_count_table(data6)

  after removing the cwd from sys.path.
  """


((1557, 1397, 644, 386), (1687, 1495, 717, 393), (2260, 1939, 983, 483))

### Determining how many syllables are in the multisyllabic words

In [20]:
sylls3 = data3['len_syllables'][data3['ismulti']].values
sylls4 = data4['len_syllables'][data4['ismulti']].values
sylls6 = data6['len_syllables'][data6['ismulti']].values

In [21]:
np.max([np.max(sylls3),np.max(sylls4),np.max(sylls6)])

5

In [22]:
np.min([np.min(sylls3),np.min(sylls4),np.min(sylls6)])

2

# Table 2

In [23]:
data = Path.cwd()/'current_projects/jcl_multisyllabic_neighborhoods_2021/data'
old_data = data/'old_2018'

Note that this table uses previously calculated 'old' data. This contains the same PACT values as in the paper, but also contains many other calculations. The one of interest for this table is age-of-acquisition data, used to rank the candidate choices as 'easier' or 'harder' to acquire.

Note that 10 candidate words were chosen per age/word type. This was to account for missing data in the age-of-acquisition data (coded as age 0.0), and to avoid duplicates across ages.

In [24]:
c3 = pd.read_pickle(old_data/'three_vars.pickle')
c4 = pd.read_pickle(old_data/'four_vars.pickle')
c6 = pd.read_pickle(old_data/'six_vars.pickle')

In [25]:
c3.columns

Index(['phonological', 'orthographic', 'percent_child', 'percent_adult',
       'token_child', 'token_adult', 'nucleus_density',
       'onset_nucleus_density', 'nucleus_coda_density', 'on_nu&nu_co_density',
       'onset_nucleus_coda_density', 'syllables', 'length_syllables',
       'length_phonemes', 'stressed_syll_position', 'onset_nucleus',
       'onset_nucleus_coda', 'SAD_density', 'SAD_frequency_pct_child',
       'SAD_frequency_tok_child', 'SAD_frequency_tok_raw',
       'SAD_frequency_pct_child_raw', 'SON_frequency', 'adult_SON_density',
       'SAD_frequency', 'SAD_frequency_pct', 'age_of_acquisition', 'fau_poly1',
       'fau_poly2', 'fau_token_poly1', 'fau_token_poly2', 'fau_pct_poly1',
       'fau_pct_poly2', 'adjusted_aoa', 'fau_bisyllabic',
       'fau_bisyllabic_trochaic', 'fau_cvc', 'fau_minus_1', 'fau_minus_10',
       'fau_minus_100', 'fau_random_25', 'fau_random_50', 'fau_fandom_75'],
      dtype='object')

In [26]:
iscvc3 = (c3['length_syllables']==0).values & np.isfinite(c3['percent_adult']).values
iscvc4 = (c4['length_syllables']==0).values & np.isfinite(c4['percent_adult']).values
iscvc6 = (c6['length_syllables']==0).values & np.isfinite(c6['percent_adult']).values

In [27]:
ismulti3 = (c3['length_syllables']>=2).values & np.isfinite(c3['percent_adult']).values
ismulti4 = (c4['length_syllables']>=2).values & np.isfinite(c4['percent_adult']).values
ismulti6 = (c6['length_syllables']>=2).values & np.isfinite(c6['percent_adult']).values

In [28]:
c3['iscvc'] = iscvc3
c3['ismulti'] = ismulti3
c4['iscvc'] = iscvc4
c4['ismulti'] = ismulti4
c6['iscvc'] = iscvc6
c6['ismulti'] = ismulti6

In [29]:
p3 = c3[np.isfinite(c3['fau_pct_poly2'])]
p4 = c4[np.isfinite(c4['fau_pct_poly2'])]
p6 = c6[np.isfinite(c6['fau_pct_poly2'])]

In [30]:
p3['zscore'] = stats.zscore(p3['fau_pct_poly2'])
p4['zscore'] = stats.zscore(p4['fau_pct_poly2'])
p6['zscore'] = stats.zscore(p6['fau_pct_poly2'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


### create all of the high PACT multisyllabic

In [31]:
cols = ['orthographic','age_of_acquisition']
n=1.75
num=10
m3g = p3[cols][p3['zscore'] > n][p3['ismulti']]
m4g = p4[cols][p4['zscore'] > n][p4['ismulti']]
m6g = p6[cols][p6['zscore'] > n][p6['ismulti']]
word_max = np.amin([num,len(m3g),len(m4g),len(m6g)],axis=0)
s_m3g = m3g.sort_values(by=['age_of_acquisition'],ascending=True)[:word_max]
s_m4g = m4g.sort_values(by=['age_of_acquisition'],ascending=True)[:word_max]
s_m6g = m6g.sort_values(by=['age_of_acquisition'],ascending=True)[:word_max]

d = {'3 words': s_m3g['orthographic'].values, '3 aoa': s_m3g['age_of_acquisition'].values,
     '4 words': s_m4g['orthographic'].values, '4 aoa': s_m4g['age_of_acquisition'].values,
     '6 words': s_m6g['orthographic'].values, '6 aoa': s_m6g['age_of_acquisition'].values,}

df_multi_high = pd.DataFrame(data=d)
df_multi_high


  after removing the cwd from sys.path.
  """
  


Unnamed: 0,3 words,3 aoa,4 words,4 aoa,6 words,6 aoa
0,sucker,0.0,mother,2.63,hour,0.0
1,flour|flower,0.0,spider,3.43,carrot|carrot:carrots,2.74
2,carrot|carrot:carrots,2.74,airplane,3.94,puppy,3.28
3,welcome,4.06,welcome,4.06,toilet,3.54
4,mellon|melon,4.21,stupid,4.4,kitten,3.64
5,ladder|ladder:ladders,4.4,teacher,4.55,circle,3.67
6,monster|monster:monsters,4.58,doctor|doctor:doctor's|doctor:doctors,4.6,welcome,4.06
7,doctor|doctor:doctor's,4.6,garbage,4.89,stupid,4.4
8,napkin,4.79,girlfriend,6.26,grownup,4.43


### create the low PACT multisyllabic

In [32]:
m3g = p3[cols][p3['zscore'] < n][p3['ismulti']]
m4g = p4[cols][p4['zscore'] < n][p4['ismulti']]
m6g = p6[cols][p6['zscore'] < n][p6['ismulti']]
word_max = np.amin([num,len(m3g),len(m4g),len(m6g)],axis=0)
s_m3g = m3g.sort_values(by=['age_of_acquisition'],ascending=False)[:word_max]
s_m4g = m4g.sort_values(by=['age_of_acquisition'],ascending=False)[:word_max]
s_m6g = m6g.sort_values(by=['age_of_acquisition'],ascending=False)[:word_max]

d = {'3 words': s_m3g['orthographic'].values, '3 aoa': s_m3g['age_of_acquisition'].values,
     '4 words': s_m4g['orthographic'].values, '4 aoa': s_m4g['age_of_acquisition'].values,
     '6 words': s_m6g['orthographic'].values, '6 aoa': s_m6g['age_of_acquisition'].values,}

df_multi_low = pd.DataFrame(data=d)
df_multi_low


  """Entry point for launching an IPython kernel.
  
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,3 words,3 aoa,4 words,4 aoa,6 words,6 aoa
0,excavator,13.0,cappuccino,12.42,mignon,14.4
1,finesse,12.39,network,10.94,internet,12.33
2,tofu,11.89,incline,10.9,frisky,12.17
3,aerial,11.5,disposal,10.75,grimace,12.0
4,ebony,11.18,recycle:recycling,10.61,satellite,11.68
5,mackerel,10.72,beeper,10.61,pesticide,11.63
6,sterling,10.7,furnace,10.6,gusto,11.53
7,sexy,10.65,appendix,10.56,mangle:mangled,11.29
8,scuttle,10.58,micro,10.44,changer,11.2
9,poppy,10.37,poppy,10.37,administration,11.11


### Create High PACT CVC

In [33]:
m3g = p3[cols][p3['zscore'] > n][p3['iscvc']]
m4g = p4[cols][p4['zscore'] > n][p4['iscvc']]
m6g = p6[cols][p6['zscore'] > n][p6['iscvc']]
word_max = np.amin([num,len(m3g),len(m4g),len(m6g)],axis=0)
s_m3g = m3g.sort_values(by=['age_of_acquisition'],ascending=True)[:word_max]
s_m4g = m4g.sort_values(by=['age_of_acquisition'],ascending=True)[:word_max]
s_m6g = m6g.sort_values(by=['age_of_acquisition'],ascending=True)[:word_max]

d = {'3 words': s_m3g['orthographic'].values, '3 aoa': s_m3g['age_of_acquisition'].values,
     '4 words': s_m4g['orthographic'].values, '4 aoa': s_m4g['age_of_acquisition'].values,
     '6 words': s_m6g['orthographic'].values, '6 aoa': s_m6g['age_of_acquisition'].values,}

df_cvc_high = pd.DataFrame(data=d)
df_cvc_high


  """Entry point for launching an IPython kernel.
  
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,3 words,3 aoa,4 words,4 aoa,6 words,6 aoa
0,duck,3.5,rain|rain:raining,0.0,butt,0.0
1,pig,3.84,wood|wood:woods,0.0,bell,0.0
2,poop,4.0,walk|walk:walking|walk:walks,3.45,mat,0.0
3,ham,4.1,laugh,3.79,shoot,0.0
4,peach,4.21,poop|poop:pooped,4.0,coat,3.58
5,seat,4.58,peach|peach:peaches,4.21,dig,4.19
6,fall|fall:falling,4.71,bowl,4.26,peach,4.21
7,hop,4.84,kid|kid:kids,4.28,bowl,4.26
8,cop|cop:cops,4.94,dumb,4.5,dumb,4.5
9,hate,5.53,hop,4.84,hop,4.84


### create the low PACT CVC

In [34]:
m3g = p3[cols][p3['zscore'] < n][p3['iscvc']]
m4g = p4[cols][p4['zscore'] < n][p4['iscvc']]
m6g = p6[cols][p6['zscore'] < n][p6['iscvc']]
word_max = np.amin([num,len(m3g),len(m4g),len(m6g)],axis=0)
s_m3g = m3g.sort_values(by=['age_of_acquisition'],ascending=False)[:word_max]
s_m4g = m4g.sort_values(by=['age_of_acquisition'],ascending=False)[:word_max]
s_m6g = m6g.sort_values(by=['age_of_acquisition'],ascending=False)[:word_max]

d = {'3 words': s_m3g['orthographic'].values, '3 aoa': s_m3g['age_of_acquisition'].values,
     '4 words': s_m4g['orthographic'].values, '4 aoa': s_m4g['age_of_acquisition'].values,
     '6 words': s_m6g['orthographic'].values, '6 aoa': s_m6g['age_of_acquisition'].values,}

df_cvc_low = pd.DataFrame(data=d)
df_cvc_low


  """Entry point for launching an IPython kernel.
  
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,3 words,3 aoa,4 words,4 aoa,6 words,6 aoa
0,nerve:nerves,9.67,nerve|nerve:nerves,9.67,pike,12.06
1,din,9.0,nag,9.21,nerve:nerves,9.67
2,fawn,8.4,dine:dining,8.95,tab,9.33
3,peep,8.22,pad,8.83,dine:dining,8.95
4,bum,8.11,pep,8.75,doubt,8.79
5,hut,8.1,league,8.45,pep,8.75
6,fig,8.06,chef,8.3,jig,8.7
7,buck,7.68,bum,8.11,chef,8.3
8,wreck|wreck:wrecked,7.59,hut,8.1,dill,8.2
9,fool,7.56,rail:railing,7.89,hut,8.1


### Combine all examples into one dataframe, export to csv

In [37]:
all_example = pd.concat([df_cvc_low, df_cvc_high, df_multi_low, df_multi_high])
all_example.to_csv(outdir/"PACT_examples.csv")

# Table 3

Caption: PACT values (and Standard Deviation) for different subsets of words. 

In [38]:
i3 = set(list(data3.index.values))
i4 = set(list(data4.index.values))
i6 = set(list(data6.index.values))
alli = i3 | i4 | i6

In [47]:
# ALL PACT
pact = pd.DataFrame(columns=['Three','Four','Six'],index=alli)
pact['Three'] = data3['PACT']
pact['Four'] = data4['PACT']
pact['Six'] = data6['PACT']

In [48]:
pact.mean().round(3),pact.std().round(2)

(Three   -0.0
 Four    -0.0
 Six     -0.0
 dtype: float64,
 Three    0.68
 Four     0.65
 Six      0.67
 dtype: float64)

In [49]:
# CVC PACT
pact_cvc = pd.DataFrame(columns=['Three','Four','Six'],index=alli)
pact_cvc['Three'] = data3['PACT'][data3.iscvc]
pact_cvc['Four'] = data4['PACT'][data4.iscvc]
pact_cvc['Six'] = data6['PACT'][data6.iscvc]

In [51]:
pact_cvc.mean().round(2), pact_cvc.std().round(2)

(Three    0.10
 Four     0.04
 Six      0.08
 dtype: float64,
 Three    0.68
 Four     0.70
 Six      0.68
 dtype: float64)

In [52]:
# MULTISYLLABIC PACT
pact_multi = pd.DataFrame(columns=['Three','Four','Six'],index=alli)
pact_multi['Three'] = data3['PACT'][data3.ismulti]
pact_multi['Four'] = data4['PACT'][data4.ismulti]
pact_multi['Six'] = data6['PACT'][data6.ismulti]

In [53]:
pact_multi.mean().round(2),pact_multi.std().round(2)

(Three   -0.06
 Four    -0.05
 Six     -0.07
 dtype: float64,
 Three    0.68
 Four     0.61
 Six      0.67
 dtype: float64)