This project uses the Kaggle Dataset ["When do children learn words?"](https://www.kaggle.com/rtatman/when-do-children-learn-words)

In [84]:
import pandas as pd #data manipulation & analysis  
import numpy as np #math

# Data Initialization 

In [85]:
#import the "main_data" CSV from Data folder
df = pd.read_csv("Data/main_data.csv") 

#import the "Norwegian_CDS_frequency" CSV from Data folder 
freq_df = pd.read_csv("Data/Norwegian_CDS_frequency.csv")

## main_data_df feature & metadata EDA 

In [86]:
df.head()

Unnamed: 0,ID_CDI_I,ID_CDI_II,Word_NW,Word_CDI,Translation,AoA,VSoA,Lex_cat,Broad_lex,Freq,CDS_freq
0,i_4_1,i_1_1,'au','au','ouch',16.0,40.0,sound effects,nominals,4366.0,7.0
1,i_4_2,i_1_2,'bææ','bææ','baa baa',15.0,40.0,sound effects,nominals,18.0,5.0
2,i_4_3,i_1_3,'brrr (bil-lyd)','brrr (bil-lyd)','vroom',13.0,20.0,sound effects,nominals,,20.0
3,i_4_4,i_1_4,'gakk gakk','gakk gakk','quack quack',17.0,40.0,sound effects,nominals,16.0,3.0
4,i_4_5,i_1_5,'grr','grr','grr',22.0,220.0,sound effects,nominals,78.0,1.0


In [87]:
df.dtypes

ID_CDI_I        object
ID_CDI_II       object
Word_NW         object
Word_CDI        object
Translation     object
AoA            float64
VSoA           float64
Lex_cat         object
Broad_lex       object
Freq           float64
CDS_freq       float64
dtype: object

In [88]:
df.shape

(731, 11)

In [89]:
df.columns

Index(['ID_CDI_I', 'ID_CDI_II', 'Word_NW', 'Word_CDI', 'Translation', 'AoA',
       'VSoA', 'Lex_cat', 'Broad_lex', 'Freq', 'CDS_freq'],
      dtype='object')

In [90]:
df.isna().sum()

ID_CDI_I       341
ID_CDI_II        0
Word_NW          0
Word_CDI         0
Translation      0
AoA             36
VSoA            27
Lex_cat         16
Broad_lex       16
Freq            10
CDS_freq         8
dtype: int64

This dataset contains 731 enteries, 11 columns. There are missing enteries in 7 columns, that will need to be handled before analysis. Reference for columns: 
* IDCDII: Word ID from the Norwegian adaptation of the MacArthur-Bates Communicative Development Inventories, version 1
* IDCDIII: Word ID from the Norwegian adaptation of the MacArthur-Bates Communicative Development Inventories, version 2
* Word_NW: The word in Norwegian
* Word_CDI: The form of the word found in the Norwegian adaptation of the MacArthur-Bates Communicative Development Inventories
* Translation: the English translation of the Norwegian word
* AoA: how old a child generally is was when they this this word, in months (Estimated from the MacArthur-Bates Communicative Development Inventories)
* VSoA: how many other words a child generally knows when they learn this word (rounded up to the nearest 10)
* Lex_cat: the specific part of speech of the word
* Broad_lex: the broad part of speech of the word
* Freq: a measure of how commonly this word occurs in Norwegian
* CDS_Freq: a measure of how commonly this word occurs when a Norwegian adult is talking to a Norwegian child

### DF Data Cleaning 

In [91]:
df.loc[df['ID_CDI_I'].isna()]

Unnamed: 0,ID_CDI_I,ID_CDI_II,Word_NW,Word_CDI,Translation,AoA,VSoA,Lex_cat,Broad_lex,Freq,CDS_freq
9,,i_1_10,'oi (uttrykk for overraskelse)','oi (uttrykk for overraskelse)','oh',20.0,80.0,sound effects,nominals,5019.0,437.0
20,,i_2_9,'et esel','esel','donkey',32.0,560.0,common nouns,nominals,1292.0,9.0
27,,i_2_16,'en hane','hane','rooster',25.0,440.0,common nouns,nominals,1508.0,1.0
29,,i_2_18,'ei høne','høne','hen',24.0,280.0,common nouns,nominals,4401.0,5.0
34,,i_2_23,'en krokodille','krokodille','crocodile',24.0,280.0,common nouns,nominals,1190.0,15.0
...,...,...,...,...,...,...,...,...,...,...,...
726,,i_22_5,'hvis','hvis','if',35.0,660.0,closed-class items,closed-class,498529.0,86.0
727,,i_22_6,'men','men','but',33.0,600.0,closed-class items,closed-class,3015440.0,444.0
728,,i_22_7,'og','og','and',25.0,400.0,closed-class items,closed-class,16079937.0,1074.0
729,,i_22_8,'så','så','then',31.0,580.0,closed-class items,closed-class,2141716.0,900.0


341 rows in this column are NaN, there is a second ID column ID_CDI_II which has no NaN values, to handle these NaN values I am going to drop the first column

In [92]:
df = df.drop(columns =['ID_CDI_I'])

In [93]:
df.head()

Unnamed: 0,ID_CDI_II,Word_NW,Word_CDI,Translation,AoA,VSoA,Lex_cat,Broad_lex,Freq,CDS_freq
0,i_1_1,'au','au','ouch',16.0,40.0,sound effects,nominals,4366.0,7.0
1,i_1_2,'bææ','bææ','baa baa',15.0,40.0,sound effects,nominals,18.0,5.0
2,i_1_3,'brrr (bil-lyd)','brrr (bil-lyd)','vroom',13.0,20.0,sound effects,nominals,,20.0
3,i_1_4,'gakk gakk','gakk gakk','quack quack',17.0,40.0,sound effects,nominals,16.0,3.0
4,i_1_5,'grr','grr','grr',22.0,220.0,sound effects,nominals,78.0,1.0


In [100]:
df.loc[df['VSoA'] == 660.0]

Unnamed: 0,ID_CDI_II,Word_NW,Word_CDI,Translation,AoA,VSoA,Lex_cat,Broad_lex,Freq,CDS_freq
25,i_2_14,'ei gås','gås','goose',36.0,660.0,common nouns,nominals,1491.0,2.0
45,i_2_34,'et reinsdyr','reinsdyr','reindeer',36.0,660.0,common nouns,nominals,1317.0,3.0
54,i_2_43,'en valp','valp','puppy',34.0,660.0,common nouns,nominals,26852.0,2.0
80,i_4_11,'et kritt','kritt','chalk',35.0,660.0,common nouns,nominals,550.0,1.0
126,i_5_40,'nugatti','nugatti','Nugatti',36.0,660.0,common nouns,nominals,209.0,1.0
224,i_8_12,'en gåbil','gåbil','toy car',36.0,660.0,common nouns,nominals,3.0,1.0
273,i_9_11,'gyngestol','gyngestol','rocking chair',35.0,660.0,common nouns,nominals,188.0,1.0
291,i_9_29,'en tørketrommel','tørketrommel','dryer',35.0,660.0,common nouns,nominals,767.0,4.0
343,i_11_16,'en park','park','park',34.0,660.0,places to go,nominals,14131.0,19.0
376,i_12_27,'en oldefar','oldefar','great-grandfather',33.0,660.0,people,nominals,1234.0,1.0


There are 36 values in the AoA column that are NaN. There is some correlation between this value and the value found in the VSoA column. A VSoA value of 660.0 equals as AoA value of between 34.0 - 36.0. As such I am going to fill in missing AoA values with a known VSoA value of 660.0 to the mean of the known AoA range = 35.0 

In [113]:
#replace the nan values in AoA based on VSoA
df['AoA'] = np.where((df['AoA'].isna()) & ((df['VSoA'] == 660.0)), 35.0, df['AoA']) 
df['AoA'] = np.where((df['AoA'].isna()) & ((df['VSoA'] == 680.0)), 35.0, df['AoA'])

In [114]:
#sanity check
df.loc[df['AoA'].isna()]

Unnamed: 0,ID_CDI_II,Word_NW,Word_CDI,Translation,AoA,VSoA,Lex_cat,Broad_lex,Freq,CDS_freq
93,i_5_6,'en bønne','bønner','beans',,,common nouns,nominals,3285.0,1.0
104,i_5_18,'frokostblanding','frokostblanding','cereal',,,common nouns,nominals,686.0,1.0
137,i_5_51,'ristet brød','ristet brød','toast',,,common nouns,nominals,108.0,1.0
185,i_7_1,'en ankel','ankel','ankle',,,common nouns,nominals,3147.0,1.0
232,i_8_20,'kjøkkenrull','kjøkkenrull','roll of paper towels',,,common nouns,nominals,9.0,1.0
237,i_8_25,'ei krukke','krukke','jar',,,common nouns,nominals,1312.0,1.0
265,i_9_3,'en balkong','balkong','balcony',,,common nouns,nominals,3835.0,2.0
298,i_10_2,'bakgård','bakgård','backyard',,,places to go,nominals,2581.0,1.0
332,i_11_5,'en campingplass','campingplass','camping site',,,places to go,nominals,3425.0,1.0
341,i_11_14,'landet','landet','the countryside',,,places to go,nominals,,


The final NaN values in AoA are without reference value in the VSoA, so I will drop these columns. 

In [115]:
df = df.dropna(subset=['AoA'])

In [117]:
df.loc[df['Lex_cat'].isna()]

Unnamed: 0,ID_CDI_II,Word_NW,Word_CDI,Translation,AoA,VSoA,Lex_cat,Broad_lex,Freq,CDS_freq
583,i_16_1,'en dag','dag','day',26.0,360.0,,,1074722.0,62.0
584,i_16_2,'etter','etter','after',31.0,580.0,,,1137455.0,23.0
586,i_16_4,'etterpå','etterpå','afterwords',25.0,360.0,,,36076.0,29.0
587,i_16_5,'før','før','before',30.0,580.0,,,598533.0,41.0
589,i_16_7,'i dag','i dag','today',27.0,440.0,,,394236.0,2.0
590,i_16_8,'i går','i går','yesterday',30.0,580.0,,,83408.0,7.0
591,i_16_9,'i kveld','i kveld','tonight',32.0,620.0,,,2709.0,2.0
592,i_16_10,'i morgen','i morgen','tomorrow',28.0,460.0,,,37720.0,7.0
593,i_16_11,'en kveld','kveld','evening',28.0,520.0,,,123627.0,2.0
594,i_16_12,'morgen','morgen','morning',28.0,520.0,,,46248.0,2.0


In [None]:
The Lex_cat NaN words corespont to the Broad_lex

## Norwegian_CDS_frequency_df feature & metadata EDA

In [40]:
freq_df.head()

Unnamed: 0,Word_CDI,Translation,Freq_NoWaC,Freq_CDS
0,'au','ouch',4366.0,7.0
1,'bææ','baa baa',18.0,5.0
2,'brrr (bil-lyd)','vroom',,20.0
3,'gakk gakk','quack quack',16.0,3.0
4,'grr','grr',78.0,1.0


In [41]:
freq_df.dtypes

Word_CDI        object
Translation     object
Freq_NoWaC     float64
Freq_CDS       float64
dtype: object

In [42]:
freq_df.shape

(731, 4)

In [43]:
freq_df.columns

Index(['Word_CDI', 'Translation', 'Freq_NoWaC', 'Freq_CDS'], dtype='object')

In [44]:
freq_df.isna().sum()

Word_CDI        0
Translation     0
Freq_NoWaC     10
Freq_CDS        8
dtype: int64

This dataset contains 731 entries and 4 columns. 2 columns have null values that will have to be resolved before analysis. Reference for columns: 
* Word_CDI: The word from, as found in the Norwegian adaptation of the MacArthur-Bates Communicative Development Inventories
* Translation: The English translation of the Norwegian word
* Freq_NoWaC: How often this word is used on the internet
* Freq_CDS: How often this word is used when talking to children (based on two Norwegian CHILDES corpora)