## Diminutive Suffix Productivity: further cleaning and descriptive analysis
Juan Berrios | jeb358@pitt.edu | Last updated: March 17, 2020

**Summary and overview:**

- This notebook is a continuation of the [corpus processing notebook](https://github.com/Data-Science-for-Linguists-2020/Diminutive-Suffix-Productivity/blob/master/code/corpus_processing.ipynb) in my repository. The purpose is to finish cleaning the data frame objects I've built (which were previously pickled in order to be loaded here) and create a master, cross-dialectal data frame to explore descriptive statistics and get started on the linguistics analysis of the data.

**Contents:**
- [Section 1](###1.-Preparation)  includes the necessary preparations and looading of the files.
- [Section 2](###2.-Further-cleaning)  includes code for performing further cleaning on the data to remove extraneous rows that are still left.
- [Section 3](###3.-Exploratory-analysis) is the start of the analysis, with a focus on descriptive statistics and data visualization.
- [Section 4](###4.-Storing-files)  includes code for storing the results as pickled files.

### 1. Preparation

- Loading libraries and additional settings:

In [265]:
#Importing libraries
import glob, pickle, re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Turning pretty print off:
%pprint

#Releasing all output:                                            
from IPython.core.interactiveshell import InteractiveShell #Prints all commands rather than the last one.
InteractiveShell.ast_node_interactivity = "all"

Pretty printing has been turned OFF


- Loading pickle files

In [2]:
ar_DF = pickle.load(open( 'ar_DF.pkl', 'rb'))
bo_DF = pickle.load(open( 'bo_DF.pkl', 'rb'))
cl_DF = pickle.load(open( 'cl_DF.pkl', 'rb'))
co_DF = pickle.load(open( 'co_DF.pkl', 'rb'))
cr_DF = pickle.load(open( 'cr_DF.pkl', 'rb'))
cu_DF = pickle.load(open( 'cu_DF.pkl', 'rb'))
do_DF = pickle.load(open( 'do_DF.pkl', 'rb'))
ec_DF = pickle.load(open( 'ec_DF.pkl', 'rb'))
es_DF = pickle.load(open( 'es_DF.pkl', 'rb'))
gt_DF = pickle.load(open( 'gt_DF.pkl', 'rb'))
hn_DF = pickle.load(open( 'hn_DF.pkl', 'rb'))
mx_DF = pickle.load(open( 'mx_DF.pkl', 'rb'))
ni_DF = pickle.load(open( 'ni_DF.pkl', 'rb'))
pa_DF = pickle.load(open( 'pa_DF.pkl', 'rb'))
pe_DF = pickle.load(open( 'pe_DF.pkl', 'rb'))
pr_DF = pickle.load(open( 'pr_DF.pkl', 'rb'))
py_DF = pickle.load(open( 'py_DF.pkl', 'rb'))
sv_DF = pickle.load(open( 'sv_DF.pkl', 'rb'))
us_DF = pickle.load(open( 'us_DF.pkl', 'rb'))
uy_DF = pickle.load(open( 'uy_DF.pkl', 'rb'))

In [3]:
type(ar_DF)
type(bo_DF)
type(cl_DF)
type(co_DF)
type(cr_DF)
type(cu_DF)
type(do_DF)
type(ec_DF)
type(es_DF)
type(gt_DF)
type(hn_DF)
type(mx_DF)
type(ni_DF)
type(pa_DF)
type(pe_DF)
type(pr_DF)
type(py_DF)
type(sv_DF)
type(us_DF)
type(uy_DF)

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

In [4]:
print('Argentina rows:', ar_DF.shape[0])
print('Bolivia rows:', bo_DF.shape[0])
print('Chile rows:', cl_DF.shape[0])
print('Colombia rows:', co_DF.shape[0])
print('Costa Rica:', cr_DF.shape[0])
print('Cuba rows:', cu_DF.shape[0])
print('Dominican Republic rows:', do_DF.shape[0])
print('Ecuador rows:', ec_DF.shape[0])
print('Spain rows:', es_DF.shape[0])
print('Guatemala rows:', gt_DF.shape[0])
print('Honduras rows:', hn_DF.shape[0])
print('Mexico rows:', mx_DF.shape[0])
print('Nicaragua rows:', ni_DF.shape[0])
print('Panama rows:', pa_DF.shape[0])
print('Peru rows:', pe_DF.shape[0])
print('Puerto Rico rows:', pr_DF.shape[0])
print('Paraguay rows:', py_DF.shape[0])
print('El Salvador rows:', sv_DF.shape[0])
print('US rows rows:', us_DF.shape[0])
print('Uruguay rows:', uy_DF.shape[0])

Argentina rows: 616845
Bolivia rows: 143682
Chile rows: 247619
Colombia rows: 646821
Costa Rica: 125645
Cuba rows: 214837
Dominican Republic rows: 145577
Ecuador rows: 234485
Spain rows: 1719752
Guatemala rows: 224592
Honduras rows: 136451
Mexico rows: 939030
Nicaragua rows: 129622
Panama rows: 455817
Peru rows: 451522
Puerto Rico rows: 145101
Paraguay rows: 103839
El Salvador rows: 146933
US rows rows: 626729
Uruguay rows: 137585


In [5]:
master_DF = pd.concat([ar_DF, bo_DF, cl_DF, co_DF, cr_DF, cu_DF, do_DF, ec_DF, es_DF, gt_DF, hn_DF,
                mx_DF, ni_DF, pa_DF, pe_DF, pr_DF, py_DF, sv_DF, us_DF, uy_DF], sort=True)

In [6]:
print('Master rows:', master_DF.shape[0])

Master rows: 7592484


In [7]:
master_DF = master_DF.dropna()

In [8]:
print('Master rows:', master_DF.shape[0])

Master rows: 7590609


In [9]:
master_DF.keys()

Index(['Lemma', 'POS', 'SourceID', 'TokenID', 'Variety', 'Word'], dtype='object')

In [10]:
master_DF = master_DF[['SourceID', 'TokenID', 'Lemma', 'Word', 'POS', 'Variety']]

In [11]:
master_DF.keys()

Index(['SourceID', 'TokenID', 'Lemma', 'Word', 'POS', 'Variety'], dtype='object')

### 2. Further cleaning

In [12]:
master_DF['POS'].value_counts()

n           1370111
nms         1306937
o            848151
nfs          707053
nmp          635693
vip-3s       613522
jms          371646
j            360502
jfs          328370
vip-1s       235896
nfp          135066
vps-ms       129978
jmp          102855
vsp-1/3s     100798
jfp           85536
vip-2s        81871
r             49707
v             45588
vps-fs        36387
vps-mp        19406
vps-fp        13977
vsp-2s         6941
np             1562
x              1142
m$              991
i               420
vr              201
fn              159
vm-2s           108
xy               19
e                12
y                 3
vm-3s             1
Name: POS, dtype: int64

In [13]:
master_DF = master_DF[master_DF['POS'].str.contains('n|j')] 

In [14]:
master_DF.shape

(5405490, 6)

In [15]:
master_DF['POS'].value_counts()

n          1370111
nms        1306937
nfs         707053
nmp         635693
jms         371646
j           360502
jfs         328370
nfp         135066
jmp         102855
jfp          85536
np            1562
fn             159
Name: POS, dtype: int64

In [16]:
master_DF = master_DF[master_DF['POS'] != ('fn')] 

In [17]:
master_DF.sample(5)

Unnamed: 0,SourceID,TokenID,Lemma,Word,POS,Variety
13554020,1776720,1063600933,mito,mito,nms,ES
2701,431293,329875056,escrito,escritos,jmp,ES
6121701,1705802,225141664,n.egrito,n.egrito,n,ES
8513612,1196970,2305587877,crédito,crédito,nms,US
366826,435275,812731737,cepillo,cepillo,nms,ES


In [18]:
master_DF['POS'].value_counts()

n          1370111
nms        1306937
nfs         707053
nmp         635693
jms         371646
j           360502
jfs         328370
nfp         135066
jmp         102855
jfp          85536
np            1562
Name: POS, dtype: int64

In [19]:
master_DF['POS'].unique()

array(['nmp    ', 'nfs    ', 'n', 'jfs    ', 'nms    ', 'j', 'jms    ',
       'nfp    ', 'jmp    ', 'jfp    ', 'np'], dtype=object)

In [20]:
master_DF['POS'] = master_DF['POS'].str.strip()

In [21]:
master_DF['POS'].unique()

array(['nmp', 'nfs', 'n', 'jfs', 'nms', 'j', 'jms', 'nfp', 'jmp', 'jfp',
       'np'], dtype=object)

In [22]:
pos_dict = {'n': 'n', 'nms': 'n', 'nfs': 'n', 'nmp': 'n', 'nfp': 'n', 'np': 'n', 
             'j': 'j', 'jms': 'j', 'jfs': 'j', 'jmp': 'j', 'jfp': 'j'}

In [23]:
master_DF['POS_binary'] = master_DF['POS'].map(pos_dict)

In [24]:
master_DF.sample(10)

Unnamed: 0,SourceID,TokenID,Lemma,Word,POS,Variety,POS_binary
762394,932791,2172708147,requisito,requisitos,nmp,NI,n
3977266,2301544,639611494,cerbesito,cerbesita,j,US,j
6233184,252951,19674691,maravilla,maravilla,nfs,CO,n
5101572,1159731,2408635331,propósito,propósito,nms,US,n
2016590,1065146,1263937078,sencillo,sencillo,jms,PR,j
562359,307107,2411512129,ámbito,ámbito,n,CR,n
2928046,1974726,2309768250,ejército,ejército,nms,MX,n
1024047,373932,258021410,tripito,tripita,j,DO,j
2606888,1671633,421731064,explícito,explícita,jfs,ES,j
7966163,1385842,234940674,guerrilla,guerrilla,nfs,AR,n


In [25]:
number_dict = {'n': 'unknown', 'nms': 'singular', 'nfs': 'singular', 'nmp': 'plural', 'nfp': 'plural', 
                  'np': 'unknown', 'j': 'unknown', 'jms': 'singular', 'jfs': 'singular', 'jmp': 'plural', 
                  'jfp': 'plural'}

In [26]:
gender_dict = {'n': 'unknown', 'nms': 'masculine', 'nfs': 'feminine', 'nmp': 'masculine', 'nfp': 'feminine', 
                  'np': 'unknown', 'j': 'unknown', 'jms': 'masculine', 'jfs': 'feminine', 'jmp': 'masculine', 
                  'jfp': 'feminine'}

In [27]:
master_DF['Number'] = master_DF['POS'].map(number_dict)

In [28]:
master_DF.sample(5)

Unnamed: 0,SourceID,TokenID,Lemma,Word,POS,Variety,POS_binary,Number
13303486,561604,2333688520,insólito,insólito,j,ES,j,unknown
3906697,1685565,891404601,hipócrita,hipócrita,jms,ES,j,singular
2944371,988836,1206995132,éxito,éxito,n,PA,n,unknown
21705626,620454,2321486958,poquito,poquito,n,ES,n,unknown
571360,2470973,512415509,delito,delito,nms,BO,n,singular


In [29]:
master_DF['Gender'] = master_DF['POS'].map(gender_dict)

In [30]:
master_DF.sample(5)

Unnamed: 0,SourceID,TokenID,Lemma,Word,POS,Variety,POS_binary,Number,Gender
4315488,1004700,994093105,bonito,bonita,jfs,PE,j,singular,feminine
3424137,1452634,665365838,delito,delito,nms,CL,n,singular,masculine
7550596,1188529,1853326605,brillo,brillo,nms,US,n,singular,masculine
905083,1214986,1246503548,gestito,gestito,n,UY,n,unknown,unknown
4331090,799113,2527637652,visita,visita,nfs,MX,n,singular,feminine


Frequency-based approach to remove last extraneous rows:

In [40]:
fname = '../../Diminutive-Suffix-Productivity/private/data/span_lexicon/span_dic.txt'
lexicon = pd.read_csv(fname,sep='\t',encoding='iso-8859-1',skiprows=[1],low_memory=False) 

In [41]:
lexicon

Unnamed: 0,wID,word,lemma,PoS
0,1,",","$,",y
1,2,de,de,e
2,3,.,$.,y
3,4,la,la,ld-fs
4,5,y,y,cc
...,...,...,...,...
8754578,11925785,@@999948,,
8754579,11925786,@@99996,,
8754580,11925787,@@999965,,
8754581,11925788,@@999974,,


In [44]:
lexicon = lexicon.rename(columns={"wID": "WordID", "word": "Word", "lemma": "Lemma", "PoS": "POS"})

In [45]:
lexicon

Unnamed: 0,WordID,Word,Lemma,POS
0,1,",","$,",y
1,2,de,de,e
2,3,.,$.,y
3,4,la,la,ld-fs
4,5,y,y,cc
...,...,...,...,...
8754578,11925785,@@999948,,
8754579,11925786,@@99996,,
8754580,11925787,@@999965,,
8754581,11925788,@@999974,,


In [46]:
lexicon = lexicon.dropna()

In [47]:
lexicon

Unnamed: 0,WordID,Word,Lemma,POS
0,1,",","$,",y
1,2,de,de,e
2,3,.,$.,y
3,4,la,la,ld-fs
4,5,y,y,cc
...,...,...,...,...
8474411,11645618,descabellados,descabellar,v
8474412,11645619,hacendados,hacendar,v
8474413,11645620,reposeídas,reposeer,v
8474414,11645621,remozadas,remozar,v


In [49]:
lexicon['POS'].unique()

array(['y', 'e', 'ld-fs', 'cc', 'ld-ms', 'cs', 'ld-mp', 'xx', 'li-ms',
       'po', 'r', 'vip-3s', 'ld-fp', 'li-fs', 'dp-', 'dd-', 'v', 'vip-3p',
       'cS_22', 'vr', 'ld', 'nmp    ', 'nfs    ', 'j', 'pd-3cs', 'mc',
       'ps', 'nms    ', 'nfp    ', 'vii-1/3s', 'dxmp-ind-', 'pi-0cn',
       'pi-0ms', 'vis-3s', 'cS_21', 'dxms-ind-', 'pi', 'dxcs-ind-',
       'vsp-1/3s', 'o', 'n', 'pr-3cs', 'dxfp-ind-', 'pq-3cn', 'dxfs-ind-',
       'm$', 'pr-3cn"', 'jms    ', 'li-mp', 'px-ms', 'vip-1s', 'cS_33',
       'dxfs-', 'vps-ms', 'vif-3s', 'dxcs-dem-', 'jfs    ', 'jmp    ',
       'pr-3cp', 'pi-3cs', 'pp-2cs', 'vip-1p', 'px', 'vc-1/3s', 'pp-1cs',
       'cC_22', 'x', 'vpp', 'li-fp', 'pq-3cn"', 'vip-2s', 'vii-3p',
       'cC_21', 'jfp    ', 'i', 'e_32', 'vsp-3p', 'vis-3p', 'pi-3ms',
       'e_21', 'dd', 'pp-2cp', 'pq-3cs', 'px-mp', 'cc-', 'b', 'cS_32',
       'vsi-1/3s', 'dxcp-dem-', 'pr-3ms', 'vc-3p', 'pr-3fs', 'dxfp-',
       'e_22', 'vif-3p', 'vis-1s', 'cS_31', 'vip-1p/vis-1p', 'pp-2p',
    

In [50]:
lexicon['POS'] = lexicon['POS'].str.strip()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [51]:
lexicon['POS'].unique()

array(['y', 'e', 'ld-fs', 'cc', 'ld-ms', 'cs', 'ld-mp', 'xx', 'li-ms',
       'po', 'r', 'vip-3s', 'ld-fp', 'li-fs', 'dp-', 'dd-', 'v', 'vip-3p',
       'cS_22', 'vr', 'ld', 'nmp', 'nfs', 'j', 'pd-3cs', 'mc', 'ps',
       'nms', 'nfp', 'vii-1/3s', 'dxmp-ind-', 'pi-0cn', 'pi-0ms',
       'vis-3s', 'cS_21', 'dxms-ind-', 'pi', 'dxcs-ind-', 'vsp-1/3s', 'o',
       'n', 'pr-3cs', 'dxfp-ind-', 'pq-3cn', 'dxfs-ind-', 'm$', 'pr-3cn"',
       'jms', 'li-mp', 'px-ms', 'vip-1s', 'cS_33', 'dxfs-', 'vps-ms',
       'vif-3s', 'dxcs-dem-', 'jfs', 'jmp', 'pr-3cp', 'pi-3cs', 'pp-2cs',
       'vip-1p', 'px', 'vc-1/3s', 'pp-1cs', 'cC_22', 'x', 'vpp', 'li-fp',
       'pq-3cn"', 'vip-2s', 'vii-3p', 'cC_21', 'jfp', 'i', 'e_32',
       'vsp-3p', 'vis-3p', 'pi-3ms', 'e_21', 'dd', 'pp-2cp', 'pq-3cs',
       'px-mp', 'cc-', 'b', 'cS_32', 'vsi-1/3s', 'dxcp-dem-', 'pr-3ms',
       'vc-3p', 'pr-3fs', 'dxfp-', 'e_22', 'vif-3p', 'vis-1s', 'cS_31',
       'vip-1p/vis-1p', 'pp-2p', 'pq-3cp', 'pv', 'cS_44', 'mo-ms-',
 

In [53]:
dim_lexicon = lexicon[lexicon['Word'].str.contains(r'\w*i(t|ll)(o|a)s?\b')]

In [54]:
dim_lexicon

Unnamed: 0,WordID,Word,Lemma,POS
713,714,éxito,éxito,n
798,799,necesita,necesitar,vip-3s
1599,1600,propósito,propósito,nms
1678,1679,ámbito,ámbito,n
1700,1701,visita,visita,nfs
...,...,...,...,...
8471444,11642632,Zurroncito,zurroncito,o
8471445,11642633,zurroncitos,zurroncito,n
8471514,11642703,zurullito,zurullito,j
8471536,11642725,zurunguillas,zurunguilla,n


In [55]:
dim_lexicon = dim_lexicon[dim_lexicon['POS'].str.contains('n|j')] 

In [56]:
dim_lexicon['POS'].unique()

array(['n', 'nms', 'nfs', 'jfs', 'jms', 'nmp', 'nfp', 'j', 'jfp', 'jmp',
       'np', 'fn'], dtype=object)

In [57]:
dim_lexicon = dim_lexicon[dim_lexicon['POS'] != ('fn')] 

In [58]:
dim_lexicon['POS'].unique()

array(['n', 'nms', 'nfs', 'jfs', 'jms', 'nmp', 'nfp', 'j', 'jfp', 'jmp',
       'np'], dtype=object)

In [60]:
dim_lexicon.shape

(70351, 4)

In [155]:
lemmas = dim_lexicon['Lemma']

In [156]:
len(lemmas)
type(lemmas)

70351

<class 'pandas.core.series.Series'>

In [173]:
lexicalized = set(lemmas[:200])

In [174]:
len(lexicalized)

113

In [175]:
type(lexicalized)

<class 'set'>

In [176]:
lexicalized

{'manuscrito', 'circuito', 'perrito', 'israelita', 'maravilla', 'propósito', 'brillo', 'inscrito', 'cosita', 'depósito', 'mantequilla', 'éxito', 'jesuita', 'milla', 'súbdito', 'taquilla', 'zapatilla', 'pandilla', 'cigarrillo', 'meteorito', 'anillo', 'bendito', 'rodilla', 'vómito', 'perito', 'tobillo', 'sevilla', 'tortilla', 'mito', 'castillo', 'visita', 'cepillo', 'ilícito', 'órbita', 'apetito', 'granito', 'hito', 'besito', 'ventanilla', 'pastilla', 'exquisito', 'comillas', 'bonito', 'redondilla', 'abuelita', 'frito', 'dígito', 'ámbito', 'poquito', 'ratito', 'incógnita', 'infinito', 'amarillo', 'pleito', 'vainilla', 'arcilla', 'señorita', 'mérito', 'requisito', 'suscrito', 'casita', 'caudillo', 'distrito', 'implícito', 'hábito', 'planilla', 'delito', 'cucharadita', 'escrito', 'tránsito', 'grito', 'platillo', 'plantilla', 'semilla', 'súbito', 'bolsillo', 'mejilla', 'cápita', 'banquillo', 'cuchillo', 'pasillo', 'mosquito', 'cajita', 'rito', 'carrito', 'gratuito', 'añito', 'débito', 'guer

Final step:

In [177]:
master_DF = master_DF[~master_DF['Lemma'].isin(lexicalized)] #Pandas way to select rows of based on values bot found in a list.

In [178]:
master_DF.shape

(1536896, 9)

In [182]:
master_DF.sample(5)

Unnamed: 0,SourceID,TokenID,Lemma,Word,POS,Variety,POS_binary,Number,Gender
1326831,153328,2296813246,librito,librito,n,CL,n,unknown,unknown
1729787,944423,364102783,sombrerito,sombrerito,n,NI,n,unknown,unknown
841051,310893,490606940,acólito,acólitos,nmp,CR,n,plural,masculine
208591,2467274,566248569,intuito,intuito,nms,BO,n,singular,masculine
3962655,228572,160823993,cuadernillo,cuadernillo,nms,CO,n,singular,masculine


### 3. Exploratory analysis

In [228]:
def get_dim(lemma):
    if lemma[-4:] == 'illo' or lemma[-4:] == 'illa':
        return '-illo'
    else: return '-ito'

In [229]:
master_DF['Diminutive'] = master_DF['Lemma'].map(get_dim)

In [230]:
master_DF.sample(5)

Unnamed: 0,SourceID,TokenID,Lemma,Word,POS,Variety,POS_binary,Number,Gender,Diminutive
898287,970990,1777925866,revisito,revisito,n,PE,n,unknown,unknown,-ito
4604130,1987308,2251343209,hijito,Hijitos,n,MX,n,unknown,unknown,-ito
16729274,1814300,2122445182,dinamita,dinamita,nfs,ES,n,singular,feminine,-ito
45651,303368,1565284609,brinquito,brinquitos,n,CR,n,unknown,unknown,-ito
42837,717881,2002937,muertito,muertitos,n,HN,n,unknown,unknown,-ito


In [231]:
master_DF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1536896 entries, 1338 to 2061267
Data columns (total 10 columns):
SourceID      1536896 non-null object
TokenID       1536896 non-null object
Lemma         1536896 non-null object
Word          1536896 non-null object
POS           1536896 non-null object
Variety       1536896 non-null object
POS_binary    1536896 non-null object
Number        1536896 non-null object
Gender        1536896 non-null object
Diminutive    1536896 non-null object
dtypes: object(10)
memory usage: 129.0+ MB


In [238]:
master_DF.describe() 

Unnamed: 0,SourceID,TokenID,Lemma,Word,POS,Variety,POS_binary,Number,Gender,Diminutive
count,1536896,1536896,1536896,1536896,1536896,1536896,1536896,1536896,1536896,1536896
unique,490972,1443283,40307,50201,11,20,2,3,3,2
top,1704776,1269287915,hipócrita,hipócrita,n,ES,n,unknown,unknown,-ito
freq,2119,2,16233,8093,711896,380958,1176343,953315,953315,1291306


- We get plenty of information here. There are X tokens and X types. X is the most frequent lemma. Spain is the most widely represented country. Noun is the most frequent part of speech and -ito is the most common diminutive. To get information about gender and number we'll have to take an extra step because most tokens were not tagged for either category. All that were tagged for gender were also tagged for number, so we only need to create one subset. It turns out that most are masculine and singular.

In [254]:
master_DF[(master_DF['Gender'] == 'masculine')|(master_DF['Gender'] == 'feminine')].describe()

Unnamed: 0,SourceID,TokenID,Lemma,Word,POS,Variety,POS_binary,Number,Gender,Diminutive
count,583581,583581,583581,583581,583581,583581,583581,583581,583581,583581
unique,272021,552963,1490,3201,8,20,2,2,2,2
top,1704776,1380796754,hipócrita,hipócrita,nms,ES,n,singular,masculine,-ito
freq,1085,2,14447,8093,190953,156790,462885,449886,337732,395203


In [264]:
master_DF['Variety'].value_counts()

ES    380958
MX    182396
AR    139839
US    133663
CO    107277
PA     94293
PE     93613
CL     56541
CU     40988
GT     40622
EC     38981
UY     31198
SV     28981
PR     28737
CR     26847
DO     24524
BO     24252
NI     22168
HN     21861
PY     19157
Name: Variety, dtype: int64

In [258]:
master_DF['POS_binary'].value_counts()

n    1176343
j     360553
Name: POS_binary, dtype: int64

In [257]:
master_DF['Diminutive'].value_counts()

-ito     1291306
-illo     245590
Name: Diminutive, dtype: int64

In [259]:
essay_L1_prompt = master_DF.groupby('Variety').Diminutive.value_counts()
essay_L1_prompt.unstack()

Diminutive,-illo,-ito
Variety,Unnamed: 1_level_1,Unnamed: 2_level_1
AR,15659,124180
BO,3616,20636
CL,7498,49043
CO,17826,89451
CR,5747,21100
CU,6759,34229
DO,3765,20759
EC,6723,32258
ES,79982,300976
GT,4974,35648


In [260]:
essay_L1_prompt = master_DF.groupby('POS_binary').Diminutive.value_counts()
essay_L1_prompt.unstack()

Diminutive,-illo,-ito
POS_binary,Unnamed: 1_level_1,Unnamed: 2_level_1
j,16001,344552
n,229589,946754


In [262]:
master_DF.groupby(['Variety','POS_binary']).Diminutive.describe() #This will return a data frame with a multi-level index.

Unnamed: 0_level_0,Unnamed: 1_level_0,count,unique,top,freq
Variety,POS_binary,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AR,j,31612,2,-ito,31039
AR,n,108227,2,-ito,93141
BO,j,6016,2,-ito,5777
BO,n,18236,2,-ito,14859
CL,j,13996,2,-ito,13508
CL,n,42545,2,-ito,35535
CO,j,26475,2,-ito,25293
CO,n,80802,2,-ito,64158
CR,j,5648,2,-ito,5198
CR,n,21199,2,-ito,15902


### 4. Storing files