## Diminutive Suffix Productivity: further cleaning and descriptive analysis
Juan Berrios | jeb358@pitt.edu | Last updated: March 17, 2020

**Summary and overview:**

- This notebook is a continuation of the [corpus processing notebook](https://github.com/Data-Science-for-Linguists-2020/Diminutive-Suffix-Productivity/blob/master/code/corpus_processing.ipynb) in my repository. The purpose is to finish cleaning the data frame objects I've built (which were previously pickled in order to be loaded here) and create a master, cross-dialectal data frame to explore descriptive statistics and get started on the linguistics analysis of the data.

**Contents:**
- [Section 1](###1.-Preparation)  includes the necessary preparations and looading of the files.
- [Section 2](###2.-Further-cleaning)  includes code for performing further cleaning on the data to remove extraneous rows that are still left.
- [Section 3](###3.-Exploratory-analysis) is the start of the analysis, with a focus on descriptive statistics and data visualization.
- [Section 4](###4.-Storing-files)  includes code for storing the results as pickled files.

### 1. Preparation

- Loading libraries and additional settings:

In [1]:
#Importing libraries
import glob, pickle, re
import pandas as pd
import numpy as np

#Turning pretty print off:
%pprint

#Releasing all output:                                            
from IPython.core.interactiveshell import InteractiveShell #Prints all commands rather than the last one.
InteractiveShell.ast_node_interactivity = "all"

Pretty printing has been turned OFF


- Loading pickle files

In [2]:
ar_DF = pickle.load(open( 'ar_DF.pkl', 'rb'))
bo_DF = pickle.load(open( 'bo_DF.pkl', 'rb'))
cl_DF = pickle.load(open( 'cl_DF.pkl', 'rb'))
co_DF = pickle.load(open( 'co_DF.pkl', 'rb'))
cr_DF = pickle.load(open( 'cr_DF.pkl', 'rb'))
cu_DF = pickle.load(open( 'cu_DF.pkl', 'rb'))
do_DF = pickle.load(open( 'do_DF.pkl', 'rb'))
ec_DF = pickle.load(open( 'ec_DF.pkl', 'rb'))
es_DF = pickle.load(open( 'es_DF.pkl', 'rb'))
gt_DF = pickle.load(open( 'gt_DF.pkl', 'rb'))
hn_DF = pickle.load(open( 'hn_DF.pkl', 'rb'))
mx_DF = pickle.load(open( 'mx_DF.pkl', 'rb'))
ni_DF = pickle.load(open( 'ni_DF.pkl', 'rb'))
pa_DF = pickle.load(open( 'pa_DF.pkl', 'rb'))
pe_DF = pickle.load(open( 'pe_DF.pkl', 'rb'))
pr_DF = pickle.load(open( 'pr_DF.pkl', 'rb'))
py_DF = pickle.load(open( 'py_DF.pkl', 'rb'))
sv_DF = pickle.load(open( 'sv_DF.pkl', 'rb'))
us_DF = pickle.load(open( 'us_DF.pkl', 'rb'))
uy_DF = pickle.load(open( 'uy_DF.pkl', 'rb'))

In [3]:
type(ar_DF)
type(bo_DF)
type(cl_DF)
type(co_DF)
type(cr_DF)
type(cu_DF)
type(do_DF)
type(ec_DF)
type(es_DF)
type(gt_DF)
type(hn_DF)
type(mx_DF)
type(ni_DF)
type(pa_DF)
type(pe_DF)
type(pr_DF)
type(py_DF)
type(sv_DF)
type(us_DF)
type(uy_DF)

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

In [4]:
print('Argentina rows:', ar_DF.shape[0])
print('Bolivia rows:', bo_DF.shape[0])
print('Chile rows:', cl_DF.shape[0])
print('Colombia rows:', co_DF.shape[0])
print('Costa Rica:', cr_DF.shape[0])
print('Cuba rows:', cu_DF.shape[0])
print('Dominican Republic rows:', do_DF.shape[0])
print('Ecuador rows:', ec_DF.shape[0])
print('Spain rows:', es_DF.shape[0])
print('Guatemala rows:', gt_DF.shape[0])
print('Honduras rows:', hn_DF.shape[0])
print('Mexico rows:', mx_DF.shape[0])
print('Nicaragua rows:', ni_DF.shape[0])
print('Panama rows:', pa_DF.shape[0])
print('Peru rows:', pe_DF.shape[0])
print('Puerto Rico rows:', pr_DF.shape[0])
print('Paraguay rows:', py_DF.shape[0])
print('El Salvador rows:', sv_DF.shape[0])
print('US rows rows:', us_DF.shape[0])
print('Uruguay rows:', uy_DF.shape[0])

Argentina rows: 616845
Bolivia rows: 143682
Chile rows: 247619
Colombia rows: 646821
Costa Rica: 125645
Cuba rows: 214837
Dominican Republic rows: 145577
Ecuador rows: 234485
Spain rows: 1719752
Guatemala rows: 224592
Honduras rows: 136451
Mexico rows: 939030
Nicaragua rows: 129622
Panama rows: 455817
Peru rows: 451522
Puerto Rico rows: 145101
Paraguay rows: 103839
El Salvador rows: 146933
US rows rows: 626729
Uruguay rows: 137585


In [5]:
master_DF = pd.concat([ar_DF, bo_DF, cl_DF, co_DF, cr_DF, cu_DF, do_DF, ec_DF, es_DF, gt_DF, hn_DF,
                mx_DF, ni_DF, pa_DF, pe_DF, pr_DF, py_DF, sv_DF, us_DF, uy_DF], sort=True)

In [6]:
print('Master rows:', master_DF.shape[0])

Master rows: 7592484


In [7]:
master_DF = master_DF.dropna()

In [8]:
print('Master rows:', master_DF.shape[0])

Master rows: 7590609


In [9]:
master_DF.keys()

Index(['Lemma', 'POS', 'SourceID', 'TokenID', 'Variety', 'Word'], dtype='object')

In [10]:
master_DF = master_DF[['SourceID', 'TokenID', 'Lemma', 'Word', 'POS', 'Variety']]

In [11]:
master_DF.keys()

Index(['SourceID', 'TokenID', 'Lemma', 'Word', 'POS', 'Variety'], dtype='object')

### 2. Further cleaning

In [12]:
master_DF['POS'].value_counts()

n           1370111
nms         1306937
o            848151
nfs          707053
nmp          635693
vip-3s       613522
jms          371646
j            360502
jfs          328370
vip-1s       235896
nfp          135066
vps-ms       129978
jmp          102855
vsp-1/3s     100798
jfp           85536
vip-2s        81871
r             49707
v             45588
vps-fs        36387
vps-mp        19406
vps-fp        13977
vsp-2s         6941
np             1562
x              1142
m$              991
i               420
vr              201
fn              159
vm-2s           108
xy               19
e                12
y                 3
vm-3s             1
Name: POS, dtype: int64

In [13]:
master_DF = master_DF[master_DF['POS'].str.contains('n|j')] 

In [14]:
master_DF.shape

(5405490, 6)

In [15]:
master_DF['POS'].value_counts()

n          1370111
nms        1306937
nfs         707053
nmp         635693
jms         371646
j           360502
jfs         328370
nfp         135066
jmp         102855
jfp          85536
np            1562
fn             159
Name: POS, dtype: int64

In [16]:
master_DF = master_DF[master_DF['POS'] != ('fn')] 

In [17]:
master_DF.sample(5)

Unnamed: 0,SourceID,TokenID,Lemma,Word,POS,Variety
2192871,137090,1697579944,éxito,éxito,n,BO
4172444,1989240,2646829788,pancita,pancita,nfs,MX
6025626,1532122,285134405,guerrilla,guerrilla,nfs,CO
1052901,2257308,1548534005,delito,delito,nms,SV
310681,327740,1098232739,hilito,hilito,n,CU


In [18]:
master_DF['POS'].value_counts()

n          1370111
nms        1306937
nfs         707053
nmp         635693
jms         371646
j           360502
jfs         328370
nfp         135066
jmp         102855
jfp          85536
np            1562
Name: POS, dtype: int64

In [19]:
master_DF['POS'].unique()

array(['nmp    ', 'nfs    ', 'n', 'jfs    ', 'nms    ', 'j', 'jms    ',
       'nfp    ', 'jmp    ', 'jfp    ', 'np'], dtype=object)

In [None]:
master_DF['POS'] = master_DF['POS'].str.strip()

In [None]:
master_DF['POS'].unique()

In [None]:
pos_dict = {'n': 'n', 'nms': 'n', 'nfs': 'n', 'nmp': 'n', 'nfp': 'n', 'np': 'n', 
             'j': 'j', 'jms': 'j', 'jfs': 'j', 'jmp': 'j', 'jfp': 'j'}

In [None]:
master_DF['POS_binary'] = master_DF['POS'].map(pos_dict)

In [None]:
master_DF.sample(10)

In [None]:
number_dict = {'n': 'unknown', 'nms': 'singular', 'nfs': 'singular', 'nmp': 'plural', 'nfp': 'plural', 
                  'np': 'unknown', 'j': 'unknown', 'jms': 'singular', 'jfs': 'singular', 'jmp': 'plural', 
                  'jfp': 'plural'}

In [None]:
gender_dict = {'n': 'unknown', 'nms': 'masculine', 'nfs': 'feminine', 'nmp': 'masculine', 'nfp': 'feminine', 
                  'np': 'unknown', 'j': 'unknown', 'jms': 'masculine', 'jfs': 'feminine', 'jmp': 'masculine', 
                  'jfp': 'feminine'}

In [None]:
master_DF['Number'] = master_DF['POS'].map(number_dict)

In [None]:
master_DF.sample(5)

In [None]:
master_DF['Gender'] = master_DF['POS'].map(gender_dict)

In [None]:
master_DF.sample(5)

### 3. Exploratory analysis

### 4. Storing files