## Diminutive Suffix Productivity
Juan Berrios | jeb358@pitt.edu | Last updated: March 17, 2020

**Summary and overview of the data:**

- The purpose of the code included in this notebook is to build `DataFrame` objects from the `.txt` files in the corpus directories. I also do some preliminary data cleaning. The corpus I am using is the [*Corpus del español*](https://www.corpusdelespanol.org/); more specifically the [Web/Dialects](https://www.corpusdelespanol.org/web-dial/) corpus. While the corpus is searchable online, it is also possible to access the full data set for those wishing to do computational analyses, such as this. It is necessary to purchase a license to do so. I am authorized to use it through the license of the [Department of Linguistics](https://www.linguistics.pitt.edu/). Samples for the different formats can be downloaded from the [official website](https://www.corpusdata.org/formats.asp). I have also uploaded a copy of the free sample in the [data samples  directory](https://github.com/Data-Science-for-Linguists-2020/Diminutive-Suffix-Productivity/tree/master/data_samples) of this repository. The data set is available in three formats: (i) Database (Structured Query Language), (ii) Word/lemma/PoS, and (iii) linear (raw) text. All are `.txt` files and the former two are tab-delimited. I have chosen to work with the second format because the tags will come in handy and because it's quite compatible with Pandas.

**Contents:**
- [Section 1](###1.-Preparation)  includes the necessary preparations.
- [Section 2](###2.-Loading-files)  includes code for loading the files and turning them into data frames using one of the `.txt` file as a sample.
- [Section 3](###3.-Processing-corpus-directories)  includes code for performing the operations on a corpus directory containing all the text files of one variety.
- [Section 4](###4.-Storing-files)  includes code for storing the resulting data frames as pickled files.

### 1. Preparation

- Loading libraries and additional settings:

In [2]:
#Importing libraries
import glob, pickle, re
import pandas as pd
import numpy as np

#Turning pretty print off:
%pprint

#Releasing all output:                                            
from IPython.core.interactiveshell import InteractiveShell #Prints all commands rather than the last one.
InteractiveShell.ast_node_interactivity = "all"

Pretty printing has been turned OFF


### 2. Loading files

- The `txt` files are very large. For testing purposes, I'll use only one of them as a start. The files are also tab-delimited, which makes my job a little easier. The columns correspond to an ID for the source text, an ID for the token, the token (word), the lemma, and the POS. I will hence use those for column names. 

In [2]:
fname = '../../Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo/es-b-0.txt'

In [3]:
cols = ['SourceID', 'TokenID', 'Word', 'Lemma', 'POS']

In [4]:
#First row is ignored because it corresponds to an identifier for the .txt file.

df = pd.read_csv(fname,sep='\t',encoding ='iso-8859-1',skiprows=[0],header=None,names=cols) 

In [5]:
df.shape #It's a very large file (27345213 rows). It will get much smaller once I start cleaning up the data.

(27345213, 5)

In [6]:
df.head(5)   #The lemma column will be useful when I need to aggregations that put lowecase and uppercase 
            #as well as plural and singular forms together.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS
0,431270,2206403096,Este,este,dd-
1,431270,2206403097,es,ser,vip-3s
2,431270,2206403098,un,un,li-ms
3,431270,2206403099,blog,blog,nms
4,431270,2206403100,de,de,e


In [7]:
df.tail(5)

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS
27345208,676060,2511862343,mas,mas,cc
27345209,676060,2511862344,informacion,información,n
27345210,676060,2511862345,visite,visitar,vsp-1/3s
27345211,676060,2511862346,www.DineroAbundancia.com,www.dineroabundancia.com,n
27345212,676060,2511862347,.,$.,y


In [8]:
df.sample(5) #Everything seems to be loaded correctly as of now. 

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS
12560728,554710,1012050429,.,$.,y
25161571,658620,799549379,Pandilla,pandilla,nfs
25349054,659670,932940257,un,un,li-ms
15941911,583600,724041365,(,par-l,y
3361630,467990,1786426227,mismo,mismo,r


In [9]:
df['POS'].unique()  #These POS tags are not very transparent, but it's a good start. These will also come 
                    #handy for data clean up because diminutivization applies only to some classes. 

array(['dd-', 'vip-3s', 'li-ms', 'nms    ', 'e', 'nfp    ', 'y',
       'nfs    ', 'nmp    ', 'vsp-1/3s', 'n', 'cc', 'o', 'r', 'vps-ms',
       'ld-fs', 'm$', 'vc-1/3s', 'vr', 'po', 'li-fs', 'ld-mp', 'vip-3p',
       'ld-ms', 'jms    ', 'vpp', 'mc', 'vps-fs', 'j', 'vii-1/3s', 'e_21',
       'x', nan, 'cs', 'ld', 'v', 'cc-', 'ld-fp', 'pi-0cn', 'dd',
       'dxmp-ind-', 'vip-2s', 'dp-', 'vsp-3p', 'pi-0ms', 'jfp    ',
       'cS_21', 'cS_22', 'jmp    ', 'vsp-2s', 'vip-1s', 'vsp-1p', 'ps',
       'vip-1p/vis-1p', 'vip-1p', 'pd-3cs', 'vif-1p', 'vps-mp', 'vif-3s',
       'vif-3p', 'jfs    ', 'dxfs-ind-', 'vsj-1/3s', 'i', 'pi-3ms',
       'vis-3s', 'b', 'p', 'np', 'pr-3cn"', 'px', 'dxms-ind-', 'pi-3cs',
       'dxfs-', 'pr-3cs', 'px-ms', 'dxfp-ind-', 'vis-3p', 'li-mp', 'pi',
       'vsi-3p', 'px-mp', 'vsi-1/3s', 'pq-3cn"', 'vif-2s', 'vif-2p',
       'vsp-2p', 'vip-2p', 'vpp-00', 'vm-2p', 'vis-1p', 'dxcs-ind-',
       'pr-3cp', 'dxcs-dem-', 'vif-1s', 'vc-1p', 'cC_21', 'cC_22',
       'vps-fp',

In [11]:
df['Variety'] = 'ES' #Time to add a column for the variety of Spanish. In this case Spain.

In [12]:
df.head() #It works out well.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
0,431270,2206403096,Este,este,dd-,ES
1,431270,2206403097,es,ser,vip-3s,ES
2,431270,2206403098,un,un,li-ms,ES
3,431270,2206403099,blog,blog,nms,ES
4,431270,2206403100,de,de,e,ES


In [13]:
df.tail()

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
27345208,676060,2511862343,mas,mas,cc,ES
27345209,676060,2511862344,informacion,información,n,ES
27345210,676060,2511862345,visite,visitar,vsp-1/3s,ES
27345211,676060,2511862346,www.DineroAbundancia.com,www.dineroabundancia.com,n,ES
27345212,676060,2511862347,.,$.,y,ES


In [14]:
df.sample(5)

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
23859409,644890,2343014538,a,a,e,ES
10021654,532990,1786645392,gratis,gratis,jms,ES
25918237,665520,507420331,panzers,panzer,n,ES
17728027,591680,2280544516,Como,como,r,ES
13070606,558910,1559326864,basaba,basar,vii-1/3s,ES


- A first step involved in cleaning up the data is to remove rows that are not necessary for this analysis. There are two main things to tackle first: symbols and '@' that are stand-ins for words that were removed from the corpus for copyright reasons when it was created. For the former, I can make use of the POS column. Symbols are tagged 'y'. 

In [15]:
df = df[df['POS'] != 'y'] 

In [16]:
df #It works. Looks like around 3,000,000 rows were removed.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
0,431270,2206403096,Este,este,dd-,ES
1,431270,2206403097,es,ser,vip-3s,ES
2,431270,2206403098,un,un,li-ms,ES
3,431270,2206403099,blog,blog,nms,ES
4,431270,2206403100,de,de,e,ES
...,...,...,...,...,...,...
27345207,676060,2511862342,obtener,obtener,vr,ES
27345208,676060,2511862343,mas,mas,cc,ES
27345209,676060,2511862344,informacion,información,n,ES
27345210,676060,2511862345,visite,visitar,vsp-1/3s,ES


- Since '@' is meant to replace words and not symbols, it is tagged as a noun, so the same strategy doesn't work. An alternative is to use the Word column instead:

In [17]:
df = df[df['Word'] != '@'] 

In [18]:
df #Looks good, this removed about 1,500,000 more rows.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
0,431270,2206403096,Este,este,dd-,ES
1,431270,2206403097,es,ser,vip-3s,ES
2,431270,2206403098,un,un,li-ms,ES
3,431270,2206403099,blog,blog,nms,ES
4,431270,2206403100,de,de,e,ES
...,...,...,...,...,...,...
27345207,676060,2511862342,obtener,obtener,vr,ES
27345208,676060,2511862343,mas,mas,cc,ES
27345209,676060,2511862344,informacion,información,n,ES
27345210,676060,2511862345,visite,visitar,vsp-1/3s,ES


- Lastly, I only want to keep diminutivized forms for analysis. 

In [23]:
df = df[(df['Word'].str[-3:] == 'ito') | (df['Word'].str[-3:] == 'ita')|   #Keeps only rows ending in the segments.
  (df['Word'].str[-4:] == 'itos') | (df['Word'].str[-4:] == 'itas')|       #of interest. I will probably refine this. 
  (df['Word'].str[-4:] == 'illo') | (df['Word'].str[-4:] == 'illa')|       #later using regular expressions instead. 
  (df['Word'].str[-4:] == 'illos') | (df['Word'].str[-4:] == 'illas')]

In [24]:
df

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
98,431270,2206403194,Nikita,nikita,o,ES
110,431270,2206403206,escrito,escrito,jms,ES
321,431290,2074527333,calladita,calladito,j,ES
331,431290,2074527343,sólito,sólito,jms,ES
536,431310,2143630275,necesita,necesitar,vip-3s,ES
...,...,...,...,...,...,...
27344477,676040,250005751,ladrillo,ladrillo,nms,ES
27344546,676040,250005820,depósitos,depósito,nmp,ES
27344657,676040,250005931,precipita,precipitar,vip-3s,ES
27344674,676040,250005948,corralito,corralito,n,ES


- This gets the data at to a first stage that's easier to work with. There are still many rows which do not belong in the data frame (e.g., verb forms other than gerunds that just happen to end in the same segment), but it will be more efficient to remove those when the full data frame is constructed. 

### 3. Processing corpus directories

- As a first step, knowing now that the operations above are sucessful, I will define functions to make the processing pipeline for the full data drame object more efficient and streamlined:

In [4]:
def toDF(fname):
    """Turns tab-delimited file into a data frame"""
    cols = ['SourceID', 'TokenID', 'Word', 'Lemma', 'POS']
    df = pd.read_csv(fname,sep='\t',encoding ='iso-8859-1',skiprows=[0],header=None,names=cols)
    return df

In [5]:
def add_variety(df,variety):
    """Adds a column specifying a given variety of Spanish to a data frame"""
    df['Variety'] = variety

In [6]:
def remove_syms(df):
    """Excludes symbols"""
    df = df[df['POS'] != 'y']
    return df

In [7]:
def remove_redacted(df):
    """Excludes @ symbols, which stand for words that have been redacted for copyright reasons"""
    df = df[df['Word'] != '@']
    return df

In [8]:
def remove_nondims(df):
    """Removes tokens that do not end in the segments of interest (-ito/-illo)"""
    df = df[(df['Word'].str[-3:] == 'ito') | (df['Word'].str[-3:] == 'ita')|
            (df['Word'].str[-4:] == 'itos') | (df['Word'].str[-4:] == 'itas')|
            (df['Word'].str[-4:] == 'illo') | (df['Word'].str[-4:] == 'illa')|
            (df['Word'].str[-4:] == 'illos') | (df['Word'].str[-4:] == 'illas')]
    return df

- Now let's set the directory. I'll start with the Spain directory, since I used a Spain text file for the test run in Section 2 above. In addition, Spain is by far the largest directory (over 3 GB when compressed), so if the code works fine for it, it should work for the rest of the countries.

In [9]:
corpus_dir = '../../Diminutive-Suffix-Productivity/private/data/'
es_fname = glob.glob(corpus_dir + 'wlp_ES-sbo/*.txt')
es_fname[0]

'../../Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo\\es-b-0.txt'

In [26]:
len(es_fname) #There are 20 files in total

20

In [27]:
es_DF = pd.DataFrame(columns=cols) #Builds a new, empty data frame object.
es_DF

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety


In [74]:
for fname in es_fname:                         #Note that I'm only using one of the cleaning functions
    df = toDF(fname)                           #because it gets me to the same end result. But the other ones   
    df = remove_nondims(df)                    #might be useful down the road, which is why I defined them.
    es_DF = pd.concat([es_DF, df], sort=True)

In [82]:
es_DF #It takes a while (which is expected because of the file sizes), but it works. I won't rerun the notebook
      #because of the prior line but I've removed all extraneous cells for ease of reading.

Unnamed: 0,Lemma,POS,SourceID,TokenID,Variety,Word
98,nikita,o,431270,2206403194,,Nikita
110,escrito,jms,431270,2206403206,,escrito
321,calladito,j,431290,2074527333,,calladita
331,sólito,jms,431290,2074527343,,sólito
536,necesitar,vip-3s,431310,2143630275,,necesita
...,...,...,...,...,...,...
22146901,sencillo,jfs,1891249,1779969779,,sencilla
22147017,permitir,vsp-1/3s,1891249,1779969895,,permita
22147166,inscrito,j,1891249,1779970044,,inscrito
22147196,visita,nfp,1891249,1779970074,,visitas


In [83]:
add_variety(es_DF,'ES')

In [84]:
es_DF  #It appears the order of the columns was shuffled for some reason. I will fix this once I have constructed 
       #the final data frame object.

Unnamed: 0,Lemma,POS,SourceID,TokenID,Variety,Word
98,nikita,o,431270,2206403194,ES,Nikita
110,escrito,jms,431270,2206403206,ES,escrito
321,calladito,j,431290,2074527333,ES,calladita
331,sólito,jms,431290,2074527343,ES,sólito
536,necesitar,vip-3s,431310,2143630275,ES,necesita
...,...,...,...,...,...,...
22146901,sencillo,jfs,1891249,1779969779,ES,sencilla
22147017,permitir,vsp-1/3s,1891249,1779969895,ES,permita
22147166,inscrito,j,1891249,1779970044,ES,inscrito
22147196,visita,nfp,1891249,1779970074,ES,visitas


- Spain is ready, now I have to do the same for the remaning 20 countries. This will take me some time, but I don't expect the code above to run into issues as the files are all in the same format. Below are the corpus directories for the remaining countries:

In [10]:
ar_fname = glob.glob(corpus_dir + 'wlp_AR-tez/*.txt') #Argentina
bo_fname = glob.glob(corpus_dir + 'wlp_BO-teh/*.txt') #Bolivia
cl_fname = glob.glob(corpus_dir + 'wlp_CL-wts/*.txt') #Chile
co_fname = glob.glob(corpus_dir + 'wlp_CO-pem/*.txt') #Colombia
cr_fname = glob.glob(corpus_dir + 'wlp_CR-jfh/*.txt') #Costa Rica
cu_fname = glob.glob(corpus_dir + 'wlp_CU-rag/*.txt') #Cuba
do_fname = glob.glob(corpus_dir + 'wlp_DO-egn/*.txt') #Dominican Republic
ec_fname = glob.glob(corpus_dir + 'wlp_EC-jss/*.txt') #Ecuador
gt_fname = glob.glob(corpus_dir + 'wlp_GT-miv/*.txt') #Guatemala
hn_fname = glob.glob(corpus_dir + 'wlp_HN-paj/*.txt') #Honduras
mx_fname = glob.glob(corpus_dir + 'wlp_MX-vzo/*.txt') #Mexico
ni_fname = glob.glob(corpus_dir + 'wlp_NI-exu/*.txt') #Nicaragua
pa_fname = glob.glob(corpus_dir + 'wlp_PA-qlz/*.txt') #Panama
pe_fname = glob.glob(corpus_dir + 'wlp_PE-tae/*.txt') #Peru
pr_fname = glob.glob(corpus_dir + 'wlp_PR-epz/*.txt') #Puerto Rico
py_fname = glob.glob(corpus_dir + 'wlp_PY-ukd/*.txt') #Paraguay
sv_fname = glob.glob(corpus_dir + 'wlp_SV-xkl/*.txt') #El Salvador
us_fname = glob.glob(corpus_dir + 'wlp_US-ufh/*.txt') #The US
uy_fname = glob.glob(corpus_dir + 'wlp_UY-nde/*.txt') #Uruguay
#ve_fname = glob.glob(corpus_dir + 'wlp_VE-wsc/*.txt') #Venezuela. This file (ironically) doesn't work. I'll have to 
                                                       #go to the LMC and try to get a new copy. 

- Let's make sure all files are in the directories, as we did with Spain:

In [11]:
len(ar_fname)
len(bo_fname)
len(cl_fname)
len(co_fname)
len(cr_fname)
len(cu_fname)
len(do_fname)
len(ec_fname)
len(gt_fname)
len(hn_fname)
len(mx_fname)
len(ni_fname)
len(pa_fname)
len(pe_fname)
len(pr_fname)
len(py_fname)
len(sv_fname)
len(us_fname)
len(uy_fname)

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

20

- On to buiilding data frames:

In [13]:
ar_DF = pd.DataFrame(columns=cols)
bo_DF = pd.DataFrame(columns=cols)
cl_DF = pd.DataFrame(columns=cols)
co_DF = pd.DataFrame(columns=cols)
cr_DF = pd.DataFrame(columns=cols)
cu_DF = pd.DataFrame(columns=cols)
do_DF = pd.DataFrame(columns=cols)
ec_DF = pd.DataFrame(columns=cols)
gt_DF = pd.DataFrame(columns=cols)
hn_DF = pd.DataFrame(columns=cols)
mx_DF = pd.DataFrame(columns=cols)
ni_DF = pd.DataFrame(columns=cols)
pa_DF = pd.DataFrame(columns=cols)
pe_DF = pd.DataFrame(columns=cols)
pr_DF = pd.DataFrame(columns=cols)
py_DF = pd.DataFrame(columns=cols)
sv_DF = pd.DataFrame(columns=cols)
us_DF = pd.DataFrame(columns=cols)
uy_DF = pd.DataFrame(columns=cols)

In [None]:
for fname in ar_fname:                         
    df = toDF(fname)                           
    df = remove_nondims(df)                    
    ar_DF = pd.concat([ar_DF, df], sort=True)

In [55]:
for fname in bo_fname:                         
    df = toDF(fname)                           
    df = remove_nondims(df)                    
    bo_DF = pd.concat([bo_DF, df], sort=True)

Chile:

In [61]:
for fname in cl_fname:                         
    df = toDF(fname)                           
    df = remove_nondims(df)                    
    cl_DF = pd.concat([cl_DF, df], sort=True)

Colombia:

In [14]:
for fname in co_fname:                         
    df = toDF(fname)                           
    df = remove_nondims(df)                    
    co_DF = pd.concat([co_DF, df], sort=True)

In [56]:
for fname in cr_fname:                         
    df = toDF(fname)                           
    df = remove_nondims(df)                    
    cr_DF = pd.concat([cr_DF, df], sort=True)

In [17]:
for fname in cu_fname:                         
    df = toDF(fname)                           
    df = remove_nondims(df)                    
    cu_DF = pd.concat([cu_DF, df], sort=True)

In [50]:
for fname in do_fname:                         
    df = toDF(fname)                           
    df = remove_nondims(df)                    
    do_DF = pd.concat([do_DF, df], sort=True)

In [51]:
for fname in hn_fname:                         
    df = toDF(fname)                           
    df = remove_nondims(df)                    
    hn_DF = pd.concat([hn_DF, df], sort=True)

In [46]:
for fname in ni_fname:                         
    df = toDF(fname)                           
    df = remove_nondims(df)                    
    ni_DF = pd.concat([ni_DF, df], sort=True)

In [27]:
for fname in pa_fname:                         
    df = toDF(fname)                           
    df = remove_nondims(df)                    
    pa_DF = pd.concat([pe_DF, df], sort=True)

In [25]:
for fname in pe_fname:                         
    df = toDF(fname)                           
    df = remove_nondims(df)                    
    pe_DF = pd.concat([pe_DF, df], sort=True)

Puerto Rico:

In [22]:
for fname in pr_fname:                         
    df = toDF(fname)                           
    df = remove_nondims(df)                    
    pr_DF = pd.concat([pr_DF, df], sort=True)

Paraguay:

In [20]:
for fname in py_fname:                         
    df = toDF(fname)                           
    df = remove_nondims(df)                    
    py_DF = pd.concat([py_DF, df], sort=True)

Guatemala:

In [55]:
for fname in gt_fname:                         
    df = toDF(fname)                           
    df = remove_nondims(df)                    
    gt_DF = pd.concat([gt_DF, df], sort=True)

El Salvador:

In [18]:
for fname in sv_fname:                         
    df = toDF(fname)                           
    df = remove_nondims(df)                    
    sv_DF = pd.concat([sv_DF, df], sort=True)

US:

In [14]:
for fname in us_fname:                         
    df = toDF(fname)                           
    df = remove_nondims(df)                    
    us_DF = pd.concat([us_DF, df], sort=True)

Uruguay:

In [16]:
for fname in uy_fname:                         
    df = toDF(fname)                           
    df = remove_nondims(df)                    
    uy_DF = pd.concat([uy_DF, df], sort=True)

Adding varieties:

In [18]:
add_variety(do_DF,'DO')
add_variety(co_DF,'CO')
add_variety(cl_DF,'CL')
add_variety(cu_DF,'CU')
add_variety(bo_DF,'BO')
add_variety(cr_DF,'CR')
add_variety(hn_DF,'HN')
add_variety(ni_DF,'NI')
add_variety(pa_DF,'PA')
add_variety(pe_DF,'PE')
add_variety(pr_DF,'PR')
add_variety(py_DF,'PY')
add_variety(sv_DF,'SV')
add_variety(uy_DF,'UY')

### 4. Storing files

- So that I don't have to rerun the whole notebook each time (or in case I want to use the resulting data frames in a different notebook), it makes sense to save the results in their current state. For this I'll make use of the pickle library. 

Spain:

In [87]:
f = open ('es_DF.pkl', 'wb')
pickle.dump(es_DF, f, -1)
f.close()

Bolivia:

In [58]:
f = open ('bo_DF.pkl', 'wb')
pickle.dump(bo_DF, f, -1)
f.close()

Colombia:

In [16]:
f = open ('co_DF.pkl', 'wb')
pickle.dump(co_DF, f, -1)
f.close()

In [64]:
f = open ('cl_DF.pkl', 'wb')
pickle.dump(cl_DF, f, -1)
f.close()

In [59]:
f = open ('cr_DF.pkl', 'wb')
pickle.dump(cr_DF, f, -1)
f.close()

Cuba:

In [19]:
f = open ('cu_DF.pkl', 'wb')
pickle.dump(cu_DF, f, -1)
f.close()

In [54]:
f = open ('do_DF.pkl', 'wb')
pickle.dump(do_DF, f, -1)
f.close()

In [53]:
f = open ('hn_DF.pkl', 'wb')
pickle.dump(hn_DF, f, -1)
f.close()

In [48]:
f = open ('ni_DF.pkl', 'wb')
pickle.dump(ni_DF, f, -1)
f.close()

In [39]:
f = open ('pa_DF.pkl', 'wb')
pickle.dump(pa_DF, f, -1)
f.close()

In [40]:
f = open ('pe_DF.pkl', 'wb')
pickle.dump(pe_DF, f, -1)
f.close()

Guatemala:

In [57]:
f = open ('gt_DF.pkl', 'wb')
pickle.dump(gt_DF, f, -1)
f.close()

Puerto Rico:

In [41]:
f = open ('pr_DF.pkl', 'wb')
pickle.dump(pr_DF, f, -1)
f.close()

Paraguay:

In [42]:
f = open ('py_DF.pkl', 'wb')
pickle.dump(py_DF, f, -1)
f.close()

El Salvador:

In [43]:
f = open ('sv_DF.pkl', 'wb')
pickle.dump(sv_DF, f, -1)
f.close()

US:

In [44]:
f = open ('us_DF.pkl', 'wb')
pickle.dump(us_DF, f, -1)
f.close()

Uruguay:

In [45]:
f = open ('uy_DF.pkl', 'wb')
pickle.dump(uy_DF, f, -1)
f.close()

## Alternative code

In [19]:
dftoy = df.iloc[5000:7000]

In [20]:
dftoy.shape

(2000, 6)

In [21]:
dftoy = dftoy[dftoy['Word'].str.contains(r'\w*i(t|ll)(o|a)s?\b')]   #Note word boundary is optional since
                                                                    #the cells only contain one word. I'm keeping here
                                                                    #in case I want to use the function on sentences.

  return func(self, *args, **kwargs)


In [22]:
dftoy

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
7163,431560,615218396,favorito,favorito,nms,ES
7658,431570,2427238236,sibaritas,sibarita,jmp,ES
7672,431570,2427238250,articulillo,articulillo,n,ES
7688,431570,2427238266,parrilla,parrilla,nfs,ES
7699,431570,2427238277,éxito,éxito,n,ES
7704,431570,2427238282,Pesadilla,pesadilla,nfs,ES
7905,431570,2427238483,éxito,éxito,n,ES
8131,431570,2427238709,tornillo,tornillo,nms,ES
