# Diminutive Suffix Productivity: corpus processing and cleaning
Juan Berrios | jeb358@pitt.edu | Last updated: April 14, 2020

**Summary and overview of the data:**

- The purpose of the code included in this notebook is to build `DataFrame` objects from the `.txt` files in the corpus directories. I also do some preliminary data cleaning. The corpus I am using is the [*Corpus del español*](https://www.corpusdelespanol.org/); more specifically the [Web/Dialects](https://www.corpusdelespanol.org/web-dial/) corpus. While the corpus is searchable online, it is also possible to access the full data set for those wishing to do computational analyses, such as this. It is necessary to purchase a license to do so. I am authorized to use it through the license of the [Department of Linguistics](https://www.linguistics.pitt.edu/). Samples for the different formats can be downloaded from the [official website](https://www.corpusdata.org/formats.asp). I have also uploaded a copy of the free sample in the [data samples  directory](https://github.com/Data-Science-for-Linguists-2020/Diminutive-Suffix-Productivity/tree/master/data_samples) of this repository. The data set is available in three formats: (i) Database (Structured Query Language), (ii) Word/lemma/PoS, and (iii) linear (raw) text. All are `.txt` files and the former two are tab-delimited. I have chosen to work with the second format because the tags will come in handy and because it's quite compatible with Pandas.

**Contents:**
1. [Preparation](#1.-Preparation)  includes the necessary preparations.
2. [Loading files](#2.-Loading-files)  includes code for loading the files, cleaning them and turning them into data frames using one of the `.txt` file as a sample.
3. [Processing corpus directories](#3.-Processing-corpus-directories)  includes code for performing the operations on a corpus directory containing all the text files of one variety. The resulting data frames are stored as `.pkl` files.

## 1. Preparation

- Loading libraries and additional settings:

In [1]:
#Importing libraries
import glob, pickle, re
import pandas as pd
import numpy as np

#Turning pretty print off:
%pprint

#Releasing all output:                                            
from IPython.core.interactiveshell import InteractiveShell #Prints all commands rather than the last one.
InteractiveShell.ast_node_interactivity = "all"

Pretty printing has been turned OFF


## 2. Loading files

- The `txt` files are very large. For testing purposes, I'll use only one of them as a start. The files are also tab-delimited, which makes my job a little easier. The columns correspond to an ID for the source text, an ID for the token, the token (word), the lemma, and the POS. I will hence use those for column names. 

In [2]:
fname = '../../Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo/es-b-0.txt'

In [3]:
cols = ['SourceID', 'TokenID', 'Word', 'Lemma', 'POS']

In [4]:
#First row is ignored because it corresponds to an identifier for the .txt file.

df = pd.read_csv(fname,sep='\t',encoding ='iso-8859-1',skiprows=[0],header=None,names=cols) 

In [5]:
df = df.dropna() #Removing NaN values

In [6]:
df.shape #It's a very large file (27345213 rows). It will get much smaller once I start cleaning up the data.

(27310829, 5)

In [7]:
df.head(5)   #The lemma column will be useful when I need to aggregations that put lowecase and uppercase 
            #as well as plural and singular forms together.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS
0,431270,2206403096,Este,este,dd-
1,431270,2206403097,es,ser,vip-3s
2,431270,2206403098,un,un,li-ms
3,431270,2206403099,blog,blog,nms
4,431270,2206403100,de,de,e


In [8]:
df.tail(5)

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS
27345208,676060,2511862343,mas,mas,cc
27345209,676060,2511862344,informacion,información,n
27345210,676060,2511862345,visite,visitar,vsp-1/3s
27345211,676060,2511862346,www.DineroAbundancia.com,www.dineroabundancia.com,n
27345212,676060,2511862347,.,$.,y


In [9]:
df.sample(5) #Everything seems to be loaded correctly as of now. 

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS
20684906,611100,1803563509,te,tu,po
3228921,466920,1589481691,pero,pero,cc
16800789,586970,1715751836,el,el,ld-ms
29473,431880,1473416153,esto,esto,pd-3cs
15419231,579140,1912395913,lo,lo,po


In [10]:
df['POS'].unique()  #These POS tags are not very transparent, but it's a good start. These will also come 
                    #handy for data clean up because diminutivization applies only to some classes. 

array(['dd-', 'vip-3s', 'li-ms', 'nms    ', 'e', 'nfp    ', 'y',
       'nfs    ', 'nmp    ', 'vsp-1/3s', 'n', 'cc', 'o', 'r', 'vps-ms',
       'ld-fs', 'm$', 'vc-1/3s', 'vr', 'po', 'li-fs', 'ld-mp', 'vip-3p',
       'ld-ms', 'jms    ', 'vpp', 'mc', 'vps-fs', 'j', 'vii-1/3s', 'e_21',
       'x', 'cs', 'ld', 'v', 'cc-', 'ld-fp', 'pi-0cn', 'dd', 'dxmp-ind-',
       'vip-2s', 'dp-', 'vsp-3p', 'pi-0ms', 'jfp    ', 'cS_21', 'cS_22',
       'jmp    ', 'vsp-2s', 'vip-1s', 'vsp-1p', 'ps', 'vip-1p/vis-1p',
       'vip-1p', 'pd-3cs', 'vif-1p', 'vps-mp', 'vif-3s', 'vif-3p',
       'jfs    ', 'dxfs-ind-', 'vsj-1/3s', 'i', 'pi-3ms', 'vis-3s', 'b',
       'p', 'np', 'pr-3cn"', 'px', 'dxms-ind-', 'pi-3cs', 'dxfs-',
       'pr-3cs', 'px-ms', 'dxfp-ind-', 'vis-3p', 'li-mp', 'pi', 'vsi-3p',
       'px-mp', 'vsi-1/3s', 'pq-3cn"', 'vif-2s', 'vif-2p', 'vsp-2p',
       'vip-2p', 'vpp-00', 'vm-2p', 'vis-1p', 'dxcs-ind-', 'pr-3cp',
       'dxcs-dem-', 'vif-1s', 'vc-1p', 'cC_21', 'cC_22', 'vps-fp',
       'vii

In [11]:
df['Variety'] = 'ES' #Time to add a column for the variety of Spanish. In this case Spain.

In [12]:
df.head() #It works out well.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
0,431270,2206403096,Este,este,dd-,ES
1,431270,2206403097,es,ser,vip-3s,ES
2,431270,2206403098,un,un,li-ms,ES
3,431270,2206403099,blog,blog,nms,ES
4,431270,2206403100,de,de,e,ES


In [13]:
df.tail()

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
27345208,676060,2511862343,mas,mas,cc,ES
27345209,676060,2511862344,informacion,información,n,ES
27345210,676060,2511862345,visite,visitar,vsp-1/3s,ES
27345211,676060,2511862346,www.DineroAbundancia.com,www.dineroabundancia.com,n,ES
27345212,676060,2511862347,.,$.,y,ES


In [14]:
df.sample(5)

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
13969830,566880,1491338568,las,la,ld-fp,ES
11132087,542350,67032467,¡,$¡,y,ES
7869575,512780,2505104061,",","$,",y,ES
5285951,488550,597595099,todos,todo,dxmp-ind-,ES
8847542,523950,1671410917,sabemos,saber,vip-1p,ES


- A first step involved in cleaning up the data is to remove rows that are not necessary for this analysis. There are two main things to tackle first: symbols and '@' that are stand-ins for words that were removed from the corpus for copyright reasons when it was created. For the former, I can make use of the POS column. Symbols are tagged 'y'. 

In [15]:
df = df[df['POS'] != 'y'] 

In [16]:
df #It works. Looks like around 3,000,000 rows were removed.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
0,431270,2206403096,Este,este,dd-,ES
1,431270,2206403097,es,ser,vip-3s,ES
2,431270,2206403098,un,un,li-ms,ES
3,431270,2206403099,blog,blog,nms,ES
4,431270,2206403100,de,de,e,ES
...,...,...,...,...,...,...
27345207,676060,2511862342,obtener,obtener,vr,ES
27345208,676060,2511862343,mas,mas,cc,ES
27345209,676060,2511862344,informacion,información,n,ES
27345210,676060,2511862345,visite,visitar,vsp-1/3s,ES


- Since '@' is meant to replace words and not symbols, it is tagged as a noun, so the same strategy doesn't work. An alternative is to use the Word column instead:

In [17]:
df = df[df['Word'] != '@'] 

In [18]:
df #Looks good, this removed about 1,500,000 more rows.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
0,431270,2206403096,Este,este,dd-,ES
1,431270,2206403097,es,ser,vip-3s,ES
2,431270,2206403098,un,un,li-ms,ES
3,431270,2206403099,blog,blog,nms,ES
4,431270,2206403100,de,de,e,ES
...,...,...,...,...,...,...
27345207,676060,2511862342,obtener,obtener,vr,ES
27345208,676060,2511862343,mas,mas,cc,ES
27345209,676060,2511862344,informacion,información,n,ES
27345210,676060,2511862345,visite,visitar,vsp-1/3s,ES


- Lastly, I only want to keep diminutivized forms for analysis. 

In [19]:
df = df[df['Word'].str.contains(r'\w*i(t|ll)(o|a)s?\b', regex=True)] #Keeps only rows ending in the segments of interest.

  return func(self, *args, **kwargs)


In [20]:
df

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
98,431270,2206403194,Nikita,nikita,o,ES
110,431270,2206403206,escrito,escrito,jms,ES
321,431290,2074527333,calladita,calladito,j,ES
331,431290,2074527343,sólito,sólito,jms,ES
536,431310,2143630275,necesita,necesitar,vip-3s,ES
...,...,...,...,...,...,...
27344546,676040,250005820,depósitos,depósito,nmp,ES
27344657,676040,250005931,precipita,precipitar,vip-3s,ES
27344674,676040,250005948,corralito,corralito,n,ES
27344789,676060,2511861924,sencillos,sencillo,j,ES


- This gets the data to a first stage that's easier to work with. There are still many rows which do not belong in the data frame (e.g., verb forms other than gerunds that just happen to end in the same segment), but it will be more efficient to remove those when the full data frame is constructed. I'll delete unneeded objects before proceding:

In [21]:
del df
del fname
del cols

## 3. Processing corpus directories

- As a first step, knowing now that the operations above are sucessful, I will define functions to make the processing pipeline for the full data drame object more efficient and streamlined:

In [22]:
def toDF(fname):
    """Turns tab-delimited file into a data frame"""
    cols = ['SourceID', 'TokenID', 'Word', 'Lemma', 'POS']
    df = pd.read_csv(fname,sep='\t',encoding ='iso-8859-1',skiprows=[0],header=None,names=cols)
    df = df.dropna()
    return df

In [23]:
def add_variety(df,variety):
    """Adds a column specifying a given variety of Spanish to a data frame"""
    df['Variety'] = variety

In [24]:
def remove_syms(df):
    """Excludes symbols"""
    df = df[df['POS'] != 'y']
    return df

In [25]:
def remove_redacted(df):
    """Excludes @ symbols, which stand for words that have been redacted for copyright reasons"""
    df = df[df['Word'] != '@']
    return df

In [26]:
def remove_nondims(df):
    """Removes tokens that do not end in the segments of interest (-ito/-illo)"""
    df = df[df['Word'].str.contains(r'\w*i(t|ll)(o|a)s?\b', regex=True)]
    return df

- Now let's set the directory. I'll start with the Spain directory, since I used a Spain text file for the test run in Section 2 above. In addition, Spain is by far the largest directory (over 3 GB when compressed), so if the code works fine for it, it should work for the rest of the countries.

In [27]:
corpus_dir = '../../Diminutive-Suffix-Productivity/private/data/'
es_dir = glob.glob(corpus_dir + 'wlp_ES-sbo/*.txt')
es_dir[0]

'../../Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo\\es-b-0.txt'

In [28]:
len(es_dir) #There are 20 files in total

20

In [29]:
es_DF = pd.DataFrame(columns=['SourceID', 'TokenID', 'Word', 'Lemma', 'POS']) #Builds a new, empty data frame object.
es_DF

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS


In [64]:
for fname in es_dir:                            #Note that I'm only using one of the cleaning functions
    df = toDF(fname)                            #because it gets me to the same end result. But the other ones 
    df = remove_nondims(df)                     #might be useful down the road, which is why I defined them.
    es_DF = pd.concat([es_DF, df], sort=True)

In [65]:
es_DF #It takes a while (which is expected because of the file sizes), but it works. I won't rerun the notebook
      #because of the prior line but I've removed all extraneous cells for ease of reading.

Unnamed: 0,Lemma,POS,SourceID,TokenID,Word
98,nikita,o,431270,2206403194,Nikita
110,escrito,jms,431270,2206403206,escrito
321,calladito,j,431290,2074527333,calladita
331,sólito,jms,431290,2074527343,sólito
536,necesitar,vip-3s,431310,2143630275,necesita
...,...,...,...,...,...
22146901,sencillo,jfs,1891249,1779969779,sencilla
22147017,permitir,vsp-1/3s,1891249,1779969895,permita
22147166,inscrito,j,1891249,1779970044,inscrito
22147196,visita,nfp,1891249,1779970074,visitas


In [69]:
add_variety(es_DF,'ES')

In [70]:
es_DF  #It appears the order of the columns was shuffled for some reason. I will fix this once I have constructed 
       #the final data frame object.

Unnamed: 0,Lemma,POS,SourceID,TokenID,Word,Variety
98,nikita,o,431270,2206403194,Nikita,ES
110,escrito,jms,431270,2206403206,escrito,ES
321,calladito,j,431290,2074527333,calladita,ES
331,sólito,jms,431290,2074527343,sólito,ES
536,necesitar,vip-3s,431310,2143630275,necesita,ES
...,...,...,...,...,...,...
22146901,sencillo,jfs,1891249,1779969779,sencilla,ES
22147017,permitir,vsp-1/3s,1891249,1779969895,permita,ES
22147166,inscrito,j,1891249,1779970044,inscrito,ES
22147196,visita,nfp,1891249,1779970074,visitas,ES


- Spain is ready, now I have to do the same for the remaning 20 countries. This will take me some time, but I don't expect the code above to run into issues as the files are all in the same format. Below are the corpus directories for the remaining countries:

In [30]:
ar_dir = glob.glob(corpus_dir + 'wlp_AR-tez/*.txt') #Argentina
bo_dir = glob.glob(corpus_dir + 'wlp_BO-teh/*.txt') #Bolivia
cl_dir = glob.glob(corpus_dir + 'wlp_CL-wts/*.txt') #Chile
co_dir = glob.glob(corpus_dir + 'wlp_CO-pem/*.txt') #Colombia
cr_dir = glob.glob(corpus_dir + 'wlp_CR-jfh/*.txt') #Costa Rica
cu_dir = glob.glob(corpus_dir + 'wlp_CU-rag/*.txt') #Cuba
do_dir = glob.glob(corpus_dir + 'wlp_DO-egn/*.txt') #Dominican Republic
ec_dir = glob.glob(corpus_dir + 'wlp_EC-jss/*.txt') #Ecuador
gt_dir = glob.glob(corpus_dir + 'wlp_GT-miv/*.txt') #Guatemala
hn_dir = glob.glob(corpus_dir + 'wlp_HN-paj/*.txt') #Honduras
mx_dir = glob.glob(corpus_dir + 'wlp_MX-vzo/*.txt') #Mexico
ni_dir = glob.glob(corpus_dir + 'wlp_NI-exu/*.txt') #Nicaragua
pa_dir = glob.glob(corpus_dir + 'wlp_PA-qlz/*.txt') #Panama
pe_dir = glob.glob(corpus_dir + 'wlp_PE-tae/*.txt') #Peru
pr_dir = glob.glob(corpus_dir + 'wlp_PR-epz/*.txt') #Puerto Rico
py_dir = glob.glob(corpus_dir + 'wlp_PY-ukd/*.txt') #Paraguay
sv_dir = glob.glob(corpus_dir + 'wlp_SV-xkl/*.txt') #El Salvador
us_dir = glob.glob(corpus_dir + 'wlp_US-ufh/*.txt') #The US
uy_dir = glob.glob(corpus_dir + 'wlp_UY-nde/*.txt') #Uruguay
#ve_dir = glob.glob(corpus_dir + 'wlp_VE-wsc/*.txt') #Venezuela. This file (ironically) doesn't work. I'll have to 
                                                       #go to the LMC and try to get a new copy. 

- On to building data frames. For this purpose, I have created a master function that goes through all of the steps above. I'll run each country in one cell so that memory and time limitations won't be exceeded (hency why data frames are deleted after being save). Likewise, so that I don't have to rerun the whole notebook each time (or in case I want to use the resulting data frames in a different notebook), it makes sense to save the results in their current state. For this I'll make use of Pandas' own pickling function.

In [31]:
def corpus_process(fdir, country_df, variety):
    """Builds, cleans, and creates a data frame using all files from the corpus directory and
    keeping only rows of interest. fdir corresponds to the directory of the country, country_df
    is a string and corresponds to an empty data frame to be populated, variety is also a string
    and correspond to the variety being processed."""
    country_df = pd.DataFrame(columns=['SourceID', 'TokenID', 'Word', 'Lemma', 'POS'])
    for fname in fdir:                         
        df = toDF(fname)                           
        df = remove_nondims(df)
        country_df = pd.concat([country_df, df], sort=True)
    add_variety(country_df, variety)
    return country_df

Argentina:

In [32]:
ar_DF = corpus_process(ar_dir, 'ar_DF', 'AR')
ar_DF.to_pickle('ar_DF.pkl')
del ar_DF

Bolivia:

In [36]:
bo_DF = corpus_process(bo_dir, 'bo_DF', 'BO')
bo_DF.to_pickle('bo_DF.pkl')
del bo_DF

Chile:

In [37]:
cl_DF = corpus_process(cl_dir, 'cl_DF', 'CL')
cl_DF.to_pickle('cl_DF.pkl')
del cl_DF

Colombia:

In [42]:
co_DF = corpus_process(co_dir, 'co_DF', 'CO')
co_DF.to_pickle('co_DF.pkl')
del co_DF

Costa Rica:

In [43]:
cr_DF = corpus_process(cr_dir, 'cr_DF', 'CR')
cr_DF.to_pickle('cr_DF.pkl')
del cr_DF

Cuba:

In [49]:
cu_DF = corpus_process(cu_dir, 'cu_DF', 'CU')
cu_DF.to_pickle('cu_DF.pkl')
del cu_DF

Dominican Republic:

In [51]:
do_DF = corpus_process(do_dir, 'do_DF', 'DO')
do_DF.to_pickle('do_DF.pkl')
del do_DF

Ecuador:

In [56]:
ec_DF = corpus_process(ec_dir, 'ec_DF', 'EC')
ec_DF.to_pickle('ec_DF.pkl')
del ec_DF

Guatemala:

In [73]:
gt_DF = corpus_process(gt_dir, 'gt_DF', 'GT')
gt_DF.to_pickle('gt_DF.pkl')
del gt_DF

Honduras:

In [74]:
hn_DF = corpus_process(hn_dir, 'hn_DF', 'HN')
hn_DF.to_pickle('hn_DF.pkl')
del hn_DF

Mexico:

In [78]:
mx_DF = corpus_process(mx_dir, 'mx_DF', 'MX')
mx_DF.to_pickle('mx_DF.pkl')
del mx_DF

Nicaragua:

In [81]:
ni_DF = corpus_process(ni_dir, 'ni_DF', 'NI')
ni_DF.to_pickle('ni_DF.pkl')
del ni_DF

Panama:

In [82]:
pa_DF = corpus_process(pa_dir, 'pa_DF', 'PA')
pa_DF.to_pickle('pa_DF.pkl')
del pa_DF

Peru:

In [83]:
pe_DF = corpus_process(pe_dir, 'pe_DF', 'PE')
pe_DF.to_pickle('pe_DF.pkl')
del pe_DF

Puerto Rico:

In [84]:
pr_DF = corpus_process(pr_dir, 'pr_DF', 'PR')
pr_DF.to_pickle('pr_DF.pkl')
del pr_DF

Paraguay:

In [85]:
py_DF = corpus_process(py_dir, 'py_DF', 'PY')
py_DF.to_pickle('py_DF.pkl')
del py_DF

El Salvador:

In [86]:
sv_DF = corpus_process(sv_dir, 'sv_DF', 'SV')
sv_DF.to_pickle('sv_DF.pkl')
del sv_DF

US:

In [87]:
us_DF = corpus_process(us_dir, 'us_DF', 'US')
us_DF.to_pickle('us_DF.pkl')
del us_DF

Uruguay:

In [167]:
uy_DF = corpus_process(uy_dir, 'uy_DF', 'UY')
uy_DF.to_pickle('uy_DF.pkl')
del uy_DF

  return func(self, *args, **kwargs)


Venezuela:

In [29]:
#ve_DF = corpus_process(ve_dir, 've_DF', 'VE')
#ve_DF.to_pickle('ve_DF.pkl')
#del ve_DF

- That's the end of this stage. I now have a preliminary data frame for each country that I can later put together into a larger one. I still want to keep the individual country data frames, though, for by-country analysis and processing or in case I notice an issue down the road. As a last step, I will need to know the number of hapax legomena in each subset of the corpus, so I'll extract it as a set also using a master function that incorporates some of the other functions I have defined above:

In [31]:
def extract_hapax(fdir, country_hapax):
    """Creates a set of hapax legomena. fdir corresponds to the directory of the country, country_hapax 
    is a string and corresponds to an empty data frame to be populated."""
    country_hapax = set()
    for fname in fdir:                         
        df = toDF(fname)    
        df = remove_syms(df)
        df = remove_redacted(df)
        hapax = set([w.lower() for w in df['Word']])
        for word in hapax:
            country_hapax.add(word)
    return country_hapax

- Argentina

In [32]:
ar_hapax = extract_hapax(ar_dir, 'ar_hapax')
with open('ar_hapax.pkl', 'wb') as f:
    pickle.dump(ar_hapax, f, -1)
del ar_hapax

KeyboardInterrupt: 

- Bolivia:

In [None]:
bo_hapax = extract_hapax(bo_dir, 'bo_hapax')
with open('bo_hapax.pkl', 'wb') as f:
    pickle.dump(bo_hapax, f, -1)
del bo_hapax

- Chile:

In [None]:
cl_hapax = extract_hapax(cl_dir, 'cl_hapax')
with open('cl_hapax.pkl', 'wb') as f:
    pickle.dump(cl_hapax, f, -1)
del cl_hapax

- Colombia:

In [None]:
co_hapax = extract_hapax(co_dir, 'co_hapax')
with open('co_hapax.pkl', 'wb') as f:
    pickle.dump(co_hapax, f, -1)
del co_hapax

- Costa Rica:

In [None]:
cr_hapax = extract_hapax(cr_dir, 'cr_hapax')
with open('cr_hapax.pkl', 'wb') as f:
    pickle.dump(cr_hapax, f, -1)
del cr_hapax

- Cuba:

In [None]:
cu_hapax = extract_hapax(cu_dir, 'cu_hapax')
with open('cu_hapax.pkl', 'wb') as f:
    pickle.dump(cu_hapax, f, -1)
del cu_hapax

- Dominican Republic:

In [None]:
do_hapax = extract_hapax(do_dir, 'do_hapax')
with open('do_hapax.pkl', 'wb') as f:
    pickle.dump(do_hapax, f, -1)
del do_hapax

- Ecuador:

In [None]:
ec_hapax = extract_hapax(ec_dir, 'ec_hapax')
with open('ec_hapax.pkl', 'wb') as f:
    pickle.dump(ec_hapax, f, -1)
del ec_hapax

- Spain:

In [None]:
es_hapax = extract_hapax(es_dir, 'es_hapax')
with open('es_hapax.pkl', 'wb') as f:
    pickle.dump(es_hapax, f, -1)
del es_hapax

- Guatamela:

In [None]:
gt_hapax = extract_hapax(gt_dir, 'gt_hapax')
with open('gt_hapax.pkl', 'wb') as f:
    pickle.dump(gt_hapax, f, -1)
del gt_hapax

- Honduras:

In [None]:
hn_hapax = extract_hapax(hn_dir, 'hn_hapax')
with open('hn_hapax.pkl', 'wb') as f:
    pickle.dump(hn_hapax, f, -1)
del hn_hapax

- Mexico:

In [None]:
mx_hapax = extract_hapax(mx_dir, 'mx_hapax')
with open('mx_hapax.pkl', 'wb') as f:
    pickle.dump(mx_hapax, f, -1)
del mx_hapax

- Nicaragua:

In [None]:
ni_hapax = extract_hapax(ni_dir, 'ni_hapax')
with open('ni_hapax.pkl', 'wb') as f:
    pickle.dump(ni_hapax, f, -1)
del ni_hapax

- Panama:

In [None]:
pa_hapax = extract_hapax(pa_dir, 'pa_hapax')
with open('pa_hapax.pkl', 'wb') as f:
    pickle.dump(pa_hapax, f, -1)
del pa_hapax

- Peru:

In [None]:
pe_hapax = extract_hapax(pe_dir, 'pe_hapax')
with open('pe_hapax.pkl', 'wb') as f:
    pickle.dump(pe_hapax, f, -1)
del pe_hapax

- Puerto Rico:

In [None]:
pr_hapax = extract_hapax(pr_dir, 'pr_hapax')
with open('pr_hapax.pkl', 'wb') as f:
    pickle.dump(pr_hapax, f, -1)
del pr_hapax

- Paraguay:

In [None]:
py_hapax = extract_hapax(py_dir, 'py_hapax')
with open('py_hapax.pkl', 'wb') as f:
    pickle.dump(py_hapax, f, -1)
del py_hapax

- El Salvador:

In [None]:
sv_hapax = extract_hapax(sv_dir, 'bsvo_hapax')
with open('sv_hapax.pkl', 'wb') as f:
    pickle.dump(sv_hapax, f, -1)
del sv_hapax

- The US:

In [None]:
us_hapax = extract_hapax(us_dir, 'us_hapax')
with open('us_hapax.pkl', 'wb') as f:
    pickle.dump(us_hapax, f, -1)
del us_hapax

- Uruguay:

In [None]:
uy_hapax = extract_hapax(uy_dir, 'uy_hapax')
with open('uy_hapax.pkl', 'wb') as f:
    pickle.dump(uy_hapax, f, -1)
del uy_hapax