# Diminutive Suffix Productivity: corpus processing and cleaning
Juan Berrios | jeb358@pitt.edu | Last updated: March 19, 2020

**Summary and overview of the data:**

- The purpose of the code included in this notebook is to build `DataFrame` objects from the `.txt` files in the corpus directories. I also do some preliminary data cleaning. The corpus I am using is the [*Corpus del español*](https://www.corpusdelespanol.org/); more specifically the [Web/Dialects](https://www.corpusdelespanol.org/web-dial/) corpus. While the corpus is searchable online, it is also possible to access the full data set for those wishing to do computational analyses, such as this. It is necessary to purchase a license to do so. I am authorized to use it through the license of the [Department of Linguistics](https://www.linguistics.pitt.edu/). Samples for the different formats can be downloaded from the [official website](https://www.corpusdata.org/formats.asp). I have also uploaded a copy of the free sample in the [data samples  directory](https://github.com/Data-Science-for-Linguists-2020/Diminutive-Suffix-Productivity/tree/master/data_samples) of this repository. The data set is available in three formats: (i) Database (Structured Query Language), (ii) Word/lemma/PoS, and (iii) linear (raw) text. All are `.txt` files and the former two are tab-delimited. I have chosen to work with the second format because the tags will come in handy and because it's quite compatible with Pandas.

**Contents:**
1. [Preparation](#1.-Preparation)  includes the necessary preparations.
2. [Loading files](#2.-Loading-files)  includes code for loading the files and turning them into data frames using one of the `.txt` file as a sample.
3. [Processing corpus directories](#3.-Processing-corpus-directories)  includes code for performing the operations on a corpus directory containing all the text files of one variety.
4. [Storing files](#4.-Storing-files)  includes code for storing the resulting data frames as pickled files.
5. [Alternative code](#5.-Alternative-code)  includes alternative code to accomplish some of the tasks in this notebook.

### 1. Preparation

- Loading libraries and additional settings:

In [1]:
#Importing libraries
import glob, pickle, re
import pandas as pd
import numpy as np

#Turning pretty print off:
%pprint

#Releasing all output:                                            
from IPython.core.interactiveshell import InteractiveShell #Prints all commands rather than the last one.
InteractiveShell.ast_node_interactivity = "all"

Pretty printing has been turned OFF


### 2. Loading files

- The `txt` files are very large. For testing purposes, I'll use only one of them as a start. The files are also tab-delimited, which makes my job a little easier. The columns correspond to an ID for the source text, an ID for the token, the token (word), the lemma, and the POS. I will hence use those for column names. 

In [88]:
fname = '../../Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo/es-b-0.txt'

In [89]:
cols = ['SourceID', 'TokenID', 'Word', 'Lemma', 'POS']

In [90]:
#First row is ignored because it corresponds to an identifier for the .txt file.

df = pd.read_csv(fname,sep='\t',encoding ='iso-8859-1',skiprows=[0],header=None,names=cols) 

In [91]:
df = df.dropna() #Removing NaN values

In [92]:
df.shape #It's a very large file (27345213 rows). It will get much smaller once I start cleaning up the data.

(27310829, 5)

In [93]:
df.head(5)   #The lemma column will be useful when I need to aggregations that put lowecase and uppercase 
            #as well as plural and singular forms together.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS
0,431270,2206403096,Este,este,dd-
1,431270,2206403097,es,ser,vip-3s
2,431270,2206403098,un,un,li-ms
3,431270,2206403099,blog,blog,nms
4,431270,2206403100,de,de,e


In [94]:
df.tail(5)

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS
27345208,676060,2511862343,mas,mas,cc
27345209,676060,2511862344,informacion,información,n
27345210,676060,2511862345,visite,visitar,vsp-1/3s
27345211,676060,2511862346,www.DineroAbundancia.com,www.dineroabundancia.com,n
27345212,676060,2511862347,.,$.,y


In [95]:
df.sample(5) #Everything seems to be loaded correctly as of now. 

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS
8330253,517950,1671391719,no,no,r
17322104,589480,398554469,",","$,",y
6715455,496570,650290440,orígenes,origen,nmp
19091258,598950,1679652522,se,se,po
21918216,623720,2272201126,un,un,li-ms


In [96]:
df['POS'].unique()  #These POS tags are not very transparent, but it's a good start. These will also come 
                    #handy for data clean up because diminutivization applies only to some classes. 

array(['dd-', 'vip-3s', 'li-ms', 'nms    ', 'e', 'nfp    ', 'y',
       'nfs    ', 'nmp    ', 'vsp-1/3s', 'n', 'cc', 'o', 'r', 'vps-ms',
       'ld-fs', 'm$', 'vc-1/3s', 'vr', 'po', 'li-fs', 'ld-mp', 'vip-3p',
       'ld-ms', 'jms    ', 'vpp', 'mc', 'vps-fs', 'j', 'vii-1/3s', 'e_21',
       'x', 'cs', 'ld', 'v', 'cc-', 'ld-fp', 'pi-0cn', 'dd', 'dxmp-ind-',
       'vip-2s', 'dp-', 'vsp-3p', 'pi-0ms', 'jfp    ', 'cS_21', 'cS_22',
       'jmp    ', 'vsp-2s', 'vip-1s', 'vsp-1p', 'ps', 'vip-1p/vis-1p',
       'vip-1p', 'pd-3cs', 'vif-1p', 'vps-mp', 'vif-3s', 'vif-3p',
       'jfs    ', 'dxfs-ind-', 'vsj-1/3s', 'i', 'pi-3ms', 'vis-3s', 'b',
       'p', 'np', 'pr-3cn"', 'px', 'dxms-ind-', 'pi-3cs', 'dxfs-',
       'pr-3cs', 'px-ms', 'dxfp-ind-', 'vis-3p', 'li-mp', 'pi', 'vsi-3p',
       'px-mp', 'vsi-1/3s', 'pq-3cn"', 'vif-2s', 'vif-2p', 'vsp-2p',
       'vip-2p', 'vpp-00', 'vm-2p', 'vis-1p', 'dxcs-ind-', 'pr-3cp',
       'dxcs-dem-', 'vif-1s', 'vc-1p', 'cC_21', 'cC_22', 'vps-fp',
       'vii

In [97]:
df['Variety'] = 'ES' #Time to add a column for the variety of Spanish. In this case Spain.

In [98]:
df.head() #It works out well.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
0,431270,2206403096,Este,este,dd-,ES
1,431270,2206403097,es,ser,vip-3s,ES
2,431270,2206403098,un,un,li-ms,ES
3,431270,2206403099,blog,blog,nms,ES
4,431270,2206403100,de,de,e,ES


In [99]:
df.tail()

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
27345208,676060,2511862343,mas,mas,cc,ES
27345209,676060,2511862344,informacion,información,n,ES
27345210,676060,2511862345,visite,visitar,vsp-1/3s,ES
27345211,676060,2511862346,www.DineroAbundancia.com,www.dineroabundancia.com,n,ES
27345212,676060,2511862347,.,$.,y,ES


In [100]:
df.sample(5)

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
24403218,649640,834060125,que,que,cs,ES
3669217,471560,619788564,segundo,segundo,mc,ES
6426100,495270,2027468388,angoleño,angoleño,jms,ES
17556588,590800,2468531832,diferentes,diferente,jmp,ES
21936826,624200,2649604048,haber,haber,v,ES


- A first step involved in cleaning up the data is to remove rows that are not necessary for this analysis. There are two main things to tackle first: symbols and '@' that are stand-ins for words that were removed from the corpus for copyright reasons when it was created. For the former, I can make use of the POS column. Symbols are tagged 'y'. 

In [101]:
df = df[df['POS'] != 'y'] 

In [102]:
df #It works. Looks like around 3,000,000 rows were removed.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
0,431270,2206403096,Este,este,dd-,ES
1,431270,2206403097,es,ser,vip-3s,ES
2,431270,2206403098,un,un,li-ms,ES
3,431270,2206403099,blog,blog,nms,ES
4,431270,2206403100,de,de,e,ES
...,...,...,...,...,...,...
27345207,676060,2511862342,obtener,obtener,vr,ES
27345208,676060,2511862343,mas,mas,cc,ES
27345209,676060,2511862344,informacion,información,n,ES
27345210,676060,2511862345,visite,visitar,vsp-1/3s,ES


- Since '@' is meant to replace words and not symbols, it is tagged as a noun, so the same strategy doesn't work. An alternative is to use the Word column instead:

In [103]:
df = df[df['Word'] != '@'] 

In [104]:
df #Looks good, this removed about 1,500,000 more rows.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
0,431270,2206403096,Este,este,dd-,ES
1,431270,2206403097,es,ser,vip-3s,ES
2,431270,2206403098,un,un,li-ms,ES
3,431270,2206403099,blog,blog,nms,ES
4,431270,2206403100,de,de,e,ES
...,...,...,...,...,...,...
27345207,676060,2511862342,obtener,obtener,vr,ES
27345208,676060,2511862343,mas,mas,cc,ES
27345209,676060,2511862344,informacion,información,n,ES
27345210,676060,2511862345,visite,visitar,vsp-1/3s,ES


- Lastly, I only want to keep diminutivized forms for analysis. 

In [114]:
df = df[df['Word'].str.contains(r'\w*i(t|ll)(o|a)s?\b', regex=True)]    #Keeps only rows ending in the segments of interest.


#df = df[(df['Word'].str[-3:] == 'ito') | (df['Word'].str[-3:] == 'ita')|   #Former code
#  (df['Word'].str[-4:] == 'itos') | (df['Word'].str[-4:] == 'itas')|   
#  (df['Word'].str[-4:] == 'illo') | (df['Word'].str[-4:] == 'illa')|       
#  (df['Word'].str[-4:] == 'illos') | (df['Word'].str[-4:] == 'illas')]

In [115]:
df

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
98,431270,2206403194,Nikita,nikita,o,ES
110,431270,2206403206,escrito,escrito,jms,ES
321,431290,2074527333,calladita,calladito,j,ES
331,431290,2074527343,sólito,sólito,jms,ES
536,431310,2143630275,necesita,necesitar,vip-3s,ES
...,...,...,...,...,...,...
27344546,676040,250005820,depósitos,depósito,nmp,ES
27344657,676040,250005931,precipita,precipitar,vip-3s,ES
27344674,676040,250005948,corralito,corralito,n,ES
27344789,676060,2511861924,sencillos,sencillo,j,ES


- This gets the data at to a first stage that's easier to work with. There are still many rows which do not belong in the data frame (e.g., verb forms other than gerunds that just happen to end in the same segment), but it will be more efficient to remove those when the full data frame is constructed. 

### 3. Processing corpus directories

- As a first step, knowing now that the operations above are sucessful, I will define functions to make the processing pipeline for the full data drame object more efficient and streamlined:

In [71]:
def toDF(fname):
    """Turns tab-delimited file into a data frame"""
    cols = ['SourceID', 'TokenID', 'Word', 'Lemma', 'POS']
    df = pd.read_csv(fname,sep='\t',encoding ='iso-8859-1',skiprows=[0],header=None,names=cols)
    df = df.dropna()
    return df

In [72]:
def add_variety(df,variety):
    """Adds a column specifying a given variety of Spanish to a data frame"""
    df['Variety'] = variety

In [73]:
def remove_syms(df):
    """Excludes symbols"""
    df = df[df['POS'] != 'y']
    return df

In [74]:
def remove_redacted(df):
    """Excludes @ symbols, which stand for words that have been redacted for copyright reasons"""
    df = df[df['Word'] != '@']
    return df

In [116]:
def remove_nondims(df):
    """Removes tokens that do not end in the segments of interest (-ito/-illo)"""
    df = df[df['Word'].str.contains(r'\w*i(t|ll)(o|a)s?\b', regex=True)]
    return df

- Now let's set the directory. I'll start with the Spain directory, since I used a Spain text file for the test run in Section 2 above. In addition, Spain is by far the largest directory (over 3 GB when compressed), so if the code works fine for it, it should work for the rest of the countries.

In [176]:
corpus_dir = '../../Diminutive-Suffix-Productivity/private/data/'
es_dir = glob.glob(corpus_dir + 'wlp_ES-sbo/*.txt')
es_dir[0]

'../../Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo\\es-b-0.txt'

In [177]:
len(es_fname) #There are 20 files in total

1

In [178]:
es_DF = pd.DataFrame(columns=cols) #Builds a new, empty data frame object.
es_DF

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS


In [179]:
for fname in es_fname:                         #Note that I'm only using one of the cleaning functions
    df = toDF(fname)                           #because it gets me to the same end result. But the other ones   
    df = remove_nondims(df)                    #might be useful down the road, which is why I defined them.
    es_DF = pd.concat([es_DF, df], sort=True)

In [180]:
es_DF #It takes a while (which is expected because of the file sizes), but it works. I won't rerun the notebook
      #because of the prior line but I've removed all extraneous cells for ease of reading.

Unnamed: 0,Lemma,POS,SourceID,TokenID,Word
98,nikita,o,431270,2206403194,Nikita
110,escrito,jms,431270,2206403206,escrito
321,calladito,j,431290,2074527333,calladita
331,sólito,jms,431290,2074527343,sólito
536,necesitar,vip-3s,431310,2143630275,necesita
...,...,...,...,...,...
27344546,depósito,nmp,676040,250005820,depósitos
27344657,precipitar,vip-3s,676040,250005931,precipita
27344674,corralito,n,676040,250005948,corralito
27344789,sencillo,j,676060,2511861924,sencillos


In [83]:
add_variety(es_DF,'ES')

In [84]:
es_DF  #It appears the order of the columns was shuffled for some reason. I will fix this once I have constructed 
       #the final data frame object.

Unnamed: 0,Lemma,POS,SourceID,TokenID,Variety,Word
98,nikita,o,431270,2206403194,ES,Nikita
110,escrito,jms,431270,2206403206,ES,escrito
321,calladito,j,431290,2074527333,ES,calladita
331,sólito,jms,431290,2074527343,ES,sólito
536,necesitar,vip-3s,431310,2143630275,ES,necesita
...,...,...,...,...,...,...
22146901,sencillo,jfs,1891249,1779969779,ES,sencilla
22147017,permitir,vsp-1/3s,1891249,1779969895,ES,permita
22147166,inscrito,j,1891249,1779970044,ES,inscrito
22147196,visita,nfp,1891249,1779970074,ES,visitas


- Spain is ready, now I have to do the same for the remaning 20 countries. This will take me some time, but I don't expect the code above to run into issues as the files are all in the same format. Below are the corpus directories for the remaining countries:

In [149]:
ar_dir = glob.glob(corpus_dir + 'wlp_AR-tez/*.txt') #Argentina
bo_dir = glob.glob(corpus_dir + 'wlp_BO-teh/*.txt') #Bolivia
cl_dir = glob.glob(corpus_dir + 'wlp_CL-wts/*.txt') #Chile
co_dir = glob.glob(corpus_dir + 'wlp_CO-pem/*.txt') #Colombia
cr_dir = glob.glob(corpus_dir + 'wlp_CR-jfh/*.txt') #Costa Rica
cu_dir = glob.glob(corpus_dir + 'wlp_CU-rag/*.txt') #Cuba
do_dir = glob.glob(corpus_dir + 'wlp_DO-egn/*.txt') #Dominican Republic
ec_dir = glob.glob(corpus_dir + 'wlp_EC-jss/*.txt') #Ecuador
gt_dir = glob.glob(corpus_dir + 'wlp_GT-miv/*.txt') #Guatemala
hn_dir = glob.glob(corpus_dir + 'wlp_HN-paj/*.txt') #Honduras
mx_dir = glob.glob(corpus_dir + 'wlp_MX-vzo/*.txt') #Mexico
ni_dir = glob.glob(corpus_dir + 'wlp_NI-exu/*.txt') #Nicaragua
pa_dir = glob.glob(corpus_dir + 'wlp_PA-qlz/*.txt') #Panama
pe_dir = glob.glob(corpus_dir + 'wlp_PE-tae/*.txt') #Peru
pr_dir = glob.glob(corpus_dir + 'wlp_PR-epz/*.txt') #Puerto Rico
py_dir = glob.glob(corpus_dir + 'wlp_PY-ukd/*.txt') #Paraguay
sv_dir = glob.glob(corpus_dir + 'wlp_SV-xkl/*.txt') #El Salvador
us_dir = glob.glob(corpus_dir + 'wlp_US-ufh/*.txt') #The US
uy_dir = glob.glob(corpus_dir + 'wlp_UY-nde/*.txt') #Uruguay
#ve_dir = glob.glob(corpus_dir + 'wlp_VE-wsc/*.txt') #Venezuela. This file (ironically) doesn't work. I'll have to 
                                                       #go to the LMC and try to get a new copy. 

- On to buiilding data frames. For this purpose, I have created a master function that goes through all of the steps above. I'll run each country in one cell so that memory and time limitations won't be exceeded.

In [186]:
def corpus_process(fdir, country_df, variety):
    """Builds, cleans, and creates a data frame using all files from the corpus directory and
    keeping only rows of interest. fdir corresponds to the directory of the country, country_df
    is a string and corresponds to an empty data frame to be populated, variety is also a string
    and correspond to the variety being processed."""
    country_df = pd.DataFrame(columns=['SourceID', 'TokenID', 'Word', 'Lemma', 'POS'])
    for fname in fdir:                         
        df = toDF(fname)                           
        df = remove_nondims(df)
        country_df = pd.concat([country_df, df], sort=True)
    add_variety(country_df, variety)
    return country_df

Argentina:

In [12]:
corpus_process(ar_dir, 'ar_DF', 'AR')

Bolivia:

In [55]:
corpus_process(bo_dir, 'bo_DF', 'BO')

Chile:

In [61]:
corpus_process(cl_dir, 'cl_DF', 'CL')

Colombia:

In [14]:
corpus_process(co_dir, 'co_DF', 'CO')

Costa Rica:

In [56]:
corpus_process(cr_dir, 'cr_DF', 'CR')

Cuba:

In [17]:
corpus_proces(cu_dir, 'cu_DF', 'CU')

Dominican Republic:

In [50]:
corpus_process(dr_dir, 'dr_DF', 'DR')

Ecuador:

In [16]:
corpus_process(ec_dir, 'ec_DF', 'EC')

Guatemala:

In [182]:
gt_DF = corpus_process(gt_dir, 'gt_DF', 'GT')

Honduras:

In [51]:
corpus_process(hn_dir, 'hn_DF', 'HN')

Mexico:

In [17]:
corpus_process(mx_dir, 'mx_DF', 'MX')

Nicaragua:

In [46]:
corpus_process(ni_dir, 'ni_DF', 'NI')

Panama:

In [27]:
corpus_process(pa_dir, 'pa_DF', 'PA')

Peru:

In [25]:
corpus_process(pe_dir, 'pe_DF', 'PE')

Puerto Rico:

In [22]:
corpus_process(pr_dir, 'pr_DF', 'PR')

Paraguay:

In [20]:
corpus_process(py_dir, 'py_DF', 'PY')

El Salvador:

In [18]:
corpus_process(sv_dir, 'sv_DF', 'SV')

US:

In [25]:
corpus_process(us_dir, 'us_DF', 'US')

Uruguay:

In [16]:
corpus_process(uy_dir, 'uy_DF', 'UY')

Venezuela:

In [29]:
# corpus_process(ve_dir, 've_DF', 'VE')

- That's the end of this stage. I now have a preliminary data frame for each country that I can later put together into a larger one. I still want to keep the individual country data frames, though, for by-country analysis and processing or in case I notice an issue down the road.

### 4. Storing files

- So that I don't have to rerun the whole notebook each time (or in case I want to use the resulting data frames in a different notebook), it makes sense to save the results in their current state. For this I'll make use of the pickle library. 

Argentina:

In [15]:
ar_DF.to_pickle('ar_DF.pkl')

Bolivia:

In [58]:
bo_DF.to_pickle('bo_DF.pkl')

Colombia:

In [16]:
co_DF.to_pickle('co_DF.pkl')

Chile:

In [64]:
cl_DF.to_pickle('cl_DF.pkl')

Costa Rica:

In [59]:
cl_DF.to_pickle('cl_DF.pkl')

Cuba:

In [19]:
cu_DF.to_pickle('cu_DF.pkl')

Dominican Republic:

In [54]:
do_DF.to_pickle('do_DF.pkl')

Ecuador:

In [19]:
ec_DF.to_pickle('ec_DF.pkl')

Guatemala:

In [170]:
gt_DF.to_pickle('gt_DF.pkl')

Honduras:

In [53]:
hn_DF.to_pickle('hn_DF.pkl')

Mexico:

In [19]:
mx_DF.to_pickle('mx_DF.pkl')

Nicaragua:

In [48]:
ni_DF.to_pickle('ni_DF.pkl')

Panama:

In [39]:
pa_DF.to_pickle('pa_DF.pkl')

Peru:

In [40]:
pe_DF.to_pickle('pe_DF.pkl')

Puerto Rico:

In [41]:
pr_DF.to_pickle('pr_DF.pkl')

Paraguay:

In [42]:
pa_DF.to_pickle('pa_DF.pkl')

Spain:

In [87]:
es_DF.to_pickle('es_DF.pkl')

El Salvador:

In [43]:
sv_DF.to_pickle('sv_DF.pkl')

US:

In [27]:
us_DF.to_pickle('us_DF.pkl')

Uruguay:

In [45]:
ur_DF.to_pickle('ur_DF.pkl')