## Diminutive Suffix Productivity
Juan Berrios | jeb358@pitt.edu | Last updated: February 25, 2020

**Summary and overview of the data:**

- The purpose of the code included in this notebook is to build `DataFrame` objects from the `.txt` files in the corpus directories. 

**Contents:**
- Section 1 includes the necessary preparations.
- Section 2 includes code for loading the files and turning them into data frames using one of the `.txt` file as a sample.
- Section 3 includes code for performing the operations on a corpus directory containing all the text files of one variety.

**1. Preparation**

- Loading libraries and additional settings:

In [1]:
#Importing libraries
import glob
import pandas as pd
import numpy as np

#Turning pretty print off:
%pprint

#Releasing all output:                                            
from IPython.core.interactiveshell import InteractiveShell #Prints all commands rather than the last one.
InteractiveShell.ast_node_interactivity = "all"

Pretty printing has been turned OFF


**2. Loading a text file as a data frame**

- The text files are very large. For testing purposes, I'll use only one of the text files as a start. The `.txt` files are tab delimited which makes my job a little easier. The columns correspond to an ID for the source text, an ID for the token, the token (word), the lemma, and the POS. I will hence use those for column names. 

In [2]:
fname = '../../Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo/es-b-0.txt'

In [3]:
cols = ['SourceID', 'TokenID', 'Word', 'Lemma', 'POS']

In [4]:
#First row is ignored because it corresponds to an identifier for the .txt file.

df = pd.read_csv(fname,sep='\t',encoding ='iso-8859-1',skiprows=[0],header=None,names=cols) 

In [5]:
df.shape #It's a very large file (27345213 rows). It will get much smaller once I start cleaning up the data.

(27345213, 5)

In [6]:
df.head(5)   #The lemma column will be useful when I need to aggregations that put lowecase and uppercase 
            #as well as plural and singular forms together.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS
0,431270,2206403096,Este,este,dd-
1,431270,2206403097,es,ser,vip-3s
2,431270,2206403098,un,un,li-ms
3,431270,2206403099,blog,blog,nms
4,431270,2206403100,de,de,e


In [7]:
df.tail(5)

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS
27345208,676060,2511862343,mas,mas,cc
27345209,676060,2511862344,informacion,información,n
27345210,676060,2511862345,visite,visitar,vsp-1/3s
27345211,676060,2511862346,www.DineroAbundancia.com,www.dineroabundancia.com,n
27345212,676060,2511862347,.,$.,y


In [8]:
df.sample(5) #Everything seems to be loaded correctly as of now. 

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS
10583968,537030,154105123,de,de,e
6284118,494450,2493055018,uno,uno,pi-0ms
21793085,621950,1662734684,la,la,ld-fs
20674469,611090,2499298591,de,de,e
24422308,649860,1421465329,lo,lo,ld


In [9]:
df['POS'].unique()  #These POS tags are not very transparent, but it's a good start. These will also come 
                    #handy for data clean up because diminutivization applies only to some classes. 

array(['dd-', 'vip-3s', 'li-ms', 'nms    ', 'e', 'nfp    ', 'y',
       'nfs    ', 'nmp    ', 'vsp-1/3s', 'n', 'cc', 'o', 'r', 'vps-ms',
       'ld-fs', 'm$', 'vc-1/3s', 'vr', 'po', 'li-fs', 'ld-mp', 'vip-3p',
       'ld-ms', 'jms    ', 'vpp', 'mc', 'vps-fs', 'j', 'vii-1/3s', 'e_21',
       'x', nan, 'cs', 'ld', 'v', 'cc-', 'ld-fp', 'pi-0cn', 'dd',
       'dxmp-ind-', 'vip-2s', 'dp-', 'vsp-3p', 'pi-0ms', 'jfp    ',
       'cS_21', 'cS_22', 'jmp    ', 'vsp-2s', 'vip-1s', 'vsp-1p', 'ps',
       'vip-1p/vis-1p', 'vip-1p', 'pd-3cs', 'vif-1p', 'vps-mp', 'vif-3s',
       'vif-3p', 'jfs    ', 'dxfs-ind-', 'vsj-1/3s', 'i', 'pi-3ms',
       'vis-3s', 'b', 'p', 'np', 'pr-3cn"', 'px', 'dxms-ind-', 'pi-3cs',
       'dxfs-', 'pr-3cs', 'px-ms', 'dxfp-ind-', 'vis-3p', 'li-mp', 'pi',
       'vsi-3p', 'px-mp', 'vsi-1/3s', 'pq-3cn"', 'vif-2s', 'vif-2p',
       'vsp-2p', 'vip-2p', 'vpp-00', 'vm-2p', 'vis-1p', 'dxcs-ind-',
       'pr-3cp', 'dxcs-dem-', 'vif-1s', 'vc-1p', 'cC_21', 'cC_22',
       'vps-fp',

In [10]:
df['Variety'] = 'ES' #Time to add a column for the variety of Spanish. In this case Spain.

In [11]:
df.head() #It works out well.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
0,431270,2206403096,Este,este,dd-,ES
1,431270,2206403097,es,ser,vip-3s,ES
2,431270,2206403098,un,un,li-ms,ES
3,431270,2206403099,blog,blog,nms,ES
4,431270,2206403100,de,de,e,ES


In [12]:
df.tail()

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
27345208,676060,2511862343,mas,mas,cc,ES
27345209,676060,2511862344,informacion,información,n,ES
27345210,676060,2511862345,visite,visitar,vsp-1/3s,ES
27345211,676060,2511862346,www.DineroAbundancia.com,www.dineroabundancia.com,n,ES
27345212,676060,2511862347,.,$.,y,ES


In [13]:
df.sample(5)

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
291949,434660,881998670,ambiguo,ambiguo,jms,ES
15794397,583110,2550976836,que,que,cs,ES
17396541,589870,1447054062,el,el,ld-ms,ES
6247489,494200,393454860,reducido,reducir,vps-ms,ES
16251953,584280,2048551884,la,la,ld-fs,ES


- A first step involved in cleaning up the data is to remove rows that are not necessary for this analysis. There are two main things to tackle first: symbols and '@' that are stand-ins for words that were removed from the corpus for copyright reasons when the corpus was created. For the former, I can make use of the POS column. Symbols are tagged 'y'. 

In [14]:
df = df[df['POS'] != 'y'] 

In [15]:
df #It works. Looks like around 3,000,000 rows were removed.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
0,431270,2206403096,Este,este,dd-,ES
1,431270,2206403097,es,ser,vip-3s,ES
2,431270,2206403098,un,un,li-ms,ES
3,431270,2206403099,blog,blog,nms,ES
4,431270,2206403100,de,de,e,ES
...,...,...,...,...,...,...
27345207,676060,2511862342,obtener,obtener,vr,ES
27345208,676060,2511862343,mas,mas,cc,ES
27345209,676060,2511862344,informacion,información,n,ES
27345210,676060,2511862345,visite,visitar,vsp-1/3s,ES


- Since '@' is meant to replace words and not symbols, it is tagged as a noun, so the same strategy doesn't work. An alternative is to use the Word column instead:

In [16]:
df = df[df['Word'] != '@'] 

In [17]:
df #Looks good, this removed about 1,500,000 more rows.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
0,431270,2206403096,Este,este,dd-,ES
1,431270,2206403097,es,ser,vip-3s,ES
2,431270,2206403098,un,un,li-ms,ES
3,431270,2206403099,blog,blog,nms,ES
4,431270,2206403100,de,de,e,ES
...,...,...,...,...,...,...
27345207,676060,2511862342,obtener,obtener,vr,ES
27345208,676060,2511862343,mas,mas,cc,ES
27345209,676060,2511862344,informacion,información,n,ES
27345210,676060,2511862345,visite,visitar,vsp-1/3s,ES


- Lastly, I only want to keep diminutivized forms for analysis. #Keeps only rows ending in the segments. #of interest. I will probably refine this. #later using regular expressions instead. 

In [18]:
df = df[(df['Word'].str[-3:] == 'ito') | (df['Word'].str[-3:] == 'ita')|
  (df['Word'].str[-4:] == 'itos') | (df['Word'].str[-4:] == 'itas')|
  (df['Word'].str[-4:] == 'illo') | (df['Word'].str[-4:] == 'illa')|
  (df['Word'].str[-4:] == 'illos') | (df['Word'].str[-4:] == 'illas')]

In [19]:
df

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
98,431270,2206403194,Nikita,nikita,o,ES
110,431270,2206403206,escrito,escrito,jms,ES
321,431290,2074527333,calladita,calladito,j,ES
331,431290,2074527343,sólito,sólito,jms,ES
536,431310,2143630275,necesita,necesitar,vip-3s,ES
...,...,...,...,...,...,...
27344477,676040,250005751,ladrillo,ladrillo,nms,ES
27344546,676040,250005820,depósitos,depósito,nmp,ES
27344657,676040,250005931,precipita,precipitar,vip-3s,ES
27344674,676040,250005948,corralito,corralito,n,ES


- This gets the data at to a first stage that's easier to work with. There are still many rows which do not belong in the data frame (e.g., verb forms other than gerunds that just happend to end in the same segment), but it will be more efficient to remove those when the full data frame is constructed. 

**3. Loading the corpus directory and processing all files**

- As a first step, knowing now that the operations above are sucessful, I will define functions to make the processing pipeline for the full data drame object more efficient and streamlined:

In [20]:
def toDF(fname):
    """Turns tab-delimited file into a data frame"""
    cols = ['SourceID', 'TokenID', 'Word', 'Lemma', 'POS']
    df = pd.read_csv(fname,sep='\t',encoding ='iso-8859-1',skiprows=[0],header=None,names=cols)
    return df

In [21]:
def add_variety(df,variety):
    """Adds a column specifying a given variety of Spanish to a data frame"""
    df['Variety'] = variety

In [22]:
def remove_syms(df):
    """Excludes symbols"""
    df = df[df['POS'] != 'y']

In [23]:
def remove_redacted(df):
    """Excludes @ symbols, which stand for words that have been redacted for copyright reasons"""
    df = df[df['Word'] != '@']

In [24]:
def remove_nondims(df):
    """Removes tokens that do not end in the segments of interest (-ito/-illo)"""
    df = df[(df['Word'].str[-3:] == 'ito') | (df['Word'].str[-3:] == 'ita')|
            (df['Word'].str[-4:] == 'itos') | (df['Word'].str[-4:] == 'itas')|
            (df['Word'].str[-4:] == 'illo') | (df['Word'].str[-4:] == 'illa')|
            (df['Word'].str[-4:] == 'illos') | (df['Word'].str[-4:] == 'illas')]

- Now let's set the directory. I'll start with the Spain directory, since I used a Spain text file for the test run in Section 2 above.

In [25]:
corpus_dir = '../../Diminutive-Suffix-Productivity/private/data/'
es_fname = glob.glob(corpus_dir + 'wlp_ES-sbo/*.txt')
es_fname[0]

'../../Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo\\es-b-0.txt'

In [26]:
len(es_fname) #There are 20 files in total

20

In [27]:
es_DF = pd.DataFrame(columns=cols) #Builds a new, empty data frame object.
add_variety(es_DF,'ES')
es_DF

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
