## Diminutive Suffix Productivity
Juan Berrios | jeb358@pitt.edu | Last updated: February 23, 2020

**Summary:**

- The purpose of the code included in this notebook is to build `DataFrame` objects from the `.txt` files in the corpus directories. Section 1 includes the necessary preparations. Section 2 provides code for loading the files. Section 3 provides code for building the objects. Section 4 is a first attempt at exploring and visualizing the data.

**1. Preparation**

- Loading libraries and additional settings:

In [12]:
#Importing libraries
import nltk, glob, re, pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Turning pretty print off:
%pprint

#Releasing all output:                                            
from IPython.core.interactiveshell import InteractiveShell #Prints all commands rather than the last one.
InteractiveShell.ast_node_interactivity = "all"

Pretty printing has been turned ON


- Defining functions: 

In [13]:
def toDF(fname):
    """Turns tab-delimited file into a data frame"""
    cols = ['SourceID', 'TokenID', 'Word', 'Lemma', 'POS']
    df = pd.read_csv(fname,sep='\t',encoding ='iso-8859-1',skiprows=[0],header=None,names=cols)
    return df

In [14]:
def add_variety(df,variety):
    """Adds a column specifying a given variety of Spanish to a data frame"""
    df['Variety'] = variety

In [15]:
#def remove_syms(df):
#    """Excludes symbols"""
#    df = [row for row in df if df['Lemma'].isalpha]

In [16]:
#def extract_dims(df):
#    """Removes tokens that do not end in the segments of interest (-ito/-illo)"""
#    df = [row for row in df if df['Lemma'] == [ito|ita|itos|itas]]

In [17]:
#def remove_lexforms(df):
#    """Removes diminutivized forms that have lexicalized"""
#    df = [row for row in df if df['Lemma'] == [ito|ita|itos|itas]]

**2. Import files**

- The text files are very large. For testing purpose, I'll use only one of the text files as a start. The `.txt` files are tab delimited which makes my job a little easier. The columns correspond to an ID for the source text, an ID for the token, the token (word), the lemma, and the POS. I will hence use those for column names. 

In [25]:
fname = 'C:/Users/Juan/Documents/LING2340/Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo/es-b-0.txt'

In [6]:
cols = ['SourceID', 'TokenID', 'Word', 'Lemma', 'POS']

In [4]:
#First row ignored because it corresponds to an identifier for the .txt file.

df = pd.read_csv(fname,sep='\t',encoding ='iso-8859-1',skiprows=[0],header=None,names=cols) 

In [13]:
df.shape #It's a very large file. It will get much smaller once I start extracting the tokens of interest.

(27345213, 5)

In [28]:
df.head()   #The lemma column will be useful when I need to aggregations that put lowecase and uppercase 
            #as well as plural and singular forms together.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS
0,431270,2206403096,Este,este,dd-
1,431270,2206403097,es,ser,vip-3s
2,431270,2206403098,un,un,li-ms
3,431270,2206403099,blog,blog,nms
4,431270,2206403100,de,de,e


In [29]:
df.tail()

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS
27345208,676060,2511862343,mas,mas,cc
27345209,676060,2511862344,informacion,información,n
27345210,676060,2511862345,visite,visitar,vsp-1/3s
27345211,676060,2511862346,www.DineroAbundancia.com,www.dineroabundancia.com,n
27345212,676060,2511862347,.,$.,y


In [30]:
df.sample(5)

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS
15815166,583170,1988799088,continuaciones,continuación,nfp
12487263,553890,1499417129,una,un,li-fs
18153970,594570,645278646,plazo,plazo,nms
18391857,595570,645299939,tiene,tener,vip-3s
4634351,481160,1980220106,anónimo,anónimo,jms


In [22]:
df['POS'].unique() #These POS tags are not very transparent, but it's a good start.

array(['dd-', 'vip-3s', 'li-ms', 'nms    ', 'e', 'nfp    ', 'y',
       'nfs    ', 'nmp    ', 'vsp-1/3s', 'n', 'cc', 'o', 'r', 'vps-ms',
       'ld-fs', 'm$', 'vc-1/3s', 'vr', 'po', 'li-fs', 'ld-mp', 'vip-3p',
       'ld-ms', 'jms    ', 'vpp', 'mc', 'vps-fs', 'j', 'vii-1/3s', 'e_21',
       'x', nan, 'cs', 'ld', 'v', 'cc-', 'ld-fp', 'pi-0cn', 'dd',
       'dxmp-ind-', 'vip-2s', 'dp-', 'vsp-3p', 'pi-0ms', 'jfp    ',
       'cS_21', 'cS_22', 'jmp    ', 'vsp-2s', 'vip-1s', 'vsp-1p', 'ps',
       'vip-1p/vis-1p', 'vip-1p', 'pd-3cs', 'vif-1p', 'vps-mp', 'vif-3s',
       'vif-3p', 'jfs    ', 'dxfs-ind-', 'vsj-1/3s', 'i', 'pi-3ms',
       'vis-3s', 'b', 'p', 'np', 'pr-3cn"', 'px', 'dxms-ind-', 'pi-3cs',
       'dxfs-', 'pr-3cs', 'px-ms', 'dxfp-ind-', 'vis-3p', 'li-mp', 'pi',
       'vsi-3p', 'px-mp', 'vsi-1/3s', 'pq-3cn"', 'vif-2s', 'vif-2p',
       'vsp-2p', 'vip-2p', 'vpp-00', 'vm-2p', 'vis-1p', 'dxcs-ind-',
       'pr-3cp', 'dxcs-dem-', 'vif-1s', 'vc-1p', 'cC_21', 'cC_22',
       'vps-fp',

In [31]:
df['Variety'] = 'ES' #Time to add a column for the variety of Spanish

In [32]:
df.head()

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
0,431270,2206403096,Este,este,dd-,ES
1,431270,2206403097,es,ser,vip-3s,ES
2,431270,2206403098,un,un,li-ms,ES
3,431270,2206403099,blog,blog,nms,ES
4,431270,2206403100,de,de,e,ES


In [36]:
df.tail()

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
27345208,676060,2511862343,mas,mas,cc,ES
27345209,676060,2511862344,informacion,información,n,ES
27345210,676060,2511862345,visite,visitar,vsp-1/3s,ES
27345211,676060,2511862346,www.DineroAbundancia.com,www.dineroabundancia.com,n,ES
27345212,676060,2511862347,.,$.,y,ES


In [43]:
df.sample(5)

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
12039497,549480,398527564,notable,notable,jms,ES
26455927,670370,122994012,dopaje,dopaje,nms,ES
27258438,674960,1688566260,este,este,dd-,ES
16321843,584610,2634468121,de,de,e,ES
26447521,670270,2039624095,que,que,cs,ES


**3. Build objects**

In [18]:
corpus_dir = "C:/Users/Juan/Documents/LING2340/Diminutive-Suffix-Productivity/private/data/"
es_fname = glob.glob(corpus_dir + 'wlp_ES-sbo/*.txt')
es_fname[0]

'C:/Users/Juan/Documents/LING2340/Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo\\es-b-0.txt'

In [30]:
es_DF = pd.DataFrame(columns=cols) #Builds a new, empty data frame object.
es_DF

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS


In [20]:
add_variety(es_DF,'ES')

In [21]:
es_DF

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety


In [22]:
for fname in es_fname:
    print(fname)

C:/Users/Juan/Documents/LING2340/Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo\es-b-0.txt
C:/Users/Juan/Documents/LING2340/Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo\es-b-1.txt
C:/Users/Juan/Documents/LING2340/Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo\es-b-2.txt
C:/Users/Juan/Documents/LING2340/Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo\es-b-3.txt
C:/Users/Juan/Documents/LING2340/Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo\es-b-4.txt
C:/Users/Juan/Documents/LING2340/Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo\es-b-5.txt
C:/Users/Juan/Documents/LING2340/Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo\es-b-6.txt
C:/Users/Juan/Documents/LING2340/Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo\es-b-7.txt
C:/Users/Juan/Documents/LING2340/Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo\es-b-8.txt
C:/Users/Juan/Documents/LING2340/Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo\es-b-9.txt
C:/Users/J

In [31]:
#for fname in es_fname:
#    df = toDF(fname)
#    es_DF = pd.concat([es_DF, df], sort=True)

MemoryError: 

In [23]:
testdf = toDF('C:/Users/Juan/Documents/LING2340/Diminutive-Suffix-Productivity/private/data/wlp_ES-sbo\es-b-1.txt')

In [24]:
testdf.head()

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS
0,431261,2028532228,A,a,e
1,431261,2028532229,la,la,ld-fs
2,431261,2028532230,porra,porra,nms
3,431261,2028532231,las,la,ld-fp
4,431261,2028532232,escuchas,escucha,nmp


In [25]:
add_variety(testdf,'ES')

In [26]:
testdf

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Variety
0,431261,2028532228,A,a,e,ES
1,431261,2028532229,la,la,ld-fs,ES
2,431261,2028532230,porra,porra,nms,ES
3,431261,2028532231,las,la,ld-fp,ES
4,431261,2028532232,escuchas,escucha,nmp,ES
...,...,...,...,...,...,...
28524066,676061,2126301162,una,un,li-fs,ES
28524067,676061,2126301163,especie,especie,nfs,ES
28524068,676061,2126301164,de,de,e,ES
28524069,676061,2126301165,santo,santo,nms,ES


In [27]:
es_DF = pd.concat([es_DF, testdf], sort=True)

In [28]:
es_DF.head()

Unnamed: 0,Lemma,POS,SourceID,TokenID,Variety,Word
0,a,e,431261,2028532228,ES,A
1,la,ld-fs,431261,2028532229,ES,la
2,porra,nms,431261,2028532230,ES,porra
3,la,ld-fp,431261,2028532231,ES,las
4,escucha,nmp,431261,2028532232,ES,escuchas


In [29]:
es_DF.tail()

Unnamed: 0,Lemma,POS,SourceID,TokenID,Variety,Word
28524066,un,li-fs,676061,2126301162,ES,una
28524067,especie,nfs,676061,2126301163,ES,especie
28524068,de,e,676061,2126301164,ES,de
28524069,santo,nms,676061,2126301165,ES,santo
28524070,$.,y,676061,2126301166,ES,.


- Save objects as CSV files for further processing later on. 