# Adjective position rates in a corpus of Argentinean Spanish
Juan Berrios | juanberrios@pitt.edu | Last updated: October 4, 2020

**Summary and overview of the data:**

- The purpose of the code included in this notebook is to build a `DataFrame` object from the `.txt` files in the Argentinian corpus directory. Adjective position rates for 6 adjective lexemes are extrated from the resulting data frame. The corpus I am using is the [*Corpus del español*](https://www.corpusdelespanol.org/); more specifically the [Web/Dialects](https://www.corpusdelespanol.org/web-dial/) corpus. While the corpus is searchable online, it is also possible to access the full data set for those wishing to do computational analyses, such as this. It is necessary to purchase a license to do so. I am authorized to use it through the license of the [Department of Linguistics](https://www.linguistics.pitt.edu/). Samples for the different formats can be downloaded from the [official website](https://www.corpusdata.org/formats.asp).

**Contents:**
1. [Preparation](#1.-Preparation)  includes the necessary preparations.
2. [Loading files](#2.-Loading-files)  includes code for loading the files, turning them into a data frame, and cleaning them using one of the `.txt` files as a sample.
3. [Processing corpus directories](#3.-Processing-corpus-directories)  includes code for performing the operations on a corpus directory containing all the text files for Argentinean Spanish. The resulting data frames is stored as a `.pkl` file in case further processing is needed.

## 1. Preparation

- Loading libraries and additional settings:

In [1]:
#Importing libraries
import glob, pickle, re
import pandas as pd
import numpy as np
import os

#Turning pretty print off:
%pprint

#Releasing all output:                                            
from IPython.core.interactiveshell import InteractiveShell #Prints all commands rather than the last one.
InteractiveShell.ast_node_interactivity = "all"

Pretty printing has been turned OFF


## 2. Loading files

- The `.txt` files are very large. For testing purposes, I'll use only one of them as a start. The files are also tab-delimited. The columns correspond to an ID for the source text, an ID for the token, the token (word), the lemma, and the POS. I will hence use those for column names. 

In [2]:
fname = '../../adjective_position/data/cde/wlp_AR-tez/ar-b-0.txt'

In [3]:
cols = ['SourceID', 'TokenID', 'Word', 'Lemma', 'POS']

In [4]:
#First row is ignored because it corresponds to an identifier for the .txt file.

df = pd.read_csv(fname,sep='\t',encoding ='iso-8859-1',skiprows=[0],header=None,names=cols) 

In [5]:
df.shape #Size of the raw data frame in rows and columns.

(10087127, 5)

In [6]:
df #First and last five rows. Dimensions on the bottom. Data were loaded correctly.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS
0,10,2232878917,10,10,m$
1,10,2232878918,tipos,tipo,nmp
2,10,2232878919,de,de,e
3,10,2232878920,sexo,sexo,nms
4,10,2232878921,.,$.,y
...,...,...,...,...,...
10087122,108200,393712654,política,político,jfs
10087123,108200,393712655,de,de,e
10087124,108200,393712656,la,la,ld-fs
10087125,108200,393712657,Nación,nación,nms


- Before removing unneeded rows, it is necessary to add information about the word preceding and following each token, as well as their corresponding POS tags, so that it is possible to determine rates of pre- and post-position later:

In [7]:
df['Previous_word'] = df['Word'].shift() #Extracting previous word and POS tag
df['Previous_POS'] = df['POS'].shift() 
df['Following_word'] = df['Word'].shift(-1) #Extracting following word and POS tag
df['Following_POS'] = df['POS'].shift(-1)

In [8]:
df #New columns have been added

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Previous_word,Previous_POS,Following_word,Following_POS
0,10,2232878917,10,10,m$,,,tipos,nmp
1,10,2232878918,tipos,tipo,nmp,10,m$,de,e
2,10,2232878919,de,de,e,tipos,nmp,sexo,nms
3,10,2232878920,sexo,sexo,nms,de,e,.,y
4,10,2232878921,.,$.,y,sexo,nms,.,y
...,...,...,...,...,...,...,...,...,...
10087122,108200,393712654,política,político,jfs,participación,nfs,de,e
10087123,108200,393712655,de,de,e,política,jfs,la,ld-fs
10087124,108200,393712656,la,la,ld-fs,de,e,Nación,nms
10087125,108200,393712657,Nación,nación,nms,la,ld-fs,.,y


- Now that concordances have been added, we are keeping only the rows of interest: 

In [9]:
adj_lexemes = ['grande', 'bueno', 'nuevo', 'rudo', 'fresco', 'llano'] #Adjectives under study

In [10]:
df = df[df['Lemma'].isin(adj_lexemes)] #Keep only the rows of interest in the data frame

In [11]:
df #First and last five rows. Dimensions on the bottom

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Previous_word,Previous_POS,Following_word,Following_POS
2171,30,2245675103,nueva,nuevo,jfs,la,ld-fs,integrante,nms
2647,30,2245675579,nueva,nuevo,jfs,Una,li-fs,pareja,nfs
2748,30,2245675680,buenos,bueno,jmp,ni,cc,ni,cc
2775,30,2245675707,gran,grande,j,",",y,herramienta,nfs
2861,30,2245675793,buena,bueno,jfs,la,ld-fs,onda,nfs
...,...,...,...,...,...,...,...,...,...
10086814,108200,393712346,nuevo,nuevo,jms,el,ld-ms,hombre,nms
10086923,108200,393712455,nueva,nuevo,jfs,la,ld-fs,Nación,o
10086996,108200,393712528,nuevo,nuevo,jms,como,r,protagonista,nms
10087097,108200,393712629,Nuevo,nuevo,jms,?,y,concepto,nms


- Lastly, it is necessary to remove rows that are not needed for analysis (e.g., those containing lexemes that are not adjectives, or adjectives used after a copula rather than with a noun):

In [12]:
df = df.dropna() #Dropping Nan values which might cause errors later.

In [13]:
df['POS'].unique() #These are the POS currently included. Only those containing 'j' (adjectives) are to be kept

array(['jfs    ', 'jmp    ', 'j', 'jms    ', 'jfp    ', 'r', 'o',
       'nmp    ', 'nms    ', 'i'], dtype=object)

In [14]:
df = df[df['POS'].str.contains('j')] #Removes unneeded values

In [15]:
df['POS'].unique() #Values removed. There is also unnecessary white space which can be removed

array(['jfs    ', 'jmp    ', 'j', 'jms    ', 'jfp    '], dtype=object)

In [16]:
df = df.applymap(lambda x: x.rstrip() if type(x)==str else x) #Remove all white spaces in data frame if type is str

In [17]:
df['POS'].unique() #Effectively removed

array(['jfs', 'jmp', 'j', 'jms', 'jfp'], dtype=object)

- Another cleaning step is to remove tokens that do not correspond to adjective-noun combinations.

In [18]:
df = df[df['Previous_POS'].str.startswith('n')|df['Following_POS'].str.startswith('n')] #Removes unneeded rows. 

- The (grammatical) number of each token is relevant information to determine rates later. Let's make the column more transparent and create new columns in the process:

In [19]:
#Build dictionaries that map POS into desired number values:

number_dict = {'j': 'unknown', 'jms': 'singular', 'jfs': 'singular', 'jmp': 'plural', 
                  'jfp': 'plural'}

#Note that some tokens are not tagged for number. As a solution,I'll create an 'Unknown label'. It is also possible 
#to write code that will tag them based on certain patterns. 

In [20]:
#Mapping values to new column

df['Number'] = df['POS'].map(number_dict) #Map new values

In [21]:
df.sample(10) #Ten sample rows.

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Previous_word,Previous_POS,Following_word,Following_POS,Number
6547047,65560,2419103415,buenos,bueno,jmp,2,m$,delanteros,nmp,plural
1768571,12920,1580985524,nuevas,nuevo,jfp,son,vip-3p,pruebas,nfp,plural
3576385,32990,1769309363,nuevas,nuevo,jfp,las,ld-fp,posibilidades,nfp,plural
9778649,104310,2301100275,buen,bueno,j,un,li-ms,repulgue,n,unknown
5584831,53570,641167647,grandes,grande,jmp,de,e,bodegas,nfp,plural
4371770,41690,968266488,buenos,bueno,jmp,dan,vip-3p,resultados,n,plural
3003391,26630,803596764,grandes,grande,jmp,poetas,nmp,como,e,plural
59807,540,2629794278,nuevo,nuevo,jms,un,li-ms,hombre,nms,singular
627714,5620,773097723,nuevas,nuevo,jfp,las,ld-fp,generaciones,nfp,plural
1313475,9800,1256300105,nuevos,nuevo,jmp,los,ld-mp,smartphones,n,plural


- Let's take care of the unknown value to finish. The information needed can be extrated from POS columns:

In [22]:
df['Number'].value_counts() #There are 7202 'unknown' values

singular    7259
unknown     6627
plural      5253
Name: Number, dtype: int64

- Now that there's a defined process flow, the next step is to apply all operations to each `.txt` file in the directory in a streamlined fashion. I'll delete unneeded objects before proceding:

In [23]:
del df
del fname
del cols

## 3. Processing corpus directories

- As a first step, knowing now that the operations above are sucessful, I will define functions to make the processing pipeline for the full data drame object more efficient and streamlined:

In [24]:
def toDF(fname):
    """Turns tab-delimited file into a data frame"""
    cols = ['SourceID', 'TokenID', 'Word', 'Lemma', 'POS']
    df = pd.read_csv(fname,sep='\t',encoding ='iso-8859-1',skiprows=[0],header=None,names=cols)
    df = df.dropna() #Dropping Nan values which might cause errors later.
    return df

In [25]:
def surrounding_words(df):
    """Adds four columns containing the token's previous and folling word and corresponding POS tags"""
    df['Previous_word'] = df['Word'].shift()
    df['Previous_POS'] = df['POS'].shift() 
    df['Following_word'] = df['Word'].shift(-1)
    df['Following_POS'] = df['POS'].shift(-1)
    return df

In [26]:
def adj_lexemes(df):
    """Keep only rows contaning adjective lexemes under study"""
    adj_lexemes = ['grande', 'bueno', 'nuevo', 'rudo', 'fresco', 'llano'] #Adjectives under study
    df = df[df['Lemma'].isin(adj_lexemes)] #Keep only the rows of interest in the data frame
    return df

In [27]:
def clean_POS(df):
    """Removes white spaces and rows containing extraneous POS"""
    df = df.applymap(lambda x: x.rstrip() if type(x)==str else x) #Removes white space
    df = df[df['POS'].str.contains('j')] #Removes non-adjective tokens
    df = df[df['Previous_POS'].str.startswith('n')|df['Following_POS'].str.startswith('n')] #Removes non A-N rows 
    return df

In [28]:
def add_number(df):
    """Adds column specifying grammtical information for each token"""
    number_dict = {'j': 'unknown', 'jms': 'singular', 'jfs': 'singular', 'jmp': 'plural', 
                  'jfp': 'plural'}
    df['Number'] = df['POS'].map(number_dict)
    return df

- Now let's set the directory:

In [29]:
fname = '../../adjective_position/data/cde/wlp_AR-tez/ar-b-0.txt'

corpus_dir = '../../adjective_position/data/cde/'
ar_dir = glob.glob(corpus_dir + 'wlp_AR-tez/*.txt')
ar_dir[0] #First file found in directory

'../../adjective_position/data/cde/wlp_AR-tez\\ar-b-0.txt'

In [30]:
len(ar_dir) #There are 20 files in total

20

In [31]:
ar_df = pd.DataFrame(columns=['SourceID', 'TokenID', 'Word', 'Lemma', 'POS', 'Previous_word',
       'Previous_POS', 'Following_word', 'Following_POS', 'Number']) #Builds a new, empty data frame object.
ar_df

Unnamed: 0,SourceID,TokenID,Word,Lemma,POS,Previous_word,Previous_POS,Following_word,Following_POS,Number


In [32]:
for fname in ar_dir:                            #Create a master data frame 
    df = toDF(fname)                             
    df = surrounding_words(df)
    df = adj_lexemes(df)
    df = clean_POS(df)
    df = add_number(df)                   
    ar_df = pd.concat([ar_df, df], sort=True)

In [33]:
ar_df #First and last five rows and dimensions on the bottom.

Unnamed: 0,Following_POS,Following_word,Lemma,Number,POS,Previous_POS,Previous_word,SourceID,TokenID,Word
2171,nms,integrante,nuevo,singular,jfs,ld-fs,la,30,2245675103,nueva
2647,nfs,pareja,nuevo,singular,jfs,li-fs,Una,30,2245675579,nueva
2775,nfs,herramienta,grande,unknown,j,y,",",30,2245675707,gran
2861,nfs,onda,bueno,singular,jfs,ld-fs,la,30,2245675793,buena
3415,nms,representante,nuevo,singular,jms,dp-,su,40,2407904224,nuevo
...,...,...,...,...,...,...,...,...,...,...
10441803,nms,esfuerzo,grande,unknown,j,li-ms,un,1405719,1907146676,gran
10441944,vsp-1/3s,ponga,grande,singular,jms,nms,equipo,1405719,1907146817,grande
10442385,nms,comienzo,bueno,unknown,j,li-ms,un,1405729,1909545195,buen
10442494,y,",",bueno,singular,jms,n,@,1405729,1909545304,bueno


In [34]:
ar_df.keys() #Order is shuffled

Index(['Following_POS', 'Following_word', 'Lemma', 'Number', 'POS',
       'Previous_POS', 'Previous_word', 'SourceID', 'TokenID', 'Word'],
      dtype='object')

In [35]:
ar_df = ar_df[['SourceID', 'TokenID', 'Word', 'Lemma', 'POS', 'Previous_word',
       'Previous_POS', 'Following_word', 'Following_POS', 'Number']]

In [36]:
ar_df.keys() #Order restablished.

Index(['SourceID', 'TokenID', 'Word', 'Lemma', 'POS', 'Previous_word',
       'Previous_POS', 'Following_word', 'Following_POS', 'Number'],
      dtype='object')

- Storing the master data frame:

In [37]:
ar_df.to_pickle('pkl/ar_DF.pkl')