# **Análisis Computacional de Datos Lingüísticos**
### Javier Vera Zúñiga, javier.vera@pucv.cl
## **Clase 4**
### Manejo de archivos: **texto + tablas!**

## Parte A. **(aspectos fundamentales sobre diccionarios desde) pandas!**
### [link Glottolog](https://glottolog.org/)
### [link pandas](https://pandas.pydata.org/)

In [1]:
## usemos pandas!
## recomendación: trate siempre de usar pandas para datos tabulados!!! (al menos para leer datos)

import pandas as pd

In [2]:
## https://glottolog.org/meta/downloads
## noten que "langs" queda como un objeto del tipo "data frame"

langs = pd.read_csv('languages_and_dialects_geo.csv',sep=',')

In [3]:
langs

Unnamed: 0,glottocode,name,isocodes,level,macroarea,latitude,longitude
0,3adt1234,3Ad-Tekles,,dialect,Africa,,
1,aala1237,Aalawa,,dialect,Papunesia,,
2,aant1238,Aantantara,,dialect,Papunesia,,
3,aari1239,Aari,aiw,language,Africa,5.95034,36.5721
4,aari1240,Aariya,aay,language,Eurasia,,
...,...,...,...,...,...,...,...
21324,zuwa1238,Zuwadza,,dialect,Papunesia,,
21325,zwal1238,Zwall,,dialect,Africa,,
21326,zyph1238,Zyphe,zyp,language,Eurasia,22.52400,93.2640
21327,zyud1238,Zyuzdin,,dialect,Eurasia,,


In [4]:
## miremos las columnas

langs.columns

Index(['glottocode', 'name', 'isocodes', 'level', 'macroarea', 'latitude',
       'longitude'],
      dtype='object')

In [5]:
## accedemos a una columna con langs['glottocode']

langs['glottocode']

0        3adt1234
1        aala1237
2        aant1238
3        aari1239
4        aari1240
           ...   
21324    zuwa1238
21325    zwal1238
21326    zyph1238
21327    zyud1238
21328    huaa1249
Name: glottocode, Length: 21329, dtype: object

In [6]:
## primeras filas!

langs[:3]

Unnamed: 0,glottocode,name,isocodes,level,macroarea,latitude,longitude
0,3adt1234,3Ad-Tekles,,dialect,Africa,,
1,aala1237,Aalawa,,dialect,Papunesia,,
2,aant1238,Aantantara,,dialect,Papunesia,,


In [7]:
## índices!

langs.index

RangeIndex(start=0, stop=21329, step=1)

In [8]:
## iloc!

langs.iloc[0]

glottocode      3adt1234
name          3Ad-Tekles
isocodes             NaN
level            dialect
macroarea         Africa
latitude             NaN
longitude            NaN
Name: 0, dtype: object

In [9]:
## loc!

langs.loc[0,'glottocode']

'3adt1234'

In [10]:
## mirar https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html

In [11]:
## nos quedamos con un dataframe con dos columnas: iso vs glottocode
## elmiminamos los NaN

iso_glotto = langs[['isocodes','glottocode']]
iso_glotto = iso_glotto.dropna()

In [12]:
iso_glotto

Unnamed: 0,isocodes,glottocode
3,aiw,aari1239
4,aay,aari1240
5,aas,aasa1238
11,kbt,abad1241
13,abg,abag1245
...,...,...
21311,zuy,zuma1239
21313,jmb,zumb1240
21318,zun,zuni1245
21319,zzj,zuoj1238


In [13]:
## ¿Cómo hacemos un diccionario?

## diccionario iso:glottocode

## manera sofisticada

iso_glotto1 = dict(zip(iso_glotto['isocodes'], iso_glotto['glottocode']))

## manera menos sofisticada (no hay una mejor que otra). En general, se prefiere que los códigos sean más
## cortos. Sin embargo, también se vuelven más crípticos. Hay que encontrar un balance :)

iso = list(iso_glotto['isocodes'])
glotto = list(iso_glotto['glottocode'])

iso_glotto_pairs = []

for i in range(len(iso)):
    iso_glotto_pairs += [[iso[i],glotto[i]]]

iso_glotto2 = {item[0]:item[1] for item in iso_glotto_pairs}

In [14]:
iso_glotto1['arn']

'mapu1245'

In [15]:
iso_glotto2['arn']

'mapu1245'

In [16]:
## ¿Cómo podemos filtrar solo las lenguas de América?

## filtramos por lenguas de las Américas

macroarea = langs[['isocodes','macroarea']]
macroarea = macroarea.dropna()

In [17]:
## a la manera sofisticada
## ejercicio: hacerlo de la forma no sofisticada!!!

macroarea = dict(zip(macroarea['isocodes'], macroarea['macroarea']))

In [18]:
areas_list = list(macroarea.values())

In [19]:
## extraigamos los values de macroarea

## forma sofisticada

areas = set(macroarea.values())

In [20]:
areas

{'Africa',
 'Australia',
 'Eurasia',
 'North America',
 'Papunesia',
 'South America'}

In [21]:
## forma no sofisticada

areas = []

for value in macroarea.values():
    if value not in areas:
        areas = areas + [value]

In [22]:
areas

['Africa',
 'Eurasia',
 'Papunesia',
 'South America',
 'North America',
 'Australia']

In [23]:
## ¿Cómo podemos filtrar iso_glotto2 solo con lenguas que tengan macroarea 'North America' y 'South America'?

## nos quedamos con los keys que estén en el diccionario macroarea
iso_glotto2 = {iso:iso_glotto2[iso] for iso in iso_glotto2.keys() if iso in macroarea.keys()}

In [24]:
## nos quedamos con los keys que tengan macroarea[key] in ['North America','South America']
iso_glotto2 = {iso:iso_glotto2[iso] for iso in iso_glotto2.keys() if macroarea[iso] in ['North America','South America']}

In [25]:
len(iso_glotto2)

1260

In [26]:
## y si queremos guardar iso_glotto2? usamos pickle!!!

import pickle

pickle.dump(iso_glotto2,open('iso_glotto.p','wb'))

In [27]:
## ¿Cómo leemos datos pickle?

iso_glotto = pickle.load(open('iso_glotto.p','rb'))

##### otra información de Glottolog :)

In [28]:
## Objetivo: mirar languoids!
## https://glottolog.org/meta/downloads

languoids = pd.read_csv('languoid.csv',sep=',')

In [29]:
languoids

Unnamed: 0,id,family_id,parent_id,name,bookkeeping,level,latitude,longitude,iso639P3code,description,markup_description,child_family_count,child_language_count,child_dialect_count,country_ids
0,3adt1234,afro1255,nort3292,3Ad-Tekles,False,dialect,,,,,,0,0,0,
1,aala1237,aust1307,ramo1244,Aalawa,False,dialect,,,,,,0,0,0,
2,aant1238,nucl1709,nort2920,Aantantara,False,dialect,,,,,,0,0,0,
3,aari1238,sout2845,ahkk1235,Aari-Gayil,False,family,,,aiz,,,0,2,0,
4,aari1239,sout2845,aari1238,Aari,False,language,5.95034,36.5721,aiw,,,0,0,0,ET
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25434,zuti1239,tupi1275,guaj1255,Guajajára of Zutiua,False,dialect,,,,,,0,0,0,
25435,zuwa1238,koia1260,omie1241,Zuwadza,False,dialect,,,,,,0,0,0,
25436,zwal1238,atla1278,shal1242,Zwall,False,dialect,,,,,,0,0,0,
25437,zyph1238,sino1245,nucl1757,Zyphe,False,language,22.52400,93.2640,zyp,,,0,0,2,IN MM


In [30]:
## filtremos languoids según los keys de iso_glotto :)

languoids = languoids[languoids['iso639P3code'].isin(iso_glotto.keys())]

In [31]:
languoids

Unnamed: 0,id,family_id,parent_id,name,bookkeeping,level,latitude,longitude,iso639P3code,description,markup_description,child_family_count,child_language_count,child_dialect_count,country_ids
49,abip1241,guai1249,guai1250,Abipon,False,language,-29.000000,-61.000000,axb,,,0,0,0,AR
50,abis1238,,,Aewa,False,language,-1.284096,-75.084405,ash,,,0,0,0,PE
84,acat1239,otom1299,west2948,Acatepec Me'phaa,False,language,17.103400,-99.060200,tpx,,,0,0,3,MX
88,acha1250,araw1281,piap1247,Achagua,False,language,4.386490,-72.200500,aca,,,0,0,0,CO
92,ache1246,tupi1275,tupi1277,Aché,False,language,-25.586500,-56.469700,guq,,,0,0,0,PY
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25287,zapa1253,zapa1251,zapa1252,Záparo,False,language,-1.998710,-76.364000,zro,,,0,0,0,EC PE
25331,zenz1235,otom1299,core1263,Zenzontepec Chatino,False,language,16.527990,-97.455460,czn,,,0,0,0,MX
25389,zoee1240,tupi1275,zoee1241,Zo'é,False,language,-1.772080,-55.507460,pto,,,0,0,0,BR
25397,zoog1238,otom1299,cajo1239,Zoogocho Zapotec,False,language,17.201600,-96.343300,zpq,,,0,0,3,MX US


In [32]:
languoids = languoids[['id','family_id']]

In [33]:
languoids

Unnamed: 0,id,family_id
49,abip1241,guai1249
50,abis1238,
84,acat1239,otom1299
88,acha1250,araw1281
92,ache1246,tupi1275
...,...,...
25287,zapa1253,zapa1251
25331,zenz1235,otom1299
25389,zoee1240,tupi1275
25397,zoog1238,otom1299


In [34]:
languoids = languoids.dropna()

In [35]:
languoids

Unnamed: 0,id,family_id
49,abip1241,guai1249
84,acat1239,otom1299
88,acha1250,araw1281
92,ache1246,tupi1275
96,achi1256,maya1287
...,...,...
25277,zani1235,otom1299
25287,zapa1253,zapa1251
25331,zenz1235,otom1299
25389,zoee1240,tupi1275


In [36]:
## resolvamos, ahora, una pregunta que ya hemos resuelto. Definamos un diccionario glottocode:familia

diccionario_glottocode_fam = {}
## primero, recorremos los índices

for i in languoids.index:
    diccionario_glottocode_fam[languoids.loc[i,'id']]=languoids.loc[i,'family_id']

In [37]:
len(diccionario_glottocode_fam)

1184

## Parte B. **texto!**

In [62]:
len(iso_glotto)

1260

In [68]:
#'udhr/udhr_kaz.txt'[10:13]

'kaz'

In [69]:
## supongamos que queremos leer los archivos de la carpeta udhr
## ¿Qué pasa si no sabemos los nombres? ¿Y si solo sabemos que son archivos .txt? :)

import glob

lista_files = glob.glob('udhr/*.txt')
corpus = {}

for file in lista_files:
    ## lista de lenguas
    if file[10:13] in iso_glotto.keys():
        with open(file, 'r') as file_input:
            corpus[file[10:13]]=file_input.read()

In [84]:
## ahora, lo mismo pero con pandas!

import glob

lista_files = glob.glob('udhr/*.txt')

corpus = {}

for file in lista_files:
    
    if file[10:13] in iso_glotto.keys():
        dataframe = pd.read_csv(file, sep='\n', header=None)
        dataframe.columns = ['sentences']
        corpus[file[10:13]]=list(dataframe['sentences'])

In [86]:
## tercera versión!!! supongamos que sabemos los nombres de los archivos

corpus = {}

for lengua in iso_glotto.keys():
    file = open('udhr/'+'udhr_'+lengua+'.txt','r')
    file = file.read()
    corpus[lengua]=file

FileNotFoundError: [Errno 2] No such file or directory: 'udhr/udhr_axb.txt'

In [87]:
## ¿Qué hacemos con este error? De alguna forma, estamos tratando de abrir archivos que no existen :)

corpus = {}

for lengua in iso_glotto.keys():
    
    ## noten la secuencia try-except. Usamos el error
    
    try:
        file = open('udhr/'+'udhr_'+lengua+'.txt','r')
        file = file.read()
        corpus[lengua]=file
    except FileNotFoundError:
        pass

In [88]:
corpus['arn']

'Universal Declaration of Human Rights - Mapudungun\n© 1996 – 2009 The Office of the High Commissioner for Human Rights\nThis plain text version prepared by the “UDHR in Unicode”\nproject, https://www.unicode.org/udhr.\n---\n\nKom Mapu Fijke Az Tañi Az Mogeleam\n    Tuwvlzugun\n    ("Preámbulo" pi ta wigka)\n    Kimnieel fij mapu mew tañi kimgen kvme felen kisugvnew felen xvr kvme mvlen. Tvfaci zugu ñi mvleken mvleyem yamvwvn ka xvr kvme nor felen kom pu reñmawke ce mew.\n    Gewenonmu yamuwvn, zuamgewenonmu kvme felen, goymagenmu nor felen mvley re jazkvnkawvn: Fey mew mvley xvrvmzugu kom pu ce tañi kvme mogeleam kisuke ñi feyentun mew, kisu ñi rakizuam mew ka ñi wimtun mew ñi mvleal egvn.\n    Tañi mvlenoam kaiñetuwvn zugu, awkan zugu tvfeyci wezake vbmen egvn mvley tañi xvrvmuwael tvfeyci wvnenkvleci pu ce zugu mew tañi yamgeam xvr kvme nor felen.\n    Ka kimnieel ñi rvf kvme zugugen, wenvykawvn kom pu Xokiñke Ce egvn, nuwkvlelu Kiñe Mapu mew ("Naciones Unidas" pi ta wigka): Tvfa eg