#Representación de los elementos del vocabulario
El objetivo de este notebook es mostrar mecanismos que permitan representar diferentes elementos del lenguaje natural, en este caso las unidades léxicas que hacen parte del vocabulario que se haya definido (usualmente las palabras) y a partir de ellos derivar aplicaciónes de procesamiento del lenguaje natural. Trabajaremos con una representación inicial basada en la *matriz de co-ocurrencias* para pasar luego a revisar las ideas fundamentales de **word embeddings**


In [1]:
import sklearn
import os, re, string, collections, random
import spacy
import nltk
import numpy, matplotlib
import pandas as pd

Primero preparar el texto de forma básica

In [2]:
abre=open("odisea.txt",encoding="UTF-8")
base=abre.read()
##inicio y final
inicio="EBOOK"
init=base.find(inicio)
finale=base.find("END OF")
base2=base[init+len(inicio):finale]
##limpieza
base2a = re.sub ("\n|\t|\W|\d|_"," ",base2)
base2b=re.sub (" +"," ",base2a)
##paso a minusculas
base3=base2b.lower()
base3[0:1000]

' la odisea produced by ramon pajares box carlos colon and the online distributed proofreading team at http www pgdp net nota de transcripción en el texto las cursivas se muestran entre subrayados y las versalitas se han convertido a mayúsculas los errores de imprenta han sido corregidos sin avisar se ha respetado la ortografía del original que difiere ligeramente de la actual normalizándola a la grafía de mayor frecuencia se han añadido tildes a las mayúsculas que las necesitan se han hecho los siguientes cambios canto v p vendabal vendaval canto xii p vendabal vendaval índice de nombres propios p voz neoptólemo hijo de ulises hijo de aquiles algunas ilustraciones se han desplazado ligeramente para evitar que interrumpieran un párrafo la odisea ilustración homero la odisea versión directa y literal del griego por luis segalá y estalella doctor en filosofía y letras y en derecho catedrático de lengua y literatura griegas de la universidad de barcelona académico electo de la real de bue

##Matriz de co-ocurrencias
En la preparación de texto no vamos a retirar los stopwords. Ellos son funcionales para el cálculo de métricas como el PMI. Como veremos si resulta funcional eliminar la puntuación, pues no debería contar como token a la hora de representar los elementos del vocabulario y al no realizar un análisis sintáctico no la requeriremos.

In [3]:

nltk.download("punkt")
nltk.download("stopwords")
fichas2=nltk.tokenize.word_tokenize(base3,language="spanish")
libroquijote=nltk.Text(fichas2)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [18]:
fichas2[1000:1200]

['más',
 'adiciones',
 'que',
 'las',
 'necesarias',
 'para',
 'su',
 'cabal',
 'inteligencia',
 'vertiendo',
 'hasta',
 'las',
 'circunlocuciones',
 'cuando',
 'son',
 'inteligibles',
 'y',
 'constituyen',
 'un',
 'modo',
 'respetuoso',
 'de',
 'nombrar',
 'á',
 'determinados',
 'personajes',
 'ἱερὸν',
 'μένος',
 'ἀλκινόοιο',
 'la',
 'sacra',
 'potestad',
 'de',
 'alcínoo',
 'para',
 'designar',
 'al',
 'rey',
 'de',
 'los',
 'feacios',
 'etc',
 'en',
 'lo',
 'que',
 'se',
 'refiere',
 'á',
 'los',
 'epítetos',
 'hubiéramos',
 'querido',
 'seguir',
 'el',
 'consejo',
 'que',
 'nos',
 'dió',
 'la',
 'real',
 'academia',
 'española',
 'en',
 'su',
 'dictamen',
 'acerca',
 'de',
 'la',
 'versión',
 'de',
 'la',
 'ilíada',
 'de',
 'que',
 'se',
 'traduzcan',
 'los',
 'compuestos',
 'por',
 'otros',
 'análogos',
 'que',
 'se',
 'podrían',
 'formar',
 'en',
 'castellano',
 'como',
 'por',
 'ejemplo',
 'bracinívea',
 'ojilúcida',
 'y',
 'argentípeda',
 'que',
 'hemos',
 'usado',
 'en',
 'nue

##Matriz de co-ocurrencias
Vamos a obtener los bigramas (consecutivos), calcular el PMI entre las palabras de todos los bigramas y crear una matriz que tenga el PMI para cada par de palabras. De ese modo cada palabra queda representada por el vector de PMIs con aquellas palabras que co-ocurre.

In [19]:
from nltk.collocations import *
from nltk import bigrams

bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

finder = BigramCollocationFinder.from_words(libroquijote)

#obtenerlos con su PMI
scored=finder.score_ngrams(bigram_measures.pmi)
scored[0:5]
#ordenarlos por PMI
listabig=sorted(bigram for bigram, score in scored)
listabig[0:100]
len(listabig)
##extraerlos para hacer la representación
inicio=[]
fin=[]
puntaje=[]
for i in range(0,len(scored)):
          inicio.append(scored[i][0][0])
          fin.append(scored[i][0][1])
          puntaje.append(scored[i][1])
          
init_pmi = pd.DataFrame(
    {'pal1': inicio,
     'pal2': fin,
     'puntaje': puntaje
    })

init_pmi.head(4)
coocur=init_pmi.pivot(index="pal1",columns="pal2")
coocur=coocur['puntaje'].reset_index()
coocur.head(5)

coocurt=coocur.fillna(0)
coocurt.head(5)
coocurt.shape


(14778, 14779)

In [20]:
display(coocurt)

pal2,pal1,a,abajo,abandona,abandonado,abandonados,abandonar,abandonaron,abandonas,abandonó,...,ὠκεανός,ὠκύαλος,ὠρίων,ὡς,ὥστε,ὦτος,ὦψ,ῥαδάμανθυς,ῥεῖθρον,ῥηξήνωρ
0,a,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,abajo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,abandona,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,abandonado,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,abandonados,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14773,ὦτος,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14774,ὦψ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14775,ῥαδάμανθυς,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14776,ῥεῖθρον,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0




Bigramas directos serían muy limitados para describir una palabra. Refinemos la representación usando una ventana y eliminando posibles signos de puntuación o signos sueltos que tengamos que nos generaron dificultades en la aproximación anterior

In [4]:
from nltk.collocations import *
from nltk import bigrams

bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()


finderw = BigramCollocationFinder.from_words(libroquijote,window_size=5)
finderw.apply_word_filter(lambda w: len(w) < 2)

#obtenerlos con su PMI
scoredw=finderw.score_ngrams(bigram_measures.pmi)
#ordenarlos por PMI
listabigw=sorted(bigram for bigram, score in scoredw)
##extraerlos para hacer la representación
inicio=[]
fin=[]
puntaje=[]
for i in range(0,len(scoredw)):
          inicio.append(scoredw[i][0][0])
          fin.append(scoredw[i][0][1])
          puntaje.append(scoredw[i][1])
          
init_pmiw = pd.DataFrame(
    {'pal1': inicio,
     'pal2': fin,
     'puntaje': puntaje
    })

init_pmiw.head(4)
coocurw=init_pmiw.pivot(index="pal1",columns="pal2")
coocurw=coocurw['puntaje'].reset_index()

coocurtw=coocurw.fillna(0)
coocurtw.shape

(14748, 14749)

In [23]:
display(coocurtw)


pal2,pal1,abajo,abandona,abandonado,abandonados,abandonar,abandonaron,abandonas,abandonó,abastecida,...,ὠκεανός,ὠκύαλος,ὠρίων,ὡς,ὥστε,ὦτος,ὦψ,ῥαδάμανθυς,ῥεῖθρον,ῥηξήνωρ
0,abajo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,abandona,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,abandonado,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,abandonados,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,abandonar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14743,ὦτος,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14744,ὦψ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14745,ῥαδάμανθυς,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14746,ῥεῖθρον,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0




!Ya tenemos representaciones de palabras! Estas representaciones son semánticas en su naturaleza, pues están definiendo los elementos del vocabulario (palabras) a partir de las palabras de los contextos en los que ocurre. Pero al ser una representación numérica, ya podemos tratar de ver que tan semánticamente parecidas son dos palabras:

In [5]:
a=coocurtw[coocurtw["pal1"]=="juno"].iloc[:,1:14749]
b=coocurtw["minerva"].to_frame().transpose()
display(a)

pal2,abajo,abandona,abandonado,abandonados,abandonar,abandonaron,abandonas,abandonó,abastecida,abastecidas,...,ὠκεανός,ὠκύαλος,ὠρίων,ὡς,ὥστε,ὦτος,ὦψ,ῥαδάμανθυς,ῥεῖθρον,ῥηξήνωρ
7587,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Aqui calculamos la distancia

In [6]:
from sklearn.metrics.pairwise import cosine_similarity
buscando=cosine_similarity(a,b)
buscando


array([[0.02340762]])

Ahora el interés es buscar la palabra que más se relacione con una palabra dada. En este caso usaremos la diosa Minerva.

In [18]:
coocurtw.columns.get_loc("minerva")

8646

Aqui obtenemos la palabra (entre las primeras 8000) que más se *parece semánticamente* a Minerva o que co-ocurre más de lo esperado

In [19]:
import numpy as np
a1=coocurtw.iloc[:,1:8000].transpose()
b1=coocurtw["minerva"].to_frame().transpose()
todominerva=cosine_similarity(a1,b1)
coocurtw.columns[np.argmax(todominerva)+1]

##Word embeddings
Extendiendo la idea presentada en la matriz de co-ocurrencias, podemos entrenar una red neuronal para predecir una palabra a partir de las palabras que la rodean, en este caso con una sola capa intermedia. Los pesos de la capa final serán la representación vectorial del elemento del vocabulario (palabra). Vamos a usar gensim, y dentro de él, la estrategia Word2vec.

In [None]:
import gensim
from gensim.models import Word2Vec


A word2vec debo entregarle una lista de listas de tokenes. Vamos a hacer esa preparación.

In [33]:
oraciones=nltk.tokenize.sent_tokenize(base2,language="spanish")
oraciones[0:5]


[' LA ODISEA ***\n\n\n\n\nProduced by Ramon Pajares Box, Carlos Colon, and the Online\nDistributed Proofreading Team at http://www.pgdp.net\n\n\n\n\n\n\nNOTA DE TRANSCRIPCIÓN\n\n  * En el texto, las cursivas se muestran entre _subrayados_ y las\n    versalitas se han convertido a MAYÚSCULAS.',
 '* Los errores de imprenta han sido corregidos sin avisar.',
 '* Se ha respetado la ortografía del original —que difiere\n    ligeramente de la actual—, normalizándola a la grafía de mayor\n    frecuencia.',
 '* Se han añadido tildes a las mayúsculas que las necesitan.',
 '* Se han hecho los siguientes cambios:\n    · Canto V, 388, p. 78: «vendabal» → «vendaval».']

In [34]:
lista_tokenizada=[]
for oracion in oraciones:
  temp = re.sub ("\n|\t|\W|\d|_"," ",oracion)
  tempb=re.sub (" +"," ",temp)
  tempc=tempb.lower()
  tempd=nltk.tokenize.word_tokenize(tempc,language="spanish")
  lista_tokenizada.append(tempd)
lista_tokenizada[0:2]

[['la',
  'odisea',
  'produced',
  'by',
  'ramon',
  'pajares',
  'box',
  'carlos',
  'colon',
  'and',
  'the',
  'online',
  'distributed',
  'proofreading',
  'team',
  'at',
  'http',
  'www',
  'pgdp',
  'net',
  'nota',
  'de',
  'transcripción',
  'en',
  'el',
  'texto',
  'las',
  'cursivas',
  'se',
  'muestran',
  'entre',
  'subrayados',
  'y',
  'las',
  'versalitas',
  'se',
  'han',
  'convertido',
  'a',
  'mayúsculas'],
 ['los',
  'errores',
  'de',
  'imprenta',
  'han',
  'sido',
  'corregidos',
  'sin',
  'avisar']]

Aqui creamos el modelo. Por defecto usa vectores de tamaño 100

In [35]:
model_odisea =Word2Vec(lista_tokenizada)
##puedo personalizar: vector_size, window,min_count

Veamos el vector de una palabra

In [36]:
model_odisea["odisea"]

  """Entry point for launching an IPython kernel.


array([-0.04894758, -0.01013862, -0.09144062,  0.07699146,  0.1599834 ,
        0.03706657, -0.05008316, -0.11987091, -0.09076518,  0.00778952,
       -0.00313868, -0.00950956,  0.04276849, -0.19023491, -0.12797093,
        0.15040684, -0.05981608, -0.01099305,  0.09634944,  0.0372693 ,
        0.0693007 ,  0.25790885,  0.13930148,  0.19706748,  0.45395288,
       -0.03786364,  0.06480478,  0.27923974,  0.08951759, -0.04963721,
       -0.11221723,  0.05332185, -0.16508862,  0.0126765 ,  0.40537673,
       -0.0323102 , -0.02519048,  0.14859585,  0.13059336,  0.01350446,
       -0.13903284, -0.07092285,  0.09909344,  0.00182275, -0.06267077,
       -0.10650466,  0.03921119,  0.04961519, -0.10371433, -0.00225482,
       -0.18614945,  0.00394902,  0.12954502,  0.07333945,  0.23417504,
        0.15763353,  0.07835452,  0.03140767, -0.27739453,  0.00537117,
       -0.14598468, -0.03014289,  0.13083015,  0.01032402,  0.06836412,
        0.00078021, -0.1434266 , -0.14279507,  0.25910005,  0.07

Calculemos las palabras más similares a una palabra dada.

In [46]:
similitud=model_odisea.similar_by_word("minerva", topn=10)
similitud
##sugerencias: penélope, ulises, comida

  """Entry point for launching an IPython kernel.


[('deidad', 0.999035656452179),
 ('sala', 0.9988104104995728),
 ('después', 0.9987512826919556),
 ('ojos', 0.9987228512763977),
 ('xxiv', 0.9986372590065002),
 ('cuando', 0.9985435009002686),
 ('junto', 0.9985355138778687),
 ('matanza', 0.998443067073822),
 ('palas', 0.9983879327774048),
 ('cabeza', 0.9983853101730347)]

Dado el tamaño de la base de entrenamiento, los resultados son mixtos. Aun así, puedo guardarlo y seguirlo entrenando en el futuro con más datos.

In [None]:
modelo_odisea.save("odisea.model")