<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/4/47/Acronimo_y_nombre_uc3m.png"/>

<img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" width=15%/>
</center>    

# Word Embeddings

En este cuaderno, aprenderemos a cargar un modelo de word embeddings utilizando la librería gensim, y estudiaremos las distintas funcionalidades que ofrece el modelos de word embeddings.

Necesitamos actualizar gensim, pero antes debemos instalar una versión específica (1.22.4) de numpy:



In [1]:
!pip install numpy==1.22.4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting numpy==1.22.4
  Downloading numpy-1.22.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.9/16.9 MB[0m [31m56.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.21.6
    Uninstalling numpy-1.21.6:
      Successfully uninstalled numpy-1.21.6
Successfully installed numpy-1.22.4


Una vez reiniciado el entorno, actualizamos gensim:

In [1]:
!pip install gensim --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gensim
  Downloading gensim-4.3.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m72.3 MB/s[0m eta [36m0:00:00[0m
Collecting FuzzyTM>=0.4.0
  Downloading FuzzyTM-2.0.5-py3-none-any.whl (29 kB)
Collecting pyfume
  Downloading pyFUME-0.2.25-py3-none-any.whl (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.1/67.1 KB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
Collecting simpful
  Downloading simpful-2.9.0-py3-none-any.whl (30 kB)
Collecting fst-pso
  Downloading fst-pso-1.8.1.tar.gz (18 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting miniful
  Downloading miniful-0.0.6.tar.gz (2.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: fst-pso, miniful
  Building wheel for fst-pso (setup.p

Consultamos las versiones de gensim y numpy:

In [2]:
import gensim, numpy

print('versión de gensim:', gensim.__version__) # >=4.3.0
print('versión de numpy:', numpy.__version__)   # =1.22.4

versión de gensim: 4.3.0
versión de numpy: 1.22.4


Ya podemos cargar el modelo. El API de gensim nos permite cargar directamente modelos pre-entrenados. Por ejemplos, en la siguiente celda vamos a cargar el modelo 'glove-wiki-gigaword-100', que fue entrenado con textos de Wikipedia (2014) y Gigaword 5. El tamaño del vocabulario es de 400000 tokens (128 MB). 

Puedes consultar otros modelos en el siguiente [link](#https://github.com/RaRe-Technologies/gensim-data)

La operación puede tardar unos minutos:



In [3]:
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-100")  # carga un modelo pre-entrenado




También es posible cargar un modelo desde local. Por ejemplo, vamos a salvar el modelo anterior, y lo vamos a cargar en una nueva variable new_model:

In [4]:
from gensim.models import KeyedVectors
model.save('model.bin')
new_model = KeyedVectors.load('model.bin')

Consultemos el vector asociado a una palabra concreta, 'mother'. Podemos ver que es un vector de dimensión 100. 

In [5]:
vector = model['mother']  # numpy vector of a word
print(vector.shape)
print(vector)

(100,)
[ 0.60587   0.027989  0.018495 -0.018674 -0.39562   1.0309   -0.35793
  0.20527   0.3293    0.035267 -0.38475   0.31452   0.32538   0.70024
  0.13935  -0.58923   0.36985  -0.080566 -0.59721   1.0215   -0.55154
  0.042073  0.34687   0.86511   0.63521   0.52616  -0.92199  -1.4634
  0.34517   0.58921   0.12295   0.7323    1.0468    0.065458 -0.27033
 -0.095179  0.20613   0.22589   0.90409  -0.11252  -0.58059   0.036599
  0.32003  -0.53638   0.19297   0.035694 -0.56487   0.1527    0.70196
 -0.24191   0.10476  -0.23424   1.212     1.1612   -0.033677 -1.9996
 -0.79448  -0.087088  0.51475   0.44601   0.638     0.89893   0.17408
 -0.32006   0.41652   0.23289   0.50642   0.26938  -0.1453    0.1207
 -0.26246   0.16991   0.16702  -0.042041  0.64841   0.9827   -0.092602
 -0.56797  -0.63854  -0.38415  -0.13816   0.43137   0.44748   0.24486
 -1.5669   -0.80245  -0.15123  -0.18795  -0.4888   -0.67834   0.27133
 -0.36768   1.1268    0.44722  -0.91335  -0.055973 -0.38328  -0.62756
 -0.24055  -0.

El modelo de word embeddings nos permite calcular la similitud entre dos palabras. Si el resultado es cercano a 1, significa que ambas palabras tienen un significado similar. 

In [30]:
similarity = model.similarity('mother', 'father')
print(similarity)

0.86566603


In [31]:
similarity = model.similarity('mother', 'mothers')
print(similarity)

0.604178


In [6]:
similarity = model.similarity('mother', 'teeth')
print(similarity)

0.2704376


In [33]:
word1='man',
for word2 in ['woman', 'guy', 'boy']:
    similarity = model.similarity(word1, word2)
    print("similarity of {} and {} = {}".format(word1,word2,similarity))


similarity of ('man',) and woman = [0.8323495]
similarity of ('man',) and guy = [0.6679584]
similarity of ('man',) and boy = [0.79148716]


El método *most_similar* nos permite obtener una lista de palabras similares a una dada, ordenadas de mayor a menor similitud. 

In [34]:
model.most_similar('truck')
#model.most_similar('aspirin')


[('car', 0.8597878217697144),
 ('trucks', 0.8078932166099548),
 ('vehicle', 0.7879196405410767),
 ('bus', 0.7633007764816284),
 ('pickup', 0.7436763644218445),
 ('tractor', 0.7433986067771912),
 ('cars', 0.741030752658844),
 ('driver', 0.7295383214950562),
 ('parked', 0.7291535139083862),
 ('lorry', 0.7239130139350891)]

Podemos ver como 'good' es la segunda palabra propuesta como más similar a 'bad'. Esto no es cierto, pero el método la propone porque bad y good suelen ocurrir en contextos muy parecidos. 

In [35]:
model.most_similar('bad')


[('worse', 0.7929712533950806),
 ('good', 0.7702797651290894),
 ('things', 0.7653602957725525),
 ('too', 0.7630148530006409),
 ('thing', 0.7609668374061584),
 ('lot', 0.7443646788597107),
 ('kind', 0.7408681511878967),
 ('because', 0.7398799061775208),
 ('really', 0.7376540899276733),
 ("n't", 0.7336540818214417)]

El método *most_similar* acepta como entrada una palabra pero puede recibir también una lista de vectores:

In [42]:
vector1=model['bad']  
model.most_similar([vector1])


[('bad', 1.0),
 ('worse', 0.7929712533950806),
 ('good', 0.7702798247337341),
 ('things', 0.7653602957725525),
 ('too', 0.7630148530006409),
 ('thing', 0.7609667778015137),
 ('lot', 0.7443647980690002),
 ('kind', 0.7408681511878967),
 ('because', 0.7398799061775208),
 ('really', 0.7376540899276733)]

In [43]:
vector1=model['bad']  
vector2=model['good']  

model.most_similar([vector1, vector2])



[('good', 0.9436833262443542),
 ('bad', 0.9378852248191833),
 ('better', 0.8442150354385376),
 ('thing', 0.8393608927726746),
 ('things', 0.8369812369346619),
 ('kind', 0.8352970480918884),
 ('really', 0.8341462016105652),
 ('lot', 0.8266578912734985),
 ("n't", 0.8208479881286621),
 ('sure', 0.8168836832046509)]

In [44]:
vector1=model['law']  # numpy vector of a word
vector2=model['judge']  # numpy vector of a word

model.most_similar([vector1, vector2])

[('law', 0.8960108160972595),
 ('judge', 0.8908922672271729),
 ('court', 0.8809602856636047),
 ('supreme', 0.788550615310669),
 ('justice', 0.770503580570221),
 ('attorney', 0.765242338180542),
 ('case', 0.7597336173057556),
 ('federal', 0.7577816843986511),
 ('legal', 0.7538726329803467),
 ('appeals', 0.7446171641349792)]

El método *similar_by_word* es muy similar al método anterior, *most_similar*. La principal diferencia es que mientras most_similar puede recibir como entrada una palabra o un vector, *similar_by_word* must be always a word, únicamente acepta palabras:

In [45]:
result = model.similar_by_word("truck") #cat
for r in result:
    print(r)


('car', 0.8597878217697144)
('trucks', 0.8078932166099548)
('vehicle', 0.7879196405410767)
('bus', 0.7633007764816284)
('pickup', 0.7436763644218445)
('tractor', 0.7433986067771912)
('cars', 0.741030752658844)
('driver', 0.7295383214950562)
('parked', 0.7291535139083862)
('lorry', 0.7239130139350891)


El método *distance* proporciona la distancia del cosenoentre dos palabras. El método *similarity* proporciona el grado de similitud, que es, 1 menos la distancia del coseno entre las dos palabras:

$similarity = 1 - distance = 1 - cosine$

In [65]:
w1="woman"
print(w1)
distance = model.distance(w1, w1)
print("Distancia:", f"{distance:.1f}")


similarity = model.similarity(w1, w1)
print("similitud:", f"{similarity:.1f}")



woman
Distancia: 0.0
similitud: 1.0


In [66]:
w1='woman'
w2='man'
distance = model.distance(w1,w2)
similarity = model.similarity(w1,w2)
print(w1,w2)
print("Distancia:", f"{distance:.1f}"," similitud:", f"{similarity:.1f}")


woman man
Distancia: 0.2  similitud: 0.8


In [67]:
w1= 'woman'
for w2 in ['cosine', 'girl', 'wife']:
    distance = model.distance(w1,w2)
    similarity = model.similarity(w1,w2)
    print(w1, w2, '-> distancia:', f"{distance:.1f}", 'similitud:', f"{similarity:.1f}")


woman cosine -> distancia: 1.2 similitud: -0.2
woman girl -> distancia: 0.2 similitud: 0.8
woman wife -> distancia: 0.2 similitud: 0.8


In [68]:
w1= 'man'
for w2 in ['cosine', 'boy', 'husband']:
    distance = model.distance(w1,w2)
    similarity = model.similarity(w1,w2)
    print(w1, w2, '-> distancia:', f"{distance:.1f}", 'similitud:', f"{similarity:.1f}")


man cosine -> distancia: 1.1 similitud: -0.1
man boy -> distancia: 0.2 similitud: 0.8
man husband -> distancia: 0.3 similitud: 0.7


El método 'does_match' es capaz de identificar en un conjunto de palabras la palabra que no encaja:
Which word from the given list doesn’t go with the others?


In [69]:
print(model.doesnt_match(['breakfast', 'house', 'dinner', 'lunch']))

house


In [70]:
print(model.doesnt_match("car ship woman train".split()))

woman


El método *n_similarity* calcula la similitud entre dos conjuntos de palabras:

In [73]:
similarity = model.n_similarity(['one', 'heart'], ['japanese', 'restaurant'])
print(f"{similarity:.4f}")

0.4986


In [74]:
similarity = model.n_similarity(['sushi', 'bar'], ['japanese', 'restaurant'])
print(f"{similarity:.4f}")

0.6657


In [75]:
similarity = model.n_similarity(['sushi', 'red'], ['blue', 'restaurant'])
print(f"{similarity:.4f}")

0.8065


El siguiente método, most_similar_cosmul, nos permite obtener una lista de palabras similar a un conjunto dado, pero con significado opuesto a otro conjunto de palabras: 


In [76]:
result = model.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])

most_similar_key, similarity = result[0]  
print(f"{most_similar_key}: {similarity:.4f}")


queen: 0.8965


In [77]:
result = model.most_similar_cosmul(positive=['madrid', 'france'], negative=['spain'])
most_similar_key, similarity = result[0] 
print(f"{most_similar_key}: {similarity:.4f}")


paris: 0.9525


In [78]:
result = model.most_similar_cosmul(positive=['baghdad', 'england'], negative=['london'])
most_similar_key, similarity = result[0]  
print(f"{most_similar_key}: {similarity:.4f}")

iraq: 0.8781


In [79]:
result = model.most_similar_cosmul(positive=['spain', 'barcelona'], negative=['madrid'])
most_similar_key, similarity = result[0]  
print(f"{most_similar_key}: {similarity:.4f}")

portugal: 0.9031


In [80]:
result = model.most_similar(positive=['woman', 'king'], negative=['man'])
most_similar_key, similarity = result[0]  
print(f"{most_similar_key}: {similarity:.4f}")


queen: 0.7699


También es posible obtener la similitud entre dos oraciones (o documentos)



https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html

In [84]:
!pip install POT==0.4.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [87]:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()
sentence_president3 = 'The president greets the media in Illinois'.lower().split()

distance = model.wmdistance(sentence_obama, sentence_president)
print(f"{distance:.4f}")

distance = model.wmdistance(sentence_obama, sentence_president3)
print(f"{distance:.4f}")

distance = model.wmdistance(sentence_president, sentence_president3)
print(f"{distance:.4f}")


0.6182
0.3908
0.2274


In [88]:
text1 = 'The hotel was very expensive and not good'.lower().split()
text2 = 'The hotel was very good and not expensive'.lower().split()
text3 = 'The hotel was very bad and not cheap'.lower().split()
text4 = 'The best result was achieved by BERT'.lower().split()

distance = model.wmdistance(text1, text2)
print(f"{distance:.4f}")

distance = model.wmdistance(text1, text3)
print(f"{distance:.4f}")

distance = model.wmdistance(text1, text4)
print(f"{distance:.4f}")

0.0000
0.1686
0.6942
