<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/4/47/Acronimo_y_nombre_uc3m.png"/>

<img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" width=15%/>
</center>    

# Word Embeddings

En este cuaderno, aprenderemos a cargar un modelo de word embeddings utilizando la librería gensim. También trabajaremos con el modelo para descubrir que funcionalidades nos permite hacer: 

Primero de todo, vamos a instalar [gensim](#https://radimrehurek.com/gensim/), una librería de Python que nos permite tanto entrenar modelos de word embeddings como cargar modelos pre-entrenados y utlizarlos para obtener los vectores de palabras. 

In [3]:
!pip install --upgrade gensim
#!pip gensim --version

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Loading a pre-trained word embedding model by using gensim

El API de gensim nos permite cargar directamente modelos pre-entrenados. Por ejemplos, en la siguiente celda vamos a cargar el modelo 

Pre-trained glove vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased.


'glove-twitter-25', un modelo pre-entrenado utilizando 2billones de tweets con 27B de tokens, y que como resultado se ha obtenido un diccionario de 1.2 millones de palabras (uncased). 

La operación puede tardar unos minutos:



In [4]:
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-100")  # load pre-trained word-vectors from gensim-data



También es posible cargar un modelo desde local. Por ejemplo, vamos a salvar el modelo anterior, y lo vamos a cargar en una nueva variable new_model:

In [5]:
from gensim.models import KeyedVectors
model.save('model.bin')
new_model = KeyedVectors.load('model.bin')

Consultemos el vector asociado a una palabra concreta, 'mother'. Podemos ver que es un vector de dimensión 100. 

In [6]:
vector = model['mother']  # numpy vector of a word
print(vector.shape)
print(vector)

(100,)
[ 0.60587   0.027989  0.018495 -0.018674 -0.39562   1.0309   -0.35793
  0.20527   0.3293    0.035267 -0.38475   0.31452   0.32538   0.70024
  0.13935  -0.58923   0.36985  -0.080566 -0.59721   1.0215   -0.55154
  0.042073  0.34687   0.86511   0.63521   0.52616  -0.92199  -1.4634
  0.34517   0.58921   0.12295   0.7323    1.0468    0.065458 -0.27033
 -0.095179  0.20613   0.22589   0.90409  -0.11252  -0.58059   0.036599
  0.32003  -0.53638   0.19297   0.035694 -0.56487   0.1527    0.70196
 -0.24191   0.10476  -0.23424   1.212     1.1612   -0.033677 -1.9996
 -0.79448  -0.087088  0.51475   0.44601   0.638     0.89893   0.17408
 -0.32006   0.41652   0.23289   0.50642   0.26938  -0.1453    0.1207
 -0.26246   0.16991   0.16702  -0.042041  0.64841   0.9827   -0.092602
 -0.56797  -0.63854  -0.38415  -0.13816   0.43137   0.44748   0.24486
 -1.5669   -0.80245  -0.15123  -0.18795  -0.4888   -0.67834   0.27133
 -0.36768   1.1268    0.44722  -0.91335  -0.055973 -0.38328  -0.62756
 -0.24055  -0.

El modelo de word embeddings nos permite calcular la similitud entre dos palabras. Si el resultado es cercano a 1, significa que ambas palabras tienen un significado similar. 

In [7]:
similarity = model.similarity('mother', 'teeth')
print(similarity)

0.2704376


In [8]:
similarity = model.similarity('brush', 'brushes')
print(similarity)

0.595461


Como era de esperar 'brush' y 'brushes' tienen un grado de similitud mayor que el que hay entre 'mother' y 'teeth'.

Veamos ahora la similitud entre 'man' y otras palabras como 'woman', 'guy' o 'boy'.

In [9]:
word1='man',
for word2 in ['woman', 'guy', 'boy']:
    similarity = model.similarity(word1, word2)
    print("similarity of {} and {} = {}".format(word1,word2,similarity))


similarity of ('man',) and woman = [0.8323495]
similarity of ('man',) and guy = [0.6679584]
similarity of ('man',) and boy = [0.79148716]


Como era de esperar 'boy', seguida por 'guy' guardan una mayor similitud con 'man' que 'woman'.

El método *most_similar* nos permite obtener una lista de palabras similares a una dada, ordenadas de mayor a menor similitud. 

In [10]:
model.most_similar('truck')
#model.most_similar('aspirin')


[('car', 0.8597878217697144),
 ('trucks', 0.8078932166099548),
 ('vehicle', 0.7879196405410767),
 ('bus', 0.7633007764816284),
 ('pickup', 0.7436763644218445),
 ('tractor', 0.7433986067771912),
 ('cars', 0.741030752658844),
 ('driver', 0.7295383214950562),
 ('parked', 0.7291535139083862),
 ('lorry', 0.7239130139350891)]

Podemos ver como 'good' es la segunda palabra propuesta como más similar a 'bad'. Esto no es cierto, pero el método la propone porque bad y good suelen ocurrir en contextos muy parecidos. 

In [11]:
model.most_similar('bad')


[('worse', 0.7929712533950806),
 ('good', 0.7702797651290894),
 ('things', 0.7653602957725525),
 ('too', 0.7630148530006409),
 ('thing', 0.7609668374061584),
 ('lot', 0.7443646788597107),
 ('kind', 0.7408681511878967),
 ('because', 0.7398799061775208),
 ('really', 0.7376540899276733),
 ("n't", 0.7336540818214417)]

The input for the method *most_similar* could be words or vectors. 

In [12]:
vector=model['bad']  # numpy vector of a word
print(model.most_similar('bad'))
print(model.most_similar([vector]))



[('worse', 0.7929712533950806), ('good', 0.7702797651290894), ('things', 0.7653602957725525), ('too', 0.7630148530006409), ('thing', 0.7609668374061584), ('lot', 0.7443646788597107), ('kind', 0.7408681511878967), ('because', 0.7398799061775208), ('really', 0.7376540899276733), ("n't", 0.7336540818214417)]
[('bad', 1.0), ('worse', 0.7929712533950806), ('good', 0.7702798247337341), ('things', 0.7653602957725525), ('too', 0.7630148530006409), ('thing', 0.7609667778015137), ('lot', 0.7443647980690002), ('kind', 0.7408681511878967), ('because', 0.7398799061775208), ('really', 0.7376540899276733)]


El método *similar_by_word* es muy similar al método anterior, *most_similar*. La principal diferencia es que mientras most_similar puede recibir como entrada una palabra o un vector, *similar_by_word* must be always a word, únicamente acepta palabras:

In [13]:
result = model.similar_by_word("truck") #cat
for r in result:
    print(r)


('car', 0.8597878217697144)
('trucks', 0.8078932166099548)
('vehicle', 0.7879196405410767)
('bus', 0.7633007764816284)
('pickup', 0.7436763644218445)
('tractor', 0.7433986067771912)
('cars', 0.741030752658844)
('driver', 0.7295383214950562)
('parked', 0.7291535139083862)
('lorry', 0.7239130139350891)


El método *distance* proporciona la distancia del cosenoentre dos palabras. El método *similarity* proporciona el grado de similitud, que es, 1 menos la distancia del coseno entre las dos palabras:

$similarity = 1 - distance = 1 - cosine$

In [27]:
w1="woman"

distance = model.distance(w1, w1)
print(f"{distance:.1f}")


similarity = model.similarity(w1, w1)
print(f"{similarity:.1f}")



0.0
1.0


In [28]:
distance = model.distance("woman", "man")
similarity = model.similarity('woman', 'man')
print(f"{distance:.1f}",f"{similarity:.1f}")


0.2 0.8


In [29]:
w1= 'woman'
for w2 in ['cosine', 'girl', 'wife']:
    distance = model.distance(w1,w2)
    similarity = model.similarity(w1,w2)
    print(w1, w2, '-> distancia:', f"{distance:.1f}", 'similitud:', f"{similarity:.1f}")


woman cosine -> distancia: 1.2 similitud: -0.2
woman girl -> distancia: 0.2 similitud: 0.8
woman wife -> distancia: 0.2 similitud: 0.8


In [30]:
w1= 'man'
for w2 in ['cosine', 'boy', 'husband']:
    distance = model.distance(w1,w2)
    similarity = model.similarity(w1,w2)
    print(w1, w2, '-> distancia:', f"{distance:.1f}", 'similitud:', f"{similarity:.1f}")


man cosine -> distancia: 1.1 similitud: -0.1
man boy -> distancia: 0.2 similitud: 0.8
man husband -> distancia: 0.3 similitud: 0.7


El método 'does_match' es capaz de identificar en un conjunto de palabras la palabra que no encaja:
Which word from the given list doesn’t go with the others?


In [31]:
print(model.doesnt_match(['breakfast', 'house', 'dinner', 'lunch']))

house


In [32]:
print(model.doesnt_match("car ship woman train".split()))

woman


##  n_similarity
Compute cosine similarity between two sets of words.


In [20]:
similarity = model.n_similarity(['one', 'heart'], ['japanese', 'restaurant'])
print(f"{similarity:.4f}")

0.4986


In [21]:
similarity = model.n_similarity(['sushi', 'bar'], ['japanese', 'restaurant'])
print(f"{similarity:.4f}")

0.6657


In [40]:
similarity = model.n_similarity(['sushi', 'red'], ['blue', 'restaurant'])
print(f"{similarity:.4f}")

0.8065


El siguiente método, most_similar_cosmul, nos permite obtener una lista de palabras similar a un conjunto dado, pero con significado opuesto a otro conjunto de palabras: 


In [41]:
# Use a different similarity measure: "cosmul".
result = model.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])

most_similar_key, similarity = result[0]  # look at the first match
print(f"{most_similar_key}: {similarity:.4f}")


queen: 0.8965


In [42]:
result = model.most_similar_cosmul(positive=['madrid', 'france'], negative=['spain'])
most_similar_key, similarity = result[0]  # look at the first match
print(f"{most_similar_key}: {similarity:.4f}")


paris: 0.9525


In [43]:
result = model.most_similar_cosmul(positive=['baghdad', 'england'], negative=['london'])
most_similar_key, similarity = result[0]  # look at the first match
print(f"{most_similar_key}: {similarity:.4f}")

iraq: 0.8781


In [44]:
result = model.most_similar_cosmul(positive=['spain', 'barcelona'], negative=['madrid'])
most_similar_key, similarity = result[0]  # look at the first match
print(f"{most_similar_key}: {similarity:.4f}")

portugal: 0.9031


In [45]:
# Check the "most similar words", using the default "cosine similarity" measure.
result = model.most_similar(positive=['woman', 'king'], negative=['man'])
most_similar_key, similarity = result[0]  # look at the first match
print(f"{most_similar_key}: {similarity:.4f}")


queen: 0.7699


También es posible obtener la similitud entre dos oraciones (o documentos)



https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html

In [49]:
!pip install POT==0.4.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting POT==0.4.0
  Downloading POT-0.4.0.tar.gz (315 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.5/315.5 KB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: POT
  Building wheel for POT (setup.py) ... [?25l[?25hdone
  Created wheel for POT: filename=POT-0.4.0-cp38-cp38-linux_x86_64.whl size=296780 sha256=e3891321f8f4ae91fb31d3c32c8f89d65fe129826df34aa430a3ea80608be411
  Stored in directory: /root/.cache/pip/wheels/06/a2/14/41d41262c65ab560964e367fc6e0203dc1a6657d2e22d0d5e7
Successfully built POT
Installing collected packages: POT
Successfully installed POT-0.4.0


In [50]:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()

sentence_president = 'The president greets the press in Chicago'.lower().split()
sentence_president3 = 'The president greets the media in Illinois'.lower().split()

distance = model.wmdistance(sentence_obama, sentence_president)
print(f"{distance:.4f}")

distance = model.wmdistance(sentence_obama, sentence_president3)
print(f"{distance:.4f}")

distance = model.wmdistance(sentence_president, sentence_president3)
print(f"{distance:.4f}")


0.6182
0.3908
0.2274


In [52]:
text1 = 'The hotel was very expensive and not good'.lower().split()
text2 = 'The hotel was very good and not expensive'.lower().split()
text3 = 'The hotel was very bad and not cheap'.lower().split()

text4 = 'The best result was achieved by BERT'.lower().split()

distance = model.wmdistance(text1, text2)
print(f"{distance:.4f}")

distance = model.wmdistance(text1, text3)
print(f"{distance:.4f}")

distance = model.wmdistance(text1, text4)
print(f"{distance:.4f}")

0.0000
0.1686
0.6942


Aquí os dejo el código de una función que os permite obtener un gráfico del modelo de word embeddings (tarda mucho tiempo):

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

def tsne_plot(word_vectors):
    "Create TSNE model and plot it"
    labels = []
    tokens = []

    words=list(word_vectors.index_to_key)
    for word in words:
        tokens.append(word_vectors[word])
        labels.append(word)
    
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(tokens)

    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(18, 18)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()
   
tsne_plot(model)

