<a href="https://colab.research.google.com/github/nlp-en-es/nlp-de-cero-a-cien/blob/main/1_word_embeddings/word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2vec con Gensim

En este cuaderno de Jupyter vas a utilizar la biblioteca [Gensim](https://radimrehurek.com/gensim/index.html) para experimentar con word2vec. Este cuaderno está enfocado en la intuición de los conceptos y no en los detalles de implementación. Este cuaderno está inspirado en esta [guía](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html).

## 1. Instalación y cargar el modelo

In [None]:
!pip install --upgrade gensim

In [1]:
import gensim.downloader as api



In [2]:
model = api.load('word2vec-google-news-300') # tarda unos cuantos minutos

## 2. Similitud de palabras

En esta sección veremos cómo conseguir la similitud entre dos palabras utilizando un word embedding ya entrenado.

In [3]:
model.similarity("king", "queen")

0.6510957

In [4]:
model.similarity("king", "man")

0.22942673

In [5]:
model.similarity("king", "potato")

0.09978465

In [6]:
model.similarity("king", "king")

1.0

In [30]:
model.similarity("nine", "eight")

0.935371

Ahora veremos cómo encontrar las palabras con mayor similitud al conjunto de palabras especificado.

In [31]:
model.most_similar(["king", "queen"], topn=5)

[('monarch', 0.7042067050933838),
 ('kings', 0.6780861616134644),
 ('princess', 0.6731551885604858),
 ('queens', 0.6679497957229614),
 ('prince', 0.6435247659683228)]

In [32]:
model.most_similar(["tomato", "carrot"], topn=5)

[('carrots', 0.7536594867706299),
 ('tomatoes', 0.7129638195037842),
 ('celery', 0.7025030851364136),
 ('broccoli', 0.6796350479125977),
 ('cherry_tomatoes', 0.662927508354187)]

Pero incluso puedes hacer cosas interesantes como ver qué palabra no corresponde a una lista.

In [45]:
model.doesnt_match(["six", "seven", "three", "four", "five"])

'three'

## Ejercicios

1. Usa el modelo word2vec para hacer un ranking de las siguientes 15 palabras según su similitud con las palabras "man" y "woman". Para cada par, imprime su similitud.

In [26]:
words = [
"wife",
"husband",
"child",
"queen",
"king",
"man",
"woman",
"birth",
"doctor",
"nurse",
"teacher",
"professor",
"engineer",
"scientist",
"president"]

dicc_man = {}
dicc_wom = {}
for word in words:
    dicc_man[word] = model.similarity("man", word)
    dicc_wom[word] = model.similarity("woman", word)

tuples_list_man = sorted(dicc_man.items(), key = lambda kv: kv[1], reverse=True)
tuples_list_wom = sorted(dicc_wom.items(), key = lambda kv: kv[1], reverse=True)

print("De más a menos similares a 'man':")
for tupla in tuples_list_man:
    print(tupla[0], "se parece a 'man' un", round(tupla[1]*100, 2), "%")

print()
print("De más a menos similares a 'woman':")
for tupla in tuples_list_wom:
    print(tupla[0], "se parece a 'woman' un", round(tupla[1]*100, 2), "%")

De más a menos similares a 'man':
man se parece a 'man' un 100.0 %
woman se parece a 'man' un 76.64 %
husband se parece a 'man' un 34.5 %
wife se parece a 'man' un 32.92 %
child se parece a 'man' un 31.63 %
doctor se parece a 'man' un 31.45 %
nurse se parece a 'man' un 25.47 %
teacher se parece a 'man' un 25.0 %
king se parece a 'man' un 22.94 %
queen se parece a 'man' un 16.66 %
scientist se parece a 'man' un 15.82 %
engineer se parece a 'man' un 15.13 %
birth se parece a 'man' un 11.08 %
professor se parece a 'man' un 9.42 %
president se parece a 'man' un 2.84 %

De más a menos similares a 'woman':
woman se parece a 'woman' un 100.0 %
man se parece a 'woman' un 76.64 %
husband se parece a 'woman' un 49.28 %
child se parece a 'woman' un 47.5 %
wife se parece a 'woman' un 44.48 %
nurse se parece a 'woman' un 44.14 %
doctor se parece a 'woman' un 37.95 %
queen se parece a 'woman' un 31.62 %
teacher se parece a 'woman' un 31.36 %
birth se parece a 'woman' un 21.47 %
scientist se parece a

**2. Completa las siguientes analogías por tu cuenta (sin usar el modelo)**

a. king is to throne as judge is to _bench_

b. giant is to dwarf as genius is to _fool_

c. French is to France as Spaniard is to _Spain_

d. bad is to good as sad is to _happy_

e. nurse is to hospital as teacher is to _school_

f. universe is to planet as house is to _room_

**3. Ahora completa las analogías usando un modelo word2vec**

Aquí hay un ejemplo de cómo hacerlo. Puedes resolver analogías como "A es a B como C es a _" haciendo A + C - B. 

In [39]:
def evaluar(tupla, pred):
    if tupla[0][0] == pred:
        print("🥳")
    else:
        print("💩")

In [40]:
# king is to throne as judge is to bench?
tupla = model.most_similar(positive=["judge", "throne"], negative=["king"], topn=1)
print(tupla)
evaluar(tupla, "bench")

[('appellate_court', 0.5845253467559814)]
💩


In [41]:
# giant is to dwarf as genius is to fool?
tupla = model.most_similar(positive=["genius", "dwarf"], negative=["giant"], topn=1)
print(tupla)
evaluar(tupla, "fool")

[('savant', 0.44152510166168213)]
💩


In [46]:
# French is to France as Spaniard is to Spain?
tupla = model.most_similar(positive=["Spaniard", "France"], negative=["French"], topn=1)
print(tupla)
evaluar(tupla, "Spain")

tupla = model.most_similar(positive=["Spanish", "France"], negative=["French"], topn=1)
print(tupla)
evaluar(tupla, "Spain")

[('rider_Dani_Pedrosa', 0.5646752119064331)]
💩
[('Spain', 0.8138449192047119)]
🥳


In [43]:
# bad is to good as sad is to happy?
tupla = model.most_similar(positive=["sad", "good"], negative=["bad"], topn=1)
print(tupla)
evaluar(tupla, "happy")

[('wonderful', 0.6414927840232849)]
💩


In [44]:
# nurse is to hospital as teacher is to school?
tupla = model.most_similar(positive=["teacher", "hospital"], negative=["nurse"], topn=1)
print(tupla)
evaluar(tupla, "school")

[('school', 0.60170978307724)]
🥳


In [45]:
# universe is to planet as house is to room?
tupla = model.most_similar(positive=["house", "planet"], negative=["universe"], topn=1)
print(tupla)
evaluar(tupla, "room")

[('bungalow', 0.5428240299224854)]
💩


In [24]:
#JUST FOR FUN

def regla_de_tres(base1, base2, target1):
    target2, prob = model.most_similar(positive=[target1, base2], negative=[base1], topn=1)[0]
    print(base1, "es a", base2, "como", target1, "es a", target2)

In [30]:
country_base = "Spain"
country_target = "Italy"
food_base = "paella"
regla_de_tres(country_base, food_base, country_target)

Spain es a paella como Italy es a risotto


In [36]:
country_base = "Italy"
country_target = "Spain"
food_base = "pasta"
regla_de_tres(country_base, food_base, country_target)

Italy es a pasta como Spain es a paella


In [37]:
country_base = "Italy"
country_target = "Spain"
food_base = "lasagna"
regla_de_tres(country_base, food_base, country_target)

Italy es a lasagna como Spain es a paella


In [38]:
country_base = "Spain"
country_target = "Italy"
food_base = "cocido"
regla_de_tres(country_base, food_base, country_target)

Spain es a cocido como Italy es a cucina


In [40]:
country_base = "Spain"
country_target = "Italy"
food_base = "jamón"
regla_de_tres(country_base, food_base, country_target)

Spain es a jamón como Italy es a pesce


In [42]:
country_base = "Italy"
country_target = "Spain"
food_base = "formaggio"
regla_de_tres(country_base, food_base, country_target)

Italy es a formaggio como Spain es a jamón


In [45]:
country_base = "Italy"
country_target = "Spain"
food_base = "prosciutto"
regla_de_tres(country_base, food_base, country_target)

Italy es a prosciutto como Spain es a Serrano_ham


In [46]:
country_base = "Italy"
country_target = "Spain"
food_base = "pesto"
regla_de_tres(country_base, food_base, country_target)

Italy es a pesto como Spain es a chorizo


In [47]:
country_base = "Spain"
country_target = "Italy"
food_base = "chorizo"
regla_de_tres(country_base, food_base, country_target)

Spain es a chorizo como Italy es a ricotta
