<a href="https://colab.research.google.com/github/mhered/nlp-de-cero-a-cien/blob/embeddings/NLP_S01_MH_Word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2vec con Gensim

En este cuaderno de Jupyter vas a utilizar la biblioteca [Gensim](https://radimrehurek.com/gensim/index.html) para experimentar con word2vec. Este cuaderno está enfocado en la intuición de los conceptos y no en los detalles de implementación. Este cuaderno está inspirado en esta [guía](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html).

## 1. Instalación y cargar el modelo

In [None]:
!pip install --upgrade gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/44/52/f1417772965652d4ca6f901515debcd9d6c5430969e8c02ee7737e6de61c/gensim-4.0.1-cp37-cp37m-manylinux1_x86_64.whl (23.9MB)
[K     |████████████████████████████████| 23.9MB 126kB/s 
Installing collected packages: gensim
  Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.0.1


In [None]:
import gensim.downloader as api



In [None]:
model = api.load('word2vec-google-news-300')



## 2. Similitud de palabras

En esta sección veremos cómo conseguir la similitud entre dos palabras utilizando un word embedding ya entrenado.

In [None]:
model.similarity("king", "queen")

0.6510957

In [None]:
model.similarity("king", "man")

0.22942673

In [None]:
model.similarity("king", "potato")

0.09978465

In [None]:
model.similarity("king", "king")

1.0

Ahora veremos cómo encontrar las palabras con mayor similitud al conjunto de palabras especificado.

In [None]:
model.most_similar(["king", "queen"], topn=5)

[('monarch', 0.7042067050933838),
 ('kings', 0.6780861616134644),
 ('princess', 0.6731551885604858),
 ('queens', 0.6679497957229614),
 ('prince', 0.6435247659683228)]

In [None]:
model.most_similar(["tomato", "carrot"], topn=5)

[('carrots', 0.7536594867706299),
 ('tomatoes', 0.7129638195037842),
 ('celery', 0.7025030851364136),
 ('broccoli', 0.6796350479125977),
 ('cherry_tomatoes', 0.662927508354187)]

Pero incluso puedes hacer cosas interesantes como ver qué palabra no corresponde a una lista.

In [None]:
model.doesnt_match(["summer", "fall", "spring", "air"])

'air'

## Ejercicios

1. Usa el modelo word2vec para hacer un ranking de las siguientes 15 palabras según su similitud con las palabras "man" y "woman". Para cada par, imprime su similitud.

In [None]:
words = [
"wife",
"husband",
"child",
"queen",
"king",
"man",
"woman",
"birth",
"doctor",
"nurse",
"teacher",
"professor",
"engineer",
"scientist",
"president"]
lst = []
ref = ["man","woman"]
for word in words:
  a = [model.similarity(ref[0], word), model.similarity(ref[1], word)] 
  lst.append((word, a[0], a[1]))
lst_by_sim_man = sorted(lst, key=lambda x: x[1], reverse=True)
lst_by_sim_woman = sorted(lst, key=lambda x: x[2], reverse=True)

print("By similarity to man")
print(lst_by_sim_man)
print("By similarity to woman")
print(lst_by_sim_woman)



By similarity to man
[('man', 1.0, 0.76640123), ('woman', 0.76640123, 1.0), ('husband', 0.34499747, 0.49281383), ('wife', 0.32920915, 0.444824), ('child', 0.31633338, 0.47500372), ('doctor', 0.31448963, 0.37945858), ('nurse', 0.2547229, 0.44135594), ('teacher', 0.25000125, 0.31357846), ('king', 0.22942673, 0.12847973), ('queen', 0.16658202, 0.31618136), ('scientist', 0.15824963, 0.15486898), ('engineer', 0.15128928, 0.09435377), ('birth', 0.11078789, 0.21471293), ('professor', 0.09415862, 0.13077852), ('president', 0.028424604, 0.062676705)]
By similarity to woman
[('woman', 0.76640123, 1.0), ('man', 1.0, 0.76640123), ('husband', 0.34499747, 0.49281383), ('child', 0.31633338, 0.47500372), ('wife', 0.32920915, 0.444824), ('nurse', 0.2547229, 0.44135594), ('doctor', 0.31448963, 0.37945858), ('queen', 0.16658202, 0.31618136), ('teacher', 0.25000125, 0.31357846), ('birth', 0.11078789, 0.21471293), ('scientist', 0.15824963, 0.15486898), ('professor', 0.09415862, 0.13077852), ('king', 0.2294

**2. Completa las siguientes analogías por tu cuenta (sin usar el modelo)**

a. king is to throne as judge is to court

b. giant is to dwarf as genius is to idiot

c. French is to France as Spaniard is to Spain

d. bad is to good as sad is to happy

e. nurse is to hospital as teacher is to school

f. universe is to planet as house is to room

**3. Ahora completa las analogías usando un modelo word2vec**

Aquí hay un ejemplo de cómo hacerlo. Puedes resolver analogías como "A es a B como C es a _" haciendo A + C - B. 

In [None]:
# man is to woman as king is to ___?
model.most_similar(positive=["king", "woman"], negative=["man"], topn=1)

[('queen', 0.7118193507194519)]

In [None]:
# us is to burger as italy is to ___?
model.most_similar(positive=["Mexico", "burger"], negative=["USA"], topn=1)

[('taco', 0.6266060471534729)]

In [None]:
"""
a. king is to throne as judge is to court
b. giant is to dwarf as genius is to idiot
c. French is to France as Spaniard is to Spain
d. bad is to good as sad is to happy
e. nurse is to hospital as teacher is to school
f. universe is to planet as house is to room
"""
words = [["king", "throne", "judge", "court"],
["giant", "dwarf", "genius", "idiot"],
["French", "France", "Spaniard", "Spain"],
["bad", "good", "sad", "happy"],
["nurse", "hospital", "teacher", "school"],
["universe", "planet", "house", "room"]]
for line in words:
  pred = model.most_similar(positive=[line[2], line[1]], negative=[line[0]], topn=1)
  print(line[0]," is to ", line[1], " as ", line[2], " is to ", pred[0][0], ", (vs. ", line[3],")")

king  is to  throne  as  judge  is to  appellate_court , (vs.  court )
giant  is to  dwarf  as  genius  is to  savant , (vs.  idiot )
French  is to  France  as  Spaniard  is to  rider_Dani_Pedrosa , (vs.  Spain )
bad  is to  good  as  sad  is to  wonderful , (vs.  happy )
nurse  is to  hospital  as  teacher  is to  school , (vs.  school )
universe  is to  planet  as  house  is to  bungalow , (vs.  room )
