<a href="https://colab.research.google.com/github/l12maro/nlp-de-cero-a-cien/blob/embeddings/Word2vec_done.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2vec con Gensim

En este cuaderno de Jupyter vas a utilizar la biblioteca [Gensim](https://radimrehurek.com/gensim/index.html) para experimentar con word2vec. Este cuaderno está enfocado en la intuición de los conceptos y no en los detalles de implementación. Este cuaderno está inspirado en esta [guía](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html).

## 1. Instalación y cargar el modelo

In [4]:
!pip install --upgrade gensim

Collecting gensim
  Downloading gensim-4.0.1-cp37-cp37m-manylinux1_x86_64.whl (23.9 MB)
[K     |████████████████████████████████| 23.9 MB 93 kB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.0.1


In [5]:
import gensim.downloader as api



In [6]:
model = api.load('word2vec-google-news-300')



## 2. Similitud de palabras

En esta sección veremos cómo conseguir la similitud entre dos palabras utilizando un word embedding ya entrenado.

In [None]:
model.similarity("king", "queen")

0.6510957

In [None]:
model.similarity("king", "man")

0.22942673

In [None]:
model.similarity("king", "potato")

0.09978465

In [None]:
model.similarity("king", "king")

1.0

Ahora veremos cómo encontrar las palabras con mayor similitud al conjunto de palabras especificado.

In [None]:
model.most_similar(["king", "queen"], topn=5)

[('monarch', 0.7042067050933838),
 ('kings', 0.6780861616134644),
 ('princess', 0.6731551885604858),
 ('queens', 0.6679497957229614),
 ('prince', 0.6435247659683228)]

In [None]:
model.most_similar(["tomato", "carrot"], topn=5)

[('carrots', 0.7536594867706299),
 ('tomatoes', 0.7129638195037842),
 ('celery', 0.7025030851364136),
 ('broccoli', 0.6796350479125977),
 ('cherry_tomatoes', 0.662927508354187)]

Pero incluso puedes hacer cosas interesantes como ver qué palabra no corresponde a una lista.

In [None]:
model.doesnt_match(["summer", "fall", "spring", "air"])

'air'

## Ejercicios

1. Usa el modelo word2vec para hacer un ranking de las siguientes 15 palabras según su similitud con las palabras "man" y "woman". Para cada par, imprime su similitud.

In [7]:
words = [
"wife",
"husband",
"child",
"queen",
"king",
"man",
"woman",
"birth",
"doctor",
"nurse",
"teacher",
"professor",
"engineer",
"scientist",
"president"]

In [11]:
ranking_man = {}
ranking_woman = {}
for t in words:
  ranking_man[t] = model.similarity("man", t)
  ranking_woman[t] = model.similarity("woman", t)
ranked_man = sorted(ranking_man, key=ranking_man.get, reverse=True)
for key in ranked_man: 
  print(key, ranking_man[key])
print("-------------------")
ranked_woman = sorted(ranking_woman, key=ranking_woman.get, reverse=True)
for key in ranked_woman: 
  print(key, ranking_woman[key])

man 1.0
woman 0.76640123
husband 0.34499747
wife 0.32920915
child 0.31633338
doctor 0.31448963
nurse 0.2547229
teacher 0.25000125
king 0.22942673
queen 0.16658202
scientist 0.15824963
engineer 0.15128928
birth 0.11078789
professor 0.09415862
president 0.028424604
-------------------
woman 1.0
man 0.76640123
husband 0.49281383
child 0.47500372
wife 0.444824
nurse 0.44135594
doctor 0.37945858
queen 0.31618136
teacher 0.31357846
birth 0.21471293
scientist 0.15486898
professor 0.13077852
king 0.12847973
engineer 0.09435377
president 0.062676705


**2. Completa las siguientes analogías por tu cuenta (sin usar el modelo)**

a. king is to throne as judge is to _

b. giant is to dwarf as genius is to _

c. French is to France as Spaniard is to _

d. bad is to good as sad is to _

e. nurse is to hospital as teacher is to _

f. universe is to planet as house is to _

Solution 2:

a. chair

b. stupid / dumb 

c. Spain

d. happy

e. school / highschool

f. room


**3. Ahora completa las analogías usando un modelo word2vec**

Aquí hay un ejemplo de cómo hacerlo. Puedes resolver analogías como "A es a B como C es a _" haciendo A + C - B. 

In [None]:
# man is to woman as king is to ___?
model.most_similar(positive=["king", "woman"], negative=["man"], topn=1)

[('queen', 0.7118193507194519)]

In [None]:
# us is to burger as italy is to ___?
model.most_similar(positive=["Mexico", "burger"], negative=["USA"], topn=1)

[('taco', 0.6266060471534729)]

In [12]:
model.most_similar(positive=["throne", "judge"], negative=["king"], topn=1)

[('appellate_court', 0.5845253467559814)]

In [None]:
model.most_similar(positive=["dwarf", "genious"], negative=["giant"], topn=1)

[('overated', 0.4708128571510315)]

In [None]:
model.most_similar(positive=["France", "Spaniard"], negative=["French"], topn=1)

[('rider_Dani_Pedrosa', 0.5646752119064331)]

In [None]:
model.most_similar(positive=["good", "sad"], negative=["bad"], topn=1)

[('wonderful', 0.6414927840232849)]

In [None]:
model.most_similar(positive=["hospital", "teacher"], negative=["nurse"], topn=1)

[('school', 0.60170978307724)]

In [None]:
model.most_similar(positive=["planet", "house"], negative=["universe"], topn=1)

[('bungalow', 0.5428240299224854)]