<a href="https://colab.research.google.com/github/manmorjim/ai_playground/blob/main/somos_nlp/nlp_0_100/1_word_embeddings/word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2vec con Gensim

En este cuaderno de Jupyter vas a utilizar la biblioteca [Gensim](https://radimrehurek.com/gensim/index.html) para experimentar con word2vec. Este cuaderno está enfocado en la intuición de los conceptos y no en los detalles de implementación. Este cuaderno está inspirado en esta [guía](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html).

## 1. Instalación y cargar el modelo

In [1]:
!pip install --upgrade gensim

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import gensim.downloader as api

In [3]:
model = api.load('word2vec-google-news-300')



## 2. Similitud de palabras

En esta sección veremos cómo conseguir la similitud entre dos palabras utilizando un word embedding ya entrenado.

In [4]:
model.similarity("king", "queen")

0.6510957

In [5]:
model.similarity("king", "man")

0.22942673

In [6]:
model.similarity("king", "potato")

0.09978465

In [7]:
model.similarity("king", "king")

1.0

Ahora veremos cómo encontrar las palabras con mayor similitud al conjunto de palabras especificado.

In [8]:
model.most_similar(["king", "queen"], topn=5)

[('monarch', 0.7042067050933838),
 ('kings', 0.6780861616134644),
 ('princess', 0.6731551885604858),
 ('queens', 0.6679497957229614),
 ('prince', 0.6435247659683228)]

In [9]:
model.most_similar(["tomato", "carrot"], topn=5)

[('carrots', 0.7536594867706299),
 ('tomatoes', 0.7129638195037842),
 ('celery', 0.7025030851364136),
 ('broccoli', 0.6796350479125977),
 ('cherry_tomatoes', 0.662927508354187)]

Pero incluso puedes hacer cosas interesantes como ver qué palabra no corresponde a una lista.

In [10]:
model.doesnt_match(["summer", "fall", "spring", "air"])

'air'

## Ejercicios

1. Usa el modelo word2vec para hacer un ranking de las siguientes 15 palabras según su similitud con las palabras "man" y "woman". Para cada par, imprime su similitud.

In [17]:
common_words_man = model.most_similar(['man'], topn=15)
common_words_woman = model.most_similar(['woman'], topn=15)

for m, w in zip(common_words_man, common_words_woman):
  print(m, w)

('woman', 0.7664012908935547) ('man', 0.7664012908935547)
('boy', 0.6824871301651001) ('girl', 0.7494640946388245)
('teenager', 0.6586930155754089) ('teenage_girl', 0.7336829304695129)
('teenage_girl', 0.6147903203964233) ('teenager', 0.6317085027694702)
('girl', 0.5921714305877686) ('lady', 0.6288785934448242)
('suspected_purse_snatcher', 0.571636438369751) ('teenaged_girl', 0.6141784191131592)
('robber', 0.5585119128227234) ('mother', 0.6076306104660034)
('Robbery_suspect', 0.5584409832954407) ('policewoman', 0.6069462299346924)
('teen_ager', 0.5549196600914001) ('boy', 0.5975907444953918)
('men', 0.5489763021469116) ('Woman', 0.5770983099937439)
('horribly_horribly_deranged', 0.5426712036132812) ('sexually_assualted', 0.5723768472671509)
('guy', 0.5420035123825073) ('she', 0.5641393661499023)
('person', 0.5342026352882385) ('Leah_Questin', 0.5481955409049988)
('gentleman', 0.5337990522384644) ('WOMAN', 0.5480420589447021)
('knife_wielding_thief', 0.5337865352630615) ('person', 0.547

In [None]:
words = [
"wife",
"husband",
"child",
"queen",
"king",
"man",
"woman",
"birth",
"doctor",
"nurse",
"teacher",
"professor",
"engineer",
"scientist",
"president"]

**2. Completa las siguientes analogías por tu cuenta (sin usar el modelo)**

a. king is to throne as judge is to _ *tribune*

b. giant is to dwarf as genius is to _ *stupid*

c. French is to France as Spaniard is to _ *Spain*

d. bad is to good as sad is to _ *happy*

e. nurse is to hospital as teacher is to _ *school*

f. universe is to planet as house is to _ *furniture*

**3. Ahora completa las analogías usando un modelo word2vec**

Aquí hay un ejemplo de cómo hacerlo. Puedes resolver analogías como "A es a B como C es a _" haciendo B + C - A. 

In [19]:
# man is to woman as king is to ___?
model.most_similar(positive=["king", "woman"], negative=["man"], topn=1)

[('queen', 0.7118193507194519)]

In [27]:
# us is to burger as italy is to ___?
model.most_similar(positive=["Italy", "burger"], negative=["USA"], topn=1)

[('panino', 0.5671379566192627)]

In [24]:
# king is to throne as judge is to _
model.most_similar(positive=["throne", "judge"], negative=["king"], topn=1)

[('appellate_court', 0.5845253467559814)]

In [25]:
# giant is to dwarf as genius is to _
model.most_similar(positive=["dwarf", "genius"], negative=["giant"], topn=1)

[('savant', 0.44152510166168213)]

In [26]:
# French is to France as Spaniard is to _
model.most_similar(positive=["France", "Spaniard"], negative=["French"], topn=1)

[('rider_Dani_Pedrosa', 0.5646752119064331)]

In [28]:
# bad is to good as sad is to _
model.most_similar(positive=["good", "sad"], negative=["bad"], topn=1)

[('wonderful', 0.6414927840232849)]

In [29]:
# nurse is to hospital as teacher is to _
model.most_similar(positive=["hospital", "teacher"], negative=["nurse"], topn=1)

[('school', 0.60170978307724)]

In [30]:
# universe is to planet as house is to _
model.most_similar(positive=["planet", "house"], negative=["universe"], topn=1)

[('bungalow', 0.5428240299224854)]