# Research

This notebook is for research about different word embedding libraries. The goal is know about the features of this library to select the best way to aboard our project.

We are going to investigate about these libraries:
- fasttext
- gemsim
- nltk
- SpaCy

In [8]:
import warnings
warnings.filterwarnings('ignore')
import fasttext
import gensim
import nltk
import spacy

In [2]:
board = {
    'red': ['prensa', 'falda', 'cuchillo', 'misil', 'archivo', 'red', 'bosque', 'embajada'],
    'blue': ['plomo', 'punta', 'puerto', 'pendiente', 'gancho', 'bota', 'botella', 'Londres', 'zanahoria'],
    'neutral': ['pirata', 'olimpo', 'goma', 'paracaídas', 'cólera', 'ruleta', 'Argentina'],
    'murderer': ['estado']
}

## fasttext

First, we are going to test `fasttext`. We are going to use a pretrained word embedding.

In [21]:
import fasttext.util

FAST_TEXT_MODEL = "cc.es.300.bin" # Model name in fasttext

fasttext.util.download_model('es', if_exists='ignore')  # Spanish
ft = fasttext.load_model(FAST_TEXT_MODEL)



In [24]:
ft.get_nearest_neighbors('ropa')

[(0.7451867461204529, '-ropa'),
 (0.7219822406768799, 'ropita'),
 (0.7006562948226929, 'ropas'),
 (0.6829654574394226, 'lencería'),
 (0.6828590035438538, 'indumentaria'),
 (0.6634566187858582, 'ropa.'),
 (0.6600044369697571, 'prendas'),
 (0.6464620232582092, 'laropa'),
 (0.6463090181350708, 'ropa.Y'),
 (0.6461246609687805, 'ropa.La')]

As you can see, it is not a very useful.

### Using gensim with FastText

We can use FastText model with gensim to get the most similar words.

More info [here](https://radimrehurek.com/gensim/models/fasttext.html)

In [25]:
from gensim.models import FastText
model = FastText.load_fasttext_format(FAST_TEXT_MODEL)

In [40]:
model.wv.most_similar(positive=board['red'])

[('trozo', 0.5222485065460205),
 ('moto-sierra', 0.5117003917694092),
 ('supercañón', 0.5076452493667603),
 ('avión', 0.5039247274398804),
 ('taladora', 0.5023553371429443),
 ('helicóptero', 0.5008259415626526),
 ('navajilla', 0.49977535009384155),
 ('arma', 0.49449434876441956),
 ('cortadero', 0.4942677915096283),
 ('pilotillo', 0.4931870698928833)]

In [41]:
similar_words = model.wv.most_similar(positive=board['red'], negative=board['murderer'])

In [56]:
import numpy as np
results = []
for word in similar_words:
    results.append({'word': word[0], 'results': list(map(lambda w: {
        'word': w,
        'distance': np.linalg.norm(ft.get_word_vector(w) - ft.get_word_vector(word[0]))
    }, board['red']))}) 

In [58]:
results

[{'word': 'moto-sierra',
  'results': [{'word': 'prensa', 'distance': 0.8837527},
   {'word': 'falda', 'distance': 1.1775761},
   {'word': 'cuchillo', 'distance': 0.6353251},
   {'word': 'misil', 'distance': 1.3958523},
   {'word': 'archivo', 'distance': 0.86589664},
   {'word': 'red', 'distance': 1.8994608},
   {'word': 'bosque', 'distance': 0.8908189},
   {'word': 'embajada', 'distance': 0.76522934}]},
 {'word': 'navajilla',
  'results': [{'word': 'prensa', 'distance': 0.9269706},
   {'word': 'falda', 'distance': 1.155475},
   {'word': 'cuchillo', 'distance': 0.6195383},
   {'word': 'misil', 'distance': 1.4194101},
   {'word': 'archivo', 'distance': 0.9224154},
   {'word': 'red', 'distance': 1.8965},
   {'word': 'bosque', 'distance': 0.99281836},
   {'word': 'embajada', 'distance': 0.81782246}]},
 {'word': 'macheta',
  'results': [{'word': 'prensa', 'distance': 0.98485684},
   {'word': 'falda', 'distance': 1.2174498},
   {'word': 'cuchillo', 'distance': 0.59120625},
   {'word': 'misi

In [60]:
model.wv.most_similar(positive=['cuchillo', 'bosque'], negative=board['murderer'])

[('cuchillito', 0.5495517253875732),
 ('hacha', 0.537045955657959),
 ('cuhillo', 0.5332930684089661),
 ('cuchillos', 0.5165868997573853),
 ('machete', 0.5149796605110168),
 ('chuchillo', 0.5114734172821045),
 ('cuchilo', 0.5101864337921143),
 ('puñal', 0.5012166500091553),
 ('bosqueque', 0.4940257668495178),
 ('bosqueEl', 0.48760658502578735)]

In [88]:
from itertools import combinations
candidate_words = board['red']
all_pairs = combinations(candidate_words, 3)
scored_pairs = [(model.wv.similarity(p[0], p[1]), p)
                for p in all_pairs]
sorted_pairs = sorted(scored_pairs, reverse=True)
print(sorted_pairs[0])  # first item is most-similar pair

(0.27643186, ('prensa', 'red', 'embajada'))


In [89]:
model.wv.most_similar(positive=['prensa', 'red', 'embajada'], negative=board['murderer'])

[('Embajada', 0.5810344219207764),
 ('comunciación', 0.5428721904754639),
 ('legación', 0.5371121168136597),
 ('Prensa', 0.5246720910072327),
 ('prensa.La', 0.5230076909065247),
 ('prensaLa', 0.5183517932891846),
 ('cablegráfica', 0.5047746896743774),
 ('cancillería', 0.49862468242645264),
 ('Cancillería', 0.4959728419780731),
 ('comuniación', 0.4955418109893799)]

In [94]:
nltk.download('wordnet')
nltk.download('omw')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/gallardo/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw to /Users/gallardo/nltk_data...
[nltk_data]   Unzipping corpora/omw.zip.


True

In [98]:
from nltk.corpus import wordnet as wn
for syns in wn.synsets('prensa', lang='spa'):
    print(syns.name(), syns.lemma_names(lang='spa'))
    

press.n.09 ['prensa', 'presión']
press.n.07 ['prensa']
press.n.02 ['prensa']


### Get Synonyms and Antonyms

To improve the model, is a good idea use also the synonyms and antonyms. We have created this two methods to help us.

In [234]:
def get_synonyms(word, lang='spa', output_lang='spa'):
    synonyms = [syns.lemma_names(lang=output_lang) for syns in wn.synsets(word, lang=lang)]
    flat = list([item for sublist in synonyms for item in sublist])
    return sorted(set(flat),key=flat.count)[::-1]

def get_antonyms(word, lang='spa'):
    help_lang = 'eng'
    translated_word = get_synonyms(word, lang=lang, output_lang=help_lang)
    antonyms = []
    for syns in wn.synsets(translated_word[0], lang=help_lang):
        for lemma in syns.lemmas(lang=help_lang):
            for antonym in lemma.antonyms():
                antonyms_lemma = [syns.lemma_names(lang=lang) for syns in wn.synsets(antonym.name(), lang=help_lang)]
                antonyms.append([item for sublist in antonyms_lemma for item in sublist])
    flat = list([item for sublist in antonyms for item in sublist])
    return sorted(set(flat),key=flat.count)[::-1]

In [235]:
print('Synonyms:', get_synonyms('guerra'))
print('Antonyms:', get_antonyms('guerra'))

Synonyms: ['guerra', 'estado_de_guerra']
Antonyms: ['paz', 'reposo', 'tranquilidad', 'tranquilidad_de_espíritu', 'serenidad', 'ataraxia', 'sosiego', 'tratado_de_paz', 'pacificación']
