# Research

This notebook is for research about different word embedding libraries. The goal is know about the features of this library to select the best way to aboard our project.

We are going to investigate about these libraries:
- fasttext
- gemsim
- nltk
- SpaCy

In [1]:
import warnings
warnings.filterwarnings('ignore')
import fasttext
import gensim
import nltk
import spacy

In [2]:
board = {
    'red': ['key', 'green', 'force', 'casino', 'hospital', 'robot', 'spell', 'red'],
    'blue': ['eagle', 'fair', 'lap', 'beach', 'back', 'sound', 'bottle', 'hole', 'alien'],
    'neutral': ['dog', 'mole', 'wind', 'apple', 'whale', 'berry', 'pool'],
    'murderer': ['staff']
}

## Fist step

The first step is get a clue witch identify a word. And try to get the same clue for two words. We will test with 'red' and 'green', the obviously is to say 'color' as a clue. We will make it?

### fasttext

First, we are going to test `fasttext`. We are going to use a pretrained word embedding.

In [3]:
%%capture
import fasttext.util
FAST_TEXT_MODEL = "cc.en.300.bin" # Model name in fasttext

fasttext.util.download_model('en', if_exists='ignore')  # English
ft = fasttext.load_model(FAST_TEXT_MODEL)

In [56]:
ft.get_nearest_neighbors('red')

[(0.8113083243370056, 'blue'),
 (0.8055115342140198, 'yellow'),
 (0.7738474011421204, 'purple'),
 (0.7048869132995605, 'orange'),
 (0.6970385909080505, 'pink'),
 (0.6968856453895569, 'non-red'),
 (0.682639479637146, 'crimson'),
 (0.6794082522392273, 'green'),
 (0.6712507605552673, 'white'),
 (0.6558310389518738, 'light-red')]

In [70]:
ft.get_nearest_neighbors('green')

[(0.7388068437576294, 'greeen'),
 (0.7049704194068909, 'blue'),
 (0.6820159554481506, 'yellow'),
 (0.6794081926345825, 'red'),
 (0.6742905378341675, 'green-'),
 (0.6663212180137634, 'purple'),
 (0.6522365212440491, 'green-ish'),
 (0.6520159840583801, 'green.The'),
 (0.6385214328765869, '-green'),
 (0.6341298818588257, 'orange')]

As you can see, it is not a very useful.

### Using gensim with FastText

We can use FastText model with gensim to get the most similar words.

More info [here](https://radimrehurek.com/gensim/models/fasttext.html)

In [5]:
from gensim.models import FastText
model = FastText.load_fasttext_format(FAST_TEXT_MODEL)

In [71]:
model.wv.most_similar(positive=['red', 'green'])

[('blue', 0.8273435831069946),
 ('yellow', 0.8116556406021118),
 ('purple', 0.7858149409294128),
 ('orange', 0.7306221723556519),
 ('greeen', 0.698380172252655),
 ('pink', 0.6954872608184814),
 ('green-colored', 0.6573896408081055),
 ('white', 0.6534684300422668),
 ('blue-green', 0.6492818593978882),
 ('green-ish', 0.6438031196594238)]

### Get Synonyms and Antonyms

To improve the model, is a good idea use also the synonyms and antonyms. We have created this two methods to help us.

In [72]:
nltk.download('wordnet')
nltk.download('omw')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/gallardo/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw to /Users/gallardo/nltk_data...
[nltk_data]   Package omw is already up-to-date!


True

In [73]:
from nltk.corpus import wordnet as wn

def get_synonyms(word, lang='eng', output_lang='eng'):
    synonyms = [syns.lemma_names(lang=output_lang) for syns in wn.synsets(word, lang=lang)]
    flat = list([item.lower() for sublist in synonyms for item in sublist])
    return sorted(set(flat),key=flat.count)[::-1]

def get_antonyms(word, lang='eng'):
    help_lang = 'eng'
    translated_word = get_synonyms(word, lang=lang, output_lang=help_lang)
    antonyms = []
    for syns in wn.synsets(translated_word[0], lang=help_lang):
        for lemma in syns.lemmas(lang=help_lang):
            for antonym in lemma.antonyms():
                antonyms_lemma = [syns.lemma_names(lang=lang) for syns in wn.synsets(antonym.name(), lang=help_lang)]
                antonyms.append([item for sublist in antonyms_lemma for item in sublist])
    flat = list([item for sublist in antonyms for item in sublist])
    return sorted(set(flat),key=flat.count)[::-1]

In [74]:
print('Synonyms:', get_synonyms('red'))
print('Antonyms:', get_antonyms('red'))

Synonyms: ['red', 'crimson', 'red_river', 'ruby-red', 'violent', 'red_ink', 'cherry-red', 'ruddy', 'ruby', 'reddened', 'bolshy', 'cherry', 'blood-red', 'cerise', 'bolshevik', 'scarlet', 'redness', 'reddish', 'flushed', 'carmine', 'loss', 'bolshie', 'red-faced', 'marxist']
Antonyms: ['gain', 'profit', 'win', 'advance', 'make', 'gain_ground', 'hit', 'addition', 'gather', 'earn', 'pull_ahead', 'make_headway', 'acquire', 'realize', 'reach', 'get_ahead', 'bring_in', 'put_on', 'increase', 'attain', 'amplification', 'take_in', 'realise', 'benefit', 'clear', 'derive', 'pull_in', 'arrive_at']


In [75]:
print('Synonyms:', get_synonyms('green'))
print('Antonyms:', get_antonyms('green'))

Synonyms: ['green', 'commons', 'jet', 'immature', 'putting_green', 'putting_surface', 'k', 'super_acid', 'honey_oil', 'greens', 'dark-green', 'greenness', 'fleeceable', 'special_k', 'william_green', 'park', 'cat_valium', 'gullible', 'viridity', 'light-green', 'super_c', 'green_river', 'unripened', 'common', 'leafy_vegetable', 'unripe', 'greenish']
Antonyms: ['ripe', 'good', 'right', 'advanced', 'mature']


In the case of 'red' and 'pink', 'color' is not a synonyms, therefore it's not very useful. But we could be use for other cases.

### Similarity

We can calculate the words similarity.

In [76]:
def calculate_similarity(word1, word2):
    words1 = get_synonyms(word1)
    words2 = get_synonyms(word2)
    max_similarity = 0
    pair = None
    for w1 in words1:
        for w2 in words2:
            max_similarity = max(max_similarity, model.wv.similarity(w1, w2))
    return max_similarity

In [77]:
calculate_similarity('red', 'green')

0.8483072

### Combination of words

For play Codenames, is a good idea to try to find a combination of similar words, and then look for a clue. We can find the best combination for the words in the board. 

**Note:** for next steps we have to ensure that the combination is not in conflict with other words in the board.

In [79]:
from itertools import combinations

def get_similar_combinations(candidate_words, k):
    all_pairs = combinations(candidate_words, k)
    scored_pairs = [(calculate_similarity(p[0], p[1]), p)
                    for p in all_pairs]
    return sorted(scored_pairs, reverse=True)

In [80]:
get_similar_combinations(board['red'], 2)[0]

(0.8483072, ('green', 'red'))

### spaCy

Also, we can use spaCy to get similarity of words.

In [87]:
nlp = spacy.load("en_core_web_lg")

In [88]:
nlp('red').similarity(nlp('green'))

0.7822252390048066

In [91]:
def spacy_most_similar(word, topn=100):
    ms = nlp_ru.vocab.vectors.most_similar(
      nlp_ru(word).vector.reshape(1,nlp_ru(word).vector.shape[0]), n=topn)
    return list(set([nlp_ru.vocab.strings[w].lower() for w in ms[0][0]]))[:10]

spacy_most_similar('red')

['maroon',
 'white',
 'burgundy',
 'purple',
 'lime',
 'beige',
 'dark',
 'reddish',
 'coloured',
 'pale']

### Similarity or synonyms are not relevan

For other words we can use synonyms, but for 'red' and 'green' we need to work with the meaning of the word.

In [167]:
def get_definition(word):
    return wn.synsets(word)[0].definition()

In [168]:
print(get_definition('red'))
print(get_definition("green"))

red color or pigment; the chromatic color resembling the hue of blood
green color or pigment; resembling the color of growing grass


For the first time, we found the word 'color'. We can try to find the key words in the words meaning.

In [183]:
def get_definition_similarity(word1, word2):
    return nlp(get_definition(word1)).similarity(nlp(get_definition(word2)))

In [184]:
get_definition_similarity('red', 'green')

0.9311594117249767

In [185]:
from collections import Counter

def get_words_from_definition(word):
    definition = nlp(get_definition(word))
    words = [token.text for token in definition if not token.is_stop and not token.is_punct]
    counter =  Counter(words)
    return sorted(counter, key=counter.get, reverse=True)

In [186]:
print(get_words_from_definition('red'))
print(get_words_from_definition('green'))

['color', 'red', 'pigment', 'chromatic', 'resembling', 'hue', 'blood']
['color', 'green', 'pigment', 'resembling', 'growing', 'grass']


In [187]:
def match_words_from_definition(words):
    word_definition = list(map(lambda word: get_words_from_definition(word), words))
    return list(set(word_definition[0]).intersection(*word_definition[1:]))

Using the definitions we can found a words witch appears in both definitions.

In [188]:
match_words_from_definition(["red", "green"])

['pigment', 'resembling', 'color']

In [189]:
def get_combinations_with_same_words_in_definition(candidate_words, k):
    all_pairs = combinations(candidate_words, k)
    match_pairs = [(match_words_from_definition(p), p)
                    for p in all_pairs]
    return list(filter(lambda pair: len(pair[0]), match_pairs))

In [190]:
get_combinations_with_same_words_in_definition(board['red'], 2)

[(['mechanism'], ('key', 'robot')),
 (['pigment', 'resembling', 'color'], ('green', 'red'))]

In [191]:
get_combinations_with_same_words_in_definition(board['blue'], 2)

[(['person'], ('lap', 'alien')), (['neck'], ('back', 'bottle'))]

In [192]:
get_combinations_with_same_words_in_definition([*board['blue'], *board['red']], 2)

[(['person'], ('lap', 'alien')),
 (['neck'], ('back', 'bottle')),
 (['effect'], ('sound', 'force')),
 (['mechanism'], ('key', 'robot')),
 (['pigment', 'resembling', 'color'], ('green', 'red'))]

In [195]:
print(get_definition('back'))
print(get_definition('bottle'))

the posterior part of a human (or animal) body from the neck to the end of the spine
a glass or plastic vessel used for storing drinks or other liquids; typically cylindrical without handles and with a narrow neck that can be plugged or capped


In [194]:
get_definition_similarity('key', 'robot')

0.9166554279210518