<a href="https://colab.research.google.com/github/mallibus/DSS-NLP-challenge/blob/master/Play_with_word_embeddings_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Play with Word embeddings
<img src="http://www.datascienceseed.com/wp-content/uploads/2019/08/embeddings.jpg" ><br>


In [0]:
import os
import numpy as np
import pandas as pd

#### Parsing the GloVe word-embeddings file
You can find the Glove word-embeddings file here http://nlp.stanford.edu/data/glove.6B.zip or here https://drive.google.com/drive/folders/1wvyeiRwYAdypLfrOfIaiwBMPPTzQwKp_ you can find the .txt file already extracted.<br>
Let’s parse the unzipped file (a .txt file) to build an index that maps words (as strings) to their vector representation (as number vectors).

In [4]:
# This is where the glove embedding file is in my drive
glove_dir = '/content/drive/My Drive/AIML/Trainings/2019-07-22 Deep Learning UNIGE/LABs/Lab-2/glove.6B'

embeddings_index = {}

f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'), encoding="utf8") # for windows encoding "utf8" works; for linux/ios check

# Parse the .txt file to build an index that maps words (as strings) to their vector representation (as number vectors).
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
    
f.close()
print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


## Vector utility functions

In [0]:
def vector_len(v) -> float:
    return np.sqrt(np.dot(v, v))

def cosine_similarity(v1, v2) -> float:
    """
    Returns the cosine of the angle between the two vectors.
    Results range from -1 (very different) to 1 (very similar).
    """
    return np.dot(v1, v2) / (vector_len(v1) * vector_len(v2))

In [6]:
cosine_similarity(embeddings_index['king']-embeddings_index['man']+embeddings_index['woman'],embeddings_index['queen'])

0.7834413

In [0]:
def embed_similarity(w1,w2,w3):
  """generate similarities like w1=king - w2=man + w3=woman """
  q = embeddings_index[w1]-embeddings_index[w2]+embeddings_index[w3]
  dist_list = [cosine_similarity(embeddings_index[w],q) for w in embeddings_index.keys()]
  # Trick of removing input words
  return pd.Series(data=dist_list,index=embeddings_index.keys()).sort_values(ascending=False).drop([w1,w2,w3])

In [8]:
embed_similarity('rome','italy','france').head()

paris         0.844468
prohertrib    0.658137
french        0.654002
strasbourg    0.630593
brussels      0.619239
dtype: float64

In [9]:
embed_similarity('king','man','woman').head()

queen       0.783441
monarch     0.693380
throne      0.683311
daughter    0.680908
prince      0.671314
dtype: float64

In [10]:
embed_similarity('doctor','masculine','feminine').head()

nurse       0.708128
medical     0.699084
patient     0.692603
hospital    0.677783
medicine    0.675922
dtype: float64

In [11]:
embed_similarity('bitch','feminine','masculine').head()

bastard    0.612845
dude       0.598435
whore      0.568276
cunt       0.552512
fucking    0.543644
dtype: float64

In [12]:
embed_similarity('bank','money','food').head()

supermarket    0.594245
market         0.556250
markets        0.556143
foods          0.545908
retail         0.540577
dtype: float64

In [14]:
embed_similarity('pizza','italy','usa').head()

kfc         0.560588
donuts      0.548909
doughnut    0.524129
7-eleven    0.515943
mcdonald    0.514797
dtype: float64

In [20]:
embed_similarity('trump','usa','italy').head()

berlusconi    0.637846
silvio        0.563385
dini          0.550138
scalfaro      0.537195
prodi         0.517768
dtype: float64