In [3]:
from gensim.test.utils import get_tmpfile, common_texts
from gensim.models import Word2Vec

In [19]:
print(common_texts)

[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]


In [14]:
path = get_tmpfile("word2vec.model")

model = Word2Vec(common_texts, size=300, window=5, min_count=1, workers=4)
model.save("word2vec.model")

In [15]:
model = Word2Vec.load("word2vec.model")
model.train([["hello", "world"]], total_examples=1, epochs=1)

(0, 2)

In [17]:
vector = model.wv['computer']  # numpy vector of a word
vector

array([ 2.2316521e-05,  4.7390778e-05,  4.3435142e-05, ...,
       -4.7317448e-05,  3.7334288e-05, -4.8908816e-05], dtype=float32)

In [70]:
try:
    w11 = model.wv['king']
    w12 = model.wv['man']
    w21 = model.wv['queen']
    w22 = model.wv['woman']
except:
    print("Some words were not found in the vocabulary...")

Some words were not found in the vocabulary...


### https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html

In [96]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

In [97]:
print(len(wv.vocab))
vec_king = wv['king']
print(len(vec_king))

3000000
300


From Paper Page 5 (Figure 1 New model architectures): 
* The CBOW architecture predicts the **current word based on the context**, and 
* The Skip-gram **predicts surrounding words given the current word**.

My understanding:
* So for CBOW, output should be a V sized softmax?
* So for Skip-gram (lets say we pick C = 4 surrounding words +-2), output should be a of size 4 and each is a V sized Softmax (matches slide 10 from class PPT but contradicts slide 8.


Slides from Class clarification:
* Slide 8: Output (why does it have to be 1 for each vocab? Should it not just be a V sized softmax for the neighboring C words)


Also 
* Slide 13, how does "Dot Product/Similarity" help? and what are we comparing this to?



In [112]:
## Question: Case matters below??

four_grams = [
    [('king', 'queen'), ('man', 'woman')],
    [('king', 'man'), ('queen', 'woman')],
    [('King', 'man'), ('Queen', 'woman')],
    [('King', 'man'), ('queen', 'woman')],
    [('man', 'woman'), ('boy', 'girl')],
    [('Ottawa', 'Canada'), ('Nairobi', 'Kenya')],
    [('big', 'bigger'), ('tall', 'taller')],
    [('yen', 'japan'), ('ruble', 'russia')],
    [('man', 'doctor'), ('woman', 'nurse')],  ## Bias in language  
    [('Paris', 'France'), ('London', 'England')]
]

In [111]:
import numpy as np
from numpy.linalg import norm
from scipy import spatial
from gensim.matutils import softcossim

for four_gram in four_grams:
    lhs = wv[four_gram[0][0]] - wv[four_gram[0][1]] + wv[four_gram[1][0]]
    rhs = wv[four_gram[1][1]]  
    print("-"*50)
    print("1.0 Using inbuilt functions to find most similar word ...")
    top_similar = wv.most_similar(positive=[four_gram[0][0], four_gram[1][0]], negative=[four_gram[0][1]], topn=5)
    print(f"'{four_gram[0][0]}':'{four_gram[0][1]}' as '{four_gram[1][0]}': ?")
    print(f"Ans: '{top_similar[0][0]}' with a similarity of {top_similar[0][1]}")
    print(top_similar)
    
    print("\n2.0 Using manual vecotor addition/subtraction ...")
   
    # gensim_similarity = 1 - wv.distance(lhs, rhs)
    # print(f"Similarity using Gensim: {gensim_similarity}")
    cosine_similarity1 = 1 - spatial.distance.cosine(lhs, rhs)
    print(f"Cosine Similarity using scipy: {cosine_similarity1}")
    cosine_similarity2 = np.dot(lhs, rhs)/(norm(lhs)*norm(rhs))
    print(f"Cosine Similarity using numpy: {cosine_similarity2}")    
    
    # print(softcossim(lhs,rhs, wv.similarity_matrix))

--------------------------------------------------
1.0 Using inbuilt functions to find most similar word ...
'king':'queen' as 'man': ?
Ans: 'boy' with a similarity of 0.5393532514572144
[('boy', 0.5393532514572144), ('guy', 0.47399765253067017), ('Alexios_Marakis', 0.4579210579395294), ('Man', 0.4575732350349426), ('teenager', 0.4346425235271454)]

2.0 Using manual vecotor addition/subtraction ...
Cosine Similarity using scipy: 0.33915412425994873
Cosine Similarity using numpy: 0.33915409445762634
--------------------------------------------------
1.0 Using inbuilt functions to find most similar word ...
'king':'man' as 'queen': ?
Ans: 'queens' with a similarity of 0.595018744468689
[('queens', 0.595018744468689), ('monarch', 0.5815044641494751), ('kings', 0.5612993240356445), ('royal', 0.5204525589942932), ('princess', 0.5191516876220703)]

2.0 Using manual vecotor addition/subtraction ...
Cosine Similarity using scipy: -0.0818394273519516
Cosine Similarity using numpy: -0.0818394273

In [82]:
?wv.similarity_matrix

[1;31mSignature:[0m
[0mwv[0m[1;33m.[0m[0msimilarity_matrix[0m[1;33m([0m[1;33m
[0m    [0mdictionary[0m[1;33m,[0m[1;33m
[0m    [0mtfidf[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mthreshold[0m[1;33m=[0m[1;36m0.0[0m[1;33m,[0m[1;33m
[0m    [0mexponent[0m[1;33m=[0m[1;36m2.0[0m[1;33m,[0m[1;33m
[0m    [0mnonzero_limit[0m[1;33m=[0m[1;36m100[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m=[0m[1;33m<[0m[1;32mclass[0m [1;34m'numpy.float32'[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Construct a term similarity matrix for computing Soft Cosine Measure.

This creates a sparse term similarity matrix in the :class:`scipy.sparse.csc_matrix` format for computing
Soft Cosine Measure between documents.

Parameters
----------
dictionary : :class:`~gensim.corpora.dictionary.Dictionary`
    A dictionary that specifies the considered terms.
tfidf : :class:`gensim.models.tfidfmodel.Tf

In [101]:
?wv.

[1;31mSignature:[0m [0mwv[0m[1;33m.[0m[0mrelative_cosine_similarity[0m[1;33m([0m[0mwa[0m[1;33m,[0m [0mwb[0m[1;33m,[0m [0mtopn[0m[1;33m=[0m[1;36m10[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Compute the relative cosine similarity between two words given top-n similar words,
by `Artuur Leeuwenberga, Mihaela Velab , Jon Dehdaribc, Josef van Genabithbc "A Minimally Supervised Approach
for Synonym Extraction with Word Embeddings" <https://ufal.mff.cuni.cz/pbml/105/art-leeuwenberg-et-al.pdf>`_.

To calculate relative cosine similarity between two words, equation (1) of the paper is used.
For WordNet synonyms, if rcs(topn=10) is greater than 0.10 then wa and wb are more similar than
any arbitrary word pairs.

Parameters
----------
wa: str
    Word for which we have to look top-n similar word.
wb: str
    Word for which we evaluating relative cosine similarity with wa.
topn: int, optional
    Number of top-n similar words to look with respect to wa.
