<a href="https://colab.research.google.com/github/nlauchande/corise-nlp-notebooks/blob/main/Natu_NLP_Week_1_Word2Vec_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

> DUPLICATE THIS COLAB TO START WORKING ON IT. Using File > Save a copy to drive.


### Word2Vec

In this notebook we're going to learn and play around with word2vec embeddings that are packaged with Spacy. We'll try to build intuition on how they work and what can we do with them.

Install all the required dependencies for the project

In [1]:
%%capture
!pip install spacy==2.2.4 --quiet
!python -m spacy download en_core_web_md
!apt install libopenblas-base libomp-dev
!pip install faiss==1.5.3 --quiet

Import all the necessary libaries

In [2]:
from collections import defaultdict
import en_core_web_md
import numpy as np
import spacy
import time
import faiss

Now let's load the Spacy data which comes with pre-trainined embeddings. This process is expensive so only do this once.



In [3]:
spacyModel = en_core_web_md.load()

First, let's play with some basic similarity functions.

In [4]:
banana = spacyModel("banana")
fruit = spacyModel("fruit")
table = spacyModel("table")
print(banana.similarity(fruit))
print(banana.similarity(table))

0.671483588786149
0.22562773991991913


As expected `Banana` is a lot more similar to `Fruit` than to `Table`. Now let's iterate over the entire vocabulary and build a search index using **Faiss**. This will make it a lot faster for us to find similar words instead of having to loop over the entire corpus each time. 

Feel free to ignore learning more about **Faiss** at this time as we'll dive more into it in Week 3. At the high-level it is a really efficient library to find similar vectors from a corpus.

Note: This next cell will take a fair bit of time to run.

In [5]:
def load_index():
  """Expensive method - call only once!!
  """
  word_to_id = {}
  id_to_word = {}
  vectors = []
  vector_dimension = 300
  id = 0

  # Iterate over the entire vocabulary
  for i, tok in enumerate(spacyModel.vocab):
    vector = tok.vector
    l2_norm = np.linalg.norm(vector)

    # Ignore zero vectors, nan vlaues
    if (np.isclose(l2_norm, 0.0) or 
        np.isnan(l2_norm) or 
        np.any(np.isnan(vector))):
      continue
    else:
      vectors.append(np.divide(vector, l2_norm))

    # Add to the output variables
    word_to_id[tok.text.lower()] = id
    id_to_word[id] = tok.text.lower()
    id += 1

  
  vectors = np.array(vectors)
  index = faiss.IndexFlatIP(vector_dimension)
  index.add(vectors)
  return word_to_id, id_to_word, vectors, index

word_to_id, id_to_word, vectors, index = load_index()
vector_size = len(vectors)
print("We created a search index of %d vectors" % vector_size)

We created a search index of 684754 vectors


Now we're going to add a helper functions to calculate top_k similar words to some input in the index.

In [6]:
def search_vector(word_vector, top_k=100, print_time_taken=False):
  word_vector = np.array([word_vector])
  start_time = time.time()
  scores, indices = index.search(word_vector, top_k)
  if print_time_taken:
    print("Time taken to search amongst {} words is {:.3}s".format(
        vector_size, (time.time() - start_time))
    )
  results = []
  words = set()
  for i, query_index in enumerate(indices):
      # Matches for the i'th one 
      for inn_idx, word_index in enumerate(query_index):
          if word_index < 0:
              continue
          word = id_to_word[word_index]
          if word in words:
            continue
          words.add(word)
          results.append((word, float(scores[i][inn_idx])))
  return sorted(results, key=lambda tup: -tup[1])

Let's do an empirical test by searching similar words to a few terms

### Search

In [7]:
def search(word, top_k=100,print_time_taken=False):
  word = word.lower()
  if word not in word_to_id:
    print("Oops, the word {} is not in loaded dictionary".format(word))
    return
  id = word_to_id[word]
  word_vector = vectors[id]
  search_results = search_vector(word_vector, top_k, print_time_taken)
  print(f"The top similar words to {word} are - ")
  for i in range(len(search_results)):
    print(f"Word = {search_results[i][0]} and similarity = {search_results[i][1]}")
  return search_results

In [None]:
output = search("happy")

The top similar words to happy are - 
Word = happy and similarity = 0.9999999403953552
Word = glad and similarity = 0.7701864242553711
Word = hope and similarity = 0.7318377494812012
Word = everyone and similarity = 0.7277782559394836
Word = thankful and similarity = 0.6912474632263184
Word = excited and similarity = 0.6901208162307739
Word = love and similarity = 0.6869317889213562
Word = wish and similarity = 0.6853763461112976
Word = gratefully and similarity = 0.6847965717315674
Word = greatful and similarity = 0.6847965717315674
Word = grateful and similarity = 0.6847965717315674
Word = appreciative and similarity = 0.6847965717315674
Word = always and similarity = 0.6795146465301514
Word = lucky and similarity = 0.6788508296012878
Word = feel and similarity = 0.6768770813941956
Word = freinds and similarity = 0.6759893894195557
Word = friends and similarity = 0.6759893894195557
Word = thank and similarity = 0.6728546619415283
Word = really and similarity = 0.6657894849777222
Word

In [8]:
output = search("baseball", 10)

The top similar words to baseball are - 
Word = fastpitch and similarity = 0.9999999403953552
Word = scorebook and similarity = 0.9999999403953552
Word = baseballs and similarity = 0.9999999403953552
Word = baseball and similarity = 0.9999999403953552
Word = softball and similarity = 0.9999999403953552
Word = sandlot and similarity = 0.9999999403953552


In [9]:
output = search("cheese", 25)

The top similar words to cheese are - 
Word = mozzarella and similarity = 1.0000001192092896
Word = fromage and similarity = 1.0000001192092896
Word = cheeses and similarity = 1.0000001192092896
Word = cheese and similarity = 1.0000001192092896
Word = velveta and similarity = 0.8228569626808167
Word = chevre and similarity = 0.8228569626808167
Word = cheddar and similarity = 0.8228569626808167
Word = chedder and similarity = 0.8228569626808167
Word = emmental and similarity = 0.8228569626808167
Word = mozza and similarity = 0.8228569626808167
Word = roquefort and similarity = 0.8228569626808167
Word = cheeseboard and similarity = 0.8228569626808167
Word = mozz and similarity = 0.8228569626808167
Word = velveeta and similarity = 0.8228569626808167


Now why don't you try out a few different words that come to mind and see where does the model perform well and where it struggles!! 

### Analogies

In [11]:
def analogy(word1, word2, word3):
  word1 = word1.lower()
  word2 = word2.lower()
  word3 = word3.lower()
  if word1 not in word_to_id or word2 not in word_to_id or word3 not in word_to_id:
    print("word not present in dictionary, try something else")
  vector1 = vectors[word_to_id[word1]]
  vector2 = vectors[word_to_id[word2]]
  vector3 = vectors[word_to_id[word3]]
  analogy_results = search_vector(np.add(np.subtract(vector1, vector2), vector3), 10)
  print(f"The top similar item for ({word1} - {word2} + {word3}) = {analogy_results[0][0]}")
  print(f"The top similar words to ({word1} - {word2} + {word3}) are - ")
  for i in range(len(analogy_results)):
    print(f"Word = {analogy_results[i][0]} and similarity = {analogy_results[i][1]}")
  return analogy_results

In [15]:
output = analogy("salad", "patty", "bun")

The top similar item for (salad - patty + bun) = topknot
The top similar words to (salad - patty + bun) are - 
Word = topknot and similarity = 0.9501674771308899
Word = buttie and similarity = 0.9501674771308899
Word = bun and similarity = 0.9501674771308899
Word = chignon and similarity = 0.9501674771308899
Word = plait and similarity = 0.9501674771308899
Word = nuong and similarity = 0.9501674771308899


In [16]:
output = analogy("smallest", "small", "short")

The top similar item for (smallest - small + short) = shortest
The top similar words to (smallest - small + short) are - 
Word = shortest and similarity = 0.8045328259468079
Word = straightest and similarity = 0.8045328259468079
Word = second-longest and similarity = 0.7076637744903564
Word = longest-ever and similarity = 0.7076637744903564
Word = fifth-longest and similarity = 0.7076637744903564
Word = third-longest and similarity = 0.7076637744903564


Now why don't you try out a few different examples see what comes out :) 

In [17]:
output = analogy("salad", "patty", "bun")

The top similar item for (salad - patty + bun) = topknot
The top similar words to (salad - patty + bun) are - 
Word = topknot and similarity = 0.9501674771308899
Word = buttie and similarity = 0.9501674771308899
Word = bun and similarity = 0.9501674771308899
Word = chignon and similarity = 0.9501674771308899
Word = plait and similarity = 0.9501674771308899
Word = nuong and similarity = 0.9501674771308899


In [18]:
output = analogy("fish", "wrap", "cheese")

The top similar item for (fish - wrap + cheese) = warmwater
The top similar words to (fish - wrap + cheese) are - 
Word = warmwater and similarity = 1.1421476602554321
Word = carp and similarity = 1.1421476602554321
Word = catfish and similarity = 1.1421476602554321
Word = bream and similarity = 1.1421476602554321
Word = fishes and similarity = 1.1421476602554321
Word = freshwater and similarity = 1.1421476602554321
Word = fish and similarity = 1.1421476602554321
Word = perch and similarity = 1.1421476602554321
