This is a Word2Vec embedding model with skip-gram and negative sampling method using pure tensorflow 2.

# Making Data Ready

In [2]:
!pip install tqdm # This is for showing a smart progress meter in any loop



In [3]:
from tqdm.auto import tqdm # This is for showing a smart progress meter in any loop
import gensim.downloader as api
corpus = api.load("text8") # This is tokenized corpus of text from wikipedia



In [4]:
# just to show what is inside the corpus
for i in corpus:
  print(i[:20])
  break

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english']


Now we want to build our skip-grams which means we should pair every word with its neighbours. For defining neighbour we choose a `window_size`.

For example suppose the sentense "*I want a new car*". its skip-grams with `window_size = 1` would be:

*   I , want
*   want , I
*   want , a  
*   a , want
*   a , new
*   new , a
*   new , car
*   car , new

And with `window_size = 2` words near `new` would be:

*   new , car
*   new , want
*   new , a  











In [5]:
window_size = 2

In [6]:
# Now we should build our skip-grams and find neoghbour words, e.g. {('modern', 'era'), ('means', 'to'), ('economics', 'there'), ('while', 'is'),...}

from collections import Counter
neighbour_counter = Counter()

for text in tqdm(corpus):
  dump_txt = ['_TEMP_'] * window_size + text + ['_TEMP_'] * window_size # e.g. sentence "I love code" with window_size = 2 would be "_TEMP_ _TEMP_ I love code _TEMP_ _TEMP_"
  for i in range(len(dump_txt) - (window_size + 1) * 2):
    middle_window = dump_txt[window_size + i] 
    for j in range(window_size): # we choose words near the center word as its nighbours (context words)
      location = window_size + i - 1 
      if not middle_window == dump_txt[location - j] and not dump_txt[location - j] == '_TEMP_': # This is just for making sure neighbours aren't the same, like (the, the) and we ignore it when it becomes naighbour to "_TEMP_"
        neighbour_counter[(middle_window, dump_txt[location - j])] += 1
      if not middle_window == dump_txt[location + j] and not dump_txt[location + j] == '_TEMP_': # This is just for making sure neighbours aren't the same, like (the, the) and we ignore it when it becomes naighbour to "_TEMP_"
        neighbour_counter[(middle_window, dump_txt[location + j])] += 1

print(len(neighbour_counter))
print(neighbour_counter.most_common(3))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


8551710
[(('nine', 'one'), 358337), (('the', 'of'), 333414), (('the', 'in'), 197295)]


In [7]:
neighbours = [i[0] for i in neighbour_counter.items() if i[1] > 5] # Removing neighbours that repeated less than 5 time in our large corpus. They are not useful.

len(neighbours)

897084

In [8]:
from collections import defaultdict

dict_of_neighbours = defaultdict(set) # This dict contains any naighbour that a word has in our data
for i, j in neighbours: 
    dict_of_neighbours[i].add(j) 

len(dict_of_neighbours['cat'])

48

In [9]:
words_set = list(set([j for i in neighbours for j in (i[0], i[1])])) # list of all tokens

print(len(words_set))

index2word = dict(enumerate(list(words_set))) # A dict that maps an index to a word, e.g. d[3] = 'cat'
word2index = {v: k for k, v in index2word.items()} # Reverse of the dict above. e.g. d['cat'] = 3

63258


Now we should make negative smapling. Which means we should produce fake nighbours. We then lable the true tuple with 1 and fake tuples with 0.

For producing fake neighbours we just randomly sample from our `words_lst`. Do not worry about it randomly choose a true nighbour and not a fake one. We do not need to be perfect for now.

In [10]:
negative_samples = 5

In [11]:
import random

def negative_sampling(true_neighbours): # We lable true_neighbours with 1, and we produce fake neighbours and lable them with 0
  center_word = true_neighbours[0]
  neighbours_labeled = [(true_neighbours[1], 1)]

  for random_word in random.sample(words_set, negative_samples):
    neighbours_labeled.append((random_word , 0))

  random.shuffle(neighbours_labeled) # Shuffle to make the order random
  x, y = zip(*neighbours_labeled) # Now Xs are neighbour and Ys are thir lables (0 or 1)

  return center_word, x, y

In [12]:
negative_sampling(("cool", "code"))

('cool',
 ('code', 'baroque', 'incumbent', 'res', 'gamers', 'acted'),
 (1, 0, 0, 0, 0, 0))

Now The Final step to have our data ready! (Yoo Hoo!).

We should make a function that return a batch of data everytime we call it. we rather replacethe words with thir indices here (by using `word2index`)

In [13]:
def give_us_data(batch_size):
  batched_center = []
  batched_x = []
  batched_y = []

  for random_neighbour in random.sample(neighbours, batch_size):
    center_word, x , y = negative_sampling(random_neighbour)
    
    batched_y.append(y)
    batched_center.append(word2index[center_word]) # Convert the center word to its index (e.g. "cat" to 3)
    batched_x.append([word2index[i] for i in x]) # Convert words to its index (e.g. "cat" to 3)

  return batched_center, batched_x, batched_y

In [14]:
give_us_data(2)

([1364, 55498],
 [[35557, 15774, 33734, 46291, 10516, 39514],
  [54854, 4541, 53446, 3848, 19714, 30488]],
 [(0, 0, 1, 0, 0, 0), (0, 0, 1, 0, 0, 0)])

# MODEL

Now is the time to code the model with pure TF 2

In [15]:
import tensorflow as tf
import numpy as np

embeding_size = 50
batch_size = 128

In [16]:
target_amb = tf.keras.layers.Embedding(len(words_set), embeding_size) # This will be our Word2Vec
context_amb = tf.keras.layers.Embedding(len(words_set), embeding_size) # This is just for the learning phase

optimizer = tf.keras.optimizers.Adam()

In [17]:
for _ in tqdm(range(10001)): # tqdm is just for a nice progress bar.

  center, x, y = give_us_data(batch_size)
  center = np.asarray(center)
  x = np.asarray(x)
  y = np.asarray(y)

  with tf.GradientTape() as t:
    center_embs = target_amb(center)
    neighbor_choices = context_amb(x)

    scores = tf.keras.backend.batch_dot(neighbor_choices, center_embs, axes=(2,1)) # dot prudoct vectors of naighbours together
    prediction = tf.nn.sigmoid(scores) # we want the model to give us a probability of that two, being real naighbours

    loss = tf.keras.losses.categorical_crossentropy(y, prediction)

    if not _ % 2000:
      print("batch:", _ , " - The mean loss is: " ,tf.reduce_mean(loss).numpy())
      print(y[1], prediction[1].numpy())
      print("------------------------------------------------")

    g_embed, g_context = t.gradient(loss, [target_amb.embeddings, context_amb.embeddings])
    optimizer.apply_gradients(zip([g_embed, g_context], [target_amb.embeddings, context_amb.embeddings]))

HBox(children=(FloatProgress(value=0.0, max=10001.0), HTML(value='')))

batch: 0  - The mean loss is:  1.7918224
[1 0 0 0 0 0] [0.49952218 0.49837837 0.502265   0.49912843 0.49756768 0.50099564]
------------------------------------------------
batch: 2000  - The mean loss is:  1.1438766
[0 0 1 0 0 0] [0.03187576 0.01108352 0.98943454 0.05731562 0.05656805 0.00725603]
------------------------------------------------
batch: 4000  - The mean loss is:  0.83899343
[0 0 0 0 1 0] [2.6515016e-01 7.0336723e-04 7.1048317e-04 2.0742374e-04 9.7770542e-01
 1.4255049e-03]
------------------------------------------------
batch: 6000  - The mean loss is:  0.66536206
[0 0 0 1 0 0] [5.5141826e-03 2.3234300e-03 9.1780163e-04 9.8258972e-01 2.5561482e-03
 3.7092983e-03]
------------------------------------------------
batch: 8000  - The mean loss is:  0.6653873
[0 0 0 0 1 0] [0.38478294 0.47002935 0.25358015 0.39610022 0.9862511  0.3963444 ]
------------------------------------------------
batch: 10000  - The mean loss is:  0.5969314
[0 0 0 1 0 0] [4.1194609e-01 1.9202128e-06 

Wooo Hooo! Congratulation!

Now we have our Word2Vec ready in `target_amb`. Lets analyse is a little bit.

# Model analysis

Here I want to write a function that finds closest words to a word 

In [20]:
def find_closest(embeds, word, n=1): # n is for "n closest words"
  n = n + 1 # This is becuse the most similar word is definatly that word itself. like the most similar word for "apple" is "apple". so we should look for top n+1 words
  main_vec = embeds(word2index[word])

  similarities = -tf.keras.losses.cosine_similarity(embeds.embeddings, main_vec)
  top_n = tf.math.top_k(similarities, n).indices
  words = [index2word[i] for i in top_n.numpy()]

  return words[1:] # I did [1:] to remove the word as I mentioned in `n = n + 1` comments



In [21]:
find_closest(target_amb, "two", 10)

['five', 'three', 'six', 'four', 'seven', 'zero', 'eight', 'nine', 'e', 'all']