# 1 Word2Vec

**Reading material**
* [1] Mikolov, Tomas, et al. "[Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781)" Advances in neural information processing systems. 2013. 





### Word embeddings
Build word embeddings with a Keras implementation where the embedding vector is of length 50, 150 and 300. Use the Alice in Wonderland text book for training. Use a window size of 2 to train the embeddings (`window_size` in the jupyter notebook). 

1. Build word embeddings of length 50, 150 and 300 using the Skipgram model
2. Build word embeddings of length 50, 150 and 300 using CBOW model
3. Analyze the different word embeddings:
    - Implement your own function to perform the analogy task (see [1] for concrete examples). Use the same distance metric as in the paper. Do not use existing libraries for this task such as Gensim. 
Your function should be able to answer whether an analogy like: "a king is to a queen as a man is to a woman" ($e_{king} - e_{queen} + e_{woman} \approx e_{man}$) is true. $e_{x}$ denotes the embedding of word $x$. We want to find the word $p$ in the vocabulary, where the embedding of $p$ ($e_p$) is the closest to the predicted embedding (i.e. result of the formula). Then, we can check if $p$ is the same word as the true word $t$.
    - Give at least 5 different  examples of analogies.
    - Compare the performance on the analogy tasks between the word embeddings and briefly discuss your results.

4. Discuss:
  - Given the same number of sentences as input, CBOW and Skipgram arrange the data into different number of training samples. Which one has more and why?



### Import libraries

In [None]:
%tensorflow_version 2.x

In [None]:
import numpy as np
import keras.backend as K
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Reshape, Lambda
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import plot_model
from tensorflow.keras.preprocessing import sequence

# other helpful libraries
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors as nn
from matplotlib import pylab
import pandas as pd

Using TensorFlow backend.


In [None]:
print(tf.__version__) #  check what version of TF is imported

2.2.0


### Import file

If you use Google Colab, you need to mount your Google Drive to the notebook when you want to use files that are located in your Google Drive. Paste the authorization code, from the new tab page that opens automatically when running the cell, in the cell below.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Navigate to the folder in which `alice.txt` is located. Make sure to start path with '/content/drive/My Drive/' if you want to load the file from your Google Drive.

In [None]:
# cd '/content/drive/My Drive/Deep Learning/'
cd '/content/drive/My Drive/Deep Learning'


In [None]:
file_name = 'alice.txt'
corpus = open(file_name).readlines()

### Data preprocessing

In [None]:
# Removes sentences with fewer than 3 words
corpus = [sentence for sentence in corpus if sentence.count(" ") >= 2]

# remove punctuation in text and fit tokenizer on entire corpus
tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'+"'")
tokenizer.fit_on_texts(corpus)

# convert text to sequence of integer values
corpus = tokenizer.texts_to_sequences(corpus)
n_samples = sum(len(s) for s in corpus) # total number of words in the corpus
V = len(tokenizer.word_index) + 1 # total number of unique words in the corpus

In [None]:
n_samples, V

(27165, 2557)

In [None]:
# example of how word to integer mapping looks like in the tokenizer
print(list((tokenizer.word_index.items()))[:5])

[('the', 1), ('and', 2), ('to', 3), ('a', 4), ('it', 5)]


In [None]:
# parameters
window_size = 2
window_size_corpus = 4
dims = [50, 150, 300]

## 1.1 - Skipgram
Build word embeddings of length 50, 150 and 300 using the Skipgram model.

In [None]:
#prepare data for skipgram
def generate_data_skipgram(corpus, window_size, V):
    maxlen = window_size*2
    all_in = []
    all_out = []
    for words in corpus:
        L = len(words)
        for index, word in enumerate(words):
            p = index - window_size
            n = index + window_size + 1
                    
            in_words = []
            labels = []
            for i in range(p, n):
                if i != index and 0 <= i < L:
                    # Add the input word
                    #in_words.append(word)
                    all_in.append(word)
                    # Add one-hot of the context words
                    all_out.append(to_categorical(words[i], V))
                    
                                      
    return (np.array(all_in),np.array(all_out))

In [None]:
# create training data
x , y = generate_data_skipgram(corpus,window_size,V)

In [None]:
x.shape, y.shape

((94556,), (94556, 2557))

In [None]:
# create skipgram architecture
skipgramModels = []

for dim in dims:
  print("Skipgram model with dimensions:",dim)
  skipgram = Sequential()
  skipgram.add(Embedding(input_dim=V, output_dim=dim, input_length=1, embeddings_initializer='glorot_uniform'))
  skipgram.add(Reshape((dim, )))
  skipgram.add(Dense(V, activation='softmax', kernel_initializer='glorot_uniform'))

  skipgram.compile(optimizer='adadelta', loss='categorical_crossentropy')
  skipgram.summary()
  skipgramModels.append(skipgram)
  

Skipgram model with dimensions: 50
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 1, 50)             127850    
_________________________________________________________________
reshape (Reshape)            (None, 50)                0         
_________________________________________________________________
dense (Dense)                (None, 2557)              130407    
Total params: 258,257
Trainable params: 258,257
Non-trainable params: 0
_________________________________________________________________
Skipgram model with dimensions: 150
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1, 150)            383550    
_________________________________________________________________
reshape_1 (Reshape)    

<b>HINT</b>: To increase training speed of your model, you can use the free available GPU power in Google Colab. Go to `Edit` --> `Notebook Settings` --> select `GPU` under `hardware accelerator`.

In [None]:
# train skipgram model
model=0

for skipgram in skipgramModels:
  print("Training Skipgram model with", dims[model], "dimensions")
  skipgram.fit(x, y, batch_size=128, epochs=10, verbose=1)
  model+=1

Training Skipgram model with 50 dimensions
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Training Skipgram model with 150 dimensions
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Training Skipgram model with 300 dimensions
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
# save embeddings for vectors of length 50, 150 and 300 using skipgram model
model = 0
embedding_sgm = []

for skipgram in skipgramModels:
  print("Save embedding of Skipgram model with dimensions:",dims[model])
  # get the embedding matrix
  weights = skipgram.get_weights()
  embedding = weights[0]
  embedding_sgm.append(embedding)
  model+=1

Save embedding of Skipgram model with dimensions: 50
Save embedding of Skipgram model with dimensions: 150
Save embedding of Skipgram model with dimensions: 300


## 1.2 - CBOW

Build word embeddings of length 50, 150 and 300 using CBOW model.

In [None]:
#prepare data for CBOW
def generate_data_CBOW(corpus, window_size, V):
    maxlen = window_size*2
    all_in = []
    all_out = []
    for words in corpus:
        L = len(words)
        for index, word in enumerate(words):
            p = index - window_size
            n = index + window_size + 1
                    
            in_words = []
            labels = []
            for i in range(p, n):
                if i != index and 0 <= i < L:
                    # Add the input word
                    in_words.append(words[i])
            labels = word
            # pad_sequences
            all_in.append(sequence.pad_sequences([in_words], maxlen))
            # Add one-hot of the context words
            all_out.append(to_categorical(labels, V))
                                      
    return (np.array(all_in).reshape(-1, maxlen), np.array(all_out))

In [None]:
# create training data
x, y = generate_data_CBOW(corpus, window_size, V)
x.shape,y.shape

((27165, 4), (27165, 2557))

In [None]:
# create CBOW architecture
cbowModels = []

for dim in dims:
  print("CBOW model with dimensions:",dim)
  cbow = Sequential()
  cbow.add(Embedding(input_dim=V, output_dim=dim, input_length=window_size*2))
  cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(dim,)))
  cbow.add(Dense(V, activation='softmax'))

  #compile the model
  cbow.compile(optimizer='adadelta', loss='categorical_crossentropy')
  cbow.summary()
  cbowModels.append(cbow)


CBOW model with dimensions: 50
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 4, 50)             127850    
_________________________________________________________________
lambda (Lambda)              (None, 50)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 2557)              130407    
Total params: 258,257
Trainable params: 258,257
Non-trainable params: 0
_________________________________________________________________
CBOW model with dimensions: 150
Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 4, 150)            383550    
_________________________________________________________________
lambda_1 (Lambda)            

In [None]:
# train CBOW model
model=0
for cbow in cbowModels:
  print("Training CBOW model with", dims[model], "dimensions")
  cbow.fit(x, y, batch_size=128, epochs=10, verbose=1)
  model+=1


Training CBOW model with 50 dimensions
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Training CBOW model with 150 dimensions
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Training CBOW model with 300 dimensions
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
# save embeddings for vectors of length 50, 150 and 300 using CBOW model
model = 0
embedding_cbow = []

for cbow in cbowModels:
  print("Save embedding of CBOW model with dimensions:",dims[model])
  # get the embedding matrix
  weights = cbow.get_weights()
  embedding = weights[0]
  embedding_cbow.append(embedding)
  model+=1

Save embedding of CBOW model with dimensions: 50
Save embedding of CBOW model with dimensions: 150
Save embedding of CBOW model with dimensions: 300


## 1.3 - Analogy function

Implement your own function to perform the analogy task (see [1] for concrete examples). Use the same distance metric as in [1]. Do not use existing libraries for this task such as Gensim. Your function should be able to answer whether an analogy like: "a king is to a queen as a man is to a woman" ($e_{king} - e_{queen} + e_{woman} \approx e_{man}$) is true. 

In a perfect scenario, we would like that this analogy ( $e_{king} - e_{queen} + e_{woman}$) results in the embedding of the word "man". However, it does not always result in exactly the same word embedding. The result of the formula is called the expected or the predicted word embedding. In this context, "man" is called the true or the actual word $t$. We want to find the word $p$ in the vocabulary, where the embedding of $p$ ($e_p$) is the closest to the predicted embedding (i.e. result of the formula). Then, we can check if $p$ is the same word as the true word $t$.  

You have to answer an analogy function using each embedding for both CBOW and Skipgram model. This means that for each analogy we have 6 outputs. Show the true word (with distance similarity value between predicted embedding and true word embedding, i.e. `sim1`) , the predicted word (with distance similarity value between predicted embedding and the embedding of the word in the vocabulary that is closest to this predicted embedding, i.e. `sim2`) and a boolean answer whether the predicted word **exactly** equals the true word. 

<b>HINT</b>: to visualize the results of the analogy tasks , you can print them in a table. An example is given below.


| Analogy task | True word (sim1)  | Predicted word (sim2) | Embedding | Correct?|
|------|------|------|------|------|
|  queen is to king as woman is to ?	 | man (sim1) | predictd_word(sim2) | SG_50 | True / False|

* Give at least 5 different  examples of analogies.
* Compare the performance on the analogy s between the word embeddings and briefly discuss your results.

In [None]:
# Embed a word by getting the one hot encoding and taking the dot product of this vector with the embedding matrix
# 'word' = string type
def embed(word, embedding=embedding, vocab_size = V, tokenizer=tokenizer):
    # get the index of the word from the tokenizer, i.e. convert the string to it's corresponding integer in the vocabulary
    int_word = tokenizer.texts_to_sequences([word])[0]
    # get the one-hot encoding of the word
    bin_word = to_categorical(int_word, V)
    return np.dot(bin_word, embedding)

In [None]:
import operator
# Find the the embedding of the word in the vocabulary that is closest to this predicted embedding
def similar(predict_emb, embedding=embedding, vocab_size=V, tokenizer=tokenizer):
    # sorce_emb=predict
    distances = []
    for word, i in tokenizer.word_index.items():
      # get embedding of the similar word
      sim_emb = embed(word, embedding)
      # calculate the distance
      distance = cosine_distances(predict_emb, sim_emb)
      distances.append((i,distance))
    # get the closest word by sort the distances
    sort_distance = sorted(distances, key=operator.itemgetter(1))[:1]
    # get the closest word by index
    predict_word = list(tokenizer.word_index.keys())[sort_distance[0][0]]
    # get the distance between predicted embedding and embedding of the cloest word
    sim_distance = sort_distance[0][1].item()
    return predict_word, sim_distance

In [None]:
import plotly.graph_objects as go

# validate predict word
def TF(x,y):
   return True if x==y else False

# plot result
def plot(result):
  headerColor = 'grey'
  rowEvenColor = 'lightgrey'
  rowOddColor = 'white'

  fig = go.Figure(layout=go.Layout(autosize=True,margin=dict(t=0, b=0), height=180),
  data=[go.Table(
    header=dict(
      values = list(result.columns),
      line_color='darkslategray',
      fill_color=headerColor,
      align=['left','center'],
      font=dict(color='white', size=12)
    ),
    cells=dict(
      values=[result['Analogy task'], result['True word(sim1)'], result['Predicted(sim2)'],
              result['Embedding'], result['True/False']],
      line_color='darkslategray',
      # 2-D list of colors for alternating rows
      fill_color = [[rowOddColor,rowEvenColor,rowOddColor,rowEvenColor,rowOddColor,rowEvenColor,]*5],
      align = ['left', 'center'],
      font = dict(color = 'darkslategray', size = 11)
    ))]
  )

  fig.show()

In [None]:
# analogy function
from sklearn.metrics.pairwise import cosine_distances
import pandas as pd

task = [['king', 'queen', 'woman', 'man'], ['up', 'down', 'ground', 'roof'], ['good', 'bad', 'ugly', 'pretty'], 
        ['mouse', 'rabbit', 'rabbits', 'mice'], ['eat', 'read', 'reading', 'eating']]

embedding_ms = [['SG_', embedding_sgm],['CBOW_', embedding_cbow]]

def analogy(embedding_ms=embedding_ms, task=task):
  for i in range(len(task)):
    # analogy tasks
    ana_tasks = []
    # true(sim1)
    t_sim1 = []
    # predict(sim2)
    p_sim2 = []
    # model name
    model_name = []
    # true or false
    tf = []
    for modelname, embedding_m in embedding_ms:
      model = 0
      for embedding in embedding_m:
        a_task = task[i][1]+"-"+task[i][0]+", "+task[i][2]+"- ?"
        ana_tasks.append(a_task)
        predictedEmbedding = embed(task[i][0], embedding)-embed(task[i][1], embedding)+embed(task[i][2], embedding)
        expectedEmbedding = embed(task[i][3], embedding)

        distance = cosine_distances(predictedEmbedding, expectedEmbedding)
        t_sim1.append(task[i][3]+"("+str(round(distance.item(),7))+")")

        predict_word, sim_distance = similar(predictedEmbedding, embedding)
        p_sim2.append(predict_word+"("+str(round(sim_distance,7))+")")
        tf.append(TF(task[i][3],predict_word))
        model_name.append(modelname+str(dims[model]))
        model+=1
    result = pd.DataFrame(
        {'Analogy task': ana_tasks,
        'True word(sim1)': t_sim1,
        'Predicted(sim2)': p_sim2,
        'Embedding': model_name,
        'True/False': tf
    })
    plot(result)


In [None]:
# answer the analogy function using each embedding for both CBOW and Skipgram model
analogy(embedding_ms)

**Compare the performance on the analogy tasks between the word embeddings and briefly discuss your results.**

We formed five tasks including both semantic questions and syntactic questions. By comparing the results we got from Skipgram and CBOW, we can see that both of them had very low accuracy based on the current model while Skipgram had slightly better performance. According to the parper[1], Skipgram works well with a small amount of data and is found to represent rare words well. Meanwhile, CBOW is faster and has better representations for more frequent words. From our experiment, Skipgram model did take more time than CBOW model since the Skipgram approach involves more calculations.


## 1.4 - Discussion



Answer the following question:
* Given the same number of sentences as input, CBOW and Skipgram arrange the data into different number of training samples. Which one has more and why?


---



From above we know that for the same number of sentences as input('Alice.txt'), Skipgram generated 94556 input while CBOW 27165.

Apparently, more training instances are generated in Skipgram from limited amount of data than in CBOW because we are taking pairs of the center word with every word in the context. More data yields better performance, which can be verified from Task1.1 and Task1.2.

Take sequence (x1, x2, x3) with window size 1 as an example, for Skipgram algorithm, pairwise input with their lables (x1,x2),(x2,x1),(x2,x3),(x3,x2) are generated. For CBOX, we only have (x1,[x2,_]),(x2,[x1,x3]) and (x3,[x2,_]), which is the same number as the length of the sequence.




---





