Name: **Kartik More**<br>
Div: **BE09-R09**<br>
Roll no: **43149**<br>
Title: **Assignment 5**<br>

*Problem Statement:*

    Implement the Continuous Bag of Words (CBOW) Model. Stages can be:
    a. Data preparation
    b. Generate training data
    c. Train model
    d. Output


# Importing libraries

In [None]:
from keras.preprocessing import text
from keras.utils import np_utils
from keras.preprocessing import sequence
from keras.utils import pad_sequences
import numpy as np
import pandas as pd

In [None]:
# Taking random sentences as data

data = """
Coronavirus disease (COVID-19) is an infectious disease caused by the SARS-CoV-2 virus.

Most people infected with the virus will experience mild to moderate respiratory illness and recover without requiring special treatment. However, some will become seriously ill and require medical attention. Older people and those with underlying medical conditions like cardiovascular disease, diabetes, chronic respiratory disease, or cancer are more likely to develop serious illness. Anyone can get sick with COVID-19 and become seriously ill or die at any age. 

The best way to prevent and slow down transmission is to be well informed about the disease and how the virus spreads. Protect yourself and others from infection by staying at least 1 metre apart from others, wearing a properly fitted mask, and washing your hands or using an alcohol-based rub frequently. Get vaccinated when it’s your turn and follow local guidance.
"""
dl_data = data.split()

In [None]:
#tokenization
tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(dl_data)
word2id = tokenizer.word_index

word2id['PAD'] = 0
id2word = {v:k for k, v in word2id.items()}
wids = [[word2id[w] for w in text.text_to_word_sequence(doc)] for doc in dl_data]

vocab_size = len(word2id)
embed_size = 100
window_size = 2 

print('Vocabulary Size:', vocab_size)
print('Vocabulary Sample:', list(word2id.items())[:10])

Vocabulary Size: 105
Vocabulary Sample: [('and', 1), ('disease', 2), ('the', 3), ('to', 4), ('virus', 5), ('with', 6), ('or', 7), ('covid', 8), ('19', 9), ('is', 10)]


In [None]:
#generating (context word, target/label word) pairs
def generate_context_word_pairs(corpus, window_size, vocab_size):
    context_length = window_size*2
    for words in corpus:
        sentence_length = len(words)
        for index, word in enumerate(words):
            context_words = []
            label_word   = []            
            start = index - window_size
            end = index + window_size + 1
            
            context_words.append([words[i] 
                                 for i in range(start, end) 
                                 if 0 <= i < sentence_length 
                                 and i != index])
            label_word.append(word)

            x = pad_sequences(context_words, maxlen=context_length)
            y = np_utils.to_categorical(label_word, vocab_size)
            yield (x, y)
            
i = 0
for x, y in generate_context_word_pairs(corpus=wids, window_size=window_size, vocab_size=vocab_size):
    if 0 not in x[0]:
        # print('Context (X):', [id2word[w] for w in x[0]], '-> Target (Y):', id2word[np.argwhere(y[0])[0][0]])
    
        if i == 10:
            break
        i += 1

In [None]:
#model building
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda

cbow = Sequential()
cbow.add(Embedding(input_dim=vocab_size, output_dim=embed_size, input_length=window_size*2))
cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(embed_size,)))
cbow.add(Dense(vocab_size, activation='softmax'))
cbow.compile(loss='categorical_crossentropy', optimizer='rmsprop')

print(cbow.summary())

# from IPython.display import SVG
# from keras.utils.vis_utils import model_to_dot

# SVG(model_to_dot(cbow, show_shapes=True, show_layer_names=False, rankdir='TB').create(prog='dot', format='svg'))

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 4, 100)            10500     
                                                                 
 lambda (Lambda)             (None, 100)               0         
                                                                 
 dense (Dense)               (None, 105)               10605     
                                                                 
Total params: 21,105
Trainable params: 21,105
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
for epoch in range(1, 6):
    loss = 0.
    i = 0
    for x, y in generate_context_word_pairs(corpus=wids, window_size=window_size, vocab_size=vocab_size):
        i += 1
        loss += cbow.train_on_batch(x, y)
        if i % 100000 == 0:
            print('Processed {} (context, word) pairs'.format(i))

    print('Epoch:', epoch, '\tLoss:', loss)
    print()

Epoch: 1 	Loss: 688.3122568130493

Epoch: 2 	Loss: 680.139594078064

Epoch: 3 	Loss: 673.7600057125092

Epoch: 4 	Loss: 668.9784500598907

Epoch: 5 	Loss: 666.2758429050446



In [None]:
weights = cbow.get_weights()[0]
weights = weights[1:]
print(weights.shape)

pd.DataFrame(weights, index=list(id2word.values())[1:]).head()

(104, 100)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
disease,0.014448,-0.029284,-0.035423,-0.036,0.044337,-0.006938,-0.048643,0.010988,0.023904,0.04118,...,-0.015466,-0.025648,0.048307,0.013982,-0.022741,-0.037856,0.046606,0.007796,-0.04114,0.031815
the,-0.010541,-0.010285,-0.009707,0.020682,-0.002547,-0.034537,0.009864,0.024907,0.029813,0.002363,...,0.040963,0.013179,0.047231,-0.022594,0.030379,-0.023812,0.036356,0.023944,-0.046655,-0.016505
to,0.01551,-0.020697,-0.025914,-0.010149,-0.035583,-0.01202,0.046521,0.026137,-0.022844,-0.042199,...,-0.014969,-0.026405,0.012192,0.047313,0.017969,0.044551,0.026511,0.041246,0.015675,-0.03922
virus,0.018251,0.026188,0.019935,-0.043572,-0.029743,0.019278,-0.028132,0.044867,-0.006527,-0.02488,...,-0.045433,-0.0234,0.015812,0.022574,-0.012801,-0.013299,-0.045309,-0.034459,-0.019358,0.039377
with,0.027026,0.015665,0.030564,-0.049847,-0.02369,-0.010641,0.01999,0.001041,-0.017819,-0.020281,...,-0.021683,-0.039438,0.041375,-0.009283,-0.04091,0.02347,0.046881,0.025787,-0.024547,0.039874


In [None]:
from sklearn.metrics.pairwise import euclidean_distances

distance_matrix = euclidean_distances(weights)
print(distance_matrix.shape)

similar_words = {search_term: [id2word[idx] for idx in distance_matrix[word2id[search_term]-1].argsort()[1:6]+1] 
                   for search_term in ['disease']}

similar_words