## DL Assignment No. 05
5. Implement the Continuous Bag of Words (CBOW) Model. Stages can be:

a. Data preparation

b. Generate training data

c. Train model

d. Output

https://www.kaggle.com/code/aggarwalrahul/nlp-word-embedding-continuous-bag-of-words-cbow


In [1]:
import numpy as np
import re

In [2]:
sentences = """ The speed of transmission is an important point of difference between the two viruses. Influenza has a shorter median incubation period (the time from infection to appearance of symptoms) and a shorter serial interval (the time between successive cases) than COVID-19 virus. The serial interval for COVID-19 virus is estimated to be 5-6 days, while for influenza virus, the serial interval is 3 days. This means that influenza can spread faster than COVID-19. 
Further, transmission in the first 3-5 days of illness, or potentially pre-symptomatic transmission –transmission of the virus before the appearance of symptoms – is a major driver of transmission for influenza. In contrast, while we are learning that there are people who can shed COVID-19 virus 24-48 hours prior to symptom onset, at present, this does not appear to be a major driver of transmission. 
The reproductive number – the number of secondary infections generated from one infected individual – is understood to be between 2 and 2.5 for COVID-19 virus, higher than for influenza. However, estimates for both COVID-19 and influenza viruses are very context and time-specific, making direct comparisons more difficult.  """

In [3]:
# a. Data preparation
# remove special characters from text
sentences = re.sub('[^A-Za-z0-9]+', ' ', sentences)
#sentences

In [4]:
# make all words lower case
sentences = sentences.lower()
#sentences

In [5]:
# Creating the vocabulary
words = sentences.split()
vocab = set(words)
vocab_size = len(vocab)
#vocab

In [6]:
# Indexing the Words
word_to_ix = {word: i for i, word in enumerate(vocab)}
ix_to_word = {i: word for i, word in enumerate(vocab)}

In [7]:
#ix_to_word

In [8]:
# b. Generate training data
data = []
for i in range(2, len(words) - 2):
    context = [words[i - 2], words[i - 1], words[i + 1], words[i + 2]]
    target = words[i]
    data.append((context, target))
print(data[:5])

[(['the', 'speed', 'transmission', 'is'], 'of'), (['speed', 'of', 'is', 'an'], 'transmission'), (['of', 'transmission', 'an', 'important'], 'is'), (['transmission', 'is', 'important', 'point'], 'an'), (['is', 'an', 'point', 'of'], 'important')]


In [9]:
embeddings = np.random.random_sample((vocab_size, 10))

In [10]:
def forward(context_idxs, theta):
    m = embeddings[context_idxs].reshape(1, -1)
    n = m.dot(theta)
    o = np.log(np.exp(n-np.max(n)) / np.exp(n-np.max(n)).sum())
    return m, n, o

In [11]:
def backward(preds, theta, target_idxs):
    m, n, o = preds
    out = np.zeros_like(n)
    out[np.arange(len(n)),target_idxs] = 1
    softmax = np.exp(n) / np.exp(n).sum(axis=-1,keepdims=True)
    dlog = (- out + softmax) / n.shape[0]
    dw = m.T.dot(dlog)
    return dw

In [12]:
# c. Train model
theta = np.random.uniform(-1, 1, (40, vocab_size))
epoch_losses = {}
for epoch in range(80):
    losses = []
    for context, target in data:
        context_idxs = np.array([word_to_ix[w] for w in context])
        preds = forward(context_idxs, theta)
        target_idxs = np.array([word_to_ix[target]])
        loss = -preds[-1][range(len(target_idxs)), target_idxs].sum() / len(preds[-1][range(len(target_idxs)), target_idxs])
        losses.append(loss)
        grad = backward(preds, theta, target_idxs)
        theta = theta - (grad*0.03)
    epoch_losses[epoch] = losses
    print('epoch -', epoch, '\tloss -', losses[epoch])

epoch - 0 	loss - 5.336473570773312
epoch - 1 	loss - 3.721455354345014
epoch - 2 	loss - 4.185769530924911
epoch - 3 	loss - 5.267146301078617
epoch - 4 	loss - 4.096753216140029
epoch - 5 	loss - 5.4119348438620625
epoch - 6 	loss - 3.634939371180429
epoch - 7 	loss - 4.762962100214823
epoch - 8 	loss - 4.793614605089183
epoch - 9 	loss - 3.6810390530527903
epoch - 10 	loss - 5.28522726450934
epoch - 11 	loss - 5.588577406138822
epoch - 12 	loss - 3.3947510593754284
epoch - 13 	loss - 8.061769351899502
epoch - 14 	loss - 3.1395575890835072
epoch - 15 	loss - 3.1601178332129822
epoch - 16 	loss - 4.497819447270813
epoch - 17 	loss - 4.288797635122561
epoch - 18 	loss - 4.740924554864281
epoch - 19 	loss - 2.1283893015940447
epoch - 20 	loss - 3.6754438252418415
epoch - 21 	loss - 3.6813955124767284
epoch - 22 	loss - 3.6052383071790195
epoch - 23 	loss - 2.4416638396943617
epoch - 24 	loss - 1.8851089402375263
epoch - 25 	loss - 2.3087246871909843
epoch - 26 	loss - 0.9242279134497954

In [13]:
def predict(words):
    context_idxs = np.array([word_to_ix[w] for w in words])
    preds = forward(context_idxs, theta)
    word = ix_to_word[np.argmax(preds[-1])]
    return word

In [14]:
def accuracy():
    wrong = 0
    for context, target in data:
        if(predict(context) != target):
            wrong += 1
    return (1 - (wrong / len(data)))
accuracy()

0.9685863874345549

In [15]:
# d. Output
predict(['transmission', 'is', 'important', 'point'])

'an'