<a href="https://colab.research.google.com/github/heshumi/NNTI-WS2021-NLP-Project/blob/main/Task_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1: Word Embeddings (10 points)

This notebook will guide you through all steps necessary to train a word2vec model (Detailed description in the PDF).

## Imports

This code block is reserved for your imports. 

You are free to use the following packages: 

(List of packages)

In [None]:
# Imports
import pandas as pd
import numpy as np
from math import sqrt
import random
import torch.nn as nn
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

In [None]:
torch.manual_seed(21) # set randomness

<torch._C.Generator at 0x7f1e6a82b9d0>

# 1.1 Get the data (0.5 points)

The Hindi portion HASOC corpus from [github.io](https://hasocfire.github.io/hasoc/2019/dataset.html) is already available in the repo, at data/hindi_hatespeech.tsv . Load it into a data structure of your choice. Then, split off a small part of the corpus as a development set (~100 data points).

If you are using Colab the first two lines will let you upload folders or files from your local file system.

In [None]:
# #TODO: implement!

# data = pd.read_csv('hindi_dataset.tsv', sep='\t', usecols=['text'])
# dev = data #.iloc[:100,]
# dev.head()

In [None]:
dev = pd.read_csv('bengali_hatespeech_sampled.csv', usecols=['text', 'task_1'])#.iloc[:100,]

## 1.2 Data preparation (0.5 + 0.5 points)

* Prepare the data by removing everything that does not contain information. 
User names (starting with '@') and punctuation symbols clearly do not convey information, but we also want to get rid of so-called [stopwords](https://en.wikipedia.org/wiki/Stop_word), i. e. words that have little to no semantic content (and, but, yes, the...). Hindi stopwords can be found [here](https://github.com/stopwords-iso/stopwords-hi/blob/master/stopwords-hi.txt) Then, standardize the spelling by lowercasing all words.
Do this for the development section of the corpus for now.

* What about hashtags (starting with '#') and emojis? Should they be removed too? Justify your answer in the report, and explain how you accounted for this in your implementation.

In [None]:
#TODO: implement!

with open('stopwords-hi.txt', 'r') as f:
  stopwords_hi = f.read().split()

with open('stopwords-bn.txt', 'r') as f:
  stopwords_bn = f.read().split()

with open('stopwords-en.txt', 'r') as f:
  stopwords = set(f.read().split() + stopwords_hi + stopwords_bn)

punct = ':;?!-—-\"\'|।()[]{},./\\“'

def preprocess(x):
  for ch in punct:
    x = x.replace(ch, ' ')

  words = x.split(' ')

  preprocessed=[]

  for word in words: 
    word = word.lower().strip()
    if 2 < len(word) < 30 and word[0]!='@' and word not in stopwords and not word.startswith('http'): 
      preprocessed.append(word)

  if len(preprocessed) == 0:
    return np.nan

  return preprocessed

We chose to leave the emojis and hashtags in the text because they might indicate the user's mood which we will be trying to predict. 

In [None]:
dev['text'] = dev['text'].apply(lambda x: preprocess(x))
dev = dev.dropna()
dev = dev.reset_index(drop=True)

dev['text'][:5]

0    [বিশ্বাস, ভাই, হাজারের, কমেন্টস, দেখলাম, বিচার...
1             [ভাই, এসব, কথা, ভালো, আপনে, মানুষ, কিচো]
2    [কাওয়, কাদের, মরলে, কুততা, যবে, জানোয়ার, দের, ...
3    [এখনো, পদে, বহাল, কেনো, জাতি, চাই, জামালপুরের,...
4                               [একটা, ছাগলের, বাচ্চা]
Name: text, dtype: object

## 1.3 Build the vocabulary (0.5 + 0.5 points)

The input to the first layer of word2vec is an one-hot encoding of the current word. The output of the model is then compared to a numeric class label of the words within the size of the skip-gram window. Now

* Compile a list of all words in the development section of your corpus and save it in a variable ```V```.

In [None]:
#TODO: implement!
V = {}

for s in dev['text']:
  for w in s:
    if w in V:
      V[w]+=1
    else: 
      V[w]=1
      
summ = sum(V.values())

* Then, write a function ```word_to_one_hot``` that returns a one-hot encoding of an arbitrary word in the vocabulary. The size of the one-hot encoding should be ```len(v)```.

In [None]:
# TODO: implement!

# # 1. Create a dictionary "{word:1-hot} for a faster access
# # Use this implementation for testing the captured semantics

onehot_dict={}

N = len(V)
for i, word in enumerate(V.keys()):
  onehot_dict[word] = np.append(np.append(np.zeros(i), [1]), np.zeros(N-i-1))

def word_to_one_hot(word):
  return onehot_dict[word]


# # 2. Create a dictionary "{word:word_num} and create the vectors on the go 
# # for memory economy. Use for training on large data

# onehot_dict={}

# N = len(V)
# for i, word in enumerate(V.keys()):
#   onehot_dict[word] = i

# def word_to_one_hot(word):
#   i = onehot_dict[word]
#   return np.append(np.append(np.zeros(i), [1]), np.zeros(N-i-1))

In [None]:
# for key, item in onehot_dict.items():
#   if item[-1] == 1:
#     print (key, len(item))
#     print(item)

In [None]:
len(onehot_dict)

14324

In [None]:
# onehot_dict['डालें']

## 1.4 Subsampling (0.5 points)

The probability to keep a word in a context is given by:

$P_{keep}(w_i) = \Big(\sqrt{\frac{z(w_i)}{0.001}}+1\Big) \cdot \frac{0.001}{z(w_i)}$

Where $z(w_i)$ is the relative frequency of the word $w_i$ in the corpus. Now,
* Calculate word frequencies
* Define a function ```sampling_prob``` that takes a word (string) as input and returns the probabiliy to **keep** the word in a context.

In [None]:
#TODO: implement!

def sampling_prob(word):
  z = V[word]/summ

  return (sqrt(z/0.001)+1)*0.001/z

# 1.5 Skip-Grams (1 point)

Now that you have the vocabulary and one-hot encodings at hand, you can start to do the actual work. The skip gram model requires training data of the shape ```(current_word, context)```, with ```context``` being the words before and/or after ```current_word``` within ```window_size```. 

* Have closer look on the original paper. If you feel to understand how skip-gram works, implement a function ```get_target_context``` that takes a sentence as input and [yield](https://docs.python.org/3.9/reference/simple_stmts.html#the-yield-statement)s a ```(current_word, context)```.

* Use your ```sampling_prob``` function to drop words from contexts as you sample them. 

In [None]:
#TODO: implement!

def get_target_context(sentence, window_size=5):
  new_sent = [w for w in sentence if random.random() < sampling_prob(w)] # Remove too frequent words 

  for i, w in enumerate(new_sent): # For each word in a sentence (or comment)
    context = [y for y in new_sent[i-window_size:i] + new_sent[i+1:i+window_size+1]]
    yield (w, context) # yield its context

# 1.6 Hyperparameters (0.5 points)

According to the word2vec paper, what would be a good choice for the following hyperparameters? 

* Embedding dimension
* Window size

Initialize them in a dictionary or as independent variables in the code block below. 

In [None]:
# Set hyperparameters
window_size = 5
embedding_size = 300
input_size = len(V) # Onehot vector length
batch_size = 100

# More hyperparameters
learning_rate = 0.05
epochs = 100

# Create a dataset and a data loader

In [None]:
class HindiDataset(Dataset):
  def __init__(self, df):
    self.df = df

  def __len__(self):
    return len(self.df)

  def __getitem__(self, idx): # Returns pairs of 'word - context' in onehot
    word_list=[] # List of input words
    cont_list=[] # List of corresponding context words
    
    for word, context in get_target_context(self.df[idx], window_size= window_size):
        word_onehot = word_to_one_hot(word)

        for cont_word in context:
          cont_onehot = word_to_one_hot(cont_word)

          word_list.append(word_onehot)
          cont_list.append(cont_onehot)
    
    word_list = torch.tensor(word_list) #.type(torch.DoubleTensor)
    cont_list = torch.tensor(cont_list)

    wc = torch.stack([word_list, cont_list])
    wc.type(torch.DoubleTensor)
    return wc # 2 columns: word_vector tensor, context_vector tensor


In [None]:
def my_collate(batches): # batches is a list of tensors returned from __getitem()__
  data = []
  labels = []
  for b in batches:
    data.append(b[0])
    labels.append(b[1])

  d = torch.cat(data, dim=0) # data
  l = torch.cat(labels, dim=0) # labels
  
  return (d, l)


In [None]:
dataset = HindiDataset(dev['text'])
dataloader = DataLoader(dataset, collate_fn = my_collate, batch_size= batch_size, shuffle = False, num_workers=0)

In [None]:
len(V)

14324

## Create files of tensors for a faster access

In [None]:
# for i, batch in enumerate(dataloader):
#   full_batch = torch.stack(batch).type(torch.int8)
#   print(full_batch.size())
#   print(full_batch.size()[0]*full_batch.size()[1]*full_batch.size()[2]*8 *1.25e-10)
#   torch.save(batch[0], 'Dataset/batch-test{}.pt'.format(i))
    

# 1.7 Pytorch Module (0.5 + 0.5 + 0.5 points)

Pytorch provides a wrapper for your fancy and super-complex models: [torch.nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). The code block below contains a skeleton for such a wrapper. Now,

* Initialize the two weight matrices of word2vec as fields of the class.

* Override the ```forward``` method of this class. It should take a one-hot encoding as input, perform the matrix multiplications, and finally apply a log softmax on the output layer.

* Initialize the model and save its weights in a variable. The Pytorch documentation will tell you how to do that.

In [None]:
# Create model 

class Word2Vec(nn.Module):
  def __init__(self):
    super().__init__()

    # Hidden layer
    self.fc1 = nn.Linear(input_size, embedding_size)

    # Output layer
    self.fc2 = nn.Linear(embedding_size, input_size)

  def forward(self, one_hot):
    x = self.fc1(one_hot)
    x = F.relu(x)
    x = self.fc2(x)
    
    # Softmax is not needed in cosine similarity as the vectors are normalized

    return x

In [None]:
model = Word2Vec()

model.fc1.weight.shape

torch.Size([300, 14324])

In [None]:
torch.manual_seed(21) # For reproducibility

model = Word2Vec()
model.double()
# model.cuda()

nn.init.uniform_(model.fc1.weight)
nn.init.uniform_(model.fc2.weight)

Parameter containing:
tensor([[0.0668, 0.4083, 0.8056,  ..., 0.8653, 0.0113, 0.2629],
        [0.4388, 0.4043, 0.6394,  ..., 0.7573, 0.1224, 0.9858],
        [0.7768, 0.8394, 0.6113,  ..., 0.6657, 0.3316, 0.6482],
        ...,
        [0.1821, 0.0859, 0.3567,  ..., 0.5399, 0.0650, 0.7519],
        [0.8096, 0.1359, 0.5630,  ..., 0.5545, 0.0962, 0.9322],
        [0.1502, 0.6163, 0.8451,  ..., 0.2649, 0.9688, 0.9619]],
       dtype=torch.float64, requires_grad=True)

# 1.8 Loss function and optimizer (0.5 points)

Initialize variables with [optimizer](https://pytorch.org/docs/stable/optim.html#module-torch.optim) and loss function. You can take what is used in the word2vec paper, but you can use alternative optimizers/loss functions if you explain your choice in the report.

In [None]:
# Define optimizer and loss
optimizer = torch.optim.Adam(model.parameters(), lr = learning_rate)

criterion = nn.CosineEmbeddingLoss()

# 1.9 Training the model (3 points)

As everything is prepared, implement a training loop that performs several passes of the data set through the model. You are free to do this as you please, but your code should:

* Load the weights saved in 1.6 at the start of every execution of the code block
* Print the accumulated loss at least after every epoch (the accumulate loss should be reset after every epoch)
* Define a criterion for the training procedure to terminate if a certain loss value is reached. You can find the threshold by observing the loss for the development set.

You can play around with the number of epochs and the learning rate.

In [None]:
# Define train procedure

# load initial weights
import time 

def train():
  
  print("Training started")
  
  for epoch in range(epochs):
    acc_loss=0
    words_num=0
    t1=time.time()
    for i, batch in enumerate(dataloader): # This line is slow for large data
      # t2 = time.time()
      # t3 = t2 - t1


      # print('Time for dataloader: {}'.format(t3))

      word_t = batch[0]#.cuda()
      cont_t = batch[1]#.cuda()
      
      if word_t.size()[0] == 0:
        continue

      outputs = model(word_t)

      loss = criterion(cont_t, outputs, torch.ones(len(cont_t)))#.cuda()) # Correctly computes 1 - cos(a,b)
    
      acc_loss += loss.item()
      words_num += 1

      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      # t1 = time.time()
      # t4 = t1 - t2
      
      # print('Time for the rest: {} s'.format(t4))
    
    print('Epoch: {}; Accumulated mean loss: {}'.format(epoch+1, acc_loss/words_num))
    torch.save(model.state_dict(), 'Word2Vec-bg.model')

In [None]:
train()

print("Training finished")

Training started


KeyboardInterrupt: ignored

Batch size = 200 => Error = 0.7 on the 100th epoch

In [None]:
batch_size

For batch_size = 1 the timing for 100 point is max: 0.03

For batch_size = 1 the timing for all points is max: 0.43

Each batch is processed in 0.8 s maximum, however when dealing woth darge data, the batches are processed in 10 seconds each! The batches sized are the same.

Why does it happen?

**Experiment Results**

1) Window Size: 5; lr: 0.05; embedding_size = 30; subsampling=True

Epoch: 1; Error: 0.915

Epoch: 17; Error: 0.858 

Epoch 50: Error 0.8499

Epoch 70: Error 0.8473

Epoch 90: Error 0.8486

Epoch 100: Error 0.8478

2) Window: 3, lr: 0.05; embedding_size = 300; subsampling = False

Epoch 1: 0.906

Epoch 7: 0.8496

Epoch 25: 0.8129

Epoch 50: 0.8194

3) Window: 5, lr: 0.05, embedding = 300, subs = True, batch_size = 500

# 1.10 Train on the full dataset (0.5 points)

Now, go back to 1.1 and remove the restriction on the number of sentences in your corpus. Then, reexecute code blocks 1.2, 1.3 and 1.6 (or those relevant if you created additional ones). 

* Then, retrain your model on the complete dataset.

* Now, the input weights of the model contain the desired word embeddings! Save them together with the corresponding vocabulary items (Pytorch provides a nice [functionality](https://pytorch.org/tutorials/beginner/saving_loading_models.html) for this).

In [None]:
train()

print("Training finished")

Training started
Time for dataloader: 3.9300851821899414
Time for the rest: 0.06310796737670898 s
Time for dataloader: 3.6967005729675293
Time for the rest: 0.05514931678771973 s
Time for dataloader: 4.438810110092163
Time for the rest: 0.0655677318572998 s
Time for dataloader: 3.705820083618164
Time for the rest: 0.05425596237182617 s


KeyboardInterrupt: ignored

Data size with 100 first points, Batch_size = 100: 13478

Data size with all points, Batch_size = 100: 13617

As it is expected to be. However, each batch is processed way slower, and the RAM consumption might be higher

The training is continued on the cluster

In [None]:
torch.save(model.state_dict(), 'Word2Vec.model')

# All data experiments

1) window_size = 5; embedding_size = 300; lr = 0.05;

Epoch: 1; Accumulated mean loss: 0.9824172251430727

Epoch: 2; Accumulated mean loss: 0.9769900029832989

Epoch: 3; Accumulated mean loss: 0.9706227188189077

Epoch: 4; Accumulated mean loss: 0.963010271279602

Epoch: 5; Accumulated mean loss: 0.9569715874365621

Epoch: 6; Accumulated mean loss: 0.9525484683097712

Epoch: 7; Accumulated mean loss: 0.9490198400897474

Epoch: 8; Accumulated mean loss: 0.9462296936510709

Epoch: 9; Accumulated mean loss: 0.9439137938502523

Epoch: 10; Accumulated mean loss: 0.9419612759007573

Epoch: 11; Accumulated mean loss: 0.9401766861262544

Epoch: 12; Accumulated mean loss: 0.9387034317395186

Epoch: 13; Accumulated mean loss: 0.9373532943693524

Epoch: 14; Accumulated mean loss: 0.936140513085175

Epoch: 30; Accumulated mean loss: 0.9279157843358515

Epoch: 50; Accumulated mean loss: 0.9227518338587359

Epoch: 80; Accumulated mean loss: 0.9193855124162842

Epoch: 90; Accumulated mean loss: 0.9186107575588937

Epoch: 100; Accumulated mean loss: 0.918196428019041

# Debugging notes

In [None]:
model = Word2Vec()#.cuda()
model.double()

model.load_state_dict(torch.load('Word2Vec-bg.model', map_location=torch.device('cpu')))
model.eval()

Word2Vec(
  (fc1): Linear(in_features=14324, out_features=300, bias=True)
  (fc2): Linear(in_features=300, out_features=14324, bias=True)
)

In [None]:
# Outputs 10 most similar words to the input word. 

word = torch.from_numpy(word_to_one_hot('মুভির'))#.cuda() # देशद्रोही
word = word.view(1, len(word))

context = model(word)

max = torch.topk(context, 10) # most suitable context (sorted by relevance)

for idx in max[1][0]:
  for w, onehot in onehot_dict.items():
      if np.argmax(onehot) == idx:
          print(w)

চুদার
থাকি
ভাইয়ের
জানতাম
সৌদি
শুনতাম
খুবই
গান<br
হয়না
>হৃদয়


**Works well!**