# Course AI Homework 5
In Homework 5, we will train our own 'CBOW' Word2Vec embedding from WikiText2 dataset. (small dataset)
- Change Runtime option above to GPU if you could. (max 12 hours for one user)
- Save and submit the outputs of this notebook and model and vocab file you trained.
- Not allowed to have other python file or import pretrained model.

In [None]:
# YOU should run this command if you will train the model in COLAB environment
! pip install datasets transformers

In [None]:
%pip install datasets

In [38]:
import argparse
import yaml
import os
import torch
import torch.nn as nn
import torchtext

import json
import numpy as np

from functools import partial
from torch.utils.data import DataLoader
from torchtext.data import to_map_style_dataset
from torchtext.data.utils import get_tokenizer

from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import WikiText2 # WikiText103

import torch.optim as optim
from torch.optim.lr_scheduler import LambdaLR

from datasets import load_dataset



In [39]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch_seed_numb = 0
if device.type == 'cuda':
    torch.cuda.manual_seed(torch_seed_numb)

In [40]:
device

device(type='cpu')

In [None]:
# If you use Google Colab environment, mount you google drive here to save model and vocab
from google.colab import drive
drive.mount('/content/drive')
root_dir = '/content/drive/MyDrive/course_ai_hw5'

### Constant Setting

In [71]:
# You could change parameters if you want.

train_batch_size =  96
val_batch_size = 96
shuffle =  True

optimizer =  'Adam'
learning_rate =  0.025
epochs = 50

result_dir = 'weights/'

# Parameters about CBOW model architecture and Vocab.
CBOW_N_WORDS = 4

MIN_WORD_FREQUENCY = 50
MAX_SEQUENCE_LENGTH = 256

EMBED_DIMENSION = 300
EMBED_MAX_NORM = 1

In [None]:
result_dir = os.path.join(root_dir, result_dir)
if not os.path.exists(result_dir):
    os.mkdir(result_dir)


In [58]:
# I used this code since I worked in VS code
result_dir = "weights/"
if not os.path.exists(result_dir):
    os.mkdir(result_dir)

## Prepare dataset and vocab

In [8]:
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
train_dataset = datasets["train"]
val_dataset = datasets['validation']
test_dataset = datasets['test']
#train_dataset.map(tokenizing_word , batched= True, batch_size = 5000)


In [59]:
# Let's print one example
train_dataset['text'][11]

" Troops are divided into five classes : Scouts , Shocktroopers , Engineers , Lancers and Armored Soldier . Troopers can switch classes by changing their assigned weapon . Changing class does not greatly affect the stats gained while in a previous class . With victory in battle , experience points are awarded to the squad , which are distributed into five different attributes shared by the entire squad , a feature differing from early games ' method of distributing to different unit types . \n"

As you can see, we need to clean up the sentences, lowercase them, tokenize them, and change each word into an index (one-hot vector). Before going through the whole process, we need to create a vocab set using the training dataset.

In [60]:
# references : https://pytorch.org/text/stable/vocab.html#build-vocab-from-iterator

tokenizer = get_tokenizer("basic_english", language="en")

# TODO 1): make vocabulary
# Hint) use function: build_vocab_from_iterator, use train_dataset set special tokens.. etc
def build_vocab(data_iter, tokenizer):    
    vocab = build_vocab_from_iterator(map(tokenizer, data_iter), specials=["<unk>"], min_freq=MIN_WORD_FREQUENCY)
    vocab.set_default_index(vocab["<unk>"])
    return vocab

vocab = build_vocab(train_dataset['text'], tokenizer)

We need a collate function to make dataset into CBOW trainning format. The collate function should iterate over (sliding) batch data and make train/test dataset.And each component of data should be composed of CBOW_N_WORD words in each left and right side as input and target output as word in center.  
Make the collate function return CBOW dataset in tensor type.

In [61]:
# Here is a lambda function to tokenize sentence and change words to vocab indexes.
text_pipeline = lambda x: vocab(tokenizer(x))

![cbow](https://user-images.githubusercontent.com/74028313/204695601-51d44a38-4bd3-4a69-8891-2854aa57c034.png)

In [62]:
def collate(batch, text_pipeline):

    batch_input, batch_output = [], []

    # TODO 2): make collate function
    for text in batch:
        text_idx = text_pipeline(text)
        if len(text_idx) < CBOW_N_WORDS * 2 + 1:
            continue
        if MAX_SEQUENCE_LENGTH:
            text_idx = text_idx[:MAX_SEQUENCE_LENGTH]
        for idx in range(len(text_idx) - CBOW_N_WORDS * 2):
            ids = text_idx[idx : (idx + CBOW_N_WORDS * 2 + 1)]
            outputs = ids.pop(CBOW_N_WORDS)
            inputs = ids
            batch_input.append(inputs)
            batch_output.append(outputs)

    batch_input = torch.tensor(batch_input, dtype=torch.long)
    batch_output = torch.tensor(batch_output, dtype=torch.long)
    return batch_input, batch_output

In [63]:
train_dataloader = DataLoader(
    train_dataset['text'],
    batch_size=train_batch_size,
    shuffle=shuffle,
    collate_fn=partial(collate, text_pipeline=text_pipeline),
)

val_dataloader = DataLoader(
    val_dataset['text'],
    batch_size=val_batch_size,
    shuffle=shuffle,
    collate_fn=partial(collate, text_pipeline=text_pipeline),
)

## Make CBOW Model
![image](https://user-images.githubusercontent.com/74028313/204701161-cd9df4bf-78b8-4b4d-b8b7-ed4a3b5c3922.png)

CBOW Models' main concept is to predict center-target word using context words. As you see in above simple architecture, input 2XCBOW_N_WORDS length words are projected to Projection layer. In order to convert each word to embedding, it needs look-up table and we will use torch's Embedding function to convert it. After combining embeddings of context, it use shallow linear neural network to predict target word and compare result with center word's index using cross-entropy loss. Finally, the embedding layer (lookup table) of the trained model itself serves as an embedding representing words.

In [138]:
''' references
  1. https://tutorials.pytorch.kr/beginner/text_sentiment_ngrams_tutorial.html
  2. https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html

  Note : the following implementation is almost same to reference 1's implementation. 
  However, please believe that I tried to fully understand what each line do, rather than just copy and paste.
'''
class CBOW_Model(nn.Module):
    def __init__(self, vocab_size: int,EMBED_DIMENSION,EMBED_MAX_NORM):
        super(CBOW_Model, self).__init__()
        # TODO 3-1): make CBOW model using nn.Embedding and nn.Linear function
        self.embeddings = nn.Embedding(num_embeddings=vocab_size, embedding_dim=EMBED_DIMENSION,max_norm=EMBED_MAX_NORM) # vocab_size : V, EMBED_DIMENSION : E
        self.linear = nn.Linear(EMBED_DIMENSION,vocab_size)
        
    def forward(self, inputs_):
      # TODO 3-2): make forward function

      _output = self.embeddings(inputs_)
      _output = _output.mean(axis=1)
      return self.linear(_output)

## Train the model

Let's make _train_epoch and _validate_epoch functions to train the CBOW model.  
- model.train() and model.eval() change torch mode in some parts (Dropout, BatchNorm..  etc) of the model to behave differently during inference time.
- There is lr_scheduler option which changes learning rate according to epoch level. Try the option if you are interested in.

In [139]:
vocab_size = len(vocab.get_stoi()) #4119

# model = CBOW_Model(vocab_size=vocab_size, EMBED_DIMENSION = EMBED_DIMENSION, EMBED_MAX_NORM = EMBED_MAX_NORM)
model = CBOW_Model(vocab_size=vocab_size, EMBED_DIMENSION = EMBED_DIMENSION, EMBED_MAX_NORM = EMBED_MAX_NORM).to(device) # change this line by adding .to(device) due to error
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr = learning_rate)

In [75]:
''' references
  1. https://algopoolja.tistory.com/55
  2. https://tigris-data-science.tistory.com/entry/PyTorch-modeltrain-vs-modeleval-vs-torchnograd
  3. https://tutorials.pytorch.kr/beginner/basics/optimization_tutorial.html
'''
class Train_CBOW:

    def __init__(
        self,
        model,
        epochs,
        train_dataloader,
        val_dataloader,
        loss_function,
        optimizer,
        device,
        model_dir,
        lr_scheduler = None
    ):
        self.model = model
        self.epochs = epochs
        self.train_dataloader = train_dataloader
        self.val_dataloader = val_dataloader
        self.loss_function = loss_function
        self.optimizer = optimizer
        self.lr_scheduler = lr_scheduler
        self.device = device
        self.model_dir = model_dir

        self.loss = {"train": [], "val": []}
        self.model.to(self.device)

    def train(self):
        for epoch in range(self.epochs):
            self._train_epoch()
            self._validate_epoch()
            print(
                "Epoch: {}/{}, Train Loss={:.5f}, Val Loss={:.5f}".format(
                    epoch + 1,
                    self.epochs,
                    self.loss["train"][-1],
                    self.loss["val"][-1],
                )
            )
            if self.lr_scheduler is not None:
                self.lr_scheduler.step()


    
    # The following code is based on the code given by reference 3. Pkease note that the given implementation is not identical to the following but the overall process does. 
    
    def _train_epoch(self):
        self.model.train() # set model as train
        loss_list = []
        # TODO 4-1):
        for batch_data in self.train_dataloader:
            inputs = batch_data[0].to(self.device) # context
            labels = batch_data[1].to(self.device) # target word

            self.optimizer.zero_grad() # For every iteration, setting gradient as zero is idealistic for proper training(ref.2).
            outputs = self.model(inputs)
            
            loss = self.loss_function(outputs, labels)
            loss.backward()
            self.optimizer.step()
            loss_list.append(loss.item())

        # end of TODO
        epoch_loss = np.mean(loss_list)
        self.loss["train"].append(epoch_loss)

    def _validate_epoch(self):
        self.model.eval()
        loss_list = []
        
        with torch.no_grad(): # turn off the gradient calculation : validation epoch!
            # TODO 4-2):
            for batch_data in self.val_dataloader:
                inputs = batch_data[0].to(self.device)
                labels = batch_data[1].to(self.device)

                outputs = self.model(inputs)
                
                loss = self.loss_function(outputs, labels)
                loss_list.append(loss.item())
                
        # end of TODO
        epoch_loss = np.mean(loss_list)
        self.loss["val"].append(epoch_loss)


    def save_model(self):
        model_path = os.path.join(self.model_dir, "model.pt")
        torch.save(self.model, model_path)

    def save_loss(self):
        loss_path = os.path.join(self.model_dir, "loss.json")
        with open(loss_path, "w") as fp:
            json.dump(self.loss, fp)

In [76]:
# Option: you could add and change lr_sceduler
scheduler = LambdaLR(optimizer, lr_lambda = lambda epoch: 0.95 ** epoch)

In [77]:
trainer = Train_CBOW(
    model=model,
    epochs=epochs,
    train_dataloader=train_dataloader,
    val_dataloader=val_dataloader,
    loss_function=loss_function,
    optimizer=optimizer,
    lr_scheduler=scheduler,
    device=device,
    model_dir=result_dir,
)

trainer.train()
print("Training finished.")


Epoch: 1/50, Train Loss=5.28799, Val Loss=5.03254
Epoch: 2/50, Train Loss=4.97577, Val Loss=4.92674
Epoch: 3/50, Train Loss=4.87472, Val Loss=4.84460
Epoch: 4/50, Train Loss=4.81241, Val Loss=4.85556
Epoch: 5/50, Train Loss=4.77050, Val Loss=4.78853
Epoch: 6/50, Train Loss=4.73555, Val Loss=4.77733
Epoch: 7/50, Train Loss=4.70689, Val Loss=4.77171
Epoch: 8/50, Train Loss=4.68040, Val Loss=4.76780
Epoch: 9/50, Train Loss=4.65593, Val Loss=4.76044
Epoch: 10/50, Train Loss=4.63757, Val Loss=4.74148
Epoch: 11/50, Train Loss=4.61609, Val Loss=4.74001
Epoch: 12/50, Train Loss=4.59640, Val Loss=4.73048
Epoch: 13/50, Train Loss=4.58036, Val Loss=4.72456
Epoch: 14/50, Train Loss=4.56278, Val Loss=4.70876
Epoch: 15/50, Train Loss=4.54327, Val Loss=4.68994
Epoch: 16/50, Train Loss=4.52586, Val Loss=4.70236
Epoch: 17/50, Train Loss=4.50984, Val Loss=4.68298
Epoch: 18/50, Train Loss=4.49390, Val Loss=4.68616
Epoch: 19/50, Train Loss=4.47653, Val Loss=4.67986
Epoch: 20/50, Train Loss=4.46150, Val Lo

In [78]:
# save model
trainer.save_model()
trainer.save_loss()

vocab_path = os.path.join(result_dir, "vocab.pt")
torch.save(vocab, vocab_path)

### Result
Let's inference trained word embedding and visualize it.

In [79]:
import pandas as pd
import sys

from sklearn.manifold import TSNE
import plotly.graph_objects as go

sys.path.append("../")

In [80]:
result_dir

'weights/'

In [110]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# reload saved model and vocab
model = torch.load(os.path.join(result_dir,"model.pt"), map_location=device)
vocab = torch.load(os.path.join(result_dir,"vocab.pt"))

# embedding is model's first layer
embeddings = list(model.parameters())[0]
embeddings = embeddings.cpu().detach().numpy()

# normalization
norms = (embeddings ** 2).sum(axis=1) ** (1 / 2)
norms = np.reshape(norms, (len(norms), 1))
embeddings_norm = embeddings / norms
embeddings_norm.shape



(4119, 300)

### Make t-SNE graph of trained embedding and color numeric values

In [145]:
'''references
    1. https://bigdatamaster.tistory.com/186
    2. https://plotly.com/python-api-reference/generated/plotly.graph_objects.Scatter.html

'''

embeddings_df = pd.DataFrame(embeddings_norm)
fig = go.Figure()
# TODO 5-1) : make 2-d t-SNE graph of all vocabs and color only for numeric values(others, just color black)

model = TSNE(n_components=2)
embeddings_df_transformed = model.fit_transform(embeddings_df)
embeddings_df_transformed = pd.DataFrame(embeddings_df_transformed)
embeddings_df_transformed.index = vocab.get_itos()

# change numberic values' color to blue 
isNumeric = embeddings_df_transformed.index.str.isnumeric()
color = np.where(isNumeric, "blue", "black")

# plot figure
figure = go.Figure()
figure.add_trace(go.Scatter(x=embeddings_df_transformed[0], y=embeddings_df_transformed[1],mode="text",text=embeddings_df_transformed.index,textposition="middle center",textfont=dict(color=color)))
figure.show()
figure.write_html("visulization_output.html")

![img](https://github.com/jmSNU/AI/blob/main/newplot.png?raw=true)


### Find top N similar words


In [136]:
'''references
    1.https://stackoverflow.com/questions/16486252/is-it-possible-to-use-argsort-in-descending-order
'''

def find_top_similar(word: str, vocab, embeddings_norm, topN: int = 10):
    # TODO 5-2) : make function returning top n similiar words and similarity scores
    topN_dict = {}
    word_idx = vocab[word]
    
    word_vec = embeddings_norm[word_idx]
    word_vec = word_vec.flatten()
    dist = np.matmul(embeddings_norm, word_vec).flatten() # calculate the inner products of the words in vocab
    
    topN_ids = np.argsort(-dist) # get the index of the word by using argsort
    topN_ids = topN_ids[1:topN+1] # 0 element is always given word itself
    
    for sim_id in topN_ids:
        sim_word = vocab.lookup_token(sim_id)
        topN_dict[sim_word] = dist[sim_id]

    return topN_dict


In [137]:
for word, sim in find_top_similar("english", vocab, embeddings_norm).items():
    print("{}: {:.3f}".format(word, sim))


kannada: 0.303
bible: 0.298
mole: 0.296
celtic: 0.284
spanish: 0.274
irish: 0.250
artificial: 0.233
pine: 0.229
georgian: 0.217
american: 0.217


### Result Report

Save the colab result and submit it with your trained model and vocab file. Check one more time your submitted notebook file has result.

You can change the CBOW model parameters Training parameters and details if you want.