# Course AI Homework 5
In Homework 5, we will train our own 'CBOW' Word2Vec embedding from WikiText2 dataset. (small dataset)
- Change Runtime option above to GPU if you could. (max 12 hours for one user)
- Save and submit the outputs of this notebook and model and vocab file you trained.
- Not allowed to have other python file or import pretrained model.

In [1]:
# YOU should run this command if you will train the model in COLAB environment
! pip install datasets transformers

Collecting datasets
  Downloading datasets-2.16.0-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.16.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6


In [2]:
import argparse
import yaml
import os
import torch
import torch.nn as nn
import torchtext

import json
import numpy as np

from functools import partial
from torch.utils.data import DataLoader
from torchtext.data import to_map_style_dataset
from torchtext.data.utils import get_tokenizer

from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import WikiText2 # WikiText103

import torch.optim as optim
from torch.optim.lr_scheduler import LambdaLR

from datasets import load_dataset



In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch_seed_numb = 0
if device.type == 'cuda':
    torch.cuda.manual_seed(torch_seed_numb)

In [4]:
device

device(type='cuda')

In [5]:
# If you use Google Colab environment, mount you google drive here to save model and vocab
from google.colab import drive
drive.mount('/content/drive')
root_dir = '/content/drive/MyDrive/Continuous Bag of Words'

Mounted at /content/drive


### Constant Setting

In [6]:
# You could change parameters if you want.

train_batch_size =  96
val_batch_size = 96
shuffle =  True

optimizer =  'Adam'
learning_rate =  0.05
epochs = 50

result_dir = 'weights/'

# Parameters about CBOW model architecture and Vocab.
CBOW_N_WORDS = 4

MIN_WORD_FREQUENCY = 50
MAX_SEQUENCE_LENGTH = 256

EMBED_DIMENSION = 300
EMBED_MAX_NORM = 1

In [7]:
result_dir = os.path.join(root_dir, result_dir)
if not os.path.exists(result_dir):
    os.mkdir(result_dir)


## Prepare dataset and vocab

In [8]:
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
train_dataset = datasets["train"]
val_dataset = datasets['validation']
test_dataset = datasets['test']
#train_dataset.map(tokenizing_word , batched= True, batch_size = 5000)


Downloading data:   0%|          | 0.00/733k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

In [9]:
# Let's print one example
train_dataset['text'][11]

" Troops are divided into five classes : Scouts , Shocktroopers , Engineers , Lancers and Armored Soldier . Troopers can switch classes by changing their assigned weapon . Changing class does not greatly affect the stats gained while in a previous class . With victory in battle , experience points are awarded to the squad , which are distributed into five different attributes shared by the entire squad , a feature differing from early games ' method of distributing to different unit types . \n"

As you can see, we need to clean up the sentences, lowercase them, tokenize them, and change each word into an index (one-hot vector). Before going through the whole process, we need to create a vocab set using the training dataset.

In [10]:
tokenizer = get_tokenizer("basic_english", language="en")

# TODO 1): make vocabulary
# Hint) use function: build_vocab_from_iterator, use train_dataset set special tokens.. etc

# Training starts from 6.7 without the tokenize function
def tokenize(data_iter):
    for text in data_iter:
        yield tokenizer(text)

# Build the vocabulary from the iterator
vocab = build_vocab_from_iterator(tokenize(train_dataset['text']), specials=["<unk>", "<pad>", "<bos>", "<eos>"], min_freq=MIN_WORD_FREQUENCY)

# Set default index for unknown tokens
vocab.set_default_index(vocab["<unk>"])

We need a collate function to make dataset into CBOW trainning format. The collate function should iterate over (sliding) batch data and make train/test dataset.And each component of data should be composed of CBOW_N_WORD words in each left and right side as input and target output as word in center.  
Make the collate function return CBOW dataset in tensor type.

In [11]:
# Here is a lambda function to tokenize sentence and change words to vocab indexes.
text_pipeline = lambda x: vocab(tokenizer(x))

![cbow](https://user-images.githubusercontent.com/74028313/204695601-51d44a38-4bd3-4a69-8891-2854aa57c034.png)

In [12]:
def collate(batch, text_pipeline):

    batch_input, batch_output = [], []

    # TODO 2): make collate function
    for text in batch:
        # Tokenize the text
        tokenized_text = text_pipeline(text)

        # Iterate over each word in the text
        for i in range(len(tokenized_text)):
            # Define the context window
            ctx = []
            for j in range(-CBOW_N_WORDS, CBOW_N_WORDS + 1):
                if j != 0 and 0 <= i + j < len(tokenized_text):
                    ctx.append(tokenized_text[i + j])

            # Check if the context window size is correct
            if len(ctx) == 2 * CBOW_N_WORDS:
                # Add the input and output to the batch
                batch_input.append(ctx)
                batch_output.append(tokenized_text[i])

    # Convert the input and output to tensors
    batch_input = torch.tensor(batch_input)
    batch_output = torch.tensor(batch_output)

    return batch_input, batch_output

In [13]:
train_dataloader = DataLoader(
    train_dataset['text'],
    batch_size=train_batch_size,
    shuffle=shuffle,
    collate_fn=partial(collate, text_pipeline=text_pipeline),
)

val_dataloader = DataLoader(
    val_dataset['text'],
    batch_size=val_batch_size,
    shuffle=shuffle,
    collate_fn=partial(collate, text_pipeline=text_pipeline),
)

## Make CBOW Model
![image](https://user-images.githubusercontent.com/74028313/204701161-cd9df4bf-78b8-4b4d-b8b7-ed4a3b5c3922.png)

CBOW Models' main concept is to predict center-target word using context words. As you see in above simple architecture, input 2XCBOW_N_WORDS length words are projected to Projection layer. In order to convert each word to embedding, it needs look-up table and we will use torch's Embedding function to convert it. After combining embeddings of context, it use shallow linear neural network to predict target word and compare result with center word's index using cross-entropy loss. Finally, the embedding layer (lookup table) of the trained model itself serves as an embedding representing words.

In [14]:
class CBOW_Model(nn.Module):
    def __init__(self, vocab_size: int, EMBED_DIMENSION, EMBED_MAX_NORM):
        super(CBOW_Model, self).__init__()
        # TODO 3-1): make CBOW model using nn.Embedding and nn.Linear function
        self.embeddings = nn.Embedding(vocab_size, EMBED_DIMENSION, max_norm=EMBED_MAX_NORM)
        self.linear = nn.Linear(EMBED_DIMENSION, vocab_size)

    def forward(self, _inputs):
        # TODO 3-2): make forward function
        # Input shape: (batch_size, 2 * CBOW_N_WORDS)
        # Embedding lookup
        embedded_inputs = self.embeddings(_inputs)

        # Reshape for concatenation
        reshaped_inputs = torch.mean(embedded_inputs, dim=1)

        # Output shape: (batch_size, vocab_size)
        _outputs = self.linear(reshaped_inputs)

        return _outputs

## Train the model

Let's make _train_epoch and _validate_epoch functions to train the CBOW model.  
- model.train() and model.eval() change torch mode in some parts (Dropout, BatchNorm..  etc) of the model to behave differently during inference time.
- There is lr_scheduler option which changes learning rate according to epoch level. Try the option if you are interested in.

In [15]:
vocab_size = len(vocab.get_stoi())

model = CBOW_Model(vocab_size=vocab_size, EMBED_DIMENSION = EMBED_DIMENSION, EMBED_MAX_NORM = EMBED_MAX_NORM)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr = learning_rate)

In [16]:
class Train_CBOW:

    def __init__(
        self,
        model,
        epochs,
        train_dataloader,
        val_dataloader,
        loss_function,
        optimizer,
        device,
        model_dir,
        lr_scheduler = None
    ):
        self.model = model
        self.epochs = epochs
        self.train_dataloader = train_dataloader
        self.val_dataloader = val_dataloader
        self.loss_function = loss_function
        self.optimizer = optimizer
        self.lr_scheduler = lr_scheduler
        self.device = device
        self.model_dir = model_dir

        self.loss = {"train": [], "val": []}
        self.model.to(self.device)

    def train(self):
        for epoch in range(self.epochs):
            self._train_epoch()
            self._validate_epoch()
            print(
                "Epoch: {}/{}, Train Loss={:.5f}, Val Loss={:.5f}".format(
                    epoch + 1,
                    self.epochs,
                    self.loss["train"][-1],
                    self.loss["val"][-1],
                )
            )
            if self.lr_scheduler is not None:
                self.lr_scheduler.step()


    def _train_epoch(self):
        self.model.train() # set model as train
        loss_list = []
        # TODO 4-1):
        for batch_input, batch_output in self.train_dataloader:
          # Move data to the device
          batch_input, batch_output = batch_input.to(self.device), batch_output.to(self.device)

          # Zero the gradients
          self.optimizer.zero_grad()

          # Forward pass
          output = self.model(batch_input)

          # Calculate the loss
          loss = self.loss_function(output, batch_output)
          loss_list.append(loss.item())

          # Backward pass
          loss.backward()

          # Update the weights
          self.optimizer.step()

        # end of TODO
        epoch_loss = np.mean(loss_list)
        self.loss["train"].append(epoch_loss)

    def _validate_epoch(self):
        self.model.eval()
        loss_list = []

        with torch.no_grad():
            # TODO 4-2):
            for batch_input, batch_output in self.val_dataloader:
              # Move data to the device
              batch_input, batch_output = batch_input.to(self.device), batch_output.to(self.device)

              # Forward pass
              output = self.model(batch_input)

              # Calculate the loss
              loss = self.loss_function(output, batch_output)
              loss_list.append(loss.item())

            # end of TODO
        epoch_loss = np.mean(loss_list)
        self.loss["val"].append(epoch_loss)


    def save_model(self):
        model_path = os.path.join(self.model_dir, "model.pt")
        torch.save(self.model, model_path)

    def save_loss(self):
        loss_path = os.path.join(self.model_dir, "loss.json")
        with open(loss_path, "w") as fp:
            json.dump(self.loss, fp)

In [17]:
# Option: you could add and change lr_sceduler
scheduler = LambdaLR(optimizer, lr_lambda = lambda epoch: 0.95 ** epoch)

In [18]:
trainer = Train_CBOW(
    model=model,
    epochs=epochs,
    train_dataloader=train_dataloader,
    val_dataloader=val_dataloader,
    loss_function=loss_function,
    optimizer=optimizer,
    lr_scheduler=scheduler,
    device=device,
    model_dir=result_dir,
)

trainer.train()
print("Training finished.")


Epoch: 1/50, Train Loss=5.22124, Val Loss=4.97873
Epoch: 2/50, Train Loss=4.93571, Val Loss=4.90226
Epoch: 3/50, Train Loss=4.86652, Val Loss=4.89206
Epoch: 4/50, Train Loss=4.82945, Val Loss=4.86877
Epoch: 5/50, Train Loss=4.79969, Val Loss=4.85043
Epoch: 6/50, Train Loss=4.78240, Val Loss=4.86412
Epoch: 7/50, Train Loss=4.76065, Val Loss=4.85793
Epoch: 8/50, Train Loss=4.74365, Val Loss=4.85006
Epoch: 9/50, Train Loss=4.72638, Val Loss=4.82317
Epoch: 10/50, Train Loss=4.71372, Val Loss=4.81852
Epoch: 11/50, Train Loss=4.69589, Val Loss=4.80810
Epoch: 12/50, Train Loss=4.67901, Val Loss=4.81713
Epoch: 13/50, Train Loss=4.66902, Val Loss=4.80813
Epoch: 14/50, Train Loss=4.65611, Val Loss=4.80558
Epoch: 15/50, Train Loss=4.64034, Val Loss=4.79323
Epoch: 16/50, Train Loss=4.62505, Val Loss=4.79281
Epoch: 17/50, Train Loss=4.61309, Val Loss=4.77280
Epoch: 18/50, Train Loss=4.59995, Val Loss=4.77074
Epoch: 19/50, Train Loss=4.58343, Val Loss=4.77866
Epoch: 20/50, Train Loss=4.57088, Val Lo

In [19]:
# save model
trainer.save_model()
trainer.save_loss()

vocab_path = os.path.join(result_dir, "vocab.pt")
torch.save(vocab, vocab_path)

### Result
Let's inference trained word embedding and visualize it.

In [20]:
import pandas as pd
import sys

from sklearn.manifold import TSNE
import plotly.graph_objects as go

sys.path.append("../")

In [21]:
result_dir

'/content/drive/MyDrive/Continuous Bag of Words/weights/'

In [22]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# reload saved model and vocab
model = torch.load(os.path.join(result_dir,"model.pt"), map_location=device)
vocab = torch.load(os.path.join(result_dir,"vocab.pt"))

# embedding is model's first layer
embeddings = list(model.parameters())[0]
embeddings = embeddings.cpu().detach().numpy()

# normalization
norms = (embeddings ** 2).sum(axis=1) ** (1 / 2)
norms = np.reshape(norms, (len(norms), 1))
embeddings_norm = embeddings / norms
embeddings_norm.shape



(4122, 300)

### Make t-SNE graph of trained embedding and color numeric values

In [23]:
# Apply t-SNE to the normalized embeddings
tsne = TSNE(n_components=2, random_state=0)
embeddings_tsne = tsne.fit_transform(embeddings_norm)

# Create a DataFrame to hold the t-SNE results
embeddings_df = pd.DataFrame(embeddings_tsne, columns=['x', 'y'])
# TODO 5-1) : make 2-d t-SNE graph of all vocabs and color only for numeric values(others, just color black)
vocab_list = list(vocab.get_itos())
embeddings_df['word'] = vocab_list
embeddings_df['color'] = embeddings_df['word'].apply(lambda w: 'blue' if w.isdigit() else 'black')

fig = go.Figure()

# Add a scatter plot for each word
for _, row in embeddings_df.iterrows():
    fig.add_trace(go.Scatter(
        x=[row['x']], y=[row['y']],
        text=row['word'],
        mode='markers+text',
        marker=dict(color=row['color']),
        textposition='top center'
    ))

# Update the plot layout
fig.update_layout(
    title='t-SNE Plot of Word Embeddings (Color by Numeric/Non-Numeric)',
    xaxis_title='t-SNE Dimension 1',
    yaxis_title='t-SNE Dimension 2',
)

# Show the plot
fig.show()

### Find top N similar words


In [24]:
def find_top_similar(word: str, vocab, embeddings_norm, topN: int = 10):
    # TODO 5-2) : make function returning top n similiar words and similarity scores
    topN_dictionary = {}
    stoi = vocab.get_stoi() if hasattr(vocab, 'get_stoi') else vocab.stoi

    # Check if the word is in the vocabulary
    if word not in stoi:
        return topN_dictionary

    # Get the index and embedding of the word
    word_idx = stoi[word]
    word_embed = embeddings_norm[word_idx]

    # Calculate cosine similarity
    cos_similarity = np.dot(embeddings_norm, word_embed)

    # Get the indices of the top N similar words
    top_indices = np.argsort(cos_similarity)[::-1][1:topN+1]

    # Get the top N similar words
    itos = vocab.get_itos() if hasattr(vocab, 'get_itos') else vocab.itos
    top_words = [itos[idx] for idx in top_indices]

    # Get the top N similarity scores
    top_scores = cos_similarity[top_indices]

    # Combine words and scores in a dictionary
    topN_dictionary = dict(zip(top_words, top_scores))

    return topN_dictionary


In [25]:
for word, sim in find_top_similar("english", vocab, embeddings_norm).items():
    print("{}: {:.3f}".format(word, sim))


kannada: 0.345
celtic: 0.303
spanish: 0.294
barry: 0.261
scotland: 0.242
mole: 0.228
bible: 0.222
shakespeare: 0.210
institution: 0.204
american: 0.202


### Result Report

Save the colab result and submit it with your trained model and vocab file. Check one more time your submitted notebook file has result.

You can change the CBOW model parameters Training parameters and details if you want.