# HOMEWORK 6: TEXT CLASSIFICATION
In this homework, you will create models to classify texts from TRUE call-center. There are two classification tasks:
1. Action Classification: Identify which action the customer would like to take (e.g. enquire, report, cancle)
2. Object Classification: Identify which object the customer is referring to (e.g. payment, truemoney, internet, roaming) 

In this homework, you are asked to do the following tasks:
1. Data Cleaning
2. Preprocessing data for pytorch
3. Build and evaluate a model for "action" classification
4. Build and evaluate a model for "object" classification
5. Build and evaluate a multi-task model that does both "action" and "object" classifications in one-go 


Note: we have removed phone numbers from the dataset for privacy purposes. 

In [None]:
# !wget --no-check-certificate https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv
# !pip install pythainlp

## Import Libs

In [2]:
%matplotlib inline
import time
import pandas
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import torch
import pandas as pd 
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from IPython.display import display
from pythainlp.tokenize import word_tokenize
from collections import defaultdict
from sklearn.metrics import accuracy_score
from torchtext.vocab import build_vocab_from_iterator
from pythainlp.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

## Loading data
First, we load the data from disk into a Dataframe.

A Dataframe is essentially a table, or 2D-array/Matrix with a name for each column.

In [76]:
# data_df = pd.read_csv('clean-phone-data-for-students.csv')

## Data cleaning

We call the DataFrame.describe() again.
Notice that there are 33 unique labels/classes for object and 10 unique labels for action that the model will try to predict.
But there are unwanted duplications e.g. Idd,idd,lotalty_card,Lotalty_card

Also note that, there are 13389 unqiue sentence utterances from 16175 utterances. You have to clean that too!

## #TODO 1: 
You will have to remove unwanted label duplications as well as duplications in text inputs. 
Also, you will have to trim out unwanted whitespaces from the text inputs. 
This shouldn't be too hard, as you have already seen it in the demo.



In [77]:
# display(data_df.describe())
# display(data_df.Object.unique())
# display(data_df.Action.unique())

In [3]:
# TODO1: Data cleaning
data_df = pd.read_csv('clean-phone-data-for-students.csv')

data_df_obj = data_df[['Sentence Utterance', 'Object']]
data_df_obj.columns = ['input', 'raw_label']
data_df_obj['clean_label'] = data_df_obj['raw_label'].str.lower().copy()
data_df_obj.drop('raw_label', axis=1, inplace=True)
data_df_obj = data_df_obj.drop_duplicates('input', keep='first')
data_obj = data_df_obj.to_numpy()

unique_label = data_df_obj.clean_label.unique()
label_2_num_map = dict(zip(unique_label, range(len(unique_label))))
num_2_label_map = dict(zip(range(len(unique_label)), unique_label))
data_obj[:, 1] = np.vectorize(label_2_num_map.get)(data_obj[:, 1])

data_df_act = data_df[['Sentence Utterance', 'Action']]
data_df_act.columns = ['input', 'raw_label']
data_df_act['clean_label'] = data_df_act['raw_label'].str.lower().copy()
data_df_act.drop('raw_label', axis=1, inplace=True)
data_df_act = data_df_act.drop_duplicates('input', keep='first')
data_act = data_df_act.to_numpy()

unique_label = data_df_act.clean_label.unique()
label_2_num_map = dict(zip(unique_label, range(len(unique_label))))
num_2_label_map = dict(zip(range(len(unique_label)), unique_label))
data_act[:, 1] = np.vectorize(label_2_num_map.get)(data_act[:, 1])

def strip_str(string):
    return string.strip()

data_obj[:, 0] = np.vectorize(strip_str)(data_obj[:, 0])
data_act[:, 0] = np.vectorize(strip_str)(data_act[:, 0])

## TODO2 : Assign index to word and labels in each sentences. 

Note that please use **word_tokenize** (https://pythainlp.github.io/docs/2.0/api/tokenize.html) as a function to tokenize each sentences.

In [28]:
# TODO2: assign index to each words and labels in sentence.
tokenized_obj, labels_obj = zip(*[(word_tokenize(sentence, engine="newmm"), label) for sentence, label in data_obj])
tokenized_act, labels_act = zip(*[(word_tokenize(sentence, engine="newmm"), label) for sentence, label in data_act])
vocab = build_vocab_from_iterator(tokenized_obj, specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])
embedded_obj = [[labels_obj[i], vocab(tokens)] for i, tokens in enumerate(tokenized_obj)]
embedded_act = [[labels_act[i], vocab(tokens)] for i, tokens in enumerate(tokenized_act)]
embedded_act_obj = [[labels_act[i],labels_obj[i], vocab(tokens)] for i, tokens in enumerate(tokenized_act)]
# embedded_act_obj = []
# label_act_obj = dict()
# for i,tokens in enumerate(tokenized_act):
#   if (labels_obj[i],labels_act[i]) not in label_act_obj:
#     label_act_obj[(labels_obj[i],labels_act[i])] = len(label_act_obj)
#   embedded_act_obj.append([label_act_obj[(labels_obj[i],labels_act[i])],vocab(tokens)]) 

## TODO 2,3: Preprocessing data for pytorch
You will be using pytorch in this assignment. Please show us how you prepare your dataloader for pytorch.
Don't forget to split data into train, valdation, and test sets (normally the ratio will be 80:10:10 , respectively)


## TODO 3: Split the data

We recommend to use train_test_spilt from scikit-learn to split the data into train, validation, test set. 

In addition, it should split the data that distribution of the labels in train , validation, test set are similar. There is **stratify** variable handling this issue. 

In this case, you can choose whatever you want either "**Action**" or "**Object**" ;). 

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html


In [40]:
# TODO3: split data into train, validation, test  
train_data_act, test_data_act = train_test_split(embedded_act, test_size=0.2, shuffle=False)
train_data_act, val_data_act = train_test_split(train_data_act, test_size=0.1, shuffle=False)

train_data_obj, test_data_obj = train_test_split(embedded_obj, test_size=0.2, shuffle=False)
train_data_obj, val_data_obj = train_test_split(train_data_obj, test_size=0.1, shuffle=False)

train_data_act_obj,test_data_act_obj = train_test_split(embedded_act_obj,test_size = 0.2, shuffle=False)
train_data_act_obj,val_data_act_obj = train_test_split(train_data_act_obj,test_size = 0.1, shuffle=False)

## TODO 4: Build a model for classifying these texts.


In [6]:
class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

In [7]:
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         label_list.append(_label)
         processed_text = torch.tensor(_text, dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)

def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc/total_count))
            total_acc, total_count = 0, 0
            start_time = time.time()

def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count

## #TODO 3: Build and evaluate a model for "action" classification


In [14]:
batch_size=8
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
num_class_act = len(set([label for label in labels_act]))
num_class_obj = len(set([label for label in labels_obj]))
emsize = 64

In [9]:
model = TextClassificationModel(len(vocab), emsize, num_class_act).to(device)
train_dataloader = DataLoader(train_data_act, batch_size=batch_size, shuffle=False,collate_fn=collate_batch)
val_dataloader = DataLoader(val_data_act, batch_size=batch_size, shuffle=False,collate_fn=collate_batch)
test_dataloader = DataLoader(test_data_act, batch_size=batch_size, shuffle=False,collate_fn=collate_batch)
# Hyperparameters
EPOCHS = 10 # epoch
LR = 5  # learning rate
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
num_train = int(len(data_act) * 0.95)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(val_dataloader)
    if total_accu is not None and total_accu > accu_val:
      scheduler.step()
    else:
       total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
    print('-' * 59)

| epoch   1 |   500/ 1205 batches | accuracy    0.757
| epoch   1 |  1000/ 1205 batches | accuracy    0.804
-----------------------------------------------------------
| end of epoch   1 | time:  5.29s | valid accuracy    0.864 
-----------------------------------------------------------
| epoch   2 |   500/ 1205 batches | accuracy    0.821
| epoch   2 |  1000/ 1205 batches | accuracy    0.826
-----------------------------------------------------------
| end of epoch   2 | time:  2.35s | valid accuracy    0.869 
-----------------------------------------------------------
| epoch   3 |   500/ 1205 batches | accuracy    0.836
| epoch   3 |  1000/ 1205 batches | accuracy    0.839
-----------------------------------------------------------
| end of epoch   3 | time:  2.44s | valid accuracy    0.874 
-----------------------------------------------------------
| epoch   4 |   500/ 1205 batches | accuracy    0.848
| epoch   4 |  1000/ 1205 batches | accuracy    0.848
-------------------------

In [10]:
## TODO 3.5: evalaute on test set  
evaluate(test_dataloader)

0.8450336071695295

## #TODO 4: Build and evaluate a model for "object" classification



In [11]:
model = TextClassificationModel(len(vocab), emsize, num_class_act).to(device)
train_dataloader = DataLoader(train_data_act, batch_size=batch_size, shuffle=False,collate_fn=collate_batch)
val_dataloader = DataLoader(val_data_act, batch_size=batch_size, shuffle=False,collate_fn=collate_batch)
test_dataloader = DataLoader(test_data_act, batch_size=batch_size, shuffle=False,collate_fn=collate_batch)
# Hyperparameters
EPOCHS = 10 # epoch
LR = 5  # learning rate
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
num_train = int(len(data_act) * 0.95)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(val_dataloader)
    if total_accu is not None and total_accu > accu_val:
      scheduler.step()
    else:
       total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
    print('-' * 59)

| epoch   1 |   500/ 1205 batches | accuracy    0.759
| epoch   1 |  1000/ 1205 batches | accuracy    0.805
-----------------------------------------------------------
| end of epoch   1 | time:  1.95s | valid accuracy    0.865 
-----------------------------------------------------------
| epoch   2 |   500/ 1205 batches | accuracy    0.825
| epoch   2 |  1000/ 1205 batches | accuracy    0.824
-----------------------------------------------------------
| end of epoch   2 | time:  1.97s | valid accuracy    0.872 
-----------------------------------------------------------
| epoch   3 |   500/ 1205 batches | accuracy    0.838
| epoch   3 |  1000/ 1205 batches | accuracy    0.840
-----------------------------------------------------------
| end of epoch   3 | time:  2.69s | valid accuracy    0.873 
-----------------------------------------------------------
| epoch   4 |   500/ 1205 batches | accuracy    0.850
| epoch   4 |  1000/ 1205 batches | accuracy    0.850
-------------------------

In [12]:
## TODO 4.5: evalaute on test set  
evaluate(test_dataloader)

0.8454070201643017

## #TODO 5: Build and evaluate a multi-task model that does both "action" and "object" classifications in one-go 

The model will have 2 separate output layers one for action classification task and another for object classification task. 

This is a rough sketch of what your model might look like:
![image](https://raw.githubusercontent.com/ekapolc/nlp_course/master/HW5/multitask_sketch.png)

In [15]:
model = TextClassificationModel(len(vocab), emsize, num_class_act_obj).to(device)
train_dataloader = DataLoader(train_data_act_obj, batch_size=batch_size, shuffle=False,collate_fn=collate_batch)
val_dataloader = DataLoader(val_data_act_obj, batch_size=batch_size, shuffle=False,collate_fn=collate_batch)
test_dataloader = DataLoader(test_data_act_obj, batch_size=batch_size, shuffle=False,collate_fn=collate_batch)
# Hyperparameters
EPOCHS = 10 # epoch
LR = 5  # learning rate
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
num_train = int(len(data_act) * 0.95)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(val_dataloader)
    if total_accu is not None and total_accu > accu_val:
      scheduler.step()
    else:
       total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
    print('-' * 59)

| epoch   1 |   500/ 1205 batches | accuracy    0.469
| epoch   1 |  1000/ 1205 batches | accuracy    0.616
-----------------------------------------------------------
| end of epoch   1 | time:  2.15s | valid accuracy    0.636 
-----------------------------------------------------------
| epoch   2 |   500/ 1205 batches | accuracy    0.643
| epoch   2 |  1000/ 1205 batches | accuracy    0.691
-----------------------------------------------------------
| end of epoch   2 | time:  2.20s | valid accuracy    0.632 
-----------------------------------------------------------
| epoch   3 |   500/ 1205 batches | accuracy    0.729
| epoch   3 |  1000/ 1205 batches | accuracy    0.773
-----------------------------------------------------------
| end of epoch   3 | time:  1.97s | valid accuracy    0.663 
-----------------------------------------------------------
| epoch   4 |   500/ 1205 batches | accuracy    0.751
| epoch   4 |  1000/ 1205 batches | accuracy    0.783
-------------------------

In [36]:
class Model(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class_act,num_class_obj):
        super(Model, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc_act = nn.Linear(embed_dim, num_class_act)
        self.fc_obj = nn.Linear(embed_dim, num_class_obj)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc_act.weight.data.uniform_(-initrange, initrange)
        self.fc_act.bias.data.zero_()
        self.fc_obj.weight.data.uniform_(-initrange, initrange)
        self.fc_obj.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc_act(embedded), self.fc_obj(embedded)

In [46]:
def collate_batch_mul(batch):
    label_list_act,label_list_obj, text_list, offsets = [], [], [], [0]
    for (_label_act,_label_obj, _text) in batch:
         label_list_act.append(_label_act)
         label_list_obj.append(_label_obj)
         processed_text = torch.tensor(_text, dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label_list_act = torch.tensor(label_list_act, dtype=torch.int64)
    label_list_obj = torch.tensor(label_list_obj, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list_act.to(device),label_list_obj.to(device), text_list.to(device), offsets.to(device)

def train_mul(dataloader):
    model_mul.train()
    total_acc_act, total_count_act,total_acc_obj,total_count_obj = 0, 0,0,0
    log_interval = 500
    start_time = time.time()
    for idx, (label_act,label_obj, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label_act,predicted_label_obj = model_mul(text, offsets)
        loss = criterion(predicted_label_act, label_act) + criterion(predicted_label_obj, label_obj)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model_mul.parameters(), 0.1)
        optimizer.step()
        total_acc_act += (predicted_label_act.argmax(1) == label_act).sum().item()
        total_acc_obj += (predicted_label_obj.argmax(1) == label_obj).sum().item()
        total_count_act += label_act.size(0)
        total_count_obj += label_obj.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc_act/total_count_act))
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc_obj/total_count_obj))
            total_acc_act, total_count_act,total_acc_obj,total_count_obj = 0, 0,0,0
            start_time = time.time()

def evaluate_mul(dataloader):
    model_mul.eval()
    total_acc_act, total_count_act,total_acc_obj,total_count_obj = 0, 0,0,0
    with torch.no_grad():
        for idx, (label_act,label_obj, text, offsets) in enumerate(dataloader):
            predicted_label_act,predicted_label_obj = model_mul(text, offsets)
            loss = criterion(predicted_label_act, label_act) + criterion(predicted_label_obj, label_obj)
            total_acc_act += (predicted_label_act.argmax(1) == label_act).sum().item()
            total_acc_obj += (predicted_label_obj.argmax(1) == label_obj).sum().item()
            total_count_act += label_act.size(0)
            total_count_obj += label_obj.size(0)
    return total_acc_act/total_count_act, total_acc_obj/total_count_obj

In [50]:
model_mul = Model(len(vocab), emsize, num_class_act,num_class_obj).to(device)
train_dataloader = DataLoader(train_data_act_obj, batch_size=batch_size, shuffle=False,collate_fn=collate_batch_mul)
val_dataloader = DataLoader(val_data_act_obj, batch_size=batch_size, shuffle=False,collate_fn=collate_batch_mul)
test_dataloader = DataLoader(test_data_act_obj, batch_size=batch_size, shuffle=False,collate_fn=collate_batch_mul)
# Hyperparameters
EPOCHS = 10 # epoch
LR = 5  # learning rate
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model_mul.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu_act = None
total_accu_obj = None
num_train = int(len(data_act) * 0.95)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train_mul(train_dataloader)
    accu_val_act, accu_val_obj= evaluate_mul(val_dataloader)
    if total_accu_act is not None and total_accu_obj is not None and total_accu_act > accu_val_act and total_accu_obj > accu_val_obj:
      scheduler.step()
    else:
       total_accu_act = accu_val_act
       total_accu_obj = accu_val_obj
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f}, {:8.3f}'.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val_act,accu_val_obj))
    print('-' * 59)

| epoch   1 |   500/ 1205 batches | accuracy    0.759
| epoch   1 |   500/ 1205 batches | accuracy    0.498
| epoch   1 |  1000/ 1205 batches | accuracy    0.808
| epoch   1 |  1000/ 1205 batches | accuracy    0.628
-----------------------------------------------------------
| end of epoch   1 | time:  2.40s | valid accuracy    0.860,    0.617
-----------------------------------------------------------
| epoch   2 |   500/ 1205 batches | accuracy    0.824
| epoch   2 |   500/ 1205 batches | accuracy    0.668
| epoch   2 |  1000/ 1205 batches | accuracy    0.828
| epoch   2 |  1000/ 1205 batches | accuracy    0.703
-----------------------------------------------------------
| end of epoch   2 | time:  2.52s | valid accuracy    0.868,    0.643
-----------------------------------------------------------
| epoch   3 |   500/ 1205 batches | accuracy    0.845
| epoch   3 |   500/ 1205 batches | accuracy    0.708
| epoch   3 |  1000/ 1205 batches | accuracy    0.841
| epoch   3 |  1000/ 1205 

In [51]:
evaluate_mul(test_dataloader)

(0.8412994772218073, 0.7135922330097088)

## #TODO 6: report the result in each set-up 

**Single Task learning**

Action classification : 
Acc = 0.8450336071695295, f1 = 

Object classification : 
Acc = 0.8454070201643017, f1 = 

**Multi-task learning**

Action classification : 
Acc = 0.8412994772218073, f1 = 

Object classification : 
Acc = 0.7135922330097088, f1 = 

## TODO 7: Use pretraining word embedding & handling out of vocabulary words

Pretrained word embeddings can be used to improve the performance of a classification model, as they provide the model better representation of the words. These embeddings can be used to initialize the weights of the neural network model, providing it with a more meaningful starting point for learning especially on smaller datasets.

In this part, we will try to use pretrained word embedding to initialize the word embeddings in this corpus.





In the previous labs, we have always been using a vector of zeros to initialize words for OOVs. However, that is usually not the best method. In this part of the homework, you will try to handle these OOVs better.

**Note :** you can use any pretrained word embedding.

Repeat the model in TODO 5 with pretrained word embedding. Use a better initialization than a vector of zeroes.

Here are some ideas:

1.   [average](https://nlp.stanford.edu/~johnhew/vocab-expansion.html)
2.   [Using character n-grams from FastText](https://fasttext.cc/docs/en/unsupervised-tutorial.html)
3.   [Use a character LSTM model](https://link.springer.com/chapter/10.1007/978-3-030-18305-9_60)

In [None]:
!wget https://github.com/PyThaiNLP/pythainlp-corpus/releases/download/thai2fit_wv-v0.1/thai2vec.bin -O thai2vec.bin

In [58]:
## TODO 7.1: pretrained word embedding 
import gensim
vec = gensim.models.KeyedVectors.load_word2vec_format("thai2vec.bin", binary=True)
itos = {i:k for i,(k,v) in enumerate(vec.vocab.items())}
weight = vec.vectors
stoi = {v:k for k,v in itos.items()}
model_mul = Model(len(stoi), 300, num_class_act,num_class_obj).to(device)
with torch.no_grad():
    model_mul.embedding.weight.copy_(torch.tensor(weight))

In [60]:
## TODO 7.2: how to handle out of vocab  
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model_mul.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu_act = None
total_accu_obj = None
num_train = int(len(data_act) * 0.95)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train_mul(train_dataloader)
    accu_val_act, accu_val_obj= evaluate_mul(val_dataloader)
    if total_accu_act is not None and total_accu_obj is not None and total_accu_act > accu_val_act and total_accu_obj > accu_val_obj:
      scheduler.step()
    else:
       total_accu_act = accu_val_act
       total_accu_obj = accu_val_obj
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f}, {:8.3f}'.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val_act,accu_val_obj))
    print('-' * 59)

| epoch   1 |   500/ 1205 batches | accuracy    0.761
| epoch   1 |   500/ 1205 batches | accuracy    0.521
| epoch   1 |  1000/ 1205 batches | accuracy    0.804
| epoch   1 |  1000/ 1205 batches | accuracy    0.643
-----------------------------------------------------------
| end of epoch   1 | time:  2.52s | valid accuracy    0.868,    0.618
-----------------------------------------------------------
| epoch   2 |   500/ 1205 batches | accuracy    0.827
| epoch   2 |   500/ 1205 batches | accuracy    0.686
| epoch   2 |  1000/ 1205 batches | accuracy    0.829
| epoch   2 |  1000/ 1205 batches | accuracy    0.719
-----------------------------------------------------------
| end of epoch   2 | time:  2.45s | valid accuracy    0.869,    0.641
-----------------------------------------------------------
| epoch   3 |   500/ 1205 batches | accuracy    0.845
| epoch   3 |   500/ 1205 batches | accuracy    0.725
| epoch   3 |  1000/ 1205 batches | accuracy    0.847
| epoch   3 |  1000/ 1205 

Describe what pretrained word embedding you used, and how you handle OOVs.

**ANS** TODO 7.3

Use thai2fit pre-trained word embedding model. It can handle OOV, which are words that do not appear in the modle's vocabulary. This is because the model is trained too generalize from the words it has seen during training to similara words that it has not seen before. When an OOV word is encountered, the model can still produce a meaningful vector representation for it based on its character n-grams.