1. Make sure you fill in all cells contain YOUR CODE HERE or YOUR ANSWER HERE.
2. After you finished, Restart the kernel & run all cell in order.

# Project II: Text Classification Using LSTM Network
## Deadline: Nov 14, 11:59 pm

You have learned about the basics of neural network training and testing during the class. Now let's move forward to the text classification tasks using simple LSTM networks! In this project, you need to implement two parts:

- **Part I: Building vocabulary for LSTM network**
    - Get familiar with discrete text data processing for neural networks. Building vocabulary by yourself.


- **Part II: Implementing your own LSTM Neural Network**
    - Learn to implement your own LSTM network and aims for a strong performance on the given text classification task.
    - Note that you need to implement the LSTM network manually, any kind of integrated package invoking will get 0 points.
    - Your LSTM network can be 2-4 layers.
    - Expected Accuracy: >=65%.
    ![](./LSTM.png)
    

项目二:利用LSTM网络进行文本分类
截止日期:11月14日晚上11:59
您已经在课堂上学到了神经网络训练和测试的基础知识。现在让我们前进到使用简单的LSTM网络的文本分类任务！在这个项目中，您需要实现两个部分:

第一部分:为LSTM网络建立词汇
熟悉神经网络的离散文本数据处理。自己建立词汇。
第二部分:实现你自己的LSTM神经网络
学习实现你自己的LSTM网络，目标是在给定的文本分类任务上有很好的表现。
请注意，您需要手动实现LSTM网络，任何类型的集成包调用将获得0分。
你的LSTM网络可以有2-4层。
预期精度:> =65%。

In [35]:
import torch
import pandas as pd

# nlp library of Pytorch
# from torchtext import data
from torchtext.legacy import data

import warnings as wrn
wrn.filterwarnings('ignore')
SEED = 2021

torch.manual_seed(SEED)
torch.backends.cuda.deterministic = True

In [36]:
data_ = pd.read_csv('./sms_spam.csv')
data_.head()
data_.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5574 entries, 0 to 5573
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   type    5574 non-null   object
 1   text    5574 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [37]:
import spacy.cli
spacy.cli.download("en_core_web_lg")
nlp = spacy.load("en_core_web_lg")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [25]:
# Field is a normal column 
# LabelField is the label column.

import spacy
nlp = spacy.load("en_core_web_lg")
def tokenizer(text): # create a tokenizer function
    return [tok.text for tok in nlp.tokenizer(text)]

TEXT = data.Field(tokenize=tokenizer,batch_first=True,include_lengths=True)
LABEL = data.LabelField(dtype = torch.float,batch_first=True)

In [26]:
fields = [("type",LABEL),('text',TEXT)]

In [27]:
training_data = data.TabularDataset(path="./sms_spam.csv",
                                    format="csv",
                                    fields=fields,
                                    skip_header=True
                                   )

print(vars(training_data.examples[0]))

{'type': 'ham', 'text': ['Go', 'until', 'jurong', 'point', ',', 'crazy', '..', 'Available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'Cine', 'there', 'got', 'amore', 'wat', '...']}


In [28]:
import random
# train and validation splitting
train_data,valid_data = training_data.split(split_ratio=0.75,
                                            random_state=random.seed(SEED))

#### Question 1 (5 points)
Implement the vocabulary building and the text to label part for training.

In [29]:
#implement Question1 here:
#Building vocabularies => (Token to integer)
#you can use the data package built-in function to build the vocabulary, check the 'torchtext data' doc.

TEXT.build_vocab(train_data)
LABEL.build_vocab(train_data)

In [30]:
print("Size of text vocab:",len(TEXT.vocab))
print("Size of label vocab:",len(LABEL.vocab))
TEXT.vocab.freqs.most_common(10)

Size of text vocab: 9705
Size of label vocab: 2


[('.', 3658),
 ('to', 1615),
 ('I', 1478),
 (',', 1461),
 ('you', 1383),
 ('?', 1086),
 ('!', 1019),
 ('a', 1003),
 ('the', 882),
 ('...', 869)]

In [31]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

BATCH_SIZE = 8

# We'll create iterators to get batches of data when we want to use them
"""
This BucketIterator batches the similar length of samples and reduces the need of 
padding tokens. This makes our future model more stable

"""
train_iterator,validation_iterator = data.BucketIterator.splits(
    (train_data,valid_data),
    batch_size = BATCH_SIZE,
    # Sort key is how to sort the samples
    sort_key = lambda x:len(x.text),
    sort_within_batch = True,
    device = device
)

#### Question 2 (25 points)
You need to implement the embedding layer and the LSTM cell according to the given architecture, but you are not allowed to use any integrated package!
LSTM tutorial: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
![](./LSTM_CELL.png)

In [34]:
import torch.nn as nn
import paddle
import paddle.nn as nn
from torch.nn import Parameter

class lstm_(nn.Module):
    def __init__(self, embedding_dim, hidden_dim,n_layers=2,bidirectional=False,bias=True):
        super().__init__()
        self.input_size = embedding_dim
        self.hidden_size = hidden_dim
        self.bias = bias
        self.num_layers = n_layers
        self.bidirectional = bidirectional
        if bidirectional:
            self.num_directions = 2 
        else:
            self.num_directions = 1
        self.param_names = []
        for layer in range(self.num_layers):
            self.param_names.append([])
            for direction in range(self.num_directions):
                self.input_size = self.input_size if layer == 0 else self.hidden_size * self.num_directions
                W =  Parameter(torch.Tensor([self.input_size, self.hidden_size * 4]))
                U =  Parameter(torch.Tensor([self.hidden_size, self.hidden_size * 4]))
                b =  Parameter(torch.Tensor([self.hidden_size * 4]))
                layer_params = (W, U, b)
                suffix = '_reverse' if direction == 1 else ''
                self.param_name = ['weight_W{}{}', 'weight_U{}{}']
                if bias:
                    self.param_name += ['bias_{}{}']
                self.param_name = [x.format(layer, suffix) for x in self.param_name]
                for name, param in zip(self.param_name, layer_params):
                    setattr(self, name, param)
                self.param_names[layer].append(self.param_name)

            

        self.all_weights = [[[getattr(self, weight) for weight in weights]
                        for weights in weights_layer] for weights_layer in self.param_names]

    def _init_states(self, x):
        h_t = torch.zeros(1, x.size(0), self.hidden_size, dtype=x.dtype).to(x.device)
        c_t = torch.zeros(1, x.size(0), self.hidden_size, dtype=x.dtype).to(x.device)
        return h_t, c_t

    def forward(self, x,init_states=None):

        batch_size, seq_sz, _ = x.shape
        if init_states is None:
            num_directions = 2 if self.bidirectional else 1
            h_t, c_t = (torch.zeros((self.num_layers * num_directions,batch_size,self.hidden_size)),
                        torch.zeros((self.num_layers * num_directions,batch_size,self.hidden_size)))
        else:
            h_t, c_t = init_states

        for layer in range(self.num_layers):
            hidden_seq = []
            hidden_seq_reverse = []
            self.weight_layer = self.all_weights[layer]
            for direction in range(self.num_directions):
                # self.param_name = self.param_names[layer]
                self.weight = self.weight_layer[direction]
                HS = self.hidden_size
                h_t, c_t = h_t[0].unsqueeze(0),c_t[0].unsqueeze(0)
                for t in range(seq_sz):
                    x_t = x[:, t, :]
                    # batch the computations into a single matrix multiplication
                    print(self.weight[0])
#                     gates = x_t @ getattr(self,self.param_name[0]) + h_t @ getattr(self,self.param_name[1]) \
#                        + getattr(self,self.param_name[2])
#                     print(self.weight[1].shape)
#                     print(x_t.shape)
                    if self.bias:
                        gates = x_t @ self.weight[0] + h_t @ self.weight[1] + self.weight[2]
                    else:
                        gates = x_t @ self.weight[0] + h_t @ self.weight[1]

                    gates = gates[0]
                    i_t, f_t, g_t, o_t = (
                        paddle.nn.functional.sigmoid(gates[:, :HS]),  # input
                        paddle.nn.functional.sigmoid(gates[:, HS:HS * 2]),  # forget
                        paddle.tanh(gates[:, HS * 2:HS * 3]),
                        paddle.nn.functional.sigmoid(gates[:, HS * 3:]),  # output
                    )
                    c_t = f_t * c_t + i_t * g_t
                    h_t = o_t * torch.tanh(c_t)

                    if direction == 0:
                        if isinstance(hidden_seq, list):
                            hidden_seq = h_t[0].unsqueeze(1)
                        else:
                            hidden_seq = paddle.concat((hidden_seq, h_t[0].unsqueeze(1)), axis=1)

                    if direction == 1:
                        if isinstance(hidden_seq_reverse, list):
                            hidden_seq_reverse = h_t[0].unsqueeze(1)
                        else:
                            hidden_seq_reverse = paddle.concat((hidden_seq_reverse, h_t[0].unsqueeze(1)), axis=1)
                x = paddle.to_tensor(x.detach().cpu().numpy()[:,::-1,:].copy())
                if direction == 1:
                    hidden_seq_reverse = paddle.to_tensor(
                        hidden_seq_reverse.detach().cpu().numpy()[:, ::-1, :].copy())
                    hidden_seq = paddle.concat((hidden_seq, hidden_seq_reverse),axis=2)
            x = hidden_seq
        return hidden_seq, (h_t, c_t)


ModuleNotFoundError: No module named 'paddle'

In [None]:
import torch.nn as nn
import torch.nn.functional as F


class LSTMNet(nn.Module):
    
    def __init__(self,vocab_size,embedding_dim,hidden_dim,output_dim,n_layers,bidirectional,dropout):
        
        super(LSTMNet,self).__init__()
        # In this class, you need to implement the architecture of an LSTM network, the architecture should include:
        # 1. Embedding layer converts integer sequences to vector sequences
        self.embedding=nn.Embedding(vocab_size,embedding_dim,padding_idx=0)        
        # 2. LSTM layer process the vector sequences 
        self.lstm = lstm_(embedding_dim, hidden_dim,n_layers,bidirectional=bidirectional)
        self.tanh1 = nn.Tanh()
        # self.u = nn.Parameter(torch.Tensor(config.hidden_size * 2, config.hidden_size * 2))
        self.w = nn.Parameter(torch.zeros(hidden_dim * 2))
        self.tanh2 = nn.Tanh()
        self.fc = nn.Linear(hidden_dim*2, output_dim)
        self.act = F.sigmoid()
    def forward(self, x,x_len):
        emb = self.embedding(x)  # [batch_size, seq_len, embeding]=[128, 32, 300]
        H, _ = self.lstm(emb)  # [batch_size, seq_len, hidden_size * num_direction]=[128, 32, 256]

        M = self.tanh1(H)  # [128, 32, 256]
        # M = torch.tanh(torch.matmul(H, self.u))
        alpha = F.softmax(torch.matmul(M, self.w), dim=1).unsqueeze(-1)  # [128, 32, 1]
        out = H * alpha  
        out = torch.sum(out, 1)  
        out = self.fc(out) 
        out = self.act(out)

        return out
        
    

In [None]:
SIZE_OF_VOCAB = len(TEXT.vocab)
EMBEDDING_DIM = 300
NUM_HIDDEN_NODES = 64
NUM_OUTPUT_NODES = 1
NUM_LAYERS = 2
BIDIRECTION = True
DROPOUT = 0.1

In [None]:
model = LSTMNet(SIZE_OF_VOCAB,
                EMBEDDING_DIM,
                NUM_HIDDEN_NODES,
                NUM_OUTPUT_NODES,
                NUM_LAYERS,
                BIDIRECTION,
                DROPOUT
               )

In [None]:
import torch.optim as optim
model = model.to(device)
optimizer = optim.Adam(model.parameters(),lr=1e-4)
criterion = nn.BCELoss()
criterion = criterion.to(device)

In [None]:
def binary_accuracy(preds, y):
    #round predictions to the closest integer
    rounded_preds = torch.round(preds)
    
    correct = (rounded_preds == y).float() 
    acc = correct.sum() / len(correct)
    return acc

In [None]:
def train(model,iterator,optimizer,criterion):
    
    epoch_loss = 0.0
    epoch_acc = 0.0
    
    model.train()
    
    for batch in iterator:
        # cleaning the cache of optimizer
        optimizer.zero_grad()
        
        text,text_lengths = batch.text
        
        # forward propagation and squeezing
        predictions = model(text,text_lengths).squeeze()
        # computing loss / backward propagation
        loss = criterion(predictions,batch.type)
        loss.backward()
        
        # accuracy
        acc = binary_accuracy(predictions,batch.type)
        # updating params
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    # It'll return the means of loss and accuracy
    return epoch_loss / len(iterator), epoch_acc / len(iterator)
        

In [None]:
def evaluate(model,iterator,criterion):
    
    epoch_loss = 0.0
    epoch_acc = 0.0
    
    # deactivate the dropouts
    model.eval()
    
    # Sets require_grad flat False
    with torch.no_grad():
        for batch in iterator:
            text,text_lengths = batch.text
            
            predictions = model(text,text_lengths).squeeze()
            
            #compute loss and accuracy
            loss = criterion(predictions, batch.type)
            acc = binary_accuracy(predictions, batch.type)
            
            #keep track of loss and accuracy
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
EPOCH_NUMBER = 15
for epoch in range(1,EPOCH_NUMBER+1):
    
    train_loss,train_acc = train(model,train_iterator,optimizer,criterion)
    
    valid_loss,valid_acc = evaluate(model,validation_iterator,criterion)
    
    # Showing statistics
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')
    print()