<a href="https://colab.research.google.com/github/petitmi/Deep_learning-Sequential_data/blob/main/LSTM_sentiment_analysis_latest_torchtext.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 586-lab4

I will use custom CSV text dataset for training a simple RNN for sentiment classification (here: a binary classification problem with two labels, positive and negative) using LSTM (Long Short Term Memory) cells and GRU Cells.

In [49]:
from google.colab import drive
drive.mount('/content/drive')
%cd '/content/drive/MyDrive/586'
!ls 


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/586
cnn-lenet5-cifar10.ipynb  data	Lab4.ipynb  notebookc0cd28f5bd.ipynb  Project


In [None]:
!pip install torch --upgrade
!pip install torchtext --upgrade

In [51]:
import numpy as np 
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import torch
import torch.nn as nn
import torch.nn.functional as F
# from nltk.corpus import stopwords 
from collections import Counter
import string
import re
import seaborn as sns
from tqdm import tqdm
import matplotlib.pyplot as plt
from torch.utils.data import TensorDataset, DataLoader
from torch.utils.data.dataset import random_split


The following cells will download the IMDB movie review dataset (http://ai.stanford.edu/~amaas/data/sentiment/) for positive-negative sentiment classification in as CSV-formatted file:

In [52]:
# !gunzip -f movie_data.csv.gz 
base_csv = 'data/movie_data.csv'
df = pd.read_csv(base_csv)
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


### Splitting to train and test data

I will split data to train and test initially.

In [53]:
RANDOM_SEED = 123

import random
raw_data=df[['sentiment','review']].values.tolist()
train_data, valid_data, test_data = random_split(raw_data, [0.75,0.05,0.2], generator=torch.Generator().manual_seed(RANDOM_SEED))
print(f'Num Train: {len(train_data)}')
print(f'Num Valid: {len(valid_data)}')
print(f'Num Test: {len(test_data)}')


Num Train: 37500
Num Valid: 2500
Num Test: 10000


### Analysing sentiment

### Tockenization

**<font color='red'>Note</font>**: I will use the latest version(i.e. `0.15.0`) of `torchtext` instead of the example(i.e. `0.9.0`). 
- Use `data.utils.get_tokenizer` instead of `data.Field`
- `vocab` is also different
- Padding process are differene


In [56]:
from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer('basic_english')

In [57]:
from collections import Counter
from torchtext.vocab import vocab
counter = Counter()
for line in df['review']:
    counter.update(tokenizer(line))
vocab = vocab(counter, min_freq=1, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))

In [58]:
print("The length of the new vocab is", len(vocab))
new_stoi = vocab.get_stoi()
print("The index of '<BOS>' is", new_stoi['<BOS>'])
new_itos = vocab.get_itos()
print("The token at index 2 is", new_itos[2])

The length of the new vocab is 147160
The index of '<BOS>' is 1
The token at index 2 is <EOS>


In [59]:
text_transform = lambda x: [vocab['<BOS>']] + [vocab[token] for token in tokenizer(x)] + [vocab['<EOS>']]
label_transform = lambda x: float(1) if x == 1 else 0

print("input to the text_transform:", 'this is me')
print("output of the text_transform:", text_transform( 'this is me'))

input to the text_transform: this is me
output of the text_transform: [1, 181, 49, 272, 2]


In [60]:
print(f'Length of vocabulary is {len(vocab)}')

Length of vocabulary is 147160



### Padding, batching and loading as tensor

In [61]:
RANDOM_SEED = 123
torch.manual_seed(RANDOM_SEED)
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

# VOCABULARY_SIZE = 20000
LEARNING_RATE = 1e-4
BATCH_SIZE = 128
NUM_EPOCHS = 15

INPUT_DIM = len(vocab)
EMBEDDING_DIM = 128
HIDDEN_DIM = 256
OUTPUT_DIM = 1


- Using `DataLoader` to load the data
- Using `collate_fn` function when loading to implement padding

In [62]:

from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence

def custom_collate(data): #(2)
    inputs = [torch.tensor(text_transform(d[1])) for d in data] #(3)
    labels = torch.tensor([label_transform(d[0]) for d in data])
    inputs_lens = torch.tensor([len(x) for x in inputs])
    inputs = pad_sequence(inputs, batch_first=True) #(4)
    # inputs = [(d) for d in inputs]

    return labels.to(DEVICE),inputs.to(DEVICE),inputs_lens.to(DEVICE)


Look at the outcome of this function.

In [63]:
custom_collate(list(train_data)[:3])

(tensor([1., 1., 0.], device='cuda:0'),
 tensor([[    1, 71932,    49,  ...,  3856,    24,     2],
         [    1,  2382, 39360,  ...,     0,     0,     0],
         [    1, 43646,  1841,  ...,     0,     0,     0]], device='cuda:0'),
 tensor([528,  54,  59], device='cuda:0'))

In [64]:
from torch.utils.data import DataLoader
train_loader = DataLoader(train_data, shuffle=True, batch_size=BATCH_SIZE,collate_fn=custom_collate)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=BATCH_SIZE,collate_fn=custom_collate)
test_loader = DataLoader(test_data, shuffle=True, batch_size=BATCH_SIZE,collate_fn=custom_collate)

Have a look at the data, get one:

In [65]:
next(iter(train_loader))

(tensor([0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 0., 1., 1., 0., 0., 1., 0., 1.,
         1., 1., 1., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0.,
         0., 0., 0., 1., 1., 0., 0., 1., 0., 1., 0., 1., 1., 0., 1., 0., 0., 0.,
         1., 1., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0.,
         1., 0., 1., 1., 0., 0., 0., 0., 1., 1., 0., 0., 1., 0., 0., 1., 1., 1.,
         0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 1., 0., 0., 1.,
         1., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0.,
         0., 1.], device='cuda:0'),
 tensor([[    1,   181,   106,  ...,     0,     0,     0],
         [    1,   457,   153,  ...,     0,     0,     0],
         [    1,   181,   637,  ...,     0,     0,     0],
         ...,
         [    1,   258,  4717,  ...,     0,     0,     0],
         [    1,    50,   371,  ...,     0,     0,     0],
         [    1,  1055, 10006,  ...,     0,     0,     0]], device='cuda:0'),
 tensor([ 

### Model

In [66]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        
        super().__init__()
        #Here is a preliminary model using LSTM cell
        #The primary goal of this lab is to vary the dimensions of the embeddings and see the results
        #The second task is to use a another RNN cell such as GRU and perform parameter tuning and report the results.

        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, text, text_length):
        embedded = self.embedding(text)
        packed = pack_padded_sequence(embedded, text_length.to('cpu'),batch_first=True,enforce_sorted=False) # Related to the previous pad_sequence
        packed_output, (hidden, cell) = self.rnn(packed)
        # print(packed_output, (hidden, cell))
        return self.fc(hidden.squeeze(0)).view(-1)

In [67]:
model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)
model = model.to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

### Training

In [68]:
def compute_binary_accuracy(model, data_loader, device):
    model.eval()
    correct_pred, num_examples = 0, 0
    with torch.no_grad():
        for idx, (label, text, text_length) in enumerate(data_loader):
            logits = model(text,text_length)
            predicted_labels = (torch.sigmoid(logits) > 0.5).long()
            num_examples += label.size(0)#change
            correct_pred += (predicted_labels.long() == label.long()).sum()#change
        return correct_pred.float()/num_examples * 100

In [69]:
import time
start_time = time.time()

for epoch in range(NUM_EPOCHS):
    model.train()
    for idx, (label, text,text_length) in enumerate(train_loader):
        
        ### FORWARD AND BACK PROP
        logits = model(text, text_length)
        cost = F.binary_cross_entropy_with_logits(logits, label)
        optimizer.zero_grad()
        cost.backward()
        
        ### UPDATE MODEL PARAMETERS
        optimizer.step()
        
        ### LOGGING
        if not idx % 50:
             print(f'Epoch: {epoch+1:03d}/{NUM_EPOCHS:03d} | '
                   f'Batch {idx:03d}/{len(train_loader):03d} | '
                   f'Cost: {cost:.4f}')

    with torch.set_grad_enabled(False):
        print(f'training accuracy: '
              f'{compute_binary_accuracy(model, train_loader, DEVICE):.2f}%'
              f'\nvalid accuracy: '
              f'{compute_binary_accuracy(model, valid_loader, DEVICE):.2f}%')
        
    print(f'Time elapsed: {(time.time() - start_time)/60:.2f} min')
    
print(f'Total Training Time: {(time.time() - start_time)/60:.2f} min')
print(f'Test accuracy: {compute_binary_accuracy(model, test_loader, DEVICE):.2f}%')

Epoch: 001/015 | Batch 000/293 | Cost: 0.6918
Epoch: 001/015 | Batch 050/293 | Cost: 0.6854
Epoch: 001/015 | Batch 100/293 | Cost: 0.6944
Epoch: 001/015 | Batch 150/293 | Cost: 0.6865
Epoch: 001/015 | Batch 200/293 | Cost: 0.6266
Epoch: 001/015 | Batch 250/293 | Cost: 0.5758
training accuracy: 71.58%
valid accuracy: 70.08%
Time elapsed: 0.87 min
Epoch: 002/015 | Batch 000/293 | Cost: 0.6088
Epoch: 002/015 | Batch 050/293 | Cost: 0.4727
Epoch: 002/015 | Batch 100/293 | Cost: 0.4751
Epoch: 002/015 | Batch 150/293 | Cost: 0.4879
Epoch: 002/015 | Batch 200/293 | Cost: 0.4724
Epoch: 002/015 | Batch 250/293 | Cost: 0.4567
training accuracy: 80.12%
valid accuracy: 77.60%
Time elapsed: 1.74 min
Epoch: 003/015 | Batch 000/293 | Cost: 0.4758
Epoch: 003/015 | Batch 050/293 | Cost: 0.4403
Epoch: 003/015 | Batch 100/293 | Cost: 0.3637
Epoch: 003/015 | Batch 150/293 | Cost: 0.5305
Epoch: 003/015 | Batch 200/293 | Cost: 0.3574
Epoch: 003/015 | Batch 250/293 | Cost: 0.3941
training accuracy: 83.37%
va

### Inferance

In [136]:
import spacy
# nlp = spacy.load('en')

def predict_sentiment(model, sentence):
    # based on:
    # https://github.com/bentrevett/pytorch-sentiment-analysis/blob/
    # master/2%20-%20Upgraded%20Sentiment%20Analysis.ipynb
    model.eval()
    # tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    input = torch.tensor(text_transform(sentence)).unsqueeze(dim=1).to(DEVICE)
    length = torch.tensor([len(input)]).to(DEVICE)
    input = pad_sequence(input)
    prediction = torch.sigmoid(model(input, length))
    
    return prediction.item()

In [137]:
predict_sentiment(model,"I love it so much!")

0.8334588408470154

In [141]:
predict_sentiment(model,"I hate it, so foolish and sucks!")

0.25770819187164307