# <span style="color:turquoise">Text classification with pytorch</span>


An example of using natural language processing for sentiment analysis. <br> Building a binary classifier of movie reviews that will predict if a review is positive or negative.




__Dataset:__ IMDB movie reviews from Kaggle<br>
__Model:__ LSTM


### <span style="color:teal">Todo:</span>

- ~~Read dataset~~
- ~~Preprocess text~~
- ~~Split into train, validation, and test sets~~
- ~~Convert text to indices and add paddings~~
- ~~Make model~~
- ~~Make training function~~
- ~~Make evaluation function~~
- ~~Train~~
- Evaluate on test set
- Run inference

In [1]:
import csv
import random
import numpy as np

from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

import torch
import torch.nn as nn

import time

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

  return torch._C._cuda_getDeviceCount() > 0


## <span style="color:teal">Read the data and split it into training, cross-validation, and test sets</span>

In [2]:
class Reviews():
    
    def __init__(self):
        self.train = {}
        self.val = {}
        self.test = {}
        self.LABELS = {"positive":1, "negative": 0}
        self.COUNT = {"positive": 0, "negative": 0}
    
    
    def read_data(self):
        
        dataset = []
        
        with open ("IMDB_Dataset.csv", newline='') as f:
            datareader = csv.reader(f, delimiter=',')
            next(datareader, None)

            for row in datareader:
                dataset.append([row[0], self.LABELS[row[1]]])
                self.COUNT[row[1]] += 1
            
            random.shuffle(dataset)
                
        return dataset




    def split_dataset(self,
                      dataset,
                      split=[int(50000*0.6), int(50000*0.2), int(50000*0.2)]):
        
        train, val, test = torch.utils.data.random_split(dataset,
                                               split,
                                               generator=torch.Generator().manual_seed(43))
          
            
        return train, val, test

In [3]:
rev = Reviews()
data = rev.read_data()
pos_count = rev.COUNT["positive"]
neg_count = rev.COUNT["negative"]


In [4]:
print(data[10])

["Its a very good comedy movie.Ijust liked it.I don't know why i love this movie i just love it.Storyline:It is a story of two boys Amar (Aamir Khan) and Prem (Salman Khan) who want to get rich quickly by taking all the short-cuts in the book. Amar is the son of an honest barber, Murli Manohar (Deven Verma) in Mumbai, while Prem is the son of Bankeylal Bhopali (Jagdeep), a hardworking tailor in Bhopal. Both Amar and Prem sell their father's shop and house respectively, and zero in on a hill station where a beautiful wealthy heiress Raveena (Raveena Tandon) has come from London accompanied by her friend cum secretary Karishma (Karishma Kapoor) with the intention of getting married to a virtuous Indian. The lucky man to wed Raveena will inherit her father Ram Gopal Bajaj's (Paresh Rawal) entire wealth. Amar and Prem see their get rich quick chance and woo Raveena, each trying to out do the other. Enter Teja (Paresh Rawal in a double role) whose sole ambition in life has been to grab his 

In [5]:
train, val, test = rev.split_dataset(data)
print(train[10])

["THE MATADOR is hit-man movie lite....if you can say that about a hit-man movie. The violence is never really shown but often introduced. At first I was scared I was in for another retread of mid-90s gangster-hit-man-hipster-dark comedy BUT was happily surprised when I realized this is just a sweet and humorous story about friendship. Nothing terribly exciting happens in this film but every bit of it is kept me grinning. The three leads have the best chemistry the big screen has offered in recent years and it looks like they had a great time making this film together. The writing is sharp though at times it felt as if the script had been adapted from a stage play because of the one set dialog scenes. This is a good film that I probably won't remember for too long but at the time it was a complete joy. Good film.", 1]


In [6]:
print(len(train), len(val), len(test))

30000 10000 10000


In [7]:
def split_x_and_y(data):
    x = []
    y = []
    for review, label in data:
        x.append(review)
        y.append(label)
    return x, np.array(y)

In [8]:
train_x_raw, train_y = split_x_and_y(train)
val_x_raw, val_y = split_x_and_y(val)
test_x_raw, test_y = split_x_and_y(test)


print(len(train_x_raw), len(train_y))
print(train_x_raw[50], train_y[50])

30000 30000
While in one country, Spain, Luis Bunuel and Salvador Dali combined forces to create the benchmark of short-subject, cinematic surrealism, Un Chien Andalou, Walt Disney and his collaborator Ub Iwerks in America worked on Steamboat Willie, the most prominent of the early synchronized sound cartoons (it was revealed that this was not the first, contrary to other reports). It's also one of the more successfully simplistic and funny of the Mickey Mouse shorts (still in a silent-film way- the only sounds are little irks and bleeps from the Mickey and the animals). It also goes by fairly quickly for its less-than-ten minute run. But in these minutes one gets the immediate sense of how much fun Disney has with his characters, and how the newfound use of sound changes how his creation uses the animals as musical tools. There's no story to speak of, just random things that happens and occurs because of Mickey (err, Steamboat Willie) on this boat on a river. And like the better Micke

## <span style="color:teal">Preprocess text</span>

In [9]:
def preprocess(review,
               remove_stopwords=False, 
               remove_html=True, 
               remove_punct=False, 
               lowercase=False, 
               lemmatize=False,
               maxlen=128):
    
    review = re.sub(r"\'", "'", review)
    review = re.sub(r"\x96", "-", review)
    
    if remove_html:
        review = re.sub(r'<.*>', ' ', review)
    
    review = word_tokenize(review)
        
    if remove_stopwords:
        stop_words = set(stopwords.words("english"))
        review = [w for w in review if w not in stop_words]
        
    if remove_punct:
        contractions = ["'ll", "'s", "n't", "'d", "'m", "'ve", "'re"]
        review = [w for w in review if w.isalnum() or w in contractions]
    
    if lowercase:
        review = [w.lower() for w in review]
        
    if lemmatize:
        lemmatizer = WordNetLemmatizer()
        review = [lemmatizer.lemmatize(w) for w in review]
    
    
    return review[:maxlen]
    


In [10]:
train_words = [preprocess(review, 
                      lowercase=True, 
                      remove_punct=True,
                      remove_stopwords=True
                     ) 
           for review in train_x_raw]

val_words = [preprocess(review, 
                    lowercase=True, 
                    remove_punct=True,
                    remove_stopwords=True
                   ) 
         for review in val_x_raw]

In [11]:
print(train_words[9592], '\n', val_words[3029])
print(len(train_words[9592]), '\n', len(val_words[3029]))

['the', 'movie', 'great', 'venezuelan', 'tourism', 'birds', 'birds', 'birds', 'only', '1', 'piranha', 'nice', 'scenery', 'the', 'highlight', 'alligator', 'seen', 'long', 'boring', 'motorcycle', 'race', 'the', 'end', 'caribe', 'drowns', 'definite', 'hollywood', 'prop', 'there', 'definite', 'storyline', 'it', 'goes', 'venezuelan', 'scenery', 'rip', 'easy', 'rider', 'diamond', 'mining', 'ruthless', 'hunter', 'going', 'crazy', 'reason', 'gets', 'end', 'a', 'low', 'budget', 'movie', 'could', 'filmed', 'anywhere', 'outtakes', 'venezuela', 'william', 'smith', 'talented', 'actor', 'made', 'good', 'movies', 'like', 'actors', 'need', 'least', 'one', 'bad', 'film', 'do', "n't", 'waste', 'dvd'] 
 ['the', 'claude', 'lelouch', "'s", 'movie', 'pretty', 'good', 'moment', 'cinema', 'one', 'touching', 'films', 'family', 'loneliness', 'surely', 'best', 'interpretation', 'french', 'actor', 'belmondo']
74 
 20


## <span style="color:teal">Convert text to indices and add paddings</span>

In [12]:
def make_vocabulary_dicts(preprocessed_data, pad_token='<PAD>', unk_token='<UNK>'):
    vocab = set()
    
    for review in preprocessed_data:
        for word in review:
            vocab.add(word)
    
            
    vocab_sorted = sorted(vocab)
    word2ind = {word : i+2 for i, word in enumerate(vocab_sorted)}
    ind2word = {i+2 : word for i, word in enumerate(vocab_sorted)}
    
    # Prepend the pad token
    word2ind[pad_token] = 0
    ind2word[0] = pad_token
    
    # Prepend the 'unknown' token
    word2ind[unk_token] = 1
    ind2word[1] = unk_token
    
    assert len(word2ind) == len(ind2word)

    
    return word2ind, ind2word

In [13]:
del train_x_raw, val_x_raw

In [14]:
word2ind, ind2word = make_vocabulary_dicts(train_words)

print(len(word2ind), len(ind2word))
print(word2ind['never'], word2ind['awful'])
print(ind2word[6700], ind2word[10582])

61552 61552
37504 4183
bone clockwise


In [15]:
print(np.max([len(x) for x in train_words]))
print(np.mean([len(x) for x in train_words]))

print(np.max([len(x) for x in val_words]))
print(np.mean([len(x) for x in val_words]))

128
71.35953333333333
128
71.5128


In [16]:
def make_padded_inputs(preprocessed_data, 
                       vocab, 
                       padded_length=128,
                       pad_token='<PAD>',
                       unk_token='<UNK>'
                      ):
    
    num_lines = len(preprocessed_data)
    pad = vocab[pad_token]
    
    inputs = np.full((num_lines, padded_length), pad)
    
    for i, review in enumerate(preprocessed_data):
        for j, word in enumerate(review):    
            inputs[i, j] = vocab.get(word, vocab[unk_token])
            
    return inputs
            

In [17]:
train_x = make_padded_inputs(train_words, word2ind)
val_x = make_padded_inputs(val_words, word2ind)


print(f"""Training example at index 10:\n{train_words[10]}\n
    Converted to indices:\n{train_x[10, :]}\n""")

Training example at index 10:
['the', 'matador', 'movie', 'lite', 'say', 'movie', 'the', 'violence', 'never', 'really', 'shown', 'often', 'introduced', 'at', 'first', 'i', 'scared', 'i', 'another', 'retread', 'comedy', 'but', 'happily', 'surprised', 'i', 'realized', 'sweet', 'humorous', 'story', 'friendship', 'nothing', 'terribly', 'exciting', 'happens', 'film', 'every', 'bit', 'kept', 'grinning', 'the', 'three', 'leads', 'best', 'chemistry', 'big', 'screen', 'offered', 'recent', 'years', 'looks', 'like', 'great', 'time', 'making', 'film', 'together', 'the', 'writing', 'sharp', 'though', 'times', 'felt', 'script', 'adapted', 'stage', 'play', 'one', 'set', 'dialog', 'scenes', 'this', 'good', 'film', 'i', 'probably', 'wo', "n't", 'remember', 'long', 'time', 'complete', 'joy', 'good', 'film']

    Converted to indices:
[54695 34131 36447 32196 47544 36447 54695 58837 37504 44369 49327 38607
 28354  3761 20409 26724 47620 26724  2776 45667 11084  8038 24587 53309
 26724 44363 53514 26513 5

## <span style="color:teal">Load data into torch</span>

In [18]:
train_dataset = torch.utils.data.TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
val_dataset = torch.utils.data.TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))

In [19]:
batch_size = 128

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size)

In [20]:
class SentimentClassifier(nn.Module):
    
    def __init__(self, 
                 vocab_size, 
                 d_feature, 
                 num_layers, 
                 hidden_size,
                 n_outputs,
                 bidirectional=False,
                 dropout_rate=0.2):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, d_feature)
        self.dropout = nn.Dropout(p=dropout_rate)
        self.lstm = nn.LSTM(input_size=d_feature,
                           hidden_size=hidden_size,
                           num_layers=num_layers,
                           bidirectional=bidirectional,
                           batch_first=True)
        self.fc = nn.Linear(hidden_size, n_outputs)
        self.sigmoid = nn.Sigmoid()
        
        
    def forward(self, input_data): 
        
        embedded = self.dropout(self.embedding(input_data))
        lstm_out, _ = self.lstm(embedded)
        fc = self.fc(lstm_out[:,-1,:])
        sigmoid = self.sigmoid(fc)
   
        return sigmoid
    

In [21]:
vocab_size = len(word2ind)
d_feature = 128
hidden_size = 128
n_outputs = 1
num_layers = 1

model = SentimentClassifier(
                            vocab_size=vocab_size, 
                            d_feature=d_feature,  
                            num_layers=num_layers, 
                            hidden_size=hidden_size, 
                            n_outputs=n_outputs).to(device)

print(model)

SentimentClassifier(
  (embedding): Embedding(61552, 128)
  (dropout): Dropout(p=0.2, inplace=False)
  (lstm): LSTM(128, 128, batch_first=True)
  (fc): Linear(in_features=128, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)


## <span style="color:teal">Train model</span>

In [26]:
def train_model(train_loader=train_loader,
             val_loader=val_loader,
             model=model,
             optimizer=torch.optim.Adam(model.parameters(), lr=0.005),
             criterion=nn.BCELoss(),
             n_epochs=6):
    
    start_time = time.time()
    
    
    for epoch in range(n_epochs):
        model.train()
        for inputs, labels in train_loader:  
            model.zero_grad()
            output = model(inputs)
            loss = criterion(output.squeeze(), labels.float())
            nn.utils.clip_grad_norm_(model.parameters(), 5)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

           
        model.eval()
            
        val_losses = []
        
        
        for val_inputs, val_labels in val_loader:

            val_output = model(val_inputs)
            val_loss = criterion(val_output.squeeze(), val_labels.float())
            val_losses.append(val_loss.item())

        
        print(f"Epoch: {epoch+1}/{ n_epochs}".format(),
              f"Time taken: {((time.time() - start_time) / 60):.2f} min",
              f"Training Loss: {loss.item():.4f}",
              f"Validation Loss: {np.mean(val_losses):.4f}")
            
    print(f"Training completed in {(time.time() - start_time) / 60} min.")
    print(f"Final loss: {loss}\nValidation loss: {val_loss}")
    
    return loss, val_loss
    

In [27]:
train_model()

Epoch: 1/6 Time taken: 1.33 min Training Loss: 0.6925 Validation Loss: 0.6922
Epoch: 2/6 Time taken: 2.66 min Training Loss: 0.7026 Validation Loss: 0.5965
Epoch: 3/6 Time taken: 3.93 min Training Loss: 0.3667 Validation Loss: 0.4768
Epoch: 4/6 Time taken: 5.17 min Training Loss: 0.2841 Validation Loss: 0.4688
Epoch: 5/6 Time taken: 6.42 min Training Loss: 0.1517 Validation Loss: 0.5004
Epoch: 6/6 Time taken: 7.67 min Training Loss: 0.1586 Validation Loss: 0.4928
Training completed in 7.674058783054352 min.
Final loss: 0.15861161053180695
Validation loss: 0.7937467098236084


(tensor(0.1586, grad_fn=<BinaryCrossEntropyBackward>),
 tensor(0.7937, grad_fn=<BinaryCrossEntropyBackward>))