# <span style="color:turquoise">Text classification with pytorch</span>


An example of using natural language processing for sentiment analysis. <br> Building a binary classifier of movie reviews that will predict if a review is positive or negative.




__Dataset:__ IMDB movie reviews from Kaggle<br>
__Model:__ LSTM (?)


### <span style="color:teal">Todo:</span>

- ~~Read dataset~~
- ~~Preprocess text~~
- ~~Split into train, validation, and test sets~~
- ~~Convert text to indices and add paddings~~
- ~~Make model~~
- Make training function
- Make evaluation function
- Train
- Evaluate

In [1]:
import csv
import random
import numpy as np

from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

  return torch._C._cuda_getDeviceCount() > 0


## <span style="color:teal">Read the data and split it into training, cross-validation, and test sets</span>

In [2]:
class Reviews():
    
    def __init__(self):
        self.train = {}
        self.val = {}
        self.test = {}
        self.LABELS = {"positive":1, "negative": 0}
        self.COUNT = {"positive": 0, "negative": 0}
    
    
    def read_data(self):
        
        dataset = []
        
        with open ("IMDB_Dataset.csv", newline='') as f:
            datareader = csv.reader(f, delimiter=',')
            next(datareader, None)

            for row in datareader:
                dataset.append([row[0], self.LABELS[row[1]]])
                self.COUNT[row[1]] += 1
            
            random.shuffle(dataset)
                
        return dataset




    def split_dataset(self,
                      dataset,
                      split=[int(50000*0.6), int(50000*0.2), int(50000*0.2)]):
        
        train, val, test = torch.utils.data.random_split(dataset,
                                               split,
                                               generator=torch.Generator().manual_seed(43))
          
            
        return train, val, test

In [3]:
rev = Reviews()
data = rev.read_data()
pos_count = rev.COUNT["positive"]
neg_count = rev.COUNT["negative"]


In [4]:
print(data[10])

['I bought this movie hoping that it would be another great killer toy movie. I am a big fan of the Child\'s Play series and was hoping to see the same here. Boy, was I wrong. Most of the movie was not the least bit scary, plus the only time we really see Pinocchio "alive" is the final few scenes of the film. The little girl in the film, her acting is so bad it\'s almost laughable. Plus, the ending never showed what happened to the puppet or what made them put the little girl in a asylum or wherever she was at the end of the film. So, in my opinion this movie is the worst of the "killer toy" genre. If you want a good killer toy series, stick with the Child\'s Play franchise. Pinocchio\'s Revenge is a waste of money and time.', 0]


In [5]:
train, val, test = rev.split_dataset(data)
print(train[10])

["This movie came as a huge disappointment. The anime series ended with a relatively stupid plot twist and the rushed introduction of a pretty lame villain, but I expected Shamballa to tie up all the loose ends. Unfortunately, it didn't. It added more plot holes than it resolved, and confused more than it clarified. The animation and voice acting were great, but with an idiotic plot, dull setting (most of the movie doesn't even take place in dull WWII Earth rather than the Alchemy world), and disappointing ending (Ed is useless for the rest of his days in a world with no alchemy, and he ditches Winry?), it was altogether pretty lackluster. Do yourself a favor-- disregard the last half of the anime as well as this movie, and read the manga.", 0]


In [6]:
print(len(train), len(val), len(test))

30000 10000 10000


In [7]:
def split_x_and_y(data):
    x = []
    y = []
    for review, label in data:
        x.append(review)
        y.append(label)
    return x, np.array(y)

In [8]:
train_x_raw, train_y = split_x_and_y(train)
val_x_raw, val_y = split_x_and_y(val)
test_x_raw, test_y = split_x_and_y(test)


print(len(train_x_raw), len(train_y))
print(train_x_raw[50], train_y[50])

30000 30000
I wish it were "Last Dumb Thriller". But thrillers are like that. They are like children: numerous, illogical, and often annoying. They want so desperately to be taken seriously but what is there to take seriously about a child's behaviour or a thriller's plot? Having seen this particular child - I mean... thriller - I understand why reviewers refer to it as "a hitchcockian thriller"; they might as well have called it "idiotic" for that's what "hitchcockian" means in the movie dictionary (look it up, if you don't believe me). Even the soundtrack is old-school Hollywood which is a mistake: it doesn't fit a late 70s film and makes it look phony. Besides, how dare they steal De Palma's idea of stealing from Hitchcock?! The story is absurd. Scheider's wife is killed, and her killers are never an issue. Instead, first his former employers follow him around, and later decide to kill him. Why do they decide to kill him? No explanation. Perhaps because the FBI is a dark, dark organ

## <span style="color:teal">Preprocess text</span>

In [9]:
def preprocess(review,
               remove_stopwords=False, 
               remove_html=True, 
               remove_punct=False, 
               lowercase=False, 
               lemmatize=False,
               maxlen=128):
    
    review = re.sub(r"\'", "'", review)
    review = re.sub(r"\x96", "-", review)
    
    if remove_html:
        review = re.sub(r'<.*>', ' ', review)
    
    review = word_tokenize(review)
        
    if remove_stopwords:
        stop_words = set(stopwords.words("english"))
        review = [w for w in review if w not in stop_words]
        
    if remove_punct:
        contractions = ["'ll", "'s", "n't", "'d", "'m", "'ve", "'re"]
        review = [w for w in review if w.isalnum() or w in contractions]
    
    if lowercase:
        review = [w.lower() for w in review]
        
    if lemmatize:
        lemmatizer = WordNetLemmatizer()
        review = [lemmatizer.lemmatize(w) for w in review]
    
    
    return review[:maxlen]
    


In [10]:
train_words = [preprocess(review, 
                      lowercase=True, 
                      remove_punct=True,
                      remove_stopwords=True
                     ) 
           for review in train_x_raw]

val_words = [preprocess(review, 
                    lowercase=True, 
                    remove_punct=True,
                    remove_stopwords=True
                   ) 
         for review in val_x_raw]

In [11]:
print(train_words[9592], '\n', val_words[3029])
print(len(train_words[9592]), '\n', len(val_words[3029]))

['i', 'never', 'seen', 'original', 'death', 'wish', 'book', 'either', 'death', 'wish', 'i', 'film', 'however', 'death', 'wish', '3', 'interested', 'film', 'the', 'vigilante', 'paul', 'kersey', 'tried', 'visit', 'friend', 'charlie', 'visited', 'couple', 'minutes', 'died', 'charlie', "n't", 'pay', 'protection', 'huge', 'infamous', 'underground', 'gang', 'leaded', 'manny', 'franker', 'after', 'altercation', 'time', 'jail', 'kersey', 'learned', 'franker', 'fought', 'jail', 'agenda', 'make', 'new', 'york', 'city', 'hellfire', 'also', 'influence', 'sending', 'henchmen', 'set', 'crime', 'every', 'time', 'everywhere', 'everyday', 'due', 'chance', 'given', 'insp', 'richard', 'shriker', 'know', 'profile', 'well', 'like', 'much', 'kersey', 'decided', 'set', 'war', 'franker', 'gang', 'for', 'summary', 'i', 'ok', 'death', 'wish', '3', 'due', 'explanation', 'i', 'typed', 'first', 'the', 'cast', 'action', 'good', 'extreme', 'even', 'scene', 'i', "n't", 'like', 'straightedge', 'extreme', 'violence', '

## <span style="color:teal">Convert text to indices and add paddings</span>

In [12]:
def make_vocabulary_dicts(preprocessed_data, pad_token='<PAD>', unk_token='<UNK>'):
    vocab = set()
    
    for review in preprocessed_data:
        for word in review:
            vocab.add(word)
    
            
    vocab_sorted = sorted(vocab)
    word2ind = {word : i+2 for i, word in enumerate(vocab_sorted)}
    ind2word = {i+2 : word for i, word in enumerate(vocab_sorted)}
    
    # Prepend the pad token
    word2ind[pad_token] = 0
    ind2word[0] = pad_token
    
    # Prepend the 'unknown' token
    word2ind[unk_token] = 1
    ind2word[1] = unk_token
    
    assert len(word2ind) == len(ind2word)

    
    return word2ind, ind2word

In [13]:
del train_x_raw, val_x_raw

In [14]:
word2ind, ind2word = make_vocabulary_dicts(train_words)

print(len(word2ind), len(ind2word))
print(word2ind['never'], word2ind['awful'])
print(ind2word[6700], ind2word[10582])

61697 61697
37527 4150
bonhomie closing


In [15]:
print(np.max([len(x) for x in train_words]))
print(np.mean([len(x) for x in train_words]))

print(np.max([len(x) for x in val_words]))
print(np.mean([len(x) for x in val_words]))

128
71.15276666666666
128
71.3938


In [16]:
def make_padded_inputs(preprocessed_data, 
                       vocab, 
                       padded_length=128,
                       pad_token='<PAD>',
                       unk_token='<UNK>'
                      ):
    
    num_lines = len(preprocessed_data)
    pad = vocab[pad_token]
    
    unpadded_lengths = np.zeros(num_lines, dtype='int64')
    
    inputs = np.full((num_lines, padded_length), pad)
    
    for i, review in enumerate(preprocessed_data):
        for j, word in enumerate(review):    
            inputs[i, j] = vocab.get(word, vocab[unk_token])
        unpadded_lengths[i] = j+1
            
    return inputs, unpadded_lengths
            

In [17]:
train_x, train_lengths = make_padded_inputs(train_words, word2ind)
val_x, val_lengths = make_padded_inputs(val_words, word2ind)


print(f"""Training example at index 10:\n{train_words[10]}\n
    Converted to indices:\n{train_x[10, :]}\n 
    Unpadded length of the example:\n{train_lengths[10]}""")

Training example at index 10:
['this', 'movie', 'came', 'huge', 'disappointment', 'the', 'anime', 'series', 'ended', 'relatively', 'stupid', 'plot', 'twist', 'rushed', 'introduction', 'pretty', 'lame', 'villain', 'i', 'expected', 'shamballa', 'tie', 'loose', 'ends', 'unfortunately', "n't", 'it', 'added', 'plot', 'holes', 'resolved', 'confused', 'clarified', 'the', 'animation', 'voice', 'acting', 'great', 'idiotic', 'plot', 'dull', 'setting', 'movie', "n't", 'even', 'take', 'place', 'dull', 'wwii', 'earth', 'rather', 'alchemy', 'world', 'disappointing', 'ending', 'ed', 'useless', 'rest', 'days', 'world', 'alchemy', 'ditches', 'winry', 'altogether', 'pretty', 'lackluster', 'do', 'favor', 'disregard', 'last', 'half', 'anime', 'well', 'movie', 'read', 'manga']

    Converted to indices:
[55008 36469  8312 26453 15389 54830  2659 48705 18004 45114 52767 41651
 56803 46979 28373 42635 31043 58915 26766 19107 48952 55243 32560 18028
 57518 36957 28693  1251 41651 25904 45601 11588 10331 54830

## <span style="color:teal">Load data into torch</span>

In [18]:
train_dataset = torch.utils.data.TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
val_dataset = torch.utils.data.TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))

In [19]:
batch_size = 32

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size)

## <span style="color:teal">Create model</span>

In [20]:
class SentimentClassifier(nn.Module):
    
    def __init__(self, 
                 vocab_size, 
                 d_feature, 
                 num_layers, 
                 hidden_size,
                 n_outputs,
                 bidirectional=False,
                 dropout_rate=0.9):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, d_feature)
        self.dropout = nn.Dropout(p=dropout_rate)
        self.lstm = nn.LSTM(input_size=d_feature,
                           hidden_size=hidden_size,
                           num_layers=num_layers,
                           bidirectional=bidirectional)
        self.fc = nn.Linear(hidden_size, n_outputs)
        self.logsoftmax = nn.LogSoftmax(dim=1)
        
        
    def forward(self, 
                input_data, 
                unpadded_lengths, 
                padding_value=0):
        
        embedded = self.dropout(self.embedding(input_data))
        packed_embedded = pack_padded_sequence(embedded,
                                               lengths=torch.from_numpy(unpadded_lengths),
                                               enforce_sorted=False)
        lstm_out, _ = self.lstm(packed_embedded)
        lstm_out, _ = pad_packed_sequence(lstm_out,
                                          padding_value=padding_value,
                                          total_length=128)
        fc = self.fc(lstm_out.view(len(input_data), -1))
        logsoftmax = self.logsoftmax(fc).view(batch_size, -1)
        
        
        return logsoftmax
    

In [21]:
vocab_size = len(word2ind)
d_feature = 128
hidden_size = 128
n_outputs = 2
num_layers = 2

model = SentimentClassifier(
                            vocab_size=vocab_size, 
                            d_feature=d_feature,  
                            num_layers=num_layers, 
                            hidden_size=hidden_size, 
                            n_outputs=n_outputs)

print(model)

SentimentClassifier(
  (embedding): Embedding(61697, 128)
  (dropout): Dropout(p=0.9, inplace=False)
  (lstm): LSTM(128, 128, num_layers=2)
  (fc): Linear(in_features=128, out_features=2, bias=True)
  (logsoftmax): LogSoftmax(dim=1)
)
