# <span style="color:turquoise">Text classification with pytorch</span>


An example of using natural language processing for sentiment analysis. <br> Building a binary classifier of movie reviews that will predict if a review is positive or negative.




__Dataset:__ IMDB movie reviews from Kaggle<br>
__Model:__ bag-of-words + RNN(?)


### <span style="color:teal">Todo:</span>

- ~~Read dataset~~
- ~~Preprocess text~~
- ~~Split into train, validation, and test sets~~
- Convert text to indices and add paddings
- Make model
- Make training function
- Make evaluation function
- Train
- Evaluate

In [1]:
import torch
import csv
import random
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

  return torch._C._cuda_getDeviceCount() > 0


## <span style="color:teal">Read the data and split it into training, cross-validation, and test sets</span>

In [2]:
class Reviews():
    
    def __init__(self):
        self.train = {}
        self.val = {}
        self.test = {}
        self.LABELS = {"positive":1, "negative": 0}
        self.COUNT = {"positive": 0, "negative": 0}
    
    
    def read_data(self):
        
        dataset = []
        
        with open ("IMDB_Dataset.csv", newline='') as f:
            datareader = csv.reader(f, delimiter=',')
            next(datareader, None)

            for row in datareader:
                dataset.append([row[0], self.LABELS[row[1]]])
                self.COUNT[row[1]] += 1
            
            random.shuffle(dataset)
                
        return dataset




    def split_dataset(self,
                      dataset,
                      split=[int(50000*0.6), int(50000*0.2), int(50000*0.2)]):
        
        train, val, test = torch.utils.data.random_split(dataset,
                                               split,
                                               generator=torch.Generator().manual_seed(43))
          
            
        return train, val, test

In [3]:
rev = Reviews()
data = rev.read_data()
pos_count = rev.COUNT["positive"]
neg_count = rev.COUNT["negative"]


In [4]:
print(data[10])

['This version of "The Magic Flute" is not only the worst production of Mozart\'s great opera that I have ever seen, it is also the worst video production I have seen of any opera. This is not a movie version of "The Magic Flute." It is a filmed performance and it is not a good performance and it was not filmed very well. You can pick any other available DVD of this opera and I guarantee it will be better than this one. My preference is for the version conducted by James Levine with sets by David Hockney.', 0]


In [5]:
train, val, test = rev.split_dataset(data)
print(train[10])

["Over the years I've seen a bunch of these straight to video Segal movies, and every one holds the same amount of entertainment; unfortanetley, the entertainment level is at a low. Sure, the action sequences were amusing, but that was pretty much it. Seagal was really in his prime when he did movies like; Under Siege, Under Siege 2, and Executive Decision(at least on the action standpoint), but during the past ten years, these types of movies that star Segal really do not meet his past qualifications. On the more positive side, the movie did make good use of time, like some of the action sequences and use of wit. Just when the movie seemed to just drag on, a pretty cool action scene brought it up out of the gutter. I honestly believe that more of Segal's movies would do better if he wasn't the only one that fans recognize in the movie. Supporting actors and actresses are a very important thing, and if his current movies had this known supporting actors and actresses, maybe the movie w

In [6]:
print(len(train), len(val), len(test))

30000 10000 10000


In [7]:
def split_x_and_y(data):
    x = []
    y = []
    for review, label in data:
        x.append(review)
        y.append(label)
    return x, np.array(y)

In [9]:
train_x_raw, train_y = split_x_and_y(train)
val_x_raw, val_y = split_x_and_y(val)
test_x_raw, test_y = split_x_and_y(test)


print(len(train_x_raw), len(train_y))
print(train_x_raw[50], train_y[50])

30000 30000
Peter O'Toole is a treat to watch in roles where the lines he speaks are good and offer a chance for him to swagger in drunken stupor. The lovely Susannah York provides a good foil for O'Toole's dramatic presence. I saw the film twice over a period of 20 years--on both occasions with the name "Brotherly love". "Country dance" is a rather farcical and inappropriate title for this movie, wherever it was released as such. 1


## <span style="color:teal">Preprocess text</span>

In [10]:
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

import re

In [11]:
def preprocess(review,
               remove_stopwords=False, 
               remove_html=True, 
               remove_punct=False, 
               lowercase=False, 
               lemmatize=False,
               maxlen=128):
    
    review = re.sub(r"\'", "'", review)
    review = re.sub(r"\x96", "-", review)
    
    if remove_html:
        review = re.sub(r'<.*>', ' ', review)
    
    review = word_tokenize(review)
        
    if remove_stopwords:
        stop_words = set(stopwords.words("english"))
        review = [w for w in review if w not in stop_words]
        
    if remove_punct:
        contractions = ["'ll", "'s", "n't", "'d", "'m", "'ve", "'re"]
        review = [w for w in review if w.isalnum() or w in contractions]
    
    if lowercase:
        review = [w.lower() for w in review]
        
    if lemmatize:
        lemmatizer = WordNetLemmatizer()
        review = [lemmatizer.lemmatize(w) for w in review]
    
    
    return review[:maxlen]
    


In [12]:
train_x = [preprocess(review, 
                      lowercase=True, 
                      remove_punct=True,
                      remove_stopwords=True
                     ) 
           for review in train_x_raw]

val_x = [preprocess(review, 
                    lowercase=True, 
                    remove_punct=True,
                    remove_stopwords=True
                   ) 
         for review in val_x_raw]

In [14]:
print(train_x[9592], '\n', val_x[3029])
print(len(train_x[9592]), '\n', len(val_x[3029]))

['probably', 'worst', 'movie', 'i', 'ever', 'seen', 'it', 'cheesily', 'filmed', 'focus', 'even', 'supposed', 'real', 'crew', 'coming', 'hollywood', 'make', 'movie', 'no', 'cinematic', 'significance', 'whatsoever', 'i', 'could', 'take', 'back', 'almost', '1', 'hours', 'i', 'spent', 'watching', 'film', 'i', 'would', 'feel', 'much', 'better', 'worst', 'movie', 'i', 'ever', 'seen', 'positive', 'side', 'like', 'one', 'scene', 'visuals', "n't", 'bad', 'looking', 'do', 'rent'] 
 ['emma', 'favourite', 'jane', 'austen', 'novel', 'emma', 'despite', 'flaws', 'readers', 'forgive', 'love', 'relationship', 'mr', 'knightley', 'warm', 'familiar', 'respectful', 'playful', 'generating', 'warm', 'fuzzy', 'romantic', 'excitement', 'mr', 'knightley', 'perfect', 'man', 'emma', 'close', 'could', 'get', 'times', 'independent', 'clever', 'confident', 'woman', 'remember', '21', 'sure', 'matured', 'grown', 'flaws', 'who', "n't", 'want', 'emma', 'who', "n't", 'want', 'told', 'mr', 'knightley', 'this', 'version', 

## <span style="color:teal">Convert text to indices and add paddings</span>

In [15]:
def make_vocabulary_dicts(preprocessed_data, pad_token='<PAD>', unk_token='<UNK>'):
    vocab = set()
    
    for review in preprocessed_data:
        for word in review:
            vocab.add(word)
    
            
    vocab_sorted = sorted(vocab)
    word2ind = {word : i+2 for i, word in enumerate(vocab_sorted)}
    ind2word = {i+2 : word for i, word in enumerate(vocab_sorted)}
    
    # Prepend the pad token
    word2ind[pad_token] = 0
    ind2word[0] = pad_token
    
    # Prepend the 'unknown' token
    word2ind[unk_token] = 1
    ind2word[1] = unk_token
    
    assert len(word2ind) == len(ind2word)

    
    return word2ind, ind2word

In [16]:
#del train_x_raw, val_x_raw

In [17]:
word2ind, ind2word = make_vocabulary_dicts(train_x)
print(len(word2ind), len(ind2word))
print(word2ind['never'], word2ind['awful'])
print(ind2word[6700], ind2word[10582])

61724 61724
37585 4171
bonbons clin


In [18]:
print(np.max([len(x) for x in train_x]))
print(np.mean([len(x) for x in train_x]))

print(np.max([len(x) for x in val_x]))
print(np.mean([len(x) for x in val_x]))

128
71.39476666666667
128
70.9034


In [19]:
def make_padded_inputs(preprocessed_data, 
                       vocab, 
                       padded_length=128,
                       pad_token='<PAD>',
                       unk_token='<UNK>'
                      ):
    
    num_lines = len(preprocessed_data)
    pad = vocab[pad_token]
    
    inputs = np.full((num_lines, padded_length), pad)
    
    for i, review in enumerate(preprocessed_data):
        for j, word in enumerate(review):    
            inputs[i, j] = vocab.get(word, vocab[unk_token])
            
    mask = np.where(inputs==0, 0, 1)
            
    return inputs, mask
            

In [20]:
train_inputs, train_mask = make_padded_inputs(train_x, word2ind)
val_inputs, val_mask = make_padded_inputs(val_x, word2ind)


print(f"""Training example at index 10:\n{train_x[10]}\n
    Converted to indices:\n{train_inputs[10, :]}\n 
    Mask of the example:\n{train_mask[10, :]}""")

Training example at index 10:
['over', 'years', 'i', "'ve", 'seen', 'bunch', 'straight', 'video', 'segal', 'movies', 'every', 'one', 'holds', 'amount', 'entertainment', 'unfortanetley', 'entertainment', 'level', 'low', 'sure', 'action', 'sequences', 'amusing', 'pretty', 'much', 'seagal', 'really', 'prime', 'movies', 'like', 'under', 'siege', 'under', 'siege', '2', 'executive', 'decision', 'least', 'action', 'standpoint', 'past', 'ten', 'years', 'types', 'movies', 'star', 'segal', 'really', 'meet', 'past', 'qualifications', 'on', 'positive', 'side', 'movie', 'make', 'good', 'use', 'time', 'like', 'action', 'sequences', 'use', 'wit', 'just', 'movie', 'seemed', 'drag', 'pretty', 'cool', 'action', 'scene', 'brought', 'gutter', 'i', 'honestly', 'believe', 'segal', "'s", 'movies', 'would', 'better', "n't", 'one', 'fans', 'recognize', 'movie', 'supporting', 'actors', 'actresses', 'important', 'thing', 'current', 'movies', 'known', 'supporting', 'actors', 'actresses', 'maybe', 'movie', 'get', 