# <span style="color:turquoise">Text classification with pytorch</span>


An example of using natural language processing for sentiment analysis. <br> Building a binary classifier of movie reviews that will predict if a review is positive or negative.




__Dataset:__ IMDB movie reviews from Kaggle<br>
__Model:__ bag-of-words + RNN(?)


### <span style="color:teal">Todo:</span>

- ~~Read dataset~~
- ~~Preprocess text~~
- ~~Split into train, validation, and test sets~~
- Convert text to indices and add paddings
- Make model
- Make training function
- Make evaluation function
- Train
- Evaluate

In [84]:
import torch
import csv
import random

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## <span style="color:teal">Read the data and split it into training, cross-validation, and test sets</span>

In [85]:
class Reviews():
    
    def __init__(self):
        self.train = {}
        self.val = {}
        self.test = {}
        self.LABELS = {"positive":1, "negative": 0}
        self.COUNT = {"positive": 0, "negative": 0}
    
    
    def read_data(self):
        
        dataset = []
        
        with open ("IMDB_Dataset.csv", newline='') as f:
            datareader = csv.reader(f, delimiter=',')
            next(datareader, None)

            for row in datareader:
                dataset.append([row[0], self.LABELS[row[1]]])
                self.COUNT[row[1]] += 1
            
            
                
        return dataset




    def split_dataset(self,
                      dataset,
                      split=[int(50000*0.6), int(50000*0.2), int(50000*0.2)]):
        
        train, val, test = torch.utils.data.random_split(dataset,
                                               split,
                                               generator=torch.Generator().manual_seed(43))
          
            
        return train, val, test

In [86]:
rev = Reviews()
data = rev.read_data()
pos_count = rev.COUNT["positive"]
neg_count = rev.COUNT["negative"]


In [87]:
print(data[10])

['Phil the Alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines. For something similar but better try "Brother from another planet"', 0]


In [88]:
train, val, test = rev.split_dataset(data)
print(train[10])

['An art student in Rome is possessed...or something. She has dreams of being nailed to a cross and Satan himself raping her. He possesses her (I think) and turns her into a sex addict. That\'s about all I could take and I turned it off. A pointless "Exorcist" rip off. I caught this on cable back in the 80s and was horrified...and not in a good way! This movie is supposed to be a horror film but turns into nothing more than a sex film disguised as a horror movie. There\'s tons of pointless female nudity and the actress playing the lead has to degrade herself more than once. We see her being raped by Satan (a hot-looking guy), masturbating, coming on to her own father...Gotta give her points for bravery. Add to that bad dubbing, editing (the rape scene looks like it was cut a bit), lousy acting and a story that makes next to no sense. The one disturbing sequence (her being nailed to the cross) ALMOST works but the lousy "special" effects ruin it. This is one of the few horror film that 

In [90]:
print(len(train), len(val), len(test))

30000 10000 10000


In [91]:
def split_x_and_y(data):
    x = []
    y = []
    for review, label in data:
        x.append(review)
        y.append(label)
    return x, y

In [92]:
train_x, train_y = split_x_and_y(train)
val_x, val_y = split_x_and_y(val)
test_x, test_y = split_x_and_y(test)


print(len(train_x), len(train_y))
print(train_x[50], train_y[50])

30000 30000
And the worst part is that it could have been good. But something horribly wrong. First thing first, they should not have cast Amitabh Bachchan in this film at all. He is too much of an Icon to tackle such a delicate and controversial topic let alone the role itself.  The worst part of the movie is perhaps the subservient portrayal of the character of Bachchan's character's wife. Her role was so underwritten and ridiculously wooden that it's impossible to actually feel any pity or concern for her. I actually felt like reaching into the screen slapping her for not reacting like any normal woman would. Instead she just stood there looking Irritated and Helpless, as I imagine much of the viewers of this film might feel after watching this train-wreck of a film. Watch at your own risk. 0


## <span style="color:teal">Preprocess text</span>

In [93]:
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

import re

In [95]:
def preprocess(review,
               remove_stopwords=False, 
               remove_html=True, 
               remove_punct=False, 
               lowercase=False, 
               lemmatize=False):
    
    review = re.sub(r"\'", "'", review)
    review = re.sub(r"\x96", "-", review)
    
    if remove_html:
        review = re.sub(r'<.*>', ' ', review)
    
    review = word_tokenize(review)
        
    if remove_stopwords:
        stop_words = set(stopwords.words("english"))
        review = [w for w in review if w not in stop_words]
        
    if remove_punct:
        contractions = ["'ll", "'s", "n't", "'d", "'m", "'ve", "'re"]
        review = [w for w in review if w.isalnum() or w in contractions]
    
    if lowercase:
        review = [w.lower() for w in review]
        
    if lemmatize:
        lemmatizer = WordNetLemmatizer()
        review = [lemmatizer.lemmatize(w) for w in review]
    
    
    return review
    


## <span style="color:teal">Convert text to indices and add paddings</span>

In [96]:
def make_vocabulary_dicts(data, pad_token='<PAD>'):
    vocab = set()
    for review in data:
        words = preprocess(review, lowercase=True, remove_punct=True)
        for word in words:
            vocab.add(word)
            
    vocab_sorted = sorted(vocab)
    word2ind = {word : i for i, word in enumerate(vocab_sorted)}
    ind2word = {i : word for i, word in enumerate(vocab_sorted)}
    
    assert len(word2ind) == len(ind2word)
    
    #append the pad token
    word2ind[pad_token] = len(word2ind)
    ind2word[len(ind2word)] = pad_token

    
    return word2ind, ind2word

In [97]:
word2ind, ind2word = make_vocabulary_dicts(train_x)
print(len(word2ind), len(ind2word))
print(word2ind['never'], word2ind['awful'])
print(ind2word[6700], ind2word[10582])

64521 64521
39207 4353
blubber churned


In [74]:
def make_padded_inputs(reviews, vocabulary):
    pass