# Neural networks from scratch - Sentiment analysis

In [1]:
import pandas as pd
import numpy as np

## The dataset

IMDB dataset having 50K polar movie reviews for natural language processing.

In [2]:
INPUT_DIR = './'
with open(INPUT_DIR + 'IMDB Dataset.csv', 'r') as file:
    reviews = pd.read_csv(file)
reviews.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
reviews['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [4]:
reviews['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

### Preprocessing

- **Transforming to lowercase** so that This, this, THIS are considered equal
- **Removing punctuation marks and line breaks `<br />`**

In [5]:
import re

reviews['review'] = reviews['review'].map(lambda review: re.sub('(\\\'s|[\.,"\']|<br />)', '', review.lower()))

In [6]:
reviews['review'][0]

'one of the other reviewers has mentioned that after watching just 1 oz episode youll be hooked they are right as this is exactly what happened with methe first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordit is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to manyaryans muslims gangstas latinos christians italians irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayi would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pictur

In [7]:
reviews['review'][1]

'a wonderful little production the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece the actors are extremely well chosen- michael sheen not only has got all the polari but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece a masterful production about one of the great master of comedy and his life the realism really comes home with the little things: the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears it plays on our knowledge and our senses particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell murals decorating every surface) are terribly well done'

#### Analysis

Computing the ratios of positive and negative words and applying log function to make them comparable.

In [8]:
from collections import Counter

positive_counter = Counter()
negative_counter = Counter()
total_counter = Counter()
for review, sentiment in zip(reviews['review'], reviews['sentiment']):
    for word in review.split(' '):
        total_counter[word] += 1
        if sentiment == 'positive':
            positive_counter[word] += 1
        else:
            negative_counter[word] += 1

In [9]:
MINIMUM_WORD_COUNT = 10
positive2negative_ratios = Counter()
for word, count in total_counter.most_common():
    if count >= MINIMUM_WORD_COUNT:
        positive2negative_ratios[word] = positive_counter[word] / (negative_counter[word] + 1)

for word, ratio in positive2negative_ratios.most_common():
    if ratio > 1:
        positive2negative_ratios[word] = np.log(ratio)
    else:
        positive2negative_ratios[word] = -np.log((1 / (ratio + 0.0001)))

In [10]:
def print_positive_to_negative_ratio(word):
    print(f'Positive to negative ratio for {word} = {positive2negative_ratios[word]}')

print_positive_to_negative_ratio('very')
print_positive_to_negative_ratio('wonderful')
print_positive_to_negative_ratio('worst')

Positive to negative ratio for very = 0.3391205133385867
Positive to negative ratio for wonderful = 1.560890256170682
Positive to negative ratio for worst = -2.396373419560346


## Multilayer perceptron

In [11]:
import numpy as np
import re
from collections import Counter

class SentimentAnalysisNNet:
    def __init__(self, reviews, sentiments, hidden_nodes=10, output_nodes=1, min_word_count=10, polarity_threshold=0.1):
        positive_counter = Counter()
        negative_counter = Counter()
        total_counter = Counter()
        for review, sentiment in zip(reviews, sentiments):
            for word in review.split(' '):
                total_counter[word] += 1
                if sentiment == 'positive':
                    positive_counter[word] += 1
                else:
                    negative_counter[word] += 1

        positive2negative_ratios = Counter()
        for word, count in total_counter.most_common():
            if count >= min_word_count:
                positive2negative_ratios[word] = positive_counter[word] / (negative_counter[word] + 1)

        for word, ratio in positive2negative_ratios.most_common():
            if ratio > 1:
                positive2negative_ratios[word] = np.log(ratio)
            else: 
                positive2negative_ratios[word] = -np.log((1 / (ratio + 0.0001)))
        
        self.vocabulary = list(filter(lambda word:
            total_counter[word] >= min_word_count and (
            word not in positive2negative_ratios or (
                word in positive2negative_ratios and positive2negative_ratios[word] >= polarity_threshold
                )
            ), total_counter.keys()))
        
        self.word2index = {}
        for i, word in enumerate(self.vocabulary):
            self.word2index[word] = i

        self.label2index = {
            'negative': 0,
            'positive': 1
        }

        self.weights = [
            np.zeros((len(self.vocabulary), hidden_nodes)),
            # np.random.normal(0.0, hidden_nodes**-0.5, (len(self.vocabulary), hidden_nodes)),
            np.random.normal(0.0, output_nodes**-0.5, (hidden_nodes, output_nodes))
        ]
        self.hidden_layer = np.zeros((1, hidden_nodes))

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
    
    def sigmoid_prime(self, output):
        return output * (1 - output)

    def preprocess(self, review):
        return re.sub('(\\\'s|[\.,"\']|<br />)', '', review.lower())
    
    def train(self, reviews, sentiments, epochs=1, learning_rate=0.1):
        print("Training...")
        X = []
        for review in reviews:
            idxs = set()
            for word in review.split(' '):
                if word in self.word2index:
                    idxs.add(self.word2index[word])
            X.append(idxs)
        y = [self.label2index[sentiment] for sentiment in sentiments]
        X_size = len(X)
        assert(X_size == len(y))
        
        for epoch in range(epochs):
            correct_counter = 0
            for i, (review, sentiment) in enumerate(zip(X, y)):
                # FEEDFORWARD
                # Hidden layer
                #   input is whether a word is present (binary)
                #   no activation function is used in the hidden layer
                self.hidden_layer *= 0
                for index in review:
                    self.hidden_layer += self.weights[0][index]
                # Output layer
                output = self.sigmoid(self.hidden_layer.dot(self.weights[1]))

                # BACKPROPAGATION
                # Calculate error
                deltas = []
                deltas.append((output - sentiment)  * self.sigmoid_prime(output))
                deltas.append(deltas[0].dot(self.weights[1].T))
                # Update the weights
                self.weights[1] -= self.hidden_layer.T.dot(deltas[0]) * learning_rate
                for index in review:
                    self.weights[0][index] -= deltas[1][0] * learning_rate

                if output >= 0.5 and sentiment == self.label2index['positive']:
                    correct_counter += 1
                elif output < 0.5 and sentiment == self.label2index['negative']:
                    correct_counter += 1

                if(i % (X_size//10) == 0 and i != 0):
                    print("Epoch: {}\tProgress: {:.0f}%\tAccuracy: {:.2f}".format(epoch+1, 100 * i/float(X_size), correct_counter * 100 / float(i+1)))

            print("Epoch: {}\tProgress: {:.0f}%\tAccuracy: {:.2f}\n".format(epoch+1, 100 * i/float(X_size), correct_counter * 100 / float(i+1)))
    
    def test(self, reviews, sentiments):
        print("Testing...")
        assert(len(reviews) == len(sentiments))

        correct_counter = 0
        for i, (review, sentiment) in enumerate(zip(reviews, sentiments)):
            output = self.run(review)
            if output == sentiment:
                correct_counter += 1

        print("\nAccuracy: {:.2f}".format(correct_counter * 100 / float(i+1)))
    
    def run(self, review):
        idxs = set()
        for word in self.preprocess(review).split(' '):
            if word in self.word2index:
                idxs.add(self.word2index[word])
        self.hidden_layer *= 0
        for index in idxs:
            self.hidden_layer += self.weights[0][index]
        output = self.sigmoid(self.hidden_layer.dot(self.weights[1]))
        
        return 'positive' if output >= 0.5 else 'negative'


In [13]:
TRAIN_TEST_SPLIT = 40000
X = list(reviews['review'])
y = list(reviews['sentiment'])

mlp = SentimentAnalysisNNet(X[:TRAIN_TEST_SPLIT], y[:TRAIN_TEST_SPLIT], hidden_nodes=10)
mlp.train(X[:TRAIN_TEST_SPLIT], y[:TRAIN_TEST_SPLIT], epochs=2, learning_rate=.01)
mlp.test(X[TRAIN_TEST_SPLIT:], y[TRAIN_TEST_SPLIT:])

Training...
Epoch: 1	Progress: 10%	Accuracy: 68.83
Epoch: 1	Progress: 20%	Accuracy: 70.59
Epoch: 1	Progress: 30%	Accuracy: 71.24
Epoch: 1	Progress: 40%	Accuracy: 71.75
Epoch: 1	Progress: 50%	Accuracy: 72.18
Epoch: 1	Progress: 60%	Accuracy: 72.76
Epoch: 1	Progress: 70%	Accuracy: 73.35
Epoch: 1	Progress: 80%	Accuracy: 73.56
Epoch: 1	Progress: 90%	Accuracy: 73.79
Epoch: 1	Progress: 100%	Accuracy: 73.99

Epoch: 2	Progress: 10%	Accuracy: 79.16
Epoch: 2	Progress: 20%	Accuracy: 78.90
Epoch: 2	Progress: 30%	Accuracy: 78.56
Epoch: 2	Progress: 40%	Accuracy: 78.54
Epoch: 2	Progress: 50%	Accuracy: 78.44
Epoch: 2	Progress: 60%	Accuracy: 78.61
Epoch: 2	Progress: 70%	Accuracy: 78.65
Epoch: 2	Progress: 80%	Accuracy: 78.54
Epoch: 2	Progress: 90%	Accuracy: 78.54
Epoch: 2	Progress: 100%	Accuracy: 78.57

Testing...

Accuracy: 76.37


## Further steps

- Improve preprocessing with NLP techniques such as lemmatization or stemming
- Add several hidden layers
- Instead of sum of squared errors, use cross entropy as the error function
- Introduce non-linearity in hidden layers (activation function)
- Try other weight initialization methods
- Use a more complex architecture (RNNs)