#### Sentiment Analysis With Single Layer Perceptron 

We start by importing our dependencies. We will use nltk for an annotated dataset with movie reviews - categorized as negative or positive. Our goal is a classifier that given an unseen movie review categorizes that review correctly. As this is a linearly separable problem, we can use the perceptron algorithm (a singel neuron and the building block of neural nets).

In [1]:
import nltk
import random
from random import shuffle
from nltk.corpus import movie_reviews

Now we have to do some data preparation. We construct a tuple for every movie review containing a list with all the words in the review and the category (neg or pos). Then we shuffle the training data to make sure the ordering is random. We find the frequency distribution of all words in the movie reviews, and pick out the 2000 most common words as our feature set. 

In [2]:
training_data = []
for category in movie_reviews.categories():
    for file_id in movie_reviews.fileids(category):
        data_tuple = (list(movie_reviews.words(file_id)), category)
        training_data.append(data_tuple)

random.shuffle(training_data)

word_freqs = nltk.FreqDist(w.lower() for w in movie_reviews.words())

features = [w for (w,_) in word_freqs.most_common(2000)]


Now we need to make numerical representation of the words for our neuron to be able to make sense of it. We choose to make a list of integers: 1 represents a positive review, and negative 1 represents a negative review. We split up the training data so that only 3/4 of the data is used to train the neuron. 1/4 will be used to test the accuracy of our classifier. Our list of integers consists of one integer per word in the 2000 most frequent words in our corpus - again we will use 1 and negative 1 as representations. We append 1 to the list every time we find a word in our training data that are also in the most frequent words list. -1 for every word in the most frequent words list that does not exist in the current data. These integer representations follows the same order for the 2000 words for all data points.

In [3]:
training_data_int_rep = []

for i in range(0, 1500):
    data = training_data[i]
    int_list = []     
    training_data_int_rep.append((int_list, data[1]))
    for word_f in features:     
        if word_f in data[0]:    
            int_list.append(1) 
        else: 
            int_list.append(-1) 
        

It's time to define our Neuron class with a simple perceptron algorithm. We initialize our neuron with random weights for every connection from an input node to our neuron. Our learning constant will affect the rate of change for the gradient descent based error correction. The neurons mission is to map the pattern of our input vectors to the expected value in the binary output range. 

In [4]:
class Neuron:

    def __init__(self, number_of_weigths):
       
        self.input_weights = []
        for i in range(0, number_of_weigths):
            self.input_weights.append(random.uniform(-1,1))
        self.learning_constant = 0.01
  
    
    def learn(self, inputs, correct_output):
  
        prediction = self.prediction(inputs)
        deviation = correct_output - prediction
        
        for i in range(0, len(self.input_weights)):
            self.input_weights[i] += self.learning_constant * deviation * inputs[i]
    
  
    def prediction(self, inputs):
    
        sum = 0;  
        for i in range(0, len(self.input_weights)):
            sum += inputs[i] * self.input_weights[i]
           
        if sum > 0: return 1 
        return -1

2000 inputs will enter our neuron with every review. For the training effort we also send in the expected output as an argument. We use 4 epochs (complete forward and backward pass of the training set) to bump up our accuracy. 

In [5]:
neuron = Neuron(2000)

epochs = 4
for i in range(0,epochs):
    for data in training_data_int_rep:
        if data[1] == "pos":
            correct_output = 1
        else: 
            correct_output = -1
        neuron.learn(data[0], correct_output)

We can now test our classifier on our test set. 500 reviews that was not used in the training process will be used to test prediction accuracy. 

In [6]:
test_data_int_rep = []

for i in range(1500, len(training_data)):
    data = training_data[i]
    int_list = []     
    test_data_int_rep.append((int_list, data[1]))
    
    for word_f in features:     
        if word_f in data[0]:    
            int_list.append(1) 
        else: 
            int_list.append(-1)    

In [7]:
correct_predictions = 0
for data in test_data_int_rep:
    if data[1] == "pos":
        correct_output = 1
    else: 
        correct_output = -1
    
    if neuron.prediction(data[0]) == correct_output:
        correct_predictions+=1

length_of_test_set = 500
print("Accuracy: ", correct_predictions/length_of_test_set)

Accuracy:  0.798


About 0.8 accuracy rate for our quite limited training data. 