For this practical, you will write a perceptron algorithm from scratch. You
will initially need to understand how the perceptron algorithm works - so
read the Moodle workbook and lecture slides first. Some pseudocode is given here which are also
included in the lecture slides, however, you will still need to load a datafile,
extract features of your choice etc. (you learnt this in week 7). You can either
use the movie reviews dataset from the nltk (from nltk.corpus import
movie_reviews) or the dataset provided for your assignment.

Challenge 1:
Write an algorithm that creates a perceptron model:
1. Load training data
2. Extract features
3. Train the algorithm and update weights as appropriately.

In [65]:
import random
import math
from nltk.corpus import movie_reviews

In [10]:
reviews = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

In [24]:
words = []
for file_id in movie_reviews.fileids():
    review = movie_reviews.words(file_id)
    for word in review:
        words.append(word)

In [37]:
all_features = list(set(words))

In [44]:
start_weights = {word : random.uniform(-1, 1) for word in all_features}

In [51]:
def create_features(review):
    return list(set([word for word in review]))

In [47]:
def predict_one(weights, features):
    
    score = 0
    for feature in features:
        
        if feature in weights.keys():
            score += weights[feature]
    
    if score > 0:
        return 'pos'
    else:
        return 'neg'

In [54]:
def update_weights(weights, features, actual_label):
    
    if actual_label == 'pos':
        change = 1
    else:
        change = -1
        
    for feature in features:
        
        if feature in weights.keys():
            weights[feature] += change
        
    return weights

In [55]:
def train_perceptron(training_data, itterations):
    weights = start_weights
    
    for i in range(itterations):
        
        for review, actual_label in training_data:
            
            features = create_features(review)
            predicted_label = predict_one(weights, features)
            
            if predicted_label != actual_label:
                weights = update_weights(weights, features, actual_label)
    
    return weights

In [61]:
trained_weights = train_perceptron(reviews, 5)

In [62]:
trained_weights

{'deschanel': -0.7608930089604422,
 'mussenden': 0.6034019839630889,
 'users': -0.5883061368455418,
 'julianne': -0.5723286694114382,
 'fonda': -1.7899223571478438,
 'ronnie': 0.1453059904913654,
 'numeric': 0.4774403484967178,
 'fitzgerald': 0.2774361750554484,
 'dement': -0.8578027721299697,
 'noise': 1.5160655142051294,
 'indicated': -0.49429632962820524,
 'rimmer': 0.3804581587808973,
 'victor': 0.14111367704159883,
 'perfs': 0.4263790047499556,
 'womens': -0.9746895994220177,
 'forgives': 1.226622189373,
 'signal': -0.5213813258108742,
 'ia': 0.9593986237960088,
 'overheated': 0.5248082496539999,
 'gratuitous': -2.0918627788669015,
 'lovable': -1.5399492630288367,
 'moronic': 1.0469227412428812,
 'lizzy': -0.8187731114100858,
 'dread': -0.7406624555815127,
 'condon': 0.23447601043637278,
 'mesmerize': -0.7761419684610609,
 'thermo': 0.9245001623821276,
 'relic': 0.13964620649725945,
 'hairs': -0.040862112791691274,
 'dominates': -0.8441417215260538,
 'neglected': 1.803802638403501

Challenge 2:
Write a program which tests/evaluates your algorithm. You will need to use
the test set to make predictions, and then calculate the accuracy (i.e. the
percentage of correct predictions).

To streamline this process, write another function which iterates
through all the instances in the test set and returns (or prints all the
predictions)

Finally, calculate the accuracy.

In [95]:
len_reviews = len(reviews)
train_size = math.ceil(len_reviews*0.75)
test_size = len_reviews - train_size

train_index = random.sample(range(len_reviews), train_size)
test_index = list(set(range(len_reviews)) - set(train_index))

In [96]:
train = [reviews[i] for i in train_index]
test = [reviews[i] for i in test_index]

In [97]:
trained_weights = train_perceptron(train, 5)

In [98]:
def predict_all(weights, reviews):
    predicted_labels = []
    for review, label in reviews:
        features = create_features(review)
        predicted_label = predict_one(weights, features)
        predicted_labels.append(predicted_label)
    return predicted_labels

In [104]:
predicted_labels = predict_all(trained_weights, test)
actual_labels = [label for review, label in test]

In [105]:
n_correct = sum([predicted_labels[i] == actual_labels[i] for i in range(test_size)])
n_correct/test_size

0.96