<div class="alert alert-danger">
**Due date:** 2017-01-27
</div>

# Lab 1: Text Classification

**Students:** Victor Tranell (victr593), Michael Sörsäter (micso554), Ludvig Noring (ludno249)

In this lab you will implement and compare the performance of two simple text classifiers: a Naive Bayes classifier and a classifier based on the averaged perceptron.

The data set that you will use in this lab is the [review polarity data set](https://www.cs.cornell.edu/people/pabo/movie-review-data/) first used by [Pang and Lee (2004)](http://www.aclweb.org/anthology/P04-1035). This data set consists of 2,000 movie reviews, each of which has been tagged as either positive or negative towards the movie at hand. The data is originally distributed as a collection of text files. For this lab we have put all files into two JSON files, one for training and one for testing.

## Introduction

Start by importing the module for this lab.

In [1]:
import nlp1

The next cell loads the training data and the test data:

In [2]:
training_data = nlp1.load_data("/home/TDDE09/labs/nlp1/review_polarity.train.json")
test_data = nlp1.load_data("/home/TDDE09/labs/nlp1/review_polarity.test.json")

As you will see, each data instance is a pair whose first component is a document, represented as a list of tokens, and whose second component is the gold-standard polarity of the review&nbsp;&ndash; either positive (`pos`) or negative (`neg`).

In [3]:
print(training_data[813])

(['this', 'film', 'is', 'extraordinarily', 'horrendous', 'and', "i'm", 'not', 'going', 'to', 'waste', 'any', 'more', 'words', 'on', 'it', '.'], 'neg')


The two classifiers that you will implement in this lab should inherit from the following class, whose only method `predict` takes a document and returns the predicted class for that document (here: the polarity).

In [4]:
class Classifier(object):

    def predict(self, d):
        return None

## Evaluation

The first thing that you will have do is to implement a function

`accuracy(classifier, data)`

that computes the accuracy of a classifier on test data.

In [5]:
def accuracy(classifier, data):        
    return sum([classifier.predict(review[0]) == review[1] for review in data]) / len(data)

You can test this function by computing the accuracy of a Naive Bayes classifier on the test data:

In [6]:
classifier = nlp1.NaiveBayesClassifier.train(training_data)
print(accuracy(classifier, test_data))

0.765


<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
Provide your own implementation of the `accuracy()` function. Test your implementation by redoing the evaluation. You should get exactly the same results as before.
</div>
</div>

**Hint:** Using an appropriate function from the `statistics` module, this problem can be solved in a one-liner.

## Naive Bayes classifier

To implement the Naive Bayes classifier, you should complete the following code:

In [7]:
from collections import Counter
from math import log
class MyNaiveBayesClassifier(Classifier):
    
    def __init__(self, pc, pp, pn):
        self.pc = pc
        self.pp = pp
        self.pn = pn

    def predict(self, data):
        posclass = log(self.pc[0])
        negclass = log(self.pc[1])
        for word in data:
            if word in self.pp:
                posclass += log(self.pp[word])
                negclass += log(self.pn[word])
        if posclass > negclass:
            return "pos"
        else:
            return "neg"

    @classmethod
    def train(cls, data, k=1):
        wordbag = set()

        poslist = []
        neglist = []
        poscount = 0
        for rev in data:
            for word in rev[0]:
                wordbag.add(word)
            if(rev[-1] == 'pos'):
                poscount += 1
                poslist += rev[0]
            else:
                neglist += rev[0]
                
        posbag = Counter(poslist)
        negbag = Counter(neglist)
        
        Pposbag = {}
        Pnegbag = {}
        
        sumOfPosFreq = 0
        for word in posbag:
            sumOfPosFreq += posbag[word]
            
        sumOfNegFreq = 0
        for word in negbag:
            sumOfNegFreq += negbag[word]
        
        len_wordbag = len(wordbag)
        for word in wordbag:
            if(word in negbag):
                Pnegbag[word] = (negbag[word] + k) / (sumOfNegFreq + k * len_wordbag)
            else:
                Pnegbag[word] = (k) / (sumOfNegFreq + k * len_wordbag)

            if(word in posbag):
                Pposbag[word] = (posbag[word] + k) / (sumOfPosFreq + k * len_wordbag)
            else:
                Pposbag[word] = (k) / (sumOfPosFreq + k * len_wordbag)
                
            

        pc = [poscount / len(data), 1 - (poscount / len(data))]
    
        return cls(pc, Pposbag, Pnegbag)

In this skeleton the method `predict()` should implement the Naive Bayes classification rule. The method `train()` should return a new classifier that has been trained on the specified training data using maximum likelihood estimation with add-$k$ smoothing.

To test your implementation, you can re-do the evaluation from above:

In [8]:
classifier1 = MyNaiveBayesClassifier.train(training_data)
print(accuracy(classifier1, test_data))

0.765


<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
Implement the two methods in `MyNaiveBayesClassifier`. Test your implementation by evaluating on the test data. Your results should be very similar to the ones that you got when you evaluated your accuracy function in Problem&nbsp;1.
</div>
</div>

## Averaged perceptron classifier

Here is the code skeleton for the averaged perceptron classifier:

In [9]:
class MyPerceptronClassifier(Classifier):
    
    def __init__(self, wp, wn):
        self.wp = wp
        self.wn = wn

    def predict(self, x):
        pos_score = 0
        neg_score = 0
        for word in x:
            if word in self.wp:
                pos_score += self.wp[word]
                neg_score += self.wn[word]
                
        return "pos" if pos_score >= neg_score else "neg"

    @classmethod
    def train(cls, data, n_epochs=1):
        wp = {}
        wn = {}
        for review in data:
            for word in review[0]:
                wp[word] = 0
                wn[word] = 0
    
        for e in range(n_epochs):
            for review in data:
                pos_class = 0
                neg_class = 0
                for word in review[0]:
                    pos_class += wp[word]
                    neg_class += wn[word] 
                    
                if(pos_class >= neg_class):
                    pred = "pos"
                else:
                    pred = "neg"
                    
                if(pred != review[1]):
                    if(pred == "pos"):
                        inc = -1
                    else:
                        inc = 1
                    for word in review[0]:
                        wp[word] += inc
                        wn[word] += -1 * inc
        
        return cls(wp, wn)
    
    @classmethod
    def train_avg(cls, data, n_epochs=1):
        wp = {}
        wn = {}
        acc_p = {}
        acc_n = {}
        cnt = 1
        for review in data:
            for word in review[0]:
                wp[word] = 0
                wn[word] = 0
                
                acc_p[word] = 0
                acc_n[word] = 0
    
        for e in range(n_epochs):
            for review in data:
                pos_class = 0
                neg_class = 0
                for word in review[0]:
                    pos_class += wp[word]
                    neg_class += wn[word] 
                if(pos_class >= neg_class):
                    pred = "pos"
                else:
                    pred = "neg"
                    
                if(pred != review[1]):
                    if(pred == "pos"):
                        inc = -1
                    else:
                        inc = 1
                    for word in review[0]:
                        acc_p[word] += cnt * inc
                        acc_n[word] += cnt * -1 * inc
                        wp[word] += inc
                        wn[word] += -1 * inc
                        
                cnt += 1
        for word in wp:
            wp[word] -= acc_p[word]/cnt
            wn[word] -= acc_n[word]/cnt
        
        return cls(wp, wn)
    

In this skeleton, the method `predict()` should implement the perceptron classification rule. The method `train()` should return a new classifier that has been trained on the specified training data using averaged perceptron training for the specified number of epochs.

To test your implementation, as before you can train a classifier on the training data and evaluate it on the test data:

In [10]:
classifier2_1 = MyPerceptronClassifier.train(training_data, 1)
classifier2_2 = MyPerceptronClassifier.train(training_data, 2)
print("1 epoch    :", accuracy(classifier2_1, test_data))
print("2 epoch    :", accuracy(classifier2_2, test_data))

print()
classifier2_1 = MyPerceptronClassifier.train_avg(training_data, 1)
classifier2_2 = MyPerceptronClassifier.train_avg(training_data, 2)
print("1 epoch avg:", accuracy(classifier2_1, test_data))
print("2 epoch avg:", accuracy(classifier2_2, test_data))

1 epoch    : 0.64
2 epoch    : 0.75

1 epoch avg: 0.745
2 epoch avg: 0.79


<div class="panel panel-primary">
<div class="panel-heading">Problem 3</div>
<div class="panel-body">
Implement the two methods in `MyPerceptronClassifier`. Test your implementation by evaluating on the test data. You should get results in the 70&ndash;80% range. What happens if you repeat the experiment but do not do averaging? What happens when you train the classifier for two epochs? Enter your results into the table below.
</div>
</div>

<table>
<tr><td></td><td>averaging</td><td>no averaging</td></tr>
<tr><td>1 epoch</td><td>0.745</td><td>0.64</td></tr>
<tr><td>2 epochs</td><td>0.79</td><td>0.75</td></tr>
</table>

## Switching to binary features

In the lab so far, a document is represented as a list of the words that occur in it. For sentiment classification, several authors have suggested that a *binary* document representation, where each word is represented only once, can produce better results. In the last problem you will try to confirm this finding.

Your task is to implement a function `binarize()` that converts data into the binary representation:

In [11]:
def binarize(data):
    new_data = []
    [new_data.append([list(set(review[0])), review[1]]) for review in data]
    return new_data


The function is to be used in the following context:

In [12]:
binarized_training_data = binarize(training_data)
binarized_test_data = binarize(test_data)

classifier3 = MyNaiveBayesClassifier.train(binarized_training_data)
print(accuracy(classifier3, binarized_test_data))

classifier4 = MyPerceptronClassifier.train(binarized_training_data)
print(accuracy(classifier4, binarized_test_data))

0.795
0.845


<div class="panel panel-primary">
<div class="panel-heading">Problem 4</div>
<div class="panel-body">
Implement the `binarize()` function and run the evaluation. What do you observe? Summarise your results in one or two sentences.
</div>
</div>

With the binarized data the accuracy improve a lot.
By binarizing the data, the information is more concentrated and the "keywords" such as "tremendous" and "terrible" is weighted more.