<div class="alert alert-danger">
**Due date:** 2018-01-26
</div>

# L1: Text classification

In this lab you will implement and compare the performance of two simple text classifiers: a Naive Bayes classifier and a classifier based on the averaged perceptron. Both of these classifiers are presented in the lecture.

The data set that you will be using in this lab is the [review polarity data set](https://www.cs.cornell.edu/people/pabo/movie-review-data/) first used by [Pang and Lee (2004)](http://www.aclweb.org/anthology/P04-1035). It consists of 2,000 movie reviews, each of which has been tagged as either positive or negative towards the movie at hand. The distribution of the two classes is 50/50.

## Introduction

Start by importing the module for this lab.

In [2]:
import nlp1
import random
import math
from collections import defaultdict
from copy import copy

The next cell loads the training data and the test data:

In [3]:
training_data = nlp1.load_data("/home/TDDE09/labs/l1/data/review_polarity.train.json")
test_data = nlp1.load_data("/home/TDDE09/labs/l1/data/review_polarity.test.json")

As you will see, each data instance is a pair whose first component is a document, represented as a list of words (strings), and whose second component is the gold-standard polarity of the review (either positive `pos` or negative `neg`), represented as a string.

In [4]:
print(training_data[813])

(['this', 'film', 'is', 'extraordinarily', 'horrendous', 'and', "i'm", 'not', 'going', 'to', 'waste', 'any', 'more', 'words', 'on', 'it', '.'], 'neg')


## Evaluation

The first thing that you will have to do is to implement a function

`accuracy(classifier, data)`

that computes the accuracy of a classifier on reference data of the form described above. In this context, a *classifier* is an object with a method `predict` that takes a document $x$ as its input and returns the predicted class for&nbsp;$x$.

In [5]:
def accuracy(classifier, data):
    """Computes the accuracy of a classifier on reference data.

    Args:
        classifier: A classifier.
        data: Reference data.

    Returns:
        The accuracy of the classifier on the test data, a float.
    """
    correct = 0
    for x,y in data:
        pred = classifier.predict(x)
        if pred == y:
            correct+=1
    #print(nlp1.accuracy(classifier,data))
    return correct/len(data)

You can test your function by computing the accuracy of a Naive Bayes classifier on the test data:

In [6]:
classifier1 = nlp1.NaiveBayes.train(training_data)
print(accuracy(classifier1, test_data))

0.765


<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
Provide your own implementation of the `accuracy()` function in the code cell above. Test your implementation by redoing the evaluation. You should get exactly the same result as before.
</div>
</div>

## Naive Bayes classifier

To implement the Naive Bayes classifier, you can start from the following code skeleton:

In [7]:
def get_vocabulary(data):
    V = set([word for art,c in data for word in art])
    return V

class NaiveBayes(object):

    def __init__(self):
        """Initialises a new classifier."""
        # TODO: Replace the following line with your own code
        self.class_prob = {}
        self.V = {}
        self.word_prob = {}
        self.class_dict = {}
        pass

    def predict(self, x):
        """Predicts the class for a document.

        Args:
            x: A document, represented as a list of words.

        Returns:
            The predicted class, represented asets a string.
        """
        current_prediction = {'c':None,'prob':-float("inf")}
        for c in self.class_prob:
            prob = math.log(self.class_prob[c])
            for word in x:
                if word in self.word_prob[c]:
                    prob += math.log(self.word_prob[c][word])
                # If word was not encountered during training for class c but is
                # in vocabulary => set to default value
                elif word in self.V:
                    prob += math.log(self.default_word_prob[c])
            if prob > current_prediction['prob']:
                current_prediction = {'c':c,'prob':prob}
        return current_prediction['c']

    
    def calculate_prior(self,data):
        
        for c in self.class_dict:
            ccount = len(self.class_dict[c])
            self.class_prob[c] = ccount/len(data)
            
    def get_class_dict(self,data):
        unique_c = set([c for x,c in data])
        for c in unique_c:
            samples = [x for x,c_ in data if c_ == c]
            self.class_dict[c] = samples
            
        
    def calculate_word_prob(self,data,k):
        self.default_word_prob = {}
        for c in self.class_dict:
            # Count of words for sentences with class c
            word_count = defaultdict(int)
            # Total count of words in class c
            total_count = 0
            for sample in self.class_dict[c]:
                for word in set(sample):
                    word_count[word] += sample.count(word)
                total_count += len(sample)
                
            # For words in test set not encountered for class c during training
            self.default_word_prob[c] = k / (total_count + len(self.V)*k)
            
            # Calculate probabilities of words given class c
            self.word_prob[c] = {}
            for word in word_count:
                self.word_prob[c][word] = (word_count[word]+k)/(total_count + len(self.V)*k)
    
    def train(self, data, k=1):
        """Train a new classifier on training data using maximum
        likelihood estimation and additive smoothing.

        Args:
            cls: The Python class representing the classifier.
            data: Training data.
            k: The smoothing constant.

        Returns:
            A trained classifier, an instance of `cls`.
        """
        # TODO: Replace the following line with your own code
        
        self.V = get_vocabulary(data)
        self.get_class_dict(data)
        self.calculate_prior(data)
        self.calculate_word_prob(data,k)
        
    
   

Your implementation should meet the following requirements:

### Number of classes

Your implementation should support classification problems with an arbitrary number of classes. In particular, you should not hardwire the two classes used in the specific data set for this problem (`pos` and `neg`).

### Vocabulary

Your implementation should support the dynamic creation of the classifier&rsquo;s vocabulary from the training data. The vocabulary of the trained classifier should be the set of all words that occur in the training data.

### Use log probabilities

While the mathematical model of the Naive Bayes classifier is specified in terms of probabilities, for the implementation you should use log probabilities.

### Test your implementation

Test your implementation by evaluating on the test data:

In [8]:
classifier2 = NaiveBayes()
classifier2.train(training_data,1)


print(accuracy(classifier2, test_data))

KeyboardInterrupt: 

<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
Finish the implementation of the `NaiveBayes` class. Test your implementation by evaluating on the test data. When choosing the smoothing constant as$k=1$, you should get exactly the same results as in Problem&nbsp;1. What happens when you experiment with different values for the smoothing constant? Report your results and provide a short discussion in the text cell below.
</div>
</div>


* If we try with no smoothing, i.e. k=0 we get a ValueError due to our prediction attempting to calculate log(0).
* If we try to gradualy decrease the value of k we get gradual improvements to accuracy up until and including k=0.5, which gives an accuracy of 0.78. If we then try to decrease k even further to k=4.5 we start to get worse predictions again. 
* If we try values larger than 1 we get gradualy worse accuracy. This means that the ideal smoothing for our classifier with this set of test data is around 0.5.

## Averaged perceptron classifier

Here is the code skeleton for the averaged perceptron classifier:

In [None]:
class Perceptron(object):

    def __init__(self):
        """Initialises a new classifier."""
        self.weights = {}
        self.acc = {}

    def predict(self, x):
        """Predicts the class for a document.

        Args:
            x: A document, represented as a list of words.

        Returns:
            The predicted class, represented as a string.
        """
        pred = {'c': None,'activation': -float("inf")}
        for c in self.weights:
            out = sum([self.weights[c][word] for word in x if word in self.weights[c]])
            if out > pred['activation']:
                pred = {'c': c,'activation': out}
            # Currently prefers classes that occur later in alphabet (same as reference model)
            elif out == pred['activation'] and c > pred['c']:
                pred = {'c': c,'activation': out}
        return pred['c']    

    def update(self, x, y):
        """Updates the weight vectors with a single training instance.

        Args:
            x: A document, represented as a list of words.
            y: The gold-standard class, represented as a string.

        Returns:
            The predicted class, represented as a string.
        """
        p = self.predict(x)       
        if  p != y:
            for word in x:
                self.acc[p][word] -= self.count * 1
                self.acc[y][word] += self.count * 1
                self.weights[p][word] -= 1
                self.weights[y][word] += 1
        self.count += 1     
        return p
    
    def init_weights(self, X, y):        
        for c in set(y):
            self.weights[c] = defaultdict(int)
            self.acc[c] = defaultdict(int)
            #for word in self.V:
            #    self.weights[c][word] = 0
            #    self.acc[c][word] = 0

    def train(self, data, n_epochs=1):
        """Train a new classifier on training data using the averagedelif out == current_prediction['activation'] and c < current_prediction['c']:
                current_prediction = {'c': c,'activation': out}
        perceptron learning algorithm.

        Args:gradual
            cls: The Python class representing the classifier.
            data: Training data.
            n_epochs: The number of training epochs.

        Returns:
            A trained classifier, an instance of `cls`.
        """
        self.V = list(get_vocabulary(data))
        X = [X for X, _ in data]
        y = [y for _, y in data]

        self.count = 1
        self.init_weights(X, y)
        for epoch in range(n_epochs):
            for x, c in zip(X, y): 
                self.update(x, c)
        for c in self.weights:
            for word in self.weights[c]:
                self.weights[c][word] -= self.acc[c][word] / self.count
        

Your implementation should meet the following requirements:

### Number of classes

As in the case of the Naive Bayes classifier, your implementation of the multi-class perceptron should support classification problems with an arbitrary number of classes, not just the two classes from the review data set.

### Features

As the features for your classifier, you should use all words that occur in the training data (bag-of-words features). The weight of a feature should be the number of times the corresponding word occurs in the document.

### Vector operations

To implement the perceptron, you will have to translate between the mathematical model (which is formulated in terms of vectors) and the implementation in Python that was suggested in the lecture, where feature vectors are represented as lists and weight vectors are represented as dictionaries. In particular, you will have to think about how to implement the relevant vector operations on this representation.

### Tie-breaking

The exact results that you will get with your implementation will depend on how you break ties between classes with the same activation. For the sake of comparability, we ask you to adopt the following strategy: If more than one class get the same activation, pick the smallest class with respect to the lexicographic ordering on class names (so `neg` will come before `pos`).

### Test your implementation

To test your implementation, you can use the following code:

In [None]:
classifier3 = nlp1.Perceptron.train(training_data)
print(accuracy(classifier3, test_data))

classifier4 = Perceptron()
classifier4.train(training_data)

print(accuracy(classifier4, test_data))

<div class="panel panel-primary">
<div class="panel-heading">Problem 3</div>
<div class="panel-body">
    <p>Finish the implementation of the averaged perceptron classifier. Test your implementation by evaluating on the test data. You should get exactly the same results as the reference implementation.</p>
    <p>Run experiments to address the following questions:</p>
    <ul>
        <li>What happens when you repeat the experiment but do not do averaging?</li>
        <li>What happens when you train the classifier for two epochs?</li>
        <li>What happens when you invert the tie-breaking strategy?</li>
    </ul>
    <p>Report your results and provide a short discussion in the text cell below.</p>
</div>
</div>

* Suppose that after the first 100 examples the weights vector is so good that no updates happen for the next 9899 examples and the perceptron predicts the last sample wrong, this will lead to the weights being updated with regard to that sample while not considering the previous correctly classified samples.
    * When running our model without averaging we get an accuray of 0.64 instead of 0.745
<br>
<br>
* The weights are further adapted during the second epoch and the running average counter is increased further. If we don't have access to a lot of data a risk is that the model is overfitted with regard to the training data and the performance would therefore be worse on the test data.
    * If we run our model for 2 epochs we get an accuracy of 0.79
<br>
<br>
* The classifier will make different predictions based off the tie-breaker strategy. If the developers want to set a default prediction in case of ties setting a tie-breaker strategy is one way of accomplishing this.
    * We get an accuracy of 0.73 when inverting the tie-breaker strategy


## Switching to binary features

In the lab so far, a document is represented as a list of the words that occur in it. For sentiment classification, several authors have suggested that a *binary* document representation, where each word is represented only once, can produce better results. In the last problem you will try to confirm this finding.

Your task is to implement a function `binarize()` that converts data into the binary representation:

In [None]:

def binarize(data):
    # For each article extract the unique words
    new_data = [(list(set(art)), c) for art, c in data]
    return new_data


The function is to be used in the following context:

In [None]:
binarized_training_data = binarize(training_data)
binarized_test_data = binarize(test_data)

classifier5 = NaiveBayes()
classifier5.train(binarized_training_data)
print(accuracy(classifier5, binarized_test_data))

classifier6 = Perceptron()
classifier6.train(binarized_training_data)
print(accuracy(classifier6, binarized_test_data))

<div class="panel panel-primary">
<div class="panel-heading">Problem 4</div>
<div class="panel-body">
Implement the `binarize()` function and run the evaluation. What do you observe? Report your results and speculate on possible explanations in the text cell below.
</div>
</div>

By limit words in the data to one occurence per article we remove many occurences of a lot of words such as 'it', 'a' and 'is. These words are not helpful when classifying an article as positive or negative. By removing this factor we reduce noise in the data. We observed that by binarizing the data we got a higher accuracy.