# Perceptron Algorithm for Text Classification

January 6, 2023

Your name: Ha Trung Tin

Student ID: BI11-261

**Due: 23:59 January 15, 2023**

## How to submit

- Attach notebook file (.ipynb) and submit your work to Google Class Room 
- Name your file as YourName_StudentID_Assignment1.ibynb. E.g., Nguyen_Van_A_ST099834_Assignment1.ipynb
- Copying others' assignments is strictly prohibited.
- Write your name and student ID into this notebook


## Task Description

- We will train a binary classification model to determine that a title is about a person. We will use the train dataset [here](https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-train.labeled)
- We will evaluate the model on a [test dataset](https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-test.labeled). We use accuracy as the evaluation measure.

## Downloading dataset

In [1]:
%%capture
!rm -f titles-en-train.labeled
!rm -f titles-en-test.labeled

!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-train.labeled
!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-test.labeled

Each sample is written in a line. There are two labels {1, -1} in the data.

```
1	FUJIWARA no Chikamori ( year of birth and death unknown ) was a samurai and poet who lived at the end of the Heian period .
-1	Yomi is the world of the dead .
```

## Loading Data

We will load data into a list of sentences with their labels.

In [2]:
def load_data(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        for line in f:
            line = line.strip()
            if line == '':
                continue
            lb, text = line.split('\t')
            data.append((text,int(lb)))
            
    return data

Loading data from files

In [3]:
train_data = load_data('./titles-en-train.labeled')
test_data = load_data('./titles-en-test.labeled')

In [4]:
train_data[0]

('FUJIWARA no Chikamori ( year of birth and death unknown ) was a samurai and poet who lived at the end of the Heian period .',
 1)

## Building Perceptron Model

You will need to complete the implementation of the Perceptron class as follows.

In [5]:
"""
Implementation of Perceptron model
"""
from collections import defaultdict

class Perceptron:
    """Perceptron classifier
    """
    def __init__(self, eta=0.001, n_iter=10):
        self.eta = eta
        self.n_iter = n_iter
    
    def train(self, data):
        """Training the model

        Parameters
        ----------
        data: list of tuples (x,y) where x is a sentence and y is the label

        Returns
        -------
        self : object
        """
        self.w = defaultdict(float)
        for _ in range(self.n_iter):
            for x, y in data:
                phi = self.create_features(x)
                y_pred = self.predict_one(self.w, phi)
                if y != y_pred:
                        self.update_weights(self.w, phi, y)
    
    def predict_one(self, w, phi):
        """

        Parameters
        ------------
        w (dict): weights of features
        phi (dict): features extracted from input

        Returns
        ------------
        label for the input sentence (1 or -1)
        """
        #TODO: Write your code here
        score = 0
        for name, value in phi.items():
            score += value * w[name]
        if (score >= 0):
            return 1
        else:
            return -1
        pass
    
    def create_features(self, x):
        """
        Parameters
        -----------------
        x (str): Input sentence

        Returns
        -----------------
        phi: dictionary, feature vector
        """
        #TODO: Write your code here
        phi = defaultdict(float)
        words = x.split()
        for word in words:
            phi[word] += 1
        return phi
        pass

    def update_weights(self, w, phi, y):
        """
        Parameters
        -----------------
        w (dict): weights of features
        phi (dict): features extracted from input
        y (int): Gold label (1 or -1)

        Returns
        -----------------
        None
        """
        #TODO: Write your code here
        for name, value in phi.items():
            w[name] += self.eta * (y - self.predict_one(w, phi)) * value
        pass
    
    def classify(self, x):
        phi = self.create_features(x)
        return self.predict_one(self.w, phi)
    
    def predict_all(self, test_samples):
        y_preds = []
        for x in test_samples:
            y_pred = self.classify(x)
            y_preds.append(y_pred)
        return y_preds

## Training the model

In [6]:
model = Perceptron()
model.train(train_data)

## Evaluation

You need to evaluate the model on the test data and report the accuracy here.

In [7]:
from sklearn import metrics

X_test, y_true = zip(*test_data)
y_preds = model.predict_all(X_test)

print("Accuracy: ", metrics.accuracy_score(y_true, y_preds))

Accuracy:  0.9018774353524619
