# PA4: Implementing linear classifiers
### Authors: David Laessker, Peter Fagrell

### !!We added showing of our experimentation with different paramters and made the code runnable in the notebook.

## Exercise Question

Since the perceptron is a linnear classifier it can only classify data that is linearly seperable. This means that it can only classify data that can be seperated by a line. We can represent this very arcaheicly by the following:

Trainig data 1 would look like this 

x|x
-|-
x|o

and training data 2 would look like this

o|x
-|-
x|o

The top two boxes are Gothenburg/Sydney bottom two are Paris,  the right boxes are July and the left boxes are December.


As we can see we can easily draw a line that separates the two classes in example 1 but not in example 2. This means that the perceptron can only classify example 1 and not example 2.

--------------------------------
## Implementing the SVC

### The following table shows the accuracy achived with the different classifiers

| Classifier | Accuracy |
| --- | --- |
| PegasosSVC | 0.8443 |
| PegasosLREG | 0.8359 |

# Our implemetation:

In [19]:
#Imports for lab 4
import numpy as np
from aml_perceptron import LinearClassifier

import time

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


#### The _PegasosSVC_ was implemented with by translating the pseudocode from the document _'Clarification of the pseudocode in the Pegasos paper'_ with the help of the clarification in said document. The algorithm was run with several different combinations of parameters before settling on the following:

| Parameter | Value |
| --- | --- |
| lambda_reg | 0.01 |
|n_iter | 100 000 |

### The code cell below shows the implementation of the PegasosSVC algorithm:

In [20]:
class PegasosSVC(LinearClassifier):
    """
    Implementation of the Pegasos algorithm for SVCs.
    """

    def __init__(self, lambda_reg=0.1, n_iter=1000000):
        self.lambda_reg = lambda_reg
        self.n_iter = n_iter

    def fit(self, X, Y):

        # Preprocess the data
        self.find_classes(Y)
        Y_encoded = self.encode_outputs(Y)

        if not isinstance(X, np.ndarray):
            X = X.toarray()

        # Initialize the weights
        n_features = X.shape[1]
        self.w = np.zeros(n_features)
        self.lambda_reg = 1/n_features

        # Pegasos algorithm implemented
        # like the peudocode in the paper
        for t in range(1, self.n_iter):
            rand = np.random.randint(0, len(X))
            x, y = X[rand], Y_encoded[rand]

            n = 1/(self.lambda_reg*t)

            score = x.dot(self.w)

            if y*score <= 1:
                self.w = (1 - n*self.lambda_reg) * self.w + n*y*x
            else:
                self.w = (1 - n*self.lambda_reg) * self.w



#### The following two code cells is doc_classification.py that was provided by the course. It was modified to fit the _PegasosSVC_ algorithm instead of perceptron as it was orignally. We split it up and simplified it to make it easier to understand and to make it easier to implement the _PegasosSVC_ and _PegasosLREG_ algorithms.

In [21]:
# This function reads the corpus, returns a list of documents, and a list
# of their corresponding polarity labels.


def read_data(corpus_file):
    X = []
    Y = []
    with open(corpus_file, encoding='utf-8') as f:
        for line in f:
            _, y, _, x = line.split(maxsplit=3)
            X.append(x.strip())
            Y.append(y)
    return X, Y


#### When we run the code cell below it will print the **accuracy** of the _PegasosSVC_ algorithm.

## Accuracy: 0.8443

In [22]:
# Read all the documents.
X, Y = read_data(
    'data/all_sentiment_shuffled.txt')
# Split into training and test parts.
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2,
                                                random_state=69)

# Set up the preprocessing steps and the classifier.
pipeline = make_pipeline(
    TfidfVectorizer(),
    SelectKBest(k=1000),
    Normalizer(),
    # NB that this is our Perceptron, not sklearn.linear_model.Perceptron
    PegasosSVC()
)

# Train the classifier.
t0 = time.time()
pipeline.fit(Xtrain, Ytrain)
t1 = time.time()
print('Training time: {:.2f} sec.'.format(t1-t0))
# Evaluate on the test set.
Yguess = pipeline.predict(Xtest)
print('Accuracy: {:.4f}.'.format(accuracy_score(Ytest, Yguess)))

Training time: 17.93 sec.
Accuracy: 0.8443.


#### PegsosLREG is the 

| Parameter | Value |
| --- | --- |
| lambda_reg | 0.1 |
|n_iter | 100 000 |

### The code cell below shows the implementation of the PegasosLREG algorithm:

In [23]:
class PegasosLREG(LinearClassifier):
    """
    Implementation of the Pegasos algorithm for logistic regression.
    """

    def __init__(self, lambda_reg=0.1, n_iter=100000):
        self.lambda_reg = lambda_reg
        self.n_iter = n_iter

    def fit(self, X, Y):

        # Preprocess the data
        self.find_classes(Y)
        Y_encoded = self.encode_outputs(Y)

        if not isinstance(X, np.ndarray):
            X = X.toarray()

        # Initialize the weights
        n_features = X.shape[1]
        self.w = np.zeros(n_features)
        self.lambda_reg = 1/n_features

        # Pegasos algorithm implemented with logistic regression
        for t in range(1, self.n_iter):
            rand = np.random.randint(0, len(X))
            x, y = X[rand], Y_encoded[rand]

            n = 1/(self.lambda_reg*t)

            score = x.dot(self.w)
            gradient = -y * x * (1 - 1/(1 + np.exp(-y * score)))

            self.w = (1 - n * self.lambda_reg) * self.w - n * gradient
            self.w *= min(1, 1 / (np.sqrt(self.lambda_reg)
                          * np.linalg.norm(self.w)))


Testing values for lambda and n_iter for linear regression (we conlclude that most values for n_iter over 50 000 giuve the highest accuracies and that lambda values between 0.1 and 0.0001 does not make a huge difference)

In [29]:
for lambda_reg in [0.1, 0.01, 0.001, 0.0001]:
    for exp in range(13,20):
        n_iter = 2**exp
        X, Y = read_data(
        'data/all_sentiment_shuffled.txt')
        # Split into training and test parts.
        Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2,
                                                        random_state=69)

        # Set up the preprocessing steps and the classifier.
        pipeline = make_pipeline(
            TfidfVectorizer(),
            SelectKBest(k=1000),
            Normalizer(),
            # NB that this is our Perceptron, not sklearn.linear_model.Perceptron
            PegasosLREG(lambda_reg=lambda_reg, n_iter=n_iter)
        )

        # Train the classifier.
        t0 = time.time()
        pipeline.fit(Xtrain, Ytrain)
        t1 = time.time()
        print('Training time: {:.2f} sec.'.format(t1-t0))
        # Evaluate on the test set.
        Yguess = pipeline.predict(Xtest)
        print('---------------------')
        print('Accuracy: {:.4f}.'.format(accuracy_score(Ytest, Yguess)))
        print("lambda_reg: ", lambda_reg, "n_iter: ", n_iter)


Training time: 3.72 sec.
---------------------
Accuracy: 0.8317.
lambda_reg:  0.1 n_iter:  65536
Training time: 5.76 sec.
---------------------
Accuracy: 0.8317.
lambda_reg:  0.1 n_iter:  131072
Training time: 11.75 sec.
---------------------
Accuracy: 0.8330.
lambda_reg:  0.1 n_iter:  262144
Training time: 3.41 sec.
---------------------
Accuracy: 0.8317.
lambda_reg:  0.01 n_iter:  65536
Training time: 5.57 sec.
---------------------
Accuracy: 0.8246.
lambda_reg:  0.01 n_iter:  131072
Training time: 10.52 sec.
---------------------
Accuracy: 0.8242.
lambda_reg:  0.01 n_iter:  262144
Training time: 3.52 sec.
---------------------
Accuracy: 0.8284.
lambda_reg:  0.001 n_iter:  65536
Training time: 5.88 sec.
---------------------
Accuracy: 0.8292.
lambda_reg:  0.001 n_iter:  131072
Training time: 10.21 sec.
---------------------
Accuracy: 0.8326.
lambda_reg:  0.001 n_iter:  262144
Training time: 3.43 sec.
---------------------
Accuracy: 0.8309.
lambda_reg:  0.0001 n_iter:  65536
Training 

Testing values for linearSVC we get the same conclusion that a value of n_iter over 50 000 is prefered and all lambda_reg values between 0.1 and 0.0001 work well.

In [31]:
for lambda_reg in [0.1, 0.01, 0.001, 0.0001]:
    for exp in range(13,20):
        n_iter = 2**exp
        X, Y = read_data(
        'data/all_sentiment_shuffled.txt')
        # Split into training and test parts.
        Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2,
                                                        random_state=69)

        # Set up the preprocessing steps and the classifier.
        pipeline = make_pipeline(
            TfidfVectorizer(),
            SelectKBest(k=1000),
            Normalizer(),
            # NB that this is our Perceptron, not sklearn.linear_model.Perceptron
            PegasosSVC(lambda_reg=lambda_reg, n_iter=n_iter)
        )

        # Train the classifier.
        t0 = time.time()
        pipeline.fit(Xtrain, Ytrain)
        t1 = time.time()
        print('Training time: {:.2f} sec.'.format(t1-t0))
        # Evaluate on the test set.
        Yguess = pipeline.predict(Xtest)
        print('---------------------')
        print('Accuracy: {:.4f}.'.format(accuracy_score(Ytest, Yguess)))
        print("lambda_reg: ", lambda_reg, "n_iter: ", n_iter)


Training time: 2.07 sec.
---------------------
Accuracy: 0.8443.
lambda_reg:  0.1 n_iter:  32768
Training time: 2.25 sec.
---------------------
Accuracy: 0.8389.
lambda_reg:  0.1 n_iter:  65536
Training time: 3.14 sec.
---------------------
Accuracy: 0.8393.
lambda_reg:  0.1 n_iter:  131072
Training time: 5.11 sec.
---------------------
Accuracy: 0.8431.
lambda_reg:  0.1 n_iter:  262144
Training time: 9.54 sec.
---------------------
Accuracy: 0.8443.
lambda_reg:  0.1 n_iter:  524288
Training time: 1.77 sec.
---------------------
Accuracy: 0.8372.
lambda_reg:  0.01 n_iter:  32768
Training time: 2.16 sec.
---------------------
Accuracy: 0.8393.
lambda_reg:  0.01 n_iter:  65536
Training time: 3.26 sec.
---------------------
Accuracy: 0.8405.
lambda_reg:  0.01 n_iter:  131072
Training time: 5.95 sec.
---------------------
Accuracy: 0.8397.
lambda_reg:  0.01 n_iter:  262144
Training time: 8.97 sec.
---------------------
Accuracy: 0.8435.
lambda_reg:  0.01 n_iter:  524288
Training time: 1.68

In [34]:
# Read all the documents.
X, Y = read_data(
    'data/all_sentiment_shuffled.txt')
# Split into training and test parts.
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2,
                                                random_state=69)

# Set up the preprocessing steps and the classifier.
pipeline = make_pipeline(
    TfidfVectorizer(),
    SelectKBest(k=1000),
    Normalizer(),
    # NB that this is our Perceptron, not sklearn.linear_model.Perceptron
    PegasosLREG(n_iter=100000, lambda_reg=0.01)
)

# Train the classifier.
t0 = time.time()
pipeline.fit(Xtrain, Ytrain)
t1 = time.time()
print('Training time: {:.2f} sec.'.format(t1-t0))
# Evaluate on the test set.
Yguess = pipeline.predict(Xtest)
print('Accuracy: {:.4f}.'.format(accuracy_score(Ytest, Yguess)))


Training time: 5.11 sec.
Accuracy: 0.8359.
