# Lab - Classification using Perceptron Learning
This notebook serves as the starter code and lab description covering **Chapter 19 - Learning from Examples (Part 2)** from the book *Artificial Intelligence: A Modern Approach.*

In [None]:
# pip install pandas
# pip install tqdm
# pip install nltk
# pip install numpy
# pip install sklearn

import warnings
warnings.filterwarnings("ignore")
import pandas as pd
from numpy.random import uniform
from tqdm import tqdm # you may comment this line if you don't need it

import re
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn import metrics

## OVERVIEW
In the lecture, we discussed *linear classifiers with a hard threshold* and looked at perceptron learning update rule for model parameters. In this lab, we implement a perceptron linear classifier and use it to classify a real-world classification dataset. 

Though, to make things easier, we first start with a mock dataset with which we develope and test our perceptron classifier.

## Part 1 - Implementing the Perceptron Classifier
To get started on the task, lets first assume a very simple set of classification points over which we want to develope our perceptron classifier. Here are the data:

In [None]:
mock_data = pd.DataFrame([[2.7810836,2.550537003,0], [1.465489372,2.362125076,0], [3.396561688,4.400293529,0], [1.38807019,1.850220317,0], [3.06407232,3.005305973,0], [7.627531214,2.759262235,1], [5.332441248,2.088626775,1], [6.922596716,1.77106367,1], [8.675418651,-0.242068655,1], [7.673756466,3.508563011,1]], columns=['X1', 'X2', 'Y'])
mock_data_X = mock_data[['X1', 'X2']]
mock_data_y = mock_data[['Y']]
display(mock_data_X)
display(mock_data_y)

To implement the classifier, we clearly are trying to base our model on the *multivariable linear regression* idea, so our classifier would look like: $$h_w(\textbf{x}_j) = w_0 + w_1x_{j,1}+w_2x_{j,2}$$ 

* Note: The actual *multivariable linear regression* formula goes all the way up to $w_nx_{j,n}$, but since in our mock_data we only have $n=2$ variables, we have simplified the euqation!

Looking at our equation we understand that we need three $w$s to calculate $h_w(\textbf{x}_j)$ for each row of our mock dataset. 
Start by storing the three required $w$s in a list and radomly set their values (the values would change while we run the perceptron algorithm):

In [None]:
# TODO create a 'weights' list and fill it with three randomly generated values between -0.5 and 0.5

Assuming that the `weights` vector you just created is the best parameter set possible for classifying the mock dataset (it clearly is not, but the assumption helps you move forward and then you will revisit this assumption!), implement a `classify` function that receives one record of the mock dataset and the `weights` vector and return the classification result of the perceptron classifier which is: $$h_w(\textbf{x}) = 1\ \text{if}\ w_0 + w_1x_{j,1}+w_2x_{j,2} \geq 0;\ 0\ \text{o.w.}$$

In [None]:
# TODO implement `classify` function as instructed above and test it with the first row of mock_data

Now using the `classify` funtion, implement a `for` loop over the `mock_data` and classify each record. Print the classification result along with the actual expected value for each record:

In [None]:
#TODO classify and compare classification results of mock_data with the actual expected values:

Chances are that most or at least some of your datasert records are classified incorrectly (since this is a really small dataset, sometimes this does not happen, but if you re-run the experiment, you'll see its not always perfect). Now, what if I told you that one oracle setting for `weights` is `[-0.1, 0.20653640140000007, -0.23418117710000003]`. Try this oracle answer and see the change in your results:

In [None]:
# TODO try the oracle answer and repeat the past cell and show the classify results:

## Finding oracle parameter values using perceptron learning
In this section, we are going to learn how to reach from a random `weights`vector to an oracle one. For this purpose, we will use *stochastic gradient descent (SGD)*. The only three hyperparameters we need for SGD are the *learning rate ($\alpha$)*, the number of times we iterate over the data *(n_epochs)* while we update the model parameters (`weights` in our case), and the batch size *(b_size)* is the count of as many training data records before we update the model parameters (`weights` in our case).

Here is what you need to do:
* Implement a function and call it `perceptron_training` and have it to receive the training data, the learning rate and the number of epochs.
* Inside `perceptron_training`:
    * initialize the `weights` vector randomly as we did previously.
    * implement a for loop to iterate over the training data `n_epochs` number of times.
* Inside the `epochs` loop (which we created in the last line):
    * create a `temp_weights` vector which will contain the parameter updates before they are sychronized with actual `weights` vector.
    * initialize `temp_weights` to the values of `weights`.
    * create a for loop that iterates over the training data.
* Insider the loop over the training data:
    * `classify` each training data record.
    * calculate the classification $\text{error}$ using $\text{actual}-\text{prediction}$.
    * add $\alpha * \text{error} * \text{input_feature}_i$ to each index $i$ of `temp_weights`.
    * if training data record index is divisible by `b_size`, update `weights` vector with values of `temp_weights`.

* Return the final modified `weights` from `perceptron_training`.

At the end of each epoch, report the sum of squared values of calculate errors. We expect this value decreases while the training proceeds otherwise something is wrong!

Test your implemented `perceptron_training` algorithm with `l_rate = 0.1`, `n_epoch = 5`, and `b_size = 3` and use its reported weights to redo the last cell. Don't forget to print the `weights` your `perceptron_training` algorithm has found.

In [None]:
def perceptron_training(train_X, train_y, l_rate, n_epoch, b_size):
    # TODO implement `perceptron_training` here
    return []

In [None]:
# To use `perceptron_training` function to find the optimal `weights` 
# and using the weights you found re-classify the mock_data instances

## Real-world Data Application
Now that we have the perceptron classifier we can use it to classify a real-world data; namely the `SMSSpamCollection` data which we have already used previously.

Copy the `SMSSpamCollection.tsv` file from earlier lab and use `pandas` library to read and load it up.

In [None]:
# TODO read and load up SMSSpamCollection in `sms_data`

Lets see how many words are there in the `TEXT` fields of this dataset:

In [None]:
print(sms_data['TEXT'].apply(lambda x: len(x.split(' '))).sum())

### Text Pre-processing 
One important step to improve the accuracy of the model in dealing with weird and incorrect spellings of the information is to preform *text pre-processing*. Using the following pre-processing regular expressions develope a `clean_text` function that receives a SMS text message and lowercases and cleans it up (e.g. you can call `BAD_SYMBOLS_RE.sub` to replace bad symbols with an empty character!). Also, check all the words in the text message, and ignore them if they appear in `STOPWORDS` set.

In [None]:
# Run this cell only once and you're gonna be fine
# nltk.download('stopwords')

In [None]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))
def clean_text(text):
    # TODO implement this function by doing the following:
        # lowercase text
        # replace REPLACE_BY_SPACE_RE symbols by space in text
        # delete symbols which are in BAD_SYMBOLS_RE from text
        # delete stopwors from text
    return text


Now we can apply the `clean_text` function to the `TEXT` fields of the dataset and get the word count in it after clean-up:

In [None]:
sms_data['TEXT'] = sms_data['TEXT'].apply(clean_text)
print(sms_data['TEXT'].apply(lambda x: len(x.split(' '))).sum())

Next, we do the very familiar train/test separation of the data using `train_test_split` using 70:30 ratio.

In [None]:
# TODO create X_train, X_test, y_train, y_test as you did in previous labs

There are two other important steps to prepare the data to be fed to the perceptron classifier we just implemented.

### Vectorization
You remember that our mock_data contained numeric values which could be fit into perceptron classification equation $$h_w(\textbf{x}_j) = w_0 + w_1x_{j,1}+...+w_nx_{j,n}$$

However, our current dataset contains strings (words) and it really doesn't make sense to multiply weights to strings. Therefore, we need to convert the words into meaningful numeric values. One important technique is to convert the messages into bag-of-word vectors. 

The process is simple:
* You first collect all the words that appear in your **training data**.
    * if there is a word in test data that does not appear in training data you are not allowed to add it to the list. You consider one single word called *out-of-vocabulary* token to account for any possible word that appears in test data and does not appear in train data. 
* Then you assign an *id* to each word.
* To convert an SMS message into a bag-of-words:
    * create a vector of the size of all possible distinct words in train data.
    * initialize the vector with zeros
    * for each word appearing in the SMS message, find its equivalent id and set bag-of-words\[id\] for that word to 1.
    
`sklearn` library has implemented this algorithm in `CountVectorizer` class and you can use it to vectorize your dataset. Here is an example and how you can do it [Link](https://thatascience.com/learn-machine-learning/bag-of-words/).

In [None]:
# TODO vectorize the train and test data 

### Tfidf Transformation
The next transformation that you need to perform on the data is to reduce the effect of common words on the result of classification. The descriptions underneath this line are from the documentations of [`TfidfTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) in `sklearn` library:

*Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.*

*The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.*

I recommend you read more on this transformation as it is a really useful technique in information retrieval applications. Use `TfidfTransformer` on both train and test data to prepare the data for perceptron classification.

In [None]:
# TODO Tfidf Transform the vectorized train and test data 

## Perceptron classifier training
Now that the training data is prepared you can train your implemented classifier to classify the test data. Give this some time as it will need time to run. You can start with `l_rate=0.1`, `n_epoch=5`, `b_size=8`, but you can do a little bit of grid search to find better models. Create `metrics.classification_report` and `metrics.confusion_matrix` results on your implementation.

In [None]:
#TODO use your implemented `perceptron_training` algorithm to create proper `weights` for classification of test data

### Comparison of your implementation with that of `sklearn`
Now use `sklearn.linear_model.Perceptron` classifier instead of your own `perceptron_training` function and compare the performance of what you implemented and what `sklearn` has. Create the same reports as the previous cell and explain the reason of any difference in the results (if any exists).

In [None]:
from sklearn.linear_model import Perceptron
# TODO use `Perceptron` class instead of `perceptron_training`