# Classification Tutorial

In this tutorial, we will learn how to do classification in Python using scikit and also see how to do the same using Vowpal Wabbit. We will classify a piece of text into two categories - spam and not spam (called ham). We will also learn about feature representation using Bag of Words (BOW).

# Goals

**(1) Data exploration: Load and understand data**

**(2) Feature engineering**
     - Feature Extraction/Engineering using Bag Of Words (BOW)
     - Data preparation
     - Tranform data into BOW

**(3) Classification Metrics**

**(4) Classifiers and model evaluation**
        - Majority Class classifier
        - Logistic regression for classification
        - Classification using Vowpal Wabbit

**(5) Summary**

# Data exploration: Load and understand data

In [None]:
# import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg 
from sklearn.model_selection import train_test_split
%matplotlib inline

In [None]:
# Read the data into a dataframe using pandas
df = pd.read_csv('spam_ham_v23.csv', sep = ",")
df.dropna(inplace=True)

In [None]:
len(df)

In [None]:
# Let's look at the first few samples
df.head()

In [None]:
# And the last few samples (notice the file seems to be sorted, so it's important that we shuffle it later)
df.tail()

In [None]:
df.groupby('Category').count()

# Feature engineering

In [None]:
# the package we will be using handles only numerical values for labels. So, let's map our classes to reals.
def return_label(x):
    if x=="ham":
        return 0.
    else:
        return 1.
    
df['label'] = df['Category'].apply(return_label)

The data that we have must be conducive to learning. This is where one would need some domain knowledge. Just consider how you would try to yourself classify text into spam or not spam in a language you have absolutely no familiarity with. 


You don't have enough time to actually learn the language, so you want to find heuristics that will help you be an accurate classifier. This will give you a clue as to what might be useful to a computer.

In [None]:
# Let's look at message length
df['msg_len'] = df['Message'].apply(len)

In [None]:
df.head()

Intution for splitting data into train and test:

When you want to measure how well you have learnt something, you want to do some practice problems or take a practice exam. That is exactly what is happening here. You are putting some questions away as practice questions for yourself and learning from the rest.

In [None]:
training_data, testing_data = train_test_split(df,random_state = 2019, test_size = 0.2)

Y_train=training_data['label'].values
Y_test=testing_data['label'].values

testing_data.groupby('label')['label'].count()

In [None]:
training_data[training_data['Category'] == 'ham']['msg_len'].describe()

Let's filter the dataframe by category (i.e. spam or ham) and then generate histograms that show how many messages are there for a particular message length

In [None]:
hist_ham = training_data[training_data['Category'] == 'ham']['msg_len']
plt.figure(figsize = (10,5))
plt.xlim(0,10000)
plt.hist(hist_ham, bins=range(min(hist_ham), max(hist_ham) + 10, 10))

In [None]:
training_data[training_data['Category'] == 'spam']['msg_len'].describe()

In [None]:
hist_spam = training_data[training_data['Category'] == 'spam']['msg_len']
plt.figure(figsize = (10,5))
plt.xlim(0,10000)
plt.hist(hist_spam, bins=range(min(hist_ham), max(hist_ham) + 10, 10))

It looks like message length could be a good indicator of the category of message

# Feature Extraction/Engineering using Bag Of Words (BOW)

When dealing with text, one needs to be careful. Right now, we are not linguists who understand the semantics of the language. We want to deal with this abstractly. We want our method to be language agnostic. 

BOW is one such method. A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

 - A vocabulary of known words.
 - A measure of the presence of known words.

In [None]:
s1 = "Pavithra likes everything Harry Potter "
s2 = "Pavithra is a Potter Head "
s3 = "Clearly, Pavithra likes talking about pavithra "

# Tokenization
print(s1.split())
print(s2.split())
print(s3.split())

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vec = CountVectorizer()

In [None]:
test = count_vec.fit_transform([s3])

In [None]:
test

In [None]:
print(test)

In [None]:
print(test.todense())

In [None]:
count_vec.get_feature_names()[3]

In [None]:
count_vec.get_feature_names()

In [None]:
corpus = [s1,s2,s3]
corpus

In [None]:
image = mpimg.imread("BOW.PNG")
plt.figure(figsize = (10,10))
plt.imshow(image)
plt.show()

In [None]:
test = count_vec.fit_transform(corpus)

In [None]:
print(test.todense())
print(count_vec.get_feature_names())

In [None]:
count_vec = CountVectorizer(stop_words='english')

In [None]:
test = count_vec.fit_transform(corpus)

In [None]:
print(test.todense())
print(count_vec.get_feature_names())

In [None]:
s3 = "Pavithra also likes LOTR"
doc_mat = count_vec.transform([s1,s2,s3])

In [None]:
print(doc_mat.todense())
print(count_vec.get_feature_names())

 Notice difference between fit_transform and transform
 - fit_transform: Learn the vocabulary dictionary and return term-document matrix.
 - transform: Transform documents to document-term matrix.

In [None]:
doc_mat.shape

In [None]:
doc_mat.nnz

In [None]:
print("sparsity: %.2f%%" % (100.0 * doc_mat.nnz/ (doc_mat.shape[0] * doc_mat.shape[1])))

# Transform data into BOW

In [None]:
BOW_transformer = CountVectorizer(stop_words='english')
BOW_transformer.fit_transform(training_data['Message'].values)

In [None]:
train_BOW_features = BOW_transformer.transform(training_data['Message'].values)
test_BOW_features = BOW_transformer.transform(testing_data['Message'].values)

In [None]:
# append length of message to features
from scipy.sparse import hstack
X_train = hstack((train_BOW_features,training_data['msg_len'].values[:,None]))
X_test = hstack((test_BOW_features,testing_data['msg_len'].values[:,None]))

# Classification Metrics

Once we have our classifier make predictions, we want some notion of how good our classifier is. Below are three metrics that are most commonly used in classification.

(1) Accuracy
-  How many of your predictions matched the ground truth?

(2) Precision
-  Among the ones that you predicted as spam, how many were actually spam?

(3) Recall
- Among the ones that were actually spam, how many did you predict as spam?


**Which ones should we care about?**

It depends on the problem!

# How well does a majority class classifier do?

(1) What is a majority classifier?

- A majority classifier is one that predicts the majority class in the dataset for every sample. 

(2) Why should we care about this?

- This gives us a **baseline** to compare to. This is the simplest thing you can do if you had no access to anything else but the labels. One would expect our algorithms to do better than this.

Caveat: the notion of majority class is valid only in binary classification

In [None]:
pred_majority = np.array([0.]*len(testing_data))

In [None]:
testing_data.head()

In [None]:
def get_accuracy(truth, prediction):
    count_correct_pred = 0
    for i in range(len(truth)):
        if truth[i] == prediction[i]:
            count_correct_pred +=1
    return count_correct_pred/len(truth)

In [None]:
get_accuracy(Y_test, pred_majority)

In [None]:
# alternative way to compute accuracy
np.mean(Y_test == pred_majority)

In [None]:
def get_precision(truth, prediction):
    count_prec = 0
    for i in range(len(truth)):
        if prediction[i] == 1. and truth[i] == 1.:
            count_prec +=1
    if np.sum(prediction) == 0.:
        return 0.
    else:
        return count_prec/np.sum(prediction)

In [None]:
get_precision(Y_test, pred_majority)

In [None]:
# alternative way to compute precision
np.mean(Y_test[pred_majority == 1])

In [None]:
def get_recall(truth, prediction):
    count_recall = 0
    for i in range(len(truth)):
        if prediction[i] == 1. and truth[i] == 1.:
            count_recall +=1
    if np.sum(prediction) == 0.:
        return 0.
    else:
        return count_recall/np.sum(truth)

In [None]:
get_recall(Y_test, pred_majority)

In [None]:
# alternative way to compute recall
np.mean(pred_majority[Y_test == 1])

In [None]:
import sklearn.metrics as metrics

print(metrics.accuracy_score(Y_test,pred_majority))

print(metrics.precision_score(Y_test,pred_majority))

print(metrics.recall_score(Y_test,pred_majority))

In [None]:
print("Confusion Matrix: ")
print(metrics.confusion_matrix(Y_test, pred_majority))

# Logistic Regression for classification

Logistic Regression is a statistical model that is used in classification. It can be used for binary and multi class classification. It uses the logistic function as the loss function. For more info: [Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression)

In [None]:
image = mpimg.imread("LR.PNG")
plt.figure(figsize = (10,15))
plt.imshow(image)
plt.show()

In [None]:
from sklearn.linear_model import LogisticRegression

Spam_model = LogisticRegression(solver='liblinear', penalty='l1') 

# To learn more about the solvers:
# https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-defintions

Spam_model.fit(X_train, Y_train)
pred = Spam_model.predict(X_test)

Caveats:

- this is not a cookbook that you can follow straight away and throw it on another dataset
- data exploration and thinking of features is super important
- parameter tuning, I made it seem really easy but there is a lot of work involved
- regularization
- cross validation

# Thresholding ... Defining the classification boundary

Now that we have our probabilities, what do we do with them? We have to find a threshold and classify based on that threshold. 

To understand what we need, let's do the following:
1. Find all data points which had true value as "spam"
2. Plot a histogram of the spam probabilities. 

Remember, if our classifier was perfect, it would have given high spam probability for all the data points which had true value as spam.

In [None]:
probs = Spam_model.predict_proba(X_test)

In [None]:
probs

In [None]:
prob_spam = probs[:, 1]
prob_ham = probs[:, 0]

In [None]:
plt.hist(prob_spam[Y_test == 1])

Here we see that our classifier gets most of them right. Specifically, most of the data points that were indeed spam seem to have been assigned a high 'spam probability' value.

What is threshold? It is a value $\eta$ such that whenever 'spam probability' is greater than $\eta$, we will classify the data point as spam.

So if we choose $\eta$ to be 0, then we will definitely classify all data points which were indeed spam as spam. If we choose $\eta$ to be 0.5 instead, we will misclassify all those data points which had 'spam probability' less than 0.5. The histogram helps get a visual on where we should really place our $\eta$.

Note, that there is a dual viewpoint here. Let's plot a histogram for ham as well. Here, the $\eta$ works in an opposite way. Whatever we set $\eta$ to, we will misclassify all the datapoints which had 'ham probability' larger than $\eta$. That's why, let's place both histograms on top of each other.

In [None]:
plt.hist(prob_spam[Y_test == 0], color = 'orange')

Let us now juxtapose.

In [None]:
plt.hist(prob_spam[Y_test == 1])
plt.hist(prob_spam[Y_test == 0])

Thus, whatever we set our $\eta$ to, on the left, we will misclassify those in blue, and on the right, we will misclassify those in orange. 

In [None]:
plt.hist(prob_ham[Y_test == 0])

# Model evaluation

In [None]:
pd.Series(pred).value_counts()

In [None]:
# Accuracy: How many of your predictions matched the ground truth?
print(metrics.accuracy_score(Y_test,pred))

# Precision: Among the ones that you predicted as spam, how many were actually spam?
print(metrics.precision_score(Y_test,pred))

# Recall: Among the ones that were actually spam, how many did you predict as spam?
print(metrics.recall_score(Y_test,pred))

In [None]:
print("accuracy: ",get_accuracy(Y_test,pred))
print("precision: ", get_precision(Y_test,pred))
print("recall: ", get_recall(Y_test,pred))

In [None]:
print(metrics.confusion_matrix(Y_test, pred))

With better features, better feature engineering and more powerful algorithms, we can always do better! The stakes can be very high in some classification problems (cancer or no cancer) and different evaluation measures say different things, one must always choose what would be best for the problem.

# Using a larger dataset

In [None]:
# Read the data into a dataframe using pandas
df = pd.read_csv('Spam_Ham_V2345.csv', sep = ",")

from sklearn.model_selection import train_test_split
training_data, testing_data = train_test_split(df,random_state = 2019, test_size = 0.2)
training_data = training_data.head(212014) # to make sure we are working with the same data when we explore vowpal wabbit

Y_train=training_data['label'].values
Y_test=testing_data['label'].values

In [None]:
import time
start_time = time.time()
BOW_transformer = CountVectorizer(stop_words='english')
BOW_transformer.fit_transform(training_data['Message'].values)

train_BOW_features = BOW_transformer.transform(training_data['Message'].values)
test_BOW_features = BOW_transformer.transform(testing_data['Message'].values)

# append length of message to features
from scipy.sparse import hstack
X_train = hstack((train_BOW_features,training_data['msg_len'].values[:,None]))
X_test = hstack((test_BOW_features,testing_data['msg_len'].values[:,None]))

Spam_model = LogisticRegression(solver='liblinear', penalty='l1') 

Spam_model.fit(X_train, Y_train)
pred = Spam_model.predict(X_test)

print(time.time() - start_time)

In [None]:
print("accuracy: ",get_accuracy(Y_test,pred))
print("precision: ", get_precision(Y_test,pred))
print("recall: ", get_recall(Y_test,pred))

# Classification using Vowpal Wabbit

In [None]:
from vowpalwabbit import pyvw
import re

In [None]:
def to_vw_format(document, label=None):
      return str(label or '') + ' |text ' + ' '.join(re.findall('\w{3,}', document.lower())) + '\n'

In [None]:
training_data.values[1][2]

In [None]:
to_vw_format(str(training_data.values[1][2]), 1 if Y_train[0] == 1 else -1)

In [None]:
start_time = time.time()
vw_learn1 = pyvw.vw("--loss_function logistic -d ham_spam_train_v2345.vw -f vw_model")
vw_learn1.run_parser()
vw_learn1.finish()

vw_predict1 = pyvw.vw("-d ham_spam_test_v2345.vw -i vw_model -t -p vw_preds.txt --binary")
vw_predict1.run_parser()
vw_predict1.finish()
print(time.time() - start_time)

In [None]:
with open('vw_preds.txt') as pred_file:
    vw_preds = [float(label) 
                             for label in pred_file.readlines()]
vw_preds_formatted = [0 if x == -1 else x for x in vw_preds]

In [None]:
Y_test = []
with open('ham_spam_test_v2345.vw', 'r') as f:
    for line in f:
        Y_test.append(float(line.split('|')[0]))
f.close()
Y_test_formatted = [0 if x == -1. else x for x in Y_test]

In [None]:
print("accuracy: ",get_accuracy(Y_test_formatted,vw_preds_formatted))
print("precision: ", get_precision(Y_test_formatted,vw_preds_formatted))
print("recall: ", get_recall(Y_test_formatted,vw_preds_formatted))

To learn more about Vowpal Wabbit, visit http://vowpalwabbit.org/. 

# Summary

Things to explore on your own:
- [decision trees](https://scikit-learn.org/stable/modules/tree.html)
- [random forests](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- better feature engineering ([think tf-idf](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer))
- regularization
    - recap: in high dimensional data, you want to _regularize_ or shrink the coefficients of predictors to zero when you want a fit a model with all predictors in order to reduce variance
     - [ridge regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)
     - [lasso regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)
- [cross validation](https://scikit-learn.org/stable/modules/cross_validation.html)
    - recap: in the absence of designated test set, how to estimate test error?
- feature importance
- multiclass classification
- [naive bayes](https://scikit-learn.org/stable/modules/naive_bayes.html)

Beware of:
- highly correlated variables
- exploding features

First, this tutorial is not meant to be a recipe that one can blindly follow. If it was, there would have never been a need for a Data Scientist :-) This tutorial was only to give a basic overview of linear regression and how to implement that using standard libraries. More importantly, I hope you now have an idea of what sort of questions one should ask while creating a mathematical model. We hope that this has piqued your interest to explore the capabilities of machine learning