# Identify Spam in SMS Using Active Learning

## Dr. Omri Allouche 
(omri.allouche@gmail.com)

This notebook is the 2nd and final part in a series analyzing the effect of Active Learning on classification tasks.

In the previous notebook, we've used the MNIST dataset.  
In this notebook, we'll use a dataset labeling SMS as spam/ham.  
The dataset is available for download at http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection.

In [None]:
# Import relevant packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.xkcd()

plt.rcParams["figure.figsize"] = (48,20)
# plt.rcParams["font.size"] = 14
plt.rcParams["axes.titlesize"] = 28
plt.rcParams["axes.labelsize"] = 24
# plt.rcParams["figure.titlesize"] = 50

np.random.seed(42)

In [None]:
plt.rcParams["figure.figsize"] = (20,12)

In [None]:
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report, f1_score

## Load Data
Let's first load the data. We'll use a dataset of ~5,300 text messages.  
Each row contains the text of the message and its label.

In [None]:
# Read data into a DataFrame
df = pd.read_table('SMSSpamCollection', names=['y', 'text'])
df = df.sample(frac=1)
df.head()

In [None]:
# Let's get basic summary statistics
df.groupby('y').describe()

### Analyzing a basic Classifier
Let's build a simple classifier.  
We first split our data into train (80%) and test (20%) sets.  

Our model will perform the following steps:
1. Remove stopwords
1. Count word occurrences
1. Perform Tf-Idf transformation on counts
1. Use a Multinomial Naive-Bayes classifier

In [None]:
# Split data to train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['y'], test_size=0.20, random_state=0)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

model = Pipeline([('vect', CountVectorizer(stop_words='english')),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])

Now let's fit the model to the training set:

### Evaluate Model Performance

Display the classification report for the training set:

and for the test set:

and the f1 score:

### Plot model confidence
Let's plot the confidence level of our model for different samples:

We can see that while the model is very confident for most of the samples, there are about 100 samples (~10%) with low confidence.  

Note that we define confidence as the prediction probability of the predicted class.  
Since there are only 2 classes, the minimal value of confidence in this case is 0.5.

Next, let's review some of examples that got the lowest confidence.  
We can see that in many of these cases, our model misclassifies them.

### Is the model confidence correlated with its performance?
Now, let's examine the model's performance as a function of its confidence (ie the probability it assigns to the predicted class).  
We first get predictions on the test set, and save whether they are correct or not.  
We then use Seaborn's `regplot` to plot average performance, using `x_bins=20`.

As we can see, our model is pretty aware of its performance - it performs well when its confidence is high, and makes more mistakes when his confidence is low.

This serves as a trigger for Active Learning.

## Learning Curve
Let's plot the learning curve of the classifier - its performance based on the number of samples labeled.  

Each time, we take a subset of the train dataset, and use only it to train the model.  
We then calculate performance on the test set (that's left intact).  
We append the results of each run into a `history` variable.

We write a function `def learning_curve(model, train_set_size_list, X_train, y_train, X_test, y_test)` and use it later:

We next calculate the learning curve:

and plot the learning curve with the f1 score:

# Applying Active Learning
Let's compare this to an Active Learning algorithm.  
We first create a dataframe `df` with columns 'X', 'y' and content from the training set:

next we define the function `run_active_learner(df, model, num_samples_in_active_learning_batch, select_next_batch_func)`:

and the function `select_next_batch_func` for selecting the next batch for labeling:

In [None]:
num_samples_in_active_learning_batch = [50] + [10]*150 + [25]*100
al_history = run_active_learner(df, model, num_samples_in_active_learning_batch, select_next_batch_func)

and plot the models learning curve with and without active learning:

## Using selected instances in a new model
Next, let's check if the observations we asked to label based on the confidence of the Logistic Regression model are helpful for other models. We'll train a Logistic Regression model with L1 regularization.  
We'll plot its learning curve for randomly selected samples along with the learning curve for observations that were chosen for labeling in our previous perceptron classifier.

We first define a new pipeline for model2 using Logistic Regression with L1 regularization:

In [None]:
train_set_size_list = [10,20,30,40] + list(np.arange(50,500,20)) + list(np.arange(500,4000,100))
history_logistic_regression = learning_curve(model2, train_set_size_list, X_train, y_train, X_test, y_test)

Next we define the function `run_active_learner_with_different_model(df, model, model2, num_samples_in_active_learning_batch, select_next_batch_func)` that also computes the accuracy and f1 on the test set when model2 is fitted to the labeled instances:

In [None]:
num_samples_in_active_learning_batch = [50] + [10]*150 + [25]*100
# num_samples_in_active_learning_batch = [50] + [10]*150

al_history_model2 = run_active_learner_with_different_model(df, model, model2, num_samples_in_active_learning_batch, select_next_batch_func)

and plot the results:

## Bootstraping Results
Let's try a cycle (or more) of bootstraping - we'll use predictions with high confidence of the model as "ground truth" for another round of predictions.

In [None]:
num_samples_in_active_learning_batch = [100] + [10]*30
# al_history = run_active_learner(df, model, num_samples_in_active_learning_batch, select_next_batch_func)
al_history_bootstrap_1 = run_active_learner_with_bootstrap(df, model, num_samples_in_active_learning_batch, select_next_batch_func, bootstrap=1)