<div class="container">
  <div class="jumbotron">
    <h1>Scikit-learn Primer</h1>
    <p>Introducton to text classificaiton using Scikit-Learn</p>
  </div>
</div>

**Scikit-learn** (http://scikit-learn.org/) is an open-source machine learning library for Python that offers a variety of regression, classification and clustering algorithms.

In this section we'll perform a fairly simple classification exercise with scikit-learn. In the next section we'll leverage the machine learning strength of scikit-learn to perform natural language classifications.

# Installation and Setup

### From the command line or terminal:
> `conda install scikit-learn`
> <br>*or*<br>
> `pip install -U scikit-learn`

Scikit-learn additionally requires that NumPy and SciPy be installed. For more info visit http://scikit-learn.org/stable/install.html

# Perform Imports and Load Data
For this exercise we'll be using the **SMS Spam Collection** dataset from [UCI datasets](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) that contains more than 5,000 SMS phone messages.<br>You can check out the [**sms_readme**](../TextFiles/sms_readme.txt) file for more info.

The file is a [tab-separated-values](https://en.wikipedia.org/wiki/Tab-separated_values) (tsv) file with four columns:
> **label** - every message is labeled as either ***ham*** or ***spam***<br>
> **message** - the message itself<br>
> **length** - the number of characters in each message<br>
> **punct** - the number of punctuation characters in each message

In [None]:
import numpy as np
import pandas as pd

df = pd.read_csv('../TextFiles/smsspamcollection.tsv', sep='\t')
df.head()

In [None]:
len(df)

## Check for missing values:
Machine learning models usually require complete data.

In [None]:
df.isnull().sum()

## Take a quick look at the *ham* and *spam* `label` column:

In [None]:
df['label'].unique()

In [None]:
df['label'].value_counts()

<div class="alert alert-success">
  We see that 4825 out of 5572 messages, or 86.6%, are ham.<br>This means that any machine learning model we create has to perform <strong>better than 86.6%</strong> to beat random chance.
</div>

## Visualize the data:
Since we're not ready to do anything with the message text, let's see if we can predict ham/spam labels based on message length and punctuation counts. We'll look at message `length` first:

In [None]:
df['length'].describe()

<div class="alert alert-success">
  This dataset is extremely skewed. The mean value is 80.5 and yet the max length is 910. Let's plot this on a logarithmic x-axis.
</div>

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

df['length_log'] = np.log(df['length'])

#plt.xscale('log')
#bins = 1*(np.arange(0,10))
plt.hist(df[df['label']=='ham']['length_log'],bins=25,alpha=0.8)
plt.hist(df[df['label']=='spam']['length_log'],bins=25,alpha=0.8)
plt.legend(('ham','spam'))
plt.show()

<font color=green>It looks like there's a small range of values where a message is more likely to be spam than ham.</font>

Now let's look at the `punct` column:

In [None]:
df['punct'].describe()

In [None]:
df['punct_log'] = np.log(df['punct']+1)

#plt.xscale('log')
#bins = 1*(np.arange(0,7))
plt.hist(df[df['label']=='ham']['punct_log'],bins=10,alpha=0.8)
plt.hist(df[df['label']=='spam']['punct_log'],bins=10,alpha=0.8)
plt.legend(('ham','spam'))
plt.show()

<div class="alert alert-success">
  This looks even worse - there seem to be no values where one would pick spam over ham. We'll still try to build a machine learning classification model, but we should expect poor results.
</div>

___
# Split the data into train & test sets:

If we wanted to divide the DataFrame into two smaller sets, we could use
> `train, test = train_test_split(df)`

For our purposes let's also set up our Features (X) and Labels (y). The Label is simple - we're trying to predict the `label` column in our data. For Features we'll use the `length` and `punct` columns. By convention, **$X$** is capitalized and **$y$** is lowercase.

## Selecting features
There are two ways to build a feature set from the columns we want. If the number of features is small, then we can pass those in directly:
> `X = df[['length','punct']]`

If the number of features is large, then it may be easier to drop the Label and any other unwanted columns:
> `X = df.drop(['label','message'], axis=1)`

These operations make copies of **df**, but do not change the original DataFrame in place. All the original data is preserved.

In [None]:
# Create Feature and Label sets
X = df[['length','punct']]  # note the double set of brackets
y = df['label']

## Additional train/test/split arguments:
The default test size for `train_test_split` is 30%. Here we'll assign 33% of the data for testing.<br>
Also, we can set a `random_state` seed value to ensure that everyone uses the same "random" training & testing sets.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

print('Training Data Shape:', X_train.shape)
print('Testing Data Shape: ', X_test.shape)

Now we can pass these sets into a series of different training & testing algorithms and compare their results.

___
# Train a Logistic Regression classifier
One of the simplest multi-class classification tools is [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Scikit-learn offers a variety of algorithmic solvers; we'll use [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS). 

In [None]:
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(solver='lbfgs')

lr_model.fit(X_train, y_train)

## Test the Accuracy of the Model

In [None]:
from sklearn import metrics

# Create a prediction set:
predictions = lr_model.predict(X_test)

# Print a confusion matrix
print(metrics.confusion_matrix(y_test,predictions))

In [None]:
# You can make the confusion matrix less confusing by adding labels:
df = pd.DataFrame(metrics.confusion_matrix(y_test,predictions), index=['ham','spam'], columns=['ham','spam'])
df

<div class="alert alert-success">
  These results are terrible! More spam messages were confused as ham (241) than correctly identified as spam (5), although a relatively small number of ham messages (46) were confused as spam.
</div>

In [None]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

In [None]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

<div class="alert alert-success">
  This model performed <strong>worse</strong> than a classifier that assigned all messages as "ham" would have!
</div>

___
# Train a naïve Bayes classifier:
One of the most common - and successful - classifiers is [naïve Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes).

In [None]:
from sklearn.naive_bayes import MultinomialNB

nb_model = MultinomialNB()

nb_model.fit(X_train, y_train)

## Run predictions and report on metrics

In [None]:
predictions = nb_model.predict(X_test)
print(metrics.confusion_matrix(y_test,predictions))

<div class="alert alert-success">
  The total number of confusions dropped from <strong>287</strong> to <strong>256</strong>. [241+46=287, 246+10=256]
</div>

In [None]:
print(metrics.classification_report(y_test,predictions))

In [None]:
print(metrics.accuracy_score(y_test,predictions))

<div class="alert alert-success">
  Better, but still less accurate than 86.6%
</div>

___
# Train a support vector machine (SVM) classifier
Among the SVM options available, we'll use [C-Support Vector Classification (SVC)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)

In [None]:
from sklearn.svm import SVC
svc_model = SVC(gamma='auto')
svc_model.fit(X_train,y_train)

## Run predictions and report on metrics

In [None]:
predictions = svc_model.predict(X_test)
print(metrics.confusion_matrix(y_test,predictions))

<div class="alert alert-success">
The total number of confusions dropped even further to <strong>209</strong>.
</div>

In [None]:
print(metrics.classification_report(y_test,predictions))

In [None]:
print(metrics.accuracy_score(y_test,predictions))

<font color=green>And finally we have a model that performs *slightly* better than random chance.</font>



Great! Now you should be able to load a dataset, divide it into training and testing sets, and perform simple analyses using scikit-learn.