### ML101 : Text Classification, Spam Detection
07-09-2016,  
Jan Fait,  
Digital Marketing,  
Munich

In [1]:
%matplotlib inline

## Intuition


### How is text classification similar to usual supervised classification tasks?
+ There are labels/classes the model should learn
+ The input is a collection of character strings -> a document
+ A combination of input strings predicts a label

> To turn a document into a vector of predictors we need a **document representation**


### Document representation

Did not find a nice secular example

![](http://www.python-course.eu/images/document_representation.png)


### The problem with many documents 

Imagine the outcome of several hundred documents represented as shown before.
The results turn out:

+ **Wide** - there are many many words = too many predictors
+ **Sparse** - only a few words are found in all documents
+ **Confounded** - the words found in many documents don't have any predictive ability (Hello, Kind Regards, ...)

Example of a wide, sparse matrix
![](http://i.stack.imgur.com/7H4Kj.png)

### Feature Extraction

= selecting just the strings with predictive ability

> Feature extraction is tricky, usually demands experimenting and competing models.


**1. Compute new features**
    + too much UPPERCASE
    + too much !!!!!
    + sender, IP
**2. Standardize strings**
    + tokenize
    + lowercase
    + remove punctuation
    + probabilistic spelling correction
    + lemmatize
**3. Remove stopwords**
    + you, me, a, the
**4. Set a % of documents the feature should be in**
    + only words which are in 5% + documents
    + longer than 3 characters

Lemmatization (see [Lucene](https://github.com/larsmans/lucene-stanford-lemmatizer) for implementations):
![Lemmatization](https://www.briggsby.com/wp-content/uploads/2014/11/lemmatization.png)


## Conditional probability

*Two following slides adapted from Alexan.org [](https://alexn.org/)*

Data:

    30 emails out of a total of 74 are spam messages = P(spam)
    51 emails out of those 74 contain the word “penis” = P(penis)
    20 emails containing the word “penis” have been marked as spam = P(spam|penis)
 
We know what is the $probability(penis | spam)$, it is $\frac{20}{74}$
But what we really wanna know is the $probability(spam|penis)$

We use a rule of conditional probability:

![](https://alexn.org/assets/img/conditional-prob.png)

Now, $probability(spam|penis)$ = spam GIVEN penis

![](https://alexn.org/assets/img/spam-simple-bayes.png)

## Naive bayes

But our data has more words than 'penis' that can be spammy, we need to consider them jointly.

    25 emails out of the total also contain the word “viagra”
    24 emails out of those have been marked as spam
    1 remaining email is not spam

> To avoid doing Conditional probability on every single word, we assume their independence!

So we get $probability(spam|penis,viagra)$ by:

![](https://alexn.org/assets/img/spam-multiple-bayes-naive.png)

Conversely, $probability(ham|penis,viagra)$ is $\frac$

## Practical 

We take a nearly identical appraoch in loading data as at the last talk.

+ Getting data into 
+ Using Scikit-learn library for modelling


##### Data : SMS Spam Collection

5574 short messages dataset. Get it [here](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)


In [30]:
#pandas is a popular package for working with tabular data as data.frames, yes almost like Spark data.frames
import pandas as pd
import numpy as np
#read in the data, define separator
df = pd.read_csv('../data/datasets/spamcollection.csv', sep=';')
#show a truncated data frame
print(df[:10])

#simple validation set taking 0.8 of data, defined random seed
train=df.sample(frac=0.8,random_state=200)
#inverse selection for the test set
test=df.drop(train.index)
#check
print("Number of training rows is: "+ str(len(train)))

                                                text class
0  Go until jurong point, crazy.. Available only ...   ham
1                      Ok lar... Joking wif u oni...   ham
2  Free entry in 2 a wkly comp to win FA Cup fina...  spam
3  U dun say so early hor... U c already then say...   ham
4  Nah I don't think he goes to usf, he lives aro...   ham
5  FreeMsg Hey there darling it's been 3 week's n...  spam
6  Even my brother is not like to speak with me. ...   ham
7  As per your request 'Melle Melle (Oru Minnamin...   ham
8  WINNER!! As a valued network customer you have...  spam
9  Had your mobile 11 months or more? U R entitle...  spam
Number of training rows is: 4459


###  Standardization

The below code shows examples of how to standardize text.



In [14]:
import re
import string

#define translations for a elements in string.punctuation object
def remove_punctuation(s):
    table = s.maketrans({key: None for key in string.punctuation})
    return s.translate(table)

#ingest text and     
def tokenize(text):
    text = remove_punctuation(text)
    text = text.lower()
    return re.split("\W+", text)

def count_words(words):
    wc = {}
    for word in words:
        wc[word] = wc.get(word, 0.0) + 1.0
    return wc

s = "Ever wanted a russian wife? Buy a russian wife online now."
count_words(tokenize(s))

{'a': 2.0,
 'buy': 1.0,
 'ever': 1.0,
 'now': 1.0,
 'online': 1.0,
 'russian': 2.0,
 'wanted': 1.0,
 'wife': 2.0}

### Enter the CountVectorizer class

[Scikit docs for count vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

CountVectorizer does almost the same thing as the above 
+ Learns the total vocabulary
+ Gets counts of words for documents
+ Cleans up
+ Does indexation for performance

In [32]:
#load the scikit library
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

#count vectorizer it learns the vocabulary of the corpus and extracts word count features. 
count_vectorizer1 = CountVectorizer(max_features=1000, lowercase=True)
count_vectorizer2 = CountVectorizer(max_features=100, lowercase=True)
text_sample = df['text'].values
#just taking the first two messages
text_sample = df['text'].head(2)
counts_sample = count_vectorizer1.fit_transform(text_sample)
print(counts_sample)

  (0, 6)	1
  (0, 19)	1
  (0, 11)	1
  (0, 17)	1
  (0, 5)	1
  (0, 1)	1
  (0, 16)	1
  (0, 9)	1
  (0, 3)	1
  (0, 8)	1
  (0, 22)	1
  (0, 12)	1
  (0, 2)	1
  (0, 4)	1
  (0, 18)	1
  (0, 7)	1
  (0, 0)	1
  (0, 20)	1
  (1, 14)	1
  (1, 13)	1
  (1, 10)	1
  (1, 21)	1
  (1, 15)	1


In [33]:

#vectorize the training set
counts_train1 = count_vectorizer1.fit_transform(train['text'].values)
counts_train2 = count_vectorizer2.fit_transform(train['text'].values)
targets_train = train['class'].values
#this is the actual classifier
classifier1 = MultinomialNB()
classifier2 = MultinomialNB()
classifier1.fit(counts_train1, targets_train)
classifier2.fit(counts_train2, targets_train)

#vectorize the training set
counts_test1 = count_vectorizer1.fit_transform(test['text'].values)
counts_test2 = count_vectorizer2.fit_transform(test['text'].values)
targets_test = test['class'].values

#copy the array to form the expected
expected = targets_test

#run the test counts through the classifier
predicted_m1 = classifier1.predict(counts_test1)
predicted_m2 = classifier2.predict(counts_test2)
#look at predictions
print(predicted_m1)
print(predicted_m2)

['spam' 'ham' 'spam' ..., 'ham' 'ham' 'ham']
['spam' 'ham' 'ham' ..., 'ham' 'ham' 'ham']


###  Accuracy

Accuracy is the number of correct predictions (guessed the right class) in the test set.

In [36]:
#Hint: we prefer to use vectorized functions
def getAccuracy(exp,pre):
    return sum(exp==pre) /len(exp==pre)



getAccuracy(expected,predicted_m1),getAccuracy(expected,predicted_m2)

(0.76860986547085197, 0.85919282511210759)

#### The Confusion Matrix

The confusion matrix shows all possible results of the (expected == predicted) comparison
![](http://www.gepsoft.com/gepsoft/APS3KB/Chapter09/Section2/confusionmatrix.png)

Let us break it apart.

> TP = **True Positive** = Expected True == Predicted True  
> FN = **False Negative** = Expected True != Predicted False  
> FP = **False Positive** = Expected False != Predicted True  
> TN = **True Negative** = Expected False == Predicted False 

Accuracy: $A = \frac{(TP + TN)}{TP+FN+FP+TN}$

In [46]:
from sklearn.metrics import *
#confusion matrix 1
confusion_matrix(expected,predicted_m1)
#confusion matrix 2
confusion_matrix(expected,predicted_m2)

array([[888,  88],
       [ 69,  70]])

### Beyond accuracy

$Sensitivity/Recall = \frac{TP}{TP+FN}$ - Catching the True Positives True==True correctly  
$Specificity = \frac{TN}{TN+FP}$ - Catching the True Negatives False==False correctly  

If your model has a very high value of one of the above, but totally fails on the other one, its wrong.

In [67]:
def getSens(exp,pre):
    cm = confusion_matrix(exp, pre)
    return cm[0,0]/(cm[0,0]+cm[0,1])

#but we have to define our own for specificity
def getSpec(exp,pre):
    cm = confusion_matrix(exp, pre)
    return cm[0,1]/(cm[0,1]+cm[1,1])

spec1 = getSpec(expected, predicted_m1)
spec2 = getSpec(expected, predicted_m2)

sens1 = getSens(expected, predicted_m1)
sens2 = getSens(expected, predicted_m2)
print("specificiy1",spec1,"specificiy1",spec2)
print("sensitivity1",sens1,"sensitivity2",sens2)

specificiy1 0.727969348659 specificiy1 0.556962025316
sensitivity1 0.805327868852 sensitivity2 0.909836065574


### Summary

We saw how to classify text with Naive Bayes, the important take-aways are:

+ standardizing text input
+ basic feature selection
+ workings of the simplified Bayes rule in the Naive Bayes
+ although assuming independence conditional probabilities of two words and their effect on the outcome is courageous, the algorithm works.
+ remember accuracy is one thing, but always look at the confusion matrix before optimizing

That's it for today. Thank you.

### References

[Scikit-Learn Metrics module](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.naive_bayes)  
[Naive Bayes in 5 minutes](https://www.youtube.com/watch?v=IlVINQDk4o8)  
[Confusion Matrix](https://en.wikipedia.org/wiki/Confusion_matrix)  