## Document Classification

For this assignment I used text messages that were labeled as 'spam' or 'ham' (https://raw.githubusercontent.com/wtznc/Naive-Bayes-SMS-Spam-Collection/master/SMSSpamCollection).
This problem belong to supervised learning.  The goal of this assignment is to find the classifier that can predict whether text mesage belong to 'spam' or 'ham' category with higher accuracy.  

In [125]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import csv
from textblob import TextBlob
import pandas
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix, precision_score, recall_score
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.learning_curve import learning_curve
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

I read data from scv file.

In [83]:
df = pd.read_csv('spam.csv',names=['category','text message'],skiprows=1,encoding='latin-1')
df.head()

Unnamed: 0,category,text message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In order to use tf-idf model on the text I did the following:

1. Converted all upper case letters to lower case letters.
2. Removed punctuations.
3. Removed English stopwordds such as articles and prepositions.
4. Normalized words into their base form - lemmas

In [84]:
#converted to lower case letters
df['text message'] = df['text message'].str.lower()
df.head()

Unnamed: 0,category,text message
0,ham,"go until jurong point, crazy.. available only ..."
1,ham,ok lar... joking wif u oni...
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor... u c already then say...
4,ham,"nah i don't think he goes to usf, he lives aro..."


In [85]:
# remove punctuation
df['text message'] = df['text message'].str.replace(r'[^\w\s]','')
df.head()

Unnamed: 0,category,text message
0,ham,go until jurong point crazy available only in ...
1,ham,ok lar joking wif u oni
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor u c already then say
4,ham,nah i dont think he goes to usf he lives aroun...


In [86]:
# remove stopwords
stop_words = set(stopwords.words('english'))

df['text message'] = df['text message'].apply(word_tokenize) 
for i in range(0,len(df['text message'])):
    df['text message'][i] =[w for w in df['text message'][i] if w not in stop_words]
    df['text message'][i] = ' '.join(w for w in df['text message'][i])
    
df.head()

Unnamed: 0,category,text message
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,u dun say early hor u c already say
4,ham,nah dont think goes usf lives around though


In [87]:
#take words to their base forms - lemmas
def split_into_lemmas(message):
    words = TextBlob(message).words
    return [word.lemma for word in words]
df['text message'].head().apply(split_into_lemmas)

0    [go, jurong, point, crazy, available, bugis, n...
1                       [ok, lar, joking, wif, u, oni]
2    [free, entry, 2, wkly, comp, win, fa, cup, fin...
3        [u, dun, say, early, hor, u, c, already, say]
4    [nah, dont, think, go, usf, life, around, though]
Name: text message, dtype: object

After that I converted each text message into vector that machine learning models can understand.

In [88]:
text_transformer = CountVectorizer(analyzer=split_into_lemmas).fit(df['text message'])
messages_bow = text_transformer.transform(df['text message'])

I used scikit-learn's TfidfTransformer for weighting and normalization.

In [93]:
tfidf_transformer = TfidfTransformer().fit(messages_bow)
messages_tfidf = tfidf_transformer.transform(messages_bow)

Next I split data into two datasets - training and testing

In [94]:
X = df.iloc[:,1].values
y = df.iloc[:,0].values
train_X, test_X, train_y, test_y  = train_test_split(X, y, test_size=0.3)
train_X, val_X, train_y, val_y  = train_test_split(X, y, test_size=0.3)

Let's train the data setwith the following classifiers:
1. Naive Bayse 
2. Random Forest
3. Support Vector Mashines
4. Logistic Regression

and make prediction using test data set.

In [156]:
#Naive Bayse
nb = MultinomialNB().fit(messages_tfidf, df['category'])
nb_all_predictions = nb.predict(messages_tfidf)

#Random Forest
rf = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0).fit(messages_tfidf, df['category'])
rf_all_predictions = rf.predict(messages_tfidf)

#Support Vector Mashines
svc = LinearSVC().fit(messages_tfidf, df['category'])
svc_all_predictions = svc.predict(messages_tfidf)

#Logistic Regression
lr = LogisticRegression(random_state=0).fit(messages_tfidf, df['category'])
lr_all_predictions = lr.predict(messages_tfidf)

In [157]:
classifiers = pd.DataFrame()
classifiers['name'] = ['Naive Bayse','RandomForest','Support Vector Mashine','Logistic Regression']
classifiers['accuracy'] = [accuracy_score(df['category'], nb_all_predictions),
                          accuracy_score(df['category'], rf_all_predictions),
                          accuracy_score(df['category'], svc_all_predictions),
                          accuracy_score(df['category'], lr_all_predictions)]

classifiers['precision'] = [precision_score(df['category'], nb_all_predictions,  pos_label='spam'),
                           precision_score(df['category'], rf_all_predictions,  pos_label='spam'),
                           precision_score(df['category'], svc_all_predictions,  pos_label='spam'),
                           precision_score(df['category'], lr_all_predictions,  pos_label='spam')]

classifiers['recall'] = [recall_score(df['category'], nb_all_predictions,  pos_label='spam'),
                        recall_score(df['category'], rf_all_predictions,  pos_label='spam'),
                        recall_score(df['category'], svc_all_predictions,  pos_label='spam'),
                        recall_score(df['category'], lr_all_predictions,  pos_label='spam')]

classifiers['f1_score'] = [f1_score(df['category'], nb_all_predictions,  pos_label='spam'),
                        f1_score(df['category'], rf_all_predictions,  pos_label='spam'),
                        f1_score(df['category'], svc_all_predictions,  pos_label='spam'),
                        f1_score(df['category'], lr_all_predictions,  pos_label='spam')]

tn, fp, fn, tp = confusion_matrix(df['category'], nb_all_predictions).ravel()
specificity_nb = tn / (tn+fp)
tn, fp, fn, tp = confusion_matrix(df['category'], rf_all_predictions).ravel()
specificity_rf = tn / (tn+fp)
tn, fp, fn, tp = confusion_matrix(df['category'], svc_all_predictions).ravel()
specificity_svc = tn / (tn+fp)
tn, fp, fn, tp = confusion_matrix(df['category'], lr_all_predictions).ravel()
specificity_lr = tn / (tn+fp)

classifiers['specificity'] = [specificity_nb, specificity_rf, specificity_svc, specificity_lr]

tn, fp, fn, tp = confusion_matrix(df['category'], nb_all_predictions).ravel()
sensitivity_nb = tp / (tp+fn)
tn, fp, fn, tp = confusion_matrix(df['category'], rf_all_predictions).ravel()
sensitivity_rf = tp / (tp+fn)
tn, fp, fn, tp = confusion_matrix(df['category'], svc_all_predictions).ravel()
sensitivity_svc = tp / (tp+fn)
tn, fp, fn, tp = confusion_matrix(df['category'], lr_all_predictions).ravel()
sensitivity_lr = tp / (tp+fn)

classifiers['sensitivity'] = [sensitivity_nb, sensitivity_rf, sensitivity_svc, sensitivity_lr]

classifiers

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Unnamed: 0,name,accuracy,precision,recall,f1_score,specificity,sensitivity
0,Naive Bayse,0.978643,1.0,0.840696,0.913455,1.0,0.840696
1,RandomForest,0.865937,0.0,0.0,0.0,1.0,0.0
2,Support Vector Mashine,0.999641,1.0,0.997323,0.99866,1.0,0.997323
3,Logistic Regression,0.971644,0.991653,0.795181,0.882615,0.998964,0.795181


Support Vector Mashine retirns the higher accuracy of almost 99.96%.