# Document Classification

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  http://archive.ics.uci.edu/ml/datasets/Spambase

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

This assignment is due end of day on Monday, 11/12.  You may work in a small team if you want.

## Data Exploration

In [52]:
import nltk
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import random
from sklearn import datasets, svm, cross_validation, tree, preprocessing, metrics
import sklearn.ensemble as ske
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')
# pull in the spam dataset
path='https://raw.githubusercontent.com/nobieyi00/CUNY-SPS-DATA620/master/source_data.csv'
# read the data and store data in DataFrame spam
spam = pd.read_csv(path) 

#Summarize spam
spam.describe(include='all')

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam_class
count,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,...,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0
mean,0.104553,0.213015,0.280656,0.065425,0.312223,0.095901,0.114208,0.105295,0.090067,0.239413,...,0.038575,0.13903,0.016976,0.269071,0.075811,0.044238,5.191515,52.172789,283.289285,0.394045
std,0.305358,1.290575,0.504143,1.395151,0.672513,0.273824,0.391441,0.401071,0.278616,0.644755,...,0.243471,0.270355,0.109394,0.815672,0.245882,0.429342,31.729449,194.89131,606.347851,0.488698
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.065,0.0,0.0,0.0,0.0,2.276,15.0,95.0,0.0
75%,0.0,0.0,0.42,0.0,0.38,0.0,0.0,0.0,0.0,0.16,...,0.0,0.188,0.0,0.315,0.052,0.0,3.706,43.0,266.0,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


We can see that we have a total of 4601 datapoints
b) 39% of the data set is spam
c) There are skews in capital_run_length_longest variables
d) the frequency of special characters are very less compared to characters in the emails
e) We also notice certain word features have low density in the emails


In [53]:
class_counts = spam.groupby('spam_class').size()
print(class_counts)

spam_class
0    2788
1    1813
dtype: int64


In [54]:
#let's check skew
skew = spam.skew()
print(skew)

word_freq_make                 5.675639
word_freq_address             10.086811
word_freq_all                  3.009249
word_freq_3d                  26.227744
word_freq_our                  4.747126
word_freq_over                 5.956953
word_freq_remove               6.765580
word_freq_internet             9.724848
word_freq_order                5.226067
word_freq_mail                 8.487810
word_freq_receive              5.510250
word_freq_will                 2.867354
word_freq_people               6.955548
word_freq_report              11.754645
word_freq_addresses            6.971041
word_freq_free                10.763594
word_freq_business             5.688642
word_freq_email                5.413754
word_freq_you                  1.591674
word_freq_credit              14.602587
word_freq_your                 2.435527
word_freq_font                 9.975441
word_freq_000                  5.713775
word_freq_money               14.687028
word_freq_hp                   5.716843


In [55]:
spam[['spam_class','char_freq_$']].groupby('spam_class').mean()

Unnamed: 0_level_0,char_freq_$
spam_class,Unnamed: 1_level_1
0,0.011648
1,0.174478


We can see that the frequency of the '$' char is more in spam documents


In [56]:
spam.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam_class
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


## Check for Missing values

In [57]:
np.count_nonzero(spam.isnull())

0

There are no missing values

## Data Preparation

Let's confirm the data types of the features

In [58]:
 print(spam.dtypes)

word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp                  float64
word_freq_hpl                 float64
word_freq_ge

We can see we don't need further transformation

## Modelling
We now split the data into test and train set. We are also using cross validation technique by using 80% for training and 20% for testing.

In [59]:
X = spam.drop(['spam_class'], axis=1).values
y = spam['spam_class'].values
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.2)

First initialize decision tree classifier with a depth 10. Then we fit and score the model

In [60]:
clf_dt = tree.DecisionTreeClassifier(max_depth=10)
clf_dt.fit (X_train, y_train)
clf_dt.score (X_test, y_test)

0.9229098805646037

From the model we notice that the score is almost 92% accurate. We can apply the shuffle cross validation technique to get better unbiased accuracy of the model
In this case we will perform 20 permutations of the split using 80:20 train and test set split

In [61]:
shuffle_validator = cross_validation.ShuffleSplit(len(X), n_iter=20, test_size=0.2, random_state=0)
def test_classifier(clf):
    scores = cross_validation.cross_val_score(clf, X, y, cv=shuffle_validator)
    print("Accuracy: %0.4f (+/- %0.2f)" % (scores.mean(), scores.std()))

test_classifier(clf_dt)

Accuracy: 0.9199 (+/- 0.01)


This shows our decision tree classifier has about 92% accuracy. Let's try other classifiers

In [62]:
clf_rf = ske.RandomForestClassifier(n_estimators=50)
test_classifier(clf_rf)


Accuracy: 0.9508 (+/- 0.01)


In [63]:
clf_gb = ske.GradientBoostingClassifier(n_estimators=50)
test_classifier(clf_gb)

Accuracy: 0.9385 (+/- 0.01)


In [64]:
eclf = ske.VotingClassifier([('dt', clf_dt), ('rf', clf_rf), ('gb', clf_gb)])
test_classifier(eclf)

Accuracy: 0.9447 (+/- 0.01)


Based on the accuracy we can see that the randomforest classifier has the highest accuracy


## Create classification report

In [65]:
#Create classification report

clf_rf.fit (X_train, y_train)
ynew = clf_rf.predict(X_test) 

report=metrics.classification_report(y_test, ynew)
def classification_report_csv(report):
    report_data = []
    lines = report.split('\n')
    for line in lines[2:4]:
        row = {}
        row_data = line.split('     ')
        row['class'] = str(row_data[2])
        row['precision'] = float(row_data[3])
        row['recall'] = float(row_data[4])
        row['f1_score'] = float(row_data[5])
        row['support'] = float(row_data[6])
        report_data.append(row)
    for line in lines[5:6]:
        row = {}
        row_data = line.split('     ')
        row['class'] = row_data[0]
        row['precision'] = float(row_data[1])
        row['recall'] = float(row_data[2])
        row['f1_score'] = float(row_data[3])
        row['support'] = float(row_data[4])
        report_data.append(row)
    dataframe = pd.DataFrame.from_dict(report_data)
    print(dataframe)

classification_report_csv(report)



         class  f1_score  precision  recall  support
0            0      0.96       0.95    0.97    565.0
1            1      0.94       0.95    0.92    356.0
2  avg / total      0.95       0.95    0.95    921.0


In [66]:
#Create Accuracy text file and
#create Area under curve result file

Accuracy_output=test_classifier(clf_rf)

scores = cross_validation.cross_val_score(clf_rf, X, y, cv=shuffle_validator,scoring = 'roc_auc')

AUC_result="AUC: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std())
print(Accuracy_output)

Accuracy: 0.9509 (+/- 0.01)
None


In [67]:
print(AUC_result)

AUC: 0.985 (+/- 0.004)
