#### Data 620 - Assignment 6<br>July 10, 2019<br>Team 2: <ul> <li>Anthony Munoz</li> <li>Katie Evers</li> <li>Juliann McEachern</li> <li>Mia Siracusa</li></ul>
<h1 align="center">"Document Classification"</h1>

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  

Here is one example of such data:  [UCI Machine Learning Repository: Spambase Data Set](http://archive.ics.uci.edu/ml/datasets/Spambase).

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).
For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

This assignment is due end of day on Wednesday, July 10th.  You may work in a small team if you want.

*NOTE: This is a two week assignment.*

### Dependencies

In [1]:
# data processing packages
import pandas as pd, numpy as np, os

# nltk packages
from nltk import stem; from nltk.corpus import stopwords

# sklearn packages
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier

### Data Uploading 

We read the ham and spam data from a csv file in our github repository and relabled the data columns. The shape of our dataframe and a preview of the data can be viewed below. 

In [2]:
# read csv
df_data = pd.read_csv('https://raw.githubusercontent.com/Anth350z/620/master/Assignments/data/ham_spam_data', 
                      error_bad_lines=False, delimiter="\t",header=None)

# label columns
df_data.columns = ['label','email']

# preview data
print("data shape:",df_data.shape)
df_data.head(5)

data shape: (5572, 2)


Unnamed: 0,label,email
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Data cleaning (Stemmer/Stopwords)

In this section, we used the stem and stopword packages from nltk to improve our natural language processing technique. We then created a function to lower the data using the stem and use stopwords for cleaning and organizational purposes.

In [3]:
# leave only the word stem & remove stop words (ie. a, and, the, etc.) 
stemmer = stem.SnowballStemmer('english')
stopwords = set(stopwords.words('english'))

# create data cleaning function
def data_cleaning(data):
    data = data.lower()
    data = [message_word for message_word  in data.split() if message_word  not in stopwords]
    data = " ".join([stemmer.stem(message_word ) for message_word  in data])
    return data

We applied the function to our dataframe. The results of our data_cleaning can be previewed below.

In [4]:
df_data['email'] = df_data['email'].apply(data_cleaning)
df_data.head()

Unnamed: 0,label,email
0,ham,"go jurong point, crazy.. avail bugi n great wo..."
1,ham,ok lar... joke wif u oni...
2,spam,free entri 2 wkli comp win fa cup final tkts 2...
3,ham,u dun say earli hor... u c alreadi say...
4,ham,"nah think goe usf, live around though"


### Create Training and Test Datasets

We further prepared our data by applying a term frequency–inverse document frequency (TFIDF) vectorizer to our email values. The `TfidfVectorizer` function extracts important features from our corpus. 

From there, we used the `train_test_split` function from `sklearn` to split our data 80/20 for training and testing purposes.

In [5]:
# prepare and split the data for training/testing purposes 
email = df_data['email'].values
label = df_data['label'].values

# TFIDF vectorizer 
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
email = vectorizer.fit_transform(email)

# training the data splitting with 80/20
X_train, X_test, y_train, y_test = train_test_split(email,label,test_size=0.2, shuffle=True,random_state=0,  stratify=label)

### Naive Bayes

We first used the multinomial Naive Bayes classifier for our prediction model. Our training and test accuracy both faired very well with accuracy above 95% in both cases. 

In [6]:
# set classifier 
clf = MultinomialNB()

# fit model
clf.fit(X_train, y_train)

# make predictions
prediction = clf.predict(X_test)

# score accuracy 
clf_train_accuracy = round(clf.score(X_train, y_train),4)
clf_test_accuracy = round(clf.score(X_test, y_test),4)

# print accuracy 
print("Training Accuracy: " + str(clf_train_accuracy))
print("Testing Accuracy: " + str(clf_test_accuracy))

Training Accuracy: 0.9796
Testing Accuracy: 0.9686


### Random Forest Classifier

Our next attempt used decision trees to classify our ham and spam using the `RandomForestClassifier` function. This method performed extremely well. We observed perfect classification of our training set and saw improvements on our test prediction accuracy.

In [7]:
# set classifier 
RDC = RandomForestClassifier(n_estimators=100)

# fit model
RDC.fit(X_train,y_train)

# make predictions
prediction = RDC.predict(X_test)

# score accuracy 
RDC_train_accuracy = round(RDC.score(X_train, y_train),4)
RDC_test_accuracy = round(RDC.score(X_test, y_test),4)

# print accuracy 
print("Training Accuracy: " + str(RDC_train_accuracy))
print("Testing Accuracy: " + str(RDC_test_accuracy))

Training Accuracy: 1.0
Testing Accuracy: 0.9722


### Stochastic Gradient Descent (SGD)

Our final method used the stochastic gradient descent technique. While our accuracy slightly lowered, we found this method predicted our training data the best. 

In [8]:
# set classifier 
SGD = SGDClassifier(max_iter=1000,tol=0.001)

# make predictions
SGD.fit(X_train,y_train)

# make predictions
prediction = SGD.predict(X_test)

# score accuracy 
SGD_train_accuracy = round(SGD.score(X_train, y_train),4)
SGD_test_accuracy = round(SGD.score(X_test, y_test),4)

# print accuracy 
print("Training Accuracy: " + str(SGD_train_accuracy))
print("Testing Accuracy: " + str(SGD_test_accuracy))

Training Accuracy: 0.9993
Testing Accuracy: 0.9794


In [1]:
%%html

<div style="position: relative; padding-bottom: 62.5%; height: 0;"><iframe src="https://www.loom.com/embed/1490ea4657de4ffdbcc4e88970116318" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>

# References
1. https://www.datacamp.com/community/blog/text-mining-in-r-and-python-tips
2. https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk
3. https://towardsdatascience.com/spam-or-ham-introduction-to-natural-language-processing-part-2-a0093185aebd
4. https://gtraskas.github.io/post/spamit/
5. http://archive.ics.uci.edu/ml/datasets/Spambase