## Data 620
### Document Classification - Tony Mei and Lin Li
It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  UCI Machine Learning Repository: Spambase Data Set

http://archive.ics.uci.edu/ml/datasets/Spambase

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
import string

### Get Spam-ham file and clean the data

In [2]:
# Get and read csv file
url = "https://raw.githubusercontent.com/lincarrieli/DATA620-Web-Analytics/main/SMSSpamCollection"
spam_df = pd.read_csv(url, sep='\t', header = None)
spam_df.head(n=5)

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
# Add column names
spam_df.columns =['spam', 'text']

In [4]:
# Get number of rows and columns
spam_df.shape

(5572, 2)

In [5]:
# Remove duplicates
spam_df.drop_duplicates(inplace = True)

# Get new shape
spam_df.shape

(5169, 2)

In [6]:
# Check for missiong data
spam_df.isnull().sum()

spam    0
text    0
dtype: int64

### Use NLP to clean the text

In [7]:
def process_text (text):
    # remove punctuation
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    # remove stopwords
    clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    return clean_words

In [8]:
# Apply process_text function 
spam_df['text'].head().apply(process_text)

0    [Go, jurong, point, crazy, Available, bugis, n...
1                       [Ok, lar, Joking, wif, u, oni]
2    [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
3        [U, dun, say, early, hor, U, c, already, say]
4    [Nah, dont, think, goes, usf, lives, around, t...
Name: text, dtype: object

### Generate vector for ML algorithms fro classification and predictions with scikit learn

In [9]:
# create bag of words model and convert to token count matrix
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(analyzer = process_text)
spam_bow = cv.fit_transform(spam_df['text'])

# get shape of spam_bow
spam_bow.shape

(5169, 11425)

In [10]:
# Split data for training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(spam_bow, spam_df['spam'], test_size = 0.2, random_state = 0)

In [11]:
# Fit Naive Bayes classifier to train dataset
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(X_train, y_train)

In [12]:
pred = classifier.predict(X_train)
print(pred)
print(y_train.values)

['ham' 'ham' 'ham' ... 'ham' 'spam' 'spam']
['ham' 'ham' 'ham' ... 'ham' 'spam' 'spam']


In [13]:
# Evaluate model on training dataset
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
print(classification_report(y_train, pred))

# Create confusion matrix
cm = confusion_matrix(y_train, pred)
print('Confusion Matrix: \n', cm)

# Get aaccuracy
accu = accuracy_score(y_train, pred)
print('Accuracy: ', accu)

              precision    recall  f1-score   support

         ham       1.00      1.00      1.00      3620
        spam       0.98      0.98      0.98       515

    accuracy                           1.00      4135
   macro avg       0.99      0.99      0.99      4135
weighted avg       1.00      1.00      1.00      4135

Confusion Matrix: 
 [[3612    8]
 [  12  503]]
Accuracy:  0.9951632406287787


In [14]:
# Test model on test data
pred = classifier.predict(X_test)
print(pred)
print(y_test.values)

['ham' 'ham' 'ham' ... 'ham' 'ham' 'ham']
['ham' 'ham' 'ham' ... 'ham' 'ham' 'ham']


In [15]:
# Evaluate model on test data
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
print(classification_report(y_test, pred))

# Create confusion matrix
cm = confusion_matrix(y_test, pred)
print('Confusion Matrix: \n', cm)

# Get aaccuracy
accu = accuracy_score(y_test, pred)
print('Accuracy: ', accu)

              precision    recall  f1-score   support

         ham       0.99      0.95      0.97       896
        spam       0.75      0.93      0.83       138

    accuracy                           0.95      1034
   macro avg       0.87      0.94      0.90      1034
weighted avg       0.96      0.95      0.95      1034

Confusion Matrix: 
 [[854  42]
 [  9 129]]
Accuracy:  0.9506769825918762


#### The Naive Bayes model predicted the testing dataset with accuracy of 95 %, slightly worse than its predictions with the training set, but still very good.

### Test model with spam messages

I downloaded the Spam folder from my person gmail account, and use the model for spam predictions. This will allow us to evaluate the model with new set of data. 

### Download spam mails

In [16]:
import mailbox
import email
import csv

In [17]:
# Set mbox path
mbox = mailbox.mbox('Spam.mbox')

In [19]:
with open("mbox.csv", "w") as outfile:
    writer = csv.writer(outfile)
    for message in mbox:
        writer.writerow([message['subject'],message['X-Gmail-Labels']])
    

In [20]:
my_spam = pd.read_csv('mbox.csv', names=['subject', 'label'])

In [21]:
column_names = ["label", "subject"]

my_spam = my_spam.reindex(columns=column_names)

In [22]:
my_spam.head()

Unnamed: 0,label,subject
0,"Spam,Category Updates,Unread",Amazing Sale! Get This Diet Product For a Huge...
1,"Spam,Category Promotions,Unread",BUSINESS ASSISTANCE
2,"Spam,Category Promotions,Unread",April EcoQuest - Check for Cherries
3,"Spam,Category Updates,Unread",Walgreens still needs your input to improve pr...
4,"Spam,Category Promotions,Unread",Investigate suspicious activity on your account


In [23]:
my_spam['subject'].head().apply(process_text)

0    [Amazing, Sale, Get, Diet, Product, Huge, Disc...
1                               [BUSINESS, ASSISTANCE]
2                   [April, EcoQuest, Check, Cherries]
3    [Walgreens, still, needs, input, improve, prog...
4         [Investigate, suspicious, activity, account]
Name: subject, dtype: object

In [24]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(analyzer = process_text)
my_spam_bow = cv.fit_transform(my_spam['subject'])

# get shape of spam_bow
my_spam_bow.shape
#print(my_spam_bow)

(17, 72)

In [None]:
pred_my_spam = classifier.predict(my_spam_bow)
