## Data 620
### Spam detection - Tony Mei and Lin Li
It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  UCI Machine Learning Repository: Spambase Data Set

http://archive.ics.uci.edu/ml/datasets/Spambase

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
import string

In [2]:
url = "https://raw.githubusercontent.com/lincarrieli/DATA620-Web-Analytics/main/spambase_csv.csv"
uci_dataset = pd.read_csv(url)
uci_dataset.head(n=5)

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_%3B,char_freq_%28,char_freq_%5B,char_freq_%21,char_freq_%24,char_freq_%23,capital_run_length_average,capital_run_length_longest,capital_run_length_total,class
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


### Get Spam-ham file and clean the data

In [3]:
# Get and read csv file
url = "https://raw.githubusercontent.com/lincarrieli/DATA620-Web-Analytics/main/SMSSpamCollection"
spam_df = pd.read_csv(url, sep='\t', header = None)
spam_df.head(n=5)

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# Add column names
spam_df.columns =['spam', 'text']

In [5]:
# Get number of rows and columns
spam_df.shape

(5572, 2)

In [6]:
# Remove duplicates
spam_df.drop_duplicates(inplace = True)

# Get new shape
spam_df.shape

(5169, 2)

In [7]:
# Check for missiong data
spam_df.isnull().sum()

spam    0
text    0
dtype: int64

### Use NLP to clean the text

In [8]:
def process_text (text):
    # remove punctuation
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    # remove stopwords
    clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    return clean_words

In [9]:
# Apply process_text function 
spam_df['text'].head().apply(process_text)

0    [Go, jurong, point, crazy, Available, bugis, n...
1                       [Ok, lar, Joking, wif, u, oni]
2    [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
3        [U, dun, say, early, hor, U, c, already, say]
4    [Nah, dont, think, goes, usf, lives, around, t...
Name: text, dtype: object

### Generate vector for ML algorithms fro classification and predictions with scikit learn

In [None]:
# create bag of words model and convert to token count matrix
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(analyzer = process_text)
spam_bow = cv.fit_transform(spam_df['text'])

# get shape of spam_bow
spam_bow.shape

In [None]:
# Split data for training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(spam_bow, spam_df['spam'], test_size = 0.2, random_state = 0)

In [None]:
# Fit Naive Bayes classifier to train dataset
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(X_train, y_train)

In [None]:
pred = classifier.predict(X_train)
print(pred)
print(y_train.values)

In [None]:
# Evaluate model on training dataset
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
print(classification_report(y_train, pred))

# Create confusion matrix
cm = confusion_matrix(y_train, pred)
print('Confusion Matrix: \n', cm)

# Get aaccuracy
accu = accuracy_score(y_train, pred)
print('Accuracy: ', accu)

In [None]:
# Test model on test data
pred = classifier.predict(X_test)
print(pred)
print(y_test.values)

In [None]:
# Evaluate model on test data
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
print(classification_report(y_test, pred))

# Create confusion matrix
cm = confusion_matrix(y_test, pred)
print('Confusion Matrix: \n', cm)

# Get aaccuracy
accu = accuracy_score(y_test, pred)
print('Accuracy: ', accu)

#### The Naive Bayes model predicted the testing dataset with accuracy of 95 %, slightly worse than its predictions with the training set, but still very good.