## DATA 620 Assignment - Document Classification
##### Team: Mia Chen & Wei Zhou
##### Date: 7/10/2020

For this assignment, we are going to use the [SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset/home) which tags 5,574 text messages based on whether they are "spam" or "ham" (not spam). We will build a classifier to predict whether a new text message is spam or ham.

In [22]:
# Read data from a csv file
import pandas as pd

data = pd.read_csv('spam.csv', encoding = 'latin-1')
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


First column identifies the text whether it's spam or ham. The second column is the text body. We are going to leave out the other columns and rename these two.

In [23]:
# Extract and rename column
data = data[['v1', 'v2']]
data = data.rename(columns = {'v1':'class', 'v2':'text'})

data.head()

Unnamed: 0,class,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [24]:
# Count spam
len(data[data['class']=='spam'])

747

In [25]:
# Percentage of spam
len(data[data['class']=='spam'])/len(data)

0.13406317300789664

Out of the 5,574 text messages, about 13.4% (747 text messages) of which are spam.

## Data Cleaning

In [26]:
# Convert text to lower case
def lower_case(msg):
    msg = msg.lower()
    return msg

data['text'] = data['text'].apply(lower_case)

We will split our data into a training set and a testing set. 10% of our data is allocated for testing. We will then use the TF-IDF vectorizer to calculate the importance of each term to the document. 

In [27]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data['text'], data['class'], test_size = 0.1, random_state = 1)

# Train the vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform(X_train)

### SVM Classifier
We will use the Support Vector Machine (SVM) classifier since it works well with classification algorithms for two-group classification problems.

In [28]:
# Train the classifier
from sklearn import svm

svm = svm.SVC(C=1000)

svm.fit(X_train, y_train)

# Test the classifer against test set
from sklearn.metrics import confusion_matrix

X_test = vectorizer.transform(X_test)

y_pred = svm.predict(X_test)

print(confusion_matrix(y_test, y_pred))



[[490   0]
 [ 10  58]]


The confusion matrix tells us that 490 messages are correctly being classified as ham and 58 messages are correctly being classified as spam; while only 10 messages are falsed predicted - that leads to 98% accuracy!

source: https://towardsdatascience.com/spam-or-ham-introduction-to-natural-language-processing-part-2-a0093185aebd