## Create a SpamClassifier leveraging the UCI SMS Spam Collection Dataset

The link to the dataset is [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00228/)

## Step 1

We will use the pandas library to read our dataset. The data has been downloaded from the repository and uploaded in a S3 bucket(sagemaker-us-east-1-XXXXXXXXXXXX). The dataset is uploaded to a S3 bucket already. The dataset is tab separated. **Spam | Ham**

In [4]:
# import the datasets
import pandas as pd

df = pd.read_csv('s3://XXXXXXXXXXXXXXXXXX-XXXXXXX-XXXXXX',sep='\t',names=["label","messages"])
print('The shape of the dataset is:', df.shape)

The shape of the dataset is: (5572, 2)


In [6]:
# See how the data is now splitted with 2 features label and messages. Label is my dependent feature and messages are independent feature
df.head(5) 

Unnamed: 0,label,messages
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Step 2
Data Cleaning and preprocessing

In [5]:
# import necessary libraries including the stopwords
import re
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer #For Stemming(base root format of word)


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
# let's now try to cleanup the messages feature and perform some preprocessing
ps = PorterStemmer()
corpus = [] #empty corpus
for i in range(0, len(df)):
    process = re.sub('[^a-zA-Z]', ' ', df['messages'][i]) #remove all unnecessary characters(,....) except aAzZ
    process = process.lower() #convert to all lowercase
    process = process.split() #split sentence to get list of words
    
    process = [ps.stem(word) for word in process if not word in stopwords.words('english')]
    process = ' '.join(process)
    corpus.append(process)

## Why Stemming? Initially we can always start with Stemming and then look at the result before moving to Lemmatization. Lemmatization is a heavy operations and usually takes time depending on the corpus

## Step 3

Create the Bag of Words model. To improve accuracy you may use TF-IDF

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000) #lets start with 5000 columns
X = cv.fit_transform(corpus).toarray()

Y=pd.get_dummies(df['label']) #convert to categorical values.. 1 indicates Spam.. Categorical 
Y=Y.iloc[:,1].values # since there are two columns, one column can be used to specify information for both columns

## Step 4

Split Train/Test

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.20, random_state = 0)

## Step 5

Naive Bayes works exceptionally well with NLP. Naive Bayes is a classification technique, based on probability

In [21]:
# Training model using Naive bayes classifier

from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(X_train, Y_train)


In [15]:
y_pred=spam_detect_model.predict(X_test)

## Step 6

2 X 2 Confusion Matrix for checking accuracy

In [24]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Use the fitted pipeline to make predictions on the validation dataset
print(confusion_matrix(Y_test, y_pred))
print(classification_report(Y_test, y_pred))
print("Accuracy (validation):", accuracy_score(Y_test, y_pred))

[[946   9]
 [  7 153]]
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       955
           1       0.94      0.96      0.95       160

    accuracy                           0.99      1115
   macro avg       0.97      0.97      0.97      1115
weighted avg       0.99      0.99      0.99      1115

Accuracy (validation): 0.9856502242152466
