# Overview

**Context**

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. 
It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

**Content**

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

# Approach

- Loading Data

- Input and Output Data

- Applying Regular Expression

- Each word to lower case

- Splitting words to Tokenize

- Stemming with PorterStemmer handling Stop Words

- Preparing Messages with Remaining Tokens

- Preparing WordVector Corpus

- Applying Classification

In [1]:
import pandas as pd
import numpy as np
import nltk
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [2]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re

# DATA

In [3]:
df = pd.read_csv('spam.csv', encoding='latin-1')
columns_to_drop = ['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4']
df = df.drop(columns=[col for col in columns_to_drop if col in df.columns])


In [4]:
df.head()

Unnamed: 0,label,message
0,ham,"Hey, how's it going? Wanna grab lunch today?"
1,ham,Don't forget the meeting at 3 PM.
2,spam,You have won a $1000 gift card! Text WIN to 12...
3,ham,Are you coming to the party tonight?
4,spam,Congratulations! You've been selected for a fr...


In [5]:
# Replace ham with 0 and spam with 1
df = df.replace(['ham','spam'],[0, 1]) 

In [6]:
df.head()

Unnamed: 0,label,message
0,0,"Hey, how's it going? Wanna grab lunch today?"
1,0,Don't forget the meeting at 3 PM.
2,1,You have won a $1000 gift card! Text WIN to 12...
3,0,Are you coming to the party tonight?
4,1,Congratulations! You've been selected for a fr...


#### Count the number of words in each Text

In [7]:
df['Count'] = 0
for i in np.arange(0, len(df.message)):
    df.loc[i, 'Count'] = len(df.loc[i, 'message'])

In [8]:
df.head()

Unnamed: 0,label,message,Count
0,0,"Hey, how's it going? Wanna grab lunch today?",44
1,0,Don't forget the meeting at 3 PM.,33
2,1,You have won a $1000 gift card! Text WIN to 12...,62
3,0,Are you coming to the party tonight?,36
4,1,Congratulations! You've been selected for a fr...,68


In [11]:
# Rename columns to ensure 'v1' is correctly named
df.columns = ['label', 'message']

# Total ham(0) and spam(1) messages
df['label'].value_counts()

ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66 entries, 0 to 65
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v1      66 non-null     int64 
 1   v2      66 non-null     object
dtypes: int64(1), object(1)
memory usage: 1.2+ KB


In [42]:
corpus = []
ps = PorterStemmer()

In [122]:
# Original Messages

print (df['message'][0])
print (df['message'][1])

Hey, how's it going? Wanna grab lunch today?
Don't forget the meeting at 3 PM.


## Processing Messages

In [121]:
for i in range(0, len(df)):

    # Applying Regular Expression
    
    '''
    Replace email addresses with 'emailaddr'
    Replace URLs with 'httpaddr'
    Replace money symbols with 'moneysymb'
    Replace phone numbers with 'phonenumbr'
    Replace numbers with 'numbr'
    '''
    msg = df['message'][i]
    msg = re.sub('\b[\w\-.]+?@\w+?\.\w{2,4}\b', 'emailaddr', df['message'][i])
    msg = re.sub('(http[s]?\S+)|(\w+\.[A-Za-z]{2,4}\S*)', 'httpaddr', df['message'][i])
    msg = re.sub('£|\$', 'moneysymb', df['message'][i])
    msg = re.sub('\b(\+\d{1,2}\s)?\d?[\-(.]?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b', 'phonenumbr', df['message'][i])
    msg = re.sub('\d+(\.\d+)?', 'numbr', df['message'][i])
    
    ''' Remove all punctuations '''
    msg = re.sub('[^\w\d\s]', ' ', df['message'][i])
    
    if i<2:
        print("\t\t\t\t MESSAGE ", i)
    
    if i<2:
        print("\n After Regular Expression - Message ", i, " : ", msg)
    
    # Each word to lower case
    msg = msg.lower()    
    if i<2:
        print("\n Lower case Message ", i, " : ", msg)
    
    # Splitting words to Tokenize
    msg = msg.split()    
    if i<2:
        print("\n After Splitting - Message ", i, " : ", msg)
    
    # Stemming with PorterStemmer handling Stop Words
    msg = [ps.stem(word) for word in msg if not word in set(stopwords.words('english'))]
    if i<2:
        print("\n After Stemming - Message ", i, " : ", msg)
    
    # preparing Messages with Remaining Tokens
    msg = ' '.join(msg)
    if i<2:
        print("\n Final Prepared - Message ", i, " : ", msg, "\n\n")
    
    # Preparing WordVector Corpus
    corpus.append(msg)

				 MESSAGE  0

 After Regular Expression - Message  0  :  Hey  how s it going  Wanna grab lunch today 

 Lower case Message  0  :  hey  how s it going  wanna grab lunch today 

 After Splitting - Message  0  :  ['hey', 'how', 's', 'it', 'going', 'wanna', 'grab', 'lunch', 'today']

 After Stemming - Message  0  :  ['hey', 'go', 'wanna', 'grab', 'lunch', 'today']

 Final Prepared - Message  0  :  hey go wanna grab lunch today 


				 MESSAGE  1

 After Regular Expression - Message  1  :  Don t forget the meeting at 3 PM 

 Lower case Message  1  :  don t forget the meeting at 3 pm 

 After Splitting - Message  1  :  ['don', 't', 'forget', 'the', 'meeting', 'at', '3', 'pm']

 After Stemming - Message  1  :  ['forget', 'meet', '3', 'pm']

 Final Prepared - Message  1  :  forget meet 3 pm 




  msg = re.sub('\b[\w\-.]+?@\w+?\.\w{2,4}\b', 'emailaddr', df['message'][i])
  msg = re.sub('(http[s]?\S+)|(\w+\.[A-Za-z]{2,4}\S*)', 'httpaddr', df['message'][i])
  msg = re.sub('£|\$', 'moneysymb', df['message'][i])
  msg = re.sub('\b(\+\d{1,2}\s)?\d?[\-(.]?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b', 'phonenumbr', df['message'][i])
  msg = re.sub('\d+(\.\d+)?', 'numbr', df['message'][i])
  msg = re.sub('[^\w\d\s]', ' ', df['message'][i])


In [120]:
cv = CountVectorizer()
x = cv.fit_transform(corpus).toarray()

x

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

# Applying Classification

- Input : Prepared Sparse Matrix
- Ouput : Labels (Spam or Ham)

In [104]:
y = df['label']
print (y.value_counts())

print(y[0])
print(y[1])

label
ham     37
spam    29
Name: count, dtype: int64
ham
ham


### Encoding Labels

In [114]:
le = LabelEncoder()
y = le.fit_transform(y)

print(y[0])
print(y[1])

0
0


### Splitting to Training and Testing DATA

In [118]:
y

array([0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1])

In [123]:
# Ensure x and y have the same number of samples
x = x[:len(y)]

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.20, random_state=0)


# Applying Guassian Naive Bayes

In [124]:
bayes_classifier = GaussianNB()
bayes_classifier.fit(xtrain, ytrain)

In [125]:
# Predicting
y_pred = bayes_classifier.predict(xtest)

## Results

In [126]:
# Evaluating
cm = confusion_matrix(ytest, y_pred)

In [127]:
cm

array([[5, 1],
       [0, 8]])

In [128]:
print ("Accuracy : %0.5f \n\n" % accuracy_score(ytest, bayes_classifier.predict(xtest)))
print (classification_report(ytest, bayes_classifier.predict(xtest)))

Accuracy : 0.92857 


              precision    recall  f1-score   support

           0       1.00      0.83      0.91         6
           1       0.89      1.00      0.94         8

    accuracy                           0.93        14
   macro avg       0.94      0.92      0.93        14
weighted avg       0.94      0.93      0.93        14



# Applying Decision Tree

In [129]:
dt = DecisionTreeClassifier(random_state=50)
dt.fit(xtrain, ytrain)

In [130]:
# Predicting
y_pred_dt = dt.predict(xtest)

## Results

In [135]:
# Evaluating
cm = confusion_matrix(ytest, y_pred_dt)

print(cm)

[[5 1]
 [4 4]]


In [136]:
print ("Accuracy : %0.5f \n\n" % accuracy_score(ytest, dt.predict(xtest)))
print (classification_report(ytest, dt.predict(xtest)))

Accuracy : 0.64286 


              precision    recall  f1-score   support

           0       0.56      0.83      0.67         6
           1       0.80      0.50      0.62         8

    accuracy                           0.64        14
   macro avg       0.68      0.67      0.64        14
weighted avg       0.70      0.64      0.64        14



# Final Accuracy

- **Decision Tree : 96.861%**
- **Guassian NB   : 87.085%**   

Thanks for having a look :) ....Please give my kernel an **UPVOTE** 