<a href="https://colab.research.google.com/github/krystal826/Natural-Language-Processing/blob/main/SMS_Spam_Classification_Latest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SMS Spam Classification

Natural Language Processing Using Machine Learning
https://medium.com/analytics-vidhya/sms-spam-classifier-natural-language-processing-1751e2b324ed

In [1]:
#Loading the libraries
import pandas as pd
import numpy as np

# 1.0 Reading the Data

In [3]:
#Reading the csv file
messages = pd.read_csv('SMSSpamCollection',sep='\t',names = ['Label','Message'])

#Specifying the names of the columns while reading csv file (tsv--tab separated values)
messages.head()

Unnamed: 0,Label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## 2.0 Exploratory Data Analysis

** There are 5572 rows and 2 columns. It means that there are 5572 messages and 2 columns named “Label” and “Message”. There are no missing values in the data. **

In [4]:
#Info about the data
messages.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Label    5572 non-null   object
 1   Message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [5]:
#Finding missing values
messages.isnull().sum()

Label      0
Message    0
dtype: int64

In [6]:
#Shape of the dataframe
messages.shape

(5572, 2)

In [7]:
#Target variables counts
messages['Label'].value_counts()

#Data is imbalanced but for now we will continue with this

ham     4825
spam     747
Name: Label, dtype: int64

## 3.0 Data Preprocessing

**Calculating length of message (number of characters)**

In [8]:
#Calculating length of message
mes_len=0
length=[]
for i in range(len(messages)):
    length.append(len(messages['Message'][i]))

In [9]:
length

[111,
 29,
 155,
 49,
 61,
 147,
 77,
 160,
 157,
 154,
 109,
 136,
 155,
 196,
 35,
 149,
 26,
 81,
 56,
 155,
 41,
 47,
 52,
 88,
 57,
 144,
 30,
 134,
 75,
 64,
 130,
 189,
 29,
 84,
 158,
 122,
 47,
 28,
 27,
 155,
 82,
 142,
 172,
 19,
 72,
 32,
 45,
 31,
 67,
 148,
 58,
 124,
 80,
 289,
 120,
 76,
 161,
 34,
 22,
 40,
 108,
 48,
 25,
 56,
 110,
 152,
 122,
 159,
 78,
 34,
 46,
 29,
 45,
 42,
 20,
 43,
 73,
 50,
 42,
 76,
 22,
 32,
 32,
 36,
 14,
 55,
 121,
 144,
 42,
 41,
 58,
 195,
 141,
 137,
 107,
 158,
 33,
 51,
 178,
 31,
 57,
 81,
 76,
 160,
 183,
 44,
 95,
 43,
 82,
 115,
 30,
 40,
 31,
 96,
 158,
 143,
 156,
 152,
 72,
 86,
 144,
 156,
 53,
 156,
 52,
 38,
 20,
 244,
 22,
 107,
 28,
 9,
 39,
 25,
 125,
 162,
 38,
 34,
 46,
 155,
 85,
 33,
 27,
 156,
 42,
 25,
 48,
 159,
 84,
 33,
 30,
 45,
 59,
 25,
 160,
 384,
 28,
 27,
 157,
 124,
 145,
 115,
 64,
 85,
 152,
 155,
 51,
 156,
 74,
 67,
 59,
 50,
 94,
 33,
 105,
 61,
 65,
 26,
 146,
 66,
 126,
 159,
 23,
 65,
 24,
 26,
 1

In [10]:
#Adding Length column to the dataframe
messages['Length']=length

In [11]:
messages.head()

Unnamed: 0,Label,Message,Length
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61


**Calculating Punctuations in each message**

In [12]:
#Calculating Punctuations in each message

import string
count=0
punct=[]
for i in range(len(messages)):
    for j in messages['Message'][i]:
        if j in string.punctuation:
            count+=1
    #print(count)
    punct.append(count)
    count=0
    

In [13]:
punct

[9,
 6,
 6,
 6,
 2,
 8,
 2,
 6,
 6,
 2,
 6,
 8,
 8,
 4,
 2,
 11,
 6,
 5,
 1,
 8,
 1,
 0,
 6,
 3,
 2,
 7,
 1,
 6,
 7,
 4,
 4,
 7,
 1,
 1,
 2,
 6,
 3,
 0,
 6,
 7,
 3,
 15,
 5,
 1,
 10,
 4,
 2,
 2,
 2,
 7,
 2,
 11,
 1,
 14,
 5,
 6,
 12,
 3,
 0,
 3,
 5,
 2,
 0,
 2,
 2,
 3,
 8,
 10,
 5,
 1,
 9,
 1,
 1,
 6,
 3,
 2,
 2,
 3,
 5,
 4,
 2,
 2,
 0,
 0,
 1,
 5,
 4,
 4,
 1,
 5,
 3,
 5,
 2,
 2,
 3,
 8,
 3,
 1,
 7,
 0,
 3,
 2,
 3,
 6,
 8,
 0,
 2,
 2,
 1,
 6,
 3,
 1,
 2,
 4,
 6,
 15,
 13,
 7,
 4,
 6,
 3,
 6,
 3,
 2,
 1,
 2,
 2,
 9,
 1,
 4,
 5,
 2,
 3,
 1,
 5,
 8,
 4,
 1,
 1,
 4,
 12,
 2,
 2,
 9,
 2,
 0,
 1,
 7,
 3,
 3,
 2,
 3,
 1,
 6,
 6,
 16,
 1,
 4,
 8,
 2,
 3,
 1,
 5,
 5,
 13,
 6,
 1,
 5,
 3,
 2,
 5,
 2,
 12,
 1,
 6,
 7,
 1,
 1,
 14,
 1,
 5,
 3,
 2,
 2,
 2,
 1,
 7,
 1,
 2,
 1,
 5,
 5,
 8,
 4,
 2,
 3,
 0,
 0,
 2,
 5,
 8,
 5,
 0,
 16,
 1,
 1,
 3,
 13,
 1,
 4,
 3,
 4,
 0,
 7,
 6,
 2,
 1,
 3,
 3,
 0,
 5,
 2,
 5,
 2,
 6,
 11,
 2,
 6,
 1,
 12,
 0,
 7,
 0,
 2,
 4,
 8,
 6,
 0,
 2,
 14,
 3,
 1,
 0,
 5,
 9,
 

In [14]:
#Adding punctuation length column to dataframe
messages["Punctuation"]=punct

<br>

### 3.1 Text Cleaning

In [25]:
#Regex
import re

#Stopwords
from nltk.corpus import stopwords

#Lemmatization
from nltk.stem import WordNetLemmatizer
#Creating object for Lemmatizer
lemmatizer = WordNetLemmatizer()
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [26]:
#Removal of extra characters and stop words and lemmatization
corpus = []

#Skipping the 0th index (it's of Label)
for i in range(0,len(messages)):
    words = re.sub('[^a-zA-Z]',' ',messages['Message'][i])
    words = words.lower()
    #Splits into list of words 
    words = words.split()
    
    #Lemmatizing the word and removing the stopwords
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    
    #Again join words to form sentences
    words = ' '.join(words)
    
    corpus.append(words)

In [27]:
#What's in Corpus
corpus[0]

'go jurong point crazy available bugis n great world la e buffet cine got amore wat'

In [28]:
corpus[1]

'ok lar joking wif u oni'

In [29]:
#Replacing Original Message with the Transformed Messages
messages['Message'] = corpus

In [30]:
messages.head()

Unnamed: 0,Label,Message,Length,Punctuation
0,ham,go jurong point crazy available bugis n great ...,111,9
1,ham,ok lar joking wif u oni,29,6
2,spam,free entry wkly comp win fa cup final tkts st ...,155,6
3,ham,u dun say early hor u c already say,49,6
4,ham,nah think go usf life around though,61,2


<br>

### 3.2 Analyzing the difference between Spam and Ham messages

In [31]:
spam_messages = messages[messages['Label'] == 'spam']
ham_messages = messages[messages['Label'] == 'ham']

In [32]:
spam_messages.head()

Unnamed: 0,Label,Message,Length,Punctuation
2,spam,free entry wkly comp win fa cup final tkts st ...,155,6
5,spam,freemsg hey darling week word back like fun st...,147,8
8,spam,winner valued network customer selected receiv...,157,6
9,spam,mobile month u r entitled update latest colour...,154,2
11,spam,six chance win cash pound txt csh send cost p ...,136,8


In [33]:
ham_messages.head()

Unnamed: 0,Label,Message,Length,Punctuation
0,ham,go jurong point crazy available bugis n great ...,111,9
1,ham,ok lar joking wif u oni,29,6
3,ham,u dun say early hor u c already say,49,6
4,ham,nah think go usf life around though,61,2
6,ham,even brother like speak treat like aid patent,77,2


In [34]:
spam_messages['Length'].mean()

138.6706827309237

In [35]:
ham_messages['Length'].mean()

71.48248704663213

We can see that Spam messages have more average words than Ham messages

In [36]:
spam_messages['Punctuation'].mean()

5.712182061579652

In [37]:
ham_messages['Punctuation'].mean()

3.9398963730569947

Same with Punctuation also, We can see that Spam messages have more average punctuation than Ham messages

## 4.0 Feature Extraction

In [38]:
X = messages['Message']

In [39]:
X.head()

0    go jurong point crazy available bugis n great ...
1                              ok lar joking wif u oni
2    free entry wkly comp win fa cup final tkts st ...
3                  u dun say early hor u c already say
4                  nah think go usf life around though
Name: Message, dtype: object

In [40]:
y = messages['Label']

In [41]:
y.head()

0     ham
1     ham
2    spam
3     ham
4     ham
Name: Label, dtype: object

### 4.1 Train Test Split - Split the data into 77% (training data) and 33% (test data)

In [42]:
from sklearn.model_selection import train_test_split

In [43]:
X_train , X_test , y_train , y_test = train_test_split(X , y, test_size = 0.33, random_state = 42)

In [44]:
X_train.head()

3235                                            yup comin
945     sent score sophas secondary application school...
5319                              kothi print marandratha
5528                             effect irritation ignore
247                                         asked call ok
Name: Message, dtype: object

In [45]:
X_test.head()

3245    squeeeeeze christmas hug u lik frndshp den hug...
944     also sorta blown couple time recently id rathe...
1044    mmm thats better got roast b better drink good...
2484                  mm kanji dont eat anything heavy ok
812     ring come guy costume gift future yowifes hint...
Name: Message, dtype: object

### 4.2 Demonstration of Count Vectorizer

(Bag of Words)

In [46]:
from sklearn.feature_extraction.text import CountVectorizer

In [47]:
count_vect=CountVectorizer()

In [48]:
X_train_count_vect=count_vect.fit_transform(X_train).toarray()

In [49]:
X_train_count_vect

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [50]:
# 3733 are the sentences and 5772 are the words in total sentences
X_train_count_vect.shape

(3733, 5772)

**Note:-**<br>
There might be that, some words in 5772 words are not frequently present and are just appearing 1-2 times, we can reduce them using cv = CountVectorizer(max_features = 4000) (an approach)

This will only take 4000 words leading to coming of most frequent words

    We can change the max_features, according to what we want

### 4.3 Demonstration of TF-IDF Vectorizer

(Term Frequency - Inverse Document Frequency)


CountVectorizer(Bag of Words) + TFIDF Transformer, Scikit-Learn has provided with a method of TFIDF vectorizer (combining two steps into one)

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [52]:
tfidf=TfidfVectorizer()

In [53]:
X_train_tfidf_vect=count_vect.fit_transform(X_train).toarray()

In [54]:
X_train_tfidf_vect

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [55]:
X_train_tfidf_vect.shape

(3733, 5772)

# 5.0 Model Building

We are doing pipelining as we need to perform the same procedures for the test data to get predictions, that may be tiresome.

However what convenient about this pipeline object is that it actually can perform all these steps for you in a single cell, that means you can directly provide the data and it will be both vectorized and run the classifier on it in a single step.

Pipeline takes list of tuple.

In [56]:
from sklearn.pipeline import Pipeline

### Naive Bayer Classifier

In [57]:
from sklearn.naive_bayes import MultinomialNB

In [58]:
#each tuple takes the name you decide , next you call what you want to occur
text_mnb=Pipeline([('tfidf',TfidfVectorizer()),('mnb',MultinomialNB())])

In [59]:
#Now u can directly pass the X_train dataset.
text_mnb.fit(X_train,y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('mnb',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

In [60]:
X_test.head()

3245    squeeeeeze christmas hug u lik frndshp den hug...
944     also sorta blown couple time recently id rathe...
1044    mmm thats better got roast b better drink good...
2484                  mm kanji dont eat anything heavy ok
812     ring come guy costume gift future yowifes hint...
Name: Message, dtype: object

In [61]:
#It will take the X_test and do all the steps, vectorize it and predict it
y_preds_mnb=text_mnb.predict(X_test)

In [62]:
#Predictions of the test data
y_preds_mnb

array(['ham', 'ham', 'ham', ..., 'ham', 'ham', 'ham'], dtype='<U4')

In [63]:
#Training score
text_mnb.score(X_train,y_train)

0.975622823466381

In [64]:
#Testing score
text_mnb.score(X_test,y_test)

0.9700924415443176

**Evaluation Metrics**

In [65]:
from sklearn.metrics import confusion_matrix

In [66]:
print(confusion_matrix(y_test,y_preds_mnb))

[[1592    1]
 [  54  192]]


In [67]:
from sklearn.metrics import classification_report

In [68]:
print(classification_report(y_test,y_preds_mnb))

              precision    recall  f1-score   support

         ham       0.97      1.00      0.98      1593
        spam       0.99      0.78      0.87       246

    accuracy                           0.97      1839
   macro avg       0.98      0.89      0.93      1839
weighted avg       0.97      0.97      0.97      1839



### The above shows that “ham” label got predicted good but “spam” label prediction is not fine , so we can’t say that model is excellent. Model is lacking in predicting spam accurately. We may try the same problem with SVM model.

### Prediciting on New SMS

In [69]:
text = 'Congratulations, you have won a lottery of $5000. To Won Text on,555500 '

In [70]:
def refined_text(text):
    #Removal of extra characters and stop words
    words = re.sub('[^a-zA-Z]',' ',text)
    words = words.lower()
    #Splits into list of words 
    words = words.split()

    #Lemmatizing the word and removing the stopwords
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]

    #Again join words to form sentences
    words = ' '.join(words)
    return words

In [71]:
refined_word = refined_text(text)

In [72]:
refined_word = [refined_word]

In [73]:
refined_word

['congratulation lottery text']

In [74]:
# Directly predicting the single message to the model
text_mnb.predict(refined_word)

array(['spam'], dtype='<U4')

### SVM Classifier

In [75]:
from sklearn.svm import LinearSVC

In [76]:
#each tuple takes the name you decide , next you call what you want to occur
text_svm=Pipeline([('tfidf',TfidfVectorizer()),('svm',LinearSVC())])

In [77]:
#Now u can directly pass the X_train dataset.
text_svm.fit(X_train,y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('svm',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
               

In [78]:
X_test.head()

3245    squeeeeeze christmas hug u lik frndshp den hug...
944     also sorta blown couple time recently id rathe...
1044    mmm thats better got roast b better drink good...
2484                  mm kanji dont eat anything heavy ok
812     ring come guy costume gift future yowifes hint...
Name: Message, dtype: object

In [79]:
#It will take the X_test and do all the steps, vectorize it and predict it
y_preds_svm=text_svm.predict(X_test)

In [80]:
#Predictions of the test data
y_preds_svm

array(['ham', 'ham', 'ham', ..., 'ham', 'ham', 'ham'], dtype=object)

In [81]:
#Training score
text_svm.score(X_train,y_train)

1.0

In [82]:
#Testing score
text_svm.score(X_test,y_test)

0.9869494290375204

**Evaluation Metrics**

In [83]:
from sklearn.metrics import confusion_matrix

In [84]:
print(confusion_matrix(y_test,y_preds_svm))

[[1589    4]
 [  20  226]]


In [85]:
from sklearn.metrics import classification_report

In [86]:
print(classification_report(y_test,y_preds_svm))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.98      0.92      0.95       246

    accuracy                           0.99      1839
   macro avg       0.99      0.96      0.97      1839
weighted avg       0.99      0.99      0.99      1839

