Importing the libraries required for the model

In [121]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
#train test split is a function that splits the data into train and test datasets.
# commonly used in machine learning task to evaluate the performance of a model and understand data

from sklearn.feature_extraction.text import TfidfVectorizer
#TfidfVectorizer is a class that is used to convert a collection of raw documents into a matrix of TF-IDF features.
#TF-IDF stands for Term Frequency-Inverse Document Frequency. 
#TF-IDF score is a measure of how important a word is to a document in a collection or corpus.
#corpus means a collection of texts.

from sklearn.linear_model import LogisticRegression
#logistic regression is a classification algorithm that is used to predict the probability of an event.
# regression is suitable for predicting continuous values.

from sklearn.metrics import accuracy_score
#accuracy score is used to evaluate the performance of a classification model by measuring the proportion of correct predictions.

To read the data, import the Pandas library and read the data using pd.read_csv() .

Here the data in the file is Tab (\t) separated, so we must provide the “sep” (separate) parameter. Also, the file does not contain any column names, so we should provide column names using “names” parameter.

data is getting stored in a dataframe named as data

In [122]:

data = pd.read_csv("SMSSpamCollection.csv", names= ["Label" , "message"], encoding="latin-1")
#specifying the name of the columns in the CSV file.
#read_csv is a function that is used to read a CSV file and return a pandas DataFrame.

to see the data .. just write data to get better view of the dataframe

In [123]:
data

Unnamed: 0,Label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5569,spam,This is the 2nd time we have tried 2 contact u...
5570,ham,Will Ã¼ b going to esplanade fr home?
5571,ham,"Pity, * was in mood for that. So...any other s..."
5572,ham,The guy did some bitching but I acted like i'd...


now we will use head function to get view of the first 5 rows of our dataframe.

In [124]:
data.head()

Unnamed: 0,Label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Exploratory Data Analysis - explore the type of data, columns and rows available in our dataframe

In [125]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5574 entries, 0 to 5573
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Label    5574 non-null   object
 1   message  5574 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


our file have 2 columns - label and message with 5574 rows not null means there is no null columns

to manually check for null rows.. use isnull().sum() function

In [126]:
data.isnull().sum()

Label      0
message    0
dtype: int64

shape of the dataframe. it returns a tuple with 2 elements. first element shows number of rows and 
second element represent the number of columns

In [127]:
data.shape

(5574, 2)

as there are 5574 rows means we have 5574 total messages and 2 columns - label and message

there are no missing values in the data. now lets count the total variable count in the label column

In [128]:
data['Label'].value_counts()

Label
ham     4827
spam     747
Name: count, dtype: int64

ham messages are more than spam messages . 4827 messages out of 5574 are ham and 747 messages are spam
we need to create a machine learning model performing better than 86.6 percent to beat random chance

data preprocessing

calculating the length of the messages

In [129]:
message_len = 0
length = []
for i in range(len(data)):
    message_len = len(data['message'][i])
    length.append(message_len)

In [130]:
data['length'] = length
data.head()

Unnamed: 0,Label,message,length
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61


now calculating the punctuations in each message

In [131]:
import string
count = 0
punct = []
for i in range(len(data)):
    for j in data['message'][i]:
        if j in string.punctuation:
            count += 1
    punct.append(count)
    count= 0

In [132]:
data = data.assign(punct=punct)
data

Unnamed: 0,Label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2
...,...,...,...,...
5569,spam,This is the 2nd time we have tried 2 contact u...,161,8
5570,ham,Will Ã¼ b going to esplanade fr home?,37,1
5571,ham,"Pity, * was in mood for that. So...any other s...",57,7
5572,ham,The guy did some bitching but I acted like i'd...,125,1


Text Cleaning

In [133]:
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sjai5\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sjai5\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [134]:
lemmatizer = WordNetLemmatizer()

In [135]:
corpus = []

for i in range(0, len(data)):
    words = re.sub('[^a-zA-Z]', ' ', data['message'][i])
    words = words.lower()
    words = words.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    words = ' '.join(words)
    
    corpus.append(words)

In [136]:
corpus[0]

'go jurong point crazy available bugis n great world la e buffet cine got amore wat'

In [137]:
data['message'] = corpus

In [138]:
data.head()

Unnamed: 0,Label,message,length,punct
0,ham,go jurong point crazy available bugis n great ...,111,9
1,ham,ok lar joking wif u oni,29,6
2,spam,free entry wkly comp win fa cup final tkts st ...,155,6
3,ham,u dun say early hor u c already say,49,6
4,ham,nah think go usf life around though,61,2


In [139]:
spam_messages = data[data['Label'] == 'spam']
ham_messages = data[data['Label'] == 'ham']


In [140]:
spam_messages.head()

Unnamed: 0,Label,message,length,punct
2,spam,free entry wkly comp win fa cup final tkts st ...,155,6
5,spam,freemsg hey darling week word back like fun st...,148,8
8,spam,winner valued network customer selected receiv...,158,6
9,spam,mobile month u r entitled update latest colour...,154,2
11,spam,six chance win cash pound txt csh send cost p ...,136,8


In [141]:
ham_messages.head()

Unnamed: 0,Label,message,length,punct
0,ham,go jurong point crazy available bugis n great ...,111,9
1,ham,ok lar joking wif u oni,29,6
3,ham,u dun say early hor u c already say,49,6
4,ham,nah think go usf life around though,61,2
6,ham,even brother like speak treat like aid patent,77,2


In [142]:
print(spam_messages['punct'].mean())
print(spam_messages['length'].mean())
print(ham_messages['punct'].mean())
print(ham_messages['length'].mean())

5.712182061579652
139.0776439089692
3.938056764035633
71.49513155168842


model building

In [143]:
X = data['message']
Y = data['Label']

In [144]:
X.head()

0    go jurong point crazy available bugis n great ...
1                              ok lar joking wif u oni
2    free entry wkly comp win fa cup final tkts st ...
3                  u dun say early hor u c already say
4                  nah think go usf life around though
Name: message, dtype: object

In [145]:
Y.head()

0     ham
1     ham
2    spam
3     ham
4     ham
Name: Label, dtype: object

Train test split

Using the scikit-learn library, we can split the data into train and test. Here I have split the data into 77% (training data) and 33% (testing data)

In [146]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.33, random_state=42)

Dealing with the text data

We can’t directly pass the text to the machine learning model as the machine only understands data in the form of 0’s and 1’s.

To solve this problem, we will use the concept of TF-IDF Vectorizer (Term Frequency-Inverse Document Frequency). It is a standard algorithm to transform the text into a meaningful representation of numbers and is used to fit the machine algorithm for prediction

In [147]:
Tfidf = TfidfVectorizer()
from sklearn.feature_extraction.text import CountVectorizer
count_vect=CountVectorizer()


In [148]:
X_train_count_vect=count_vect.fit_transform(X_train).toarray()


In [149]:
X_train_count_vect

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [150]:
X_train_count_vect.shape

(3734, 5782)

3734 are the sentences of the X_train, and 5782 are the total words obtained from the sentences.

Pipelining

In [151]:
from sklearn.pipeline import Pipeline

In [152]:
from sklearn.naive_bayes import MultinomialNB

In [153]:
text_mnb=Pipeline([('tfidf',TfidfVectorizer()),('mnb',MultinomialNB())])

In [154]:
text_mnb.fit(X_train,Y_train)

In [155]:
X_test.head()

3690                                 still coming tonight
3527    hey babe far spun spk da mo dead da wrld sleep...
724                                   ya even cooky jelly
3370               sorry gone place tomorrow really sorry
468                                       going ride bike
Name: message, dtype: object

In [156]:
y_preds_mnb=text_mnb.predict(X_test)

In [157]:
y_preds_mnb

array(['ham', 'ham', 'ham', ..., 'ham', 'ham', 'spam'], dtype='<U4')

In [158]:
text_mnb.score(X_train,Y_train)

0.9764327798607392

In [159]:
text_mnb.score(X_test,Y_test)

0.9657608695652173

In [160]:
from sklearn.metrics import confusion_matrix

In [161]:
print(confusion_matrix(Y_test,y_preds_mnb))

[[1584    1]
 [  62  193]]


In [162]:
from sklearn.metrics import classification_report

In [163]:
print(classification_report(Y_test,y_preds_mnb))

              precision    recall  f1-score   support

         ham       0.96      1.00      0.98      1585
        spam       0.99      0.76      0.86       255

    accuracy                           0.97      1840
   macro avg       0.98      0.88      0.92      1840
weighted avg       0.97      0.97      0.96      1840



In [164]:
from sklearn.svm import LinearSVC

In [165]:
text_svm=Pipeline([('tfidf',TfidfVectorizer()),('svm',LinearSVC())])

In [166]:
from sklearn.svm import SVC

In [167]:
svm_model = SVC()
text_svm.fit(X_train,Y_train)



In [168]:
X_test.head()

3690                                 still coming tonight
3527    hey babe far spun spk da mo dead da wrld sleep...
724                                   ya even cooky jelly
3370               sorry gone place tomorrow really sorry
468                                       going ride bike
Name: message, dtype: object

In [169]:
y_preds_svm=text_svm.predict(X_test)

In [170]:
y_preds_svm

array(['ham', 'ham', 'ham', ..., 'ham', 'ham', 'spam'], dtype=object)

In [171]:
text_svm.score(X_train,Y_train)

1.0

In [172]:
text_svm.score(X_test,Y_test)

0.9836956521739131

In [173]:
print(confusion_matrix(Y_test,y_preds_svm))

[[1578    7]
 [  23  232]]


In [174]:
print(classification_report(Y_test,y_preds_svm))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1585
        spam       0.97      0.91      0.94       255

    accuracy                           0.98      1840
   macro avg       0.98      0.95      0.96      1840
weighted avg       0.98      0.98      0.98      1840



In [175]:
text = 'Congratulations, you have won a lottery of $5000. To Won Text on,555500 '

In [176]:
def refined_text(text):
    #Removal of extra characters and stop words
    words = re.sub('[^a-zA-Z]',' ',text)
    words = words.lower()
    #Splits into list of words 
    words = words.split()

    #Lemmatizing the word and removing the stopwords
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]

    #Again join words to form sentences
    words = ' '.join(words)
    return words

In [177]:
refined_word = refined_text(text)

In [178]:
refined_word = [refined_word]

In [179]:
refined_word

['congratulation lottery text']

In [180]:
text_mnb.predict(refined_word)

array(['spam'], dtype='<U4')