<a href="https://colab.research.google.com/github/nileshgode/My-Python-Projects/blob/master/Spam_Detection_Using_NLP_%26_Basic_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import the Necessary libraries for the task.
numpy & pd as common for every project but as we are dealing with text data so here we use NLTK library to find the solution for problem statement.

In [1]:
#importing the required libraries
import numpy as np
import pandas as pd


While working on google colab, We first need to mount the drive every time, enter the passcode, before that never forget to insert your data inside the drive while using colab or when working on Jupyter locally add data set in your working directory or change the path with command "os.chdir" to locate dataset.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Then read the data set with Pandas library function.
do not forget to copy your data file path from google colab to read it.

In [37]:
df = pd.read_csv('/content/sample_data/spam.csv', encoding='latin1')

### Data Cleaning & Data Understanding Steps 


 df.head() : This Function give us o/p as first 5 Rows, If we wants more numbers of rows we can initialize the desire numbers of rows inside bracket.

In [38]:
df.head()

Unnamed: 0,type,text,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


As We Can see cloumns Unnamed: 2	Unnamed: 3	Unnamed: 4 is not having any information so that we can drop those coulumns.

In [39]:
df=df.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1)
df.head()

Unnamed: 0,type,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [41]:
# See other commands to understand our Data
# Printing the size of the dataset
df.shape

(5572, 2)

In [42]:
# Getting feature names
df.columns

Index(['type', 'text'], dtype='object')

In [43]:
# Checking the duplicates and remove them
df.drop_duplicates(inplace=True)
df.shape

(5169, 2)

In [44]:
# Show the number of missing data for each column
df.isnull().sum()

type    0
text    0
dtype: int64

### Processing our text data with NLP
Now our remaning data which is clean but it is in text format, for machine understanding will have to convert that data into Numerical form. We already have the library for tokenizing the words, Import NLTK and perform operation as below.

In [45]:
import nltk
from nltk.corpus import stopwords
import string

In [46]:
# Function to tokenize each and every word
def tokenizer(text):
    tokenized=nltk.word_tokenize(text)
    tokenized=' '.join(tokenized)
    tokenized=tokenized.replace('n\'t','not')
    return tokenized

After Tokenization we need to remove punctuation, Remove stopwords with reference to stopwords stored in NLTK stopwords, Function just compare words within dictionary, if mathces remove it from the sentence, convert all words into lower case then return a list of clean words.

In [49]:
# Creating a function to process punctuation and stopwords in the text data
def process_stop_punc(text):
    # Remove punctuations
    # Remove stopwords
    # Return a list of clen text words
    nopunc=[char for char in text if char not in string.punctuation]
    nopunc=''.join(nopunc)
    
    clean_words=[word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    
    return clean_words

After above process will have to converts words into its base form called as stemming. this task done by following stemming() function here we use porterStemmer(), Its is the part of term normalization in NLP process.

In [50]:
# Functions to convert words into single form i.e. converting plural to singular and past ,past continuous to present
def stemming(List):
    stem_obj=nltk.stem.PorterStemmer()
    List=[stem_obj.stem(i) for i in List]
    message=' '.join(List)
    return message

In [53]:
# Function to compile each and every operation
def process(text):
    return stemming(process_stop_punc(tokenizer(text)))

In [55]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [56]:
# Show the tokenization
df['text'].head().apply(process)

0    Go jurong point crazi avail bugi n great world...
1                                Ok lar joke wif u oni
2    free entri 2 wkli comp win FA cup final tkt 21...
3                  U dun say earli hor U c alreadi say
4                 nah think goe usf live around though
Name: text, dtype: object

### Vectorizing the words

TFIDFVectorizer the value increases proportionally to count, but is inversely proportional to frequency of the word in the corpus; that is the inverse document frequency (IDF) part.

**TfidfVectorizer** and **CountVectorizer** both are methods for converting text data into vectors as model can process only numerical data.

In **CountVectorizer** we only count the number of times a word appears in the document which results in biasing in favour of most frequent words. this ends up in ignoring rare words which could have helped is in processing our data more efficiently.

To overcome this , we use TfidfVectorizer .

In TfidfVectorizer we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.

In [57]:
# Convert a collection of data to matrix of tokens using tf-idf vectorizer
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
message = TfidfVectorizer().fit_transform(df['text'])

In [62]:
# Getting the shape of message
message.shape

(5169, 8672)

In [63]:
# Print how our data look like in Numerical format with tf-idf.
print(message)

  (0, 8267)	0.1820760415281772
  (0, 1069)	0.32544292157369786
  (0, 3594)	0.15240463847472757
  (0, 7645)	0.15605579719351925
  (0, 2048)	0.27450748091103355
  (0, 1749)	0.31054526020101475
  (0, 4476)	0.27450748091103355
  (0, 8489)	0.22981449679298432
  (0, 3634)	0.18170677054225734
  (0, 1751)	0.27450748091103355
  (0, 4087)	0.1080194309412782
  (0, 5537)	0.15773893821302193
  (0, 1303)	0.2468122813993541
  (0, 2327)	0.2514110448509606
  (0, 5920)	0.25394599154794606
  (0, 4350)	0.32544292157369786
  (0, 8030)	0.2284782712166139
  (0, 3550)	0.1474570544871208
  (1, 5533)	0.5464988818914979
  (1, 8392)	0.4304438402468376
  (1, 4318)	0.5233434480300876
  (1, 4512)	0.406925248497845
  (1, 5504)	0.2767319100209511
  (2, 77)	0.2326251973903166
  (2, 1156)	0.16331528331958853
  :	:
  (5167, 1786)	0.2820992149566908
  (5167, 3470)	0.2744008686738812
  (5167, 2892)	0.24290552468890048
  (5167, 7049)	0.20395814718823002
  (5167, 1778)	0.13673277359621147
  (5167, 8065)	0.21062041399707843
 

CountVectorizer counts the word frequencies.

In [76]:
# Using countvectorizer
from sklearn.feature_extraction.text import CountVectorizer
message1=CountVectorizer().fit_transform(df['text'])
message1

<5169x8672 sparse matrix of type '<class 'numpy.int64'>'
	with 68018 stored elements in Compressed Sparse Row format>

### Splitting data into training tesing set
Our textual data is ready for model building, now with sklearn function we will split that into 80:20 pattern for training as well as testing resp.

In [77]:
# Splitting the data into 80:20 train test ratio for dataset vectorized using tf-idfvectorizer
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(message,df['type'],test_size=0.2,random_state=123)

In [78]:
print(X_train)

  (0, 8322)	0.3824156699859192
  (0, 2839)	0.364909991486263
  (0, 3556)	0.2640817780717849
  (0, 3526)	0.27167573719830423
  (0, 4675)	0.30505768137040207
  (0, 7049)	0.22953717497514553
  (0, 5117)	0.2377858540441012
  (0, 3532)	0.2161881827275621
  (0, 7613)	0.2265167607990493
  (0, 5477)	0.1442902004783509
  (0, 4848)	0.2157313074007234
  (0, 5525)	0.14857924246881726
  (0, 5133)	0.2152792161966106
  (0, 5570)	0.15797411905848488
  (0, 3381)	0.2530769796690938
  (0, 5367)	0.16411571836579125
  (0, 1084)	0.12640971755377062
  (0, 7756)	0.09496060052263017
  (1, 1909)	0.6851217051595304
  (1, 8594)	0.5627094801353876
  (1, 8259)	0.38284923124418946
  (1, 4087)	0.25960114834259196
  (2, 8397)	0.39148547324989375
  (2, 8016)	0.3203510753357125
  (2, 6729)	0.3302136379412883
  :	:
  (4133, 6352)	0.29436287616434814
  (4133, 4988)	0.29436287616434814
  (4133, 8364)	0.2808879527320607
  (4133, 6891)	0.25272949084534074
  (4133, 3388)	0.2296939570877491
  (4133, 4357)	0.21042470389698312
 

In [79]:
print(X_test)

  (0, 1890)	0.4834563071796509
  (0, 1321)	0.17279887941543384
  (0, 7647)	0.17755713104029425
  (0, 8328)	0.19064508663303867
  (0, 3004)	0.1687483717534898
  (0, 4729)	0.18176403366933422
  (0, 1524)	0.22281157988754302
  (0, 2471)	0.1363655694637619
  (0, 3532)	0.1366543641911891
  (0, 1498)	0.15799839013168815
  (0, 3503)	0.10109046077266595
  (0, 8315)	0.3304441482998072
  (0, 7963)	0.2987305529692703
  (0, 5477)	0.09120704636403047
  (0, 8289)	0.10213559883209336
  (0, 8537)	0.13940101126191806
  (0, 5606)	0.12718295711053096
  (0, 1889)	0.29098157559138044
  (0, 7669)	0.10516622020330577
  (0, 2109)	0.3159967802633763
  (0, 7627)	0.07341140422977838
  (0, 3770)	0.09142757966163703
  (0, 1084)	0.07990464308434181
  (0, 7756)	0.060025392340647576
  (0, 3594)	0.11320108509005652
  :	:
  (1033, 6568)	0.31431290962989383
  (1033, 226)	0.31431290962989383
  (1033, 333)	0.2383714044966735
  (1033, 6861)	0.21274602978338117
  (1033, 2752)	0.20800758986436743
  (1033, 8316)	0.26093983953

In [80]:
#splitting the data into 80:20 train test ratio for dataset vectorized using countvectorizer
from sklearn.model_selection import train_test_split
X_train1,X_test1,y_train1,y_test1=train_test_split(message1,df['type'],test_size=0.2,random_state=0)

### Model building with different algorithms

1> Naive Bayes classifier

In [81]:
# Creating and training the naive bayes classifier for dataset vectorized using tf-idfvectorizer
from sklearn.naive_bayes import MultinomialNB
classifier=MultinomialNB().fit(X_train,y_train)

In [82]:
# Evaluate the model and training dataset
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
pred=classifier.predict(X_train)
print(classification_report(y_train,pred))
print()
print('confusion Matrix:\n',confusion_matrix(y_train,pred))
print()
print(' training accuracy score:\n',accuracy_score(y_train,pred))

              precision    recall  f1-score   support

         ham       0.96      1.00      0.98      3628
        spam       1.00      0.72      0.84       507

    accuracy                           0.97      4135
   macro avg       0.98      0.86      0.91      4135
weighted avg       0.97      0.97      0.96      4135


confusion Matrix:
 [[3628    0]
 [ 142  365]]

 training accuracy score:
 0.965659008464329


In [83]:
# Printing the predictions
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
pred=classifier.predict(X_test)
print(classification_report(y_test,pred))
print()
print('confusion Matrix:\n',confusion_matrix(y_test,pred))
print()
print('testing accuracy score:\n',accuracy_score(y_test,pred))

              precision    recall  f1-score   support

         ham       0.94      1.00      0.97       888
        spam       1.00      0.63      0.77       146

    accuracy                           0.95      1034
   macro avg       0.97      0.82      0.87      1034
weighted avg       0.95      0.95      0.94      1034


confusion Matrix:
 [[888   0]
 [ 54  92]]

testing accuracy score:
 0.9477756286266924


From Above two Results we can say that our model is not overfitting as we got 96.56 % Accuracy on training and 94.77% on testing set.

In [84]:
# Creating and training the naive bayes classifier for dataset vectorized using Countvectorizer
from sklearn.naive_bayes import MultinomialNB
classifier=MultinomialNB().fit(X_train1,y_train1)

In [85]:
# Evaluate the model and training dataset on Count Vectorizer
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
pred=classifier.predict(X_train1)
print(classification_report(y_train1,pred))
print()
print('confusion Matrix:\n',confusion_matrix(y_train1,pred))
print()
print(' training accuracy score:\n',accuracy_score(y_train1,pred))

              precision    recall  f1-score   support

         ham       1.00      1.00      1.00      3631
        spam       0.98      0.97      0.97       504

    accuracy                           0.99      4135
   macro avg       0.99      0.98      0.99      4135
weighted avg       0.99      0.99      0.99      4135


confusion Matrix:
 [[3623    8]
 [  17  487]]

 training accuracy score:
 0.9939540507859734


In [86]:
# Printing the predictions for CountVectorizer 
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
pred=classifier.predict(X_test1)
print(classification_report(y_test1,pred))
print()
print('confusion Matrix:\n',confusion_matrix(y_test1,pred))
print()
print('testing accuracy score:\n',accuracy_score(y_test1,pred))

              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       885
        spam       0.91      0.93      0.92       149

    accuracy                           0.98      1034
   macro avg       0.95      0.96      0.96      1034
weighted avg       0.98      0.98      0.98      1034


confusion Matrix:
 [[872  13]
 [ 10 139]]

testing accuracy score:
 0.9777562862669246


Here we have compare both the results with tf-idf vectorizer and Count Vectorizer we get better accuracy results on Count Vectorizer. 

Let us try with SVM with grid Search approach to tune Hyperparameters

Lets try on

In [92]:
# Prediction using LinearSVC and GridsearchCV and tokens obtained fron TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
param_grid={'C':[0.1,1,10,100]}
grid=GridSearchCV(LinearSVC(),param_grid,refit=True)
grid.fit(X_train,y_train)



GridSearchCV(cv=None, error_score=nan,
             estimator=LinearSVC(C=1.0, class_weight=None, dual=True,
                                 fit_intercept=True, intercept_scaling=1,
                                 loss='squared_hinge', max_iter=1000,
                                 multi_class='ovr', penalty='l2',
                                 random_state=None, tol=0.0001, verbose=0),
             iid='deprecated', n_jobs=None, param_grid={'C': [0.1, 1, 10, 100]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [93]:
#finding best C for best parameter
print(grid.best_params_)

{'C': 10}


In [94]:
# Finding best accuracy
print(grid.best_score_)

0.9818621523579203


In [96]:
# Prediction of test data
pred2=grid.predict(X_test)

In [97]:
# Evaluate the model and training dataset
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
print(classification_report(y_test,pred2))
print()
print('confusion Matrix:\n',confusion_matrix(y_test,pred2))
print()
print('accuracy score:\n',accuracy_score(y_test,pred2))

              precision    recall  f1-score   support

         ham       0.98      0.99      0.99       888
        spam       0.96      0.86      0.91       146

    accuracy                           0.98      1034
   macro avg       0.97      0.93      0.95      1034
weighted avg       0.98      0.98      0.98      1034


confusion Matrix:
 [[883   5]
 [ 20 126]]

accuracy score:
 0.9758220502901354


For **TF-IDF Vectorizer** data, we get better Accuracy score with SVM both on training (98.18 %) & Testing (97.58 %)


Now Lets Chech Accuracy Score with Count Vectoriser data set

In [98]:
# Prediction using LinearSVC and GridsearchCV and tokens obtained fron CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
param_grid1={'C':[0.1,1,10,100]}
grid1=GridSearchCV(LinearSVC(),param_grid,refit=True)
grid1.fit(X_train1,y_train1)



GridSearchCV(cv=None, error_score=nan,
             estimator=LinearSVC(C=1.0, class_weight=None, dual=True,
                                 fit_intercept=True, intercept_scaling=1,
                                 loss='squared_hinge', max_iter=1000,
                                 multi_class='ovr', penalty='l2',
                                 random_state=None, tol=0.0001, verbose=0),
             iid='deprecated', n_jobs=None, param_grid={'C': [0.1, 1, 10, 100]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [99]:
# Finding best C for best parameter
print(grid1.best_params_)

{'C': 1}


In [101]:
# Finding best accuracy
print(grid1.best_score_)

0.9823458282950422


In [102]:
# Training test dataset
grid1.fit(X_train1,y_train1)



GridSearchCV(cv=None, error_score=nan,
             estimator=LinearSVC(C=1.0, class_weight=None, dual=True,
                                 fit_intercept=True, intercept_scaling=1,
                                 loss='squared_hinge', max_iter=1000,
                                 multi_class='ovr', penalty='l2',
                                 random_state=None, tol=0.0001, verbose=0),
             iid='deprecated', n_jobs=None, param_grid={'C': [0.1, 1, 10, 100]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [103]:
# Prediction of test data
pred3=grid1.predict(X_test1)
pred3

array(['ham', 'ham', 'ham', ..., 'ham', 'ham', 'ham'], dtype=object)

In [104]:
# Evaluate the model and training dataset
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
print(classification_report(y_test1,pred3))
print()
print('confusion Matrix:\n',confusion_matrix(y_test1,pred3))
print()
print('accuracy score:\n',accuracy_score(y_test1,pred3))

              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       885
        spam       0.99      0.91      0.95       149

    accuracy                           0.99      1034
   macro avg       0.99      0.95      0.97      1034
weighted avg       0.99      0.99      0.99      1034


confusion Matrix:
 [[884   1]
 [ 14 135]]

accuracy score:
 0.9854932301740812


As we compare the Results we get an better Accuracy score on testing (98.54%) with Count Vectorizer. It is acceptable Accuracy score as we get better f-1 score as well, it is an indication that our model classification accuracy is good with better precision & recall Score.

All these parameters are important while predicting results from classification report which will help us in decision making.

HAPPY LEARNING !!!!

STAY TUNED WITH MY BLOGS @ https://www.nileshgode.info/blog