## Spam Classifier using Multinomial Naive Bayes Classifier

The spam Classifier is used to check whether the particular SMS is spam or ham. 

Nowadays, to register or login to any website you have to provide your email-id and sometimes your phone number as well. These details are used to verify the user. But, there is a chance that these details can be misused for promotions, fake messages etc. Take for example, if you enter your bank details, phone number and email-id to buy a product from a sketchy-looking website, a few days later you would probably receive a mail from halfway around the world claiming that you have won 100 million dollars. Most of us know that this message is fake and this email should end up in spam. This trick just doesn’t work anymore(I hope so!).

We humans can sometimes be reckless. We enter our email-id and phone number in almost every website that asks for it and we expect our email company and phone company to make sure that no spam messages end up in our inbox. So, instead of being careful while entering our details, we have decided to build algorithms that automatically reads a message and decides whether it is spam or not. If it is spam, the message is removed from your inbox and not shown to you.

#### Import Libraries

In [1]:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import pandas as pd
import re
curpos = []

#### Reading Dataset

In [2]:
message = pd.read_csv('SMSSpamCollection', sep = '\t', names = ['label','message'])

In [3]:
message.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#### Shape of Dataset

In [15]:
message.shape

(5572, 2)

#### Data Preprocessing

In [4]:
stem = PorterStemmer()
lemma = WordNetLemmatizer()
for i in range(0,len(message)):
    review = re.sub("[^a-zA-Z]", " ", message['message'][i]) # removing symbols like comma, dot etc
    review = review.lower()
    review = word_tokenize(review) # converting sentence into word
    review = [lemma.lemmatize(word) for word in review if word not in set(stopwords.words('english'))]
    review = ' '.join(review) # joining all words into sentence
    curpos.append(review)

In [5]:
cv = CountVectorizer() 
x = cv.fit_transform(curpos).toarray()

In [6]:
y = pd.get_dummies(message['label'])
y = y.iloc[:,1]

#### Train Test Split

In [7]:
x_train, x_test, y_train, y_test =  train_test_split(x, y, test_size = 0.3, random_state = 49)

#### Generating Model

In [8]:
naive = MultinomialNB()
naive.fit(x_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [9]:
y_pred = naive.predict(x_test)

#### Training Score

In [10]:
naive.score(x_train, y_train)

0.992051282051282

#### Test Score

In [11]:
naive.score(x_test, y_test)

0.9826555023923444

#### Confusion Matrix

In [12]:
confusion_matrix(y_pred, y_test)

array([[1413,    5],
       [  24,  230]], dtype=int64)

#### Classification Report

In [13]:
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1418
           1       0.98      0.91      0.94       254

    accuracy                           0.98      1672
   macro avg       0.98      0.95      0.97      1672
weighted avg       0.98      0.98      0.98      1672



### Save the model to pickle file

In [14]:
import pickle
# Saving model to disk
pickle.dump(naive, open('spamclassifier.pkl','wb'))