# Spam Filter Classifier
By the development of mobile communication technology and the expansion of mobile phones, Short Message Service system or SMS has become one of the most important communication modes according to its simple operation and low price.


Among all types of short messages, we are going to focus on spam messages. Spam messages have several disadvantages including waste of traffic, storage space and computational power, which lead to financial problems. According to Cloudmark stats2, the number of mobile phone spams varies widely from region to region. For instance, in North America less than 1% of SMS messages were spam in 2010, while in parts of Asia up to 30% of messages were spam messages. In China and during 2008, the number of daily sent messages was 1.9 billion, and China's mobile phone users received an average of 10.35 spam messages per week.

There are many methods, which have been applied for detecting SMS spams. We can divide them into two groups: Content-based approaches and non-Content-based approaches. Social network analysis is a typical non- Content-based approach. This approach is often used by telecom operators instead of mobile phone users. On the other hand, approaches such as automatic text classification techniques, Support Vector Machines (SVMs), K-Nearest Neighbor algorithm, logistic regression algorithm and Winnow algorithm are content-based.


In [7]:
import sys
import numpy as np
import pandas as pd
import sklearn
import nltk

#Check versions
print('Python:  {}'.format(sys.version))
print('numpy:  {}'.format(np.__version__))
print('pandas:  {}'.format(pd.__version__))
print('sklearn:  {}'.format(sklearn.__version__))
print('nltk:  {}'.format(nltk.__version__))



Python:  3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)]
numpy:  1.16.2
pandas:  0.24.2
sklearn:  0.20.3
nltk:  3.4


## Natural Language Toolkit (NLTK)
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”

Let's download some of the important and required packages of nltk

In [17]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

## About the data
This corpus has been collected from free or free for research sources at the Internet:

A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. 

A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available.

The below data is separated with 'tab' and have no column labels

In [18]:
# Import the Data
data = pd.read_csv('SMSSpamCollection', sep='\t', names=['label','message'])

In [19]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
label      5572 non-null object
message    5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB
None


### Data Info:
There are continuous entry of index from 0 to 5571
<br>
Total rows: 5572
<br>
Total columns: 2


In [20]:
# Printing first 5 rows of dataset
print(data.head())

  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


In [21]:
# Counting the label counts
print(data['label'].value_counts())

ham     4825
spam     747
Name: label, dtype: int64


## Data Preprocessing

Importing some of the required packages for data cleaning and processing
<br>
They include:
<br>
<b>re(Regular Expression):</b> Regular expressions can contain both special and ordinary characters. Most ordinary characters, like 'A', 'a', or '0', are the simplest regular expressions; they simply match themselves. You can concatenate ordinary characters, so last matches the string 'last'. (In the rest of this section, we’ll write RE’s in this special style, usually without quotes, and strings to be matched 'in single quotes'.)
<br>
Here we will be replacing all the regular expressions with space(' ') from our messages except the alphabets and will store it in a list.

<b>Stopwards:</b> Stopwords are the words in any language which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example: the, is, at, which, and on.
<br>
So we will be removing all the stopwords from our messages so it will be easier for our model to predict better results.
<br>
<b>PorterStemmer:</b> The idea of stemming is a sort of normalizing method. Many variations of words carry the same meaning, other than when tense is involved.
<br>
The reason why we stem is to shorten the lookup, and normalize sentences.
<br>
Consider:
<br>
I was taking a ride in the car.<br>
I was riding in the car.<br>
This sentence means the same thing. in the car is the same. I was is the same. the ing denotes a clear past-tense in both cases, so is it truly necessary to differentiate between ride and riding, in the case of just trying to figure out the meaning of what this past-tense activity was?
<br>

No, not really.
<br>
This is just one minor example, but imagine every word in the English language, every possible tense and affix you can put on a word. Having individual dictionary entries per version would be highly redundant and inefficient, especially since, once we convert to numbers, the "value" is going to be identical.
<br>
So, in short the package PorterStemmer cuts the words to its origin word.

In [22]:
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

# Initialising the stemmer
ps = PorterStemmer()


In [43]:
# Define a list of sentences for our final processed data
messages =[]
for i in range(0, len(data)):
    msg = re.sub('[^a-zA-z]', ' ', data['message'][i])
    msg = msg.lower()
    msg = msg.split()
    
    # stemming and removing stopwords
    msg = [ps.stem(word) for word in msg if not word in stopwords.words('english')]
    msg = ' '.join(msg)
    messages.append(msg)

In [44]:
# printing first 10 lines of our processed data
print(len(messages))
messages[:10]

5572


['go jurong point crazi avail bugi n great world la e buffet cine got amor wat',
 'ok lar joke wif u oni',
 'free entri wkli comp win fa cup final tkt st may text fa receiv entri question std txt rate c appli',
 'u dun say earli hor u c alreadi say',
 'nah think goe usf live around though',
 'freemsg hey darl week word back like fun still tb ok xxx std chg send rcv',
 'even brother like speak treat like aid patent',
 'per request mell mell oru minnaminungint nurungu vettam set callertun caller press copi friend callertun',
 'winner valu network custom select receivea prize reward claim call claim code kl valid hour',
 'mobil month u r entitl updat latest colour mobil camera free call mobil updat co free']

### Observation:
From the above printed sample of our data we can clearly see that all the punctuations has been removed, all the letter has been converted to lower case, so that our model won't interpret different values for upper and lower case characters.
<br>
All the stopwords has been removed from our data and all the words which required stemming has been stemmed.

### Converting our labels to binary digits

In [45]:
# Converting class labels to bianry values, 0=ham and 1=spam
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
Y = encoder.fit_transform(data['label'])

print(Y[:10])

[0 0 1 0 0 1 0 0 1 1]


## Creating Bag of Words
The bag-of-words (BOW) model is a representation that turns arbitrary text into fixed-length vectors by counting how many times each word appears. This process is often referred to as vectorization.
<br>
Creation of Bag of Words is required because our machine learning algorithm works only with numerical data, it wont be able to interpret anything useful with the alphabetical data.

In [46]:
# Creating Bag of words 

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X=cv.fit_transform(messages).toarray()


In [74]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, accuracy_score

# Splitting the data into testing and training data
X_train,X_test,Y_train,Y_test = train_test_split(X,Y, test_size=0.2, random_state=1)

print(len(X_train))
print(len(X_test))

4457
1115


## Model Fitting

In [75]:
# Training the model

NB = MultinomialNB()
model = NB.fit(X_train,Y_train)


In [76]:
# model prediction 
Y_pred = model.predict(X_test)

# Creating confusion matrix
pd.DataFrame(
    confusion_matrix(Y_test,Y_pred),
    index = [['actual ham','actual spam']],
    columns = [['pedicted','predicted'],['ham','spam']])


Unnamed: 0_level_0,pedicted,predicted
Unnamed: 0_level_1,ham,spam
actual ham,952,16
actual spam,4,143


In [77]:
# Accuracy score
accuracy = accuracy_score(Y_test, Y_pred)*100
print(accuracy)

98.20627802690582


## Fitted Model Report
It has been observed that the model has fitted quite well using the Multinomial Naive Bayes algorithm with an accuracy score of 98.20%.
From the confusion matrix we can conclude:

There are 952 correctly predicted ham messages and 16 wrongly predcited ham messages i.e., there were 16 messages in our test set that were actually ham but our model predicted it as an spam (false negative).

There were 143 correctly predicted spam messages and 4 wrongly predicted spam messages i.e., there were 4 messages in our test set that were actually spam but our model predicted it as a ham (false positive).