#### Naive bayes

In [1]:
#import library
import pandas as pd

In [2]:
#read data
df = pd.read_csv("spam.csv")
#display all data 
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [3]:
#print full summary
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [4]:
#display data first five rows
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
#display data last five rows
df.tail()

Unnamed: 0,Category,Message
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


In [6]:
#groupby category and describe
df.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


In [7]:
#describe non-spam emails 
df[df.Category == 'ham'].describe()

Unnamed: 0,Category,Message
count,4825,4825
unique,1,4516
top,ham,"Sorry, I'll call later"
freq,4825,30


In [8]:
#describe spam emails 
df[df.Category == 'spam'].describe()

Unnamed: 0,Category,Message
count,747,747
unique,1,641
top,spam,Please call our customer service representativ...
freq,747,4


In [9]:
#check number of rows where particular columns of null values
#check sum of null values
df.isnull().sum()

Category    0
Message     0
dtype: int64

It can be obeserved that there are no NaN or None values in the data set.

#### Machine learning model understand numerical values

- Need to convert Category and Message into numbers first.

In [10]:
#take category column and apply lambda function check if spam will return 1 else 0
#create a new spam column
df['Spam'] = df['Category'].apply(lambda x: 1 if x=='spam' else 0)

# display data first five rows
df.head()

Unnamed: 0,Category,Message,Spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


- Created a new column named as "Spam", grouping emails into 1 and 0.

In [11]:
#train model using sklearn and train test data
from sklearn.model_selection import train_test_split

In [12]:
#use X, y as input and test_size as ratio of spliting > will get 4 parameters back
#25% test and 75% train
#when run the cell, it will split samples into train and test data set
X_train, X_test, y_train, y_test = train_test_split(df.Message,df.Spam,test_size=0.25)

In [13]:
#convert a collection of text documents to a matrix of token counts
#find unique words, treat as columns and build matrix
#...represent unique words in huge data set

from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
X_train_count = v.fit_transform(X_train.values)
X_train_count.toarray()[:3]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [14]:
#use in discrete data and have certain frequency to represent
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()

In [15]:
#fit method to train model
#X_train_count text converted into emails numbers matrix
model.fit(X_train_count,y_train)

MultinomialNB()

In [16]:
#predict data

emails = [
    'You have updated Instagram username.',
    'Click on this link to change your account details.'
]

emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 1])

Based on the emails above, first email is a normal email to update account user that username has been updated, while model detected that second email is a spam.

In [17]:
#model works only on numerical values not text
#need to convert X_test model into count and fit into prediction
X_test_count = v.transform(X_test)

#check accuracy of model by calling score method
#score will use X_test to predict model.predict(X_test) and compare with y_test value to find accuracy
model.score(X_test_count, y_test)

0.9863603732950467

98% result shows that naive bayes spam filtering method is able to detect spam emails with high accuracy.