# Spam or Ham

ML Problem: text classification

Algorithms: Naive Bayes Classification

Data: [SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset)



This application will define whether or not a text message is spam.

In [1]:
# import relevant libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Reading and Cleaning the Data

In [2]:
# read csv file
spam = pd.read_csv('spam.csv', encoding='ansi')

In [3]:
spam.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


After looking at the head of the dataset, we can notice something weird. There are three 'Unnamed' columns despite the Kaggle dataset only describing 2. Since most of these columns seems to be filled with NaN values, we will look at the rows with non-NaN values.

In [4]:
spam[spam['Unnamed: 2'].isna() == False].head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
95,spam,Your free ringtone is waiting to be collected....,PO Box 5249,"MK17 92H. 450Ppw 16""",
281,ham,\Wen u miss someone,the person is definitely special for u..... B...,why to miss them,"just Keep-in-touch\"" gdeve.."""
444,ham,\HEY HEY WERETHE MONKEESPEOPLE SAY WE MONKEYAR...,HOWU DOIN? FOUNDURSELF A JOBYET SAUSAGE?LOVE ...,,
671,spam,SMS. ac sun0819 posts HELLO:\You seem cool,"wanted to say hi. HI!!!\"" Stop? Send STOP to ...",,
710,ham,Height of Confidence: All the Aeronautics prof...,"this wont even start........ Datz confidence..""",,


For some strange reason, some of the messages have split into multiple columns. To rectify this, we will combine all the columns after 'v1' into one column and drop the rest.

In [13]:
spam_combine = spam.replace(np.nan, '')
spam_combine['v2'] = spam_combine['v2'] + spam_combine['Unnamed: 2'] + spam_combine['Unnamed: 3'] + spam_combine['Unnamed: 4']
spam_combine.drop(labels=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)

In [14]:
spam_combine

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


Now that it's a little cleaner, let's look at the info() and describe() evaluations.

In [16]:
spam_combine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v1      5572 non-null   object
 1   v2      5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [17]:
spam_combine.describe()

Unnamed: 0,v1,v2
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


Spam filters usually disregard punctuation, so let's do the same and remove punctuation from the texts.

In [26]:
def remove_punc(text):
    '''Takes in a text message in the form of a string and then removes the punctuation'''
    
    # create a set of excluded characters
    exclude = set(string.punctuation)
    
    # re-join the characters to form the text without punctuation
    no_punc = ''.join(char for char in text if char not in exclude)
    
    return no_punc    

In [28]:
spam_combine['v2'] = spam_combine['v2'].apply(remove_punc)

In [29]:
spam_combine

Unnamed: 0,v1,v2
0,ham,Go until jurong point crazy Available only in ...
1,ham,Ok lar Joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor U c already then say
4,ham,Nah I dont think he goes to usf he lives aroun...
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì b going to esplanade fr home
5569,ham,Pity was in mood for that Soany other suggest...
5570,ham,The guy did some bitching but I acted like id ...


A lot of the words we see will be repeated in many of the text messages, but most of these words will hold no real significance other than grammtical correctness. These words are called stop words and we can use the Natural Language Toolkit to remove them.