# SMS Spam Classification

This demonstration uses beginners NLP techniques to preprocess data (text) into the bag-of-words like set of features. When data are prepared, Naive Bayes ML model is used for classification whether the given SMS is spam or not.

https://archive.ics.uci.edu/ml/machine-learning-databases/00228/

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('../data/smsspamcollection/smsspamcollection.data', sep='\t', header=None)
data.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
corpus = data.values[:, 1]
corpus

array(['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
       'Ok lar... Joking wif u oni...',
       "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
       ..., 'Pity, * was in mood for that. So...any other suggestions?',
       "The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free",
       'Rofl. Its true to its name'], dtype=object)

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=2000)
X = vectorizer.fit_transform(corpus)
X.shape

(5572, 2000)

Improvement: remove *stop-words* that add no real value to the sentence.

In [5]:
vectorizer.get_feature_names()

['00',
 '000',
 '02',
 '03',
 '04',
 '05',
 '06',
 '0800',
 '08000839402',
 '08000930705',
 '08001950382',
 '0870',
 '08707509020',
 '08712300220',
 '08712460324',
 '08715705022',
 '08718720201',
 '09050090044',
 '10',
 '100',
 '1000',
 '10am',
 '10p',
 '11',
 '11mths',
 '12',
 '12hrs',
 '1327',
 '150',
 '150p',
 '150pm',
 '150ppm',
 '16',
 '18',
 '1st',
 '1x150p',
 '20',
 '200',
 '2000',
 '2003',
 '2004',
 '20p',
 '21',
 '24',
 '25',
 '250',
 '25p',
 '28',
 '2day',
 '2lands',
 '2mrw',
 '2nd',
 '2nite',
 '2optout',
 '30',
 '300',
 '3030',
 '350',
 '3510i',
 '36504',
 '3d',
 '3g',
 '3rd',
 '400',
 '40gb',
 '434',
 '4info',
 '4t',
 '4th',
 '4u',
 '50',
 '500',
 '5000',
 '50p',
 '530',
 '542',
 '5th',
 '5wb',
 '5we',
 '60p',
 '62468',
 '750',
 '7pm',
 '800',
 '80062',
 '8007',
 '80488',
 '82277',
 '83355',
 '85023',
 '86021',
 '86688',
 '87066',
 '87077',
 '8th',
 '900',
 '9am',
 'aathi',
 'abi',
 'abiola',
 'able',
 'about',
 'abt',
 'ac',
 'acc',
 'accept',
 'access',
 'account',
 'acro

In the next step, we are encoding 'ham' and 'spam' markers into the binary 0 and 1. Since this is a binary classification problem, we don't neet to use OneHotEncoder.

In [6]:
from sklearn.preprocessing import LabelEncoder

y = data.values[:, 0]
y = LabelEncoder().fit_transform(y)
y

array([0, 0, 1, ..., 0, 0, 0])

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

We are now ready to use ML model on the preprocessed data.

In [8]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB().fit(X_train, y_train)

In [9]:
model.score(X_test, y_test)

0.9809679173463839