# SMS Spam Classification

This demonstration uses beginners NLP techniques to preprocess data (text) into the bag-of-words like set of features. When data are prepared, Naive Bayes ML model is used for classification whether the given SMS is spam or not.

https://archive.ics.uci.edu/ml/machine-learning-databases/00228/

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.stem import WordNetLemmatizer

In [2]:
data = pd.read_csv('../data/smsspamcollection/smsspamcollection.data', sep='\t', header=None)
data.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
corpus = data.values[:, 1]
corpus

array(['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
       'Ok lar... Joking wif u oni...',
       "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
       ..., 'Pity, * was in mood for that. So...any other suggestions?',
       "The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free",
       'Rofl. Its true to its name'], dtype=object)

In [4]:
stop_words = set(open('../data/general/english-stopwords.txt').read().split('\n'))
lemma = WordNetLemmatizer()

In [5]:
def preprocess_text(t):
    words = t.split()
    words = [word.strip(',.').lower() for word in words]
    words = [lemma.lemmatize(word) for word in words]
    words = [word for word in words if word not in stop_words]
    return ' '.join(words)

corpus = [preprocess_text(text) for text in corpus]

In [6]:
    from sklearn.feature_extraction.text import TfidfVectorizer

    vectorizer = TfidfVectorizer(max_features=1000)
    X = vectorizer.fit_transform(corpus)
    X.shape

(5572, 1000)

In the next step, we are encoding 'ham' and 'spam' markers into the binary 0 and 1. Since this is a binary classification problem, we don't neet to use OneHotEncoder.

In [7]:
from sklearn.preprocessing import LabelEncoder

y = data.values[:, 0]
y = LabelEncoder().fit_transform(y)
y

array([0, 0, 1, ..., 0, 0, 0])

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

We are now ready to use ML model on the preprocessed data.

In [9]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB().fit(X_train, y_train)

In [10]:
model.score(X_test, y_test)

0.9847743338771071