# 04-Spam-Classifier

It's time to make our first real Machine Learning application of NLP: a spam classifier!

A spam classifier is a Machine Learning model that classifier texts (email or SMS) into two categories: Spam (1) or legitimate (0).

To do that, we will reuse our knowledge: we will apply preprocessing and BOW (Bag Of Words) on a dataset of texts.
Then we will use a classifier to predict to which class belong a new email/SMS, based on the BOW.

First things first: import the needed libraries.

In [52]:
# Import NLTK and all the needed libraries
import nltk
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

Load now the dataset in *spam.csv* using pandas. Use the 'latin-1' encoding as loading option.

In [53]:
# TODO: Load the dataset 
spam_df = pd.read_csv('spam.csv', encoding='latin-1')
spam_df.head(5)

Unnamed: 0,Class,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


As usual, I suggest you to explore a bit this dataset.

In [54]:
# TODO: explore the dataset

print("Dataset dimensions:", spam_df.shape)

print(spam_df.info())

print(spam_df.head())

Dataset dimensions: (5572, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Class    5572 non-null   object
 1   Message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB
None
  Class                                            Message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


So as you see we have a column containing the labels, and a column containing the text to classify.

We will begin by doing the usual preprocessing: tokenization, punctuation removal and lemmatization.

In [55]:
# TODO: Perform preprocessing over all the text

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string


def preprocess(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())

    # Remove stopwords and punctuation
    stopwords_en = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stopwords_en and t not in string.punctuation]

    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]

    return tokens

spam_df['tokens'] = spam_df['Message'].apply(preprocess)

print(spam_df['tokens'])

0       [go, jurong, point, crazy, .., available, bugi...
1                [ok, lar, ..., joking, wif, u, oni, ...]
2       [free, entry, 2, wkly, comp, win, fa, cup, fin...
3       [u, dun, say, early, hor, ..., u, c, already, ...
4        [nah, n't, think, go, usf, life, around, though]
                              ...                        
5567    [2nd, time, tried, 2, contact, u., u, ï¿½750, ...
5568                [ï¿½_, b, going, esplanade, fr, home]
5569                        [pity, mood, ..., suggestion]
5570    [guy, bitching, acted, like, 'd, interested, b...
5571                                   [rofl, true, name]
Name: tokens, Length: 5572, dtype: object


Ok now we have our preprocessed data. Next step is to do a BOW.

In [56]:
# TODO: compute the BOW
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer object
vectorizer = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False)

# Fit the vectorizer on the preprocessed text
vectorizer.fit(spam_df['tokens'])

# Transform the preprocessed text into a BOW matrix
bow = vectorizer.transform(spam_df['tokens'])

# Print the shape of the BOW matrix
print("BOW matrix shape:", bow.shape)



BOW matrix shape: (5572, 8916)


Then make a new dataframe as usual to have a visual idea of the words used and their frequencies.

In [57]:
# TODO: Make a new dataframe with the BOW
bow_df = pd.DataFrame(bow.toarray(), columns=vectorizer.get_feature_names())

bow_df['Class'] = spam_df['Class']

bow_df.head()



Unnamed: 0,'','an,'anything,'comfort,'d,'doctors,'heart,'help,'hex,'hw,...,ï¿½ï¿½_thanks,ï¿½ï¿½harry,ï¿½ï¿½it,ï¿½ï¿½morrow,ï¿½ï¿½rents,ï¿½ï¿½ï¿½,ï¿½ï¿½ï¿½_,ï¿½ï¿½ï¿½harry,ï¿½û¬ud,Class
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ham
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ham
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,spam
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ham
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ham


Let's check what is the most used word in the spam category and the non spam category.

There are two steps: first add the class to the BOW dataframe. Second, filter on a class, sum all the values and print the most frequent one.

In [58]:
# TODO: print the most used word in the spam and non spam category
# Adding the Class column to the BOW dataframe
bow_df['Class'] = spam_df['Class']

spam_token_counts = bow_df[bow_df['Class'] == 'spam'].iloc[:, :-1].sum()
ham_token_counts = bow_df[bow_df['Class'] == 'ham'].iloc[:, :-1].sum()

most_frequent_spam_token = spam_token_counts.idxmax()
most_frequent_ham_token = ham_token_counts.idxmax()

print(f"Most frequent spam word: {most_frequent_spam_token}")
print(f"Most frequent ham word: {most_frequent_ham_token}")

Most frequent spam word: call
Most frequent ham word: ...


You should find that the most frequent spam word is 'free', not so surprising, right?

Now we can make a classifier based on our BOW. We will use a simple logistic regression here for the example.

You're an expert, you know what to do, right? Split the data, train your model, predict and see the performance.

In [59]:
# TODO: Perform a classification to predict whether a message is a spam or not
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(bow, spam_df['Class'], test_size=0.2, random_state=42)

# Train a logistic regression classifier on the training data
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Predict the class labels on the testing data
y_pred = clf.predict(X_test)

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

0.9766816143497757


What precision do you get? Check by hand on some samples where it did predict well to check what could go wrong...

Try to use other models and try to improve your results.

In [60]:
from sklearn.metrics import precision_score

precision = precision_score(y_test, y_pred, pos_label='spam')
print(precision)

correct_samples = X_test[y_test == y_pred][:5]
for i, sample in enumerate(correct_samples):
    print(f"Correctly predicted sample {i+1}:")
    print(vectorizer.inverse_transform(sample))

0.9920634920634921
Correctly predicted sample 1:
[array(['also', 'application', 'applying', 'contact', 'cost', 'expensive',
       'joke', 'le', 'ogunrinde', 'one', 'research', 'school', 'score',
       'secondary', 'sent', 'sophas', 'think', 'thinking'], dtype='<U51')]
Correctly predicted sample 2:
[array(["'ll", 'getting', 'know', 'let', 'made', 'morning', 'ok',
       'promise', 'soon', 'text'], dtype='<U51')]
Correctly predicted sample 3:
[array(['2', '87066', 'awarded', 'cd', 'congratulation', 'draw', 'either',
       'entry', 'free', 'gift', 'music', 'tncs', 'txt', 'ur', 'voucher',
       'weekly', 'www.ldew.com1win150ppmx3age16', 'ï¿½100', 'ï¿½500'],
      dtype='<U51')]
Correctly predicted sample 4:
[array(["'ll", 'carlos', 'hang', 'know', 'let', 'text'], dtype='<U51')]
Correctly predicted sample 5:
[array(["did't", 'k', 'k.i', 'see'], dtype='<U51')]


In [61]:

import xgboost as xgb

y_train = y_train.map({'ham': 0, 'spam': 1})
y_test = y_test.map({'ham': 0, 'spam': 1})


clf = xgb.XGBClassifier()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(accuracy)


0.9730941704035875


In [62]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(spam_df['Message'], spam_df['Class'], test_size=0.2, random_state=42)

# Preprocess the text data
vocab_size = 10000
max_length = 100
embedding_dim = 16
tokenizer = Tokenizer(num_words=vocab_size, oov_token="<OOV>")
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_train = pad_sequences(X_train, maxlen=max_length, truncating='post', padding='post')
X_test = tokenizer.texts_to_sequences(X_test)
X_test = pad_sequences(X_test, maxlen=max_length, truncating='post', padding='post')

# Map the string labels to binary numeric labels
y_train = y_train.map({'ham': 0, 'spam': 1})
y_test = y_test.map({'ham': 0, 'spam': 1})

# Train an RNN on the preprocessed data
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=5)

# Evaluate the accuracy of the RNN on the testing data
y_pred = model.predict(X_test)
y_pred = [1 if pred >= 0.5 else 0 for pred in y_pred]
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
0.979372197309417
