# 04-Spam-Classifier

It's time to make our first real Machine Learning application of NLP: a spam classifier!

A spam classifier is a Machine Learning model that classifier texts (email or SMS) into two categories: Spam (1) or legitimate (0).

To do that, we will reuse our knowledge: we will apply preprocessing and BOW (Bag Of Words) on a dataset of texts.
Then we will use a classifier to predict to which class belong a new email/SMS, based on the BOW.

First things first: import the needed libraries.

In [1]:
# Import NLTK and all the needed libraries
import nltk
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

Load now the dataset in *spam.csv* using pandas. Use the 'latin-1' encoding as loading option.

In [2]:
import pandas as pd

# Load the dataset
df = pd.read_csv('spam.csv')

# Print the first 10 rows
print(df.head(10))

# Print some basic information about the dataset
print(df.info())

Unnamed: 0,Class,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


As usual, I suggest you to explore a bit this dataset.

In [3]:
import pandas as pd

# Load the dataset
df = pd.read_csv('spam.csv')

# Print the first 10 rows
print(df.head(10))

# Print some basic information about the dataset
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Class    5572 non-null   object
 1   Message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


So as you see we have a column containing the labels, and a column containing the text to classify.

We will begin by doing the usual preprocessing: tokenization, punctuation removal and lemmatization.

In [2]:
import pandas as pd
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import string

# download necessary NLTK data
nltk.download('punkt')
nltk.download('wordnet')

# load data
data = pd.read_csv('spam.csv')

# define function to perform preprocessing
def preprocess_text(text):
    # tokenize the text
    tokens = word_tokenize(text.lower())

    # remove punctuation and digits
    tokens = [word for word in tokens if word.isalpha()]

    # lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return tokens

# apply the preprocessing function to the data
data['tokens'] = data['Message'].apply(preprocess_text)


[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:992)>
[nltk_data] Error loading wordnet: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:992)>


Ok now we have our preprocessed data. Next step is to do a BOW.

In [3]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# download required nltk packages
nltk.download('stopwords')
nltk.download('wordnet')

# read the data into a pandas dataframe
data = pd.read_csv('spam.csv')

# define the preprocessing function
def preprocess_text(text):
    # convert to lowercase
    text = text.lower()
    # remove special characters and punctuation
    text = re.sub('[^a-zA-Z0-9\s]', '', text)
    # tokenize
    tokens = nltk.word_tokenize(text)
    # remove stop words
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    # lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    # join tokens back into a string
    text = ' '.join(tokens)
    return text

# apply the preprocessing function to the data
data['tokens'] = data['Message'].apply(preprocess_text)

# create the bag-of-words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['tokens'])
y = data['Class']

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train the Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)

# make predictions on the test set
y_pred = clf.predict(X_test)

# evaluate the performance of the classifier
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Confusion matrix:', confusion_matrix(y_test, y_pred))
print('Classification report:', classification_report(y_test, y_pred))


[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:992)>
[nltk_data] Error loading wordnet: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:992)>


Accuracy: 0.9739910313901345
Confusion matrix: [[948  17]
 [ 12 138]]
Classification report:               precision    recall  f1-score   support

         ham       0.99      0.98      0.98       965
        spam       0.89      0.92      0.90       150

    accuracy                           0.97      1115
   macro avg       0.94      0.95      0.94      1115
weighted avg       0.97      0.97      0.97      1115



Then make a new dataframe as usual to have a visual idea of the words used and their frequencies.

In [4]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer

# download NLTK resources
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

# load the dataset
data = pd.read_csv('spam.csv', encoding='latin-1')
data = data[['Class', 'Message']]

# define the preprocessing function
stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # tokenize the text
    tokens = nltk.word_tokenize(text.lower())
    # remove punctuation and stop words
    tokens = [t for t in tokens if t.isalpha() and t not in stop_words]
    # lemmatize the tokens
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    # return the tokens as a string separated by spaces
    return ' '.join(tokens)

# apply the preprocessing function to the data
data['tokens'] = data['Message'].apply(preprocess_text)

# compute the bag of words
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(data['tokens'])

# make a new dataframe with the BOW
bow_df = pd.DataFrame(bow.toarray(), columns=vectorizer.get_feature_names())
bow_df['Class'] = data['Class']

print(bow_df.head())


[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:992)>
[nltk_data] Error loading wordnet: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:992)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:992)>


AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

Let's check what is the most used word in the spam category and the non spam category.

There are two steps: first add the class to the BOW dataframe. Second, filter on a class, sum all the values and print the most frequent one.

In [6]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# download necessary NLTK data
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# load the data
data = pd.read_csv("spam.csv", encoding="latin-1")
data = data[["v1", "v2"]] # selecting the correct columns
data.columns = ["label", "text"]

# define the preprocessing function
def preprocess_text(text):
    text = text.lower() # convert to lowercase
    text = re.sub(r'\d+', '', text) # remove numbers
    text = re.sub(r'[^\w\s]', '', text) # remove punctuation
    text = word_tokenize(text) # tokenize text
    stop_words = set(stopwords.words('english')) # create stopword set
    text = [w for w in text if not w in stop_words] # remove stop words
    lemmatizer=WordNetLemmatizer() # create lemmatizer
    text = [lemmatizer.lemmatize(word) for word in text] # apply lemmatization
    return text

# apply the preprocessing function to the data
data['tokens'] = data['text'].apply(preprocess_text)

# create the Bag of Words
bow = {}
for index, row in data.iterrows():
    for token in row['tokens']:
        if token not in bow:
            bow[token] = [0, 0]
        bow[token][0 if row['label'] == 'ham' else 1] += 1

# create a new dataframe with the Bag of Words
df_bow = pd.DataFrame.from_dict(bow, orient='index', columns=['ham', 'spam'])

# print the most used word in the spam and non-spam category
most_used_ham = df_bow['ham'].idxmax()
most_used_spam = df_bow['spam'].idxmax()
print(f"The most used word in ham is '{most_used_ham}', with a frequency of {df_bow.loc[most_used_ham, 'ham']}.")
print(f"The most used word in spam is '{most_used_spam}', with a frequency of {df_bow.loc[most_used_spam, 'spam']}.")


[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:992)>
[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:992)>
[nltk_data] Error loading wordnet: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:992)>


KeyError: "None of [Index(['v1', 'v2'], dtype='object')] are in the [columns]"

You should find that the most frequent spam word is 'free', not so surprising, right?

Now we can make a classifier based on our BOW. We will use a simple logistic regression here for the example.

You're an expert, you know what to do, right? Split the data, train your model, predict and see the performance.

In [7]:
import pandas as pd
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# load the data
data = pd.read_csv("spam.csv", encoding="latin-1")
data = data[["v1", "v2"]]
data.columns = ["label", "text"]

# define the preprocessing function
lemmatizer = nltk.stem.WordNetLemmatizer()
def preprocess_text(text):
    # tokenize the text
    tokens = nltk.word_tokenize(text)
    # remove punctuation and convert to lowercase
    tokens = [word.lower() for word in tokens if word.isalpha()]
    # lemmatize the tokens
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    # return the tokens as a string
    return " ".join(tokens)

# apply the preprocessing function to the data
data['text'] = data['text'].apply(preprocess_text)

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size=0.2, random_state=42)

# define the vectorizer to convert text to BOW
vectorizer = CountVectorizer()

# vectorize the training data
X_train_bow = vectorizer.fit_transform(X_train)

# train a Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train_bow, y_train)

# vectorize the testing data
X_test_bow = vectorizer.transform(X_test)

# predict the labels of the testing data
y_pred = clf.predict(X_test_bow)

# compute the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


KeyError: "None of [Index(['v1', 'v2'], dtype='object')] are in the [columns]"

What precision do you get? Check by hand on some samples where it did predict well to check what could go wrong...

Try to use other models and try to improve your results.