In [44]:
#! /usr/bin/python3

# INTRODUCTION

Welcome to this Jupyter Notebook where we will delve into various Natural Language Processing (NLP) techniques. Through hands-on exploration, we aim to gain a deeper understanding of this evolving NLP paradigm and its practical applications. Join me on this journey as we unravel the intricacies of language processing and analysis.

# 1. Classify the spam using Naive Bayes

In this section, we'll dive into the fascinating task of spam classification using the Naive Bayes algorithm. Spam classification is a classic problem in the realm of Natural Language Processing (NLP) and machine learning. By employing the Naive Bayes approach, we aim to build a model that can effectively distinguish between spam and non-spam messages based on their textual content. Let's walk through the process of preprocessing the data, training the Naive Bayes classifier, and evaluating its performance. By the end of this section, you'll have a solid grasp of how Naive Bayes can be harnessed for practical NLP tasks like spam detection.

In [45]:
# import the libraries

import requests
import pandas as pd
import zipfile

In [46]:
# get the data

url = 'https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip'
data_file = 'SMSSpamCollection'

response = requests.get(url)
filename = url.split("/")[-1]

with open(filename, 'wb') as file:
    file.write(response.content)

with zipfile.ZipFile(filename, 'r') as zip:
    zip.extractall('')

data = pd.read_table(data_file,
                    header = 0,
                    names = ['type', 'message'])


In [47]:
# see a sample of our data

data.sample(3)

Unnamed: 0,type,message
2577,ham,Hey whats up? U sleeping all morning?
3714,ham,"Oh, i will get paid. The most outstanding one ..."
157,ham,"Hello, my love. What are you doing? Did you ge..."


In [48]:
# description of our data

data.describe()

Unnamed: 0,type,message
count,5571,5571
unique,2,5168
top,ham,"Sorry, I'll call later"
freq,4824,30


As we can see, we have 5571 rows with two columns each. One column is the message, and the other is the type of the message, spam or ham (no spam).

Now, we will transform our data, so we will be able to work with it.

In [49]:
# we tokenize our message

import nltk

from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import *

nltk.download('punkt')
nltk.download('stopwords')

stop = stopwords.words('english')

data['tokens'] = data['message'].apply(lambda x: nltk.word_tokenize(x))

data['tokens'] = data['tokens'].apply(lambda x: [palabra for palabra in x if palabra not in stop])

stemmer = PorterStemmer()

data['tokens'] = data['tokens'].apply(lambda x: [stemmer.stem(item) for item in x])

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Perrosato\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Perrosato\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [50]:
# see our data again

data.sample(5)

Unnamed: 0,type,message,tokens
320,ham,"Merry Christmas to you too babe, i love ya *ki...","[merri, christma, babe, ,, love, ya, *, kiss, *]"
3841,ham,Howz pain.it will come down today.do as i said...,"[howz, pain.it, come, today.do, said, ystrday...."
3532,ham,"Actually, my mobile is full of msg. And i m do...","[actual, ,, mobil, full, msg, ., and, work, on..."
2294,spam,You have 1 new message. Please call 08718738034.,"[you, 1, new, messag, ., pleas, call, 08718738..."
646,ham,Do you mind if I ask what happened? You dont h...,"[do, mind, i, ask, happen, ?, you, dont, say, ..."


Once we have the dataset, we need to transform it into a matrix format. A common representation is the term-document matrix or feature matrix, where each row corresponds to a document (data point) and each column represents a unique feature. In your case, where you're dealing with text data for spam classification, you might use techniques like TF-IDF (Term Frequency-Inverse Document Frequency) to convert text into numerical features.

In [51]:
# import the functions

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# re-join the strings

data['tokens'] = data['tokens'].apply(lambda x: ' '.join(x))

# split the data

x_train, x_test, y_train, y_test = train_test_split(
    data['tokens'],
    data['type'],
    test_size = 0.2
)

# create the vectorizer

vectorizer = CountVectorizer(
    strip_accents = 'ascii',
    lowercase = True
)

# fit vectorizer

vectorizer_fit = vectorizer.fit(x_train)
x_train_transformed = vectorizer_fit.transform(x_train)
x_test_transformed = vectorizer_fit.transform(x_test)


With the matrix prepared, we can apply the Naive Bayes algorithm for classification. Naive Bayes is a probabilistic algorithm that makes predictions based on the calculated probabilities of a data point belonging to different classes.

In [57]:
# import the functions

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, balanced_accuracy_score

# train the model

naive_bayes = MultinomialNB()
naive_bayes_fit = naive_bayes.fit(x_train_transformed, y_train)

# make the predictions

train_predict = naive_bayes_fit.predict(x_train_transformed)
test_predict = naive_bayes_fit.predict(x_test_transformed)

# get the results

print(f"The train has {balanced_accuracy_score(y_train, train_predict)} of accuracy, with the confusion matrix:\n",
        f"{confusion_matrix(y_train, train_predict)}")

print(f"\nThe test has {balanced_accuracy_score(y_test, test_predict)} of accuracy, with the confusion matrix:\n",
        f"{confusion_matrix(y_test, test_predict)}")


The train has 0.9886756136847481 of accuracy, with the confusion matrix:
 [[3865    8]
 [  12  571]]

The test has 0.9582628042368753 of accuracy, with the confusion matrix:
 [[947   4]
 [ 13 151]]
