# Ham or Spam?

🎯 The goal of this challenge is to classify emails as spams (1) or normal emails (0)

🧹 First, you will apply cleaning techniques to these textual data

👩🏻‍🔬 Then, you will convert the cleaned texts into a numerical representation

✉️ Eventually, you will apply the ***Multinomial Naive Bayes*** model to classify each email as either a spam or a regular email.

## (0) The NTLK library (Natural Language Toolkit)

In [1]:
!pip install nltk



In [2]:
# When importing nltk for the first time, we need to also download a few built-in libraries

import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/msaffan1/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /home/msaffan1/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /home/msaffan1/nltk_data...
[nltk_data] Downloading package omw-1.4 to /home/msaffan1/nltk_data...


True

In [3]:
import pandas as pd

df = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/ham_spam_emails.csv")
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


## (1) Cleaning the (text) dataset

The dataset is made up of emails that are classified as ham [0] or spam[1]. You need to clean the dataset before training a prediction model.

### (1.1) Remove Punctuation

❓ Create a function to remove the punctuation. Apply it to the `text` column and add the output to a new column in the dataframe called `clean_text` ❓

In [10]:
import pandas as pd
import re

def remove_punctuation(text):
    clean_text = re.sub(r'[^\w\s]', '', text)
    return clean_text

# Apply the remove_punctuation function to the 'text' column
df['clean_text'] = df['text'].apply(remove_punctuation)

df

Unnamed: 0,text,spam,clean_text
0,Subject naturally irresistible your corporate ...,1,Subject naturally irresistible your corporate ...
1,Subject the stock trading gunslinger fanny is...,1,Subject the stock trading gunslinger fanny is...
2,Subject unbelievable new homes made easy im w...,1,Subject unbelievable new homes made easy im w...
3,Subject 4 color printing special request addi...,1,Subject 4 color printing special request addi...
4,Subject do not have money get software cds fr...,1,Subject do not have money get software cds fr...
...,...,...,...
5723,Subject re research and development charges t...,0,Subject re research and development charges t...
5724,Subject re receipts from visit jim thanks ...,0,Subject re receipts from visit jim thanks ...
5725,Subject re enron case study update wow all ...,0,Subject re enron case study update wow all ...
5726,Subject re interest david please call shi...,0,Subject re interest david please call shi...


### (1.2) Lower Case

❓ Create a function to lowercase the text. Apply it to `clean_text` ❓

In [11]:
def lowercase_text(text):
    return text.lower()
df['clean_text'] = df['clean_text'].apply(lowercase_text)
df

Unnamed: 0,text,spam,clean_text
0,Subject naturally irresistible your corporate ...,1,subject naturally irresistible your corporate ...
1,Subject the stock trading gunslinger fanny is...,1,subject the stock trading gunslinger fanny is...
2,Subject unbelievable new homes made easy im w...,1,subject unbelievable new homes made easy im w...
3,Subject 4 color printing special request addi...,1,subject 4 color printing special request addi...
4,Subject do not have money get software cds fr...,1,subject do not have money get software cds fr...
...,...,...,...
5723,Subject re research and development charges t...,0,subject re research and development charges t...
5724,Subject re receipts from visit jim thanks ...,0,subject re receipts from visit jim thanks ...
5725,Subject re enron case study update wow all ...,0,subject re enron case study update wow all ...
5726,Subject re interest david please call shi...,0,subject re interest david please call shi...


### (1.3) Remove Numbers

❓ Create a function to remove numbers from the text. Apply it to `clean_text` ❓

In [15]:
def remove_numbers(text):
    clean_text = re.sub(r'\d+', '', text)
    return clean_text
df['clean_text'] = df['clean_text'].apply(remove_numbers)
df

Unnamed: 0,text,spam,clean_text
0,Subject naturally irresistible your corporate ...,1,subject naturally irresistible your corporate ...
1,Subject the stock trading gunslinger fanny is...,1,subject the stock trading gunslinger fanny is...
2,Subject unbelievable new homes made easy im w...,1,subject unbelievable new homes made easy im w...
3,Subject 4 color printing special request addi...,1,subject color printing special request addit...
4,Subject do not have money get software cds fr...,1,subject do not have money get software cds fr...
...,...,...,...
5723,Subject re research and development charges t...,0,subject re research and development charges t...
5724,Subject re receipts from visit jim thanks ...,0,subject re receipts from visit jim thanks ...
5725,Subject re enron case study update wow all ...,0,subject re enron case study update wow all ...
5726,Subject re interest david please call shi...,0,subject re interest david please call shi...


### (1.4) Remove StopWords

❓ Create a function to remove stopwords from the text. Apply it to `clean_text`. ❓

In [18]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

def remove_stopwords(text):
    # Tokenize the text
    words = text.split()
    # Remove stopwords
    filtered_words = [word for word in words if word.lower() not in stopwords.words('english')]
    # Recreate the text without stopwords
    clean_text = ' '.join(filtered_words)
    return clean_text
df['clean_text'] = df['clean_text'].apply(remove_stopwords)
df

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/msaffan1/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,text,spam,clean_text
0,Subject naturally irresistible your corporate ...,1,subject naturally irresistible corporate ident...
1,Subject the stock trading gunslinger fanny is...,1,subject stock trading gunslinger fanny merrill...
2,Subject unbelievable new homes made easy im w...,1,subject unbelievable new homes made easy im wa...
3,Subject 4 color printing special request addi...,1,subject color printing special request additio...
4,Subject do not have money get software cds fr...,1,subject money get software cds software compat...
...,...,...,...
5723,Subject re research and development charges t...,0,subject research development charges gpg forwa...
5724,Subject re receipts from visit jim thanks ...,0,subject receipts visit jim thanks invitation v...
5725,Subject re enron case study update wow all ...,0,subject enron case study update wow day super ...
5726,Subject re interest david please call shi...,0,subject interest david please call shirley cre...


### (1.5) Lemmatize

❓ Create a function to lemmatize the text. Make sure the output is a single string, not a list of words. Apply it to `clean_text`. ❓

In [25]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    words = text.split()
    lemmatized_text = ' '.join([lemmatizer.lemmatize(word, wordnet.VERB) for word in words])
    return lemmatized_text
df['clean_text'] = df['clean_text'].apply(lemmatize_text)
df

Unnamed: 0,text,spam,clean_text
0,Subject naturally irresistible your corporate ...,1,subject naturally irresistible corporate ident...
1,Subject the stock trading gunslinger fanny is...,1,subject stock trade gunslinger fanny merrill m...
2,Subject unbelievable new homes made easy im w...,1,subject unbelievable new home make easy im wan...
3,Subject 4 color printing special request addi...,1,subject color print special request additional...
4,Subject do not have money get software cds fr...,1,subject money get software cds software compat...
...,...,...,...
5723,Subject re research and development charges t...,0,subject research development charge gpg forwar...
5724,Subject re receipts from visit jim thanks ...,0,subject receipt visit jim thank invitation vis...
5725,Subject re enron case study update wow all ...,0,subject enron case study update wow day super ...
5726,Subject re interest david please call shi...,0,subject interest david please call shirley cre...


## (2) Bag-of-words Modelling

### (2.1) Digitizing the textual data into numbers

❓ Vectorize the `clean_text` to a Bag-of-Words representation with a default CountVectorizer. Save as `X_bow`. ❓

In [35]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()

X_bow = count_vectorizer.fit_transform(df['clean_text'])

X_bow = X_bow.toarray()

count_vectorizer.get_feature_names_out()

array(['aa', 'aaa', 'aaaenerfax', ..., 'zzn', 'zzncacst', 'zzzz'],
      dtype=object)

### (2.2) Multinomial Naive Bayes Modelling

❓ Cross-validate a MultinomialNB model with the bag-of-words data. Score the model's accuracy. ❓

In [40]:
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB

y = df['spam']
nb_classifier = MultinomialNB()

accuracy_scores = cross_val_score(nb_classifier, X_bow, y, cv=5)

mean_accuracy = accuracy_scores.mean()
mean_accuracy

0.9888272098889626

🏁 Congratulations !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge !