# Ham or Spam?

🎯 The goal of this challenge is to classify emails as spams (1) or normal emails (0)

🧹 First, you will apply cleaning techniques to these textual data

👩🏻‍🔬 Then, you will convert the cleaned texts into a numerical represensation

✉️ Eventually, you will apply the ***Multinomial Naive Bayes*** model to classify each email as either a spam or a regular email.

## (0) The NTLK library (Natural Language Toolkit)

In [1]:
# !pip install nltk

In [2]:
# When importing nltk for the first time, we need to also download a few built-in libraries

import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/nikkord/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/nikkord/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/nikkord/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/nikkord/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [60]:
import pandas as pd

df = pd.read_csv("emails.csv")
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


## (1) Cleaning the (text) dataset

The dataset is made up of emails that are classified as ham [0] or spam[1]. You need to clean the dataset before training a prediction model.

### (1.1) Remove Punctuation

❓ Create a function to remove the punctuation. Apply it to the `text` column and add the output to a new column in the dataframe called `clean_text` ❓

In [61]:
import string 
def punct(col):
    for punctuation in string.punctuation:
        col = col.replace(punctuation, '')
    return col


In [62]:
df['clean_text'] = df['text'].apply(punct)
df.head()

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,Subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,Subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,Subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,Subject 4 color printing special request addi...
4,"Subject: do not have money , get software cds ...",1,Subject do not have money get software cds fr...


### (1.2) Lower Case

❓ Create a function to lowercase the text. Apply it to `clean_text` ❓

In [63]:
# df_list = df['all_cols'].values.tolist()

In [64]:
def lowercase(col):
    col = col.lower()
    col = ''.join(char for char in col if not char.isdigit())
    return col

# cleaned_sentences = [lowercase(sentence) for sentence in sentences]
df['clean_text'] = df['clean_text'].apply(lowercase)
df

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,subject color printing special request addit...
4,"Subject: do not have money , get software cds ...",1,subject do not have money get software cds fr...
...,...,...,...
5723,Subject: re : research and development charges...,0,subject re research and development charges t...
5724,"Subject: re : receipts from visit jim , than...",0,subject re receipts from visit jim thanks ...
5725,Subject: re : enron case study update wow ! a...,0,subject re enron case study update wow all ...
5726,"Subject: re : interest david , please , call...",0,subject re interest david please call shi...


### (1.3) Remove Numbers

❓ Create a function to remove numbers from the text. Apply it to `clean_text` ❓

In [65]:
def rem_num(col):
    cleaned_text = ''.join(char for char in col if not char.isdigit())
    return col

# cleaned_sentences = [lowercase(sentence) for sentence in sentences]
df['clean_text'] = df['clean_text'].apply(rem_num)
df

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,subject color printing special request addit...
4,"Subject: do not have money , get software cds ...",1,subject do not have money get software cds fr...
...,...,...,...
5723,Subject: re : research and development charges...,0,subject re research and development charges t...
5724,"Subject: re : receipts from visit jim , than...",0,subject re receipts from visit jim thanks ...
5725,Subject: re : enron case study update wow ! a...,0,subject re enron case study update wow all ...
5726,"Subject: re : interest david , please , call...",0,subject re interest david please call shi...


### (1.4) Remove StopWords

❓ Create a function to remove stopwords from the text. Apply it to `clean_text`. ❓

In [66]:
from nltk.tokenize import word_tokenize
# df['tokens'] = df['clean_text'].apply(word_tokenize)
# df['tokens']

In [67]:
df

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,subject color printing special request addit...
4,"Subject: do not have money , get software cds ...",1,subject do not have money get software cds fr...
...,...,...,...
5723,Subject: re : research and development charges...,0,subject re research and development charges t...
5724,"Subject: re : receipts from visit jim , than...",0,subject re receipts from visit jim thanks ...
5725,Subject: re : enron case study update wow ! a...,0,subject re enron case study update wow all ...
5726,"Subject: re : interest david , please , call...",0,subject re interest david please call shi...


In [68]:
from nltk.corpus import stopwords 

stop_words = set(stopwords.words('english')) # you can also choose other languages

def stop_w(col):
    tokenized= word_tokenize(col)
    tmp = [w for w in tokenized if not w in stop_words]
    return tmp


In [69]:
df['clean_text'] = df['clean_text'].apply(stop_w)
df

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,"[subject, naturally, irresistible, corporate, ..."
1,Subject: the stock trading gunslinger fanny i...,1,"[subject, stock, trading, gunslinger, fanny, m..."
2,Subject: unbelievable new homes made easy im ...,1,"[subject, unbelievable, new, homes, made, easy..."
3,Subject: 4 color printing special request add...,1,"[subject, color, printing, special, request, a..."
4,"Subject: do not have money , get software cds ...",1,"[subject, money, get, software, cds, software,..."
...,...,...,...
5723,Subject: re : research and development charges...,0,"[subject, research, development, charges, gpg,..."
5724,"Subject: re : receipts from visit jim , than...",0,"[subject, receipts, visit, jim, thanks, invita..."
5725,Subject: re : enron case study update wow ! a...,0,"[subject, enron, case, study, update, wow, day..."
5726,"Subject: re : interest david , please , call...",0,"[subject, interest, david, please, call, shirl..."


### (1.5) Lemmatize

❓ Create a function to lemmatize the text. Make sure the output is a single string, not a list of words. Apply it to `clean_text`. ❓

In [70]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [71]:
df['clean_text']

0       [subject, naturally, irresistible, corporate, ...
1       [subject, stock, trading, gunslinger, fanny, m...
2       [subject, unbelievable, new, homes, made, easy...
3       [subject, color, printing, special, request, a...
4       [subject, money, get, software, cds, software,...
                              ...                        
5723    [subject, research, development, charges, gpg,...
5724    [subject, receipts, visit, jim, thanks, invita...
5725    [subject, enron, case, study, update, wow, day...
5726    [subject, interest, david, please, call, shirl...
5727    [subject, news, aurora, update, aurora, versio...
Name: clean_text, Length: 5728, dtype: object

In [72]:
def lem(tokenized_text):
    verb_lemmatized = [WordNetLemmatizer().lemmatize(word) for word in tokenized_text]
    joined = ' '.join(verb_lemmatized )
    return joined

df['clean_text'] = df['clean_text'].apply(lem)

In [73]:
df.loc[0,'text']

"Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you  will see logo drafts within three business days . affordability : your  ma

In [74]:
df.loc[0,'clean_text']

'subject naturally irresistible corporate identity lt really hard recollect company market full suqgestions information isoverwhelminq good catchy logo stylish statlonery outstanding website make task much easier promise havinq ordered iogo company automaticaily become world ieader isguite ciear without good product effective business organization practicable aim hotat nowadays market promise marketing effort become much effective list clear benefit creativeness hand made original logo specially done reflect distinctive company image convenience logo stationery provided format easy use content management system letsyou change website content even structure promptness see logo draft within three business day affordability marketing break make gap budget satisfaction guaranteed provide unlimited amount change extra fee surethat love result collaboration look portfolio interested'

## (2) Bag-of-words Modelling

### (2.1) Digitizing the textual data into numbers

❓ Vectorize the `clean_text` to a Bag-of-Words representation with a default CountVectorizer. Save as `X_bow`. ❓

In [75]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(df.clean_text)
X_bow.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [80]:
vectorizer.get_feature_names_out()[100:130]

array(['absorb', 'absorbing', 'absorbowac', 'absorbs', 'absorption',
       'absoutely', 'abspielt', 'abstained', 'abstract', 'abstractive',
       'abstruse', 'absultely', 'absurd', 'abu', 'abuja', 'abundance',
       'abundancegroup', 'abundant', 'abundantly', 'abuse', 'abused',
       'abusive', 'abutler', 'aby', 'abyssinia', 'ac', 'acacia',
       'academe', 'academia', 'academic'], dtype=object)

In [78]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_bow = count_vect.fit_transform(df['clean_text'])
X_bow.toarray()
vect_df = pd.DataFrame(X_bow.toarray(),
                      columns= count_vect.get_feature_names_out()
                     )
vect_df

Unnamed: 0,aa,aaa,aaaenerfax,aadedeji,aagrawal,aal,aaldous,aaliyah,aall,aanalysis,...,zwzm,zxghlajf,zyban,zyc,zygoma,zymg,zzmacmac,zzn,zzncacst,zzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5723,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5724,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5725,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [46]:
count_vectorizer.get_feature_names_out()

array(['aa', 'aaa', 'aaaenerfax', ..., 'zzn', 'zzncacst', 'zzzz'],
      dtype=object)

### (2.2) Multinomial Naive Bayes Modelling

❓ Cross-validate a MultinomialNB model with the bag-of-words data. Score the model's accuracy. ❓

In [77]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate

cv_nb = cross_validate( MultinomialNB(), X, df.spam, scoring = "accuracy")

cv_nb['test_score'].mean()

0.9895252901681946

🏁 Congratulations !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge !