# Ham or Spam?

In [1]:
# when installing nltk for the first time we need to also download a few built in libraries
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Joe\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Joe\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Joe\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
import pandas as pd

df = pd.read_csv("emails.csv")

df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


The dataset is made up of email that are classified as ham [0] or spam[1]. You need to clean the dataset before training a prediction model.

## Remove Punctuation

👇 Create a function to remove the punctuation. Apply it to the entire data and add the output as a new column in the dataframe called `clean_text`

In [3]:
def clean_text():
    from nltk.tokenize import RegexpTokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    ser = []
    for d in df['text']:
        m = tokenizer.tokenize(d)
        m = [' '.join(m)]
        ser.append(m)
    df1 = pd.DataFrame(ser,columns=['sub'])
    return df1
      

## Lower Case

In [4]:
clean_text()

Unnamed: 0,sub
0,Subject naturally irresistible your corporate ...
1,Subject the stock trading gunslinger fanny is ...
2,Subject unbelievable new homes made easy im wa...
3,Subject 4 color printing special request addit...
4,Subject do not have money get software cds fro...
...,...
5723,Subject re research and development charges to...
5724,Subject re receipts from visit jim thanks agai...
5725,Subject re enron case study update wow all on ...
5726,Subject re interest david please call shirley ...


👇 Create a function to lower case the text. Apply it to `clean_text`

In [5]:
dfs = clean_text()
dfs

Unnamed: 0,sub
0,Subject naturally irresistible your corporate ...
1,Subject the stock trading gunslinger fanny is ...
2,Subject unbelievable new homes made easy im wa...
3,Subject 4 color printing special request addit...
4,Subject do not have money get software cds fro...
...,...
5723,Subject re research and development charges to...
5724,Subject re receipts from visit jim thanks agai...
5725,Subject re enron case study update wow all on ...
5726,Subject re interest david please call shirley ...


In [6]:
def lowers(dfs):
    var = []
    for i in dfs['sub']:
        j = i.lower()
        var.append(j)
    df2 = pd.DataFrame(var,columns=['sub'])
    return df2

In [7]:
lowers(dfs)

Unnamed: 0,sub
0,subject naturally irresistible your corporate ...
1,subject the stock trading gunslinger fanny is ...
2,subject unbelievable new homes made easy im wa...
3,subject 4 color printing special request addit...
4,subject do not have money get software cds fro...
...,...
5723,subject re research and development charges to...
5724,subject re receipts from visit jim thanks agai...
5725,subject re enron case study update wow all on ...
5726,subject re interest david please call shirley ...


## Remove Numbers

👇 Create a function to remove numbers from the text. Apply it to `clean_text`

In [8]:
def remove_num(dfs):
    var = []
    for i in dfs['sub']:
        l = ''.join([o for o in i if not o.isdigit()])
        var.append(l)
    df3 = pd.DataFrame(var,columns=['sub'])
    return df3


In [9]:
remove_num(dfs)

Unnamed: 0,sub
0,Subject naturally irresistible your corporate ...
1,Subject the stock trading gunslinger fanny is ...
2,Subject unbelievable new homes made easy im wa...
3,Subject color printing special request additi...
4,Subject do not have money get software cds fro...
...,...
5723,Subject re research and development charges to...
5724,Subject re receipts from visit jim thanks agai...
5725,Subject re enron case study update wow all on ...
5726,Subject re interest david please call shirley ...


## Remove StopWords

👇 Create a function to remove stopwords from the text. Apply it to `clean_text`.

In [10]:
nn = dfs.values.tolist()

In [11]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def stop_word(dfs):
    var = []
    for i in dfs['sub']:
        stop_words = set(stopwords.words('english'))
        word_tokens = word_tokenize(i)
        filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
        
        m = [' '.join(filtered_sentence)]
        var.append(m)
        df4 = pd.DataFrame(var,columns=['sub'])
    return df4

In [12]:
stop_word(dfs)

Unnamed: 0,sub
0,Subject naturally irresistible corporate ident...
1,Subject stock trading gunslinger fanny merrill...
2,Subject unbelievable new homes made easy im wa...
3,Subject 4 color printing special request addit...
4,Subject money get software cds software compat...
...,...
5723,Subject research development charges gpg forwa...
5724,Subject receipts visit jim thanks invitation v...
5725,Subject enron case study update wow day super ...
5726,Subject interest david please call shirley cre...


## Lemmatize

👇 Create a function to lemmatize the text. Make sure the output is a single string, not a list of words. Apply it to `clean_text`.

In [13]:
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob, Word
def lemmatize(dfs):
    var =[]
    lemmatizer = WordNetLemmatizer()
    for i in dfs['sub']:
        s = TextBlob(i)
        lemmatized_sentence = " ".join([w.lemmatize() for w in s.words])
        var.append(lemmatized_sentence)
    df5 = pd.DataFrame(var,columns=['sub'])
    return df5


In [14]:
lemmatize(dfs)

Unnamed: 0,sub
0,Subject naturally irresistible your corporate ...
1,Subject the stock trading gunslinger fanny is ...
2,Subject unbelievable new home made easy im wan...
3,Subject 4 color printing special request addit...
4,Subject do not have money get software cd from...
...,...
5723,Subject re research and development charge to ...
5724,Subject re receipt from visit jim thanks again...
5725,Subject re enron case study update wow all on ...
5726,Subject re interest david please call shirley ...


## Bag-of-words Modelling

👇 Vectorize the `clean_text` to a Bag-of-Words representation with a default CountVectorizer . Save as `X_bow`.

In [28]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(tokenizer=nltk.word_tokenize, max_features=5000)
X = vectorizer.fit_transform(dfs['sub'])
vectorizer.get_feature_names()
print(X.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 4 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [29]:
import pickle
with open('X_bow.pkl', 'wb') as fout:
    pickle.dump((vectorizer), fout)

👇 Cross-validate a MultinomialNB model with the Bag-of-words. Score the model's accuracy.

In [55]:

sms = pd.read_csv("emails.csv", delimiter=',')

In [56]:
sms.shape

(5728, 2)

In [57]:
sms.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [58]:
sms.spam.value_counts()

0    4360
1    1368
Name: spam, dtype: int64

In [62]:
sms['label'] = sms.spam.map({0:'ham', 1:'spam'})
sms

Unnamed: 0,text,spam,label
0,Subject: naturally irresistible your corporate...,1,spam
1,Subject: the stock trading gunslinger fanny i...,1,spam
2,Subject: unbelievable new homes made easy im ...,1,spam
3,Subject: 4 color printing special request add...,1,spam
4,"Subject: do not have money , get software cds ...",1,spam
...,...,...,...
5723,Subject: re : research and development charges...,0,ham
5724,"Subject: re : receipts from visit jim , than...",0,ham
5725,Subject: re : enron case study update wow ! a...,0,ham
5726,"Subject: re : interest david , please , call...",0,ham


In [63]:
sms.head()

Unnamed: 0,text,spam,label
0,Subject: naturally irresistible your corporate...,1,spam
1,Subject: the stock trading gunslinger fanny i...,1,spam
2,Subject: unbelievable new homes made easy im ...,1,spam
3,Subject: 4 color printing special request add...,1,spam
4,"Subject: do not have money , get software cds ...",1,spam


In [65]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
print(X.shape)
print(y.shape)

(150, 4)
(150,)


In [66]:
X = sms.text
y = sms.spam
print(X.shape)
print(y.shape)

(5728,)
(5728,)


In [68]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4296,)
(1432,)
(4296,)
(1432,)


In [69]:
vect = CountVectorizer()
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

<4296x32500 sparse matrix of type '<class 'numpy.int64'>'
	with 532010 stored elements in Compressed Sparse Row format>

In [70]:
X_test_dtm = vect.transform(X_test)
X_test_dtm

<1432x32500 sparse matrix of type '<class 'numpy.int64'>'
	with 171108 stored elements in Compressed Sparse Row format>

In [71]:
# 1. import
from sklearn.naive_bayes import MultinomialNB

# 2. instantiate a Multinomial Naive Bayes model
nb = MultinomialNB()

In [72]:
%time nb.fit(X_train_dtm, y_train)


Wall time: 158 ms


MultinomialNB()

In [73]:
y_pred_class = nb.predict(X_test_dtm)

In [74]:
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.994413407821229

In [75]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[1079,    4],
       [   4,  345]], dtype=int64)

In [76]:
# print message text for the false positives (ham incorrectly classified as spam)

X_test[y_pred_class > y_test]

4327    Subject: your confirmation is needed  please r...
5487    Subject: news review update  please respond to...
2157    Subject: research family outing and volleyball...
4455    Subject: linux - - hit or miss ?  network worl...
Name: text, dtype: object

In [77]:
# print message text for the false negatives (spam incorrectly classified as ham)

X_test[y_pred_class < y_test]

636     Subject: i think , yes .  the things we sell a...
79      Subject: join focus groups to earn money  a la...
1144    Subject: checking account update  dear reader ...
566     Subject: blue horseshoe meet me  dear reader :...
Name: text, dtype: object

In [82]:
# example false negative
X_test[5723]

'Subject: re : research and development charges to gpg  here it is !  - - - - - - - - - - - - - - - - - - - - - - forwarded by shirley crenshaw / hou / ect on 08 / 14 / 2000  07 : 47 am - - - - - - - - - - - - - - - - - - - - - - - - - - -  vince j kaminski  08 / 10 / 2000 02 : 25 pm  to : vera apodaca / et & s / enron @ enron  cc : vince j kaminski / hou / ect @ ect , shirley crenshaw / hou / ect @ ect , pinnamaneni  krishnarao / hou / ect @ ect  subject : re : research and development charges to gpg  vera ,  we shall talk to the accounting group about the correction .  vince  08 / 09 / 2000 03 : 26 pm  vera apodaca @ enron  vera apodaca @ enron  vera apodaca @ enron  08 / 09 / 2000 03 : 26 pm  08 / 09 / 2000 03 : 26 pm  to : pinnamaneni krishnarao / hou / ect @ ect  cc : vince j kaminski / hou / ect @ ect  subject : research and development charges to gpg  per mail dated june 15 from kim watson , there was supposed to have occurred  a true - up of $ 274 . 7 in july for the fist six m

In [79]:
# calculate predicted probabilities for X_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([3.22830528e-113, 1.30541681e-030, 1.26219373e-082, ...,
       6.16117024e-027, 9.80420810e-001, 0.00000000e+000])

In [80]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.9992393515836039

⚠️ Please push the exercise once you are done 🙃

## 🏁 