# Exercise 1 Spam Ham binary classifier

In this notebook I build and evaluate a Naive Bayes classifier

The model first uses the SKLearn CountVectorize class to extract each post's word counts into a matrix.

And then feeds the matrix to the SKLearn Multinomoal Naive Bayes classifier.

* The model was trained on the TRAIN data. 
* Evaluated on the TEST data trying to optimise precision and recall. I tested a number of variations to optimise these metrics. It was possible to get the accuracy of spam classification fairly high (recall 99% or higher) especially by using ngrams that effectively act as context. But this came at the cost of reducing recall of ham to <50%. In the end I went for a more balanced approach as I was unsure whether it was worse to have false negatives or false positives for spam.
* And finally tested on the VALIDATION data

Final Results

|class | precision | recall | f1-score |
|------|-----------|--------|----------|
| ham  | 0.77      | 0.74   | 0.76     |
| spam | 0.90      | 0.91   | 0.90     |



In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB, ComplementNB
import numpy as np
from collections import Counter

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, TweetTokenizer

from sklearn.metrics import classification_report, confusion_matrix

## import the data

In [2]:
data_path = 'data/dataset.csv'

In [3]:
df = pd.read_csv(data_path, header=None)
df.columns = ['dataset', 'text', 'spamham'] 
df['target'] = 0
df.loc[df.spamham=='spam', 'target'] = 1 # set ham as 0 and spam as 1

In [4]:
df.head()

Unnamed: 0,dataset,text,spamham,target
0,TRAIN,G.E.M.S Starting a nonprofit organization in t...,spam,1
1,TRAIN,Online Shopping w/ AVON Hello Mommas 😘😘♥️ If y...,spam,1
2,TRAIN,Shop w/ AVON Hello Strong Mommas 😘♥️ If you lo...,spam,1
3,TRAIN,Shopping If anyone is into make or jewelry or ...,spam,1
4,TEST,Make money from home... http://letty1995.hotsy...,spam,1



## split out the train test and validate data

In [5]:
train_text = df.loc[df.dataset=='TRAIN'].text.values
test_text = df.loc[df.dataset=='TEST'].text.values
validate_text = df.loc[df.dataset=='VALIDATION'].text.values

In [6]:
y_train = df.loc[df.dataset=='TRAIN'].target.values
y_test = df.loc[df.dataset=='TEST'].target.values
y_validate = df.loc[df.dataset=='VALIDATION'].target.values

In [7]:
print(f'{len(train_text):,} documents in the training set')

5,015 documents in the training set


## intialise nltk functions to test tokenizer

In [8]:
stop_words = set(stopwords.words('english')) 

In [9]:
porter = PorterStemmer()

### choose a tokenizer

In [10]:
tokenize = word_tokenize # standard tokenizer

# tknzr = TweetTokenizer() # tweet friendly tokenizer -- didn't improve the model
# tknzr = TweetTokenizer(preserve_case=False, reduce_len=False)
# tokenize = tknzr.tokenize

## count word and document frequencies to filter out common and/or rare words

In [11]:
word_frequency = Counter()
doc_frequency = Counter()

for s in train_text:
    
    words = tokenize(s.lower())
    
    word_frequency.update(words)
    doc_frequency.update(set(words))

In [12]:
def tokenizer(s):
    
    """Wrapper tokenizer function to test impact of different approaches
    
    Inputs
    ------
    
    s : str
    
    Returns
    -------
    
    tokens : list [str, ]
    """

    tokens =  tokenize(s.lower()) # apply the nltk tokenizer
    tokens = [t for t in tokens if doc_frequency[t]>5 and t not in stop_words]# and doc_frequency[t]<3000]
    
    return tokens

In [13]:
# spot check tokenizer
# print(test_text[0])
# print()
# print(tokenizer(test_text[0]))

## Build the model

In [14]:
# sklearn class to extract and count features/words
vectorizer = CountVectorizer(tokenizer=tokenizer) #, ngram_range=(1, 3))#, analyzer='word'), tokenizer=tokenizer)

# sklean class to learn conditional proabilites

# small gain in accuracy from using unifrom priors - not sure why this works
classifier = MultinomialNB(fit_prior=False) 

## tested alternative bayes model that can work better unbalanced samples
## no siginficant improvement 
# classifier = ComplementNB(fit_prior=False) 


In [15]:
# extract sparse matrix with words as columns and documents as rows
# each element is the frequency count of the word in document
X_train = vectorizer.fit_transform(train_text)

## calculate the conditional proabilties for the train data
classifier.fit(X_train, y_train)

MultinomialNB(fit_prior=False)

## Evaluate the model

I evaluated the model with precision and recall as there is a 3 to 1 inbalance in the spam ham data.

I was unsure whether it is worse to allow spam to creep in or to class a user's ham post as spam. So tried to balance the two.


In [16]:
# apply the model to the test data
X_test = vectorizer.transform(test_text)
y_pred = classifier.predict(X_test)

In [17]:
print(classification_report(y_test, y_pred, target_names=['ham', 'spam']))

              precision    recall  f1-score   support

         ham       0.78      0.76      0.77       402
        spam       0.91      0.92      0.91      1041

    accuracy                           0.87      1443
   macro avg       0.84      0.84      0.84      1443
weighted avg       0.87      0.87      0.87      1443



I played around with a few things to improve the accuracy:
    
* lower case - **significant improvement**
* stemming
* nltk twitter specific tokenizer
* removing stop words
* removing numbers
* ngrams
* character grams
* removing priors - **significant improvement**
* removing low frequency words - **significant improvement**
* differnt SKLearn naive bayes classifiers

<hr>

To avoid me having to refer back to wikipedia:
* **precision** percentage classed as spam (ham) that are spam (ham)
* **recall** percentage of spam (ham) that are classed as spam (ham)


## Look at the top features/words with the biggest impact

The values are the log conditional proability P(word|spam) after laplace smoothing.

Larger negative numbers mean lower probability it is spam.


Looking at this identified that stop words were high predictors of spam.

This may be because spam is more likely to have proper sentences.

Although it didn't impact the accuracy much I still decided to remove stop words during tokenization. More work could be done evaluating the high impact features.


We can see some words that we would expect to be in spam eg $, looking, free.

And you can clearly see something common about the words that suport ham. I suppose these are words that a spammer would not use in case they offended someone.

In [18]:
def show_most_informative_features(vectorizer, clf, n=20):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    print(f'{"support ham":30} {"support spam":30}')
    print()
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print(f'{coef_1:10.2f} {fn_1:20} {coef_2:10.2f} {fn_2:20}')

In [19]:
show_most_informative_features(vectorizer, classifier, 30)

support ham                    support spam                  

    -12.07 abortion                  -3.05 .                   
    -12.07 anal                      -3.18 !                   
    -12.07 anger                     -3.20 ,                   
    -12.07 angry                     -3.94 ’                   
    -12.07 annoyed                   -4.35 :                   
    -12.07 bisexual                  -4.45 $                   
    -12.07 blocked                   -4.51 ?                   
    -12.07 cannabis                  -4.75 )                   
    -12.07 cheated                   -4.85 (                   
    -12.07 disgusting                -5.03 &                   
    -12.07 dumb                      -5.14 https               
    -12.07 frustrating               -5.19 get                 
    -12.07 fuck                      -5.22 free                
    -12.07 fucked                    -5.27 baby                
    -12.07 lesbian                   -5.2

## Spot check text where the model failed

One key thing that stood out is that many of the ham posts that are classified as spam appear to have text from a commercial/advert like offering. But the post itself is not spam.


I tried cutting post after the first N words. This didn't increase accuracy.


Possibly some kind of weighting towards words near the begining would help with this. 

But if there are too many posts where ham is very similar to spam or spam written like ham then it will push against the boundaries of what an NLP model can distinguish.

We could look at the metadata and other information for additional signal, eg past history.

In [27]:
counter = 0

for i, row in df.iterrows():

    if row.dataset != 'TEST': continue # filter foe test data
    
    if row.target != 0: continue # filter for spam or ham only. Set as 1 for spam; 0 for ham
    
    # get the prediction
    X_test = vectorizer.transform([row.text])
    y_pred = classifier.predict(X_test)
    
    if y_pred[0] == row.target: continue # filter for where we got it wrong

    print(i)
    print(row.text, end='\n------\n\n') # print post that was misclassified

    counter += 1
    
    if counter == 3: break

6
Formula samples/coupons I have samples and coupons for similac formula. Does any momma on here need any? I'd be happy to mail it to you! milac FongMoms EWARDS OW SAVE GROW 3 easy steps to FREE Similac! Look inside for savings coupons good toward any Similac product And check your mail for more coming soon. SAVE $5 FREE Similac 1 2 USE COUPONS 3 Use your Similac coupons (We've included some inside) EARN POINTS Earn 5 points for every coupon you use. Points add up and rewards sent automatically GET SIMILAC Reach 35 points and get FREE Similae. Sign up for additional savings and tips at Similac.com/Email-Signup. Enter your membership ID found in the enclosed "save" envelope. FRE Sme omes Forem d nd Pet by S grem m e OptiGRO Cm N r r Bdry's Fiest Yer MAX-BASED LPONDE Infant Formula EW Similac HMO For IMMUNE SUPPORT sen ae dnt OptiGRO with Iran Memaciad PRO-ADVANCE NET T 584 NON-GMO Ing essly cep FAbbots Complete Nutrion for Your lainys First Vear ide MLK-SASED POWDER Similac Infant Formu

## Evaluate the final model on the validation data

In [21]:
X_validate = vectorizer.transform(validate_text)
y_pred_v = classifier.predict(X_validate)

In [22]:
print(classification_report(y_validate, y_pred_v))#, target_names=target_names))

              precision    recall  f1-score   support

           0       0.77      0.74      0.76       200
           1       0.90      0.91      0.90       486

    accuracy                           0.86       686
   macro avg       0.83      0.83      0.83       686
weighted avg       0.86      0.86      0.86       686



## Save the models to be used by api

In [23]:
import pickle

In [24]:
with open('models/vectorizer.pkl', 'wb') as output:  # Overwrites any existing file.
        pickle.dump(vectorizer, output, pickle.HIGHEST_PROTOCOL)

In [25]:
with open('models/classifier.pkl', 'wb') as output:  # Overwrites any existing file.
        pickle.dump(classifier, output, pickle.HIGHEST_PROTOCOL)

In [26]:
with open('models/doc_frequency_dict.pkl', 'wb') as output:  # Overwrites any existing file.
        pickle.dump(doc_frequency, output, pickle.HIGHEST_PROTOCOL)