# Training a Naive Bayes spam filter using scikit-learn

## Load Data

In this step, we load the data from the file. Each sample is a single line in the file and has the following format

*{spam_or_ham},{email_text}*

The first part is the label that identifies whether the email is spam or ham (not spam), followed by the email text. For example:

`Spam,<p>But few feere in nor revellers in pride the a. Ear fathers yes begun revellers blazon one but not of take high. In had his her satiety alone fulness he sins perchance in thence climes nine scorching weary drugged...`

The data will be loaded into two lists. features - the raw text of the emails, and labels - 0 (ham) or 1 (spam)

In [10]:
def read_file(path):
    """
    read and return all data in a file
    """
    with open(path, 'r') as f:
        return f.read()

data_path = "data/SpamDetectionData.txt"
all_data = read_file(data_path)
all_lines = all_data.split('\n')
all_lines[1:3]

['Spam,<p>But could then once pomp to nor that glee glorious of deigned. The vexed times childe none native. To he vast now in to sore nor flow and most fabled. The few tis to loved vexed and all yet yea childe. Fulness consecrate of it before his a a a that.</p><p>Mirthful and and pangs wrong. Objects isle with partings ancient made was are. Childe and gild of all had to and ofttimes made soon from to long youth way condole sore.</p>',
 'Spam,<p>His honeyed and land vile are so and native from ah to ah it like flash in not. That gild by in basked they lemans passed way who talethis forgot deigned nor friends his before strange. Found long little the. Talethis have soon of hellas had he. But suffice een had men in things ah love was childe through prose men bade. Now she break in shamed his brow loved spent he vaunted him that yea a. Where chill thy rake might to spoiled wassailers but breast loathed maddest but a breast cell since disappointed childe. From sad lurked lowly now was was

In [12]:
import random

def read_file(path):
    """
    read and return all data in a file
    """
    with open(path, 'r') as f:
        return f.read()

def load_data():
    """
    load and return the data in features and labels lists
    each item in features contains the raw email text
    each item in labels is either 1(spam) or 0(ham) and identifies corresponding item in features
    """
    # load all data from file
    data_path = "data/SpamDetectionData.txt"
    all_data = read_file(data_path)
    
    # split the data into lines, each line is a single sample
    all_lines = all_data.split('\n')
    print("to see the type", type(all_lines))
    # each line in the file is a sample and has the following format
   
    
    # extract the feature (email text) and label (spam or ham) from each line
    features = []
    labels = []
    for line in all_lines:
        if line[0:4] == 'Spam':
            labels.append(1)
            features.append(line[5:])
            pass
        elif line[0:3] == 'Ham':
            labels.append(0)
            features.append(line[4:])
           
        else:
            # ignore markers, empty lines and other lines that aren't valid sample
            # print('ignore: "{}"'.format(line));
            pass
    
    return features, labels
    
features, labels = load_data()

print("total no. of samples:",(len(labels)))
print("total no. of features:", (len(features)))
print("total no. of spam samples: {}".format(labels.count(1)))
print("total no. of ham samples: {}".format(labels.count(0)))



to see the type <class 'list'>
total no. of samples: 2100
total no. of features: 2100
total no. of spam samples: 1043
total no. of ham samples: 1057


In [15]:
#randomly looking at some mails
print("\nPrint a random sample for inspection:")
random_idx = random.randint(0, len(labels))
print("random idx is :", random_idx)
print("example feature: {}".format(features[random_idx][0:]))
print("example label: {} ({})".format(labels[random_idx],
                                'spam' if labels[random_idx] else 'ham'))


Print a random sample for inspection:
random idx is : 1068
example feature: <p>Master wheeled gently the some that chamber bird. Word and both this gently he i there by flutter the for i chamber. Countenance silken kind spoke sat within at his grew this vainly tempest was. Streaming a something that stern a unbrokenquit. The form and that a visiter suddenly chamber above door seeing as and and dreams angels chamber this or. Fantastic in the then still my master midnight word the all i shall it. Evilprophet was so flitting over. Uttered and at art that fast door the no a melancholy but. The answer prophet my syllable demons before. Take only disaster. If within this into head a. Of the songs flown shorn and tis saintly than into we echo you devil yet feather no. Bust word dreaming. He his and and wondering fowl mystery word violet an. Hath entrance stood heard as crest bird not cushioned forgotten now sorrowsorrow be beguiling though a he now bird.</p><p>The get the once longer thereat

## Preprocess Data - Split data randomly into training and test subsets
Use [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to split data into training and test subsets

In [16]:
from sklearn.model_selection import train_test_split

# load features and labels
features, labels = load_data()

# split data into training / test sets
features_train, features_test, labels_train, labels_test = train_test_split(
    features, 
    labels, 
    test_size=0.2,   # use 10% for testing
    random_state=42)

print("no. of train features: {}".format(len(features_train)))
print("no. of train labels: {}".format(len(labels_train)))
print("no. of test features: {}".format(len(features_test)))
print("no. of test labels: {}".format(len(labels_test)))


to see the type <class 'list'>
no. of train features: 1680
no. of train labels: 1680
no. of test features: 420
no. of test labels: 420


## Preprocess Data - Vectorize text data
Use [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to vectorize text input

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

# vectorize email text into tfidf matrix
# TfidfVectorizer converts collection of raw documents to a matrix of TF-IDF features.
# It's equivalent to CountVectorizer followed by TfidfTransformer.
vectorizer = TfidfVectorizer()
    #input='content',     # input is actual text
    #lowercase=True,      # convert to lower case before tokenizing
    #stop_words='english' # remove stop words

features_train_transformed = vectorizer.fit_transform(features_train)
features_test_transformed  = vectorizer.transform(features_test)
#We only use transform() on the test data because we use the scaling paramaters
#learned on the train data to scale the test data. 
features_test_transformed

<420x782 sparse matrix of type '<class 'numpy.float64'>'
	with 68889 stored elements in Compressed Sparse Row format>

## Train a Naive Bayes Classifier
Use [MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) to train a Naive Bayes classifier

In [20]:
from sklearn.naive_bayes import 
import pickle

def save(vectorizer, classifier):
    '''
    save classifier to disk
    '''
    with open('model.pkl', 'wb') as file:
        pickle.dump((vectorizer, classifier), file)
        
def load():
    '''
    load classifier from disk
    '''
    with open('model.pkl', 'rb') as file:
        vectorizer, classifier = pickle.load(file)
    return vectorizer, classifier

# train a classifier
classifier = MultinomialNB()
classifier.fit(features_train_transformed, labels_train)

# save the trained model
save(vectorizer, classifier)

# score the classifier accuracy
print("classifier accuracy {:.2f}%".format(classifier.score(
    features_test_transformed, labels_test) * 100))




classifier accuracy 1.00%


# Calculate F1 Score
Calculate [F1 score](https://en.wikipedia.org/wiki/F1_score) using [sklearn metrics.f1_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)

In [25]:
import numpy as np
from sklearn import metrics
prediction = classifier.predict(features_test_transformed)
fscore = metrics.f1_score(labels_test, prediction, average='macro')
print("F score {:.2f}".format(fscore))



F score 1.00
[0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 1
 0 0 1 0 1 1 1 0 1 1 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1
 1 0 1 1 0 0 1 1 1 0 0 0 0 0 1 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 1 1 0 0 1
 0 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0
 0 0 1 1 1 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 0 0
 1 0 0 1 1 1 0 1 1 1 0 0 0 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 1 0 1 0
 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0
 0 1 1 1 0 1 0 1 1 0 1 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 1 1 1 1 1 1 0 1
 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 1 1 1 0 1 1 0 0 1 1 0 0 1 0 1 1 0 0
 1 0 0 1 1 0 1 0 0 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0
 0 1 1 0 0 1 0 1 0 0 1 1 0 1 1 0 1 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 0 1 1
 0 1 0 0 1 1 0 0 1 0 1 1 0]


## Using the trained classifier for prediction

In [28]:
vectorizer, classifer = load()

print('\nPerform a test')                    
#email_input = 'enter your email here'
email_input = ['<p>And once that. His eyes but tapping tempest or shore clasp other me bird perched murmured nothing fancy was caught. And with the the get. Lenore and pallas that nothing. Beast napping my hope i pondered back lamplight its pallas that this it nameless above. Flown came perfumed nevermore answer fiend by door and being tempter fluttered of startled. Into a whether in was fancy bird came more. Lenore this fiend chamber stock floor tempest my disaster all gently thing his surely burden this devil bird. Peering swung my that and bird on back tapping with back be once f']
#email_input=['Fiend this have thy and my forget thrilled or. The at ungainly followed and we still in above flirt unbrokenquit bust silken. Is something discourse flitting shadow the and tis. Nearly stillness plume lining the. Raven me of the. Louder smiling flutter shall more and one surcease scarce and so smiling i that into fiery. Implore before is here once door wandering suddenly spoken and the.</p><p>No presently velvet still at i pallas plainly swung my tapping was fantastic. Aptly stronger oer chamber this and. Beguiling the leave we something ease lie is. And sculptured door songs came is this of your my then of for tapping soul late raven whose sainted. Morrow murmured more more this surely he of with thy lies my door forget i have cushions. Stronger more by implore this there sat burning farther the and. It door and of said my whom this. Tinkled quoth the rapping shaven off to. Tis utters tufted']
email_input_transformed = vectorizer.transform(email_input)
prediction = classifer.predict(email_input_transformed)

print('EMAIL:', email_input)
print('The email is', 'SPAM' if prediction else 'HAM')




Perform a test
EMAIL: ['<p>And once that. His eyes but tapping tempest or shore clasp other me bird perched murmured nothing fancy was caught. And with the the get. Lenore and pallas that nothing. Beast napping my hope i pondered back lamplight its pallas that this it nameless above. Flown came perfumed nevermore answer fiend by door and being tempter fluttered of startled. Into a whether in was fancy bird came more. Lenore this fiend chamber stock floor tempest my disaster all gently thing his surely burden this devil bird. Peering swung my that and bird on back tapping with back be once f']
The email is HAM


In [7]:
vectorizer, classifer = load()

print('\nPerform a test')                    
#email_input = 'enter your email here'
email_input = ['<p>running adversity childe he dear disporting sought fellow longdeserted a true on. Low loved had lines sighed childe the shameless. Glorious of nor sister to or forgot the waste and aye wrong chttery bad win</p>']
email_input_transformed = vectorizer.transform(email_input)
prediction = classifer.predict(email_input_transformed)

print('EMAIL:', email_input)
print('The email is', 'SPAM' if prediction else 'HAM')



Perform a test
EMAIL: ['<p>running adversity childe he dear disporting sought fellow longdeserted a true on. Low loved had lines sighed childe the shameless. Glorious of nor sister to or forgot the waste and aye wrong chttery bad win</p>']
The email is SPAM
