## Project Unit 2 / Lesson 3 / Project 4 - Iterate and Evaluate Your Classifier

It's time to revisit your classifier from the previous assignment. Using the evaluation techniques we've covered here, look at your classifier's performance in more detail. Then go back and iterate by engineering new features, removing poor features, or tuning parameters. Repeat this process until you have five different versions of your classifier. Once you've iterated, answer these questions to compare the performance of each:

Do any of your classifiers seem to overfit?
Which seem to perform the best? Why?
Which features seemed to be most impactful to performance?
Write up your iterations and answers to the above questions in a few pages. Submit a link below and go over it with your mentor to see if they have any other ideas on how you could improve your classifier's performance.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import cross_val_score

In [2]:
pd.options.display.max_colwidth = 100

In [3]:
def get_file_encoding(fn):
    import chardet
        
    with open(fn, 'rb') as f:
        content = f.read()

    charset = chardet.detect(content)
    # {'encoding': 'EUC-JP', 'confidence': 0.99}
    #print("character set = {}".format(charset['encoding']))
    
    return charset['encoding']

In [4]:
# Let's process the data)
def open_and_load_file (filename, columnnames):
    file_encoding = get_file_encoding(filename)
    df = pd.read_csv(filename, delimiter= '\t', header=None, encoding=file_encoding) 
    df.columns = columnnames
    return df, file_encoding

In [5]:
def run_Bernoulli_supervised_learning(d1, t1):
# Our data is binary / boolean, so we're importing the Bernoulli classifier.
    from sklearn.naive_bayes import BernoulliNB

    # Instantiate our model and store it in a new variable.
    global bnb
    
    bnb = BernoulliNB()

    # Fit our model to the data.
    bnb.fit(d1, target)

    # Classify, storing the result in a new variable.
    y_pred = bnb.predict(d1)

    # Display our results.
    return_message = "Number of mislabeled points out of a total {} points : {}".format(
        d1.shape[0],
        (t1 != y_pred).sum()
    )

    return (return_message, d1.shape[0], (t1 != y_pred).sum(), (t1 != y_pred).sum()/d1.shape[0] )

In [6]:
def do_the_keywords():
    for key in keywords:
        # Add spaces around the key so that we are getting theword,
        # not just pattern matching
        sms_raw[str(key)] = sms_raw.message.str.contains(' ' + str(key) + ' ', case=False)

In [7]:
def do_the_heatmap(subplotno):
    plt.subplot(3,1,subplotno) 
    sns.heatmap(sms_raw.corr(), cmap="Blues")

In [8]:
def set_data_and_target():
    global data 
    data = sms_raw[keywords]
    # data.sample(5)
    
    global target 
    target = sms_raw['spam']
    # target.head(10)

In [9]:
# Test your model with different holdout groups.
def run_with_the_holdouts(holdout_percent):
    from sklearn.model_selection import train_test_split
    # Use train_test_split to create the necessary training and test groups
    X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=holdout_percent, random_state=20)
    print('With {0:.0%} Holdout: '.format(holdout_percent*1) + str(bnb.fit(X_train, y_train).score(X_test, y_test)))
    print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

In [10]:
def run_the_cross_validations():
    from sklearn.model_selection import cross_val_score
    return cross_val_score(bnb, data, target, cv=10)

In [11]:
def return_diff_min_max_cv_score(cv_in):
    print("min cv_score is {0:.4%}, and max is {1:.4%}, and delta is {2:.4%}".format(cv_in.min(),
                                                                      cv_in.max(), 
                                                                      cv_in.max()-cv_in.min()))

In [12]:
def refresh_data_frame(column_list):
    data_frame = pd.read_csv(data_path, delimiter= '\t', header=None)
    data_frame.columns = column_list
    return data_frame

In [13]:
data_path = ("https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/"
             "master/sms_spam_collection/SMSSpamCollection"
            )

#### Iteration 0 - the one provided in the lesson

In [14]:
print("Here is iteration 0")
# let's get the data
sms_raw = refresh_data_frame(['spam', 'message'])

keywords = ['click', 'offer', 'winner', 'buy', 'free', 'cash', 'urgent']

do_the_keywords()
# set_data_and_target()

sms_raw['allcaps'] = sms_raw.message.str.isupper()
sms_raw['spam'] = (sms_raw['spam'] == 'spam')

data = sms_raw[keywords + ['allcaps']]
target = sms_raw['spam']

xmsg, y, z, z2 = run_Bernoulli_supervised_learning(data, target)

run_with_the_holdouts(0.20)
cv_score=cross_val_score(bnb, data, target, cv=10)
print(cv_score)
return_diff_min_max_cv_score(cv_score)

Here is iteration 0
With 20% Holdout: 0.884304932735426
Testing on Sample: 0.8916008614501076
[0.89784946 0.89426523 0.89426523 0.890681   0.89605735 0.89048474
 0.88150808 0.89028777 0.88489209 0.89568345]
min cv_score is 88.1508%, and max is 89.7849%, and delta is 1.6341%


#### Iteration 1

In [15]:
print("Here is iteration 1")
# let's get the data
sms_raw = refresh_data_frame(['spam', 'message'])

keywords = [ 'buy', 'free', 'cash', 'urgent', "i'm", 'u feel', 'please confirm', "if you reply no",
           'to claim', 'UR']

do_the_keywords()

sms_raw['alllower'] = sms_raw.message.str.islower()

sms_raw['spam'] = (sms_raw['spam'] == 'spam')

data = sms_raw[keywords + ['alllower']]
target = sms_raw['spam']

xmsg, y, z, z2 = run_Bernoulli_supervised_learning(data, target)

run_with_the_holdouts(.15)
cv_score=cross_val_score(bnb, data, target, cv=20)
print(cv_score)
type(cv_score)
return_diff_min_max_cv_score(cv_score)

Here is iteration 1
With 15% Holdout: 0.888755980861244
Testing on Sample: 0.8953697056712132
[0.89642857 0.91071429 0.88928571 0.89285714 0.89285714 0.89605735
 0.89964158 0.9028777  0.88848921 0.90647482 0.88848921 0.89928058
 0.88489209 0.89208633 0.89928058 0.89208633 0.88848921 0.89568345
 0.88489209 0.90647482]
min cv_score is 88.4892%, and max is 91.0714%, and delta is 2.5822%


#### Iteration 2

In [16]:
print("Here is iteration 2")
# let's get the data
sms_raw = refresh_data_frame(['spam', 'message'])

keywords = [ '!!!', 'Please call', 'Gent', 'PRIVATE', 'sexy', 'unlimited', 'free2day', 'sex', 'matched',
           'to order']

do_the_keywords()

sms_raw['alllower'] = sms_raw.message.str.islower()
sms_raw['allcaps'] = sms_raw.message.str.isupper()


sms_raw['spam'] = (sms_raw['spam'] == 'spam')

data = sms_raw[keywords + ['allcaps'] + ['alllower']]

target = sms_raw['spam']

xmsg, y, z, z2 = run_Bernoulli_supervised_learning(data, target)

run_with_the_holdouts(.15)
cv_score=cross_val_score(bnb, data, target, cv=20)
print(cv_score)
return_diff_min_max_cv_score(cv_score)

Here is iteration 2
With 15% Holdout: 0.8552631578947368
Testing on Sample: 0.8749102656137832
[0.875      0.86785714 0.86428571 0.86785714 0.87142857 0.86738351
 0.87455197 0.86330935 0.88489209 0.88489209 0.87410072 0.88129496
 0.87410072 0.8705036  0.88129496 0.87769784 0.88489209 0.8705036
 0.8705036  0.88129496]
min cv_score is 86.3309%, and max is 88.4892%, and delta is 2.1583%


#### Iteration 3

In [17]:
print("Here is iteration 3")
# let's get the data
sms_raw = refresh_data_frame(['spam', 'message'])

keywords = [ 'free','winner', 'follow instructions', 'txt', 'please call', 'urgent']

do_the_keywords()

sms_raw['allcaps'] = sms_raw.message.str.isupper()


sms_raw['spam'] = (sms_raw['spam'] == 'spam')

data = sms_raw[keywords + ['allcaps']]

target = sms_raw['spam']

xmsg, y, z, z2 = run_Bernoulli_supervised_learning(data, target)

run_with_the_holdouts(.15)
cv_score=cross_val_score(bnb, data, target, cv=20)
print(cv_score)
return_diff_min_max_cv_score(cv_score)

Here is iteration 3
With 15% Holdout: 0.8971291866028708
Testing on Sample: 0.9038047379755922
[0.90357143 0.91785714 0.89642857 0.90714286 0.92142857 0.91397849
 0.89964158 0.9028777  0.91726619 0.92446043 0.89208633 0.91366906
 0.9028777  0.89208633 0.88848921 0.89568345 0.89208633 0.88848921
 0.89928058 0.9028777 ]
min cv_score is 88.8489%, and max is 92.4460%, and delta is 3.5971%


#### Iteration 4

In [18]:
print("Here is iteration 4")
# let's get the data
sms_raw = refresh_data_frame(['spam', 'message'])

keywords = ['have won', 'will be charged', 'dating service', 'apply', 'pleases call',
           'to unsubscribe', 'landline', 'to claim', 'private', 'fantasies', 'to stop',
           'claim code']

do_the_keywords()

sms_raw['allcaps'] = sms_raw.message.str.isupper()


sms_raw['spam'] = (sms_raw['spam'] == 'spam')

data = sms_raw[keywords + ['allcaps']]

target = sms_raw['spam']

xmsg, y, z, z2 = run_Bernoulli_supervised_learning(data, target)

run_with_the_holdouts(.15)
cv_score=cross_val_score(bnb, data, target, cv=20)
print(cv_score)
return_diff_min_max_cv_score(cv_score)

Here is iteration 4
With 15% Holdout: 0.8755980861244019
Testing on Sample: 0.8907035175879398
[0.91071429 0.88571429 0.88571429 0.87857143 0.88571429 0.87455197
 0.90322581 0.90647482 0.88489209 0.87410072 0.86690647 0.92086331
 0.88129496 0.9028777  0.89208633 0.89208633 0.88848921 0.89208633
 0.88848921 0.89208633]
min cv_score is 86.6906%, and max is 92.0863%, and delta is 5.3957%


#### Iteration 5

In [19]:
print("Here is iteration 5")
# let's get the data
sms_raw = refresh_data_frame(['spam', 'message'])

keywords = ['SMS SERVICES', 'to get', 'shit', 'sex', 'right now', 'important message',
           'latest offers']
keywords = ['reply or call', 'your number matches', 'see her', 'winner', 'reward']

do_the_keywords()

sms_raw['allcaps'] = sms_raw.message.str.isupper()


sms_raw['spam'] = (sms_raw['spam'] == 'spam')

data = sms_raw[keywords + ['allcaps']]

target = sms_raw['spam']

xmsg, y, z, z2 = run_Bernoulli_supervised_learning(data, target)

run_with_the_holdouts(.50)
cv_score=cross_val_score(bnb, data, target, cv=20)
print(cv_score)
return_diff_min_max_cv_score(cv_score)

Here is iteration 5
With 50% Holdout: 0.8743718592964824
Testing on Sample: 0.8715003589375449
[0.87142857 0.86785714 0.86785714 0.86785714 0.87142857 0.86379928
 0.86738351 0.8705036  0.86690647 0.8705036  0.86690647 0.8705036
 0.88129496 0.87769784 0.87410072 0.87410072 0.87769784 0.87769784
 0.8705036  0.87410072]
min cv_score is 86.3799%, and max is 88.1295%, and delta is 1.7496%


## Evaluation
### Do any of your classifiers seem to overfit?
Yes,  Iterations 3, and 4 had more than 3 percentage points different between min and max cv scores.  
### Which seem to perform the best? Why?
It seems that having too many keywords affected performance, and seemed to cause overfitting.  Less seems to be more, as the original example from the lesson.
### Which features seemed to be most impactful to performance?
Having multi-word keywords seems to be a hit, and having a few of the most relevant ones.