Unit 2, lesson 3.3
In here we'll use the spam data set and check for overfitting the model. I'll do that by using cross-validation. In the boot camp we used the package from SKLearn, below that I'll write my code to produce the same result (without the library)

In [1]:
import pandas as pd
import sklearn
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Grab and process the raw data.
data_path = ("https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/"
             "master/sms_spam_collection/SMSSpamCollection"
            )
sms_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
sms_raw.columns = ['spam', 'message']

# Enumerate our spammy keywords.
keywords = ['click', 'offer', 'winner', 'buy', 'free', 'cash', 'urgent']

for key in keywords:
    sms_raw[str(key)] = sms_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
)

sms_raw['allcaps'] = sms_raw.message.str.isupper()
sms_raw['spam'] = (sms_raw['spam'] == 'spam')
data = sms_raw[keywords + ['allcaps']]
target = sms_raw['spam']

from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
y_pred = bnb.fit(data, target).predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of a total 5572 points : 604


In [3]:
# Test your model with different holdout groups.
from sklearn.model_selection import train_test_split

# Use train_test_split to create the necessary training and test groups
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=20)
print('With 20% Holdout: ' + str(bnb.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

With 20% Holdout: 0.884304932735426
Testing on Sample: 0.8916008614501076


In [4]:
from sklearn.model_selection import cross_val_score
cross_val_score(bnb, data, target, cv=10)

array([0.89784946, 0.89426523, 0.89426523, 0.890681  , 0.89605735,
       0.89048474, 0.88150808, 0.89028777, 0.88489209, 0.89568345])

### My code:

In [5]:
# check the length of the data
len_data = len(data)

# get int division by 10
len_data = (int)(len_data / 10)

# assign 10th of the data to each variable
data1 = data.iloc[0 : len_data]
data2 = data.iloc[len_data : len_data*2]
data3 = data.iloc[len_data*2 : len_data*3]
data4 = data.iloc[len_data*3 : len_data*4]
data5 = data.iloc[len_data*4 : len_data*5]
data6 = data.iloc[len_data*5 : len_data*6]
data7 = data.iloc[len_data*6 : len_data*7]
data8 = data.iloc[len_data*7 : len_data*8]
data9 = data.iloc[len_data*8 : len_data*9]
data10 = data.iloc[len_data*9 : ]

# assign 10th of the target to each variable
target1 = target.iloc[0 : len_data]
target2 = target.iloc[len_data : len_data*2]
target3 = target.iloc[len_data*2 : len_data*3]
target4 = target.iloc[len_data*3 : len_data*4]
target5 = target.iloc[len_data*4 : len_data*5]
target6 = target.iloc[len_data*5 : len_data*6]
target7 = target.iloc[len_data*6 : len_data*7]
target8 = target.iloc[len_data*7 : len_data*8]
target9 = target.iloc[len_data*8 : len_data*9]
target10 = target.iloc[len_data*9 : ]


In [6]:
# Use the BernoulliNB to get the predection to each group and store it in the list called "result"
result = []
result.append(round(bnb.fit(data1, target1).score(data1, target1), 7))
result.append(round(bnb.fit(data2, target2).score(data2, target2), 7))
result.append(round(bnb.fit(data3, target3).score(data3, target3), 7))
result.append(round(bnb.fit(data4, target4).score(data4, target4), 7))
result.append(round(bnb.fit(data5, target5).score(data5, target5), 7))
result.append(round(bnb.fit(data6, target6).score(data6, target6), 7))
result.append(round(bnb.fit(data7, target7).score(data7, target7), 7))
result.append(round(bnb.fit(data8, target8).score(data8, target8), 7))
result.append(round(bnb.fit(data1, target9).score(data9, target9), 7))
result.append(round(bnb.fit(data10, target10).score(data10, target10), 7))

print(result)


[0.8904847, 0.8761221, 0.9066427, 0.8958707, 0.8994614, 0.9012567, 0.8850987, 0.8833034, 0.8707361, 0.8962433]


### More in depth into the model. 

In [7]:
# see the percentage of the span in the raw file
spam_prec = (len(sms_raw[sms_raw['spam'] == True])/len(sms_raw)) * 100
print("{}% of the data is actual spam".format(round(spam_prec, 3)))

13.406% of the data is actual spam


As it can be seen, an only a small amount of the data is actually a spam message. This is an imbalance class issue. <br>That is, even if the model categorize all the data as 'not spam' it will have accuracy of about 86.6% <br>I will try to resolve this issue by 
oversample the spam messages (by the factor of 2) and undersample the nonspam messages

In [8]:
# create two dfs of only spam and only not spam so I could control the amounts
spam_df = sms_raw[sms_raw['spam'] == True]
not_spam_df = sms_raw[sms_raw['spam'] == False]

# new df with 20% less not spam messages and twice as much spam messages
not_spam_df = not_spam_df.sample(frac=0.8)
spam_df = spam_df.sample(frac=2, replace=True)
frames = [not_spam_df, spam_df]
new_df = pd.concat(frames)

# randomize the data
new_df = sklearn.utils.shuffle(new_df)

# reset the index (from 0 to full length)
new_df.reset_index(inplace=True)
# drop the 'index column' which got created
new_df.drop(labels='index', axis=1, inplace=True)

new_df.head(10)

Unnamed: 0,spam,message,click,offer,winner,buy,free,cash,urgent,allcaps
0,False,Tell them the drug dealer's getting impatient,False,False,False,False,False,False,False,False
1,True,Got what it takes 2 take part in the WRC Rally...,False,False,False,False,False,False,False,False
2,False,Jus finish blowing my hair. U finish dinner al...,False,False,False,False,False,False,False,False
3,True,Natalie (20/F) is inviting you to be her frien...,False,False,False,False,False,False,False,False
4,True,Today's Offer! Claim ur £150 worth of discount...,False,False,False,False,False,False,False,False
5,True,Promotion Number: 8714714 - UR awarded a City ...,False,False,False,False,False,False,False,False
6,False,Future is not what we planned for tomorrow.......,False,False,False,False,False,False,False,False
7,False,"House-Maid is the murderer, coz the man was mu...",False,False,False,False,False,False,False,False
8,False,Morning only i can ok.,False,False,False,False,False,False,False,False
9,False,The beauty of life is in next second.. which h...,False,False,False,False,False,False,False,False


In [9]:
# check we didn't get 'NaN' values AND
# the new size of the spam and 'ham' to confirm the over sample and under sample worked.
print(new_df.isnull().sum())

print("The size of real spam is {}, and the size of not spam is {}".format(len(new_df[new_df['spam'] == True]), 
                                                                          len(new_df[new_df['spam'] == False])))
print("The spam messages are now {} of the total dataset".format(len(new_df[new_df['spam'] == True]) / 
                                                                  len(new_df)))

spam       0
message    0
click      0
offer      0
winner     0
buy        0
free       0
cash       0
urgent     0
allcaps    0
dtype: int64
The size of real spam is 1494, and the size of not spam is 3860
The spam messages are now 0.2790437056406425 of the total dataset


run the analysis again and check the results

In [10]:
# the target
new_target = new_df['spam']

# the data without the unnecessary columns
new_data = new_df.drop(labels=['spam', 'message'], axis=1)

new_y_pred = bnb.fit(new_data, new_target).predict(new_data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    new_data.shape[0],
    (new_target != new_y_pred).sum()
))



Number of mislabeled points out of a total 5354 points : 1143


In [11]:
# confusion matrix 
print(sklearn.metrics.confusion_matrix(new_target, new_y_pred))

# this result is much worst the the previous one, what am I doing wrong?

[[3812   48]
 [1095  399]]
