## Challenge: Iterate and evaluate your classifier
In this sheet, I am going to look into the performance of a classifier I made in a previous task. Then I am going to edit it and see if I can make it better. Once I am done, I will answer the questions:

Do any of your classifiers seem to overfit?

Which seem to perform the best? Why?

What features seemed to be most impactful to performance?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
from matplotlib.mlab import PCA as mlabPCA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

In [2]:
#Here I am going to load the reviews and their score as well as two other lists. One list contains roughly 2500 buzz words that
#we define as "good" and the other contains roughly 3300 buzz words we define as "bad." If I were to go through the reviews and
#manually select buzz words, that would take way too long, so this is a good place to start.
da = pd.read_csv(r"C:\Users\jmfra\OneDrive\Documents\Thinkful Data Science Files\sentiment labelled sentences\imdb_labelled.txt", delimiter= '\t|\t1\n', header=None)
da.columns = ['text', 'score']
good = pd.read_csv(r"C:\Users\jmfra\OneDrive\Documents\Thinkful Data Science Files\2.2.7\positive.txt", header=None)
good.columns = ['x']
bad = pd.read_csv(r"C:\Users\jmfra\OneDrive\Documents\Thinkful Data Science Files\2.2.7\negative.txt", header=None)
bad.columns = ['x']

  after removing the cwd from sys.path.


In [3]:
#The two lists above are in a pandas dataframe. They are easier to use in a list, so I have converted them to one.
goodwords = list(good.values.flatten())
badwords = list(bad.values.flatten())

In [4]:
#Some quick data cleaning.
da['text'] = da.text.str.lower()
da['text'] = da.text.str.strip()
da['text'] = da.text.str.replace('\.', '')
da_t = da

In [5]:
#editing the words after some trials to improve performance
goodwords.append('best')
goodwords.append('rivet')
goodwords.append('cool')
goodwords.append('10')
goodwords.append('promote')
goodwords.append('give this one a look')
goodwords.append('pretty decent')
goodwords.append('nostalgia')
goodwords.append('applause')
goodwords.append('liked')
goodwords.append('go and see')
goodwords.append('must see')
goodwords.append('fascinated')
goodwords.append('amaze')
goodwords.append('pearls')
goodwords.append('cult')
goodwords.append('taped')
goodwords.append('powerful')
goodwords.append('cutting edge')
goodwords.append('to learn more')
goodwords.append('unique')
goodwords.append('just right')
goodwords.append('really likes')
goodwords.append('not bad')
goodwords.append('impressed')
goodwords = [x for x in goodwords if x != 'improved']
goodwords = [x for x in goodwords if x != 'free']

In [6]:
badwords.append('absolutely no')
badwords.append('dislike')
badwords.append('disliked')
badwords.append('pillow')
badwords.append('crap')
badwords.append('shattered')
badwords.append('not a pleasant')
badwords.append('too many')
badwords.append('uninteresting')
badwords.append('unremarkable')
badwords.append('abstruse')
badwords.append('kill')
badwords.append('sucked')
badwords.append('witticisms')
badwords.append('1/10')
badwords.append('0/10')
badwords.append('embarrassed')
badwords.append('barely comprehensible')
badwords.append('remotely')
badwords = [x for x in badwords if x != 'less']

In [7]:
#Score is the column with the actual decision of if a review is good or bad with a 0 being bad and 1 being good. Let's make this
#into a boolean with True and False values to make it easier to use later.
da_t['scorenull'] = (da_t['score'] == 0)

In [8]:
#Since there are thousands of entries in the good words list, it is easiest to write a for loop that checks to see if each element
#has a match in each review. This loop then adds a column to the original dataset with a value of False if they match. This is 
#because in the previous cell, we set a score of 1(good) to False and we want these outputs to match
keywords_good = goodwords

for key in keywords_good:
    da_t[str(key)] = da_t.text.str.contains(str(key), case=False)

In [9]:
#Same concept as above but set the output to True
keywords_bad = badwords

for key in keywords_bad:
    da_t[str(key)] = da_t.text.str.contains(str(key), case=True)

In [10]:
#storing all of the oututs above in a new dataframe, and the real answer in its own as well.
data = da_t[keywords_good + keywords_bad]
target = da_t['scorenull']

In [11]:
# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

#Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

#Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

#Display our results.
print("Number of correctly labeled points out of a total {} points : {} or a {} success ratio".format(
    data.shape[0],
    (target == y_pred).sum(),
    ((target == y_pred).sum())/data.shape[0]
))
conf_mat = confusion_matrix(target, y_pred)
print('Correct False {}'.format(conf_mat[0,0]), 'Incorrect False {}'.format(conf_mat[0,1]))
print('Incorrect True {}'.format(conf_mat[1,0]), 'Correct True {}'.format(conf_mat[1,1]))
print('Sensitivity {}'.format(conf_mat[1,1]/500))
print('Specificity {}'.format(conf_mat[0,0]/500))

Number of correctly labeled points out of a total 1000 points : 900 or a 0.9 success ratio
Correct False 433 Incorrect False 67
Incorrect True 33 Correct True 467
Sensitivity 0.934
Specificity 0.866


In [12]:
#After looking through the bernoulli arguments, the only one that will change
#the guesses is alpha which is a smoothing parameter automatically set at 1
bnb = BernoulliNB(alpha=10)

bnb.fit(data, target)

y_pred = bnb.predict(data)

print("Number of correctly labeled points out of a total {} points : {} or a {} success ratio".format(
    data.shape[0],
    (target == y_pred).sum(),
    ((target == y_pred).sum())/data.shape[0]
))
conf_mat = confusion_matrix(target, y_pred)
print('Correct False {}'.format(conf_mat[0,0]), 'Incorrect False {}'.format(conf_mat[0,1]))
print('Incorrect True {}'.format(conf_mat[1,0]), 'Correct True {}'.format(conf_mat[1,1]))
print('Sensitivity {}'.format(conf_mat[1,1]/500))
print('Specificity {}'.format(conf_mat[0,0]/500))

Number of correctly labeled points out of a total 1000 points : 789 or a 0.789 success ratio
Correct False 329 Incorrect False 171
Incorrect True 40 Correct True 460
Sensitivity 0.92
Specificity 0.658


In [13]:
#since it looks as if increasing alpha lowers our accuracy, lets drop it down
#to it's lowest possible value
bnb = BernoulliNB(alpha=0)

bnb.fit(data, target)

y_pred = bnb.predict(data)

print("Number of correctly labeled points out of a total {} points : {} or a {} success ratio".format(
    data.shape[0],
    (target == y_pred).sum(),
    ((target == y_pred).sum())/data.shape[0]
))
conf_mat = confusion_matrix(target, y_pred)
print('Correct False {}'.format(conf_mat[0,0]), 'Incorrect False {}'.format(conf_mat[0,1]))
print('Incorrect True {}'.format(conf_mat[1,0]), 'Correct True {}'.format(conf_mat[1,1]))
print('Sensitivity {}'.format(conf_mat[1,1]/500))
print('Specificity {}'.format(conf_mat[0,0]/500))

Number of correctly labeled points out of a total 1000 points : 927 or a 0.927 success ratio
Correct False 453 Incorrect False 47
Incorrect True 26 Correct True 474
Sensitivity 0.948
Specificity 0.906


  'setting alpha = %.1e' % _ALPHA_MIN)


In [14]:
#no matter the alpha, our specificity is always lower than sensitivity, which
#in this test, is our ability to properly label a false, or positive review.
#this could be because entries with no matching words to our lists are 
#labaled as false based on the way our code was written. Let's try switching 
#this to see if it improves our accuracy

In [15]:
da_n = da
da_n['scorenull'] = (da_n['score'] == 1)
keywords_good = goodwords

for key in keywords_good:
    da_n[str(key)] = da_n.text.str.contains(str(key), case=True)
    
keywords_bad = badwords

for key in keywords_bad:
    da_n[str(key)] = da_n.text.str.contains(str(key), case=False)
    
data = da_n[keywords_good + keywords_bad]
target2 = da_n['scorenull']
bnb = BernoulliNB(alpha=0)

bnb.fit(data, target2)

y_pred2 = bnb.predict(data)

print("Number of correctly labeled points out of a total {} points : {} or a {} success ratio".format(
    data.shape[0],
    (target2 == y_pred2).sum(),
    ((target2 == y_pred2).sum())/data.shape[0]
))
conf_mat2 = confusion_matrix(target, y_pred2)
print('Correct False {}'.format(conf_mat2[0,0]), 'Incorrect False {}'.format(conf_mat2[0,1]))
print('Incorrect True {}'.format(conf_mat2[1,0]), 'Correct True {}'.format(conf_mat2[1,1]))
print('Sensitivity {}'.format(conf_mat2[1,1]/500))
print('Specificity {}'.format(conf_mat2[0,0]/500))

Number of correctly labeled points out of a total 1000 points : 927 or a 0.927 success ratio
Correct False 474 Incorrect False 26
Incorrect True 47 Correct True 453
Sensitivity 0.906
Specificity 0.948


  'setting alpha = %.1e' % _ALPHA_MIN)


In [16]:
#the numbers evenly flipped, so it is likely there are no unaccounted for
#inputs being labeled by default.

In [17]:
#In this model, I used a database of good and bad words and then added words I
#specifically know improved the success ratio through trial and error. The 
#only way to create new features would be continuing to add words, or break
#down the original thousand plus word lists to see if any outperform the 
#others or if some can be removed all together

In [18]:
#lets make 5 equal groups out of each list
print(len(goodwords))
print(len(goodwords)/5)
print(len(badwords))
print(len(badwords)/5)

2253
450.6
3923
784.6


In [19]:
goodwords1 = goodwords[0:450]
goodwords2 = goodwords[450:900]
goodwords3 = goodwords[900:1350]
goodwords4 = goodwords[1350:1800]
goodwords5 = goodwords[1800:2253]
badwords1 = badwords[0:784]
badwords2 = badwords[784:1568]
badwords3 = badwords[1568:2352]
badwords4 = badwords[2352:3136]
badwords5 = badwords[3136:3923]

In [20]:
#then run them all through the test to see how well they do.
keywords_good = goodwords1

for key in keywords_good:
    da_n[str(key)] = da_n.text.str.contains(str(key), case=True)
    
keywords_bad = badwords1

for key in keywords_bad:
    da_n[str(key)] = da_n.text.str.contains(str(key), case=False)
    
data = da_n[keywords_good + keywords_bad]
target2 = da_n['scorenull']
bnb = BernoulliNB(alpha=0)

bnb.fit(data, target2)

y_pred2 = bnb.predict(data)

print("Number of correctly labeled points out of a total {} points : {} or a {} success ratio".format(
    data.shape[0],
    (target2 == y_pred2).sum(),
    ((target2 == y_pred2).sum())/data.shape[0]
))
conf_mat2 = confusion_matrix(target, y_pred2)
print('Correct False {}'.format(conf_mat2[0,0]), 'Incorrect False {}'.format(conf_mat2[0,1]))
print('Incorrect True {}'.format(conf_mat2[1,0]), 'Correct True {}'.format(conf_mat2[1,1]))
print('Sensitivity {}'.format(conf_mat2[1,1]/500))
print('Specificity {}'.format(conf_mat2[0,0]/500))

Number of correctly labeled points out of a total 1000 points : 631 or a 0.631 success ratio
Correct False 483 Incorrect False 17
Incorrect True 352 Correct True 148
Sensitivity 0.296
Specificity 0.966


  'setting alpha = %.1e' % _ALPHA_MIN)


In [21]:
keywords_good = goodwords2

for key in keywords_good:
    da_n[str(key)] = da_n.text.str.contains(str(key), case=True)
    
keywords_bad = badwords2

for key in keywords_bad:
    da_n[str(key)] = da_n.text.str.contains(str(key), case=False)
    
data = da_n[keywords_good + keywords_bad]
target2 = da_n['scorenull']
bnb = BernoulliNB(alpha=0)

bnb.fit(data, target2)

y_pred2 = bnb.predict(data)

print("Number of correctly labeled points out of a total {} points : {} or a {} success ratio".format(
    data.shape[0],
    (target2 == y_pred2).sum(),
    ((target2 == y_pred2).sum())/data.shape[0]
))
conf_mat2 = confusion_matrix(target, y_pred2)
print('Correct False {}'.format(conf_mat2[0,0]), 'Incorrect False {}'.format(conf_mat2[0,1]))
print('Incorrect True {}'.format(conf_mat2[1,0]), 'Correct True {}'.format(conf_mat2[1,1]))
print('Sensitivity {}'.format(conf_mat2[1,1]/500))
print('Specificity {}'.format(conf_mat2[0,0]/500))

Number of correctly labeled points out of a total 1000 points : 660 or a 0.66 success ratio
Correct False 452 Incorrect False 48
Incorrect True 292 Correct True 208
Sensitivity 0.416
Specificity 0.904


  'setting alpha = %.1e' % _ALPHA_MIN)


In [22]:
keywords_good = goodwords3

for key in keywords_good:
    da_n[str(key)] = da_n.text.str.contains(str(key), case=True)
    
keywords_bad = badwords3

for key in keywords_bad:
    da_n[str(key)] = da_n.text.str.contains(str(key), case=False)
    
data = da_n[keywords_good + keywords_bad]
target2 = da_n['scorenull']
bnb = BernoulliNB(alpha=0)

bnb.fit(data, target2)

y_pred2 = bnb.predict(data)

print("Number of correctly labeled points out of a total {} points : {} or a {} success ratio".format(
    data.shape[0],
    (target2 == y_pred2).sum(),
    ((target2 == y_pred2).sum())/data.shape[0]
))
conf_mat2 = confusion_matrix(target, y_pred2)
print('Correct False {}'.format(conf_mat2[0,0]), 'Incorrect False {}'.format(conf_mat2[0,1]))
print('Incorrect True {}'.format(conf_mat2[1,0]), 'Correct True {}'.format(conf_mat2[1,1]))
print('Sensitivity {}'.format(conf_mat2[1,1]/500))
print('Specificity {}'.format(conf_mat2[0,0]/500))

Number of correctly labeled points out of a total 1000 points : 662 or a 0.662 success ratio
Correct False 449 Incorrect False 51
Incorrect True 287 Correct True 213
Sensitivity 0.426
Specificity 0.898


  'setting alpha = %.1e' % _ALPHA_MIN)


In [23]:
keywords_good = goodwords4

for key in keywords_good:
    da_n[str(key)] = da_n.text.str.contains(str(key), case=True)
    
keywords_bad = badwords4

for key in keywords_bad:
    da_n[str(key)] = da_n.text.str.contains(str(key), case=False)
    
data = da_n[keywords_good + keywords_bad]
target2 = da_n['scorenull']
bnb = BernoulliNB(alpha=0)

bnb.fit(data, target2)

y_pred2 = bnb.predict(data)

print("Number of correctly labeled points out of a total {} points : {} or a {} success ratio".format(
    data.shape[0],
    (target2 == y_pred2).sum(),
    ((target2 == y_pred2).sum())/data.shape[0]
))
conf_mat2 = confusion_matrix(target, y_pred2)
print('Correct False {}'.format(conf_mat2[0,0]), 'Incorrect False {}'.format(conf_mat2[0,1]))
print('Incorrect True {}'.format(conf_mat2[1,0]), 'Correct True {}'.format(conf_mat2[1,1]))
print('Sensitivity {}'.format(conf_mat2[1,1]/500))
print('Specificity {}'.format(conf_mat2[0,0]/500))

Number of correctly labeled points out of a total 1000 points : 615 or a 0.615 success ratio
Correct False 143 Incorrect False 357
Incorrect True 28 Correct True 472
Sensitivity 0.944
Specificity 0.286


  'setting alpha = %.1e' % _ALPHA_MIN)


In [24]:
keywords_good = goodwords5

for key in keywords_good:
    da_n[str(key)] = da_n.text.str.contains(str(key), case=True)
    
keywords_bad = badwords5

for key in keywords_bad:
    da_n[str(key)] = da_n.text.str.contains(str(key), case=False)
    
data = da_n[keywords_good + keywords_bad]
target2 = da_n['scorenull']
bnb = BernoulliNB(alpha=0)

bnb.fit(data, target2)

y_pred2 = bnb.predict(data)

print("Number of correctly labeled points out of a total {} points : {} or a {} success ratio".format(
    data.shape[0],
    (target2 == y_pred2).sum(),
    ((target2 == y_pred2).sum())/data.shape[0]
))
conf_mat2 = confusion_matrix(target, y_pred2)
print('Correct False {}'.format(conf_mat2[0,0]), 'Incorrect False {}'.format(conf_mat2[0,1]))
print('Incorrect True {}'.format(conf_mat2[1,0]), 'Correct True {}'.format(conf_mat2[1,1]))
print('Sensitivity {}'.format(conf_mat2[1,1]/500))
print('Specificity {}'.format(conf_mat2[0,0]/500))

Number of correctly labeled points out of a total 1000 points : 713 or a 0.713 success ratio
Correct False 458 Incorrect False 42
Incorrect True 245 Correct True 255
Sensitivity 0.51
Specificity 0.916


  'setting alpha = %.1e' % _ALPHA_MIN)


In [25]:
keywords_good = goodwords1 + goodwords2 + goodwords3 + goodwords5

for key in keywords_good:
    da_n[str(key)] = da_n.text.str.contains(str(key), case=True)
    
keywords_bad = badwords1 + badwords2 + badwords3 + badwords5

for key in keywords_bad:
    da_n[str(key)] = da_n.text.str.contains(str(key), case=False)
    
data = da_n[keywords_good + keywords_bad]
target2 = da_n['scorenull']
bnb = BernoulliNB(alpha=0)

bnb.fit(data, target2)

y_pred2 = bnb.predict(data)

print("Number of correctly labeled points out of a total {} points : {} or a {} success ratio".format(
    data.shape[0],
    (target2 == y_pred2).sum(),
    ((target2 == y_pred2).sum())/data.shape[0]
))
conf_mat2 = confusion_matrix(target, y_pred2)
print('Correct False {}'.format(conf_mat2[0,0]), 'Incorrect False {}'.format(conf_mat2[0,1]))
print('Incorrect True {}'.format(conf_mat2[1,0]), 'Correct True {}'.format(conf_mat2[1,1]))
print('Sensitivity {}'.format(conf_mat2[1,1]/500))
print('Specificity {}'.format(conf_mat2[0,0]/500))

Number of correctly labeled points out of a total 1000 points : 900 or a 0.9 success ratio
Correct False 464 Incorrect False 36
Incorrect True 64 Correct True 436
Sensitivity 0.872
Specificity 0.928


  'setting alpha = %.1e' % _ALPHA_MIN)


In [26]:
#Each group seems to equally identify False values properly, but the 4th group
#correctly identifies True values at a rediculous rate in comparison to the 
#rest. If I was to try and perfect this model, looking at this group would
#be the most beneficial

Do any of your classifiers seem to overfit?

With the multitudes of words, it is unlikely the original sets of good and bad words overfit to my data. This word list was not created with the thought of this specific list. My added words are very likeely to over fit to this data, however, because they were added to the list with specific entries in mind. I also only had this input list to test if each addition added to accuracy, so they are unlikely to be as efficient on different data sets of the same type.

Which seem to perform the best? Why?

The words with the "best" performance are the most general words. For example, "cool" was used many more times than "not too bad" and added more correct responses when added to the set. This is a double edged sword however, becasue there was no context used in this classifier and the word cool is also much more likely to be in a bad review where they were being sarcastic or mentioned something ok before talking about the negatives then the phrase "not too bad." This is true because in order to increase success ratio, you need to identify the most number of responses in the least number of inputs. The broad words are more likely to produce false responses, but they affect a much larger number in general.

As for the broken down lists of words, the first group correctly identifies the most Falses, or positive reviews and the fourth group overwhelmingly identifies the most Trues, or negative reviews. It is likely that by disecting these two groups and selecting specific words out of them through trial and error, we could attain the best success ratio.

What features seemed to be most impactful to performance?

The 4th group of bad words had the most impactful performance because the other 4 groups did a pretty good job of correctly identifying positive reviews (all around 90% success), but this one beat the others in correctly identifying the negative reviews by a wide margin.