It's time to revisit your classifier from the previous assignment. Using the evaluation techniques we've covered here, look at your classifier's performance in more detail. Then go back and iterate by engineering new features, removing poor features, or tuning parameters. Repeat this process until you have five different versions of your classifier. Once you've iterated, answer these questions to compare the performance of each:

Do any of your classifiers seem to overfit?
Which seem to perform the best? Why?
Which features seemed to be most impactful to performance?

Write up your iterations and answers to the above questions in a few pages. Submit a link below and go over it with your mentor to see if they have any other ideas on how you could improve your classifier's performance.

Estimated time: 1 hr

My time: 11111-11

---------------------------------------

In [1]:
# import libraries
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# import yelp dataset
df = pd.read_csv(r'C:\Users\AP\Desktop\yelp_clean.csv')

# rename columns
df.columns = ['review', 'sentiment']

# import positive and negative keywords dataset
# source: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
df_neg = pd.read_csv(r'C:\Users\AP\Desktop\negative words.csv')
df_pos = pd.read_csv(r'C:\Users\AP\Desktop\positive words.csv')

# resize data
df_neg = df_neg.iloc[35:]
df_pos = df_pos.iloc[35:]

# drop unnecessary columns
df_neg.drop(['Column2', 'Column3', 'Column4', 'Column5'], axis=1, inplace=True)
df_pos.drop(['Column2', 'Column3', 'Column4', 'Column5'], axis=1, inplace=True)

# convert words in columns to list
list_neg = list(df_neg['Column1'])
list_pos = list(df_pos['Column1'])

# remove non-letter characters from lists
import re

list_pos = [re.sub('[^A-Za-z0-9]+', '', mystring) for mystring in list_pos]
list_pos = list(set(list_pos))

list_neg = [re.sub('[^A-Za-z0-9]+', '', mystring) for mystring in list_neg]
list_neg = list(set(list_neg))

In [3]:
# turn sentiment column to boolean
df['sentiment'] = (df['sentiment'] == 1)

In [4]:
# match keyword list to reviews
keywords = list_neg + list_pos

for key in keywords:
    # add spaces around key so we're getting the word, not just pattern matching
    df[str(key)] = df.review.str.contains(' ' + str(key) + ' ',case=False)

# define variables for bernoulli classifier
data = df[keywords]
target = df['sentiment']

# import bernoulli classifier (data is binary/boolean)
from sklearn.naive_bayes import BernoulliNB

# instantiate model and store in new variable
bnb = BernoulliNB()

# fit model to data
bnb.fit(data, target)

# classify, store result in new variable
y_pred = bnb.predict(data)

# display results
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0], (target != y_pred).sum()))

# section break
print(100*'-')

# test model with different holdout groups
# use train_test_split to create necessary training/test groups
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=20)
print('With 20% Holdout: ' + str(bnb.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

# section break
print(100*'-')

# cross validation
from sklearn.model_selection import cross_val_score

cross_val_score(bnb, data, target, cv=5)

Number of mislabeled points out of a total 1000 points : 254
----------------------------------------------------------------------------------------------------
With 20% Holdout: 0.685
Testing on Sample: 0.746
----------------------------------------------------------------------------------------------------


array([0.68 , 0.69 , 0.68 , 0.695, 0.68 ])

Using the entire positive and negative keyword lists, this classifier can predict a positive or negative review with 74.6% accuracy. This number seems low, and I bet we can adjust the lists to achieve a higher accuracy score. Cross validation with cv=5 produces consistent results, implying the model is not overfitting.

---------------------------------



FEATURE ENGINEERING FOR FOUR ADDITIONAL CLASSIFIERS

Feature #2 - positive words only, full list

In [12]:
# match keyword list to reviews
keywords = list_pos

for key in keywords:
    # add spaces around key so we're getting the word, not just pattern matching
    df[str(key)] = df.review.str.contains(' ' + str(key) + ' ',case=False)
    
# import bernoulli classifier (data is binary/boolean)
from sklearn.naive_bayes import BernoulliNB

# define variables for bernoulli classifier
data = df[keywords]
target = df['sentiment']

# instantiate model and store in new variable
bnb = BernoulliNB()

# fit model to data
bnb.fit(data, target)

# classify, store result in new variable
y_pred = bnb.predict(data)

# display results
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0], (target != y_pred).sum()))

print(100*'-')

# test model with different holdout groups
# use train_test_split to create necessary training/test groups
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=20)
print('With 20% Holdout: ' + str(bnb.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

print(100*'-')

# cross validation
from sklearn.model_selection import cross_val_score

cross_val_score(bnb, data, target, cv=5)

Number of mislabeled points out of a total 1000 points : 278
----------------------------------------------------------------------------------------------------
With 20% Holdout: 0.675
Testing on Sample: 0.722
----------------------------------------------------------------------------------------------------


array([0.67 , 0.675, 0.665, 0.72 , 0.675])

This time only the entire positive keyword list was used in the model. The negative keyword list was left out. Using positive words only, the model can predict sentiment with 72% accuracy. This is slighter lower than using both positive and negative keywords, but overall similar. Cross validation was consistent, yet slightly wider spread than the previous model. Let's try negative keywords only next.


---------------------------

Feature #3 - negative words only, full list

In [13]:
# match keyword list to reviews
keywords = list_neg

for key in keywords:
    # add spaces around key so we're getting the word, not just pattern matching
    df[str(key)] = df.review.str.contains(' ' + str(key) + ' ',case=False)
    
# import bernoulli classifier (data is binary/boolean)
from sklearn.naive_bayes import BernoulliNB

# define variables for bernoulli classifier
data = df[keywords]
target = df['sentiment']

# instantiate model and store in new variable
bnb = BernoulliNB()

# fit model to data
bnb.fit(data, target)

# classify, store result in new variable
y_pred = bnb.predict(data)

# display results
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0], (target != y_pred).sum()))

print(100*'-')

# test model with different holdout groups
# use train_test_split to create necessary training/test groups
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=20)
print('With 20% Holdout: ' + str(bnb.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

print(100*'-')

# cross validation
from sklearn.model_selection import cross_val_score

cross_val_score(bnb, data, target, cv=5)

Number of mislabeled points out of a total 1000 points : 347
----------------------------------------------------------------------------------------------------
With 20% Holdout: 0.585
Testing on Sample: 0.653
----------------------------------------------------------------------------------------------------


array([0.6  , 0.575, 0.585, 0.595, 0.57 ])

This is the worst performing model so far. Using negative words only to identify sentiment produces an accuracy rate of 65%. Cross validation is consistent like the previous two models, so none of the features seem to be overfitting. 

The fact that the model can predict sentiment in yelp reviews with a significantly higher accuracy rate using only positive keywords as opposed to negative only words, leads me to believe that overall people leave more positive reviews than negative reviews. This evidence does not prove this theory, but it does fall in line with it. 


---------------------------

Feature #4 - keyword count (multiple keywords in review) 

In [5]:
# create new dataframe with sentiment column dropped
df_1 = df.drop(['sentiment'], axis=1)

In [6]:
df['matches'] = df_1.eq(True).dot(df_1.columns+',').str[:-1].str.split(',')
df['num_matches'] = df['matches'].str.len()

In [8]:
df['num_matches']

0      1
1      1
2      1
3      2
4      1
5      2
6      1
7      2
8      1
9      1
10     1
11     1
12     1
13     1
14     2
15     1
16     1
17     2
18     1
19     1
20     1
21     1
22     1
23     1
24     1
25     1
26     1
27     2
28     2
29     1
      ..
970    1
971    1
972    1
973    1
974    1
975    1
976    2
977    1
978    1
979    1
980    1
981    1
982    1
983    1
984    1
985    3
986    1
987    1
988    1
989    1
990    1
991    1
992    1
993    1
994    1
995    1
996    1
997    1
998    1
999    3
Name: num_matches, Length: 1000, dtype: int64

In [9]:
# keyword count/list in each review, not counting first review column name
df['hits'] = df_1.iloc[:, 1:].apply(lambda x: x.index[x.astype(bool)].tolist(), 1)

In [11]:
df['hits'] = [re.sub('[^A-Za-z0-9]+', '', mystring) for mystring in df['hits']]
#list_pos = list(set(list_pos))

TypeError: expected string or bytes-like object

In [17]:
(df_1 == 1).sum(axis=0)

review          0
smudged         0
dissidents      0
shallow         0
nebulous        0
unhelpful       0
dislike         0
devoid          0
chore           0
complained      0
distraction     0
anarchy         0
disapointed     0
fictional       0
gruff           0
imprudence      0
lament          0
complication    0
rife            0
tricked         0
emptiness       0
unaccustomed    0
intimidating    0
underpowered    0
futilely        0
unnerve         0
disorient       0
brazen          0
fatuously       0
profane         0
               ..
gems            0
reputable       0
attentive       3
titillate       0
wholesome       0
promising       0
versatility     0
sincerity       0
nourishment     0
swanky          0
affectionate    0
excels          0
enrapt          0
luxuriously     0
windfall        0
restructured    0
praise          0
notably         0
bliss           0
idolized        0
reforming       0
replaceable     0
succes          0
supporter       0
aver      

In [None]:
# match keyword list to reviews
keywords = 

for key in keywords:
    # add spaces around key so we're getting the word, not just pattern matching
    df[str(key)] = df.review.str.contains(' ' + str(key) + ' ',case=False)
    
# import bernoulli classifier (data is binary/boolean)
from sklearn.naive_bayes import BernoulliNB

# define variables for bernoulli classifier
data = df[keywords]
target = df['sentiment']

# instantiate model and store in new variable
bnb = BernoulliNB()

# fit model to data
bnb.fit(data, target)

# classify, store result in new variable
y_pred = bnb.predict(data)

# display results
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0], (target != y_pred).sum()))

print(100*'-')

# test model with different holdout groups
# use train_test_split to create necessary training/test groups
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=20)
print('With 20% Holdout: ' + str(bnb.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

print(100*'-')

# cross validation
from sklearn.model_selection import cross_val_score

cross_val_score(bnb, data, target, cv=10)


---------------------------

Feature #5 - positive/negative keyword ratio 

In [None]:
# match keyword list to reviews
keywords = 

for key in keywords:
    # add spaces around key so we're getting the word, not just pattern matching
    df[str(key)] = df.review.str.contains(' ' + str(key) + ' ',case=False)
    
# import bernoulli classifier (data is binary/boolean)
from sklearn.naive_bayes import BernoulliNB

# define variables for bernoulli classifier
data = df[keywords]
target = df['sentiment']

# instantiate model and store in new variable
bnb = BernoulliNB()

# fit model to data
bnb.fit(data, target)

# classify, store result in new variable
y_pred = bnb.predict(data)

# display results
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0], (target != y_pred).sum()))

print(100*'-')

# test model with different holdout groups
# use train_test_split to create necessary training/test groups
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=20)
print('With 20% Holdout: ' + str(bnb.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

print(100*'-')

# cross validation
from sklearn.model_selection import cross_val_score

cross_val_score(bnb, data, target, cv=10)