# Challenge: Feedback Analysis
## Thinkful Unit 2, Lesson 2, Page 7

We've mentioned that Naive Bayes is particularly good for text classification problems. Before we made a spam filter. Now we'll perform a sentiment analysis, classifying whether feedback left on a website is either positive or negative.

Again the UCI Machine Learning database has a nice labeled dataset of sentiment labelled sentences for us to use. This dataset was created for the paper From Group to Individual Labels using Deep Features, Kotzias et. al., KDD 2015.

Pick one of the company data files and build your own classifier. When you're satisfied with its performance (at this point just using the accuracy measure shown in the example), test it on one of the other datasets to see how well these kinds of classifiers translate from one context to another.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn import preprocessing
import math
from sklearn.naive_bayes import BernoulliNB

%matplotlib inline

In [2]:
data_path = ("amazon_cells_labelled.txt")
amazon_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
amazon_raw.columns = ['message', 'classifier']

In [3]:
goodwords = ['nice', 'easy', 'it fits', 'good', 'great', 'ideal', 'really recommend', 'happy', 
             'comfortable', 'it has all', 'very well', 'you must have', 'love', 'joy', 'satisfied', 
             'reasonable', 'highly recommend', 'best', 'very impressed', 'works', 'i like', 'excellent', 
             'fine', 'beautiful', 'i really like', 'value', 'awesome']

for word in goodwords:
    amazon_raw[str(word)] = amazon_raw.message.str.contains(str(word), case=False)

In [4]:
amazon_raw['length'] = amazon_raw['message'].apply(lambda x: len(x)) 
amazon_raw['length'] = amazon_raw['length'] < 30


In [5]:
data = amazon_raw[goodwords + ['length']]
target = amazon_raw['classifier']

In [6]:
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 219


In [7]:
data_path_2 = ("yelp_labelled.txt")
yelp_raw = pd.read_csv(data_path_2, delimiter= '\t', header=None)
yelp_raw.columns = ['message', 'classifier']

In [8]:
goodwords = ['nice', 'easy', 'it fits', 'good', 'great', 'ideal', 'really recommend', 'happy', 
             'comfortable', 'it has all', 'very well', 'you must have', 'love', 'joy', 'satisfied', 
             'reasonable', 'highly recommend', 'best', 'very impressed', 'works', 'i like', 'excellent', 
             'fine', 'beautiful', 'i really like', 'value', 'awesome']

for word in goodwords:
    yelp_raw[str(word)] = yelp_raw.message.str.contains(str(word), case=False)

In [9]:
yelp_raw['length'] = yelp_raw['message'].apply(lambda x: len(x)) 
yelp_raw['length'] = yelp_raw['length'] < 30

In [10]:
data_2 = yelp_raw[goodwords + ['length']]
target_2 = yelp_raw['classifier']

In [11]:
bnb = BernoulliNB()
bnb.fit(data_2, target_2)
y_pred = bnb.predict(data_2)

print("Number of mislabeled points out of a total {} points : {}".format(
    data_2.shape[0],
    (target_2 != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 298


# Challenge: Iterate and evaluate your classifier
## Thinkful Unit 2, Lesson 3

It's time to revisit your classifier from the previous assignment. Using the evaluation techniques we've covered here, look at your classifier's performance in more detail. Then go back and iterate by engineering new features, removing poor features, or tuning parameters. Repeat this process until you have five different versions of your classifier. Once you've iterated, answer these questions to compare the performance of each:

- Do any of your classifiers seem to overfit?
- Which seem to perform the best? Why?
- Which features seemed to be most impactful to performance?

Write up your iterations and answers to the above questions in a few pages. Submit a link below and go over it with your mentor to see if they have any other ideas on how you could improve your classifier's performance.

In [12]:
yelp_raw

Unnamed: 0,message,classifier,nice,easy,it fits,good,great,ideal,really recommend,happy,...,very impressed,works,i like,excellent,fine,beautiful,i really like,value,awesome,length
0,Wow... Loved this place.,1,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
1,Crust is not good.,0,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
2,Not tasty and the texture was just nasty.,0,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,Stopped by during the late May bank holiday of...,1,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,The selection on the menu was great and so wer...,1,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,Now I am getting angry and I want my damn pho.,0,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6,Honeslty it didn't taste THAT fresh.),0,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7,The potatoes were like rubber and you could te...,0,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
8,The fries were great too.,1,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,True
9,A great touch.,1,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,True


In [13]:
goodwords = ['nice', 'easy', 'it fits', 'good', 'great', 'really recommend', 'happy', 
             'comfortable', 'it has all', 'very well', 'you must have', 'love', 'joy', 'satisfied', 
             'highly recommend', 'best', 'very impressed', 'works', 'i like', 'excellent', 
             'fine', 'beautiful', 'i really like', 'value', 'awesome', 'amazing', 'wow']

badwords = ['not good', 'worst', 'awful', 'disgusting', 'unfortunately', 'disaster', 
            "wasn't good", 'suck', 'horrible', 'wasted', 'not', 'angry', 'bad',
           'below average', 'flop', 'problem', 'would avoid', 'should avoid']

for word in goodwords:
    yelp_raw[str(word)] = yelp_raw.message.str.contains(str(word), case=False)

for word in badwords:
    yelp_raw[str(word)] = (yelp_raw.message.str.contains(str(word), case=False) == False)
    
    
data_2 = yelp_raw[goodwords + badwords]
target_2 = yelp_raw['classifier']

bnb = BernoulliNB()
bnb.fit(data_2, target_2)
y_pred = bnb.predict(data_2)

print("Number of mislabeled points out of a total {} points : {}".format(
    data_2.shape[0],
    (target_2 != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 269


In [14]:
from sklearn.metrics import confusion_matrix
confusion_matrix(target_2, y_pred)

array([[476,  24],
       [245, 255]], dtype=int64)

In [15]:
# Test your model with different holdout groups.

from sklearn.model_selection import train_test_split
# Use train_test_split to create the necessary training and test groups


X_train, X_test, y_train, y_test = train_test_split(data_2, target_2, test_size=0.1, random_state=20)
print('With 60% Holdout: ' + str(bnb.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(bnb.fit(data_2, target_2).score(data_2, target_2)))

With 60% Holdout: 0.71
Testing on Sample: 0.731


In [16]:
from sklearn.model_selection import cross_val_score
cross_val_score(bnb, data_2, target_2, cv=10)

array([0.71, 0.72, 0.67, 0.74, 0.76, 0.71, 0.74, 0.72, 0.73, 0.78])