## 2.2.7 Challenge: Feedback analysis

In this file, I am going to take a list of movie reviews from the popular site imdb, and attempt to see if a program can guess if it is a positive or negative review. The actual status is listed and I will compare my guesses to the actual answer to see how many I got correct. I will set a bar for success at 90%. 

Once I finish the script, I will run it using a list of establishment reviews on yelp to see if the model has any value when carried for a different purpose.



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
from matplotlib.mlab import PCA as mlabPCA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 

In [2]:
#Here I am going to load the reviews and their score as well as two other lists. One list contains roughly 2500 buzz words that
#we define as "good" and the other contains roughly 3300 buzz words we define as "bad." If I were to go through the reviews and
#manually select buzz words, that would take way too long, so this is a good place to start.
da = pd.read_csv(r"C:\Users\jmfra\OneDrive\Documents\Thinkful Data Science Files\sentiment labelled sentences\imdb_labelled.txt", delimiter= '\t|\t1\n', header=None)
da.columns = ['text', 'score']
good = pd.read_csv(r"C:\Users\jmfra\OneDrive\Documents\Thinkful Data Science Files\2.2.7\positive.txt", header=None)
good.columns = ['x']
bad = pd.read_csv(r"C:\Users\jmfra\OneDrive\Documents\Thinkful Data Science Files\2.2.7\negative.txt", header=None)
bad.columns = ['x']

  after removing the cwd from sys.path.


In [3]:
#The two lists above are in a pandas dataframe. They are easier to use in a list, so I have converted them to one.
goodwords = list(good.values.flatten())
badwords = list(bad.values.flatten())

In [4]:
#Some quick data cleaning.
da['text'] = da.text.str.lower()
da['text'] = da.text.str.strip()
da['text'] = da.text.str.replace('\.', '')

In [5]:
#Score is the column with the actual decision of if a review is good or bad with a 0 being bad and 1 being good. Let's make this
#into a boolean with True and False values to make it easier to use later.
da['score'] = (da['score'] == 0)

In [6]:
#Since there are thousands of entries in the good words list, it is easiest to write a for loop that checks to see if each element
#has a match in each review. This loop then adds a column to the original dataset with a value of False if they match. This is 
#because in the previous cell, we set a score of 1(good) to False and we want these outputs to match
keywords_good = goodwords

for key in keywords_good:
    da[str(key)] = da.text.str.contains(str(key), case=False)

In [7]:
#Same concept as above but set the output to True
keywords_bad = badwords

for key in keywords_bad:
    da[str(key)] = da.text.str.contains(str(key), case=True)

In [8]:
#storing all of the oututs above in a new dataframe, and the real answer in its own as well.
data = da[keywords_good + keywords_bad]
target = da['score']

In [9]:
# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of correctly labeled points out of a total {} points : {} or a {} success ratio".format(
    data.shape[0],
    (target == y_pred).sum(),
    ((target == y_pred).sum())/data.shape[0]
))

Number of correctly labeled points out of a total 1000 points : 853 or a 0.853 success ratio


In [10]:
#.853 is very solid for a first pass as it is only 4.7% lower than success. In the next four cells, I have added or removed words
#to both good and bad lists that caused some outputs to be incorrect. Then I reran the original for loops to add these new words 
#and their elements to the dataframe and reran the boolean test to see our new success ratio.
goodwords.append('best')
goodwords.append('rivet')
goodwords.append('cool')
goodwords.append('10')
goodwords.append('promote')
goodwords.append('give this one a look')
goodwords.append('pretty decent')
goodwords.append('nostalgia')
goodwords.append('applause')
goodwords.append('liked')
goodwords.append('go and see')
goodwords.append('must see')
goodwords.append('fascinated')
goodwords.append('amaze')
goodwords.append('pearls')
goodwords.append('cult')
goodwords.append('taped')
goodwords.append('powerful')
goodwords.append('cutting edge')
goodwords.append('to learn more')
goodwords.append('unique')
goodwords.append('just right')
goodwords.append('really likes')
goodwords.append('not bad')
goodwords.append('impressed')
goodwords = [x for x in goodwords if x != 'improved']
goodwords = [x for x in goodwords if x != 'free']

keywords_good = goodwords

for key in keywords_good:
    da[str(key)] = da.text.str.contains(str(key), case=False)

In [11]:
badwords.append('absolutely no')
badwords.append('dislike')
badwords.append('disliked')
badwords.append('pillow')
badwords.append('crap')
badwords.append('shattered')
badwords.append('not a pleasant')
badwords.append('too many')
badwords.append('uninteresting')
badwords.append('unremarkable')
badwords.append('abstruse')
badwords.append('kill')
badwords.append('sucked')
badwords.append('witticisms')
badwords.append('1/10')
badwords.append('0/10')
badwords.append('embarrassed')
badwords.append('barely comprehensible')
badwords.append('remotely')
badwords = [x for x in badwords if x != 'less']

keywords_bad = badwords

for key in keywords_bad:
    da[str(key)] = da.text.str.contains(str(key), case=True)

In [12]:
data = da[keywords_good + keywords_bad]
target = da['score']

In [13]:
# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of correctly labeled points out of a total {} points : {} or a {} success ratio".format(
    data.shape[0],
    (target == y_pred).sum(),
    ((target == y_pred).sum())/data.shape[0]
))

Number of correctly labeled points out of a total 1000 points : 900 or a 0.9 success ratio


In [14]:
#Now that I have achieved the success criteria, I need to run it on a new set of reviews
ds = pd.read_csv(r"C:\Users\jmfra\OneDrive\Documents\Thinkful Data Science Files\sentiment labelled sentences\yelp_labelled.txt", delimiter= '\t|\t1\n', header=None)

  


In [15]:
ds.columns = ['text', 'score']
ds['text'] = ds.text.str.lower()
ds['text'] = ds.text.str.strip()
ds['text'] = ds.text.str.replace('\.', '')
ds['score'] = (ds['score'] == 0)

In [16]:
keywords_good = goodwords

for key in keywords_good:
    ds[str(key)] = ds.text.str.contains(str(key), case=False)
    
keywords_bad = badwords

for key in keywords_bad:
    ds[str(key)] = ds.text.str.contains(str(key), case=True)

In [17]:
data = ds[keywords_good + keywords_bad]
target = ds['score']

In [18]:
# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of correctly labeled points out of a total {} points : {} or a {} success ratio".format(
    data.shape[0],
    (target == y_pred).sum(),
    ((target == y_pred).sum())/data.shape[0]
))

Number of correctly labeled points out of a total 1000 points : 844 or a 0.844 success ratio
