The assignment was to take one or more datasets from http://jmcauley.ucsd.edu/data/amazon/ and build a model that could accurately (~%70) guess which reviews were positive.

#### Assignment text:
Use one of the following datasets to perform sentiment analysis on the given Amazon reviews. Pick one of the "small" datasets that is a reasonable size for your computer. The goal is to create a model to algorithmically predict if a review is positive or negative just based on its text. Try to see how these reviews compare across categories. Does a review classification model for one category work for another?

## Importing code...

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
import math
import warnings
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import roc_auc_score
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer


from IPython.display import display

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

# Suppress annoying harmless error.
warnings.filterwarnings(
    action="ignore",
    module="scipy",
    message="^internal gelsd"
)

from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from sklearn.decomposition import PCA
from sklearn import neighbors
from sklearn.neighbors import KNeighborsClassifier

from sklearn import ensemble
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR

from sklearn.datasets import load_digits
from sklearn.feature_selection import SelectKBest, chi2, f_classif

from timeit import default_timer as timer

import pydotplus
from sklearn import tree
from sklearn import preprocessing
%matplotlib inline
sns.set_style('white')

## Uploading & subsetting data...

In [73]:
# Upload Videogame dataset
videogame_basedata = pd.read_json('reviews_Video_Games_5.json', lines=True)


In [74]:
# Sample smaller chunks of the videogame dataset to speed up testing models
all_data_vg = videogame_basedata
half_data_vg = videogame_basedata.sample(frac=.5, random_state=2, axis=0)
quarter_data_vg = videogame_basedata.sample(frac=.25, random_state=2, axis=0)
tenth_data_vg = videogame_basedata.sample(frac=.1, random_state=2, axis=0)
hundreth_data_vg = videogame_basedata.sample(frac=.01, random_state=2, axis=0)


In [50]:
# create a report function that can be used for any model

def accuracy_report(testing_X, testing_Y, model, cv):
    predictions = model.predict(testing_X)
    print('Model score:')
    print(model.score(testing_X, testing_Y))
    print(" ")
    print("Classification Report:")
    y_prediction = model.predict(testing_X)
    print(classification_report(testing_Y, y_prediction, target_names=['>4 review', '<4 review']))
    auc = roc_auc_score(testing_Y.values, predictions)
    print('AUC score:%.3f'% auc)
# Sometimes we don't want to spend the processor time calculating the cross-valuation, so we need a way to toggle it.
    if cv == 1:
        print(" ")
        print('Model cross-valuation:')
        print(sklearn.model_selection.cross_val_score(model, testing_X, testing_Y, cv = 5))
    return

## Cleaning data

In [55]:
# Create a working dataframe so that operations can be performed on any dataframe by switching out a single variable.
working_df = quarter_data_vg

# create a function to prep the data
def prep_data(working_df):
    
    # Create a new binary feature based on whether a review gave the product 4 or 5 stars
    working_df.loc[working_df.overall >= 4, 'positive_review'] = 1
    working_df.loc[working_df.overall < 4, 'positive_review'] = 0
    
    # combining the reviewText and summary columns into a single text column
    working_df['combined_review'] = working_df.reviewText + ' ' + working_df.summary
    
    # Drop all columns except for positive_review and combined_review
    working_df.drop([
        'asin',
        'helpful',
        'reviewTime',
        'reviewerID',
        'reviewerName',
        'unixReviewTime',
        'overall',
        'reviewText',
        'summary',
    ], axis=1, inplace=True)

    # Drop out any remaining rows with NA in them.
    working_df = working_df.dropna()

    # Reset the index so we can merge new features into the df seamlessly
    working_df.reset_index(drop=True, inplace=True)
    print(" ")
    
    return(working_df)

# run the function
working_df = prep_data(working_df)

 


## Vectorizing the text of the reviews

In [56]:
# write a function to vectorize the working dataset

def vectorization(working_df):
    vect = TfidfVectorizer()

    # remove English stop words
    vect.set_params(stop_words='english')

    # include 1-grams and 2-grams
    vect.set_params(ngram_range=(1, 2))

    # ignore terms that appear in more than 50% of the documents
    vect.set_params(max_df=0.5)

    # only keep terms that appear in at least X reviews, where X is a percentage of the total rows of working_df
    min_df_percentage = .02
    vect.set_params(min_df=min_df_percentage)

    # vectorize the combined_review column
    tfidf_fit = vect.fit_transform(working_df['combined_review'])
    feature_names = vect.get_feature_names()
    vectorized_df = pd.DataFrame(tfidf_fit.toarray(), columns=feature_names)


    # Drop the review text column itself, now that it's been vectorized
    working_df.drop(['combined_review'], axis=1, inplace=True)
    
    # Return the concatenation of the vectorized_df onto working_df
    return(pd.concat([working_df, vectorized_df], axis=1))

#run the function
working_df = vectorization(working_df)

In [57]:
# A collection of info-collection tools to ensure the working_df is functional

#working_df['summary_vecor'].describe()  
working_df.head(5)
#print(working_df['summary_vecor'])
#working_df['combined_review'].iloc[0]

Unnamed: 0,positive_review,10,100,12,15,20,30,34,360,3d,40,50,60,abilities,ability,able,absolutely,acting,action,actual,actually,add,added,addictive,addition,adds,adventure,age,ago,ai,air,allow,allows,alot,amazing,amazon,annoying,arcade,area,areas,aren,art,aspect,attack,attacks,attention,available,average,away,awesome,bad,based,basic,basically,battle,battles,beat,beautiful,beginning,believe,best,best game,better,big,biggest,bit,black,blast,bored,boring,boss,bosses,bought,bought game,box,break,bring,bugs,build,button,buttons,buy,buy game,buying,called,came,camera,campaign,car,card,care,cars,case,cause,certain,certainly,challenge,challenging,chance,change,changed,changes,character,characters,cheap,check,choice,choose,city,classic,clear,close,collection,combat,come,comes,coming,compared,complaint,complete,completely,computer,cons,console,constantly,content,control,controller,controllers,controls,cool,copy,couldn,couple,course,cover,create,cut,damage,dark,day,days,dead,deal,death,decent,decided,deep,definitely,depending,depth,design,designed,despite,detailed,developers,did,didn,die,difference,different,difficult,difficulty,disappointed,does,doesn,doing,don,don know,don like,dont,download,drive,driving,drop,ds,duty,ea,early,easier,easily,easy,edition,effect,effects,elements,end,ending,enemies,enemy,engine,enjoy,enjoyable,enjoyed,entertaining,entire,environment,environments,epic,especially,eventually,evil,exactly,example,excellent,expect,expected,experience,explore,extra,extremely,face,fact,fairly,fall,family,fan,fans,fantastic,fantasy,far,fast,favorite,feature,features,feel,feel like,feeling,feels,feels like,felt,fight,fighting,figure,final,final fantasy,finally,finding,fine,finish,finished,fit,flaws,follow,force,forward,fps,franchise,free,friend,friends,frustrating,fun,fun game,fun play,funny,future,game does,game fun,game game,game good,game great,game just,game like,game play,game played,game really,game ve,gameplay,gamer,gamers,games,games like,gaming,gave,general,genre,gets,getting,given,gives,giving,glad,god,goes,going,gone,good,good game,got,graphics,great,great game,ground,guess,gun,guns,guy,guys,half,halo,hand,hands,happen,happy,hard,harder,hate,haven,having,head,health,hear,heard,help,hidden,high,higher,highly,highly recommend,hit,hold,home,honestly,hope,horrible,hour,hours,house,huge,idea,ii,important,impossible,improved,included,including,incredible,incredibly,instead,interesting,isn,issue,issues,item,items,job,jump,just,just like,keeps,kept,kids,kill,killing,kind,know,lack,large,later,learn,leave,left,let,lets,level,levels,life,light,like,like game,liked,limited,line,list,literally,little,live,ll,load,long,long time,longer,look,looked,looking,looks,lose,lost,lot,lot fun,lots,love,love game,loved,loves,low,main,major,make,makes,making,man,map,maps,mario,match,matter,maybe,mean,means,mechanics,memory,mention,mind,mini,minor,minutes,miss,missing,mission,missions,mode,modern,modes,money,months,motion,moves,movie,movies,moving,multiplayer,multiple,music,near,nearly,need,needed,needs,new,nice,night,nintendo,non,normal,note,number,offer,oh,ok,okay,old,older,ones,online,open,opinion,option,options,order,original,overall,pack,parts,party,pass,past,pay,pc,people,perfect,perfectly,person,personally,pick,place,places,play,play game,played,played game,player,players,playing,playing game,plays,playstation,plenty,plot,plus,point,points,poor,possible,power,powerful,pretty,pretty good,previous,price,probably,problem,problems,product,progress,pros,ps2,ps3,purchase,purchased,puzzle,puzzles,quality,quest,quests,quick,quickly,quite,race,racing,random,range,rate,rating,read,reading,real,realistic,really,really good,reason,recommend,recommend game,recommended,red,release,released,remember,repetitive,replay,replay value,require,rest,return,review,reviews,right,room,rpg,run,running,runs,said,save,saw,say,saying,says,scenes,score,screen,second,seeing,seen,sense,sequel,series,seriously,set,setting,settings,shoot,shooter,shooters,shooting,short,shot,shows,similar,simple,simply,single,single player,skill,skills,slightly,slow,small,smooth,solid,somewhat,son,sony,soon,sort,sound,sounds,soundtrack,space,special,speed,spend,spent,stand,standard,star,stars,start,started,starts,stay,step,stick,stop,store,story,story line,storyline,straight,strategy,strong,stuck,stuff,stupid,style,super,support,supposed,sure,switch,systems,taken,takes,taking,talk,talking,team,tell,terrible,thing,things,think,thinking,thought,throw,time,times,title,titles,today,tons,took,total,totally,touch,tried,true,truly,try,trying,turn,turned,turns,tv,type,types,understand,unfortunately,unique,unless,unlike,unlock,update,upgrade,use,used,using,usually,value,variety,various,ve,ve played,ve seen,version,versions,video,video game,video games,view,voice,voice acting,wait,waiting,walk,want,wanted,war,wasn,waste,watch,watching,way,ways,weapon,weapons,went,wii,win,wish,won,wonderful,work,worked,working,works,world,worse,worst,worth,wouldn,wow,wrong,xbox,xbox 360,year,year old,years,yes
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.208,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.331,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.495,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.776,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.152,0.0,0.0,0.0,0.0,0.0,0.163,0.0,0.0,0.0,0.0,0.0,0.124,0.0,0.0,0.0,0.0,0.0,0.169,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.182,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.157,0.0,0.0,0.129,0.188,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.187,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.118,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.164,0.0,0.0,0.0,0.0,0.168,0.172,0.0,0.0,0.158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.073,0.168,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112,0.089,0.149,0.129,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.187,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.116,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.19,0.0,0.0,0.0,0.0,0.139,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.242,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.099,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.122,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.122,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.163,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.084,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.174,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.134,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.292,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.134,0.171,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.159,0.0,0.0,0.0,0.0,0.0,0.0,0.256,0.0,0.0,0.21,0.307,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.257,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.152,0.0,0.0,0.0,0.0,0.0,0.308,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.145,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.121,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.253,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.52,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.279,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.267,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.181,0.0,0.0,0.0,0.137,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.303,0.0,0.0,0.0,0.0,0.0,0.379,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.206,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.172,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.204,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.394,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.189,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.215,0.246,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.243,0.0,0.0,0.0,0.145,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.422,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.251,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.155,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.248,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.265,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.191,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.141,0.118,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.118,0.275,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.189,0.0,0.0,0.0,0.0,0.207,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.208,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.192,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.208,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.119,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.263,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.238,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.378,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Preparing sample datasets for use in the model(s)

In [58]:
# create samples for training & testing
training_fraction = .8

training_X = working_df.sample(frac=training_fraction, random_state=10)
testing_X = working_df.drop(training_X.index)

# separate the Class feature out into Y datasets
training_Y = training_X['positive_review']
testing_Y = testing_X['positive_review']

# dropping the Class feature from the X datasets so that the model isn't able to cheat
training_X.drop('positive_review', axis=1, inplace=True)
testing_X.drop('positive_review', axis=1, inplace=True)
print(" ")

 


## Running an assortment of models to find the most successful

In [12]:
# Random Forest model
rfc = ensemble.RandomForestClassifier(n_estimators = 20)
rfc.fit(training_X,training_Y)
print('Random Forest results:')
print(' ')
accuracy_report(testing_X, testing_Y, rfc, 1)

Random Forest results:
 
Model score:
0.8144792475623436
 
Classification Report:
              precision    recall  f1-score   support

   >4 review       0.69      0.39      0.50      2727
   <4 review       0.83      0.95      0.89      8862

   micro avg       0.81      0.81      0.81     11589
   macro avg       0.76      0.67      0.69     11589
weighted avg       0.80      0.81      0.79     11589

AUC score:0.666
 
Model cross-valuation:
[0.8029323  0.81328159 0.80492016 0.81009927 0.79283556]


In [59]:
# Decision Tree

dtree = tree.DecisionTreeClassifier()
dtree.fit(training_X,training_Y)
print('Decision Tree results:')
print(' ')
accuracy_report(testing_X, testing_Y, dtree, 1)

Decision Tree results:
 
Model score:
0.743722495469842
 
Classification Report:
              precision    recall  f1-score   support

   >4 review       0.46      0.46      0.46      2727
   <4 review       0.83      0.83      0.83      8862

   micro avg       0.74      0.74      0.74     11589
   macro avg       0.64      0.64      0.64     11589
weighted avg       0.74      0.74      0.74     11589

AUC score:0.644
 
Model cross-valuation:
[0.73005606 0.73134972 0.73629694 0.72593871 0.7440656 ]


In [14]:
# Gradient Boosting Classifier

clf = ensemble.GradientBoostingClassifier()
clf.fit(training_X,training_Y)
print('GBC results:')
print(' ')
accuracy_report(testing_X, testing_Y, clf, 0)


GBC results:
 
Model score:
0.8139615152299594
 
Classification Report:
              precision    recall  f1-score   support

   >4 review       0.79      0.29      0.42      2727
   <4 review       0.82      0.98      0.89      8862

   micro avg       0.81      0.81      0.81     11589
   macro avg       0.80      0.63      0.65     11589
weighted avg       0.81      0.81      0.78     11589

AUC score:0.631


In [10]:
# KNN

neighbors = KNeighborsClassifier()
neighbors.fit(training_X,training_Y)
print('KNN results:')
print(' ')
accuracy_report(testing_X, testing_Y, neighbors, 1)

KNN results:
 
Model score:
0.7672793165933213
 
Classification Report:
              precision    recall  f1-score   support

   >4 review       0.58      0.04      0.08      2727
   <4 review       0.77      0.99      0.87      8862

   micro avg       0.77      0.77      0.77     11589
   macro avg       0.67      0.52      0.47     11589
weighted avg       0.72      0.77      0.68     11589

AUC score:0.516
 
Model cross-valuation:
[0.76282881 0.76282881 0.76607682 0.7673716  0.76478205]


In [15]:
# Logistic Regression

lr = LogisticRegression()
lr.fit(training_X,training_Y)
print('Logistic Regression results:')
print(' ')
accuracy_report(testing_X, testing_Y, lr, 1)



Logistic Regression results:
 
Model score:
0.8451980326171369
 
Classification Report:
              precision    recall  f1-score   support

   >4 review       0.75      0.51      0.61      2727
   <4 review       0.86      0.95      0.90      8862

   micro avg       0.85      0.85      0.85     11589
   macro avg       0.81      0.73      0.76     11589
weighted avg       0.84      0.85      0.83     11589

AUC score:0.731
 
Model cross-valuation:




[0.8262182  0.84519189 0.83513164 0.83815278 0.8256366 ]




In [12]:
# Naive Bayes

bnb = BernoulliNB()
bnb.fit(training_X,training_Y)
print('Naive Bayes results:')
print(' ')
accuracy_report(testing_X, testing_Y, bnb, 1)

Naive Bayes results:
 
Model score:
0.7224954698420917
 
Classification Report:
              precision    recall  f1-score   support

   >4 review       0.41      0.43      0.42      2727
   <4 review       0.82      0.81      0.82      8862

   micro avg       0.72      0.72      0.72     11589
   macro avg       0.62      0.62      0.62     11589
weighted avg       0.73      0.72      0.72     11589

AUC score:0.622
 
Model cross-valuation:
[0.71453213 0.71280724 0.75053949 0.73974968 0.73154942]


In [13]:
# SVM

svm = SVC(kernel='linear')
svm.fit(training_X,training_Y)
print('SVM results:')
print(' ')
accuracy_report(testing_X, testing_Y, svm, 0)


SVM results:
 
Model score:
0.841919061178704
 
Classification Report:
              precision    recall  f1-score   support

   >4 review       0.75      0.50      0.60      2727
   <4 review       0.86      0.95      0.90      8862

   micro avg       0.84      0.84      0.84     11589
   macro avg       0.80      0.72      0.75     11589
weighted avg       0.83      0.84      0.83     11589

AUC score:0.723


### Result:
Logistic Regression and Linear SVM seem to perform best, without optimizing, and are about on par with each other. However, SVM takes a comparatively massive amount of runtime to process, so for brevity in improving the model further I am choosing to use Logistic Regression going forward.

### Algorithm Comparison
Random Forest is better than the other two ensemble models (DT & GBC) at guessing both positive and negative reviews, but not as good as LogR or SVM.

Decision Tree is decent, and doesn't have terribly unbalanced approach to guessing either high or low reviews, but isn't as good as SVM or LogR

GBC does reasonably well at guessing low reviews, but is mediocre at guessing positive reviews

KNN is basically guessing with regards to which reviews are >4, and not very well at that.  It does decently well at guessing negative reviews, but even there it suffers compared to the other models.

Logistic Regression ties with SVM for best performer, with far less runtime

Naive Bayes is not the greatest, but still does surprisingly well for such a simple model.

SVM does very well, though it takes a lot of time to run through the model


## Perform GridSearchCV to identify optimal parameters

In [16]:
# Gridsearch CV for the Logistic Regression algorithm, using l1 penalty & solvers

grid_param_LogR = {
    'penalty': ['l1'],
    'solver': ['liblinear','saga'],
    'tol' : [.0001,.0005,.001,.00005,.00001],
    'C' : [1,.9,.8,.7,.6,.5]
}

grid_search_LogR = GridSearchCV(estimator = lr,  
                              param_grid = grid_param_LogR,
                              scoring = 'neg_mean_squared_error',
                              cv = 5)

grid_search_LogR.fit(testing_X, testing_Y)
print('Logistic Regression recommended parameters:')
print(grid_search_LogR.best_params_)
print(' ')

RFC recommended parameters:
{'C': 1, 'penalty': 'l1', 'solver': 'liblinear', 'tol': 0.0001}
 


In [17]:
# Gridsearch CV for the Logistic Regression algorithm, using l2 penalty & solvers

grid_param_LogR = {
    'penalty': ['l2'],
    'solver': ['newton-cg', 'sag', 'lbfgs'],
    'tol' : [.0001,.0005,.001,.00005,.00001],
    'C' : [1,.9,.8,.7,.6,.5]
}

grid_search_LogR = GridSearchCV(estimator = lr,  
                              param_grid = grid_param_LogR,
                              scoring = 'neg_mean_squared_error',
                              cv = 5)

grid_search_LogR.fit(testing_X, testing_Y)
print('Logistic Regression recommended parameters:')
print(grid_search_LogR.best_params_)
print(' ')

Logistic Regression recommended parameters:
{'C': 1, 'penalty': 'l2', 'solver': 'sag', 'tol': 0.001}
 


### Test l1 vs l2, using GSCV-optimized parameters

In [20]:
# Logistic Regression, optimized l1

lr = LogisticRegression(C = 1, penalty = 'l1', solver = 'liblinear', tol = 0.0001)
lr.fit(training_X,training_Y)
print('Logistic Regression results:')
print(' ')
accuracy_report(testing_X, testing_Y, lr, 1)

Logistic Regression results:
 
Model score:
0.8445077228406247
 
Classification Report:
              precision    recall  f1-score   support

   >4 review       0.74      0.52      0.61      2727
   <4 review       0.87      0.94      0.90      8862

   micro avg       0.84      0.84      0.84     11589
   macro avg       0.80      0.73      0.76     11589
weighted avg       0.84      0.84      0.83     11589

AUC score:0.734
 
Model cross-valuation:
[0.82449332 0.84131091 0.83426845 0.8372896  0.82693138]


In [21]:
# Logistic Regression, optimized l2

lr = LogisticRegression(C = 1, penalty = 'l2', solver = 'sag', tol = 0.001)
lr.fit(training_X,training_Y)
print('Logistic Regression results:')
print(' ')
accuracy_report(testing_X, testing_Y, lr, 1)

Logistic Regression results:
 
Model score:
0.8450254551730089
 
Classification Report:
              precision    recall  f1-score   support

   >4 review       0.75      0.51      0.61      2727
   <4 review       0.86      0.95      0.90      8862

   micro avg       0.85      0.85      0.85     11589
   macro avg       0.81      0.73      0.76     11589
weighted avg       0.84      0.85      0.83     11589

AUC score:0.731
 
Model cross-valuation:
[0.8262182  0.84519189 0.83513164 0.83815278 0.8256366 ]


### Result:
The difference between the optimized l1 and l2 variations is minimal, but I've chosen to use the l1 version due to the (fractionally) higher AUC and f1-scores it provides.

## Running the model on the entire videogame dataset

In [75]:
# reset the working_df to the entire videogame dataset, and run the data-cleaning and vectorization functions on it
working_df = all_data_vg
working_df = prep_data(working_df)
working_df = vectorization(working_df)

 


In [76]:
# Run the sampling setup again
training_fraction = .8
training_X = working_df.sample(frac=training_fraction, random_state=10)
testing_X = working_df.drop(training_X.index)
training_Y = training_X['positive_review']
testing_Y = testing_X['positive_review']
training_X.drop('positive_review', axis=1, inplace=True)
testing_X.drop('positive_review', axis=1, inplace=True)
print(" ")

 


In [77]:
# Logistic Regression, optimized l1, entire videogame dataset, final run

lr = LogisticRegression(C = 1, penalty = 'l1', solver = 'liblinear', tol = 0.0001)
lr.fit(training_X,training_Y)
print('Logistic Regression results:')
print(' ')
accuracy_report(testing_X, testing_Y, lr, 1)

Logistic Regression results:
 
Model score:
0.8450901717145569
 
Classification Report:
              precision    recall  f1-score   support

   >4 review       0.75      0.56      0.64     11469
   <4 review       0.87      0.94      0.90     34887

   micro avg       0.85      0.85      0.85     46356
   macro avg       0.81      0.75      0.77     46356
weighted avg       0.84      0.85      0.84     46356

AUC score:0.749
 
Model cross-valuation:
[0.83628128 0.83272217 0.84359832 0.83928379 0.84951456]


## Final Result, videogame review dataset
%84.5 accuracy is a great result, given that I was told to aim for a score of ~%70 as my target for this exercise. Given that it also has a decent (though not great) balance between f1-scores for >4 and <4 reviews, and less than %2 variation between cross-valuation folds, this is a pretty reasonable outcome for a first try at text analysis.

## Testing the model on another dataset
However, the assignment also tasked me with running another dataset through the same setup, to see whether an unrelated set of reviews would also manage to perform well in the model.  Let's have a go at the Health and Personal Care products reviews...

In [78]:
# Upload health and personal care dataset
healthproducts_basedata = pd.read_json('reviews_Health_and_Personal_Care_5.json', lines=True)

In [79]:
# reset the working_df to the entire health products dataset, and run the processing functions on it
working_df = healthproducts_basedata
working_df = prep_data(working_df)
working_df = vectorization(working_df)

 


In [80]:
# Run the sampling setup again
training_fraction = .8
training_X = working_df.sample(frac=training_fraction, random_state=10)
testing_X = working_df.drop(training_X.index)
training_Y = training_X['positive_review']
testing_Y = testing_X['positive_review']
training_X.drop('positive_review', axis=1, inplace=True)
testing_X.drop('positive_review', axis=1, inplace=True)
print(" ")

 


In [82]:
# Logistic Regression, optimized l1, entire health products dataset, comparison run

lr = LogisticRegression(C = 1, penalty = 'l1', solver = 'liblinear', tol = 0.0001)
lr.fit(training_X,training_Y)
print('Logistic Regression results:')
print(' ')
accuracy_report(testing_X, testing_Y, lr, 1)

Logistic Regression results:
 
Model score:
0.8474108934474744
 
Classification Report:
              precision    recall  f1-score   support

   >4 review       0.70      0.37      0.49     13423
   <4 review       0.86      0.96      0.91     55848

   micro avg       0.85      0.85      0.85     69271
   macro avg       0.78      0.67      0.70     69271
weighted avg       0.83      0.85      0.83     69271

AUC score:0.668
 
Model cross-valuation:
[0.84749188 0.84424396 0.84821364 0.84357179 0.84638706]


## H&PC Results
Well that went even better than the videogame reviews did, at least in terms of total accuracy.  It seems to have a harder time accurately picking out positive reviews (lower >4 review f1-score), likely due to the decreased ratio of positive to negative reviews in this dataset.  

This isn't entirely surprising; the preparation given to the videogame reviews was content-agnostic, and the Logistic Regression algorithm isn't at all dependent on the data having a particular content that would be likely to change between review datasets.  Because the model is already well set up to deal with a large text vector, it can deal with either dataset with equal ease, albeit with the previous caveats about the ratio between high and low reviews.

It's worth noting that it's entirely possible that a different algorithm might perform with more accuracy on the lower >4/<4 ratio, and testing this would likely make for a good follow-on assignment.