# Modeling v2.0

My first set of models didn't perform very well. I only had 4 features (one of which was not very important), so that seems like an area for improvement. Some additional features which could be helpful are:

* A sentiment rating for the review text, rating is as either negative, positive or neutral
* A listing of the top 3 most similiar books, based on a cosine similarity analsyis of each book's text description

For the second feature, I'll use that list of most similiar books by looking at:
* The average rating of each book 
* The average rating assigned to those books by that reader's cluster

__First, reload our data:__

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Here is our previous model's data, all cleaned and ready to go
data = pd.read_csv('data/model1_data_cleaned.csv', index_col=['user_id','book_id'])
data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,rating,user_avg_rating,book_cluster,user_avg_rating_by_cluster,book_avg_rating
user_id,book_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
000192962b87d560f00b06fdcbd71681,30025791,5,5.0,0,5.0,4.19
0005a08accd53b1e19c52109a1f478cb,59960,0,3.4,4,0.0,4.25
000700ecd5db3a9b0c4e392ed2e4f70b,11790194,5,5.0,0,5.0,4.04
0008931c0cde961e9c802c5a58196d23,500503,5,5.0,1,5.0,4.29
0008931c0cde961e9c802c5a58196d23,6081685,5,5.0,1,5.0,4.4


## First Feature: Sentiment Rating

For each review, we want to run a sentiment analysis on the review text to assign it a negative, positive or neutral rating.

In [3]:
# Need to reload our review dataframe
reviews = pd.read_csv('data/reviews_step3_output.csv')
reviews.head()

Unnamed: 0.1,Unnamed: 0,review_id,user_id,book_id,rating,review_text,year
0,0,66b2ba840f9bd36d6d27f46136fe4772,dc3763cdb9b2cae805882878eebb6a32,18471619,3,Sherlock Holmes and the Vampires of London \n ...,2013
1,1,72f1229aba5a88f9e72f0dcdc007dd22,bafc2d50014200cda7cb2b6acd60cd73,6315584,4,"I've never really liked Spider-Man. I am, howe...",2016
2,2,a75309355f8662caaa5e2c92ab693d3f,bafc2d50014200cda7cb2b6acd60cd73,29847729,4,"A very quick introduction, this is coming out ...",2016
3,3,c3cc5a3e1d6b6c9cf1c044f306c8e752,bafc2d50014200cda7cb2b6acd60cd73,18454118,5,I've been waiting so long for this. I first st...,2014
4,4,cc444be37ab0a42bfb4dd818cb5edd10,bafc2d50014200cda7cb2b6acd60cd73,2239435,4,The only thing more entertaining than this boo...,2013


In [4]:
# !pip install nltk
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\mdurr\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [5]:
sid = SentimentIntensityAnalyzer()
print(reviews.review_text.iloc[0])
print(sid.polarity_scores(reviews.review_text.iloc[0]))

Sherlock Holmes and the Vampires of London 
 Release Date: April 2014 
 Publisher: Darkhorse Comics 
 Story by: Sylvain Cordurie 
 Art by: Laci 
 Colors by: Axel Gonzabo 
 Cover by: Jean Sebastien Rossbach 
 ISDN: 9781616552664 
 MSRP: $17.99 Hardcover 
 "Sherlock Holmes died fighting Professor Moriarty in the Reichenbach Falls. 
 At least, that's what the press claims. 
 However, Holmes is alive and well and taking advantage of his presumed death to travel the globe. 
 Unfortunately, Holmes's plans are thwarted when a plague of vampirism haunts Britain. 
 This book collects Sherlock Holmes and the Vampires of London Volumes 1 and 2, originally created by French publisher Soleil." - Darkhorse Comics 
 When I received this copy of "Sherlock Holmes and the Vampires of London" I was Ecstatic! The cover art was awesome and it was about two of my favorite things, Sherlock Holmes and Vampires. I couldn't wait to dive into this! 
 Unfortunately, that is where my excitement ended. The story ta

Per the notes on this page: https://github.com/cjhutto/vaderSentiment#about-the-scoring, the typical thresholds for unidimensional measures of sentiment using this methodolgy are:
* positive sentiment: compound score >= 0.05
* neutral sentiment: score between -0.05 and 0.05
* negative sentiment: score <= -0.05

So those are the thresholds I will use.

In [6]:
type(sid.polarity_scores(reviews.review_text[1]))
print(sid.polarity_scores(reviews.review_text[3])['compound'])


0.9509


In [10]:
# Create a function to return the sentiment of a given text
def get_sentiment(text):
    sid = SentimentIntensityAnalyzer()
    try:
        score = sid.polarity_scores(text)['compound']
        if score >= 0.05:
            return 'positive'
        elif score <= -0.05:
            return 'negative'
        else:
            return 'neutral'
    except:
        return np.nan

In [8]:
# Create a function to access the review_text for a given tuple of user & book
def get_review_text(user_id, book_id):
    return reviews[(reviews['user_id']==user_id)
              & (reviews['book_id']== book_id)]['review_text'].values[0]

In [11]:
print(get_sentiment(get_review_text('dc3763cdb9b2cae805882878eebb6a32', 18471619)))

positive


In [12]:
# Create a review sentiment column in data df
%timeit data['review_sentiment'] = [get_sentiment(get_review_text(x,y)) for (x,y) in data.index]

KeyboardInterrupt: 

In [13]:
data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,rating,user_avg_rating,book_cluster,user_avg_rating_by_cluster,book_avg_rating
user_id,book_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
000192962b87d560f00b06fdcbd71681,30025791,5,5.0,0,5.0,4.19
0005a08accd53b1e19c52109a1f478cb,59960,0,3.4,4,0.0,4.25
000700ecd5db3a9b0c4e392ed2e4f70b,11790194,5,5.0,0,5.0,4.04
0008931c0cde961e9c802c5a58196d23,500503,5,5.0,1,5.0,4.29
0008931c0cde961e9c802c5a58196d23,6081685,5,5.0,1,5.0,4.4


In [25]:
# Try the above using df['col'].apply() instead of the loop
print(len(reviews))
print(reviews.review_sentiment.notnull().sum())
print(len(data))

542338
542015
385752


In [None]:
reviews['review_sentiment'] = sentiments
reviews.head()