# Book Recommendation System with Amazon Review Data



This is a book recommendation system specialized in children's books. The Amazon review data has several features that can be used in the machine learning models. We use here the overall book ratings, book descriptions and text reviews. 

- "Overall ratings" are used in the collaborative filtering, based on what other similar users (who like similar books) chose.
- Both "book descriptions" and "text reviews" can be used in the vectorizer model and put together in FeatureUnion. It is found that using book description feature gives (qualitatively) better result in recommendation. 
- "Text reviews" are used in the NLP word embedding model. Word embedding is effective in picking up semantic meaning of the words in the corpus. We calculate word vectors for each book and then cosine similarities to determine the "distances" between all books. 

In [1]:
# import libraries
import os
import ujson as json
import gzip
import pandas as pd

from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup

In [2]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn import base
from sklearn.feature_extraction import DictVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
from scipy.sparse import csr_matrix
import scipy.sparse

In [3]:
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
import re

from spacy.lang.en.stop_words import STOP_WORDS
import spacy

import dill

from functools import reduce

import ipywidgets as widgets
from ipywidgets import interactive, interact

Below we want to load necessary data files to expedite the recommendation process. The detailed steps of each model is found in notebooks branch.

## 1. Data Preprocessing for the Collaborative Filtering

See the details here. https://github.com/pegasuss81/book-recommendation/blob/notebooks/Ratings_model.ipynb 

Basically from the cleaned dataframe, it creates a pivot table with index="title", columns="reviewer", and values="overall rating". Getting a sparse matrix from this pivot table, we use Nearest Neighbors model to select the N nearest books from the book of choice.

In [4]:
# Dataframe for every review
df_merge_new_in = pd.read_csv('data/df_merge_with_URL.csv')

In [5]:
chunk_size = 5000
chunks = [x for x in range(0, df_merge_new_in.shape[0], chunk_size)]

df_merge_pivot = pd.concat([df_merge_new_in.iloc[ chunks[i]:chunks[i + 1] - 1 ].pivot_table(index='title', columns='reviewerID', values='overall') for i in range(0, len(chunks) - 1)])

In [6]:
df_merge_pivot.fillna(0, inplace=True)

We load the pre-saved nearst neighbors model.

In [7]:
# Nearest Neighbors Model
with open("data/model_knn.dill", "rb") as f:
    model_knn = dill.load(f)

## 2. Preprocessing for Vectorizer + FeatureUnion Using Description & ReviewText

Here we use the DictVectorizer to check the frequency of words in both book descriptions and text reviews. We can use FeatureUnion and give weights to "reviewText" and book "description" to combine the result from the two features. 

In [8]:
# Dataframe sorted for each book/title
df_merge_review_URL = pd.read_csv('data/df_merge_review_title_with_URL.csv')

In [9]:
# Prepare data as a dictionary that can be fed into DictVectorizer
class DictEncoder(base.BaseEstimator, base.TransformerMixin):
    
    def __init__(self, col):
        self.col = col
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        
        def to_dict(l):
            try:
                return {x: 1 for x in l}
            except TypeError:
                return {}
        
        return X[self.col].apply(to_dict)

In [10]:
merge_review_pipe = Pipeline([
    ('encoder', DictEncoder('reviewText')),
    ('vectorizer', DictVectorizer())
])
merge_desc_pipe = Pipeline([
    ('encoder', DictEncoder('description')),
    ('vectorizer', DictVectorizer())
])

## 3. Preprocessing for Word2Vec Model

The details can be found here. https://github.com/pegasuss81/book-recommendation/blob/notebooks/nlp_model.ipynb

We merge all text reviews for each book and calculate the average word vectors for each.
The precalculated cosine similarities are loaded from the dill file below.


In [11]:
# finding cosine similarity for the vectors
with open("data/cosine_similarities.dill", "rb") as f:
    cosine_similarities = dill.load(f)
    

## Collection of Recommendation Models

Below is the summary of three recommendation models that are going to be combined later with the user input. Each recommender outputs the titles and distances relative to the book of choice. 

#### Recommender from user ratings - collaborative filtering

In [12]:
# Recommender from user ratings - collaborative filtering. pivot table + NearestNeighbors
def book_recommender_collab(string):
    
    title = df_merge_pivot[df_merge_pivot.index.str.lower().str.contains(str.lower(string))].index[0]
    
    distances, indices = model_knn.kneighbors(df_merge_pivot.loc[title, :].values.reshape(1, -1), n_neighbors=21908)
    titles = df_merge_pivot.index[np.array(indices.flatten())]
    
    return titles, distances.flatten()

#### recommender from vectorizer + FeatureUnion 

In [13]:
# Recommender using reviewText and description for each book. vectorizer + FeatureUnion + NearestNeighbors
def book_recommender_text_features(w1, w2, string):
    """
    book recommendation system using
    w1: weight for review feature
    w2: weight for description feature
    string: substring of a title
    """
    union_merge = FeatureUnion([('reviewText', merge_review_pipe),
                      ('description', merge_desc_pipe)],
                    transformer_weights={
            'reviewText': w1,
            'description': w2
        })
    features_merge_review = union_merge.fit_transform(df_merge_review_URL)
     
    union_merge_review_model = NearestNeighbors(metric='cosine', algorithm='brute')
    union_merge_review_model.fit(features_merge_review)
        
    index1 = df_merge_review_URL[df_merge_review_URL.title.str.lower().str.contains(str.lower(string))].index[0]
    title1 = df_merge_review_URL[df_merge_review_URL.title.str.lower().str.contains(str.lower(string))]['title'].values[0]
    
    distances, indices = union_merge_review_model.kneighbors(features_merge_review[index1], n_neighbors=df_merge_review_URL.shape[0])
    titles = df_merge_review_URL['title'][df_merge_review_URL.index[np.array(indices.flatten())]].tolist()
   
    return titles, distances.flatten()

#### Recommender from the Word2Vec model

In [14]:
# Recommender using Word2Vec model
def book_recommender_wv(string):
    #Reverse mapping of the index
    indices = pd.Series(df_merge_review_URL.index, index = df_merge_review_URL['title']).drop_duplicates()
    
    title = df_merge_review_URL[df_merge_review_URL.title.str.lower().str.contains(str.lower(string)) == True].index[0]
    idx = indices[title]
    
    sim_scores = list(enumerate(1-cosine_similarities[idx]))
    
    titles = df_merge_review_URL['title'][df_merge_review_URL.index[np.array(indices)]].tolist()
   
    return titles, sim_scores

This is to prepare the results from each model as dataframe, to later combine to calculate the 
distance metric.

In [15]:
def to_dataframe(rec_tuple):
    df = pd.DataFrame(rec_tuple).T
    df.columns = ["title", "distance"]
    #df.columns = ["title", "distance", "URL", "image"]
    return df

## Combined Model

Here we combine all three models and calculate the distance metric.  The weights are provided by a user. 

In [22]:
def combined_model(w_collab=1.0, w_vect_desc=1.0, w_vect_review=0.2, w_feature_union=1.0, w_wv=1.0, string="Bambi", n_rec=5):
    df_collab = to_dataframe(book_recommender_collab(string))
    df_vect = to_dataframe(book_recommender_text_features(w_vect_desc, w_vect_review, string))
    df_wv = to_dataframe(book_recommender_wv(str.lower(string)))
    df_wv['distance'] = df_wv['distance'].str[1]
    
    df_join = reduce(lambda left, right: pd.merge(left,right,on=['title'],
                                            how='outer'), [df_collab, df_vect, df_wv])
    df_join.columns = ['title', 'dist_collab', 'dist_vect', 'dist_wv']
    #df_join.columns = ['title', 'dist_collab', 'dist_vect', 'dist_wv', 'URL', 'image']
    df_join['dist_metric'] = w_collab * df_join['dist_collab'] + w_feature_union * df_join['dist_vect'] \
            + w_wv * df_join['dist_wv']

    return df_join.sort_values('dist_metric')[["title"]].head(n_rec), df_join

In [17]:
title_text = widgets.Text(
    value='The snowy day',
    placeholder='Type something',
    description="What is your child's favorite book?",
    disabled=False,
    style= {'description_width': 'initial'}
)

In [18]:
button = widgets.Button(
    description='Submit',
    disabled=False,
    button_style='', 
    tooltip='Run report',
    icon='check' 
)

In [19]:
#layout = widgets.Layout(width='auto', height='40px') #set width and height
first_weight = widgets.IntSlider(
    value=5, min=0, max=10, 
    description="Similar reviewers", 
    style= {'description_width': 'initial'})
second_weight = widgets.IntSlider(
    value=5, min=0, max=10, 
    description="Similar book descriptions", 
    style= {'description_width': 'initial'})
third_weight = widgets.IntSlider(
    value=5, min=0, max=10, 
    description="Similar text reviews", 
    style= {'description_width': 'initial'})

n_rec = widgets.IntSlider(
    value=5, min=0, max=10, 
    description="number of recommendations", 
    style= {'description_width': 'initial'})

In [20]:
box = widgets.VBox([title_text, widgets.VBox([first_weight, second_weight, third_weight]), n_rec, button])
display(box)

VBox(children=(Text(value='The snowy day', description="What is your child's favorite book?", placeholder='Typ…

In [21]:
recommendations = interact(combined_model, w_collab=first_weight.value, w_vect_desc=1.0, w_vect_review=0.2, w_feature_union=second_weight.value, w_wv=third_weight.value, string=title_text, n_rec=n_rec.value)

interactive(children=(IntSlider(value=5, description='w_collab', max=15, min=-5), FloatSlider(value=1.0, descr…

## Example Results
### The recommendation result for "The Way Back Home"
<img src="results/TheWayBackHome.png" style="width:600px"/>

### The recommendation results for "The Snowy Day"
<img src="results/TheSnowyDay.png" style="width:450px"/>

In [24]:
df_join_metric, df_join = combined_model(w_collab=0.5, w_vect_desc=1.0, w_vect_review=0.2, w_feature_union=0.2, w_wv=1.0, string="Bambi", n_rec=5)

Now we want to quickly check how different the recommendation results would be for each model. In the plots below, we can be that the distribution of the relative distance metric for different models are quite different. We can hover over each plots to see the recommended books are a bit different for each model. (Results shown for an example of "Bambi.")

In [58]:
from bokeh.io import output_file, show, push_notebook, output_notebook
from bokeh.layouts import row
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, HoverTool, LinearColorMapper
from bokeh.palettes import plasma
from bokeh.transform import transform
import numpy as np
import random
output_notebook()

fig1 = figure(title="Plot 1")

source = ColumnDataSource(data=df_join.head(30)) 
hover = HoverTool(tooltips=[
    ("title", "@title")
])
mapper = LinearColorMapper(palette=plasma(256), low=min(df_join.head(30)['dist_metric'].tolist()), high=max(df_join.head(30)['dist_metric'].tolist()))

fig1 = figure(plot_width=400, plot_height=400, tools=[hover], title="")
fig1.circle('dist_collab','dist_metric', size=10, source=source,
         fill_color=transform('dist_metric', mapper)) 
fig1.xaxis.axis_label = 'dist collab'
fig1.yaxis.axis_label = 'dist metric'

# create the second plot
mapper = LinearColorMapper(palette=plasma(256), low=min(df_join.head(30)['dist_metric'].tolist()), high=max(df_join.head(30)['dist_metric'].tolist()))
fig2 = figure(plot_width=400, plot_height=400, tools=[hover], title="")
fig2.circle('dist_vect','dist_metric', size=10, source=source,
         fill_color=transform('dist_metric', mapper)) 
fig2.xaxis.axis_label = 'dist vectorizer'
fig2.yaxis.axis_label = 'dist metric'
 
# create the third plot
mapper = LinearColorMapper(palette=plasma(256), low=min(df_join.head(30)['dist_metric'].tolist()), high=max(df_join.head(30)['dist_metric'].tolist()))
fig3 = figure(plot_width=400, plot_height=400, tools=[hover], title="")
fig3.circle('dist_wv', 'dist_metric', size=10, source=source,
         fill_color=transform('dist_metric', mapper)) 
fig3.xaxis.axis_label = 'dist word2vec'
fig3.yaxis.axis_label = 'dist metric' 
# depict visualization
output_file('results/distance_models.html')
show(row(fig1, fig2, fig3), notebook_handle=True)

### The recommendation results for "The Snowy Day"
<img src="results/TheSnowyDay.png" style="width:450px"/>

## Conclusion

The resulting recommendations seem reasonable for our book "the way back home." They are all picture books for similar age group (4-8 or 3-7 years old). The evaluation of the models are highly qualitative. 
To work more efficiently with the big data file, we can extend the project to use spark and the ML models accordingly. 