In [2]:
import pandas as pd

# Amazon Books Recommender System

***Context:***

Recommender systems are widespread in all aspects of user-facing industries, from Netflix, to Amazon, to LinkedIn or to YouTube. Due to the large volumes of content, products or posts being generated daily, users would be overwhelmed with information and simple searches may not generate adequate results that may be relevant to the user.

As a result, recommender systems are a subclass of information filtering systems that provide suggestions for items that are most pertinent to a particular user. For example, over 70% of watch time on YouTube is spent watching videos the underlying recommender algorithm recommends. In this project, a recommender system will be designed for Amazon shoppers that would recommend books they would most likely enjoy based on the reviews of others.


***Problem Statement:***

_How can we accurately recommend new books to users using an Item-item approach to encourage expanding user preferences to books they may not have otherwise found on their own?_


In [3]:
# Import Ratings Data into Pandas DF
books = pd.read_csv('../data/Books.csv')
books.head()

Unnamed: 0,0001713353,A1C6M8LCIX4M6M,5.0,1123804800
0,1713353,A1REUF3A1YCPHM,5.0,1112140800
1,1713353,A1YRBRK2XM5D5,5.0,1081036800
2,1713353,A1V8ZR5P78P4ZU,5.0,1077321600
3,1713353,A2ZB06582NXCIV,5.0,1475452800
4,1713353,ACPQVNRD3Z09X,5.0,1469750400


In [4]:
# Import Metadata about the Books into Pandas Df

import gzip 
import json

def parse(path):
  g = gzip.open(path, 'r')
  for l in g:
    yield json.loads(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

metadata = getDF('../data/books_meta.gz')

KeyboardInterrupt: 

In [None]:
metadata.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
0,[],,[It is a biology book with God&apos;s perspect...,,Biology Gods Living Creation Third Edition 10 ...,"[0669009075, B000K2P5SA, B00MD4G2N0, B000ASIPT...",,Keith Graham,[],"1,349,781 in Books (","[0019777701, B000AUCX7I, B000K2P5SA, B001CK63X...",Books,,,$39.94,0000092878,[],[],
1,"[Books, New, Used & Rental Textbooks, Medicine...",,[],,Mksap 16 Audio Companion: Medical Knowledge Se...,[],,Acp,[],"1,702,625 in Books (","[B01MUCYEV7, B01KUGTY6O]",Books,,,,000047715X,[],[],
2,"[Books, Arts & Photography, Music]",,"[Discography of American Punk, Hardcore, and P...",,"Flex! Discography of North American Punk, Hard...",[],,Burkhard Jarisch,[],"6,291,012 in Books (",[],Books,,,$199.99,0000004545,[],[],
3,"[Books, Arts & Photography, Music]",,[This is a collection of classic gospel hymns ...,,Heavenly Highway Hymns: Shaped-Note Hymnal,[],,Stamps/Baxter,[],"2,384,057 in Books (","[0006180116, 0996092730, B000QFOGY0, B06WWKNDL...",Books,,,,0000013765,[],[],
4,[],,[],,Georgina Goodman Nelson Womens Size 8.5 Purple...,[],,,[],"11,735,726 in Books (",[],Books,,,$164.10,0000000116,[],[],


### Understanding the Chosen Algorithm


![Recommender Systems, Categorized](../imgs/recommender_types.png)


Recommender systems tend to come in two main flavors, each with their own advantages: collaborative filtering and content-based methods. And a more recent third approach that combines the best of both worlds.

This project will be taking the collaborative filtering approach. It is a technique that can filter out items that a user might like on the basis of reactions by similar users. It works by searching a large group of people and finding a smaller set of users with tastes similar to a particular user. To get recommendations, it looks at the items they like and combines them to create a ranked list of suggestions.

In summary, our approach is using an item-based collaborative filtering algorithm, that falls within the memory-based category of the larger family of collaborative filtering algorithms. This approach offers several advantages including scalability, accuracy and recommendations that are much easier to exaplin to users.

### Understanding the Data

We'll be working with two datasets, one containing rows of ratings from users about a particular book (`books`) and another containing the metadata about all the books rated (`metadata`). 

The `books` dataset has rows of information in the form of `bookID userID rating timestamp`. This is the generally expected format for the data to be in for collaborative filtering based recommender algorithms. 

The `metadata` dataset has information about each book that was recommended, which include the following: 
- `asin` - ID of the product, e.g. 0000031852
- `title` - name of the product
- `feature` - bullet-point format features of the product
- `description` - description of the product
- `price` - price in US dollars (at time of crawl)
- `imageURL` - url of the product image
- `imageURLHighRes` - url of the high resolution product image
- related - related products (also bought, also viewed, bought together, buy after viewing)
- `rank` - sales rank information
- `brand` - brand name
- `category` - list of categories the product belongs to
- `tech1` - the first technical detail table of the product
- `tech2` - the second technical detail table of the product
- `similar_item` - similar product table

### Data Wrangling

In [None]:
# Add column headings to `books` df for clarification
books1 = books.set_axis(['bookID', 'userID', 'rating', 'timestamp'], axis=1)
books1.head()

Unnamed: 0,bookID,userID,rating,timestamp
0,1713353,A1REUF3A1YCPHM,5.0,1112140800
1,1713353,A1YRBRK2XM5D5,5.0,1081036800
2,1713353,A1V8ZR5P78P4ZU,5.0,1077321600
3,1713353,A2ZB06582NXCIV,5.0,1475452800
4,1713353,ACPQVNRD3Z09X,5.0,1469750400


In [None]:
# Get the shape of the `books` dataset
books1.shape

(51311620, 4)

In [None]:
# Get Descriptive Statistics
books1.describe()

Unnamed: 0,rating,timestamp
count,51311620.0,51311620.0
mean,4.393247,1398793000.0
std,1.045411,111058000.0
min,0.0,832550400.0
25%,4.0,1368922000.0
50%,5.0,1423699000.0
75%,5.0,1472170000.0
max,5.0,1538438000.0


In [None]:
books1.userID.isnull()

NameError: name 'books1' is not defined

## Exploratory Data Analysis

How many unique users and unique books are in the dataset?

In [None]:
# Count number of Unique Users and Books
unique_users = books1.userID.nunique()
unique_books = books1.bookID.nunique()

In [None]:
print(f"There are {unique_books} unique books rated by {unique_users} unique users.")
print(f"In total, there are {books1.shape[0]} total reviews.")

There are 2930451 unique books rated by 15362619 unique users.
In total, there are 51311620 total reviews.


What is the average number of books that a user rated?

In [None]:
avg_num_ratings = books1.groupby('userID').size().mean()

In [None]:
print(f"The average number of books a user rated was {round(avg_num_ratings, 2)} books.")

The average number of books a user rated was 3.34 books.


What is the average number of ratings each book got?

In [None]:
ratings_per_book = books1.groupby('bookID').size().mean()
print(f"The average number of ratings each book received was {round(ratings_per_book, 2)}")

The average number of ratings each book received was 17.51
