# Recommender System Using Amazon Reviews

- Base Dataset: 
    - Main dataset page: 
        - https://nijianmo.github.io/amazon/index.html
    - Download dataset: 
        - http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Clothing_Shoes_and_Jewelry_5.json.gz


- Subset Datset to use Recommender system: 
    - Clothing_Shoes_and_Jewelry_5_reviewerID_asin_overall_unixReviewTime.csv
    

### Description of columns 
#### - Clothing_Shoes_and_Jewelry_5_reviewerID_asin_overall_unixReviewTime.csv
- reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
    - Ex) http://www.amazon.com/gp/cdp/member-reviews/A2SUAM1J3GNN3B
- asin - ID of the product, e.g. 0000013714
    - Ex) http://www.amazon.com/dp/0000013714
- overall - rating of the product
- unixReviewTime - time of the review (unix time)

In [1]:
!pwd

/home/ec2-user/SageMaker/dse260-CapStone-Amazon/src


In [None]:
import sys

!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install sagemaker-experiments
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install matplotlib
!{sys.executable} -m pip install boto3
!{sys.executable} -m pip install sagemaker
!{sys.executable} -m pip install pyspark


In [None]:
!{sys.executable} -m pip install ipython-autotime
!{sys.executable} -m pip install surprise

In [None]:
#### To measure all running time
# https://github.com/cpcloud/ipython-autotime

%load_ext autotime

In [1]:
import pandas as pd

import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session
from sagemaker.analytics import ExperimentAnalytics

import gzip
import json

from pyspark.ml import Pipeline
from pyspark.sql.types import StructField, StructType, StringType, DoubleType
from pyspark.ml.feature import StringIndexer, VectorIndexer, OneHotEncoder, VectorAssembler
from pyspark.sql.functions import *

# spark imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import UserDefinedFunction, explode, desc
from pyspark.sql.types import StringType, ArrayType
from pyspark.ml.evaluation import RegressionEvaluator

import os
import pandas as pd
from smart_open import smart_open

# from pandas_profiling import ProfileReport

ModuleNotFoundError: No module named 'pandas_profiling'

## Load dataset and analysis
- s3://dse-cohort5-group1/data-lake-landing-zone/ratings/Clothing_Shoes_and_Jewelry.csv.gz

In [None]:
# !pip install smart_open

In [None]:
# # get your credentials from environment variables
# aws_key = 'AKIAZAERIKDLAVRFZK35'
# aws_secret = 'tkkdilY5f9Lm0f5AvGMcCe0/51aNDW8HaF+r5WSM'
# # aws_key = os.environ['AWS_ACCESS_KEY']
# # aws_secret = os.environ['AWS_SECRET_ACCESS_KEY']

# # s3://dse-cohort5-group1/data-lake-landing-zone/reviews/Clothing_Shoes_and_Jewelry.json.gz
# bucket_name = 'dse-cohort5-group1'
# object_key = 'data-lake-landing-zone/reviews/Clothing_Shoes_and_Jewelry.json.gz'

# path = 's3://{}:{}@{}/{}'.format(aws_key, aws_secret, bucket_name, object_key)

# review_orgin_df = pd.read_csv(smart_open(path), error_bad_lines=False)

In [None]:
!ls -al

In [None]:
number_cores = 16
memory_gb = 64

spark = SparkSession \
    .builder \
    .appName("amazon recommendation") \
    .config("spark.driver.memory", '{}g'.format(memory_gb)) \
    .config("spark.master", 'local[{}]'.format(number_cores)) \
    .getOrCreate()

# get spark context
sc = spark.sparkContext

In [None]:
DATA_PATH = './'
REVIEW_DATA = 'Clothing_Shoes_and_Jewelry.json.gz'

In [None]:
ratings = spark.read.load(DATA_PATH+REVIEW_DATA, format='json', header=True, inferSchema=True)

In [None]:
ratings.show(3)

In [None]:
print("Shape of Data", (ratings.count(), len(ratings.columns)))

## Drop and Clean data
    - Drop null in Vote
    - Voted review comment is more reliable.

In [None]:
clean_ratings = ratings.na.drop(how='any', subset='vote')

In [None]:
print("Shape of Data", (clean_ratings.count(), len(clean_ratings.columns)))

In [None]:
clean_ratings.columns

In [None]:
product_ratings = clean_ratings.drop(
 'image',
 'reviewText',
 'reviewTime',
 'reviewerName',
 'style',
 'summary',
 'unixReviewTime',
 'verified',
 'vote')

In [None]:
product_ratings.show(3)

In [None]:
type(product_ratings)

In [None]:
# create csv file
product_ratings.write.csv("./voted_asin_overall_reviewerID.csv")

## Load dataset from filtered data

In [None]:
!ls -alh ./voted_asin_overall_reviewerID.csv

In [None]:
review_orgin_df = pd.read_csv('./voted_asin_overall_reviewerID.csv/part-00000-e46fb25d-448f-4be7-8739-ad2f627a4e52-c000.csv',
                        names=['asin', 'overall', 'reviewerID'])

In [None]:
review_orgin_df.head(n=5)

In [None]:
review_orgin_df.shape

#### Data profiling

In [None]:
# ! conda install -c conda-forge pandas-profiling=2.6.0 -y

In [None]:
profile = ProfileReport(review_orgin_df, 
                        title='Clothing_Shoes_and_Jewelry_5_reviewerID_asin_overall_unixReviewTime', 
                        minimal=True)
profile.to_file(output_file="Clothing_Shoes_and_Jewelry_5_josn-Profiling-Report.html")
profile.to_notebook_iframe()

# Background

- E-commerce companies like Amazon uses different recommendation systems to provide suggestions to the customers.
- Amazon uses currently item-item collaberrative filtering, which scales to massive datasets and produces high quality recommendation system in the real time. 
- This system is a kind of a information filtering system which seeks to predict the "rating" or preferences which user is interested in.

<img src="refer_pics/1.png">

## Types of recommendations
There are mainly 6 types of the recommendations systems :

1. Popularity based systems : 
    - It works by recommeding items viewed and purchased by most people and are rated high.It is not a personalized recommendation.
2. Classification model based: 
    - It works by understanding the features of the user and applying the classification algorithm to decide whether the user is interested or not in the prodcut.
3. Content based recommedations:
    - It is based on the information on the contents of the item rather than on the user opinions.The main idea is if the user likes an item then he or she will like the "other" similar item.
4. Collaberative Filtering:
    - It is based on assumption that people like things similar to other things they like, and things that are liked by other people with similar taste. 
    - it is mainly of two types: a) User-User b) Item -Item
5. Hybrid Approaches:
    - This system approach is to combine collaborative filtering, content-based filtering, and other approaches .
6. Association rule mining :
    - Association rules capture the relationships between items based on their patterns of co-occurrence across transactions.

## Attribute Information
- reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
    - Ex) http://www.amazon.com/gp/cdp/member-reviews/A2SUAM1J3GNN3B
- asin - ID of the product, e.g. 0000013714
    - Ex) http://www.amazon.com/dp/0000013714
- overall - rating of the product
- unixReviewTime - time of the review (unix time)    

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import math
import json
import time
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from sklearn.externals import joblib
import scipy.sparse
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds
import warnings; warnings.simplefilter('ignore')
%matplotlib inline

In [None]:
review_orgin_df.head()

In [None]:
review_orgin_df.info()

In [None]:
#Five point summary 
review_orgin_df.describe()['overall'].T

In [None]:
#Find the minimum and maximum ratings
print('Minimum rating is: %d' %(review_orgin_df.overall.min()))
print('Maximum rating is: %d' %(review_orgin_df.overall.max()))

- Handling Missing values

In [None]:
#Check for missing values
print('Number of missing values across columns: \n',review_orgin_df.isnull().sum())

### Ratings 
    - Most of the people has given the rating of 5

In [None]:
# Check the distribution of the rating
with sns.axes_style('white'):
    g = sns.factorplot("overall", data=review_orgin_df, aspect=2.0, kind='count')
    g.set_ylabels("Total number of ratings")

### Unique Users and products

In [None]:
print("Total data ")
print("-"*50)
print("\nTotal no of ratings :",review_orgin_df.shape[0])
print("Total No of Users   :", len(np.unique(review_orgin_df.reviewerID)))
print("Total No of products  :", len(np.unique(review_orgin_df.asin)))

### Analyzing the rating ( overall )

In [None]:
#Analysis of rating given by the user 
no_of_rated_products_per_user = review_orgin_df.groupby(by='reviewerID')['overall'].count().sort_values(ascending=False)
no_of_rated_products_per_user.head()

In [None]:
##### how-to-format-how-a-numpy-array-prints-in-pytho
# https://kite.com/python/answers/how-to-format-how-a-numpy-array-prints-in-python

np.set_printoptions(suppress=True)

In [None]:
no_of_rated_products_per_user.describe()

In [None]:
quantiles = no_of_rated_products_per_user.quantile(np.arange(0,1.01,0.01), interpolation='higher')

plt.figure(figsize=(10,10))
plt.title("Quantiles and their Values")
quantiles.plot()
# quantiles with 0.05 difference
plt.scatter(x=quantiles.index[::5], y=quantiles.values[::5], c='orange', label="quantiles with 0.05 intervals")
# quantiles with 0.25 difference
plt.scatter(x=quantiles.index[::25], y=quantiles.values[::25], c='m', label = "quantiles with 0.25 intervals")
plt.ylabel('No of ratings by user')
plt.xlabel('Value at the quantile')
plt.legend(loc='best')
plt.show()

In [None]:
# print('\n No of rated product more than 50 per user : {}\n'.format(sum(no_of_rated_products_per_user >= 50)) )

# 1. Popularity Based Recommendation

- Popularity based recommendation system works with the trend. 
- It basically uses the items which are in trend right now. 
    - For example, if any product which is usually bought by every new user then there are chances that it may suggest that item to the user who just signed up.
- The problems with popularity based recommendation system is that the personalization is not available with this method 
    - i.e. even though you know the behaviour of the user you cannot recommend items accordingly.
    
<img src="./refer_pics/2.png">    

In [None]:
# new_df=review_orgin_df.groupby("asin").filter(lambda x:x['overall'].count() >=300)
new_df=review_orgin_df

In [None]:
no_of_ratings_per_product = new_df.groupby(by='asin')['overall'].count().sort_values(ascending=False)

fig = plt.figure(figsize=plt.figaspect(.5))
ax = plt.gca()
plt.plot(no_of_ratings_per_product.values)
plt.title('# RATINGS per Product')
plt.xlabel('Product')
plt.ylabel('No of ratings per product')
ax.set_xticklabels([])

plt.show()

In [None]:
#Average rating of the product 

new_df.groupby('asin')['overall'].mean().head()

In [None]:
new_df.groupby('asin')['overall'].mean().sort_values(ascending=False).head()

- Total number of rating for product

In [None]:
new_df.groupby('asin')['overall'].count().sort_values(ascending=False).head()

In [None]:
ratings_mean_count = pd.DataFrame(new_df.groupby('asin')['overall'].mean())
ratings_mean_count['rating_counts'] = pd.DataFrame(new_df.groupby('asin')['overall'].count())
ratings_mean_count.head()

In [None]:
ratings_mean_count['rating_counts'].max()

In [None]:
plt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
ratings_mean_count['rating_counts'].hist(bins=50)

In [None]:
plt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
ratings_mean_count['overall'].hist(bins=50)

In [None]:
plt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
# sns.jointplot(x='overall', y='rating_counts', data=ratings_mean_count, alpha=0.4)

### *** Recommned products based on Popularity Based Recommendation

In [None]:
popular_products = pd.DataFrame(new_df.groupby('asin')['overall'].count())
most_popular = popular_products.sort_values('overall', ascending=False)
most_popular.head(30).plot(kind = "bar")

# 2. Collaberative filtering (Item-Item recommedation)

- Collaborative filtering is commonly used for recommender systems. 
- These techniques aim to fill in the missing entries of a user-item association matrix. 
- We are going to use collaborative filtering (CF) approach. 
    - CF is based on the idea that the best recommendations come from people who have similar tastes. 
    - In other words, it uses historical item ratings of like-minded people to predict how someone would rate an item.
    - Collaborative filtering has two sub-categories that are generally called memory based and model-based approaches.
    
#### using python module: https://surprise.readthedocs.io/en/stable/    

In [None]:
from surprise import KNNWithMeans
from surprise import Dataset
from surprise import accuracy
from surprise import Reader
import os
from surprise.model_selection import train_test_split

# **** Sampling data for testing
- Original sample size: (6940556, 4)
- Size of sample: 10000

In [None]:
# NUM_SAMPLE = 10000

# print("shape of data: ", new_df.shape)
# sample_df = new_df.sample(n=NUM_SAMPLE, random_state=2)
# print("Sample shape of data: ", sample_df.shape)

In [None]:
# #Reading the dataset
# sample_df = sample_df.drop(columns=['unixReviewTime'])
# reader = Reader(rating_scale=(1, 5))
# data = Dataset.load_from_df(sample_df, reader)

# **** All Data for testing
- Original sample size: (6940556, 3)

In [None]:
#Reading the dataset
# data_df = new_df.drop(columns=['unixReviewTime'])
data_df = new_df
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(data_df, reader)
print("shape of data: ", data_df.shape)

In [None]:
#Splitting the dataset
trainset, testset = train_test_split(new_df, test_size=0.3, random_state=10)

### Collaborative Filtering

- Collaborative Filtering is the most common technique used when it comes to building intelligent recommender systems that can learn to give better recommendations as more information about users is collected.

- Most websites like Amazon, YouTube, and Netflix use collaborative filtering as a part of their sophisticated recommendation systems. You can use this technique to build recommenders that give suggestions to a user on the basis of the likes and dislikes of similar users.

#### What Is Collaborative Filtering?
- Collaborative filtering is a technique that can filter out items that a user might like on the basis of reactions by similar users.

- It works by searching a large group of people and finding a smaller set of users with tastes similar to a particular user. It looks at the items they like and combines them to create a ranked list of suggestions.

- There are many ways to decide which users are similar and combine their choices to create a list of recommendations.

    - https://surprise.readthedocs.io/en/stable/knn_inspired.html
    - https://realpython.com/build-recommendation-engine-collaborative-filtering/

<img src="./refer_pics/KNNWithMeans.png">  

### Use user_based true/false to switch for: 
- user-based collaborative filtering 
- item-based collaborative filtering

In [None]:
algo = KNNWithMeans(k=5, sim_options={'name': 'pearson_baseline', 'user_based': False})
algo.fit(trainset)

### Run the trained model against the testset

In [None]:
test_pred = algo.test(testset)

In [None]:
test_pred[:3]

## Get RMSE

In [None]:
print("Item-based Model : Test Set")
accuracy.rmse(test_pred, verbose=True)

# 3. Model-based collaborative filtering system

- These methods are based on machine learning and data mining techniques. 
- The goal is to train models to be able to make predictions. 
- For example, we could use existing user-item interactions to train a model to predict the top-5 items that a user might like the most. 
- One advantage of these methods is that they are able to recommend a larger number of items to a larger number of users, compared to other methods like memory based approach. - They have large coverage, even when working with large sparse matrices.

In [None]:
# print("size of sample_df: ", sample_df.shape)
# ratings_matrix = sample_df.pivot_table(values='overall', index='reviewerID', columns='asin', fill_value=0)
# ratings_matrix.head()

In [None]:
# https://stackoverflow.com/questions/57507832/unable-to-allocate-array-with-shape-and-data-type

In [None]:
!cat /proc/sys/vm/overcommit_memory

In [None]:
!sudo echo 1 > /proc/sys/vm/overcommit_memory

In [None]:
!cat /proc/sys/vm/overcommit_memory

In [None]:
print("size of data_df: ", data_df.shape)
ratings_matrix = data_df.pivot_table(values='overall', index='reviewerID', columns='asin', fill_value=0)
ratings_matrix.head()

#### As expected, the utility matrix obtaned above is sparce, I have filled up the unknown values wth 0.

In [None]:
ratings_matrix.shape

- Transposing the matrix

In [None]:
X = ratings_matrix.T
X.head()

In [None]:
X.shape

- Unique products in subset of data

In [None]:
X1 = X

### Decomposing the Matrix

In [None]:
from sklearn.decomposition import TruncatedSVD

SVD = TruncatedSVD(n_components=10)
decomposed_matrix = SVD.fit_transform(X)
decomposed_matrix.shape

### Correlation Matrix

In [None]:
correlation_matrix = np.corrcoef(decomposed_matrix)
correlation_matrix.shape

## Prediction: 
Index # of product ID purchased by customer

In [None]:
product_id = X.index[75]
print("Index # of producnt ID purchased by customer: ", product_id)
print("https://www.amazon.com/dp/"+product_id)

https://www.amazon.com/dp/B0002NYQO6


<img src="./refer_pics/input_1.png">  

In [None]:
product_names = list(X.index)
product_ID = product_names.index(product_id)
product_ID

#### Correlation for all items with the item purchased by this customer based on items rated by other customers people who bought the same product

In [None]:
correlation_product_ID = correlation_matrix[product_ID]
correlation_product_ID.shape

#### Recommending top 25 highly correlated products in sequence

### Here are the top 10 products to be displayed by the recommendation system to the above customer based on the purchase history of other customers in the website.

In [None]:
Recommend = list(X.index[correlation_product_ID > 0.65])

# Removes the item already bought by the customer
Recommend.remove(product_id) 

Recommend[0:24]

for p_id in Recommend[0:24]:
    print("https://www.amazon.com/dp/"+p_id)

- https://www.amazon.com/dp/B0000AFSY8

<img src="./refer_pics/out_1.png">  

- https://www.amazon.com/dp/B0001YRE04

<img src="./refer_pics/out_2.png">  

- https://www.amazon.com/dp/B0002FHJ66    

<img src="./refer_pics/out_3.png"> 