<a href="https://colab.research.google.com/github/kronze1996/Product-Recommendation-Engine/blob/main/Kartikey_Sharma_Product_Recommendation_Engine_Capstone_Project_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Build a recommender engine that reviews customer ratings and purchase history to recommend items and improve sales. </u></b>

### Amazon.com is one of the largest electronic commerce and cloud computing companies.

### Just a few Amazon related facts:

### They lost $4.8 million in August 2013, when their website went down for 40 mins. They hold the patent on 1-Click buying, and licenses it to Apple. Their Phoenix fulfilment centre is a massive 1.2 million square feet. Amazon relies heavily on a Recommendation engine that reviews customer ratings and purchase history to recommend items and improve sales.


### This is a dataset related to over 2 Million customer reviews and ratings of Beauty related products sold on their website.

### It contains

* ### the unique UserId (Customer Identification),
* ### the product ASIN (Amazon's unique product identification code for each product),
* ### Ratings (ranging from 1-5 based on customer satisfaction) and
* ### the Timestamp of the rating (in UNIX time)

### This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.

### This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## <u><b> Collaborative Filtering </b></u> 

### This method makes automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on a set of items, A is more likely to have B's opinion for a given item than that of a randomly chosen person.   


## <u><b> Content-Based Filtering </b></u>

### This method uses only information about the description and attributes of the items users has previously consumed to model user's preferences. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended. 


## <u><b> Hybrid Approach </b></u>
### Recent research has demonstrated that a hybrid approach, combining collaborative filtering and content-based filtering could be more effective than pure approaches in some cases. These methods can also be used to overcome some of the common problems in recommender systems such as cold start and the sparsity problem.

In [30]:
import numpy as np
import scipy
import pandas as pd
import math
import random
import sklearn
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
import matplotlib.pyplot as plt

In [3]:
path = "/content/drive/MyDrive/AlmaBetter/Cohort Aravali/Module 4/Week 4/ratings_Beauty.csv"
df = pd.read_csv(path)
review_df = pd.read_json('/content/drive/MyDrive/AlmaBetter/Cohort Aravali/Module 4/Week 4/reviews_Beauty_5 (1).json.gz',lines = True)

#Exploratory Data Analysis

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2023070 entries, 0 to 2023069
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   UserId     object 
 1   ProductId  object 
 2   Rating     float64
 3   Timestamp  int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 61.7+ MB


In [5]:
review_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198502 entries, 0 to 198501
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   reviewerID      198502 non-null  object
 1   asin            198502 non-null  object
 2   reviewerName    197116 non-null  object
 3   helpful         198502 non-null  object
 4   reviewText      198502 non-null  object
 5   overall         198502 non-null  int64 
 6   summary         198502 non-null  object
 7   unixReviewTime  198502 non-null  int64 
 8   reviewTime      198502 non-null  object
dtypes: int64(2), object(7)
memory usage: 13.6+ MB


In [6]:
df.head(20)

Unnamed: 0,UserId,ProductId,Rating,Timestamp
0,A39HTATAQ9V7YF,0205616461,5.0,1369699200
1,A3JM6GV9MNOF9X,0558925278,3.0,1355443200
2,A1Z513UWSAAO0F,0558925278,5.0,1404691200
3,A1WMRR494NWEWV,0733001998,4.0,1382572800
4,A3IAAVS479H7M7,0737104473,1.0,1274227200
5,AKJHHD5VEH7VG,0762451459,5.0,1404518400
6,A1BG8QW55XHN6U,1304139212,5.0,1371945600
7,A22VW0P4VZHDE3,1304139220,5.0,1373068800
8,A3V3RE4132GKRO,130414089X,5.0,1401840000
9,A327B0I7CYTEJC,130414643X,4.0,1389052800


In [7]:
df['ProductId'].value_counts()

B001MA0QY2    7533
B0009V1YR8    2869
B0043OYFKU    2477
B0000YUXI0    2143
B003V265QW    2088
              ... 
B009ZQTBOG       1
B00HSBPJRI       1
B00ATACIYM       1
B0082MROXY       1
B00JDWHWJI       1
Name: ProductId, Length: 249274, dtype: int64

In [8]:
review_df.head(20)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A1YJEY40YUW4SE,7806397051,Andrea,"[3, 4]",Very oily and creamy. Not at all what I expect...,1,Don't waste your money,1391040000,"01 30, 2014"
1,A60XNB876KYML,7806397051,Jessica H.,"[1, 1]",This palette was a decent price and I was look...,3,OK Palette!,1397779200,"04 18, 2014"
2,A3G6XNM240RMWA,7806397051,Karen,"[0, 1]",The texture of this concealer pallet is fantas...,4,great quality,1378425600,"09 6, 2013"
3,A1PQFP6SAJ6D80,7806397051,Norah,"[2, 2]",I really can't tell what exactly this thing is...,2,Do not work on my face,1386460800,"12 8, 2013"
4,A38FVHZTNQ271F,7806397051,Nova Amor,"[0, 0]","It was a little smaller than I expected, but t...",3,It's okay.,1382140800,"10 19, 2013"
5,A3BTN14HIZET6Z,7806397051,"S. M. Randall ""WildHorseWoman""","[1, 2]","I was very happy to get this palette, now I wi...",5,Very nice palette!,1365984000,"04 15, 2013"
6,A1Z59RFKN0M5QL,7806397051,"tasha ""luvely12b""","[1, 3]",PLEASE DONT DO IT! this just rachett the palet...,1,smh!!!,1376611200,"08 16, 2013"
7,AWUO9P6PL1SY8,7806397051,TreMagnifique,"[0, 1]","Chalky,Not Pigmented,Wears off easily,Not a Co...",2,"Chalky, Not Pigmented, Wears off easily, Not a...",1378252800,"09 4, 2013"
8,A3LMILRM9OC3SA,9759091062,,"[0, 0]",Did nothing for me. Stings when I put it on. I...,2,"no Lightening, no Brightening,......NOTHING",1405209600,"07 13, 2014"
9,A30IP88QK3YUIO,9759091062,Amina Bint Ibraheem,"[0, 0]",I bought this product to get rid of the dark s...,3,Its alright,1388102400,"12 27, 2013"


Exploring the data sets

In [9]:
df[df['UserId']=='A1YJEY40YUW4SE']

Unnamed: 0,UserId,ProductId,Rating,Timestamp
340,A1YJEY40YUW4SE,7806397051,1.0,1391040000
440062,A1YJEY40YUW4SE,B000V6BCSW,2.0,1391040000
807482,A1YJEY40YUW4SE,B0020YLEYK,5.0,1328140800
942164,A1YJEY40YUW4SE,B002WLWX82,1.0,1391040000
1177542,A1YJEY40YUW4SE,B004756YJA,5.0,1318896000
1337973,A1YJEY40YUW4SE,B004ZT0SSG,5.0,1318896000
1538217,A1YJEY40YUW4SE,B006XL55FK,5.0,1391040000
1982381,A1YJEY40YUW4SE,B00GZTXTZI,4.0,1391040000


In [10]:
df['Rating'].value_counts()

5.0    1248721
4.0     307740
1.0     183784
3.0     169791
2.0     113034
Name: Rating, dtype: int64

In [11]:
df['UserId'].value_counts()

A3KEZLJ59C1JVH    389
A281NPSIMI1C2R    336
A3M174IC0VXOS2    326
A2V5R832QCSOMX    278
A3LJLRIZL38GG3    276
                 ... 
A1NYR8HTVANT1O      1
A3MMV1ZSPSU027      1
A1PIU542LN6KKS      1
A3S48M4APXMJY3      1
A2JAGN2D11CDKU      1
Name: UserId, Length: 1210271, dtype: int64

Taking data for users who bought atleast 5 items

In [12]:
df['UserId'].value_counts()

A3KEZLJ59C1JVH    389
A281NPSIMI1C2R    336
A3M174IC0VXOS2    326
A2V5R832QCSOMX    278
A3LJLRIZL38GG3    276
                 ... 
A1NYR8HTVANT1O      1
A3MMV1ZSPSU027      1
A1PIU542LN6KKS      1
A3S48M4APXMJY3      1
A2JAGN2D11CDKU      1
Name: UserId, Length: 1210271, dtype: int64

Recommender systems have a problem known as user cold-start, in which it is hard to provide personalized recommendations for users with none or a very few number of consumed items, due to the lack of information to model their preferences.  

For this reason, we are keeping in the dataset only users with at least 5 interactions.

In [13]:
users_interactions_count_df = df.groupby(['UserId', 'ProductId']).size().groupby('UserId').size()
print('# of users: %d' % len(users_interactions_count_df))

users_with_enough_interactions_df = users_interactions_count_df[users_interactions_count_df >= 5].reset_index()[['UserId']]
print('# of users with at least 5 interactions: %d' % len(users_with_enough_interactions_df))

# of users: 1210271
# of users with at least 5 interactions: 52374


In [14]:
users_with_enough_interactions_df.value_counts()

UserId               
AZZZLM1E5JJ8C            1
A29FFXE1ZJWMDB           1
A29GNM310NIEGI           1
A29GF42C0YFQOK           1
A29GES4X1DL5JV           1
                        ..
A3III07Y9VJI8Q           1
A3IIGCFLKVFW8M           1
A3IIG6WN78DIOQ           1
A3IIDZ9XUDM7RP           1
A00414041RD0BXM6WK0GX    1
Length: 52374, dtype: int64

Checking for user with 1 interaction

In [15]:
users_with_enough_interactions_df[users_with_enough_interactions_df['UserId']=='A27TKCMDYFCFOY']

Unnamed: 0,UserId


In [16]:
print('# of interactions: %d' % len(df))
interactions_from_selected_users_df = df.merge(users_with_enough_interactions_df, 
               how = 'right',
               left_on = 'UserId',
               right_on = 'UserId')
print('# of interactions from users with at least 5 interactions: %d' % len(interactions_from_selected_users_df))

# of interactions: 2023070
# of interactions from users with at least 5 interactions: 469771


In [17]:
interactions_from_selected_users_df.head(10)

Unnamed: 0,UserId,ProductId,Rating,Timestamp
0,A00414041RD0BXM6WK0GX,B007IY97U0,3.0,1405296000
1,A00414041RD0BXM6WK0GX,B00870XLDS,2.0,1405296000
2,A00414041RD0BXM6WK0GX,B008MIRO88,1.0,1405296000
3,A00414041RD0BXM6WK0GX,B00BQYYMN0,3.0,1405296000
4,A00414041RD0BXM6WK0GX,B00GRTQBTM,5.0,1405296000
5,A00414041RD0BXM6WK0GX,B00HFP4JZU,5.0,1405296000
6,A00414041RD0BXM6WK0GX,B00JM8Z52O,4.0,1405296000
7,A00473363TJ8YSZ3YAGG9,B000052YQU,2.0,1402790400
8,A00473363TJ8YSZ3YAGG9,B00016XA0K,3.0,1399593600
9,A00473363TJ8YSZ3YAGG9,B000FABN7E,5.0,1357430400


In [23]:
interactions_from_selected_users_df[interactions_from_selected_users_df['UserId']=='A27TKCMDYFCFOY']

Unnamed: 0,UserId,ProductId,Rating,Timestamp


#Content Based Filtering

In [25]:
import math
def smooth_user_preference(x):
    return math.log(1+x, 2)
    
interactions_full_df = interactions_from_selected_users_df \
                    .groupby(['UserId', 'ProductId'])['Rating'].sum() \
                    .apply(smooth_user_preference).reset_index()
print('# of unique user/item interactions: %d' % len(interactions_full_df))
interactions_full_df.head(10)

# of unique user/item interactions: 469771


Unnamed: 0,UserId,ProductId,Rating
0,A00414041RD0BXM6WK0GX,B007IY97U0,2.0
1,A00414041RD0BXM6WK0GX,B00870XLDS,1.584963
2,A00414041RD0BXM6WK0GX,B008MIRO88,1.0
3,A00414041RD0BXM6WK0GX,B00BQYYMN0,2.0
4,A00414041RD0BXM6WK0GX,B00GRTQBTM,2.584963
5,A00414041RD0BXM6WK0GX,B00HFP4JZU,2.584963
6,A00414041RD0BXM6WK0GX,B00JM8Z52O,2.321928
7,A00473363TJ8YSZ3YAGG9,B000052YQU,1.584963
8,A00473363TJ8YSZ3YAGG9,B00016XA0K,2.0
9,A00473363TJ8YSZ3YAGG9,B000FABN7E,2.584963


In [27]:
interactions_train_df, interactions_test_df = train_test_split(interactions_full_df,
                                   stratify=interactions_full_df['UserId'], 
                                   test_size=0.20,
                                   random_state=42)

print('# interactions on Train set: %d' % len(interactions_train_df))
print('# interactions on Test set: %d' % len(interactions_test_df))

# interactions on Train set: 375816
# interactions on Test set: 93955


In [28]:
#Indexing by personId to speed up the searches during evaluation
interactions_full_indexed_df = interactions_full_df.set_index('UserId')
interactions_train_indexed_df = interactions_train_df.set_index('UserId')
interactions_test_indexed_df = interactions_test_df.set_index('UserId')

In [33]:
#Ignoring stopwords (words with no semantics) from English and Portuguese (as we have a corpus with mixed languages)
import nltk
nltk.download('stopwords')
stopwords_list = stopwords.words('english')

#Trains a model whose vectors size is 5000, composed by the main unigrams and bigrams found in the corpus, ignoring stopwords
vectorizer = TfidfVectorizer(analyzer='word',
                     ngram_range=(1, 2),
                     min_df=0.003,
                     max_df=0.5,
                     max_features=5000,
                     stop_words=stopwords_list)

item_ids = df['ProductId'].tolist()
tfidf_matrix = vectorizer.fit_transform(review_df['title'] + "" + review_df['text'])
tfidf_feature_names = vectorizer.get_feature_names()
tfidf_matrix

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


NameError: ignored

### To model the user profile, we take all the item profiles the user has interacted and average them. The average is weighted by the interaction strength, in other words, the articles the user has interacted the most (eg. liked or commented) will have a higher strength in the final user profile.   

In [34]:
def get_item_profile(item_id):
    idx = item_ids.index(item_id)
    item_profile = tfidf_matrix[idx:idx+1]
    return item_profile

def get_item_profiles(ids):
    item_profiles_list = [get_item_profile(x) for x in ids]
    item_profiles = scipy.sparse.vstack(item_profiles_list)
    return item_profiles

def build_users_profile(person_id, interactions_indexed_df):
    interactions_person_df = interactions_indexed_df.loc[person_id]
    user_item_profiles = get_item_profiles(interactions_person_df['contentId'])
    
    user_item_strengths = np.array(interactions_person_df['eventStrength']).reshape(-1,1)
    
    # Weighted average of item profiles by the interactions strength
    user_item_strengths_weighted_avg = np.sum(user_item_profiles.multiply(user_item_strengths), axis=0) / np.sum(user_item_strengths)
    user_profile_norm = normalize(user_item_strengths_weighted_avg)
    return user_profile_norm

def build_users_profiles(): 
    interactions_indexed_df = interactions_full_df[interactions_full_df['contentId'].isin(articles_df['contentId'])].set_index('personId')
    user_profiles = {}
    for person_id in interactions_indexed_df.index.unique():
        user_profiles[person_id] = build_users_profile(person_id, interactions_indexed_df)
    return user_profiles

#Collaborative filtering

Train Test split

In [22]:
interactions_train_df, interactions_test_df = train_test_split(interactions_from_selected_users_df,
                                   stratify=interactions_from_selected_users_df['UserId'], 
                                   test_size=0.20,
                                   random_state=42)

print('# interactions on Train set: %d' % len(interactions_train_df))
print('# interactions on Test set: %d' % len(interactions_test_df))

# interactions on Train set: 375816
# interactions on Test set: 93955


In [20]:
#Creating a sparse pivot table with users in rows and items in columns
users_items_pivot_matrix_df = interactions_test_df.pivot(index='UserId',columns='ProductId',values='Rating').fillna(0)

users_items_pivot_matrix_df.head()

ValueError: ignored

In [35]:
users_items_pivot_matrix = users_items_pivot_matrix_df.values
users_items_pivot_matrix[:10]

NameError: ignored