<a href="https://colab.research.google.com/github/phwangktw/data-course-sample/blob/main/Session3_Collaborative-based(surprise_package)_Recommendation_Algorithm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Session3_Collaborative-based(surprise_package)_Recommendation_Algorithm.ipynb

## Step1. Packages import and utiltiets functions


In [1]:
import pandas as pd
import numpy as np
import gzip, json
from os.path import exists
from itertools import combinations
from collections import defaultdict


import matplotlib.pyplot as plt
import seaborn as sns
import re
import datetime
!pip install surprise
from surprise import Reader
from surprise import Dataset
from surprise import KNNBasic


def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield json.loads(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 4.5 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1619414 sha256=d162de6b75b286ba0d15150d16cc4749c5973cdcb9a5cc2cdca01dff29a02e2d
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1


## Step2. Download data

In [2]:
!wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/All_Beauty.csv
!wget http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles2/meta_All_Beauty.json.gz

metadata = getDF('/content/meta_All_Beauty.json.gz')
ratings = pd.read_csv('/content/All_Beauty.csv', names=['asin', 'reviewerID', 'overall', 'unixReviewTime'], header=None)



--2022-01-08 23:46:26--  http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/All_Beauty.csv
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15499476 (15M) [application/octet-stream]
Saving to: ‘All_Beauty.csv’


2022-01-08 23:46:27 (19.1 MB/s) - ‘All_Beauty.csv’ saved [15499476/15499476]

--2022-01-08 23:46:27--  http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles2/meta_All_Beauty.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10329961 (9.9M) [application/octet-stream]
Saving to: ‘meta_All_Beauty.json.gz’


2022-01-08 23:46:28 (15.1 MB/s) - ‘meta_All_Beauty.json.gz’ saved [10329961/10329961]



## Step3. Parsing data

### Step3-1: Convert time format

In [3]:
ratings['DATE'] = pd.to_datetime(ratings['unixReviewTime'], unit='s')

### Step3-2: Data preprocessing
(As same as [session1](https://github.com/phwangktw/data-course-sample/blob/main/Session1_Rule-based_Recommendation_Algorithm.ipynb))

*   Dropout the duplicated rows
*   Fill the blanks with `nan`
*   Parsing the `description` column for generating `rank_num` and `rank_category`
*   Regex expression for searching specific key words

In [4]:
##Cleaning data (cited from: https://github.com/yuchiahung/data-course-sample/blob/main/hw1_Ana.ipynb)
##Peaking data firstly
metadata_clean = metadata.loc[metadata.astype(str).drop_duplicates().index]
metadata_clean.replace('', np.nan, inplace = True)


# clean column `rank` -> Parsing out to RankNum + RankCategory
metadata_clean['rank'] = metadata_clean['rank'].str.replace('&amp;', '&')
metadata_clean['rank'].fillna('0', inplace = True)
metadata_clean['rank_category'] = [re.search('in (.*) \(', r).group(1) if r != '0' else None for r in metadata_clean['rank']]
metadata_clean['rank_num'] = [re.search('(.*) in .*', r).group(1) if r != '0' else None for r in metadata_clean['rank']]
metadata_clean['rank_num'] = metadata_clean['rank_num'].str.replace(',', '').astype(float)

# excluding category != 'Beauty & Personal Care'
metadata_clean = metadata_clean[metadata_clean.rank_category == 'Beauty & Personal Care']

# convert `price` to float
metadata_clean['price'].fillna('0', inplace = True)
metadata_clean['price'] = [re.search('\$(.*)', p).group(1) if re.search('\$(.*)', p) != None else None for p in metadata_clean['price']]
metadata_clean['price'] = metadata_clean['price'].str.replace(',', '').astype(float)

# drop useless columns
metadata_clean.drop(
    ['category', 'tech1', 'fit', 'tech2', 'date', 'similar_item', 'feature', 'main_cat', 'rank'], 
    axis = 1, 
    inplace = True
)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


### Step3-3: Split time frame for testing and validation purpose

In [5]:
ratings_trainings = ratings[
    (ratings['DATE'] < '2018-09-01')
]
ratings_testings = ratings[
    (ratings['DATE'] >= '2018-09-01') & 
    (ratings['DATE'] <= '2018-09-30')
]
ratings_testings_by_user = ratings_testings.groupby('reviewerID').agg(list).reset_index()[['reviewerID', 'asin']].to_dict('records')
ratings_testings_by_user = { rating['reviewerID']: rating['asin'] for rating in ratings_testings_by_user }
users = list(ratings_testings_by_user.keys())

## Step4: Item-based Algorithm Implementation
### Step4-1: Generate `dict` format (`user_to_items`)
### Step4-2: Filter users with >3 rating history
### Step4-3: Generate `pre_user_similarity` dict. (e.g. 'User': {'OtherUsers':[xy xx yy]}
### Step4-4: Generate user_similarity

In [6]:
def recommender_suprise(training_data, users=[], k=10, user_based=False, algo=KNNBasic, similarities_module = 'cosine', min_sup = 1):

    training_dataBkup = training_data    

    # 過濾同個 user 對同個 item 的重複評分資料 (留最新的)
    training_data = (
        training_data
        .sort_values("DATE", ascending=False)
        .groupby(['reviewerID', 'asin']).head(1)
    )

    # use data in the previous year
    training_data = training_data[training_data.DATE >= '2017-09-01']

    # preparing data
    reader = Reader(rating_scale=(0, 5))
    training_data = training_data[['reviewerID', 'asin', 'overall']]
    data = Dataset.load_from_df(training_data, reader=reader)
    
    # set the parameters & algorithm
    sim_options = {
        'name': similarities_module,
        'user_based': user_based,   # compute similarities between items
        'min_support': 2
    }
    algo_impl = algo(sim_options=sim_options)
    trainset = data.build_full_trainset()
    algo_impl.fit(trainset)

    # get the recommended items
    recommendations = {}
    k_rule = 5
    for user in users:
        items_user_rated = set(training_data.loc[training_data['reviewerID'] == user]['asin'].to_list())
        recommend_item_list = []
        recommend_item_set = set()
        for item in items_user_rated:
            iid = algo_impl.trainset.to_inner_iid(item)
            recommend_items_iid = algo_impl.get_neighbors(iid, k_rule) # recommend k items based on rated item
            for sim_item_iid in recommend_items_iid:
                item_raw_id = algo_impl.trainset.to_raw_iid(sim_item_iid)
                # if the item has not been rated nor recommended, recommend it
                if item_raw_id not in items_user_rated and item_raw_id not in recommend_item_set:
                    recommend_item_list.append(item_raw_id)
                    recommend_item_set.add(item_raw_id)

            if len(recommend_item_list) >= k_rule:
                recommend_item_list = recommend_item_list[:k_rule]
                break

        # Popular products (recommend `k_left` products)
        k_left = k - len(recommend_item_list)
        ## Best seller (by rating data) & highest rating products (recommend `k` product)
        products_rating = training_dataBkup[training_dataBkup.DATE >= '2017-09-01'].groupby('asin')[['overall']].agg(['mean', 'count'])
        products_rating.columns = products_rating.columns.droplevel(0)
        rule_recom = products_rating.sort_values(by = ['count', 'mean'], ascending = False).index.tolist()[:k_left]
        
        # concat all the item lists (k2 by rank, k3 by rating, others by sales)
        user_recom = recommend_item_list + rule_recom
        recommendations[user] = user_recom

    return recommendations

### Base Model setup 
Base case setup as the rule-based algorithm of the most K popular products of the recent year (see as Session1). 

In [7]:
#Rule1: A year-based recommendation
def recommender_base(training_data, users=[], k=10):
    '''
    * training_data: dataframe 輸入的訓練資料集（2018-09-01 以前資料）
    * users: [] 需要被推薦的使用者
    * k: int 每個使用者需要推薦的商品數
    * recommendations: dict
      {
          使用者一： [推薦商品一, 推薦商品二, ...],
          使用者二： [...], ...
      }
    '''
    recommendations = {}
    ## Best seller (by rating data) & highest rating products (recommend `k` product)
    products_rating = training_data[training_data.DATE >= '2017-09-01'].groupby('asin')[['overall']].agg(['mean', 'count'])
    products_rating.columns = products_rating.columns.droplevel(0)
    best_seller_lst = products_rating.sort_values(by = ['count', 'mean'], ascending = False).index.tolist()[:k]

    recommendations = {user: best_seller_lst for user in users}
    return recommendations

### Evaluation Algorithm and the Results

In [8]:
def evaluate(ratings_testings_by_user={}, ratings_by_user={}, method=None):
    '''
    * ratings_testings_by_user: dict 真實被購買的商品資料（2018-09-01 以後資料）
    * ratings_by_user: dict 利用訓練資料學習的推薦商品
    * method: str
    * score: float
    '''
    total = 0
    for d in ratings_testings_by_user:
        if d in ratings_by_user:
            total += len(set(ratings_by_user[d]) & set(ratings_testings_by_user[d]))

    score = total / len(ratings_testings)
    return score

ratings_by_user = recommender_suprise(ratings_trainings, users)
rcListBase = recommender_base(ratings_trainings, users)


score1 = evaluate(ratings_testings_by_user, ratings_by_user)
scoreBase = evaluate(ratings_testings_by_user, rcListBase)
# Evaluation scores
print(f'Rule1: \n{round(score1, 4)}')
print(f'Base_case: \n{round(scoreBase, 4)}')
print(f'Improvemnt of Content-based method: \n{round(100*(score1-scoreBase)/scoreBase, 1)} %')

Computing the cosine similarity matrix...
Done computing similarity matrix.
Rule1: 
0.1
Base_case: 
0.0983
Improvemnt of Content-based method: 
1.7 %


## Step5: Discussions of obstacles

The reasons that we cannot generate significant score improvements are described as follows:
*   Limiting testing users (38/584=6.5%) have purchase (comment) historical records.
*   The average content-based score is really low, which reflects that the majority of products the user new bought are irrelated to its history purchased.

