<a href="https://colab.research.google.com/github/hokou/data-course-sample/blob/wk03/wk03_sample-collaborative-filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sample Code

## 基礎建設

In [51]:
import pandas as pd
import gzip, json

def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield json.loads(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

## 載入資料

In [52]:
!wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/All_Beauty.csv
!wget http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles2/meta_All_Beauty.json.gz

--2022-01-09 14:01:45--  http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/All_Beauty.csv
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15499476 (15M) [application/octet-stream]
Saving to: ‘All_Beauty.csv.2’


2022-01-09 14:01:45 (25.1 MB/s) - ‘All_Beauty.csv.2’ saved [15499476/15499476]

--2022-01-09 14:01:45--  http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles2/meta_All_Beauty.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10329961 (9.9M) [application/octet-stream]
Saving to: ‘meta_All_Beauty.json.gz.2’


2022-01-09 14:01:46 (18.4 MB/s) - ‘meta_All_Beauty.json.gz.2’ saved [10329961/10329961]



In [53]:
metadata = getDF('/content/meta_All_Beauty.json.gz')
ratings = pd.read_csv('/content/All_Beauty.csv', names=['asin', 'reviewerID', 'overall', 'unixReviewTime'], header=None)

In [54]:
metadata.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,details,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
0,[],,[Loud 'N Clear Personal Sound Amplifier allows...,,Loud 'N Clear&trade; Personal Sound Amplifier,[],,idea village,[],"2,938,573 in Beauty & Personal Care (",[],{'ASIN: ': '6546546450'},All Beauty,,,,6546546450,[],[]
1,[],,[No7 Lift & Luminate Triple Action Serum 50ml ...,,No7 Lift &amp; Luminate Triple Action Serum 50...,"[B01E7LCSL6, B008X5RVME]",,,[],"872,854 in Beauty & Personal Care (",[],"{'Shipping Weight:': '0.3 ounces (', 'ASIN: ':...",All Beauty,"class=""a-bordered a-horizontal-stripes a-spa...",,$44.99,7178680776,[],[]
2,[],,[No7 Stay Perfect Foundation now stays perfect...,,No7 Stay Perfect Foundation Cool Vanilla by No7,[],,No7,[],"956,696 in Beauty & Personal Care (","[B01B8BR0O8, B01B8BR0NO, B014MHXXM8]","{'Shipping Weight:': '3.5 ounces (', 'ASIN: ':...",All Beauty,,,$28.76,7250468162,[],[]
3,[],,[],,Wella Koleston Perfect Hair Colour 44/44 Mediu...,[B0041PBXX8],,,[],"1,870,258 in Beauty & Personal Care (",[],"{'  Item Weight: ': '1.76 ounces', 'Sh...",All Beauty,,,,7367905066,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
4,[],,[Lacto Calamine Skin Balance Daily Nourishing ...,,Lacto Calamine Skin Balance Oil control 120 ml...,[],,Pirmal Healthcare,[],"67,701 in Beauty & Personal Care (","[3254895630, B007VL1D9S, B00EH9A0RI, B0773MBG4...","{'Shipping Weight:': '12 ounces (', 'ASIN: ': ...",All Beauty,,,$12.15,7414204790,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...


In [55]:
ratings.head()

Unnamed: 0,asin,reviewerID,overall,unixReviewTime
0,143026860,A1V6B6TNIC10QE,1.0,1424304000
1,143026860,A2F5GHSXFQ0W6J,4.0,1418860800
2,143026860,A1572GUYS7DGSR,4.0,1407628800
3,143026860,A1PSGLFK1NSVO,5.0,1362960000
4,143026860,A6IKXKZMTKGSC,5.0,1324771200


## 資料整理

In [56]:
ratings['DATE'] = pd.to_datetime(ratings['unixReviewTime'], unit='s')

## 資料切分

In [57]:
ratings_trainings = ratings[
    (ratings['DATE'] < '2018-09-01')
]
ratings_trainings_1y = ratings[
    (ratings['DATE'] >= '2017-09-01') & 
    (ratings['DATE'] < '2018-09-01')
]
ratings_trainings_3m = ratings[
    (ratings['DATE'] >= '2018-06-01') & 
    (ratings['DATE'] < '2018-09-01')
]
ratings_trainings_1m = ratings[
    (ratings['DATE'] >= '2018-08-01') & 
    (ratings['DATE'] < '2018-09-01')
]

ratings_testings = ratings[
    (ratings['DATE'] >= '2018-09-01') & 
    (ratings['DATE'] <= '2018-09-30')
]
ratings_testings_by_user = ratings_testings.groupby('reviewerID').agg(list).reset_index()[['reviewerID', 'asin']].to_dict('records')
ratings_testings_by_user = { rating['reviewerID']: rating['asin'] for rating in ratings_testings_by_user }
users = list(ratings_testings_by_user.keys())

In [58]:
print("train_all:", len(ratings_trainings))
print("train_1y:", len(ratings_trainings_1y))
print("train_3M:", len(ratings_trainings_3m))
print("train_1M:", len(ratings_trainings_1m))
print("test:", len(ratings_testings))

train_all: 370752
train_1y: 48612
train_3M: 7027
train_1M: 1662
test: 590


In [59]:
print(ratings_trainings_1y.groupby(['reviewerID']).nunique())
print(ratings_trainings_1y.groupby(['asin']).nunique())

                      asin  overall  unixReviewTime  DATE
reviewerID                                               
A0072285CLEIGDW2A0TB     1        1               1     1
A00827299BWJTEFNVKP0     1        1               1     1
A012671347N9PAVZNSK8     1        1               1     1
A0127955G1ZQHYZFUZ0A     1        1               1     1
A0148728MFRW1Y6J8UD4     1        1               1     1
...                    ...      ...             ...   ...
AZZEM9DMMK30Y            2        1               1     1
AZZHAM7ZWFCDU            1        1               1     1
AZZL11DINXDL3            1        1               1     1
AZZQKE13BD5HE            1        1               1     1
AZZZ5UJWUVCYZ            2        1               1     1

[45373 rows x 4 columns]
            reviewerID  overall  unixReviewTime  DATE
asin                                                 
0143026860           1        1               1     1
014789302X           8        1               7     7
0571

## 產生推薦

In [60]:
!pip install surprise
import time
import pandas as pd
from itertools import combinations
from collections import defaultdict

from surprise import Reader
from surprise import Dataset
from surprise import KNNBasic




In [61]:
def recommender_by_asin(training_data, k=10):
  '''
  排序數量最多的 K 個商品
  '''
  data = training_data.groupby(by=['asin']).size().sort_values(ascending=False)
  data_list = data.index.tolist()
  data_list = data_list[:k]
  return data_list

recommender_top10_1m = recommender_by_asin(ratings_trainings_1m, k=10)
recommender_top10_1m

['B01DKQAXC0',
 'B00W259T7G',
 'B013XKHA4M',
 'B01DLR9IDI',
 'B012Z7IHHI',
 'B01AVJCDYA',
 'B0195R1FT8',
 'B01CJNZKZK',
 'B018WCT01C',
 'B01C39X6TW']

### User-based collaborative filtering

In [62]:
# header: user_id,item_id,rating,timestamp

def recommender_user_cf(training_data, users=[], k=10, rule=False, rule_list=[]):

    # loading data from dataframe
    # user_to_items dict:
    # {
    #   'user': {
    #       'item': ratings...
    #   }...
    # }
    user_to_items = defaultdict(dict)
    for _, row in training_data.iterrows():
        row = dict(row)
        user = row['reviewerID']
        item = row['asin']
        rating = float(row['overall'])

        user_to_items[user][item] = rating

    print("total users before filtering: ", len(user_to_items))

    # remove obscure user to decrease data size
    # filtering params
    remove_obscure_user = True
    user_rating_threshold = 3
    all_users = list(user_to_items.keys())
    for user in all_users:
        ratings = user_to_items[user]
        if remove_obscure_user and len(ratings) < user_rating_threshold:
            del user_to_items[user]

    print("total users  after filtering: ", len(user_to_items))

    # generate item to user mapping dict
    # {
    #   'item': {
    #       'user': ratings...
    #   }...
    # }
    item_to_users = defaultdict(dict)
    for user, items in user_to_items.items():
        for item, rating in items.items():
            item_to_users[item][user] = rating

    # prepare data of computing user similarity 
    init_sim = lambda: [0 for _ in range(3)]
    factory = lambda: defaultdict(init_sim)
    pre_user_similarity = defaultdict(factory)
    n = len(item_to_users)
    index = 0
    for item, user_ratings in item_to_users.items():
        if len(user_ratings) > 1:
            # print(f"item: {item} have been rated by {len(user_ratings)} users progress: {index}/{n}")
            for user1, user2 in combinations(user_ratings.keys(), 2):
                xy = user_ratings[user1] * user_ratings[user2]
                xx = user_ratings[user1] ** 2
                yy = user_ratings[user2] ** 2
                pre_user_similarity[user1][user2][0] += xy
                pre_user_similarity[user1][user2][1] += xx
                pre_user_similarity[user1][user2][2] += yy

                pre_user_similarity[user2][user1][0] += xy
                pre_user_similarity[user2][user1][1] += xx
                pre_user_similarity[user2][user1][2] += yy
        index += 1

    user_similarity = {}
    for src_user in pre_user_similarity:
        user_similarity_order = []
        for dst_user, val in pre_user_similarity[src_user].items():
            xy = val[0]
            xx = val[1]
            yy = val[2]
            div = ((xx*yy) ** 0.5)
            if div == 0:
                continue
            similarity = xy / div
            if similarity < 0:
                continue
            for i, s in enumerate(user_similarity_order):
                target_similarity = s[1]
                if target_similarity < similarity:
                    user_similarity_order.insert(i, (dst_user, similarity))
                    break
            else:
                user_similarity_order.append((dst_user, similarity))
        user_similarity[src_user] = user_similarity_order

    recommendation = {}
    for user in users:
        if user in user_similarity:
            sim_users = user_similarity[user]
            recommended_items = []
            recommended_items_set = set()
            user_have_rated = set(user_to_items[user])
            stop_recommend = False
            for sim_user, _ in sim_users:
                items_from_sim_user = sorted(list(user_to_items[sim_user].items()), key=lambda item: item[1])
                for item, _ in items_from_sim_user:
                    if item not in user_have_rated and item not in recommended_items_set:
                        recommended_items.append(item)
                        recommended_items_set.add(item)
                    if len(recommended_items) >= k:
                        stop_recommend = True
                        break
                if stop_recommend:
                    break
            recommendation[user] = recommended_items
        else:
            recommendation[user] = []
            if rule:
                recommendation[user] = rule_list
    return recommendation

ratings_by_user_user = recommender_user_cf(ratings_trainings, users)
ratings_by_user_user_rule = recommender_user_cf(ratings_trainings, users, rule=True, rule_list=recommender_top10_1m)
# ratings_by_user

total users before filtering:  323489
total users  after filtering:  4793
total users before filtering:  323489
total users  after filtering:  4793


### Item-based collaborative filtering

In [63]:
def recommender_item_cf(training_data, users=[], k=10, rule=False, rule_list=[]):

    # loading data from dataframe
    # item_to_users dict:
    # {
    #   'item': {
    #       'user': ratings...
    #   }...
    # }
    item_to_users = defaultdict(dict)
    for _, row in training_data.iterrows():
        row = dict(row)
        user = row['reviewerID']
        item = row['asin']
        rating = float(row['overall'])
        item_to_users[item][user] = rating

    print("data converted")

    user_to_items = defaultdict(dict)
    for item, rating_users in item_to_users.items():
        for user, rating in rating_users.items():
            user_to_items[user][item] = rating

    print("data inverted")

    init_sim = lambda: [0, 0, 0]
    factory = lambda: defaultdict(init_sim)
    pre_item_similarity = defaultdict(factory)
    for user, items in user_to_items.items():
        if len(items) > 1:
            for i1, i2 in combinations(items.keys(), 2):
                xy = items[i1] * items[i2]
                xx = items[i1] ** 2
                yy = items[i2] ** 2
                pre_item_similarity[i1][i2][0] += xy
                pre_item_similarity[i1][i2][1] += xx
                pre_item_similarity[i1][i2][2] += yy

                pre_item_similarity[i2][i1][0] += xy
                pre_item_similarity[i2][i1][1] += xx
                pre_item_similarity[i2][i1][2] += yy

    print("sim data prepared")

    item_similarity = {}
    for src_item in pre_item_similarity:
        item_similarity_order = []
        for dst_item, val in pre_item_similarity[src_item].items():
            xy = val[0]
            xx = val[1]
            yy = val[2]
            div = ((xx*yy) ** 0.5)
            if div == 0:
                continue
            similarity = xy / div
            if similarity < 0:
                continue
            for i, s in enumerate(item_similarity_order):
                target_similarity = s[1]
                if target_similarity < similarity:
                    item_similarity_order.insert(i, (dst_item, similarity))
                    break
            else:
                item_similarity_order.append((dst_item, similarity))
        item_similarity[src_item] = item_similarity_order

    # print(f"get {k} recommendation items for for user: {users}")

    recommendation = {}
    for user in users:
        items = []
        items_set = set()
        stop = False
        user_has_rated = set(user_to_items[user])
        for item in user_has_rated:
            if item in item_similarity:
                for sim_item, _ in item_similarity[item]:
                    # skip the item user has rated
                    if sim_item not in user_has_rated and sim_item not in items_set:
                        items.append(sim_item)
                        items_set.add(sim_item)
                    if len(items) >= k:
                        stop = True
                        break
                if stop:
                    break
        if rule:
            if len(items) == 0:
                items = rule_list
        recommendation[user] = items
    return recommendation    

ratings_by_user_item = recommender_item_cf(ratings_trainings, users)
ratings_by_user_item_rule = recommender_item_cf(ratings_trainings, users, rule=True, rule_list=recommender_top10_1m)
# ratings_by_user_item_rule

data converted
data inverted
sim data prepared
data converted
data inverted
sim data prepared


### Surprise package collaborative filtering

In [64]:
def recommender_surprise(training_data, users=[], k=10, user_based=False, algo=KNNBasic, rule=False, rule_list=[]):

    training_data = (
        training_data
        .sort_values("DATE", ascending=False)
        .groupby(['reviewerID', 'asin']).head(1)
    )

    reader = Reader(rating_scale=(0, 5))
    training_data = training_data[['reviewerID', 'asin', 'overall']]
    data = Dataset.load_from_df(training_data, reader=reader)

    sim_options = {
        'name': 'cosine',
        'user_based': user_based  # compute similarities between items
    }
    algo_impl = algo(sim_options=sim_options)
    trainset = data.build_full_trainset()
    algo_impl.fit(trainset)

    recommendation = {}
    for user in users:
        items_user_rated = set(training_data.loc[training_data['reviewerID'] == user]['asin'].to_list())
        recommend_item_list = []
        recommend_item_set = set()
        for item in items_user_rated:
            iid = algo_impl.trainset.to_inner_iid(item)
            recommend_items_iid = algo_impl.get_neighbors(iid, k)
            for sim_item_iid in recommend_items_iid:
                item_raw_id = algo_impl.trainset.to_raw_iid(sim_item_iid)
                if item_raw_id not in items_user_rated and item_raw_id not in recommend_item_set:
                    recommend_item_list.append(item_raw_id)
                    recommend_item_set.add(item_raw_id)

            if len(recommend_item_list) >= k:
                recommend_item_list = recommend_item_list[:k]
                break
        if rule:
            if len(recommend_item_list) == 0:
                recommend_item_list = rule_list
        recommendation[user] = recommend_item_list

    return recommendation

ratings_by_user_sp_item = recommender_surprise(ratings_trainings_1y, users)
ratings_by_user_sp_item_rule = recommender_surprise(ratings_trainings_1y, users, rule=True, rule_list=recommender_top10_1m)
# ratings_by_user_sp_user = recommender(ratings_trainings_3m, users, user_based=True)
# ratings_by_user

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.


## 結果評估

In [65]:
def evaluate(ratings_testings_by_user={}, ratings_by_user={}, method=None):
    '''
    * ratings_testings_by_user: dict 真實被購買的商品資料（2018-09-01 以後資料）
    * ratings_by_user: dict 利用訓練資料學習的推薦商品
    * method: str
    * score: float
    '''
    total = 0
    for d in ratings_testings_by_user:
        if d in ratings_by_user:
            total += len(set(ratings_by_user[d]) & set(ratings_testings_by_user[d]))

    score = total / len(ratings_testings)
    return score

# evaluate(ratings_testings_by_user, ratings_by_user)

In [66]:
print("User-based collaborative filtering", evaluate(ratings_testings_by_user, ratings_by_user_user))
print("Item-based collaborative filtering", evaluate(ratings_testings_by_user, ratings_by_user_item))
print("surprise-iten collaborative filtering", evaluate(ratings_testings_by_user, ratings_by_user_sp_item))

User-based collaborative filtering 0.0
Item-based collaborative filtering 0.001694915254237288
surprise-iten collaborative filtering 0.001694915254237288


In [67]:
print("User-based collaborative filtering add rule", evaluate(ratings_testings_by_user, ratings_by_user_user_rule))
print("Item-based collaborative filtering add rule", evaluate(ratings_testings_by_user, ratings_by_user_item_rule))
print("surprise-item collaborative filtering add rule", evaluate(ratings_testings_by_user, ratings_by_user_sp_item_rule))

User-based collaborative filtering add rule 0.1576271186440678
Item-based collaborative filtering add rule 0.15423728813559323
surprise-item collaborative filtering add rule 0.1576271186440678
