# 目標 Target

優化點擊率，events資料集為隱性評分資料，我會示範用hold-one-out evaluation(只用每個使用者最後一個點擊行為當作測試集之 evaluation)

Maximize Click-through-rate (CTR). The events dataset contains implicit ratings. I will demonstrate how to use hold-one-out for test set evaluation.

In [1]:
import numpy as np
import pandas as pd

In [2]:
rating_df = pd.read_csv('../ecommerce-dataset/events_small.csv')
rating_df.head()

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
0,1433221373311,781127,view,21989,
1,1433222147345,1076270,view,262799,
2,1433221380636,849453,view,123990,
3,1433223176926,629333,view,128394,
4,1433222897013,492414,view,279976,


In [3]:
# get the number of interactions for each user
rating_df['event_count'] = rating_df['visitorid'].map(rating_df.visitorid.value_counts())
rating_df.head()

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid,event_count
0,1433221373311,781127,view,21989,,1
1,1433222147345,1076270,view,262799,,4
2,1433221380636,849453,view,123990,,1
3,1433223176926,629333,view,128394,,1
4,1433222897013,492414,view,279976,,9


In [4]:
# 將數劇集按照時間大小由小排到大
# # sort data frame by timestemp for trian/ val split
rating_df = rating_df.sort_values('timestamp')
rating_df.head(10)

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid,event_count
34513,1430622118534,584571,view,436195,,2
34476,1430622162554,837890,view,2519,,1
34471,1430622330806,990356,view,369532,,2
34484,1430622469247,584571,view,436195,,2
34470,1430622609378,1002397,view,77392,,2
34514,1430622790487,1375898,view,64152,,1
34472,1430622933406,823085,view,131879,,3
34477,1430623098261,1061274,view,356129,,1
34485,1430623224021,823085,view,214519,,3
34473,1430623241016,122517,view,232129,,4


In [5]:
# map user id and movie id to integer starting from 0 to N (num of users) and M (num of items)
from sklearn.preprocessing import LabelEncoder
user_encoder = LabelEncoder()
item_encoder = LabelEncoder()

rating_df['visitorid'] = user_encoder.fit_transform(rating_df.visitorid)
rating_df['itemid'] = item_encoder.fit_transform(rating_df.itemid)

In [6]:
num_users = rating_df.visitorid.max()+1
num_items = rating_df.itemid.max()+1
num_users, num_items

(27372, 9098)

In [7]:
# 按照visitorid分組，呼叫rank函數可以得到該點擊是該使用者第幾個點擊，這裡用asecending=False使最後一次點擊的appearance=1
# group by visitorid and call "rank" to know the number of click for that user. We set ascending=false so that the
# last click has appearance = 1

rating_df['appearance'] = rating_df.groupby('visitorid').timestamp.rank(ascending=False)
rating_df.head(15)

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid,event_count,appearance
34513,1430622118534,11315,view,8460,,2,2.0
34476,1430622162554,16213,view,69,,1,1.0
34471,1430622330806,19226,view,7229,,2,2.0
34484,1430622469247,11315,view,8460,,2,1.0
34470,1430622609378,19479,view,1573,,2,2.0
34514,1430622790487,26737,view,1315,,1,1.0
34472,1430622933406,15931,view,2638,,3,3.0
34477,1430623098261,20675,view,6962,,1,1.0
34485,1430623224021,15931,view,4263,,3,2.0
34473,1430623241016,2310,view,4602,,4,4.0


In [8]:
# train / val split
train_df = rating_df.loc[rating_df.appearance>1]
val_df = rating_df.loc[rating_df.appearance==1]
train_df.shape, val_df.shape

((38766, 7), (27371, 7))

In [9]:
train_df.drop_duplicates(['visitorid','itemid'],inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [10]:
'''
Credit to https://gist.github.com/bwhite/3726239
'''
def mean_reciprocal_rank(rs):
    """Score is reciprocal of the rank of the first relevant item
    First element is 'rank 1'.  Relevance is binary (nonzero is relevant).
    Example from http://en.wikipedia.org/wiki/Mean_reciprocal_rank
    >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
    >>> mean_reciprocal_rank(rs)
    0.61111111111111105
    >>> rs = np.array([[0, 0, 0], [0, 1, 0], [1, 0, 0]])
    >>> mean_reciprocal_rank(rs)
    0.5
    >>> rs = [[0, 0, 0, 1], [1, 0, 0], [1, 0, 0]]
    >>> mean_reciprocal_rank(rs)
    0.75
    Args:
        rs: Iterator of relevance scores (list or numpy) in rank order
            (first element is the first item)
    Returns:
        Mean reciprocal rank
    """
    rs = (np.asarray(r).nonzero()[0] for r in rs)
    return np.mean([1. / (r[0] + 1) if r.size else 0. for r in rs])


In [11]:
counts = np.zeros(num_items)
item2counts = train_df.itemid.value_counts()
counts[item2counts.keys()] = item2counts.values

In [12]:
hits = []
rankings = np.flip(np.argsort(counts))
for val_gt in val_df.itemid.values:
    hits.append(rankings == val_gt)
mean_reciprocal_rank(hits)    

0.02096631601316712