# Confidence-aware UBCF MultiEval Example

Much is this structure and organization is borrowed from the Lenskit sample evaluation walkthrough

## Setup

In [1]:
import sys # set path of locally install lenskit_confidence module
sys.path.insert(0,'C:\\Users\\Name\\Documents\\GitHub\\lenskit_confidence') # Looks like this on my machine

In [2]:
from lenskit.metrics import predict
import pandas as pd
import matplotlib.pyplot as plt

from lenskit.batch_ca import MultiEval
from lenskit.algorithms_ca import user_knn_ca, Recommender # *not* user_knn
from lenskit import topn, datasets, batch_ca # *not* batch 
from lenskit import datasets
from lenskit.datasets import MovieLens
from lenskit import crossfold as xf
from lenskit import topn, util #, metrics
from lenskit.crossfold import partition_users, SampleN

Setting up a progress bar...

In [3]:
from tqdm.notebook import tqdm_notebook as tqdm
tqdm.pandas()

  from pandas import Panel


Setup logging to the notebook...

In [4]:
util.log_to_notebook()

[   INFO] lenskit.util.log notebook logging configured


Pick a dataset to run...

In [5]:
data = MovieLens('../data/ml-1m')
#data = MovieLens('../data/ml-10m')
#data = MovieLens('../data/ml-20m')
#data = MovieLens('../data/jester') # with Jester cleaning, it's the same format a ML datasets, so the ML input function works

## Experiment

Run experiment and store output in the `my-eval` directory. 

We're not producing prediction, generating 10-item recommendation lists, and setting up 4 workers.

In [6]:
eval = MultiEval('my-eval', predict = False, recommend = 10, eval_n_jobs = 4)

We'll use 5-fold CV, partitioning users and putting 5 ratings per user in the test set.  

In [7]:
pairs = list(partition_users(data.ratings, 5, SampleN(5)))

[   INFO] lenskit.crossfold partitioning 1000209 rows for 6040 users into 5 partitions
[   INFO] lenskit.crossfold fold 0: selecting test ratings
[   INFO] lenskit.crossfold fold 0: partitioning training data
[   INFO] lenskit.crossfold fold 1: selecting test ratings
[   INFO] lenskit.crossfold fold 1: partitioning training data
[   INFO] lenskit.crossfold fold 2: selecting test ratings
[   INFO] lenskit.crossfold fold 2: partitioning training data
[   INFO] lenskit.crossfold fold 3: selecting test ratings
[   INFO] lenskit.crossfold fold 3: partitioning training data
[   INFO] lenskit.crossfold fold 4: selecting test ratings
[   INFO] lenskit.crossfold fold 4: partitioning training data


Add the dataset to MultiEval with `add_datasets`.

In [8]:
eval.add_datasets(pairs, name = 'ML1M') # give the added dataset a name

In [9]:
nhbr_range = [25] # We'll use just K=25 for our sample evaluation [10, 25, 50, 75]

Add the algorithms to MultiEval with `add_algorithms`; the three CUBCF options are listed

In [10]:
eval.add_algorithms([user_knn_ca.UserUserCA(nnbrs = f, aggregate = 'average', 
                                            variance_estimator = 'standard-deviation-average') for f in nhbr_range], 
                    attrs = ['nnbrs'], name = 'UserKNN-CA-Average') 

In [None]:
eval.add_algorithms([user_knn_ca.UserUserCA(nnbrs = f, aggregate = 'average', 
                                            variance_estimator = 'standard-deviation-jackknife-average') for f in nhbr_range], 
                    attrs = ['nnbrs'], name = 'UserKNN-CA-JK-Average') 

In [None]:
eval.add_algorithms([user_knn_ca.UserUserCA(nnbrs = f, aggregate = 'average', 
                                            variance_estimator = 'standard-deviation-bootstrap-average') for f in nhbr_range], 
                    attrs = ['nnbrs'], name = 'UserKNN-CA-BS-Average') 

Run the experiment...

In [11]:
eval.run(progress = tqdm)

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

[   INFO] lenskit.batch_ca._multi_ca starting run 1: UserUserCA(nnbrs=25, min_sim=0) on ML1M:1
[   INFO] lenskit.batch_ca._multi_ca adapting UserUserCA(nnbrs=25, min_sim=0) into a recommender
[   INFO] lenskit.batch_ca._multi_ca training algorithm UserUserCA(nnbrs=25, min_sim=0) on 994169 ratings
[   INFO] lenskit.algorithms_ca.user_knn_ca calling fit in user_knn
[   INFO] lenskit.algorithms_ca.basic_ca trained unrated candidate selector for 994169 ratings
[   INFO] lenskit.batch_ca._multi_ca trained algorithm UserUserCA(nnbrs=25, min_sim=0) in 6.24s
[   INFO] lenskit.batch_ca._multi_ca generating recommendations for 1208 users for TopN/UserUserCA(nnbrs=25, min_sim=0)
[   INFO] lenskit.sharing.shm serialized TopN/UserUserCA(nnbrs=25, min_sim=0) to 1176 pickle bytes with 13 buffers of 28104104 bytes
[   INFO] lenskit.util.parallel setting up ProcessPoolExecutor w/ 4 workers
[   INFO] lenskit.batch_ca._recommend_ca  (2) recommending with TopN/UserUserCA(nnbrs=25, min_sim=0) for 1208 user

## Analyzing Results

We need to read in experiment outputs.

First the run metadata:

In [12]:
runs = pd.read_csv('my-eval/runs.csv')
runs.set_index('RunId', inplace = True)
runs.head() # a quick visual check

Unnamed: 0_level_0,DataSet,Partition,AlgoClass,AlgoStr,name,nnbrs,TrainTime,PredTime,RecTime
RunId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,ML1M,1,UserUserCA,"UserUserCA(nnbrs=25, min_sim=0)",UserKNN-CA-Average,25,6.239759,,40.693732
2,ML1M,2,UserUserCA,"UserUserCA(nnbrs=25, min_sim=0)",UserKNN-CA-Average,25,0.477325,,40.619449
3,ML1M,3,UserUserCA,"UserUserCA(nnbrs=25, min_sim=0)",UserKNN-CA-Average,25,0.471221,,40.531695
4,ML1M,4,UserUserCA,"UserUserCA(nnbrs=25, min_sim=0)",UserKNN-CA-Average,25,0.472944,,40.113802
5,ML1M,5,UserUserCA,"UserUserCA(nnbrs=25, min_sim=0)",UserKNN-CA-Average,25,0.465477,,42.940498


This describes each run - a data set, partition, and algorithm combination.  To evaluate, we need to get the actual recommendations, and combine them with this:

In [13]:
recs = pd.read_parquet('my-eval/recommendations.parquet')
recs.head()

Unnamed: 0,item,prediction,user,var,num_nbhr,rank,RunId
0,1420,4.087044,1,0.245267,3.0,1,1
1,2503,2.259555,1,0.35264,5.0,2,1
2,2197,2.256934,1,0.320579,6.0,3,1
3,3245,1.948881,1,0.480832,3.0,4,1
4,3293,1.452391,1,0.518395,2.0,5,1


In [14]:
recs['score'] = recs['prediction']

In [15]:
recs = recs[['item', 'score', 'user','rank','RunId']]
recs.head()

Unnamed: 0,item,score,user,rank,RunId
0,1420,4.087044,1,1,1
1,2503,2.259555,1,2,1
2,2197,2.256934,1,3,1
3,3245,1.948881,1,4,1
4,3293,1.452391,1,5,1


Getting the predictions... (this is here for posterity, we're not actually making predictions on test set now)

In [None]:
#preds = pd.read_parquet('my-eval/predictions.parquet')
#preds

We're going to compute per-(run,user) evaluations of the recommendations *before* combining with metadata. 

In order to evaluate the recommendation list, we need to build a combined set of truth data. Since this is a disjoint partition of users over a single data set, we can just concatenate the individual test frames:

In [16]:
truth = pd.concat((p.test for p in pairs), ignore_index = True)
truth.head()

Unnamed: 0.1,Unnamed: 0,user,item,rating,timestamp
0,16,1,2791,4.0,978302188
1,27,1,1097,4.0,978301953
2,23,1,1270,5.0,978300055
3,35,1,1907,4.0,978824330
4,46,1,1028,5.0,978301777


In [None]:
truth.to_csv('my-eval/truth.csv') # saving truth values to a csv for future evaluation
#truth = pd.read_csv('my-eval/truth.csv')

In [17]:
truth = truth[['user', 'item', 'rating']] # just grabbing what we need

In [18]:
truth.head() # a visual check

Unnamed: 0,user,item,rating
0,1,2791,4.0
1,1,1097,4.0
2,1,1270,5.0
3,1,1907,4.0
4,1,1028,5.0


Now we can set up an analysis and compute the results.

In [19]:
rla = topn.RecListAnalysis()
rla.add_metric(topn.ndcg) # precision, recall, recip_rank, dcg, ndcg
rla.add_metric(topn.precision)
topn_compute = rla.compute(recs, truth)
topn_compute.head()

[   INFO] lenskit.topn analyzing 60400 recommendations (30200 truth rows)
[   INFO] lenskit.topn using rec key columns ['RunId', 'user']
[   INFO] lenskit.topn using truth key columns ['user']
[   INFO] lenskit.topn collecting truth data
[   INFO] lenskit.topn collecting metric results
[   INFO] lenskit.sharing.shm serialized <lenskit.topn._RLAJob object at 0x000001BC1EAEF1C0> to 1474960 pickle bytes with 12083 buffers of 2416000 bytes
[   INFO] lenskit.util.parallel setting up ProcessPoolExecutor w/ 2 workers
[   INFO] lenskit.topn measured 6040 lists in 19.53s


Unnamed: 0_level_0,Unnamed: 1_level_0,nrecs,ndcg,precision
RunId,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,10.0,0.0,0.0
1,3,10.0,0.0,0.0
1,4,10.0,0.0,0.0
1,8,10.0,0.0,0.0
1,22,10.0,0.0,0.0


Next, we need to combine this with our run data, so that we know what algorithms and configurations we are evaluating:

In [20]:
topn_results = topn_compute.join(runs[['name', 'nnbrs']], on = 'RunId') # 
topn_results.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,nrecs,ndcg,precision,name,nnbrs
RunId,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1,10.0,0.0,0.0,UserKNN-CA-Average,25
1,3,10.0,0.0,0.0,UserKNN-CA-Average,25
1,4,10.0,0.0,0.0,UserKNN-CA-Average,25
1,8,10.0,0.0,0.0,UserKNN-CA-Average,25
1,22,10.0,0.0,0.0,UserKNN-CA-Average,25


We can compute the overall average performance for each algorithm configuration

In [21]:
topn_results.fillna(0).groupby(['name', 'nnbrs'])['ndcg','precision'].mean()

  topn_results.fillna(0).groupby(['name', 'nnbrs'])['ndcg','precision'].mean()


Unnamed: 0_level_0,Unnamed: 1_level_0,ndcg,precision
name,nnbrs,Unnamed: 2_level_1,Unnamed: 3_level_1
UserKNN-CA-Average,25,0.007986,0.005596
