**Context**

Jester is a joke recommender system developed at UC Berkeley to study social information filtering. Users of the system are presented a joke and then they rate them. This dataset is a collection of those ratings.

http://eigentaste.berkeley.edu/

Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.

**Content**

Notes from the source:

Each row is a user (Row 1 = User #1)

Each column is a joke (Column 1 = Joke #1)

Ratings are given as real values from -10.00 to +10.00

99 corresponds to a null rating

As of May 2009, the jokes 7, 8, 13, 15, 16, 17, 18, 19 are the "gauge set" (as discussed in the Eigentaste paper)

In [18]:
import pandas as pd
from sklearn.model_selection import train_test_split
from scipy.sparse.linalg import svds
import numpy as np

In [24]:
jokes_df = pd.read_csv('jester_items.tsv', sep=":\t", header=None, engine='python').rename(columns={0: "jokeID", 1: "title"})
jokes_df.head()

Unnamed: 0,jokeID,title
0,1,"A man visits the doctor. The doctor says, ""I h..."
1,2,This couple had an excellent relationship goin...
2,3,Q. What's 200 feet long and has 4 teeth? A. Th...
3,4,Q. What's the difference between a man and a t...
4,5,Q. What's O. J. Simpson's web address? A. Slas...


In [2]:
columns = ['userID'] + range(1,151)
df = pd.read_csv('jesterfinal151cols.csv', header = None, names = columns)
print df.describe()
df.head(10)

             userID        1        2        3        4             5  \
count  50692.000000  50692.0  50692.0  50692.0  50692.0  50692.000000   
mean      34.104967     99.0     99.0     99.0     99.0     97.871901   
std       33.519225      0.0      0.0      0.0      0.0     10.631768   
min        8.000000     99.0     99.0     99.0     99.0    -10.000000   
25%       11.000000     99.0     99.0     99.0     99.0     99.000000   
50%       20.000000     99.0     99.0     99.0     99.0     99.000000   
75%       42.000000     99.0     99.0     99.0     99.0     99.000000   
max      140.000000     99.0     99.0     99.0     99.0     99.000000   

             6             7             8        9     ...       \
count  50692.0  50692.000000  50692.000000  50692.0     ...        
mean      99.0     -1.952510     -0.716500     99.0     ...        
std        0.0      5.370893      5.153371      0.0     ...        
min       99.0    -10.000000    -10.000000     99.0     ...        
25

Unnamed: 0,userID,1,2,3,4,5,6,7,8,9,...,141,142,143,144,145,146,147,148,149,150
0,62,99,99,99,99,0.21875,99,-9.28125,-9.28125,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
1,34,99,99,99,99,-9.6875,99,9.9375,9.53125,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
2,18,99,99,99,99,-9.84375,99,-9.84375,-7.21875,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
3,82,99,99,99,99,6.90625,99,4.75,-5.90625,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
4,27,99,99,99,99,-0.03125,99,-9.09375,-0.40625,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
5,46,99,99,99,99,-2.90625,99,-2.34375,-0.5,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
6,99,99,99,99,99,6.21875,99,-7.4375,-0.8125,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
7,15,99,99,99,99,8.25,99,9.0,8.875,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
8,104,99,99,99,99,-5.75,99,0.28125,0.78125,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
9,24,99,99,99,99,-7.15625,99,-5.90625,-0.09375,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0


In [3]:
df = pd.melt(df, id_vars=['userID'], value_name = 'rating', var_name = 'jokeID')
df.head(10)

Unnamed: 0,userID,jokeID,rating
0,62,1,99.0
1,34,1,99.0
2,18,1,99.0
3,82,1,99.0
4,27,1,99.0
5,46,1,99.0
6,99,1,99.0
7,15,1,99.0
8,104,1,99.0
9,24,1,99.0


In [5]:
df = df[df.rating != 99]
df = df[df.rating > 0]
df.head(10)

Unnamed: 0,userID,jokeID,rating
202768,62,5,0.21875
202771,82,5,6.90625
202774,99,5,6.21875
202775,15,5,8.25
202781,109,5,0.46875
202782,42,5,6.28125
202785,77,5,6.3125
202787,16,5,4.28125
202789,16,5,5.125
202790,16,5,1.84375


In [7]:
interactions_train_df, interactions_test_df = train_test_split(df,
                                   stratify=df['userID'], 
                                   test_size=0.20)

In [8]:
interactions_full_indexed_df = df.set_index('userID')
interactions_train_indexed_df = interactions_train_df.set_index('userID')
interactions_test_indexed_df = interactions_test_df.set_index('userID')

def get_items_interacted(person_id, interactions_df):
    interacted_items = interactions_df.loc[person_id]['jokeID']
    return set(interacted_items if type(interacted_items) == pd.Series else [interacted_items])

In [9]:
#Top-N accuracy metrics consts
EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS = 100

class ModelEvaluator:


    def get_not_interacted_items_sample(self, person_id, sample_size, seed=42):
        interacted_items = get_items_interacted(person_id, interactions_full_indexed_df)
        all_items = set(articles_df['jokeID'])
        non_interacted_items = all_items - interacted_items

        random.seed(seed)
        non_interacted_items_sample = random.sample(non_interacted_items, sample_size)
        return set(non_interacted_items_sample)

    def _verify_hit_top_n(self, item_id, recommended_items, topn):        
            try:
                index = next(i for i, c in enumerate(recommended_items) if c == item_id)
            except:
                index = -1
            hit = int(index in range(0, topn))
            return hit, index

    def evaluate_model_for_user(self, model, person_id):
        #Getting the items in test set
        interacted_values_testset = interactions_test_indexed_df.loc[person_id]
        if type(interacted_values_testset['jokeID']) == pd.Series:
            person_interacted_items_testset = set(interacted_values_testset['jokeID'])
        else:
            person_interacted_items_testset = set([int(interacted_values_testset['jokeID'])])  
        interacted_items_count_testset = len(person_interacted_items_testset) 

        #Getting a ranked recommendation list from a model for a given user
        person_recs_df = model.recommend_items(person_id, 
                                               items_to_ignore=get_items_interacted(person_id, 
                                                                                    interactions_train_indexed_df), 
                                               topn=10000000000)

        hits_at_5_count = 0
        hits_at_10_count = 0
        #For each item the user has interacted in test set
        for item_id in person_interacted_items_testset:
            #Getting a random sample (100) items the user has not interacted 
            #(to represent items that are assumed to be no relevant to the user)
            non_interacted_items_sample = self.get_not_interacted_items_sample(person_id, 
                                                                          sample_size=EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS, 
                                                                          seed=item_id%(2**32))

            #Combining the current interacted item with the 100 random items
            items_to_filter_recs = non_interacted_items_sample.union(set([item_id]))

            #Filtering only recommendations that are either the interacted item or from a random sample of 100 non-interacted items
            valid_recs_df = person_recs_df[person_recs_df['jokeID'].isin(items_to_filter_recs)]                    
            valid_recs = valid_recs_df['jokeID'].values
            #Verifying if the current interacted item is among the Top-N recommended items
            hit_at_5, index_at_5 = self._verify_hit_top_n(item_id, valid_recs, 5)
            hits_at_5_count += hit_at_5
            hit_at_10, index_at_10 = self._verify_hit_top_n(item_id, valid_recs, 10)
            hits_at_10_count += hit_at_10

        #Recall is the rate of the interacted items that are ranked among the Top-N recommended items, 
        #when mixed with a set of non-relevant items
        recall_at_5 = hits_at_5_count / float(interacted_items_count_testset)
        recall_at_10 = hits_at_10_count / float(interacted_items_count_testset)

        person_metrics = {'hits@5_count':hits_at_5_count, 
                          'hits@10_count':hits_at_10_count, 
                          'interacted_count': interacted_items_count_testset,
                          'recall@5': recall_at_5,
                          'recall@10': recall_at_10}
        return person_metrics

    def evaluate_model(self, model):
        #print('Running evaluation for users')
        people_metrics = []
        for idx, person_id in enumerate(list(interactions_test_indexed_df.index.unique().values)):
            #if idx % 100 == 0 and idx > 0:
            #    print('%d users processed' % idx)
            person_metrics = self.evaluate_model_for_user(model, person_id)  
            person_metrics['_person_id'] = person_id
            people_metrics.append(person_metrics)
        print('%d users processed' % idx)

        detailed_results_df = pd.DataFrame(people_metrics) \
                            .sort_values('interacted_count', ascending=False)
        
        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        
        global_metrics = {'modelName': model.get_model_name(),
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10}    
        return global_metrics, detailed_results_df
    
model_evaluator = ModelEvaluator()

In [13]:
#Creating a sparse pivot table with users in rows and items in columns
users_items_pivot_matrix_df = interactions_train_df.pivot_table(index='userID', 
                                                          columns='jokeID', 
                                                          values='rating',
                                                         aggfunc='mean').fillna(0)

users_items_pivot_matrix_df.head(10)

jokeID,5,7,8,13,15,16,17,18,19,20,...,141,142,143,144,145,146,147,148,149,150
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8,0.0,4.104444,4.129884,4.433224,4.327764,4.354687,4.653304,4.220487,4.708026,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,7.03125,3.975487,4.063128,4.549107,4.308993,4.317131,4.525802,3.971128,4.658414,0.0,...,0.0,7.055147,2.554688,0.0,0.0,0.0,3.484375,0.0,0.0,4.748437
10,7.140625,3.83644,3.773253,4.12771,3.968697,4.290208,4.53132,4.019003,4.694113,5.578125,...,0.0,5.958705,4.210938,4.570312,4.638021,8.15625,3.203125,4.008333,0.0,3.464221
11,4.3125,3.952542,4.014002,3.989392,3.992874,4.309546,4.513327,4.019537,4.673901,3.15625,...,0.0,6.189904,4.983173,5.679688,2.774038,5.240625,5.139205,4.771875,0.0,3.800781
12,4.375,3.908478,3.913229,4.281044,3.978414,4.034483,4.49183,3.948389,4.574033,9.09375,...,6.671875,6.28125,4.295312,3.477083,3.351562,6.710938,5.28125,4.387957,5.78125,3.99268
13,6.979167,4.10368,4.112755,4.250672,4.111661,4.265763,4.59375,4.356534,4.70509,8.3125,...,5.616071,6.327415,4.935811,4.271552,4.025391,4.484375,4.029514,4.544922,5.626488,4.388232
14,8.9375,3.79684,3.722679,4.235634,4.089968,4.146081,4.506911,3.877131,4.637746,5.072917,...,4.84375,6.318182,4.1875,4.49375,3.673491,4.353365,4.615385,5.087302,5.970395,4.445542
15,4.310096,3.617595,3.717233,4.235573,3.808732,4.025243,4.414968,3.930654,4.466289,4.223958,...,2.3125,6.044207,4.936035,4.871528,4.109954,6.859375,4.588235,5.037146,5.217548,4.407835
16,4.716346,3.655645,3.844903,4.140232,3.845301,3.913837,4.38119,3.943317,4.289853,4.339286,...,3.5,5.519622,4.490057,4.394097,4.999349,4.684028,4.878125,4.079688,5.482639,4.358789
17,3.984375,3.719882,3.817364,4.15049,4.030045,3.980837,4.417619,3.84319,4.401861,3.510417,...,6.645833,6.68892,5.232449,3.998355,4.506434,5.375,4.647321,4.772727,5.190341,4.763057


In [11]:
df = df[df.duplicated(subset=['userID','jokeID'], keep=False)]
print (df)

         userID jokeID   rating
202768       62      5  0.21875
202774       99      5  6.21875
202775       15      5  8.25000
202781      109      5  0.46875
202782       42      5  6.28125
202787       16      5  4.28125
202789       16      5  5.12500
202790       16      5  1.84375
202796       80      5  9.68750
202802       49      5  9.43750
202804       23      5  2.56250
202808      113      5  3.06250
202810      110      5  2.21875
202814       17      5  5.15625
202815       50      5  7.68750
202816      107      5  1.56250
202817      120      5  9.87500
202818       22      5  4.25000
202819       14      5  4.25000
202820       80      5  0.50000
202824       17      5  3.25000
202825       16      5  0.46875
202826       45      5  5.87500
202828       64      5  4.56250
202829       16      5  6.59375
202831       13      5  3.18750
202835      100      5  6.28125
202838       15      5  7.28125
202839       99      5  1.87500
202845       37      5  1.31250
...     

In [14]:
users_items_pivot_matrix = users_items_pivot_matrix_df.as_matrix()
users_items_pivot_matrix[:10]

array([[ 0.        ,  4.10444361,  4.12988351, ...,  0.        ,
         0.        ,  0.        ],
       [ 7.03125   ,  3.97548715,  4.06312751, ...,  0.        ,
         0.        ,  4.7484375 ],
       [ 7.140625  ,  3.83643973,  3.77325259, ...,  4.00833333,
         0.        ,  3.46422101],
       ..., 
       [ 4.31009615,  3.61759479,  3.71723301, ...,  5.03714623,
         5.21754808,  4.40783514],
       [ 4.71634615,  3.65564467,  3.84490266, ...,  4.0796875 ,
         5.48263889,  4.35878906],
       [ 3.984375  ,  3.71988162,  3.81736407, ...,  4.77272727,
         5.19034091,  4.76305651]])

In [19]:
users_ids = list(users_items_pivot_matrix_df.index)
print users_ids[:10]

#The number of factors to factor the user-item matrix.
NUMBER_OF_FACTORS_MF = 15
#Performs matrix factorization of the original user item matrix
U, sigma, Vt = svds(users_items_pivot_matrix, k = NUMBER_OF_FACTORS_MF)
print U.shape
print Vt.shape
sigma = np.diag(sigma)
print sigma.shape 

all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 
print all_user_predicted_ratings

#Converting the reconstructed matrix back to a Pandas dataframe
cf_preds_df = pd.DataFrame(all_user_predicted_ratings, columns = users_items_pivot_matrix_df.columns, index=users_ids).transpose()
print cf_preds_df.head(10)

print len(cf_preds_df.columns)

[8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
(126, 15)
(15, 140)
(15, 15)
[[ 0.39166909  0.42585897  0.39203517 ...,  0.37093686 -0.18591152
   0.2277466 ]
 [ 8.25441392  3.81925797  4.51311494 ...,  0.43018615  0.81182151
   4.27239485]
 [ 7.48302067  3.65235744  3.82701477 ...,  4.86522893  0.15322443
   3.46686119]
 ..., 
 [ 6.99277362 -0.18577269  8.28239852 ...,  0.47613389  6.90427152
   8.45238079]
 [ 4.57452964  4.47041505  2.85416471 ...,  5.37586645  7.35409725
   7.80207687]
 [ 2.7874183   3.42704414  4.10658287 ...,  5.3829437   5.0577344
   5.49176757]]
             8         9         10        11        12        13        14   \
jokeID                                                                         
5       0.391669  8.254414  7.483021  5.359847  7.694647  5.993439  5.147305   
7       0.425859  3.819258  3.652357  4.468445  4.037808  3.890954  3.867969   
8       0.392035  4.513115  3.827015  3.296042  3.564325  4.230975  3.650606   
13      0.792211  4.977769  5.226

In [21]:
class CFRecommender:
    
    MODEL_NAME = 'Collaborative Filtering'
    
    def __init__(self, cf_predictions_df, items_df=None):
        self.cf_predictions_df = cf_predictions_df
        #self.items_df = items_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        # Get and sort the user's predictions
        sorted_user_predictions = self.cf_predictions_df[user_id].sort_values(ascending=False) \
                                    .reset_index().rename(columns={user_id: 'recStrength'})

        # Recommend the highest predicted rating movies that the user hasn't seen yet.
        recommendations_df = sorted_user_predictions[~sorted_user_predictions['jokeID'].isin(items_to_ignore)] \
                               .sort_values('recStrength', ascending = False) \
                               .head(topn)
        return recommendations_df
    
cf_recommender_model = CFRecommender(cf_preds_df)

In [22]:
'''
        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')

            recommendations_df = recommendations_df.merge(self.items_df, how = 'left', 
                                                          left_on = 'jokeID', 
                                                          right_on = 'jokeID')[['recStrength', 'jokeID', 'title', 'url', 'lang']]
'''

'\n        if verbose:\n            if self.items_df is None:\n                raise Exception(\'"items_df" is required in verbose mode\')\n\n            recommendations_df = recommendations_df.merge(self.items_df, how = \'left\', \n                                                          left_on = \'jokeID\', \n                                                          right_on = \'jokeID\')[[\'recStrength\', \'jokeID\', \'title\', \'url\', \'lang\']]\n'

In [23]:
print('Evaluating Collaborative Filtering (SVD Matrix Factorization) model...')
cf_global_metrics, cf_detailed_results_df = model_evaluator.evaluate_model(cf_recommender_model)
print('\nGlobal metrics:\n%s' % cf_global_metrics)
cf_detailed_results_df.head(10)

Evaluating Collaborative Filtering (SVD Matrix Factorization) model...


NameError: global name 'articles_df' is not defined