# Unit 4: Neighborhood-based Collaborative Filtering for Rating Prediction

In this section we generate personalized recommendations for the first time. We exploit rating similarities among users and items to identify similar users and items that assist finding the relevant items to recommend for each user.

This describes the fundamental idea behind Collaborative Filtering (CF) and using kNN is a so-called neighborhood-based approach in solving in. In a later unit we will also have a look at model-based approaches for Collaborative Filtering.

This is also the first time we try to predict user ratings for unknown items using rating predictions to take the top-$N$ items with the highest rating predictions and recommend those to the user.

In [70]:
%load_ext autoreload
%autoreload 2

from collections import OrderedDict

from typing import Dict, List, Tuple
import itertools
import os
import sys
import math

import numpy as np
import scipy as sp
import sklearn

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import seaborn as sns
sns.set_context("poster")
sns.set(rc={'figure.figsize': (16, 9.)})
sns.set_style("whitegrid")

import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [42]:
from recsys_training.data import Dataset
from recsys_training.evaluation import get_relevant_items

In [43]:
ml100k_ratings_filepath = '../data/raw/ml-100k/u.data'

## Load Data

In [44]:
data = Dataset(ml100k_ratings_filepath)
data.rating_split(seed=42)

The idea behind this recommender is to use the item ratings of $k$ most similar users (neighbors) given some similarity metric that is applied to ratings that the root and neighboring users have in common.

Thus, the steps are as follows:

1. Compute user-user similarities (we use the Pearson Correlation Coefficient here)

2. For each user:

    1. Get the k nearest neighbors along with their similarities
    2. Collect the neighborhood item ratings and ignore those already rated by the root user
    3. Item Rating Prediction: Compute the similarity-weighted sum of neighborhood item ratings
    4. Recommendations: Get the $N$ items with the highest ratings that have a minimum rating count

### User-User Similarities

In [45]:
sim_metric = 'pearson'
user_ratings = {}
user_user_sims = {}

In [46]:
def get_entity_sim(a: int, b: int, entity_ratings: Dict[int, float], metric: str = 'pearson'):
    """
    Cosine Similarity
    Pearson Correlation
    Adjusted Cosine Similarity
    Jaccard Similarity (intersection over union) - not a good idea as it does not incorporate ratings, e.g.
        even the same users have rated two items, highest Jaccard similarity as evidence for high item similarity,
        their judgement may be very differently on the two items, justifying dissimilarity
    """
    # 1. isolate e.g. users that have rated both items (a and b)
    key_intersection = set(entity_ratings[a].keys()).intersection(entity_ratings[b].keys())
    ratings = np.array([(entity_ratings[a][key], entity_ratings[b][key]) for key in key_intersection])
    n_joint_ratings = len(ratings)

    if n_joint_ratings > 1:
        # 2. apply a similarity computation technique
        if metric == 'pearson':
            # Warning and nan if for one entity the variance is 0
            sim = np.corrcoef(ratings, rowvar=False)[0, 1]
        elif metric == 'cosine':
            nom = ratings[:, 0].dot(ratings[:, 1])
            denom = np.linalg.norm(ratings[:, 0]) * np.linalg.norm(ratings[:, 1])
            sim = nom / denom
        elif metric == 'euclidean':
            sim = normalized_euclidean_sim(ratings[:, 0], ratings[:, 1])
        elif metric == 'adj_cosine':
            sim = None
        else:
            raise ValueError(f"Value {metric} for argument 'mode' not supported.")
    else:
        sim = None

    return sim, n_joint_ratings

In [47]:
# build user rating maps
grouped = data.train_ratings[['user', 'item', 'rating']].groupby('user')
for user in data.users:
    vals = grouped.get_group(user)[['item', 'rating']].values
    user_ratings[user] = dict(zip(vals[:, 0].astype(int),
                                  vals[:, 1].astype(float)))

In [49]:
user_pairs = itertools.combinations(data.users, 2)

In [50]:
for pair in user_pairs:
    user_user_sims[pair] = get_entity_sim(pair[0], pair[1], user_ratings, sim_metric)

  c /= stddev[:, None]
  c /= stddev[None, :]


In [52]:
user_user_sims[(1,4)]

(0.9759000729485333, 5)

### Getting Recommendations

#### Implement Nearest Neighbors for a given user

For a given user id we like to get a list of up to $k$ elements where each element is a tuple of neighboring user id and the respective similarity between root and neighboring user.

In [53]:
def get_k_nearest_neighbors(user: int, k: int, user_user_sims: dict) -> List[Tuple[int, float]]:
    neighbors = set(data.users)
    neighbors.remove(user)

    nearest_neighbors = dict()
    for neighbor in neighbors:
        sim = user_user_sims[tuple(sorted((user, neighbor)))][0]
        if pd.notnull(sim):
            nearest_neighbors[neighbor] = sim

    nearest_neighbors = sorted(nearest_neighbors.items(),
                               key=lambda kv: kv[1],
                               reverse=True)
    
    return nearest_neighbors[:k]

In [61]:
user_neighbors = get_k_nearest_neighbors(1, k=10, user_user_sims=user_user_sims)

In [62]:
user_neighbors

[(107, 1.0),
 (443, 1.0),
 (485, 1.0),
 (687, 1.0),
 (791, 1.0),
 (820, 1.0),
 (34, 0.9999999999999999),
 (240, 0.9999999999999999),
 (281, 0.9999999999999999),
 (384, 0.9999999999999999)]

### Obtain the Neighborhood Ratings

In [90]:
def get_neighborhood_ratings(user, user_neighbors: List[Tuple[int, float]]) -> Dict[int, List[Dict[str, float]]]:
    neighborhood_ratings = {}
    for neighbor, sim in user_neighbors:
        neighbor_ratings = user_ratings[neighbor].copy()
        
        # collect neighbor ratings and items
        for item, rating in neighbor_ratings.items():
            add_item = {'sim': sim, 'rating': rating}
            if item not in neighborhood_ratings.keys():
                neighborhood_ratings[item] = [add_item]
            else:
                neighborhood_ratings[item].append(add_item)
        
    remove known items
    known_items = list(user_ratings[user].keys())
    for known_item in known_items:
        neighborhood_ratings.pop(known_item, None)
    
    return neighborhood_ratings

In [91]:
neighborhood_ratings = get_neighborhood_ratings(1, user_neighbors)

In [92]:
neighborhood_ratings

{340: [{'sim': 1.0, 'rating': 5.0},
  {'sim': 1.0, 'rating': 5.0},
  {'sim': 0.9999999999999999, 'rating': 4.0}],
 325: [{'sim': 1.0, 'rating': 3.0}],
 288: [{'sim': 1.0, 'rating': 3.0},
  {'sim': 1.0, 'rating': 3.0},
  {'sim': 1.0, 'rating': 4.0},
  {'sim': 1.0, 'rating': 3.0},
  {'sim': 1.0, 'rating': 5.0},
  {'sim': 0.9999999999999999, 'rating': 5.0}],
 312: [{'sim': 1.0, 'rating': 4.0},
  {'sim': 0.9999999999999999, 'rating': 4.0}],
 269: [{'sim': 1.0, 'rating': 5.0},
  {'sim': 1.0, 'rating': 4.0},
  {'sim': 1.0, 'rating': 4.0}],
 313: [{'sim': 1.0, 'rating': 2.0},
  {'sim': 1.0, 'rating': 4.0},
  {'sim': 1.0, 'rating': 5.0},
  {'sim': 1.0, 'rating': 5.0},
  {'sim': 0.9999999999999999, 'rating': 5.0},
  {'sim': 0.9999999999999999, 'rating': 5.0}],
 300: [{'sim': 1.0, 'rating': 1.0},
  {'sim': 0.9999999999999999, 'rating': 3.0},
  {'sim': 0.9999999999999999, 'rating': 4.0},
  {'sim': 0.9999999999999999, 'rating': 4.0}],
 264: [{'sim': 1.0, 'rating': 3.0},
  {'sim': 1.0, 'rating': 3.

### Compute Rating Predictions from Neighborhood Ratings

In [93]:
def compute_rating_pred(neighborhood_ratings: dict) -> dict:
    rating_preds = dict()
    for item, ratings in neighborhood_ratings.items():
        if len(ratings) > 0:
            sims = np.array([rating['sim'] for rating in ratings])
            ratings = np.array([rating['rating'] for rating in ratings])
            pred_rating = (sims * ratings).sum() / sims.sum()
            count = len(sims)
            rating_preds[item] = {'pred': pred_rating,
                                  'count': count}
        else:
            rating_preds[item] = {'pred': None, 'count': 0}

    return rating_preds

In [94]:
rating_preds = compute_rating_pred(neighborhood_ratings)

In [95]:
rating_preds

{340: {'pred': 4.666666666666667, 'count': 3},
 325: {'pred': 3.0, 'count': 1},
 288: {'pred': 3.8333333333333335, 'count': 6},
 312: {'pred': 4.0, 'count': 2},
 269: {'pred': 4.333333333333333, 'count': 3},
 313: {'pred': 4.333333333333333, 'count': 6},
 300: {'pred': 2.9999999999999996, 'count': 4},
 264: {'pred': 3.0, 'count': 3},
 271: {'pred': 3.25, 'count': 4},
 333: {'pred': 4.0, 'count': 5},
 1243: {'pred': 3.0, 'count': 1},
 322: {'pred': 2.5, 'count': 2},
 305: {'pred': 4.0, 'count': 1},
 327: {'pred': 4.0, 'count': 3},
 302: {'pred': 4.6, 'count': 5},
 687: {'pred': 3.0, 'count': 1},
 358: {'pred': 1.0, 'count': 2},
 323: {'pred': 2.5, 'count': 2},
 286: {'pred': 3.875, 'count': 8},
 678: {'pred': 2.0, 'count': 1},
 343: {'pred': 4.0, 'count': 2},
 644: {'pred': 3.0, 'count': 1},
 309: {'pred': 5.0, 'count': 1},
 39: {'pred': 1.0, 'count': 1},
 175: {'pred': 2.0, 'count': 1},
 948: {'pred': 1.0, 'count': 1},
 294: {'pred': 2.6666666666666665, 'count': 6},
 258: {'pred': 4.0,

### Compute the Top-$N$ Recommendation Items

In [96]:
def compute_top_n(rating_preds: dict, min_count: int, N: int) -> OrderedDict:
    rating_preds = {key: val for (key, val) in rating_preds.items()
                    if val['count'] >= min_count}
    # assuming more ratings mean higher confidence in the prediction
    sorted_rating_preds = sorted(rating_preds.items(),
                                 key=lambda kv: (kv[1]['pred'], kv[1]['count']),
                                 reverse=True)

    return OrderedDict(sorted_rating_preds[:N])

In [97]:
top_n_recs = compute_top_n(rating_preds, min_count=2, N=10)

In [98]:
top_n_recs

OrderedDict([(242, {'pred': 5.0, 'count': 2}),
             (272, {'pred': 5.0, 'count': 2}),
             (340, {'pred': 4.666666666666667, 'count': 3}),
             (332, {'pred': 4.666666666666667, 'count': 3}),
             (302, {'pred': 4.6, 'count': 5}),
             (690, {'pred': 4.5, 'count': 2}),
             (313, {'pred': 4.333333333333333, 'count': 6}),
             (269, {'pred': 4.333333333333333, 'count': 3}),
             (333, {'pred': 4.0, 'count': 5}),
             (327, {'pred': 4.0, 'count': 3})])

### Combine the steps in a single method

In [99]:
def get_recommendations(user: int,
                        user_user_sims: dict,
                        k: int,
                        min_count: int,
                        N: int):
    user_neighbors = get_k_nearest_neighbors(user, k=k, user_user_sims=user_user_sims)
    neighborhood_ratings = get_neighborhood_ratings(user, user_neighbors)
    rating_preds = compute_rating_pred(neighborhood_ratings)
    top_n_recs = compute_top_n(rating_preds, min_count=min_count, N=N)
    return top_n_recs

In [100]:
get_recommendations(1, user_user_sims, 10, 2, 10)

OrderedDict([(242, {'pred': 5.0, 'count': 2}),
             (272, {'pred': 5.0, 'count': 2}),
             (340, {'pred': 4.666666666666667, 'count': 3}),
             (332, {'pred': 4.666666666666667, 'count': 3}),
             (302, {'pred': 4.6, 'count': 5}),
             (690, {'pred': 4.5, 'count': 2}),
             (313, {'pred': 4.333333333333333, 'count': 6}),
             (269, {'pred': 4.333333333333333, 'count': 3}),
             (333, {'pred': 4.0, 'count': 5}),
             (327, {'pred': 4.0, 'count': 3})])

## 

### Evaluation

In [101]:
k = 10
min_count = 2
N = 10

In [102]:
users = relevant_items.keys()
prec_at_N = dict.fromkeys(data.users)

for user in users:
    recommendations = get_recommendations(user, user_user_sims, k, min_count, N)
    recommendations = list(recommendations.keys())
    hits = np.intersect1d(recommendations,
                          relevant_items[user])
    prec_at_N[user] = len(hits)/N

In [89]:
np.mean([val for val in prec_at_N.values() if val is not None])

0.04627659574468085