# Hybrid Recommender Systems
In this session, we will look at recommender systems that combine collaborative filtering and content-based methods.
The focus of this hands-on exercise is on understanding how those two methods can be combined. 
The `wine-reviews` dataset from the last session is reused in a slightly different format.

In [1]:
import numpy as np
import pandas as pd 
import surprise as sp

In [2]:
# data parsing
parsed_data = pd.read_csv("wine-reviews/winemag-data-130k-v2.csv")
filtered_data = parsed_data[['country','province','region_1','variety','price','taster_name','points']]
cleaned_data = filtered_data.rename(columns={'region_1': 'region'}).dropna(subset=['country','province','region','variety','taster_name','points'])

# group all wines from a region that have the same variety, assign mean price
wines_all = cleaned_data.groupby(['country', 'province', 'region', 'variety']).agg({'price': 'mean'}).reset_index()
wines_all = wines_all.assign(id=pd.Series(range(1, wines_all.shape[0]+1), dtype=int, index=wines_all.index))
wines_all = wines_all[['id', 'country', 'province', 'region', 'variety', 'price']]

users_all = cleaned_data.groupby('taster_name').count().reset_index()[['taster_name']]
users_all = users_all.assign(id=pd.Series(range(1, users_all.shape[0]+1), dtype=int, index=users_all.index))

# link ratings to wines and users via id
wine_id_translator = {(row['country'], row['province'], row['region'], row['variety']): row['id'] for index, row in wines_all.iterrows()}
user_id_translator = {row['taster_name']: row['id'] for index, row in users_all.iterrows()}
def get_wine_id_series(data_frame):
    return pd.Series((wine_id_translator[(row['country'], row['province'], row['region'], row['variety'])] for _, row in data_frame.iterrows()), index=data_frame.index)
def get_user_id_series(data_frame):
    return pd.Series((user_id_translator[row['taster_name']] for _, row in data_frame.iterrows()), index=data_frame.index)

# aggregate average points of all ratings from a user for a wine
ratings_all = cleaned_data.assign(wine_id=get_wine_id_series, user_id=get_user_id_series)[['taster_name', 'user_id', 'wine_id', 'points']].groupby(['user_id', 'taster_name', 'wine_id']).mean().reset_index()

# only include wines that have 3 or more ratings
most_rated_wines = list(ratings_all.groupby(['wine_id']).count()[lambda x: x['points'] >= 3].reset_index()['wine_id'].values)

ratings = ratings_all.loc[ratings_all['wine_id'].isin(most_rated_wines)].astype({'wine_id': int, 'user_id': int}).reset_index(drop=True)
wines = wines_all.loc[wines_all['id'].isin(most_rated_wines)].astype({'id': int}).reset_index(drop=True)
users = users_all.loc[users_all['id'].isin(ratings['user_id'].values)].astype({'id': int}).reset_index(drop=True)

## Wines

In [3]:
wines.head()

Unnamed: 0,id,country,province,region,variety,price
0,739,Canada,Ontario,Niagara Peninsula,Riesling,42.423077
1,741,Canada,Ontario,Niagara Peninsula,Vidal Blanc,62.615385
2,757,France,Alsace,Alsace,Gewürztraminer,34.206897
3,760,France,Alsace,Alsace,Pinot Blanc,17.622047
4,778,France,Alsace,Crémant d'Alsace,Sparkling Blend,24.886256


## Ratings

In [4]:
ratings.head()

Unnamed: 0,user_id,taster_name,wine_id,points
0,1,Alexander Peartree,5069,87.666667
1,1,Alexander Peartree,5737,89.0
2,1,Alexander Peartree,5738,86.75
3,1,Alexander Peartree,5741,86.25
4,1,Alexander Peartree,5743,88.0


## Users

In [5]:
users.head()

Unnamed: 0,taster_name,id
0,Alexander Peartree,1
1,Anna Lee C. Iijima,2
2,Anne Krebiehl MW,3
3,Carrie Dykes,4
4,Christina Pickard,5


## Collaborative Filtering
`predict_cf` returns the predicted rating of the user with name `taster_name` for item with id `wine_id`.
The function uses a KNN classifier. To train the model, all other ratings are used.
The error and the actual rating is returned as well.

In [6]:
# Collaborative Filtering

def predict_cf(ratings, taster_name, wine_id):
    is_target = (ratings['taster_name'] == taster_name) & (ratings['wine_id'] == wine_id)
    target = ratings[is_target].iloc[0]
    
    train_set = sp.Dataset.load_from_df(
        ratings[~is_target][['user_id', 'wine_id', 'points']], 
        sp.Reader(rating_scale=(0, 100))
    ).build_full_trainset()

    algo = sp.KNNBasic(verbose=False)
    algo.fit(train_set)
    prediction = algo.predict(target['user_id'], target['wine_id'], verbose=False)
    return prediction.est, prediction.est - target['points'], target['points']

## Content-Based
`predict_cn` returns the predicted rating of the user with name `taster_name` for item with id `wine_id`.
The function also uses a KNN classifier. To train the model, all other ratings from the same user, as well as the wine database are used.
The error and the actual rating is returned as well.

In [7]:
# Content-Based

from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier

def predict_cn(ratings, wines, taster_name, wine_id):
    user_ratings = ratings[ratings['taster_name'] == taster_name].join(wines.set_index('id'), on='wine_id')
    is_target = (user_ratings['wine_id'] == wine_id)
    
    features = pd.get_dummies(user_ratings.drop(columns=['points']))
    train_features = features[~is_target]
    target_features = features[is_target]
    
    encoder = LabelEncoder()
    train_labels = encoder.fit_transform(user_ratings[~is_target]['points'])
    target_label = user_ratings[is_target]['points'].iloc[0]

    clf = KNeighborsClassifier(n_neighbors=1)
    clf.fit(train_features, train_labels)
    prediction = encoder.inverse_transform(clf.predict(target_features))[0]
    return prediction, prediction - target_label, target_label

## Testing the Recommenders

In [8]:
def test_classifier(taster_name, wine_id):
    pred_cf, error_cf, truth = predict_cf(ratings, taster_name, wine_id)
    pred_cn, error_cn, truth = predict_cn(ratings, wines, taster_name, wine_id)
    print("Results for {} on wine with id {}:".format(taster_name, wine_id))
    print("Collaborative Filtering: \t prediction: {:.5f} \t error: {:.5f}".format(pred_cf, error_cf))
    print("Content-Based: \t\t\t prediction: {:.5f} \t error: {:.5f}".format(pred_cn, error_cn))

In [9]:
test_classifier(taster_name='Anna Lee C. Iijima', wine_id=741)

Results for Anna Lee C. Iijima on wine with id 741:
Collaborative Filtering: 	 prediction: 89.65560 	 error: -0.01107
Content-Based: 			 prediction: 89.50000 	 error: -0.16667


In [10]:
test_classifier(taster_name='Virginie Boone', wine_id=4147)

Results for Virginie Boone on wine with id 4147:
Collaborative Filtering: 	 prediction: 87.93883 	 error: 2.83883
Content-Based: 			 prediction: 85.50000 	 error: 0.40000


## Exercise 1: Weighted Recommender
Create a weighted recommender, combining the results of `predict_cf` and `predict_cn`. The weights should be static.

In [None]:
def predict_weighted(ratings, wines, taster_name, wine_id):
    # Your code goes here
    return prediction, error, truth


pred_weighted, error_weighted, truth = predict_weighted(ratings, wines, taster_name='Anna Lee C. Iijima', wine_id=741)
print("Weighted Hybrid: \t prediction: {:.5f} \t error: {:.5f}".format(pred_weighted, error_weighted))

## Exercise 2: Feature Combination
Use the Feature Combination method to improve the data (i.e. the wine database) that is used by `pedict_cn`. 
To do a Matrix Factorization, the `NMF` class from `sklearn.decomposition` can be used.

In [None]:
wines_plus = # Your code goes here


pred_weighted, error_weighted, truth = predict_cn(ratings, wines_plus, taster_name='Anna Lee C. Iijima', wine_id=741)
print("Weighted Hybrid: \t prediction: {:.5f} \t error: {:.5f}".format(pred_weighted, error_weighted))

## Bonus Exercise: Switching Hybrid
In Exercise 1 we have seen that for different users/items, different recommenders performe better. Implement a switching hybrid that determines which method to use depending on the input (e.g. by looking on the number of ratings for each user/item).

In [None]:
def predict_switching(ratings, wines, taster_name, wine_id):
    # Your code goes here
    return prediction, error, truth


pred, error, truth = predict_switching(ratings, wines, taster_name='Anna Lee C. Iijima', wine_id=741)
print("Switching Hybrid: \t prediction: {:.5f} \t error: {:.5f}".format(pred_weighted, error_weighted))