In this notebook, the KNNBaseline algorithm will be introduced. User bias can be an issue
in certain datasets and using the baseline aims to improve predictions by incorporating a baseline rating


In [1]:

# import libraries
from surprise import Dataset
from surprise.accuracy import rmse
from own_algorithms.UserItemKNN import UserItemKNN
from surprise import KNNBaseline
from own_algorithms.UserItemKNNv2 import UserItemKNNv2
from surprise.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
from surprise.model_selection import GridSearchCV



Before running the algorithm, we are gonna test for the best parameter setup. Unlike earlier, GridSearch will be used for sake of simplicity.

In [2]:
# takes a while to run- 21 mins
data= Dataset.load_builtin('ml-100k')
param_grid = {'k': [10, 20, 30, 40, 50, 60],
              'sim_options': {'name': ['cosine', 'pearson_baseline'],
                              'min_support': [1, 5],
                              'user_based': [True, False]}}


# Instantiate the GridSearchCV object and fit the data
gs = GridSearchCV(KNNBaseline, param_grid, measures=['rmse', 'mae'], cv=5)
gs.fit(data)

# Print the best RMSE score and the corresponding parameters
print(gs.best_score['rmse'])
print(gs.best_params['rmse'])

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Comput

In [3]:
algo = gs.best_estimator['rmse']

#test on 100k
trainset, testset = train_test_split(data, test_size=0.25)
algo.fit(trainset)

predictions = algo.test(testset)
print("Unbiased accuracy on B for KNNBaseline on ml100k,", end=" ")
algo_rmse= rmse(predictions)

#test on 1m
data=Dataset.load_builtin('ml-1m')
trainset, testset = train_test_split(data, test_size=0.25)
algo.fit(trainset)
predictions = algo.test(testset)
print("Unbiased accuracy on B for KNNBaseline on ml-1m,", end=" ")
algo_rmse_1m= rmse(predictions)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Unbiased accuracy on B for KNNBaseline on ml100k, RMSE: 0.9147
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Unbiased accuracy on B for KNNBaseline on ml100k, RMSE: 0.8603


In [4]:
KNNbaselineScores= pd.DataFrame(data={"Algo":["KNNBaseline"],
                                "100k": [algo_rmse],
                                "1M":[algo_rmse_1m]})
KNNbaselineScores.to_csv('./algo_data/KNNBaselineOp.csv', index=False)

NameError: name 'get_top_n' is not defined

In [24]:
movies_cols = ['movie_id', 'title', 'genres']
movies_df = pd.read_csv('./ml-1m/movies.dat', sep='::', names=movies_cols, engine='python', encoding='latin-1')

In [27]:
# load dataframes with predictions
user1= pd.read_csv('./predictions/1.csv')
user134= pd.read_csv('./predictions/134.csv')
user398= pd.read_csv('./predictions/398.csv')

In [29]:

from own_algorithms.top_n_list import get_top_n_list

movies=get_top_n_list(predictions, 10 ,'398', movies_df)
user398["KNN Baseline"]= movies
user398.to_csv('./predictions/398.csv')

In [30]:
movies=get_top_n_list(predictions, 10 ,'1', movies_df)
user1["KNN Baseline"]= movies
user1.to_csv('./predictions/398.csv')

In [31]:
movies=get_top_n_list(predictions, 10 ,'134', movies_df)
user134["KNN Baseline"]= movies
user134.to_csv('./predictions/398.csv')

In [32]:
user134

Unnamed: 0,Hybrid,KNN Basic,KNN Baseline
0,Braveheart (1995),Braveheart (1995),Braveheart (1995)
1,In the Line of Fire (1993),In the Line of Fire (1993),Forrest Gump (1994)
2,"Last of the Mohicans, The (1992)","Last of the Mohicans, The (1992)",Aladdin (1992)
3,Austin Powers: International Man of Mystery (1...,Austin Powers: International Man of Mystery (1...,Terminator 2: Judgment Day (1991)
4,"Full Monty, The (1997)","Full Monty, The (1997)",Groundhog Day (1993)
5,"Mask of Zorro, The (1998)",Office Space (1999),Saving Private Ryan (1998)
6,Being John Malkovich (1999),Being John Malkovich (1999),Galaxy Quest (1999)
7,Toy Story 2 (1999),Toy Story 2 (1999),Patriot Games (1992)
8,Chicken Run (2000),Chicken Run (2000),"Patriot, The (2000)"
9,Almost Famous (2000),Almost Famous (2000),Remember the Titans (2000)
