In this notebook, I will be testing my algoirthm(s) on unseen data. As opposed to
cross validation


In [2]:
import time

# import libraries
from surprise import Dataset
from surprise.accuracy import rmse
from own_algorithms.UserItemKNN import UserItemKNN
from surprise import KNNBasic
from own_algorithms.UserItemKNNv2 import UserItemKNNv2
from surprise.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
from own_algorithms.top_n_list import get_top_n_list


In [3]:
# load data and split into A/B sets, A for CFV and B for unbiased testing
# Load the data using the built-in function
data = Dataset.load_builtin('ml-100k')
raw_ratings= data.raw_ratings

random.seed(2001)
np.random.seed(2001)
random.shuffle(raw_ratings)

# create threshold for unseen, 80-20
cutoff = int(0.8 * len(raw_ratings))
A_raw= raw_ratings[:cutoff]
B_raw= raw_ratings[cutoff:]

# data is now only set A ratings
data.raw_ratings= A_raw

In [4]:
# testing the first version ensemble
algo= UserItemKNN()

trainset = data.build_full_trainset()
algo.fit(trainset)


testset = data.construct_testset(B_raw)  # testset is now the set B
predictions = algo.test(testset)
print("Unbiased accuracy on B for v1,", end=" ")
algo1_rmse= rmse(predictions)

#testing the "improved version"
algo2= UserItemKNNv2()
algo2.fit(trainset)
predictions= algo2.test(testset)
print("Unbiased accuracy on B for v2,", end=" ")
algo2_rmse= rmse(predictions)

algo3= KNNBasic()
algo3.fit(trainset)
predictions= algo2.test(testset)
print("Unbiased accuracy on B for KNNBasic,", end=" ")
algo3_rmse= rmse(predictions)

Unbiased accuracy on B for v1, RMSE: 0.9715
Unbiased accuracy on B for v2, RMSE: 0.9645
Computing the msd similarity matrix...
Done computing similarity matrix.
Unbiased accuracy on B for KNNBasic, RMSE: 0.9645


These scores are for the 100k dataset. This is not a realistic dataset as each film,user pair is filled.
Most real life predictions are made using a sparse dataset. The next cells will be using the ml-1m dataset.
A larger movie rating dataset with approximatly 95% sparsity. It is expected the performance will drop

In [5]:

#load the 1m dataset
data=Dataset.load_builtin('ml-1m')

#the algos havent been optimisded on data so its unncesccasy to split the dataset to validaton and test sets
trainset, testset = train_test_split(data, test_size=0.25)
# Train the algorithm on the trainset, and predict ratings for the testset

algo.fit(trainset)
predictions = algo.test(testset)
print("Unbiased accuracy on B for v1,", end=" ")
algo1_rmse_1m= rmse(predictions)

fit_start=time.time()
algo2.fit(trainset)
fit_hybrid=time.time()-fit_start
predict_start=time.time()
predictions= algo2.test(testset)
predict_hybrid=time.time()-predict_start
print("Unbiased accuracy on B for v2,", end=" ")
algo2_rmse_1m= rmse(predictions)

hybrid_stats= np.array([algo2_rmse_1m, fit_hybrid, predict_hybrid])

Unbiased accuracy on B for v1, RMSE: 0.9449
Unbiased accuracy on B for v2, RMSE: 0.9259


In [6]:
hybrid_stats

array([  0.92594911, 107.19443941, 225.53905702])

In [5]:

movies_cols = ['movie_id', 'title', 'genres']
movies_df = pd.read_csv('./ml-1m/movies.dat', sep='::', names=movies_cols, engine='python', encoding='latin-1')

# create top n list
movies=get_top_n_list(predictions, 10, '398', movies_df)
df398= pd.DataFrame({'Hybrid':movies})
movies=get_top_n_list(predictions, 10, '1', movies_df)
df1= pd.DataFrame({'Hybrid':movies})
movies=get_top_n_list(predictions, 10, '134', movies_df)
df134= pd.DataFrame({'Hybrid':movies})

In [6]:
df398

Unnamed: 0,Hybrid
0,Taxi Driver (1976)
1,"Godfather, The (1972)"
2,"Clockwork Orange, A (1971)"
3,Full Metal Jacket (1987)
4,Chinatown (1974)
5,"Manchurian Candidate, The (1962)"
6,This Is Spinal Tap (1984)
7,Roger & Me (1989)
8,Rushmore (1998)
9,Run Lola Run (Lola rennt) (1998)


In [7]:
df1

Unnamed: 0,Hybrid
0,Fargo (1996)
1,Gigi (1958)
2,Cinderella (1950)
3,One Flew Over the Cuckoo's Nest (1975)
4,Ben-Hur (1959)
5,Saving Private Ryan (1998)
6,"Christmas Story, A (1983)"
7,Ferris Bueller's Day Off (1986)
8,Awakenings (1990)
9,Toy Story 2 (1999)


In [8]:
df134

Unnamed: 0,Hybrid
0,Braveheart (1995)
1,In the Line of Fire (1993)
2,"Last of the Mohicans, The (1992)"
3,Austin Powers: International Man of Mystery (1...
4,"Full Monty, The (1997)"
5,"Mask of Zorro, The (1998)"
6,Being John Malkovich (1999)
7,Toy Story 2 (1999)
8,Chicken Run (2000)
9,Almost Famous (2000)


Well, the results here show us that our algorithm actually performs better in the ML-1M dataset.
THis was an unexpected result however the tests do not lie. This could be due to the fact the 1m dataset has 10x
the ratings than the 100k and theefore can make more accuracte predictions. I will test the KNNBasic algorithm again to compare the results

In [7]:


algo3=KNNBasic()
fit_start=time.time()
algo3.fit(trainset)
fit_basic= time.time()- fit_start
predict_start=time.time()
predictions= algo3.test(testset)
predict_basic=time.time()-predict_start
print("Unbiased accuracy on B for KNNBasic,", end=" ")
algo3_rmse_1m= rmse(predictions)

basic_stats=np.array([algo3_rmse_1m, fit_basic, predict_basic])
basic_stats


Computing the msd similarity matrix...
Done computing similarity matrix.
Unbiased accuracy on B for KNNBasic, RMSE: 0.9251


array([  0.92510662,  30.11594629, 139.11689234])

In [8]:
data=pd.DataFrame(columns=['RMSE', 'Fit Time', 'Predict Time'])
data.loc[len(data)] = basic_stats
data.loc[len(data)] = hybrid_stats
data.insert(0,'Algorithm', ['KNN', 'KNN (hybrid)'])
data.to_csv('./algo_data/KNN_1m', index=False)

In [10]:
movies=get_top_n_list(predictions, 10, '398', movies_df)
df398["KNN Basic"]= movies
movies=get_top_n_list(predictions, 10, '1', movies_df)
df1["KNN Basic"]= movies
movies=get_top_n_list(predictions, 10, '134', movies_df)
df134["KNN Basic"]= movies


In [11]:
df398

Unnamed: 0,Hybrid,KNN Basic
0,Taxi Driver (1976),Taxi Driver (1976)
1,"Godfather, The (1972)","Godfather, The (1972)"
2,"Clockwork Orange, A (1971)",Dial M for Murder (1954)
3,Full Metal Jacket (1987),"Clockwork Orange, A (1971)"
4,Chinatown (1974),Chinatown (1974)
5,"Manchurian Candidate, The (1962)","Manchurian Candidate, The (1962)"
6,This Is Spinal Tap (1984),This Is Spinal Tap (1984)
7,Roger & Me (1989),Run Lola Run (Lola rennt) (1998)
8,Rushmore (1998),"Sixth Sense, The (1999)"
9,Run Lola Run (Lola rennt) (1998),Dog Day Afternoon (1975)


In [12]:
df1

Unnamed: 0,Hybrid,KNN Basic
0,Fargo (1996),Fargo (1996)
1,Gigi (1958),Gigi (1958)
2,Cinderella (1950),Cinderella (1950)
3,One Flew Over the Cuckoo's Nest (1975),One Flew Over the Cuckoo's Nest (1975)
4,Ben-Hur (1959),Ben-Hur (1959)
5,Saving Private Ryan (1998),Saving Private Ryan (1998)
6,"Christmas Story, A (1983)","Christmas Story, A (1983)"
7,Ferris Bueller's Day Off (1986),Ferris Bueller's Day Off (1986)
8,Awakenings (1990),Awakenings (1990)
9,Toy Story 2 (1999),Toy Story 2 (1999)


In [13]:
df134

Unnamed: 0,Hybrid,KNN Basic
0,Braveheart (1995),Braveheart (1995)
1,In the Line of Fire (1993),In the Line of Fire (1993)
2,"Last of the Mohicans, The (1992)","Last of the Mohicans, The (1992)"
3,Austin Powers: International Man of Mystery (1...,Austin Powers: International Man of Mystery (1...
4,"Full Monty, The (1997)","Full Monty, The (1997)"
5,"Mask of Zorro, The (1998)",Office Space (1999)
6,Being John Malkovich (1999),Being John Malkovich (1999)
7,Toy Story 2 (1999),Toy Story 2 (1999)
8,Chicken Run (2000),Chicken Run (2000)
9,Almost Famous (2000),Almost Famous (2000)


In [14]:
df398.to_csv('./predictions/398.csv', index=False)
df1.to_csv('./predictions/1.csv', index=False)
df134.to_csv('./predictions/134.csv', index=False)

In [15]:
data={"Algo":["KNNBasic", "V1", "V2"],
      "100k": [algo3_rmse, algo1_rmse, algo2_rmse],
      "1M": [algo3_rmse_1m, algo1_rmse_1m, algo2_rmse_1m]}

results= pd.DataFrame(data)
results.to_csv('./algo_data/KNN_100Kvs1M.csv', index=False)

In [16]:
results

Unnamed: 0,Algo,100k,1M
0,KNNBasic,0.964453,0.925107
1,V1,0.971465,0.944865
2,V2,0.964453,0.925949
