<a href="https://colab.research.google.com/github/owlbemi/RecommendationSystems/blob/main/Recommendation_System_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Jake Lee (jxl180111)

# Recommendation Systems
We will use the surprise library of Python. Details are available at: http://surpriselib.com

We will first work through an example using a built-in dataset and then use a custom one.

First, ensure that you have the library installed and then load the required packages.

In [None]:
!pip install scikit-surprise



In [None]:
import io

import numpy as np
import pandas as pd
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise import KNNBaseline
from surprise import Dataset
from surprise import get_dataset_dir
from surprise import accuracy
from surprise.model_selection import KFold

For a recommendation system, we require a file containing at least 3 things - userId, itemId, and rating. Any other information is not needed, but can be good for human analysis of results.

Let's load the built in ml-100k dataset that contains movies and ratings.

In [None]:
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')

In [None]:
# Let's see what files come with the dataset
!ls /root/.surprise_data/ml-100k/ml-100k/

allbut.pl  u1.base  u2.test  u4.base  u5.test  ub.base	u.genre  u.occupation
mku.sh	   u1.test  u3.base  u4.test  ua.base  ub.test	u.info	 u.user
README	   u2.base  u3.test  u5.base  ua.test  u.data	u.item


In [None]:
# TODO: Show the first 10 lines of the u.data, and u.item files
!head -10 /root/.surprise_data/ml-100k/ml-100k/u.data

196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
115	265	2	881171488
253	465	5	891628467
305	451	3	886324817
6	86	3	883603013


## Algorithms
Let's look at some of the algorithms available with the package

In [None]:
?KNNBaseline

The nearest neighbor methods works by searching for neighbors using the utility matrix. Let's create a nearest neighbor first by item and user

In [None]:
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
# we are going to use item-item similarity
sim_options = {'name': 'pearson_baseline', 'user_based': False}
algo = KNNBaseline(sim_options=sim_options)
algo.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x7c56cee60ac0>

In [None]:
!head -10 /root/.surprise_data/ml-100k/ml-100k/u.item

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
6|Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)|01-Jan-1995||http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
7|Twelve Monkeys (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Twelve%20Monkeys%20(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|1|0|0|0
8|Babe (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Babe%20(1995)|0|0|0|0|1

# Id to Name Lookup
Let's write a small method that will convert id to name, and name to id

In [None]:
def read_item_names():
    """Read the u.item file from MovieLens 100-k dataset and return two
    mappings to convert raw ids into movie names and movie names into raw ids.
    """

    file_name = get_dataset_dir() + '/ml-100k/ml-100k/u.item'
    rid_to_name = {}
    name_to_rid = {}
    with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
        for line in f:
            line = line.split('|')
            rid_to_name[line[0]] = line[1]
            name_to_rid[line[1]] = line[0]

    return rid_to_name, name_to_rid

In [None]:
# test this function
rid_to_name, name_to_rid = read_item_names()

In [None]:
rid_to_name["1"]

'Toy Story (1995)'

In [None]:
name_to_rid["Twelve Monkeys (1995)"]

'7'

In [None]:
# Find top 10 movies similar to movie with id 100

movie_inner_id = algo.trainset.to_inner_iid("200")
movie_name = rid_to_name["200"]

# Retrieve inner ids of the nearest neighbors of Toy Story.
movie_neighbors = algo.get_neighbors(movie_inner_id, k=10)

# Convert inner ids of the neighbors into names.
movie_neighbors = (algo.trainset.to_raw_iid(inner_id)
                       for inner_id in movie_neighbors)
movie_neighbors = (rid_to_name[rid]
                       for rid in movie_neighbors)

print()

print('The 10 nearest neighbors of ' + movie_name)
for movie in movie_neighbors:
    print(movie)


The 10 nearest neighbors of Shining, The (1980)
Bonnie and Clyde (1967)
Godfather: Part II, The (1974)
Alien (1979)
Godfather, The (1972)
Raging Bull (1980)
Pulp Fiction (1994)
One Flew Over the Cuckoo's Nest (1975)
Carrie (1976)
Koyaanisqatsi (1983)
His Girl Friday (1940)


Let's now apply the algorithm and figure out it's accuracy

In [None]:
testset = trainset.build_testset()
predictions = algo.test(testset)
# RMSE should be low as we are biased
accuracy.rmse(predictions, verbose=True)  # ~ 0.68 (which is low)

RMSE: 0.4807


0.48071109787164656

Now, let's also try some baseline methods. Follow the code available here:

https://github.com/NicolasHug/Surprise/blob/fa7455880192383f01475162b4cbd310d91d29ca/examples/baselines_conf.py

For more elaborate testing and validation, follow steps mentioned here
https://github.com/NicolasHug/Surprise/blob/fa7455880192383f01475162b4cbd310d91d29ca/examples/grid_search_usage.py

# Assignment

In this part, you will use the dataset that is provided along with the following Kaggle competition

https://www.kaggle.com/arashnic/book-recommendation-dataset


I have uploaded the files for you at

Ratings file - https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Ratings.csv

Books file - https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Books.csv


Follow the steps below to create a recommendation system from this data

In [None]:
# TODO: Read both the data files into Pandas dataframes
import pandas as pd
ratings = pd.read_csv("https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Ratings.csv")
books = pd.read_csv("https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Books.csv")

  books = pd.read_csv("https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Books.csv")


In [None]:
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [None]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [None]:
# TODO: Answer the following questions:

# How many ratings and how many books are there in the dataset
print(len(ratings), len(books))

# Find the top 10 books have received the highest count of ratings. You should output the id of the book, its title, and the count of ratings received.
top_books = ratings.groupby('ISBN').size().reset_index(name='rating_count')
top_books = pd.merge(top_books, books[['ISBN', 'Book-Title']], on='ISBN')
top_books = top_books.sort_values('rating_count', ascending=False).head(10)

print(top_books[['ISBN', 'Book-Title', 'rating_count']])

1149780 271360
              ISBN                                         Book-Title  \
215952  0971880107                                        Wild Animus   
38570   0316666343                          The Lovely Bones: A Novel   
70798   0385504209                                  The Da Vinci Code   
7344    0060928336    Divine Secrets of the Ya-Ya Sisterhood: A Novel   
32370   0312195516                The Red Tent (Bestselling Backlist)   
87397   044023722X                                    A Painted House   
21342   0142001740                            The Secret Life of Bees   
145042  067976402X                             Snow Falling on Cedars   
133142  0671027360                                Angels &amp; Demons   
93847   0446672211  Where the Heart Is (Oprah's Book Club (Paperba...   

        rating_count  
215952          2502  
38570           1295  
70798            883  
7344             732  
32370            723  
87397            647  
21342            615

In [None]:
# TODO: Important - You may not be able use the whole dataset for model creation, so you need to create a
# smaller sample to proceeed further
# Here is what I did:
ratings_short = ratings.sample(n = 1000, random_state = 42)
# you can try larger values of n, if the system allows you.

In [None]:
# TODO: Use the data to create a custom dataset in the surprise library
# Steps to do this are: https://surprise.readthedocs.io/en/stable/getting_started.html#use-a-custom-dataset
import os
from surprise import Dataset, NormalPredictor, Reader

reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(ratings_short[['User-ID', 'ISBN', 'Book-Rating']], reader)

In [None]:
# TODO: Choose a book at random and use the KNNBasic algorithm to find out its 10 closest neighbors. Do the results make
# sense?
from surprise import KNNBasic
from surprise.model_selection import train_test_split

sim_options = {'name': 'pearson_baseline', 'user_based': False, 'shrinkage': 10}
trainset, testset = train_test_split(data, test_size=0.25)

algo = KNNBasic(sim_options=sim_options)
algo.fit(trainset)

random_id = '0971880107'

inner_id = algo.trainset.to_inner_iid(random_id)
neighbours = algo.get_neighbors(inner_id, k=10)
neighbours_raw_ids = [algo.trainset.to_raw_iid(inner_id) for inner_id in neighbours]

neighbor_titles = []
for raw_id in neighbours_raw_ids:
    title_series = books.loc[books['ISBN'] == raw_id, 'Book-Title']
    if not title_series.empty:
        neighbor_titles.append(title_series.iloc[0])
    else:
        neighbor_titles.append("Unknown")  # Append a placeholder

print(f"Nearest neighbors of {books.loc[books['ISBN'] == random_id, 'Book-Title'].iloc[0]}:")
for title in neighbour_titles:
    print(title)


Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Nearest neighbors of Wild Animus:
Windmills of the Gods
The Ominous Parallels: The End of Freedom in America
Der Kaiser von Amerika.
FACE ON/MILK CARTON
Suzanne's Diary for Nicholas
Death on the Downs: A Fethering Mystery (Fethering Mystery)
Folktale Cat
Chapterhouse Dune (Dune Chronicles, Book 6)
This Pen for Hire: A Jaine Austen Mystery (Levine, Laura, Jaine Austen Mystery.)
Facets


In [None]:
from surprise import SVD, KNNBaseline, BaselineOnly
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split
from surprise import Dataset
from surprise import Reader

# KNNBaseline
param_grid_knn = {
    'k': [20, 30, 40],
    'sim_options': {
        'name': ['pearson_baseline'],
        'user_based': [False]
    },
}

# ALS Baseline
param_grid_als = {
    'bsl_options': {
        'method': ['als'],
        'n_epochs': [5, 10, 15],
        'reg_u': [5, 10, 15],
        'reg_i': [5, 10, 15],
    }
}

# SVD Baseline
param_grid_svd = {
    'n_factors': [50, 100, 150],
    'n_epochs': [20, 30],
    'reg_all': [0.02, 0.05, 0.1],
}

# Running GridSearchCV for KNNBaseline
print("KNNBaseline:")
grid_search_knn = GridSearchCV(KNNBaseline, param_grid_knn, measures=['rmse'], cv=3)
grid_search_knn.fit(data)
print(f"Best RMSE: {grid_search_knn.best_score['rmse']}")
print(f"Best Parameters: {grid_search_knn.best_params['rmse']}")

# Running GridSearchCV for ALS
print("ALS Baseline:")
grid_search_als = GridSearchCV(BaselineOnly, param_grid_als, measures=['rmse'], cv=3)
grid_search_als.fit(data)
print(f"Best RMSE: {grid_search_als.best_score['rmse']}")
print(f"Best Parameters: {grid_search_als.best_params['rmse']}")

# Running GridSearchCV for SVD
print("SVD:")
grid_search_svd = GridSearchCV(SVD, param_grid_svd, measures=['rmse'], cv=3)
grid_search_svd.fit(data)
print(f"Best RMSE: {grid_search_svd.best_score['rmse']}")
print(f"Best Parameters: {grid_search_svd.best_params['rmse']}")

print("Cross-validation Results:")
models = [KNNBaseline, BaselineOnly, SVD]
model_names = ['KNNBaseline', 'ALS Baseline', 'SVD']

for model, name in zip(models, model_names):
    if name == 'KNNBaseline':
        model_instance = KNNBaseline(k=grid_search_knn.best_params['rmse']['k'],
                                     sim_options=grid_search_knn.best_params['rmse']['sim_options'])
    elif name == 'ALS Baseline':
        model_instance = BaselineOnly(bsl_options=grid_search_als.best_params['rmse']['bsl_options'])
    elif name == 'SVD':
        model_instance = SVD(n_factors=grid_search_svd.best_params['rmse']['n_factors'],
                             n_epochs=grid_search_svd.best_params['rmse']['n_epochs'],
                             reg_all=grid_search_svd.best_params['rmse']['reg_all'])

    # Perform cross-validation to compare mean RMSE scores
    cv_results = cross_validate(model_instance, data, measures=['rmse'], cv=3, verbose=True)
    print(f"Mean RMSE for {name}: {cv_results['test_rmse'].mean()}")


KNNBaseline:
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearso