# Recommendation Enging using Surprise in Scikit-learn
## Chapter 7
### Predictive Analytics for the Modern Enterprise 

This is jupyter notebook that can be used to follow along the code examples for Chapter 7 Section 1 - Unsupervised learning of the book. The code examples go through some of the functionality that can be used to work with the sci-kit learn library in Python to work with unsupervised learning models. 

The notebook has been tested using the following pre-requisite:

Python V3.9.13 - https://www.python.org/
Anaconda Navigator V3 for Python 3.9 - https://www.anaconda.com/
Jupyter - V6.4.12 - https://jupyter.org/
Desktop computer - macOS Ventura V13.1

Documentation referece for Scikit Learn: https://scikit-learn.org/stable/

### Pre-requisites


You will need to install the scikit-learn package in your envionrment. 
In your environment run the following command

```bash
conda install -c conda-forge scikit-learn
conda install -c conda-forge scikit-surprise
```
OR
```bash
pip install -U scikit-learn
pip install scikit-surprise
```

This code uses the Movies dataset

* Orginal Dataset: https://files.grouplens.org/datasets/movielens/ml-100k.zip 
* A Local copy can be downloaded here: https://github.com/paforme/predictiveanalytics/blob/main/Chapter7/Datasets/ml-100k/ml-100k.zip

In [1]:
import pandas as pd
import numpy as np
import itertools as it

from surprise import accuracy, Dataset, SVD, KNNBasic
from surprise import KNNWithMeans, KNNBaseline, Reader
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split
from collections import defaultdict
from itertools import chain

In [2]:
dataset = "ml-100k" #We are using a built in dataset
algo = SVD()
uid = '1'
top = 10

In [3]:
def data_loader(dataset):
    data = Dataset.load_builtin(dataset)
    return data

def train_model(algo, trainset):    
    algo.fit(trainset)
    return algo

#The below function is taken from surprise documentation that can be found here: https://surprise.readthedocs.io/en/stable/FAQ.html#how-to-get-the-k-nearest-neighbors-of-a-user-or-item
#I am not the original author of this function and am simply repurposeing it
def get_top_n(predictions, n=10):
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

#The below function is taken from surprise documentation that can be found here: https://surprise.readthedocs.io/en/stable/FAQ.html#how-to-get-the-k-nearest-neighbors-of-a-user-or-item
#I am not the original author of this function and am simply repurposeing it
def read_item_names():

    file_name = "./Datasets/ml-100k/u.item" #Change this to where the dataset is downloaded
    rid_to_name = {}
    name_to_rid = {}
    with open(file_name, encoding="ISO-8859-1") as f:
        for line in f:
            line = line.split("|")
            rid_to_name[line[0]] = line[1]
            name_to_rid[line[1]] = line[0]

    return rid_to_name, name_to_rid

def get_movie_names(uid,predictions,top):
    top_n = get_top_n(predictions, n=top)
    top_n_pd = pd.DataFrame(top_n)
    recommendation_ids = top_n_pd[uid].apply(lambda x: x[0])
    iid_to_name, name_to_iid = read_item_names()
    recommendations = (iid_to_name[iid] for iid in recommendation_ids)
    return recommendations

def run_predictions(algo):
    data = data_loader(dataset)
    trainset, testset = train_test_split(data, test_size=0.25)
    algo = train_model(algo, trainset)
    predictions = algo.test(testset)
    print(accuracy.rmse(predictions), accuracy.mae(predictions) )

    #Than predict ratings for all pairs (u, i) that are NOT in the training set.
    testset = trainset.build_anti_testset()
    predictions = algo.test(testset)
    recommendations = get_movie_names(uid,predictions,top)
    print("\nTop 10 Movie recommendations for uid:", uid)
    for movie in recommendations:
        print(movie)


In [4]:
#SVD - With Default options 
algo = SVD()
run_predictions(algo)

RMSE: 0.9375
MAE:  0.7377
0.9375435725015225 0.737691870283514

Top 10 Movie recommendations for uid: 1
Close Shave, A (1995)
Raiders of the Lost Ark (1981)
Duck Soup (1933)
Star Wars (1977)
Bridge on the River Kwai, The (1957)
Secrets & Lies (1996)
Casablanca (1942)
Some Like It Hot (1959)
Fargo (1996)
Maltese Falcon, The (1941)


In [5]:
#KNN - Basic - item based / cosine similarity
data = Dataset.load_builtin("ml-100k")
trainset = data.build_full_trainset()

sim_options = {
    "name": "cosine",
    "user_based": False,  # compute  similarities between items
}
algo = KNNBasic(sim_options=sim_options)
run_predictions(algo)

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.0252
MAE:  0.8146
1.0252284380966419 0.8145548569839735

Top 10 Movie recommendations for uid: 1
Hearts and Minds (1996)
Cyclo (1995)
Visitors, The (Visiteurs, Les) (1993)
Coldblooded (1995)
My Life and Times With Antonin Artaud (En compagnie d'Antonin Artaud) (1993)
Intimate Relations (1996)
Substance of Fire, The (1996)
The Deadly Cure (1996)
He Walked by Night (1948)
Crows and Sparrows (1949)


In [6]:
#KNN - With Means - user based / cosine similarity
sim_options = {
    "name": "cosine",
    "user_based": True,  # compute  similarities between users
}
algo = KNNWithMeans(sim_options=sim_options)
run_predictions(algo)

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9464
MAE:  0.7462
0.9463682476160932 0.7461656849188013

Top 10 Movie recommendations for uid: 1
Saint of Fort Washington, The (1993)
Santa with Muscles (1996)
Star Kid (1997)
Entertaining Angels: The Dorothy Day Story (1996)
Great Day in Harlem, A (1994)
Someone Else's America (1995)
Lamerica (1994)
Aiqing wansui (1994)
Leading Man, The (1996)
Some Mother's Son (1996)


In [7]:
data = data_loader(dataset)
# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=["RMSE", "MAE"], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9569  0.9546  0.9588  0.9543  0.9526  0.9554  0.0022  
MAE (testset)     0.7561  0.7507  0.7557  0.7556  0.7553  0.7547  0.0020  
Fit time          0.39    0.38    0.39    0.38    0.38    0.38    0.00    
Test time         2.36    2.36    2.32    2.41    2.42    2.38    0.04    


{'test_rmse': array([0.95685067, 0.9545998 , 0.95883383, 0.95429778, 0.95257005]),
 'test_mae': array([0.75611561, 0.75067524, 0.7556716 , 0.75564242, 0.75525498]),
 'fit_time': (0.3898501396179199,
  0.38194918632507324,
  0.38570117950439453,
  0.38083577156066895,
  0.3848998546600342),
 'test_time': (2.35876202583313,
  2.3632688522338867,
  2.3249189853668213,
  2.413224220275879,
  2.4221909046173096)}

### Two step recommender 


In [8]:
def get_neighbours(algo, mrid):

    #Convert Movie RawID to innerID
    movie_inner_id = algo.trainset.to_inner_iid(mrid)

    # Retrieve inner ids of the nearest neighbors of the movie mrid
    movie_neighbors = algo.get_neighbors(movie_inner_id, k=10)

    # Convert inner ids to rids
    movie_neighbors = (
        algo.trainset.to_raw_iid(inner_id) for inner_id in movie_neighbors
    )
    return movie_neighbors

#### Stage 1

In [9]:
uid = '1' #User to get recommendations for

In [10]:
#Step 1 - Load the full dataset (100k movies)
data = data_loader(dataset) #Load the dataset
trainset = data.build_full_trainset() #Build the trainingset based on complete data

In [11]:
#Get all movies rated by the user - uid
user_items = trainset.ur[int(uid)]
#Get the top 10 user rated movies
user_top_rated = sorted(user_items, key=lambda x: x[1], reverse=True)[:10]

In [12]:
#Build the retriever model - item based cosine similarity
sim_options = {'name': 'cosine', 'user_based': False}
item_model = KNNWithMeans(sim_options=sim_options)
item_model.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f88304f04c0>

In [13]:
#Get nearest neigbours for all 10 top_rated_movies 
all_neighbors = 0

for movie_rid in user_top_rated:
    neighbors = get_neighbours(item_model,str(movie_rid[0]))
    if(all_neighbors!=0):
        all_neighbors= it.chain(all_neighbors,neighbors)
    else:
        all_neighbors = neighbors

In [14]:
#Convert the all_neighbors iterator to a list
mylist = list(all_neighbors)
print(len(mylist))

100


#### Stage 2

In [15]:
#Build a new dataset with nearest neighbours of top_rated_movies
data = data_loader(dataset)
my_trainset = data.build_full_trainset()
my_data = my_trainset.build_testset() #Take full dataset

#Filter only data where the movie is in the top neighbors
filtered_data = [d for d in my_data if d[1] in mylist] 

#Add an entry from the existing set 
#just in case there is no movie rated by this user in the neighbours
for d in my_data:
    if d[0] == uid:
        filtered_data.append(d)
        break
        
filtered_data = pd.DataFrame(filtered_data) #Create a dataframe for filtered data
filtered_data.columns = ["userID", "itemID", "rating"]

reader = Reader(rating_scale=(1, 5))
#Read data back in to create a testset
my_data = Dataset.load_from_df(filtered_data
                               [["userID", "itemID", "rating"]],
                               reader) 

new_trainset = my_data.build_full_trainset() #Create filtered training set

In [16]:
print(new_trainset.n_ratings)

3034


In [17]:
#Train user model with the new dataset
sim_options = {'name': 'cosine', 'user_based': True}
user_model = KNNWithMeans(sim_options=sim_options)
user_model.fit(new_trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f88304eb580>

In [18]:
#Build an anti_testset for all known movies and users 
#in the testset where the ratings are unknown for (u, i) pair
testset = new_trainset.build_anti_testset()
predictions = user_model.test(testset)

In [19]:
#Get top n predictions for all users
#top_n = get_top_n(predictions, n=10)

# Print the recommended items for each user
#for uid, user_ratings in top_n.items():
#    print(uid, [iid for (iid, _) in user_ratings])

In [20]:
#Print top 10 movie recommendations for user(uid)
recommendations = get_movie_names(uid,predictions,10)
print("\nTop 10 Movie recommendations for user with uid:", uid)
print("---------------------------------------------------\n")
for movie in recommendations:
    print(movie)


Top 10 Movie recommendations for user with uid: 1
---------------------------------------------------

Hearts and Minds (1996)
So Dear to My Heart (1949)
Beautiful Thing (1996)
Jackie Brown (1997)
City of Lost Children, The (1995)
North by Northwest (1959)
Thieves (Voleurs, Les) (1996)
Ghost in the Shell (Kokaku kidotai) (1995)
Ninotchka (1939)
In the Name of the Father (1993)
