## Lab 8
### PSTAT 134/234

For this lab, we will work in Python, using Jupyter notebooks. You'll notice this is a change in format from our previous work, where we used Quarto and ran Python using the R package `reticulate`. The reason for this change is because R, for all its strengths, tends to struggle when performing complicated operations with very large data frames, and to create a working recommender system, you usually need to perform complicated operations on large data frames.

If you have a smaller dataset, it's possible to create a recommender system using R. Or, if you have a powerful computer available to you, it may also be possible to use R. Most of us, though, will probably find it easier and more feasible to use Python. Even using Python and Jupyter Hub, you'll find that we sometimes need to reduce the amount of data we work with. You can probably imagine that our recommendations would be even better than this if we were able to work with/process all the data that is actually available.

To experiment with recommender systems, it's best to have **two data files**, one with a set of items and one with a set of different users' reactions to those items. Their reactions can be *explicit* (for example, ratings on a scale of 1 to 5, or binary like/dislike) or *implicit* (viewing an item, adding items to a wish list, time spent reading an article).

We'll consider two types of recommender systems in this lab -- recommending anime (in the form of TV shows or movies), and recommending books.

### Building an Anime Recommendation System

We will use the data from [this Kaggle link.](https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database) It consists of user preference data provided by $73,516$ users about $12,294$ anime. The data are stored in two files, `anime.csv` and `anime_rating.csv`.

The `anime.csv` file contains the following:

- `anime_id` - unique id identifying an anime
- `name` - full name of anime
- `genre` - comma-separated list of genres for this anime
- `type` - movie, TV show, OVA, etc
- `episodes` - how many episodes in the anime (1 if movie)
- `rating` - average rating out of 10 for the anime
- `members` - number of community members that are in the anime's "group"

The `anime_rating.csv` file contains the following:

- `user_id` - non-identifiable randomly generated user id
- `anime_id` - the anime the user rated
- `rating` - rating out of 10 that this user has assigned (-1 if the user watched it but didn't assign a rating)

In the code chunk below, we first import a few Python libraries that will be of use to us -- the `pandas` library, the `numpy` library, the `re` library, and the `warnings` library. Then we read in the data files, `anime.csv` and `anime_rating.csv`, using the `pandas` function `read_csv()` since they are both .csv files.

In [28]:
import pandas as pd
import numpy as np
import re
import warnings
warnings.filterwarnings('ignore')

anime = pd.read_csv("lab-8/anime.csv")
rating = pd.read_csv("lab-8/anime_rating.csv")

We merge the two datasets, joining by the `anime_id` column:

In [31]:
fulldata = pd.merge(anime,rating,on="anime_id",suffixes= [None, "_user"])
fulldata = fulldata.rename(columns={"rating_user": "user_rating"})

We then check the dimensions of the merged dataset and take a glimpse of the first few rows using the `head()` function. Notice that we customize the display of the data using `.style.set_properties()`. This is not required.

In [32]:
print(f"Shape of the merged dataset : {fulldata.shape}")
print(f"\nGlimpse of the merged dataset :")

fulldata.head().style.set_properties(**{"background-color": "white","color":"black","border": "1.5px  solid black"})

Shape of the merged dataset : (7813727, 9)

Glimpse of the merged dataset :


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,user_id,user_rating
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,99,5
1,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,152,10
2,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,244,10
3,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,271,10
4,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,278,-1


We'll create a copy of the full dataset and work with that copy from now on; we'll come back to using the original full dataset later.

Then we need to take care of the anime that users watched but did not rate, the ones with a `user_rating` of `-1`. We should treat those observations like missing data, so it will be easier if we just replace them with `NaN` using numPy. We'll then drop those observations by using `dropna()` and confirm that there are no `NA` values remaining in any of the columns:

In [33]:
data = fulldata.copy()
data["user_rating"].replace(to_replace = -1 , value = np.nan ,inplace=True)
data = data.dropna(axis = 0)
print("Null values after final pre-processing :")
data.isna().sum().to_frame().T.style.set_properties(**{"background-color": "white","color":"black", "border": "1.5px  solid black"})

Null values after final pre-processing :


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,user_id,user_rating
0,0,0,0,0,0,0,0,0,0


We can move on to figuring out how many anime each user reviewed, using the function `.value_counts()`:

In [35]:
selected_users = data["user_id"].value_counts()
selected_users

user_id
42635    3747
53698    2905
57620    2689
59643    2632
51693    2620
         ... 
39633       1
21355       1
12346       1
7624        1
2680        1
Name: count, Length: 69600, dtype: int64

There are $69,600$ users in the dataset. While some of the users reviewed as many as $2,000$ or $3,000$ anime, there are some that reviewed as few as a single anime.

It's likely that the more users we retain in the dataset, the better our recommendation systems might ultimately perform. However, retaining users who only rated a single anime might not benefit our recommendations; how well do we trust users who have only ever reviewed one product?

In terms of a tradeoff -- retaining a small enough subset of users for our data processing to be feasible in a reasonable amount of time -- we will retain only those users with a relatively large number of reviews. More specifically, we'll retain only those users who reviewed $500$ anime or more.

In [36]:
data = data[data["user_id"].isin(selected_users[selected_users >= 500].index)]

We can then pivot this data frame, using `.pivot_table()`, so that every row represents an anime and every column a user ID, with the values in the cells equal to that user's review of the anime. We use `.fillna()` to specify that any anime a specific user hasn't reviewed should receive the value $0$.

We can then look at the first five rows to confirm this code worked:

In [37]:
data_pivot_temp = data.pivot_table(index="name",columns="user_id",values="user_rating").fillna(0)
data_pivot_temp.head(5)

user_id,226,271,294,392,446,478,661,741,771,786,...,73234,73272,73286,73340,73356,73362,73378,73395,73499,73502
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
&quot;0&quot;,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
&quot;Bungaku Shoujo&quot; Kyou no Oyatsu: Hatsukoi,0.0,0.0,7.0,0.0,0.0,9.0,0.0,6.0,0.0,0.0,...,0.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0
&quot;Bungaku Shoujo&quot; Memoire,0.0,0.0,8.0,0.0,0.0,9.0,0.0,9.0,0.0,0.0,...,0.0,0.0,9.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0
&quot;Bungaku Shoujo&quot; Movie,0.0,0.0,8.0,4.0,9.0,0.0,0.0,9.0,0.0,0.0,...,0.0,0.0,9.0,9.0,0.0,8.0,0.0,9.0,0.0,10.0
&quot;Eiji&quot;,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We can see something a little unusual in the `name` column. We have a lot of special characters, symbols, etc in the names of the animes. Let's try and remove those so that the name text is more legible. We'll use a function: 

In [38]:
def text_cleaning(text):
    text = re.sub(r'&quot;', '', text)
    text = re.sub(r'.hack//', '', text)
    text = re.sub(r'&#039;', '', text)
    text = re.sub(r'A&#039;s', '', text)
    text = re.sub(r'I&#039;', 'I\'', text)
    text = re.sub(r'&amp;', 'and', text)
    
    return text

We then need to apply that function to our new data frame, `data`, and to our original data frame, `anime`, so that our matching will go smoothly later on:

In [39]:
anime["name"] = anime["name"].apply(text_cleaning)

In [40]:
data["name"] = data["name"].apply(text_cleaning)

Now we can see how it looks after cleaning the names:

In [41]:
data_pivot = data.pivot_table(index="name",columns="user_id",values="user_rating").fillna(0)

data_pivot.head(5)

user_id,226,271,294,392,446,478,661,741,771,786,...,73234,73272,73286,73340,73356,73362,73378,73395,73499,73502
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
009 Re:Cyborg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
009-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
009-1: RandB,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Perfect. We can move on to implementing recommender systems!

#### Collaborative Recommender

The underlying assumption of a collaborative recommender system is that users who have the same or similar opinions are more likely to agree on other things than users chosen at random. Suppose you like a certain anime, and you want to figure out what other anime you might enjoy watching, based on the anime that you enjoyed. Using collaborative filtering, we might suppose that other users with similar tastes would be a good source of recommendations. We search a large group of users to find a smaller set of users with similar tastes and combine the anime that those users liked into a ranked list of suggestions.

In a recommendation system based solely on collaborative filtering, it is important to remember that similarity is **not** calculated using other factors (like user age, anime genre, or literally any other information about users or items themselves). Similarity is calculated solely on the basis of rating (either explicit or implicit) of the item. For example, regardless of whether two users are vastly different in terms of age, gender, etc., they can be considered similar if they give the same ratings to anime.

There are many, many ways to decide which users are similar and combine their choices. We'll use a popular one called **cosine similarity**. 

Before we run an algorithm on the data, we should recognize that the data, in its current format, is very sparse, meaning that there are a large number of columns with many zeros, as you can see in the preview above. In other words, many anime were not reviewed by many users.  We'll create what is called a "compressed sparse row (CSR)" matrix using part of the `scipy.sparse` module, which is designed to efficiently handle sparse matrices. The CSR format is an efficient way to store sparse matrices in memory and is optimized for quick row-based operations. It's useful when the dataset contains a lot of zeros and you want to reduce memory usage, or when you want to use similarity measures that can handle sparse data efficiently. The CSR format stores only the non-zero values.

In [42]:
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import random

data_matrix = csr_matrix(data_pivot.values)

Cosine similarity is particularly useful in high-dimensional spaces (like this one), where the magnitude of a vector is not necessarily as important as the angle between vectors. Cosine similarity values range from $-1$ to $1$, where $1$ means two vectors are perfectly aligned (high similarity), $0$ means the vectors are orthogonal (no similarity), and $-1$ means they are diametrically opposed.

The next code chunk initializes a *k*-nearest neighbors model, using cosine similarity as the distance metric. We use the `brute` algorithm, which is a straightforward, exhaustive approach that calculates distances by computing pairwise distances between all points. We often might want to use a brute-force algorithm when using cosine similarity.

In [43]:
model_knn = NearestNeighbors(metric = "cosine", algorithm = "brute")
model_knn.fit(data_matrix)

Now we want to demonstrate that the model works. We'll choose an anime ID at random with numPy, first setting a seed so that the selection will be reproducible. The anime we choose is ID number 8073, "Super Mario Brothers: Peach-hime Kyuushutsu Daisakusen!", or "Super Mario Bros.: The Great Mission to Rescue Princess Peach!" [Looking this up,](https://en.wikipedia.org/wiki/Super_Mario_Bros.:_The_Great_Mission_to_Rescue_Princess_Peach!) we find that it is a 1986 Japanese animated adventure comedy film, based on the video game Super Mario Bros.

We then use our `model_knn` to find the nearest neighbors to this anime, using cosine similarity. We ask the model for the six nearest neighbors (or most similar anime in the database).

In [44]:
np.random.seed(62)
query_no = np.random.choice(data_pivot.shape[0]) # random anime title and finding recommendation
print(f"We will find recommendations for anime ID number {query_no}, which is {data_pivot.index[query_no]}.")
distances, indices = model_knn.kneighbors(data_pivot.iloc[query_no,:].values.reshape(1, -1), n_neighbors = 6)

We will find recommendations for anime ID number 8073, which is Super Mario Brothers: Peach-hime Kyuushutsu Daisakusen!.


Now we'll organize the recommendations more neatly for viewing the results. The code above has already fit the model and located the six recommended anime; the below code simply prints those six shows or movies, in order of distance (with the most similar first):

In [45]:
no = []
name = []
distance = []
rating = []

for i in range(0, len(distances.flatten())):
    if i == 0:
        print(f"Recommendations for {data_pivot.index[query_no]} viewers :\n")
    else:
        print(f"{i}: {data_pivot.index[indices.flatten()[i]]} , with a distance of {distances.flatten()[i]}")        
        no.append(i)
        name.append(data_pivot.index[indices.flatten()[i]])
        distance.append(distances.flatten()[i])
        rating.append(*anime[anime["name"]==data_pivot.index[indices.flatten()[i]]]["rating"].values)

Recommendations for Super Mario Brothers: Peach-hime Kyuushutsu Daisakusen! viewers :

1: Amada Anime Series: Super Mario Brothers , with a distance of 0.3953048996048839
2: Bulsajo Robot Phoenix King , with a distance of 0.6238361194025348
3: Gegege no Kitarou: Nippon Bakuretsu , with a distance of 0.6254538278943385
4: Super Mario World: Mario to Yoshi no Bouken Land , with a distance of 0.6325332026784489
5: New Dream Hunter Rem: Yume no Kishitachi , with a distance of 0.6472771678176812


It also might be useful for users to know which of these recommended anime were rated most highly. We can find that using the following code (and also format the results more neatly):

In [46]:
dic = {"No" : no, "Anime Name" : name, "Rating" : rating}
recommendation = pd.DataFrame(data = dic)
recommendation.set_index("No", inplace = True)
recommendation.style.set_properties(**{"background-color": "white","color":"black","border": "1.5px  solid black"})

Unnamed: 0_level_0,Anime Name,Rating
No,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Amada Anime Series: Super Mario Brothers,5.15
2,Bulsajo Robot Phoenix King,3.11
3,Gegege no Kitarou: Nippon Bakuretsu,6.48
4,Super Mario World: Mario to Yoshi no Bouken Land,6.02
5,New Dream Hunter Rem: Yume no Kishitachi,6.35


The most similar result is ["Amada Anime Series: Super Mario Brothers"](https://www.mariowiki.com/Amada_Anime_Series:_Super_Mario_Bros). This is a series of short Japanese animated stories that are **also** set in the universe of Super Mario Brothers. It is certainly plausible to think that a fan of The Great Mission to Rescue Princess Peach would be a fan of these as well.

#### <span style="color: red;">Exercises</span>

1. Select another anime at random and generate recommendations based on that anime.
2. Assess the quality of those recommendations.
3. Try using Minkowski difference instead of cosine difference. What changes?

#### Content-Based Recommender

Let's try a different approach. It would be nice if we could incorporate some additional information about the anime in this database, since we do have some; if you remember the beginning of this lab, the `anime.csv` data file contains the genres of each anime, whether the anime is a movie or TV show, and the number of episodes. We'll take the `genre` column.

We can now incorporate some of our knowledge about processing text data! Rather than trying to determine how to dummy-code or feature hash the `genre` variable, we can treat it as a column of text data and handle it with natural language processing. In fact, we could even use something like Word2Vec here, if we wanted. It is probably simpler to calculate something like TF-IDF, though. And the `scikit-learn` library has a convenient tool for this purpose, `TfidfVectorizer`.

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents="unicode", analyzer="word",
                      token_pattern=r"\w{1,}", ngram_range=(1, 3), stop_words="english")

The code above initializes an instance of `TfidfVectorizer` with several specific parameters. `min_df=3` ignores any terms that appear in fewer than 3 documents, focusing on terms that occur more frequently; this helps filter out rarer terms that might add noise. `max_features=None` uses all terms that meet the `min_df` criterion, without further limiting the number of features. `strip_accents="unicode"` strips the accent marks from characters, which can help normalize text with differing accent patterns. `analyzer="word"` specifies that the analysis should be done at the word level, not the character level. `token_pattern` specifies a regular expression to define what constitutes a "token" (word). Here, `\w{1,}` means any alphanumeric sequence of at least one character constitutes a word.

`ngram_range=(1, 3)` considers unigrams, bigrams, and trigrams (groups of 1 to 3 words). This allows the model to capture sequences of up to three words. Lastly, `stop_words="english"` removes common English stop words.

The next chunk of code creates a copy of our original dataset to work with, then removes any duplicate entries, keeping only one record per anime (`keep = "first"`). `inplace = True` modifies `rec_data` directly without keeping a new copy.

In [48]:
rec_data = fulldata.copy()
rec_data
rec_data.drop_duplicates(subset = "name", keep = "first", inplace = True)

The next line resets the index of `rec_data`, giving it a new sequential index from $0$ to $n - 1$, where $n$ is the number of rows after duplicates are removed:

In [49]:
rec_data.reset_index(drop = True, inplace = True)

We then take the `genre` column, parse it with `str.split`, specifying that genres are separated by commas, and convert the resulting list to a string again, so that it will be compatible with `TfidfVectorizer` (which expects strings as input).

In [50]:
genres = rec_data["genre"].str.split(", | , | ,").astype(str)
tfv_matrix = tfv.fit_transform(genres)

Now we load the sigmoid kernel from `scikit-learn`. This is another method of computing similarity (this time using a sigmoid, or logistic, function); it is often used in recommender systems as an alternative to cosine similarity, etc. We calculate pairwise similarities between all items in `tfv_matrix`, producing the matrix of similarities `sig`, or $S$. $S$ is a square matrix where rows and columns represent items (here, anime) and each cell $(i, j)$ contains the similarity between item $i$ and item $j$ according to the sigmoid function.

We then use the `pandas` function `Series()` to map anime names to their indices:

In [51]:
from sklearn.metrics.pairwise import sigmoid_kernel

sig = sigmoid_kernel(tfv_matrix, tfv_matrix)      # Computing sigmoid kernel

rec_indices = pd.Series(rec_data.index, index = rec_data["name"]).drop_duplicates()

Next we define a function, `give_recommendation()`, that takes as input the title of an anime and should return the top 10 recommendations. It uses the $S$ matrix we calculated previously.

First, it retrieves the row index in `rec_data` for the specified anime. Then it retrieves all the similarity scores for the specified anime at row `idx` using `sig[idx]`, and orders the similarity scores in descending order, so that the most similar anime appear first in the list. It extracts the top 10 most similar (`sig_score[1:11]`), excluding the first item, `sig_score[0]`, which is the anime itself (with a similarity of `1`).

It then creates a list of the top 10 anime and a dictionary, `rec_dic`, with three keys:

- `"No"`: a range from 1 to 10 for numbering
- `"Anime Name"`: the name of the recommended anime
- `"Rating"`: the average rating for the recommended anime

In [52]:
# Recommendation Function
def give_recommendation(title, sig = sig):
    
    idx = rec_indices[title] # Getting index corresponding to original_title

    sig_score = list(enumerate(sig[idx]))  # Getting pairwsie similarity scores 
    sig_score = sorted(sig_score, key=lambda x: x[1], reverse=True)
    sig_score = sig_score[1:11]
    anime_indices = [i[0] for i in sig_score]
     
    # Top 10 most similar anime
    rec_dic = {"No" : range(1,11), 
               "Anime Name" : anime["name"].iloc[anime_indices].values,
               "Rating" : anime["rating"].iloc[anime_indices].values}
    dataframe = pd.DataFrame(data = rec_dic)
    dataframe.set_index("No", inplace = True)
    
    print(f"Recommendations for {title} viewers :\n")
    
    return dataframe.style.set_properties(**{"background-color": "white","color":"black","border": "1.5px  solid black"})

We can then test out the performance of this recommendation system. Suppose I wanted to know what my husband should watch, if he enjoyed a specific anime and want to find others they might like. My husband enjoyed the anime "Haikyuu!!":

In [53]:
give_recommendation("Haikyuu!!")

Recommendations for Haikyuu!! viewers :



Unnamed: 0_level_0,Anime Name,Rating
No,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Haikyuu!! Second Season,8.93
2,Ansatsu Kyoushitsu (TV) 2nd Season,8.68
3,Ghost in the Shell: Stand Alone Complex 2nd GIG,8.57
4,Hidamari Sketch x ☆☆☆ Specials,7.94
5,Jigoku Shoujo Mitsuganae,7.81
6,Onigamiden,6.64
7,Within the Bloody Woods,3.65
8,Ikkitousen,6.62
9,Koe no Katachi,9.05
10,Great Teacher Onizuka,8.77


The top recommendation is the second season of that same anime, which makes a lot of sense!

For fun, I showed my husband these recommendations and asked his opinion. I had been hoping this function would recommend some of the other series he's enjoyed; instead, he said that he's never heard of most of these recommendations, but that the recommendations are "no worse than the ones he gets from Netflix," so in retrospect this performed relatively well!

Let's try two well-known anime:

In [54]:
give_recommendation("Cowboy Bebop")

Recommendations for Cowboy Bebop viewers :



Unnamed: 0_level_0,Anime Name,Rating
No,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Bungaku Shoujo Memoire,7.54
2,Rikujou Bouei-tai Mao-chan,6.4
3,Pandora Hearts Specials,7.49
4,Seirei Tsukai no Blade Dance,7.15
5,Kamigami no Asobi,7.32
6,Hagure Yuusha no Aesthetica: Hajirai Ippai,6.75
7,Nijuu Mensou no Musume,7.65
8,Switch,7.12
9,Pokemon Omega Ruby and Alpha Sapphire: Mega Special Animation,7.04
10,Doraemon Movie 35: Nobita no Space Heroes,6.91


In [55]:
give_recommendation("Death Note")

Recommendations for Death Note viewers :



Unnamed: 0_level_0,Anime Name,Rating
No,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Hachimitsu to Clover Specials,7.85
2,Trapp Ikka Monogatari,7.75
3,Major S1,8.42
4,Hakkenden: Touhou Hakken Ibun,7.57
5,Nazotokine,4.86
6,ef: A Tale of Melodies.,8.18
7,Saki Achiga-hen: Episode of Side-A Specials,7.63
8,One Piece: Oounabara ni Hirake! Dekkai Dekkai Chichi no Yume!,7.43
9,Kizumonogatari II: Nekketsu-hen,8.73
10,Go! Princess Precure Movie: Go! Go!! Gouka 3-bondate!!!,6.89


You can go through these recommendations, if you're interested, and see what you think. Or try putting in one of your own favorites and see what recommendations you receive!

#### <span style="color: red;">Exercises</span>

4. Compare the results generated for the three anime above after you don't remove stop words. What do you think are the pros and cons of doing so?

### Building a Book Recommendation System

Now we'll take a look at a dataset of popular books and their reviews. This comes from Kaggle; you can explore more [information about the dataset here.](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset) It was collected by Cai-Nicolas Ziegler in a 4-week crawl (from August to September, 2004) of the [Book-Crossing community.](https://www.bookcrossing.com/?) It contains $278,858$ users (anonymized with demographic information) providing $1,149,780$ ratings (explicit and implicit) about $271,379$ books.

The dataset technically consists of three files (the third containing demographic information about the users), but we'll work with just two, `book.csv` and `book_rating.csv`. 

`book.csv` contains the following columns:

- `ISBN`: each book has a unique ISBN number that functions as an ID
- `Book-Title`: book title
- `Book-Author`: book author
- `Year-of-Publication`: year of publication
- `Image-URL-S`, `Image-URL-M`, `Image-URL-L`: Amazon cover images

We'll read in the data and retain the three columns that we'll use -- title, ISBN, and author:

In [58]:
df_books = pd.read_csv('lab-8/book.csv')
# Let's remove the unwanted columns:
df_books = df_books[['ISBN', 'Book-Title', 'Book-Author']]

# And we'll look at the results:
df_books.head().style.set_properties(**{"background-color": "white","color":"black","border": "1.5px  solid black"})

Unnamed: 0,ISBN,Book-Title,Book-Author
0,195153448,Classical Mythology,Mark P. O. Morford
1,2005018,Clara Callan,Richard Bruce Wright
2,60973129,Decision in Normandy,Carlo D'Este
3,374157065,Flu: The Story of the Great Influenza Pandemic of 1918 and the Search for the Virus That Caused It,Gina Bari Kolata
4,393045218,The Mummies of Urumchi,E. J. W. Barber


Now we'll read in the ratings information. `book_rating.csv` contains the following:

- `User-ID`: ID number for the user providing the review
- `ISBN`: unique ISBN number for the book being reviewed
- `Book-Rating`: number from 0-10, higher values representing higher rating

In [57]:
df_ratings = pd.read_csv('lab-8/book_rating.csv')
df_ratings.head().style.set_properties(**{"background-color": "white","color":"black","border": "1.5px  solid black"})

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


We can check for any missing data in the `books` data file:

In [59]:
df_books.isnull().sum()

ISBN           0
Book-Title     0
Book-Author    2
dtype: int64

There are only two, which is a very small proportion, so we'll drop them:

In [60]:
df_books.dropna(inplace=True)
df_books.isnull().sum()

ISBN           0
Book-Title     0
Book-Author    0
dtype: int64

Let's see how many books were reviewed by each user. We may not want to retain users who rated only a small number of books:

In [61]:
# Calculate the count of ratings given by each user and store it in the 'ratings' Series
ratings = df_ratings['User-ID'].value_counts()
# Sort the 'ratings' Series in descending order based on the counts of user IDs
ratings.sort_values(ascending=False).head()

User-ID
11676     13602
198711     7550
153662     6109
98391      5891
35859      5850
Name: count, dtype: int64

Some users rated an impressively large number of books! Let's see how many rated fewer than 200 books:

In [62]:
len(ratings[ratings < 200])

104378

We remove the users who rated fewer than 200 books from the dataset and check the dimensions again:

In [63]:
df_ratings_rm = df_ratings[
  ~df_ratings['User-ID'].isin(ratings[ratings < 200].index)
]
df_ratings_rm.shape

(527556, 3)

We also might want to look into how many ratings each of the individual books has received:

In [64]:
ratings = df_ratings['ISBN'].value_counts() 
ratings.sort_values(ascending=False).head()

ISBN
0971880107    2502
0316666343    1295
0385504209     883
0060928336     732
0312195516     723
Name: count, dtype: int64

Some books got quite a few ratings -- we have an outlier that was rated $2,502$ times, for example. Let's see how many were rated fewer than $100$ times:

In [65]:
len(ratings[ratings < 100])

339825

We could choose to keep these in the dataset, or we could lower our threshold from $100$ to, say, $50$, or if we wanted to go really low maybe even $25$ or lower. Not including them in the dataset does have the effect of reducing the number of books we have, but may also make models work better. We'll remove them:

In [66]:
df_books['ISBN'].isin(ratings[ratings < 100].index).sum()

np.int64(269422)

In [67]:
df_ratings_rm = df_ratings_rm[
  ~df_ratings_rm['ISBN'].isin(ratings[ratings < 100].index)
]
df_ratings_rm.shape

(49781, 3)

And then we can look at the ratings dataset again:

In [68]:
df_ratings_rm

Unnamed: 0,User-ID,ISBN,Book-Rating
1456,277427,002542730X,10
1469,277427,0060930535,0
1471,277427,0060934417,0
1474,277427,0061009059,9
1484,277427,0140067477,0
...,...,...,...
1147304,275970,0804111359,0
1147436,275970,140003065X,0
1147439,275970,1400031346,0
1147440,275970,1400031354,0


Let's test it out by looking up a few books that are in the dataset and checking the number of reviews that each of them received. The first two books are popular reading in high school English classes; the third is by Stephen King, but it's a shorter story that's somewhat less popular than many of his others, so it makes sense that it has fewer reviews:

In [69]:
books = ["To Kill a Mockingbird", "The Catcher in the Rye", "The Girl Who Loved Tom Gordon"]

for book in books:
    print(df_ratings_rm['ISBN'].isin(df_books[df_books['Book-Title'] == book]['ISBN']).sum())

139
88
67


Then we pivot the data so that each row is a book and each column is a user, with the values in the cells representing the ratings of each book by each user. `fillna(0)` says to replace the value with $0$ if they are missing (in other words, if a user hasn't rated a book, that value should be $0$).

In [70]:
df = df_ratings_rm.pivot_table(index=['User-ID'],columns=['ISBN'],values='Book-Rating').fillna(0).T
df.head()

User-ID,254,2276,2766,2977,3363,4017,4385,6242,6251,6323,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
002542730X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0
0060008032,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0060096195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
006016848X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0060173289,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Collaborative Recommender

We update the index of this dataset with the book title, so that information is now attached:

In [71]:
df.index = df.join(df_books.set_index('ISBN'))['Book-Title']

And then double-check that it looks good:

In [72]:
df = df.sort_index()
df.head()

User-ID,254,2276,2766,2977,3363,4017,4385,6242,6251,6323,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We could then pull out any specific book from this table and get all its user ratings. For example, the book "The Queen of the Damned," a vampire novel by Anne Rice. You can see that the first 5 users did not review this book:

In [73]:
df.loc["The Queen of the Damned (Vampire Chronicles (Paperback))"][:5]

User-ID
254     0.0
2276    0.0
2766    0.0
2977    0.0
3363    0.0
Name: The Queen of the Damned (Vampire Chronicles (Paperback)), dtype: float64

Just like we did with the anime data, we'll initialize a *k*-nearest neighbors model, using cosine similarity as the distance metric and the brute-force algorithm:

In [74]:
model = NearestNeighbors(metric='cosine', algorithm='brute')
model.fit(df.values)

Let's look at the results for that same book. If we pull out the 5 nearest books in terms of cosine similarity, what do we get? Notice that we actually set `n_neighbors=6` to get the 5 nearest neighbors; the sixth is the original book itself.

In [75]:
title = 'The Queen of the Damned (Vampire Chronicles (Paperback))'
distance, indice = model.kneighbors([df.loc[title].values], n_neighbors=6)

print(distance)
print(indice)

[[1.11022302e-16 5.17841186e-01 5.37633845e-01 7.34506886e-01
  7.44865700e-01 7.93983542e-01]]
[[612 660 648 272 667 110]]


We can pull out the book titles:

In [76]:
df.iloc[indice[0]].index.values

array(['The Queen of the Damned (Vampire Chronicles (Paperback))',
       'The Vampire Lestat (Vampire Chronicles, Book II)',
       'The Tale of the Body Thief (Vampire Chronicles (Paperback))',
       'Interview with the Vampire',
       'The Witching Hour (Lives of the Mayfair Witches)', 'Catch 22'],
      dtype=object)

And print them more neatly:

In [82]:
pd.DataFrame({
    'title'   : df.iloc[indice[0]].index.values,
    'distance': distance[0]
}) \
.sort_values(by='distance', ascending=True)

Unnamed: 0,title,distance
0,The Queen of the Damned (Vampire Chronicles (P...,1.110223e-16
1,"The Vampire Lestat (Vampire Chronicles, Book II)",0.5178412
2,The Tale of the Body Thief (Vampire Chronicles...,0.5376338
3,Interview with the Vampire,0.7345069
4,The Witching Hour (Lives of the Mayfair Witches),0.7448657
5,Catch 22,0.7939835


Disregarding the shortest distance (the book itself), we see that recommendations number 1 to 3 are other books in the same series by Anne Rice; *The Vampire Lestat* is book two, *The Tale of the Body Thief* is book four, and *Interview with the Vampire* is book 1. *The Witching Hour* is another Anne Rice book, not part of the same series.

The only somewhat unusual recommendation here is *Catch-22*, by Joseph Heller. It is the furthest away in terms of distance, to be fair.

We can write a function that takes the title of a book and returns a set of 5 recommendations similar to that book. (Note that, if you try this function, you must specify a book that exists in the current dataset, and remember that some books were filtered out if they had fewer than $100$ reviews.)

In [86]:
def get_recommends(title = ""):
  try:
    book = df.loc[title]
  except KeyError as e:
    print('The given book', e, 'does not exist')
    return

  distance, indice = model.kneighbors([book.values], n_neighbors=6)

  recommended_books = pd.DataFrame({
      'title'   : df.iloc[indice[0]].index.values,
      'distance': distance[0]
    }) \
    .sort_values(by='distance', ascending=True) \
    .head(6).values

  return [title, recommended_books]

In [87]:
books = get_recommends("The Girl Who Loved Tom Gordon")
books

['The Girl Who Loved Tom Gordon',
 array([['The Girl Who Loved Tom Gordon', 0.0],
        ['Desperation', 0.6545854230833272],
        ['The Tommyknockers', 0.6587462883314029],
        ['Needful Things', 0.6770011915640064],
        ['Dreamcatcher', 0.6938481175221921],
        ['Rose Madder', 0.7051887505359065]], dtype=object)]

Asking our recommender system for a Stephen King book like *The Girl Who Loved Tom Gordon* returns a list of five other Stephen King books!

We can try a different author. What about Michael Crichton? If we ask the recommender system for books similar to *Jurassic Park*:

In [88]:
get_recommends("Jurassic Park")

['Jurassic Park',
 array([['Jurassic Park', 0.0],
        ['Rising Sun', 0.7093192255877424],
        ['Sphere', 0.713632857436233],
        ['The Body Farm', 0.7165718160273565],
        ['Airframe', 0.7254854232144845],
        ['The Lost World', 0.7287218514519678]], dtype=object)]

Interestingly, *The Lost World*, which is the furthest away in terms of distance here, is actually the sequel to *Jurassic Park*. To be fair, however, all five of these books are pretty close together, and **four** of them are other Michael Crichton books (*The Lost World*, *Airframe*, *Sphere*, and *Rising Sun*). *The Body Farm* is a Patricia Cornwell novel.

#### <span style="color: red;">Exercises</span>

5. Try using the `get_recommends()` functions to look up a book you enjoyed. What books does it recommend to you? Have you read any of them?
6. What happens if you ask for a book that isn't in the dataset?

### Using Matrix Factorization to Find Similar Music

Both datasets that we've worked with so far have something in common; they have two data files, one with the items themselves and one with a set of user ratings for those items.

This is the most common situation you might face when designing a recommender system, but it's not what you will always encounter. Sometimes you may want to generate recommendations when you don't have explicit user ratings for items. In that case, you can use what are called **implicit** ratings. In other words, if a user rates songs on a scale from $1$ to $5$, where $1$ means they hated it and $5$ means they loved it, that is an example of an **explicit** rating. In the absence of explicit ratings, you might look at variables representing user behavior that are **implicit** indicators of their enjoyment.

For example; suppose that you want to design a system to make music recommendations, but you have no user reviews. You can use the **number of times users have played a song** instead, as a more **implicit** measure of enjoyment. There is a Python library specifically designed to handle these types of problems, called `implicit`.

The `implicit` library in Python includes code to access several different popular recommender datasets in the `implict.datasets` module. The following downloads the `lastfm` dataset locally and loads it into memory (after installing the relevant libraries):

In [92]:
import os
default_n_threads = 1
os.environ['OPENBLAS_NUM_THREADS'] = f"{default_n_threads}"
os.environ['MKL_NUM_THREADS'] = f"{default_n_threads}"
os.environ['OMP_NUM_THREADS'] = f"{default_n_threads}"
import sys
# Installing implicit and one of its dependencies; you might need to uncomment
# the following two lines the first time you run these code chunks
# !{sys.executable} -m pip install implicit
# !{sys.executable} -m pip install h5py

Collecting implicit
  Using cached implicit-0.7.2-cp311-cp311-manylinux2014_x86_64.whl.metadata (6.1 kB)
Using cached implicit-0.7.2-cp311-cp311-manylinux2014_x86_64.whl (8.9 MB)
Installing collected packages: implicit
Successfully installed implicit-0.7.2
Collecting h5py
  Using cached h5py-3.12.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.5 kB)
Using cached h5py-3.12.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.5 MB)
Installing collected packages: h5py
Successfully installed h5py-3.12.1


In [93]:
from implicit.datasets.lastfm import get_lastfm

Last.fm is a music discovery and streaming platform that also functions as a social network for music lovers.

In [94]:
artists, users, artist_user_plays = get_lastfm()

`artist_user_plays` is now a `scipy` sparse matrix, with each row corresponding to a different musician and each column to a different user. The non-zero entries in the matrix represent the number of times a given user has listened to a given artist. In other words, this is an example of using **implicit** reviews; if a user has listened to an artist several times, we can consider that evidence of their liking the artist.

Note that we only have **positive** examples of what users have interacted with. We *cannot* infer that, just because the user hasn't listened to an artist before, they don't like that specific artist.

The `artists` and `users` variables are arrays of string labels for each row and column in `artist_user_plays`.

`implicit` provides implementations of several different algorithms for implicit feedback recommender systems. For this example, we’ll be looking at the `AlternatingLeastSquares` model that’s based off the paper [Collaborative Filtering for Implicit Feedback Datasets](http://yifanhu.net/PUB/cf.pdf). This model aims to learn a binary target of whether each user has interacted with each item, but weights each binary interaction by a confidence value of how confident we are in this user/item interaction. The implementation in `implicit` uses the values of a sparse matrix to represent the confidences, with the non-zero entries representing whether or not the user has interacted with the item.

The first step in using this model is going to be transforming the raw play counts from the original dataset into values that can be used as confidences. We want to give repeated plays more confidence in the model, but have this effect taper off as the number of repeated plays increases to reduce the impact that a single superfan has on the model. Likewise, we want to direct some of the confidence weight away from popular items. To do this, we’ll use a [bm25](https://en.wikipedia.org/wiki/Okapi_BM25) weighting scheme inspired from classic information retrieval:

In [95]:
from implicit.nearest_neighbours import bm25_weight

# weight the matrix, both to reduce impact of users that have played the same artist 
# thousands of times
# and to reduce the weight given to popular items
artist_user_plays = bm25_weight(artist_user_plays, K1=100, B=0.8)

# get the transpose since the most of the functions in implicit expect (user, item) sparse
# matrices instead of (item, user)
user_plays = artist_user_plays.T.tocsr()

Once we have a weighted confidence matrix, we can use that to train an ALS model using `implicit`. The code below trains the model. We specify `factors=64` (the number of latent factors or dimensions to use for the user and item feature vectors; a higher value increases the model's capacity to learn, potentially improving performance, but can also increase computation time and risk overfitting). We also set `regularization=0.05` (adds a penalty term, trying to prevent overfitting by discouraging overly complex factorization) and `alpha=2.0` (controlling the weight of confidence levels; higher values increase the importance of known user-item interactions.

Note that this code chunk may take a few minutes to run.

In [96]:
from implicit.als import AlternatingLeastSquares

model = AlternatingLeastSquares(factors=64, regularization=0.05, alpha=2.0)
model.fit(user_plays)

  0%|          | 0/15 [00:00<?, ?it/s]

#### Recommending Music for Someone

After training the model, you can make recommendations for either a single user or a batch of users with the `.recommend` function:

In [97]:
# Get recommendations for a single user
userid = 12347
ids, scores = model.recommend(userid, user_plays[userid], N=10, filter_already_liked_items=False)

The `.recommend` call will compute the `N` best recommendations for each user in the input, and return the item IDs in the `ids` array, as well as the computed `scores` in the `scores` array. We can see what musicians are recommended for each user by looking up the IDs in the `artists` array:

In [98]:
pd.DataFrame({"artist": artists[ids], "score": scores, "already_liked": np.in1d(ids, user_plays[userid].indices)})

Unnamed: 0,artist,score,already_liked
0,missy elliott,0.902225,False
1,aaliyah,0.877665,True
2,kanye west,0.83049,True
3,r. kelly,0.827827,False
4,ginuwine,0.826273,False
5,robin thicke,0.821549,False
6,mary j. blige,0.815405,False
7,destiny's child,0.811703,False
8,jamie foxx,0.810071,False
9,amerie,0.797745,False


The `already_liked` column there shows whether the user has interacted with the item already, and in this result, two of the items being returned have already been interacted with by the user. We can remove these items from the result set with the `filter_already_liked_items` parameter - setting it to `True` will remove all of these items from the results. The `user_plays[userid]` parameter is used to look up what items each user has interacted with, and can just be set to `None` if you aren’t filtering the users' own likes or recalculating the user representation on the fly.

There are also more filtering options present in the `filter_items` parameter and `items` parameter, as well as options for recalculating the user representation on the fly with the `recalculate_user` parameter.

#### <span style="color: red;">Exercises</span>

7. Remove already liked items from the recommendations and compare the results for User ID 12347.
8. Instead of estimating $64$ latent factors, try estimating only $1$ latent factor. Compare the results for User ID 12347. What happens when you reduce the dimensionality so dramatically?

#### Finding Artists that are Similar

Each model in `implicit` also has the ability to show related items through the `similar_items` method. For instance, to get the related items for The Beatles:

In [99]:
# get related items for the beatles (itemid = 252512)
ids, scores = model.similar_items(252512)

# display the results using pandas for nicer formatting
pd.DataFrame({"artist": artists[ids], "score": scores})

Unnamed: 0,artist,score
0,the beatles,1.0
1,the beach boys,0.993134
2,the rolling stones,0.993037
3,john lennon,0.992146
4,bob dylan,0.991987
5,the who,0.991887
6,david bowie,0.991514
7,simon & garfunkel,0.991148
8,led zeppelin,0.990802
9,the white stripes,0.990235


And these make sense!

### References

- [book recommendation dataset](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset)
- [anime recommendation dataset](https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database/data)
- Kaggle notebooks using both datasets
- [`implicit` tutorial](https://benfred.github.io/implicit/tutorial_lastfm.html)
- [Building a Recommendation Engine with Collaborative Filtering](https://realpython.com/build-recommendation-engine-collaborative-filtering/#:~:text=You%20can%20use%20the%20cosine,by%20subtracting%20it%20from%201.)
- *Mining Massive Datasets*, [Chapter 9](http://infolab.stanford.edu/~ullman/mmds/ch9.pdf)
- [The Anatomy of High-Performance Recommender Systems](https://www.algolia.com/blog/ai/the-anatomy-of-high-performance-recommender-systems-part-1/)
- New York Times, [*If You Liked This, You're Sure to Love That*](https://www.nytimes.com/2008/11/23/magazine/23Netflix-t.html?_r=1&partner=permalink&exprod=permalink)
- [*Practical Recommender Systems*](https://www.manning.com/books/practical-recommender-systems)
- [*Recommender Systems Handbook*](https://dl.acm.org/doi/book/10.5555/1941884)