# Recommendation System

Recommendation System is a large problem with various applications. The main objective is to provide personalized recommendations to users based on their preferences and behavior. There exists two main approaches to solve this problem: Collaborative Filtering and Content-Based Filtering. A third approach exists named Hybrid which combines both approaches.

## 1. Content-Based Filtering

The content-based approach relies on the similarity between items. We measure the similarity between item that the users liked and the items in the dataset to recommend those items that are the most similar. A common similarity measure is the Cosine Similarity.

## 2. Collaborative Filtering

As the name suggests, Collaborative Filtering relies on the behavior of all the users to make recommendations for specific user. The idea is to find similar users and recommend items that they have liked to our user.
This approach can also be divided into two view:
* Memory-based Collaborative Filtering : we exploit all the interactions of the users to find similar users and recommend items that they have liked.
* Model-based Collaborative Filtering : we train a model to predict the rating that a user would give to an item.




In [5]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import MultiLabelBinarizer
import sklearn.metrics.pairwise as cosine_similarity

import tensorflow as tf
from tensorflow.keras.layers import Embedding, Input, Lambda
from tensorflow.keras.models import Model
from tensorflow.keras.regularizers import l2


In [21]:
movies_df = pd.read_csv('csv_files/movies.csv')
ratings_df = pd.read_csv('csv_files/ratings.csv')
tags_df = pd.read_csv('csv_files/tags.csv')

my_user_id = 42
film_id = 84

In [22]:
display(movies_df.info)
display(ratings_df.info)
display(tags_df.info)

<bound method DataFrame.info of        movieId                               title  \
0            1                    Toy Story (1995)   
1            2                      Jumanji (1995)   
2            3             Grumpier Old Men (1995)   
3            4            Waiting to Exhale (1995)   
4            5  Father of the Bride Part II (1995)   
...        ...                                 ...   
87580   292731           The Monroy Affaire (2022)   
87581   292737          Shelter in Solitude (2023)   
87582   292753                         Orca (2023)   
87583   292755              The Angry Breed (1968)   
87584   292757           Race to the Summit (2023)   

                                            genres  
0      Adventure|Animation|Children|Comedy|Fantasy  
1                       Adventure|Children|Fantasy  
2                                   Comedy|Romance  
3                             Comedy|Drama|Romance  
4                                           Comedy  
.

<bound method DataFrame.info of           userId  movieId  rating   timestamp
0              1       17     4.0   944249077
1              1       25     1.0   944250228
2              1       29     2.0   943230976
3              1       30     5.0   944249077
4              1       32     5.0   943228858
...          ...      ...     ...         ...
32000199  200948    79702     4.5  1294412589
32000200  200948    79796     1.0  1287216292
32000201  200948    80350     0.5  1294412671
32000202  200948    80463     3.5  1350423800
32000203  200948    87304     4.5  1350423523

[32000204 rows x 4 columns]>

<bound method DataFrame.info of          userId  movieId             tag   timestamp
0            22    26479     Kevin Kline  1583038886
1            22    79592        misogyny  1581476297
2            22   247150      acrophobia  1622483469
3            34     2174           music  1249808064
4            34     2174           weird  1249808102
...         ...      ...             ...         ...
2000067  162279    90645      Rafe Spall  1320817734
2000068  162279    91079   Anton Yelchin  1322337407
2000069  162279    91079  Felicity Jones  1322337400
2000070  162279    91658     Rooney Mara  1325828398
2000071  162279   100714     Julie Delpy  1373095449

[2000072 rows x 4 columns]>

## 1. Content-based



In [23]:
# coef of similarity for two list
def ochiai_coef(list_A: list, list_B: list):
    if len(list_A) == 0 or len(list_B) == 0:
        return 0
    intersect = np.intersect1d(list_A, list_B)
    return len(intersect) / (len(list_A) * len(list_B))**0.5

### Preprocessing

In [24]:
ratings_df[ratings_df.userId == my_user_id]

Unnamed: 0,userId,movieId,rating,timestamp
7132,42,36,3.0,855645897
7133,42,66,4.0,855646278
7134,42,150,4.0,855648714
7135,42,260,3.0,855646059
7136,42,349,4.0,855649174
7137,42,457,4.0,855648928
7138,42,494,3.0,855645897
7139,42,648,4.0,855645808
7140,42,733,4.0,855645897
7141,42,780,3.0,855645808


In [27]:
#movies_df['list'] = pd.Series(movies_df['genres'].str.split('|'))
#movies_df = movies_df.drop(columns=['genres'])
movies_df.head()

Unnamed: 0,movieId,title,list
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]"
2,3,Grumpier Old Men (1995),"[Comedy, Romance]"
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]"
4,5,Father of the Bride Part II (1995),[Comedy]


In [28]:
ochiai_coef(movies_df['list'][0], movies_df['list'][1])

0.7745966692414834

In [29]:
# get the movieId to numpy array
movie_ids = movies_df["movieId"].to_numpy()
# dict to map index of the dataframe to the position of the liste
id_to_idx = {mv_id: i for i, mv_id in enumerate(movie_ids)}

We use the ochiai coefficient to compute the similarity between two movies and give the top 10 similar item

In [36]:
def recommend(film_id: int, k_best: int = 10):
    idx = id_to_idx[film_id]
    similarity_row = np.zeros(len(movie_ids), dtype=float)

    for j in range(len(movie_ids)):
        if j == idx:
            similarity_row[j] = -1
        else:
            similarity_row[j] = ochiai_coef(movies_df['list'][idx], movies_df['list'][j])

    # argpartition va arranger les indices pour que le kieme indice soit le kieme plus grand, puis à gauche le plus grand et a droite les plus petits.
    # id_top_k va contenir les indices des kiemes plus grands (non triés)
    id_top_k = np.argpartition(similarity_row, -k_best)[-k_best:]
    # on recupère les indices de max, argsort les tries par ordre croissant
    # et on inverse l'ordre pour un ordre décroissant
    id_top_k = id_top_k[np.argsort(similarity_row[id_top_k])[::-1]]

    return movie_ids[id_top_k], similarity_row[id_top_k]


In [39]:
top_id, score = recommend(film_id)

print(f"Similar movies to {movies_df[movies_df.movieId == film_id]['title'].values[0]}")
for m_id in top_id:
    print(f" - {movies_df[movies_df.movieId == m_id]['title'].values[0]}")

Similar movies to Last Summer in the Hamptons (1995)
 - Seven Blessings (2023)
 - Shelter in Solitude (2023)
 - Comeback (2023)
 - Persona Non Grata (2021)
 - For Zeko (2022)
 - Spetters (1980)
 - My Dead Dad (2021)
 - Il grande spirito (2019)
 - Learners (2007)
 - Everything Went Fine (2021)


There is a problematics, I need more than 3sec to find the recommendation. It is mostly due to the computation for every movie, so more than 80k iterations.
We need to take another approach, using sparse matrix to optimize the computation and having computation only on existing value.

In [41]:
# equivalent à un OHE mais pour
mlb = MultiLabelBinarizer(sparse_output=True)
movie_genre_token = mlb.fit_transform(movies_df["list"])

In [42]:
row_norms = np.sqrt(movie_genre_token.multiply(movie_genre_token).sum(axis=1)).A1  # ||x_i|| pour tous les films (A1 = 1D)

def recommend(film_id: int, k: int = 10):
    i = id_to_idx[film_id]
    m_i = movie_genre_token.getrow(i)

    # numerateur: dot(m_i, movie_genre_token.T)
    # sur des lignes de 0 ou de 1, revient à un len(intersect)
    dots = m_i @ movie_genre_token.T
    dots = dots.toarray().ravel()

    # denom: ||xi|| * ||xj||
    denom = (row_norms[i] * row_norms)

    scores = np.divide(dots, denom, out=np.zeros_like(dots, dtype=float), where=denom != 0)

    scores[i] = -1.0  # exclure lui-même

    idx = np.argpartition(scores, -k)[-k:]
    idx = idx[np.argsort(scores[idx])[::-1]]

    return movie_ids[idx], scores[idx]

In [43]:
recommend(film_id)

(array([292539, 292737, 292617, 276005, 275595,   5249, 275637, 275507,
        275579, 275459]),
 array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))

## 2. Collaborative Filtering

### 2.A Memory-based Collaborative Filtering

### 2.B Model-based Collaborative Filtering