# Recommendation Systems

Recommendation systems seek to predict and present users with items or content they would find relevant and engaging. They operate behind the scenes in many of the digital platforms we interact with daily, such as streaming services, online retail sites, and social media platforms. At their core, these systems analyze vast amounts of data, including user behavior, preferences, and other contextual information to curate personalized content or product suggestions. The importance of recommendation systems lies in their ability to enhance user experience and engagement, drive content or product discovery, and boost business metrics like sales and retention. By offering tailor-made suggestions, they ultimately mitigate information overload, facilitate more informed choices, and play a pivotal role in enhancing user satisfaction and loyalty.

There are many different types of recommendation systems. In this project, we primarily focus on implementing a `content-based recommender`.


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from recommender_helper import (
    content_movie_recommender,
    get_popularity_rmse,
    get_vote_avg_rmse,
    get_vote_count_rmse,
)

### Eventually change the below to match eda.ipynb so Ploomber can successfully build

In [2]:
%reload_ext sql
%sql duckdb:///../../movies_data.duckdb

In [3]:
df = %sql select * from movie_genre_data
df = pd.DataFrame(df)
df

Unnamed: 0,genre_names,id,original_language,overview,popularity,release_date,title,vote_average,vote_count
0,"Thriller, Action",724209,en,An intelligence operative for a shadowy global...,2813.299,2023-08-09,Heart of Stone,6.9,700
1,"Animation, Action, Adventure",569094,en,"After reuniting with Gwen Stacy, Brooklyn’s fu...",1738.308,2023-05-31,Spider-Man: Across the Spider-Verse,8.5,3696
2,"Action, Adventure, Science Fiction",298618,en,When his attempt to save his family inadverten...,1559.171,2023-06-13,The Flash,7.0,2443
3,"Comedy, Adventure, Fantasy",346698,en,Barbie and Ken are having the time of their li...,1556.661,2023-07-19,Barbie,7.4,3309
4,"Animation, Science Fiction, Action, Adventure",1121575,en,Travel across the galaxy with John Sheridan as...,1519.610,2023-08-15,Babylon 5: The Road Home,7.6,22
...,...,...,...,...,...,...,...,...,...
980,"Action, Comedy, Science Fiction",257344,en,Video game experts are recruited by the milita...,73.242,2015-07-16,Pixels,5.7,7013
981,"Action, Crime, Thriller",273481,en,An idealistic FBI agent is enlisted by a gover...,70.284,2015-09-17,Sicario,7.4,7754
982,Horror,109428,en,Five young friends find the mysterious and fie...,45.753,2013-04-05,Evil Dead,6.5,4190
983,"Action, Adventure, Science Fiction",141052,en,Fuelled by his restored faith in humanity and ...,71.816,2017-11-15,Justice League,6.1,12200


### How Content-Based Recommenders Work

Content-based recommenders work by analyzing the attributes of items as well as a user's historical interactions with such items. In this case, our items are movies and their attributes of interest are their respective `genre_names` and `overview` columns. Given that our data excludes any information on users, we will primarily focus on just comparing each movie's attributes to find similar ones.

In summary, we will first vectorize each attribute of interest with TF-IDF and compute the similarity between these vectorized values using cosine similarity.

### TF-IDF

Below, we utilize TF-IDF to vectorize the text under the `overview` column. This [article](https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency/#:~:text=Term%20Frequency%20%2D%20Inverse%20Document%20Frequency%20(TF%2DIDF)%20is,%2C%20relative%20to%20a%20corpus) provides a great introduction and summary as to how the math behinds TF-IDF works.

In [4]:
# Create tf-idf matrix for text comparison
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df["overview"])

### Cosine Simlarity

Then we compute the cosine similarity between each movie's vectorized components. Movies with a high cosine similarity indicate that they are "close" to each other based off of their vectorized components.

In [5]:
# Compute cosine similarity between all movie-descriptions
similarity = cosine_similarity(tfidf_matrix)
similarity_df = pd.DataFrame(
    similarity, index=df.title.values, columns=df.title.values
)  # noqa E501
similarity_df.head(5)

Unnamed: 0,Heart of Stone,Spider-Man: Across the Spider-Verse,The Flash,Barbie,Babylon 5: The Road Home,No Hard Feelings,Meg 2: The Trench,Cobweb,Fast X,Insidious: The Red Door,...,Just Go with It,National Lampoon's Vacation,The Twilight Saga: New Moon,Dawn of the Planet of the Apes,Ghostbusters,Pixels,Sicario,Evil Dead,Justice League,Clash of the Titans
Heart of Stone,1.0,0.019118,0.02245,0.012476,0.006621,0.021704,0.018739,0.032052,0.035122,0.020683,...,0.016203,0.016691,0.016348,0.004965,0.030542,0.002797,0.015028,0.04611,0.016735,0.035632
Spider-Man: Across the Spider-Verse,0.019118,1.0,0.100435,0.049998,0.100344,0.029335,0.033016,0.025204,0.079503,0.032446,...,0.021171,0.038298,0.037371,0.030505,0.029616,0.017029,0.042988,0.032074,0.062735,0.057223
The Flash,0.02245,0.100435,1.0,0.086477,0.090002,0.056523,0.024659,0.091211,0.073312,0.059254,...,0.121867,0.087334,0.074185,0.075716,0.020178,0.027283,0.058568,0.042569,0.153752,0.095988
Barbie,0.012476,0.049998,0.086477,1.0,0.043066,0.039475,0.045293,0.01455,0.081365,0.07682,...,0.06816,0.070323,0.030241,0.040176,0.026161,0.023496,0.039055,0.043408,0.061784,0.031715
Babylon 5: The Road Home,0.006621,0.100344,0.090002,0.043066,1.0,0.032574,0.018692,0.033806,0.034673,0.033512,...,0.022676,0.055241,0.041648,0.039612,0.014844,0.017855,0.030133,0.06787,0.034741,0.04281


In [6]:
movie_list = similarity_df.columns.values

In [7]:
sample_movies = ["Spider-Man: Across the Spider-Verse"]

for movie in sample_movies:
    content_movie_recommender(movie, similarity_df, movie_list, 10)



Top Recommended Movies for: Spider-Man: Across the Spider-Verse are:-
 ['Spider-Man: Into the Spider-Verse' 'Spider-Man'
 'The Amazing Spider-Man 2' 'Spider-Man 3' 'Thor: Ragnarok'
 'Spider-Man: Homecoming' 'Doctor Strange in the Multiverse of Madness'
 'The Amazing Spider-Man' 'Sweet Girl' 'Spider-Man: No Way Home']


### Using both genre and overview columns

Let's now try to include the genres of the movies to our recommendation system. To do so, we're going to create a `combined` column that includes both a movie's "overview" and "genre(s)". 

We can adjust the "weight" of how genres influence our recommendation system by deciding how many times they appear in the `combined` column.

In [8]:
df["combined"] = (
    df["overview"] + " " + (df["genre_names"] + ", ") * 2
)  # Duplicate genres to give more weight, experiment by adjusting
df.combined[0]

'An intelligence operative for a shadowy global peacekeeping agency races to stop a hacker from stealing its most valuable — and dangerous — weapon. Thriller, Action, Thriller, Action, '

In [9]:
tfidf_combined = TfidfVectorizer(stop_words="english")
tfidf_matrix_combined = tfidf_combined.fit_transform(df["combined"])

In [10]:
similarity_combined = cosine_similarity(tfidf_matrix_combined)

similarity_df_combined = pd.DataFrame(
    similarity_combined, index=df.title.values, columns=df.title.values
)

similarity_df_combined.head(5)

Unnamed: 0,Heart of Stone,Spider-Man: Across the Spider-Verse,The Flash,Barbie,Babylon 5: The Road Home,No Hard Feelings,Meg 2: The Trench,Cobweb,Fast X,Insidious: The Red Door,...,Just Go with It,National Lampoon's Vacation,The Twilight Saga: New Moon,Dawn of the Planet of the Apes,Ghostbusters,Pixels,Sicario,Evil Dead,Justice League,Clash of the Titans
Heart of Stone,1.0,0.016677,0.018675,0.0,0.020224,0.0,0.02316,0.024101,0.045681,0.032084,...,0.0,0.0,0.0,0.056374,0.043553,0.024576,0.068152,0.0,0.017792,0.02767
Spider-Man: Across the Spider-Verse,0.016677,1.0,0.064058,0.030498,0.071086,0.0,0.02817,0.0,0.050396,0.006986,...,0.0,0.020868,0.022255,0.01308,0.03135,0.022215,0.015812,0.0,0.051559,0.030391
The Flash,0.018675,0.064058,1.0,0.035702,0.106439,0.0,0.074818,0.0,0.020541,0.005084,...,0.018025,0.04104,0.044144,0.073022,0.013085,0.079393,0.017707,0.0,0.184876,0.037766
Barbie,0.0,0.030498,0.035702,1.0,0.020188,0.037914,0.0,0.0,0.0,0.0,...,0.060588,0.093838,0.044579,0.0,0.057574,0.031239,0.0,0.0,0.01776,0.033077
Babylon 5: The Road Home,0.020224,0.071086,0.106439,0.020188,1.0,0.011931,0.081025,0.0,0.012853,0.008472,...,0.0,0.060802,0.026989,0.07908,0.014171,0.095674,0.019176,0.032357,0.077515,0.020015


In [11]:
combined_movie_list = similarity_df_combined.columns.values

In [12]:
sample_movies = ["Spider-Man: Across the Spider-Verse"]

for movie in sample_movies:
    content_movie_recommender(
        movie, similarity_df_combined, combined_movie_list, 10
    )  # noqa E501



Top Recommended Movies for: Spider-Man: Across the Spider-Verse are:-
 ['Spider-Man: Into the Spider-Verse' 'Spider-Man' 'Spider-Man 3'
 'The Amazing Spider-Man 2' 'Spider-Man: Homecoming'
 'Doctor Strange in the Multiverse of Madness' 'Spider-Man: No Way Home'
 'Ice Age: Dawn of the Dinosaurs'
 'Deathstroke: Knights & Dragons - The Movie' 'Big Hero 6']


In [13]:
sample_movies = ["Spider-Man: Across the Spider-Verse"]

for movie in sample_movies:
    content_movie_recommender(
        movie, similarity_df_combined, combined_movie_list, 10
    )  # noqa E501



Top Recommended Movies for: Spider-Man: Across the Spider-Verse are:-
 ['Spider-Man: Into the Spider-Verse' 'Spider-Man' 'Spider-Man 3'
 'The Amazing Spider-Man 2' 'Spider-Man: Homecoming'
 'Doctor Strange in the Multiverse of Madness' 'Spider-Man: No Way Home'
 'Ice Age: Dawn of the Dinosaurs'
 'Deathstroke: Knights & Dragons - The Movie' 'Big Hero 6']


### Evaluating Our Recommender

Normally, recommenders would be evaluated based off of a train test split, where the metrics involve whether historical users have interacted with the recommended movies. However, since we are limited to having data strictly on just information on movies themselves, we will evaluate our recommender based off of three different metrics.

1. RMSE of `popularity`
2. RMSE of `vote_average`
3. RMSE of `vote_count`

These are pretty rudimentary metrics to evaluate our recommender system on. But for now, they will suffice for learning purposes.

Try experimenting with changing the weight of genres and tuning [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), particulary its `max_df` and `stop_words` parameters.

In [23]:
df["combined"] = (
    df["overview"] + " " + (df["genre_names"] + ", ") * 2
)  # Duplicate genres to give more weight, experiment by adjusting

tfidf_combined = TfidfVectorizer(stop_words="english")
tfidf_matrix_combined = tfidf_combined.fit_transform(df["combined"])

similarity_combined = cosine_similarity(tfidf_matrix_combined)

similarity_df_combined = pd.DataFrame(
    similarity_combined, index=df.title.values, columns=df.title.values
)

combined_movie_list = similarity_df_combined.columns.values

In [21]:
sample_movie = "Spider-Man: Across the Spider-Verse"

recommendations = content_movie_recommender(
    sample_movie, similarity_df_combined, combined_movie_list, 10
)  # noqa E501



Top Recommended Movies for: Spider-Man: Across the Spider-Verse are:-
 ['Spider-Man: Into the Spider-Verse' 'Spider-Man' 'Spider-Man 3'
 'The Amazing Spider-Man 2' 'Spider-Man: Homecoming'
 'Doctor Strange in the Multiverse of Madness' 'Spider-Man: No Way Home'
 'Ice Age: Dawn of the Dinosaurs'
 'Deathstroke: Knights & Dragons - The Movie' 'Big Hero 6']


In [15]:
recommendations

array(['Spider-Man: Into the Spider-Verse', 'Spider-Man', 'Spider-Man 3',
       'The Amazing Spider-Man 2', 'Spider-Man: Homecoming',
       'Doctor Strange in the Multiverse of Madness',
       'Spider-Man: No Way Home', 'Ice Age: Dawn of the Dinosaurs',
       'Deathstroke: Knights & Dragons - The Movie', 'Big Hero 6'],
      dtype=object)

In [18]:
popularity_rmse = get_popularity_rmse(df, sample_movie, recommendations)

vote_avg_rmse = get_vote_avg_rmse(df, sample_movie, recommendations)

vote_count_rmse = get_vote_count_rmse(df, sample_movie, recommendations)

In [19]:
print(
    f"Root Mean Square Error (RMSE) for:\n"
    f"Popularity: {popularity_rmse:.2f}\n"
    f"Vote Average: {vote_avg_rmse:.2f}\n"
    f"Vote Count: {vote_count_rmse:.2f}"
)

Root Mean Square Error (RMSE) for:
Popularity: 1620.75
Vote Average: 1.37
Vote Count: 10494.70
