# Recommendation Systems

Recommendation systems seek to predict and present users with items or content they would find relevant and engaging. They operate behind the scenes in many of the digital platforms we interact with daily, such as streaming services, online retail sites, and social media platforms. At their core, these systems analyze vast amounts of data, including user behavior, preferences, and other contextual information to curate personalized content or product suggestions. The importance of recommendation systems lies in their ability to enhance user experience and engagement, drive content or product discovery, and boost business metrics like sales and retention. By offering tailor-made suggestions, they ultimately mitigate information overload, facilitate more informed choices, and play a pivotal role in enhancing user satisfaction and loyalty.

There are many different types of recommendation systems. In this project, we primarily focus on implementing a `content-based recommender`.


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from recommender_helper import (
    content_movie_recommender,
    get_popularity_rmse,
    get_vote_avg_rmse,
    get_vote_count_rmse,
)

### Eventually change the below to match eda.ipynb so Ploomber can successfully build

In [None]:
%reload_ext sql
%sql duckdb:///../../movies_data.duckdb

In [None]:
df = %sql select * from movie_genre_data
df = pd.DataFrame(df)
df

### How Content-Based Recommenders Work

Content-based recommenders work by analyzing the attributes of items as well as a user's historical interactions with such items. In this case, our items are movies and their attributes of interest are their respective `genre_names` and `overview` columns. Given that our data excludes any information on users, we will primarily focus on just comparing each movie's attributes to find similar ones.

In summary, we will first vectorize each attribute of interest with TF-IDF and compute the similarity between these vectorized values using cosine similarity.

### TF-IDF

Below, we utilize TF-IDF to vectorize the text under the `overview` column. This [article](https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency/#:~:text=Term%20Frequency%20%2D%20Inverse%20Document%20Frequency%20(TF%2DIDF)%20is,%2C%20relative%20to%20a%20corpus) provides a great introduction and summary as to how the math behinds TF-IDF works.

In [None]:
# Create tf-idf matrix for text comparison
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df["overview"])

### Cosine Simlarity

Then we compute the cosine similarity between each movie's vectorized components. Movies with a high cosine similarity indicate that they are "close" to each other based off of their vectorized components.

In [None]:
# Compute cosine similarity between all movie-descriptions
similarity = cosine_similarity(tfidf_matrix)
similarity_df = pd.DataFrame(
    similarity, index=df.title.values, columns=df.title.values
)  # noqa E501
similarity_df.head(5)

In [None]:
movie_list = similarity_df.columns.values

In [None]:
sample_movie = "Spider-Man: Across the Spider-Verse"

recommendations = content_movie_recommender(
    sample_movie, similarity_df, movie_list, 10
)  # noqa E501

recommendations

### Using both genre and overview columns

Let's now try to include the genres of the movies to our recommendation system. To do so, we're going to create a `combined` column that includes both a movie's "overview" and "genre(s)". 

We can adjust the "weight" of how genres influence our recommendation system by deciding how many times they appear in the `combined` column.

In [None]:
df["combined"] = (
    df["overview"] + " " + (df["genre_names"] + ", ") * 2
)  # Duplicate genres to give more weight, experiment by adjusting
df.combined[0]

In [None]:
tfidf_combined = TfidfVectorizer(stop_words="english")
tfidf_matrix_combined = tfidf_combined.fit_transform(df["combined"])

In [None]:
similarity_combined = cosine_similarity(tfidf_matrix_combined)

similarity_df_combined = pd.DataFrame(
    similarity_combined, index=df.title.values, columns=df.title.values
)

similarity_df_combined.head(5)

In [None]:
combined_movie_list = similarity_df_combined.columns.values

In [None]:
sample_movie = "Spider-Man: Across the Spider-Verse"
recommendations = content_movie_recommender(
    sample_movie, similarity_df_combined, combined_movie_list, 10
)  # noqa E501

recommendations

### Evaluating Our Recommender

Normally, recommenders would be evaluated based off of a train test split, where the metrics involve whether historical users have interacted with the recommended movies. However, since we are limited to having data strictly on just information on movies themselves, we will evaluate our recommender based off of three different metrics.

1. RMSE of `popularity`
2. RMSE of `vote_average`
3. RMSE of `vote_count`

These are pretty rudimentary metrics to evaluate our recommender system on. But for now, they will suffice for learning purposes.

Try experimenting with changing the weight of genres and tuning [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), particulary its `max_df` and `stop_words` parameters.

In [None]:
df["combined"] = (
    df["overview"] + " " + (df["genre_names"] + ", ") * 2
)  # Duplicate genres to give more weight, experiment by adjusting

tfidf_combined = TfidfVectorizer(stop_words="english")
tfidf_matrix_combined = tfidf_combined.fit_transform(df["combined"])

similarity_combined = cosine_similarity(tfidf_matrix_combined)

similarity_df_combined = pd.DataFrame(
    similarity_combined, index=df.title.values, columns=df.title.values
)

combined_movie_list = similarity_df_combined.columns.values

In [None]:
sample_movie = "Spider-Man: Across the Spider-Verse"

recommendations = content_movie_recommender(
    sample_movie, similarity_df_combined, combined_movie_list, 10
)  # noqa E501

recommendations

In [None]:
popularity_rmse = get_popularity_rmse(df, sample_movie, recommendations)

vote_avg_rmse = get_vote_avg_rmse(df, sample_movie, recommendations)

vote_count_rmse = get_vote_count_rmse(df, sample_movie, recommendations)

In [None]:
print(
    f"Root Mean Square Error (RMSE) for:\n"
    f"Popularity: {popularity_rmse:.2f}\n"
    f"Vote Average: {vote_avg_rmse:.2f}\n"
    f"Vote Count: {vote_count_rmse:.2f}"
)