<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Exercise.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Exercise: Recommender systems
© ExploreAI Academy

In this exercise, we will build a content-based recommendation system using a dataset of Netflix titles. We will preprocess the text data, convert it into numerical features with TF-IDF, and compute item similarities to generate recommendations. This hands-on activity will help us understand and implement key techniques in content-based filtering.

## Learning objectives

By the end of this exercise, you should be able to:
* Understand content-based recommendation systems.
* Clean and preprocess text data.
* Convert text data into numerical features using TF-IDF.
* Compute item similarities using cosine similarity.
* Build and evaluate a content-based recommendation model.

## Introduction

In this notebook, we will build a `content-based recommendation system` using the `Netflix` dataset. The primary goal of this task is to recommend similar titles to users based on the attributes of the media they have already interacted with. This will enhance the user experience by providing personalised content recommendations, thereby increasing user engagement and satisfaction. By predicting which titles a user might enjoy based on their previous interactions, content-based recommendation systems help platforms like `Netflix` keep users engaged and encourage them to explore a broader range of content.

The dataset is derived from Netflix's collection of movies and TV shows. This dataset includes various attributes for each title, such as:

* show_id: Unique identifier for each title.
* type: The type of media (e.g., Movie, TV Show).
* title: The name of the media.
* director: Directors involved in the media.
* cast: Main actors involved in the media.
* country: Countries where the media was produced.
* date_added: The date when the media was added to Netflix.
* release_year: The year the media was released.
* rating: The rating given to the media.
* duration: Duration of the media (e.g., 90 min, 1 Season).
* listed_in: Categories or genres the media belongs to.
* description: Brief summary or synopsis of the media.

The data was collected to provide a comprehensive overview of the available media on `Netflix`. It allows for detailed analysis and exploration of the media's attributes, which is essential for building a recommendation system.

Let's dive in!

Import the necessary libraries and read the data.

In [45]:
# Import necessary libraries
import numpy as np
import pandas as pd

# For text handling and regular expressions
import re
from sklearn.feature_extraction.text import TfidfVectorizer # For converting text to numerical data

# For computing cosine similarity
from sklearn.metrics.pairwise import linear_kernel


In [46]:
data = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/unsupervised_sprint/netflix_titles.csv', index_col=0)
data.head()

Unnamed: 0_level_0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


## Exercises

In this exercise, we focus on the relevant columns `cast`, `title`, `description`, and `listed_in` because these textual features provide detailed descriptions and attributes essential for capturing the content similarities between media items. These columns contain detailed information about what the media is about, who stars in it, and its genres, which are crucial for generating meaningful recommendations in a content-based filtering approach.

### Exercise 1: Data cleaning and preprocessing

Before proceeding with our recommender system, we need to clean and process our data first to get the most accurate results.

We need to do the following:

* Remove rows with missing or NaN values.
<br>
<br>
* Remove punctuation and extra spaces in the text data. This helps to standardise and clean the text, ensuring consistency in the dataset and facilitating accurate analysis and modelling by eliminating unnecessary noise and variations in the text.

**Hint**:
> * For all the text columns, remove all characters that are not alphanumeric or whitespace.
> * For the 'cast' column, first remove all spaces and then replace commas with spaces. This ensures that the cast members' names are treated as single entities separated by spaces.
<br>
<br>
* Combine the columns `listed_in`, `cast`, `title`, and `description` into a single feature for the recommendation system. This creates a richer and more complete representation of each item, enhancing the effectiveness of the recommendation system by allowing it to consider all aspects of the content simultaneously.<br>
**Hint**: Remember to drop the individual columns as they are now combined into one.
<br>
<br>   
* Drop the rest of the columns to streamline and focus on the most relevant data for our recommendation model so that we are only left with the `type`, `title`, and `combined` columns with `type` and `title` providing context and identification, and `combined` serving as the main feature for calculating similarities.


In [47]:
# Remove rows with missing values in the specified columns
data.dropna(subset=['cast', 'title', 'description', 'listed_in'], inplace=True)

# Reset the index after dropping rows to maintain a clean DataFrame
data.reset_index(drop=True, inplace=True)

# Define a function to clean text data
def clean_text(text):
    # Remove non-alphanumeric characters and extra spaces
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply the function to clean text columns
data['description'] = data['description'].apply(clean_text)
data['title'] = data['title'].apply(clean_text)
# For 'cast', remove spaces, replace commas with spaces, then apply clean_text
data['cast'] = data['cast'].apply(lambda x: clean_text(x.replace(' ', '').replace(',', ' ')))
data['listed_in'] = data['listed_in'].apply(clean_text)

# Combine 'listed_in', 'cast', 'title', and 'description' into one column
data['combined'] = data['listed_in'] + ' ' + data['cast'] + ' ' + data['title'] + ' ' + data['description']

# Drop the individual columns as they are now combined into one
data.drop(['director', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'cast', 'description'], axis=1, inplace=True)

# Keep only 'type', 'title', and new 'combined' column
data = data[['type', 'title', 'combined']]

# Display the first few rows of the cleaned data
data.head()

Unnamed: 0,type,title,combined
0,Movie,Norm of the North King Sized Adventure,Children Family Movies Comedies AlanMarriott A...
1,Movie,Jandino Whatever it Takes,StandUp Comedy JandinoAsporaat Jandino Whateve...
2,TV Show,Transformers Prime,Kids TV PeterCullen SumaleeMontano FrankWelker...
3,TV Show,Transformers Robots in Disguise,Kids TV WillFriedle DarrenCriss ConstanceZimme...
4,Movie,realityhigh,Comedies NestaCooper KateWalsh JohnMichaelHigg...


### Exercise 2: Feature extraction
Next, we want to convert the combined text feature into numerical features using TF-IDF.
This enables the application of mathematical and statistical techniques for measuring similarities between different media items. In its raw form, text data cannot be directly used for similarity calculations or machine learning algorithms. By transforming the text into numerical representations, we can leverage these techniques to analyse and compare the content effectively.

* Utilise TF-IDF to convert the `combined` column into numerical vectors, which represent the importance of words in the document. Initialise your TF-IDF Vectoriser without specifying any parameters, which means it will default to single-word tokens.
* Compute the cosine similarity between these vectors to measure how similar the titles are.


In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the 'combined' column to numerical vectors
tfidf_matrix = tfidf_vectorizer.fit_transform(data['combined'])

# Compute the cosine similarity matrix
cosine_sim_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)

### Exercise 3: Building the recommendation function

Now, we can generate recommendations based on cosine similarity.

Define a function that, given a title, finds similar titles by looking up their cosine similarity scores and returns the top 10 recommendations based on these scores.

In [49]:
# Define the recommendation function
def get_recommendations(title, cosine_sim_matrix, data):
    # Check if the title exists in the dataset
    if title not in data['title'].values:
        return f"Title '{title}' not found in the dataset."

    # Get the index of the title that matches the title
    idx = data[data['title'] == title].index[0]

    # Get the pairwise similarity scores of all titles with that title
    sim_scores = list(enumerate(cosine_sim_matrix[idx]))

    # Sort the titles based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar titles
    sim_scores = sim_scores[1:11]

    # Get the title indices
    title_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar titles
    return data['title'].iloc[title_indices]

### Exercise 4: Test the recommender

Say you are trying to get recommendations for what movie to watch, and you particularly enjoyed the film `The Crown`. Run our recommender for this title and see what recommendations we get.

Would you want to watch any of these titles?


In [50]:
# Test the function with an example title
example_title = 'The Crown'
recommendations = get_recommendations(example_title, cosine_sim_matrix, data)
print(f"Recommendations for '{example_title}':\n", recommendations)

Recommendations for 'The Crown':
 369                         Witches A Century of Murder
1829                                         London Spy
5068                                              Reign
2612                                     My Hotter Half
692     The Blue Planet A Natural History of the Oceans
3915                        The Real Football Factories
1753                                         Collateral
5474                                           Lovesick
4830               Planet Earth The Complete Collection
2724                                       Age Gap Love
Name: title, dtype: object


#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>