# Mini Project: Recommendation Engines

Recommendation engines are algorithms designed to provide personalized suggestions or recommendations to users. These systems analyze user behavior, preferences, and interactions with items (products, movies, music, articles, etc.) to predict and offer items that users are likely to be interested in. Recommendation engines play a crucial role in enhancing user experience, driving engagement, and increasing conversion rates in various applications, including e-commerce, entertainment, content platforms, and more.

There are generally two approaches taken in collaborative filtering and content-based recommendation engines:

**1. Collaborative Filtering:**
Collaborative Filtering is a popular approach to building recommendation systems that leverages the collective behavior of users to make personalized recommendations. It is based on the idea that users who have agreed in the past will likely agree in the future. There are two main types of collaborative filtering:

- **User-based Collaborative Filtering:** This method finds users similar to the target user based on their past interactions (e.g., ratings or purchases). It then recommends items that similar users have liked but the target user has not interacted with yet.

- **Item-based Collaborative Filtering:** In this approach, the system identifies similar items based on user interactions. It recommends items that are similar to the ones the target user has already liked or interacted with.

Collaborative filtering does not require any explicit information about items but relies on the similarity between users or items. It is effective in capturing complex patterns and can provide serendipitous recommendations. However, it suffers from the cold-start problem (i.e., difficulty in recommending to new users or items with no interactions) and scalability challenges in large datasets.

**2. Content-Based Recommendation:**
Content-based recommendation is an alternative approach to building recommendation systems that focuses on the attributes or features of items and users. It leverages the characteristics of items to make recommendations. The key steps involved in content-based recommendation are:

- **Feature Extraction:** For each item, relevant features are extracted. For movies, these features could be genre, director, actors, and plot summary.

- **User Profile:** A user profile is created based on the items they have interacted with in the past. The user profile contains the weighted importance of features based on their interactions.

- **Similarity Calculation:** The similarity between items or between items and the user profile is calculated using similarity metrics like cosine similarity or Euclidean distance.

- **Recommendation:** Items that are most similar to the user profile are recommended to the user.

Content-based recommendation systems are less affected by the cold-start problem as they can still recommend items based on their features. They are also more interpretable as they rely on item attributes. However, they may miss out on providing serendipitous recommendations and can be limited by the quality of feature extraction and user profiles.

**Choosing Between Collaborative Filtering and Content-Based:**
Both collaborative filtering and content-based approaches have their strengths and weaknesses. The choice between them depends on the specific requirements of the recommendation system, the type of data available, and the user base. Hybrid approaches that combine collaborative filtering and content-based techniques are also common, aiming to leverage the strengths of both methods and mitigate their weaknesses.

In this mini-project, you'll be building both content based and collaborative filtering engines for the [MovieLens 25M dataset](https://grouplens.org/datasets/movielens/25m/). The MovieLens 25M dataset is one of the most widely used and popular datasets for building and evaluating recommendation systems. It is provided by the GroupLens Research project, which collects and studies datasets related to movie ratings and recommendations. The MovieLens 25M dataset contains movie ratings and other related information contributed by users of the MovieLens website.

**Dataset Details:**
- **Size:** The dataset contains approximately 25 million movie ratings.
- **Users:** It includes ratings from over 162,000 users.
- **Movies:** The dataset consists of ratings for more than 62,000 movies.
- **Ratings:** The ratings are provided on a scale of 1 to 5, where 1 is the lowest rating and 5 is the highest.
- **Timestamps:** Each rating is associated with a timestamp, indicating when the rating was given.

**Data Files:**
The dataset is usually split into three CSV files:

1. **movies.csv:** Contains information about movies, including the movie ID, title, genres, and release year.
   - Columns: movieId, title, genres

2. **ratings.csv:** Contains movie ratings provided by users, including the user ID, movie ID, rating, and timestamp.
   - Columns: userId, movieId, rating, timestamp

3. **tags.csv:** Contains user-generated tags for movies, including the user ID, movie ID, tag, and timestamp.
   - Columns: userId, movieId, tag, timestamp

First, import all the libraries you'll need.

Personal notes: from DataCamp series <a href='https://app.datacamp.com/learn/courses/building-recommendation-engines-in-python'>Building Recommendation Engines in Python</a>


In [1]:
import zipfile
import numpy as np
import pandas as pd
from urllib.request import urlretrieve
from sklearn.metrics.pairwise import cosine_similarity


Next, download the relevant components of the MoveLens dataset. Note, these instructions are roughly based on the colab [here](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/recommendation-systems/recommendation-systems.ipynb?utm_source=ss-recommendation-systems&utm_campaign=colab-external&utm_medium=referral&utm_content=recommendation-systems#scrollTo=O3bcgduFo4s6).

In [2]:
print("Downloading movielens data...")

urlretrieve('http://files.grouplens.org/datasets/movielens/ml-100k.zip', 'movielens.zip')
zip_ref = zipfile.ZipFile('movielens.zip', 'r')
zip_ref.extractall()
print("Done. Dataset contains:")
print(zip_ref.read('ml-100k/u.info'))

ratings_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(
    'ml-100k/u.data', sep='\t', names=ratings_cols, encoding='latin-1')

# The movies file contains a binary feature for each genre.
genre_cols = [
    "genre_unknown", "Action", "Adventure", "Animation", "Children", "Comedy",
    "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
    "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"
]

movies_cols = [
    'movie_id', 'title', 'release_date', "video_release_date", "imdb_url"
] + genre_cols
movies = pd.read_csv(
    'ml-100k/u.item', sep='|', names=movies_cols, encoding='latin-1')

# convert ratings to a float type to save FutureWarnings on int to float conversion later during Normalization
ratings['rating'] = ratings['rating'].astype(float) 

Downloading movielens data...
Done. Dataset contains:
b'943 users\n1682 items\n100000 ratings\n'


Before doing any kind of machine learning, it's always good to familiarize yourself with the datasets you'lll be working with.

Here are your tasks:

1. Spend some time familiarizing yourself with both the `movies` and `ratings` dataframes. How many unique user ids are present? How many unique movies are there?
2. Create a new dataframe that merges the `movies` and `ratings` tables on 'movie_id'. Only keep the 'user_id', 'title', 'rating' fields in this new dataframe.

In [3]:
# Spend some time familiarizing yourself with both the movies and ratings
# dataframes. How many unique user ids are present? How many unique movies
# are there?
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596


In [4]:
ratings.describe()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
count,100000.0,100000.0,100000.0,100000.0
mean,462.48475,425.53013,3.52986,883528900.0
std,266.61442,330.798356,1.125674,5343856.0
min,1.0,1.0,1.0,874724700.0
25%,254.0,175.0,3.0,879448700.0
50%,447.0,322.0,4.0,882826900.0
75%,682.0,631.0,4.0,888260000.0
max,943.0,1682.0,5.0,893286600.0


In [5]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   user_id         100000 non-null  int64  
 1   movie_id        100000 non-null  int64  
 2   rating          100000 non-null  float64
 3   unix_timestamp  100000 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [6]:
movies.head()

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,genre_unknown,Action,Adventure,Animation,Children,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [7]:
movies.describe()

Unnamed: 0,movie_id,video_release_date,genre_unknown,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
count,1682.0,0.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,...,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0
mean,841.5,,0.001189,0.149227,0.080262,0.02497,0.072533,0.300238,0.064804,0.029727,...,0.01308,0.014269,0.054697,0.033294,0.036266,0.146849,0.060048,0.149227,0.042212,0.016052
std,485.695893,,0.034473,0.356418,0.271779,0.156081,0.259445,0.458498,0.246253,0.169882,...,0.11365,0.118632,0.227455,0.179456,0.187008,0.354061,0.237646,0.356418,0.201131,0.125714
min,1.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,421.25,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,841.5,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1261.75,,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1682.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [8]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1682 entries, 0 to 1681
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   movie_id            1682 non-null   int64  
 1   title               1682 non-null   object 
 2   release_date        1681 non-null   object 
 3   video_release_date  0 non-null      float64
 4   imdb_url            1679 non-null   object 
 5   genre_unknown       1682 non-null   int64  
 6   Action              1682 non-null   int64  
 7   Adventure           1682 non-null   int64  
 8   Animation           1682 non-null   int64  
 9   Children            1682 non-null   int64  
 10  Comedy              1682 non-null   int64  
 11  Crime               1682 non-null   int64  
 12  Documentary         1682 non-null   int64  
 13  Drama               1682 non-null   int64  
 14  Fantasy             1682 non-null   int64  
 15  Film-Noir           1682 non-null   int64  
 16  Horror

In [9]:
# number of unique users
unique_users = ratings.groupby('user_id').size().count() 
print(f"There are {unique_users} unique users in the ratings table.")


There are 943 unique users in the ratings table.


In [10]:
# number of unique movies
unique_movies = ratings.groupby('movie_id').size().count() 
print(f"There are {unique_movies} unique movies in the ratings table, and {len(movies)} movies listed total.")


There are 1682 unique movies in the ratings table, and 1682 movies listed total.


In [11]:
# Other relevant shapes
print(f"shape of original ratings table {ratings.shape}")
print(f"shape of original movies table {movies.shape}")

shape of original ratings table (100000, 4)
shape of original movies table (1682, 24)


In [12]:
print(f"columns of ratings: {ratings.columns}")
print(f"columns of movies: {movies.columns}")

columns of ratings: Index(['user_id', 'movie_id', 'rating', 'unix_timestamp'], dtype='object')
columns of movies: Index(['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url',
       'genre_unknown', 'Action', 'Adventure', 'Animation', 'Children',
       'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir',
       'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
       'Western'],
      dtype='object')


In [13]:
# Merge movies and ratings dataframes
# Only keep the 'user_id', 'title', 'rating' fields
user_ratings_df = pd.merge(ratings[['user_id', 'movie_id', 'rating']], movies[['movie_id', 'title']], how='left', on='movie_id')
user_ratings_df.drop(columns=['movie_id'], inplace=True)
# Before working with the user ratings, have to remove the duplicates; there are 307 records with duplicate entries of user_id and title
# we'll average them together because that's their experience
user_ratings_df = user_ratings_df.groupby(['user_id', 'title'], as_index=False)['rating'].mean()
user_ratings_df.head()

Unnamed: 0,user_id,title,rating
0,1,101 Dalmatians (1996),2.0
1,1,12 Angry Men (1957),5.0
2,1,"20,000 Leagues Under the Sea (1954)",3.0
3,1,2001: A Space Odyssey (1968),4.0
4,1,"Abyss, The (1989)",3.0


In [14]:
# I'll use these for examining content-based recommendations later on
movies_df = movies[['title']+genre_cols].set_index('title')
movies_df.head()

Unnamed: 0_level_0,genre_unknown,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Toy Story (1995),0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
GoldenEye (1995),0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
Four Rooms (1995),0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
Get Shorty (1995),0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
Copycat (1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0


As mentioned in the introduction, content-Based Filtering is a recommendation engine approach that focuses on the attributes or features of items (products, movies, music, articles, etc.) and leverages these features to make personalized recommendations. The underlying idea is to match the characteristics of items with the preferences of users to suggest items that align with their interests. Content-based filtering is particularly useful when explicit user-item interactions (e.g., ratings or purchases) are sparse or unavailable.

**Key Steps in Content-Based Filtering:**

1. **Feature Extraction:**
   - For each item, relevant features are extracted. These features are typically descriptive attributes that can be represented numerically, such as genre, director, actors, author, publication date, and keywords.
   - In the case of text-based items, natural language processing techniques may be used to extract features like TF-IDF (Term Frequency-Inverse Document Frequency) scores.

2. **User Profile Creation:**
   - A user profile is created based on the items they have interacted with in the past. The user profile contains the weighted importance of features based on their interactions.
   - For example, if a user has watched several action movies, the action genre feature would receive a higher weight in their profile.

3. **Similarity Calculation:**
   - The similarity between items or between items and the user profile is calculated using similarity metrics like cosine similarity, Euclidean distance, or Pearson correlation.
   - Cosine similarity is commonly used as it measures the cosine of the angle between two vectors, which represents their similarity.

4. **Recommendation:**
   - Items that are most similar to the user profile are recommended to the user. These are items whose features have the highest similarity scores with the user profile.
   - The recommended items are presented as a list sorted by their similarity scores.

**Advantages of Content-Based Filtering:**
1. **No Cold-Start Problem:** Content-based filtering can make recommendations even for new users with no historical interactions because it relies on item features rather than user history.

2. **User Independence:** The recommendations are based solely on the features of items and do not require knowledge of other users' preferences or behavior.

3. **Transparency:** Content-based recommendations are interpretable, as they depend on the features of items, making it easier for users to understand why specific items are recommended.

4. **Serendipity:** Content-based filtering can recommend items with characteristics not seen before by the user, leading to serendipitous discoveries.

5. **Diversity in Recommendations:** The method can offer diverse recommendations since it suggests items with different feature combinations.

**Limitations of Content-Based Filtering:**
1. **Limited Discovery:** Content-based filtering may struggle to recommend items outside the scope of users' historical interactions or interests.

2. **Over-Specialization:** Users may receive recommendations that are too similar to their previous choices, leading to a lack of exposure to new item categories.

3. **Dependency on Feature Quality:** The quality and relevance of item features significantly influence the quality of recommendations.

4. **Limited for Cold Items:** Content-based filtering can struggle to recommend new items with limited feature information.

Here is your task:

1. Write a function that takes in a user id and the dataframe you created before that contains 'user_id', 'title', and 'rating'. The function should return content-based recommendations for this user. Here are steps you can take:

  A. Get the user's rated movies

  B. Create a TF-IDF matrix using movie genres. Note, this can be extracted from the `movies` dataframe.

  C. Compute the cosine similarity between movie genres. Use the [cosine_similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) function.

  D. Get the indices of similar movies to those rated by the user based on cosine similarity. Keep only the top 5.

  E. Remove duplicates and movies already rated by the user.

In [15]:
# Content-Based Filtering using Movie Genres

# build the genre matrix with user ratings
movies_ratings_df = pd.merge(user_ratings_df, movies[['title']+genre_cols], on='title' ) 
movies_ratings_df

# apply ratings to all genres per user and movie
#user_genre_wt_df = movies_ratings_df[genre_cols].multiply(movies_ratings_df['rating'], axis=0)
#user_genre_wt_df['user_id'] = movies_ratings_df['user_id'] # add the user_id that was dropped in prior op
#user_genre_wt_df = wt_df.groupby('user_id')[genre_cols].mean() # consolidate user preferences, weighted average means by genre
#user_genre_wt_df.head()


Unnamed: 0,user_id,title,rating,genre_unknown,Action,Adventure,Animation,Children,Comedy,Crime,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,101 Dalmatians (1996),2.0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,1,12 Angry Men (1957),5.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,"20,000 Leagues Under the Sea (1954)",3.0,0,0,1,0,1,0,0,...,1,0,0,0,0,0,1,0,0,0
3,1,2001: A Space Odyssey (1968),4.0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,1,0,0
4,1,"Abyss, The (1989)",3.0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100650,943,"Wizard of Oz, The (1939)",3.0,0,0,1,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
100651,943,Wolf (1994),2.0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
100652,943,Wyatt Earp (1994),1.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
100653,943,Young Guns (1988),4.0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1


Using the <a href='https://scikit-learn.org/1.5/modules/generated/sklearn.preprocessing.normalize.html'>scikit-learn Normalize</a> class to normalize the dataset by rows

In [16]:
### Unused code block, will do this in the function bodies # normalize the rows for all users 
# not the set_output method with Normalizer doesn't retain the column names or index, but this direct iloc assignment does
# found here: https://stackoverflow.com/questions/52007165/normalizing-rows-of-pandas-dataframe
#user_genre_wt_df.iloc[:,:] = Normalizer().fit_transform(wt_df)
#user_genre_wt_df.head()

In [17]:
### Unused code block, will do this in the function bodies # Compute the cosine similarity for all movies based on their genres; indexing by title for lookup
# I suppose this would be useful for making recommendations based on a specific movie or movies just watched
#cs_movies_df = pd.DataFrame(cosine_similarity(movies[genre_cols]), index=movies['title'], columns=movies['title'])
#cs_movies_df.iloc[:7,:7]

In [18]:
# Content-Based Filtering using Movie Genres
def content_based_recommendation(user_id, mdf, gdf):
    # old code
    #this_user_df = user_genre_wt_df.loc[user_id]
    #user_genre_wt_df = gdf.multiply(gdf['rating'], axis=0))
    #user_genre_wt_df['user_id'] = movies_ratings_df['user_id'] # add the user_id that was dropped in prior op
    #user_genre_wt_df = wt_df.groupby('user_id')[genre_cols].mean() # consolidate user preferences, weighted average means by genre
    #csa_df = pd.DataFrame(csa, index=user_tfid_df.index, columns=user_tfid_df.index)
    
    # Get the user's rated movies
    user_df = mdf[mdf['user_id']==user_id]
    # Join the user's rated movie with their genres
    user_df = pd.merge(user_df, gdf, on='title', how='left')
    user_df.set_index('title', inplace=True)
    # apply the ratings to the genres
    user_df = user_df[genre_cols].multiply(user_df['rating'], axis=0)
    # get the means by genre for the user
    user_mean_df = user_df.mean()
    print(f"User {user_id} weights by genre: \n{user_mean_df}")
    #  Normalizing should not be done before the means, otherwise it loses the weighting of the counts of ratings.
    # for example for user 162, their original sumproduct of Comedy = 37, and Thriller = 52.
    # taking the Mean of these, shows Comedy = 0.88 and Thrillers = 1.23 reflecting the weight of the positive ratings and number of movies seen of that genre
    # If we normalize before taking the mean, Comedy = 0.227 and Thrillers = 0.223 which would give precedence to Comedies, but that's not their viewing pattern 
    # their viewing pattern shows they've watched 42 movies, but only 11 of those are Comedies and 14 are Thrillers, 
    #   so I'd expect a bias towards Thrillers instead of Comedies in the top content based recommendations, though there may be some Comedies in the top 10
    
    # Get the list of movies not rated by the user
    unseen_df = movies_df.drop(user_df.index) 
    # Compute the cosine similarity between unseen movies and user preferences by genre
    # TODO : I don't understand why this reshape works, when the original DataCamp exercises used (-1,1) for the reshape
    csa_df = pd.DataFrame(cosine_similarity(user_mean_df.values.reshape(1,-1), unseen_df).T, index=unseen_df.index, columns=['score']) 
    # Get the indices of the similar movies based on cosine similarity
    csa_df.sort_values(by='score', inplace=True, ascending=False)
    return csa_df
  

The key idea behind collaborative filtering is that users who have agreed in the past will likely agree in the future. Instead of relying on item attributes or user profiles, collaborative filtering identifies patterns of user behavior and item preferences from the interactions present in the data.

**Types of Collaborative Filtering:**
There are two main types of collaborative filtering:

**Collaborative Filtering Process:**
The collaborative filtering process typically involves the following steps:

1. **Data Collection:**
   - Gather data on user-item interactions, such as movie ratings, product purchases, or article clicks.

2. **User-Item Matrix:**
   - Organize the data into a user-item matrix, where rows represent users, columns represent items, and the entries contain the users' interactions (e.g., ratings).

3. **Similarity Calculation:**
   - Calculate the similarity between users or items using similarity metrics such as cosine similarity, Pearson correlation, or Jaccard similarity.
   - For user-based collaborative filtering, user similarities are calculated, and for item-based collaborative filtering, item similarities are calculated.

4. **Neighborhood Selection:**
   - For each user or item, select the most similar users or items as the neighborhood.
   - The size of the neighborhood (the number of similar users or items to consider) is an important parameter to control the system's behavior.

5. **Prediction Generation:**
   - Predict the ratings for items that the target user has not yet interacted with by combining the ratings of neighboring users or items.

6. **Recommendation Generation:**
   - Recommend items with the highest predicted ratings to the target user.

**Advantages of Collaborative Filtering using User-Item Interactions:**
- Collaborative filtering is based solely on user interactions and does not require knowledge of item attributes, making it useful for cases where item data is sparse or unavailable.
- It can provide serendipitous recommendations, suggesting items that users may not have discovered on their own.
- Collaborative filtering can be applied in various domains, including e-commerce, music, movie, and content recommendations.

**Limitations of Collaborative Filtering:**
- The cold-start problem: Collaborative filtering struggles to recommend to new users or items with no or limited interaction history.
- It may suffer from sparsity when data is limited or when users have only interacted with a small subset of items.
- Scalability issues can arise with large datasets and an increasing number of users or items.

Here is your task:

1. Write a function that takes in a user id and the dataframe you created before that contains 'user_id', 'title', and 'rating'. The function should return collaborative filtering recommendations for this user based on a user-item interaction matrix. Here are steps you can take:

  A. Create the user-item matrix using Pandas' [pivot_table](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html).

  B. Fill missing values with zeros in this matrix.

  C. Calculate user-user similarity matrix using cosine similarity.

  D. Get the array of similarity scores of the target user with all other users from the similarity matrix.

  E. Extract, say the the top 5 most similar users (excluding the target user).

  F. Generate movie recommendations based on the most similar users.

  G. Remove duplicate movies recommendations.

In [19]:

# Collaborative Filtering using User-Item Interactions
def collaborative_filtering_recommendation(user_id, df):
    #user_movies_df = df[df['user_id']==user_id]
    #user_movies_df = user_movies_df.pivot(index='user_id', columns='title', values='rating')
    
    # Create the user-item matrix
    pivot_df = df.pivot(index='user_id', columns='title', values='rating')
    
    # Fill missing values with 0 (indicating no rating)
    pivot_df.fillna(0, inplace=True)
  
    # Calculate user-user similarity matrix using cosine similarity
    csa = cosine_similarity(pivot_df)
    csa_df = pd.DataFrame(csa, index=pivot_df.index, columns=pivot_df.index)
    
    # Get the similarity scores of the target user with all other users
    user_csa = csa_df.loc[user_id]
    
    # Find the top N most similar users (excluding the target user)
    ordered_users_df = user_csa.sort_values(ascending=False)[1:6]
    print(ordered_users_df)
    
    # Generate movie recommendations based on the most similar users
    # personal notes: to implement, get a list of titles reviewed by those N most similar users
    #   then sum up the ratings across all the users, and divide by N (the number of other users)
    #   this gives us a weighted average rating, and movies that were not watched by all N similar users will be penalized
    #   so with this we can find top 5 movies watched and rated highly by all 5 similar users
    similar_movies_df = df[df['user_id'].isin(ordered_users_df.index)]
    similar_movies_df = similar_movies_df.groupby('title')['rating'].sum().to_frame()/len(ordered_users_df.index)
    similar_movies_df.rename(columns={'rating':'score'}, inplace=True)
    
    # Remove duplicates from recommendations
    user_seen_movies = df[df['user_id']==user_id]['title']
    unseen_movies_df = similar_movies_df.drop(index=user_seen_movies, errors='ignore').sort_values(by='score', ascending=False)
    return unseen_movies_df.head(5)



Now, test your recommendations engines! Select a few user ids and generate recommendations using both functions you've written. Are the recommendations similar? Do the recommendations make sense?

In [20]:
# Test the recommendation engines

# First test for user 162, leading with the content-based recommendation
u162_top5 = content_based_recommendation(162, user_ratings_df, movies_df)
print(f"\nTop 5 content-based recommended movies :")
u162_top5.head()

User 162 weights by genre: 
genre_unknown    0.000000
Action           1.571429
Adventure        0.904762
Animation        0.095238
Children         0.166667
Comedy           0.880952
Crime            0.595238
Documentary      0.000000
Drama            0.904762
Fantasy          0.000000
Film-Noir        0.000000
Horror           0.071429
Musical          0.000000
Mystery          0.000000
Romance          0.476190
Sci-Fi           0.833333
Thriller         1.238095
War              0.380952
Western          0.000000
dtype: float64

Top 5 content-based recommended movies :


Unnamed: 0_level_0,score
title,Unnamed: 1_level_1
Escape from L.A. (1996),0.810108
"Abyss, The (1989)",0.810108
Escape from New York (1981),0.810108
"Lost World: Jurassic Park, The (1997)",0.810108
Hard Target (1993),0.767694


In [21]:
movies_df.loc[u162_top5.index].head()

Unnamed: 0_level_0,genre_unknown,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Escape from L.A. (1996),0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
"Abyss, The (1989)",0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
Escape from New York (1981),0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
"Lost World: Jurassic Park, The (1997)",0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
Hard Target (1993),0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0


So in the Content-based recommendations, we can see that User #162 has watched a number of Action/Adventure/Thriller movies, and this list of 5 recommendations fits that trend as seen when the genres are listed out here for those 5 movies. On the surface, this seems like a good fit. Next, we'll see the same user, but compared to other users & ratings without the genres. 

In [22]:
# test the function for user_id 162
collab_rec_list = collaborative_filtering_recommendation(162, user_ratings_df)
print(f"\nTop 5 collaborative-based recommended movies :")
collab_rec_list

user_id
703    0.414729
117    0.388635
251    0.380955
432    0.372299
793    0.371296
Name: 162, dtype: float64

Top 5 collaborative-based recommended movies :


Unnamed: 0_level_0,score
title,Unnamed: 1_level_1
Men in Black (1997),4.4
Scream (1996),4.0
"Time to Kill, A (1996)",3.6
Mr. Holland's Opus (1995),3.6
Fargo (1996),3.0


In [23]:
# show the genres of these movies to compare against the user's genre preferences
movies_df[movies_df.index.isin(collab_rec_list.index)]

Unnamed: 0_level_0,genre_unknown,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Mr. Holland's Opus (1995),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
Fargo (1996),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0
Men in Black (1997),0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0
"Time to Kill, A (1996)",0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
Scream (1996),0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0


So on the surface, this is set of recommendations is quite different from the Content-based recommendations, and in fact we only have 1 Action/Adventure movie here in the top 5 recommendations. However, when looking at the movies watched and the ratings given by User 162 in the following Dataframe, a different story emerges, where only 3 movies have been given a "5" rating, and only one of them is Action/Adventure. This means that a movie like "Mr. Holland's Opus" may be a serendipitous finding of this methodology, where the average genre is less interesting than the specific movies watched and rated highly. This follows with User 162 giving "The People vs. Larry Flynt" a 5 rating, and so we can see that "Mr. Holland's Opus" may also be a good fit here since it was rated highly by other users with similar patterns of rating & watching.

In [24]:
movies_ratings_df[movies_ratings_df['user_id']==162].sort_values(by='rating', ascending=False).head()

Unnamed: 0,user_id,title,rating,genre_unknown,Action,Adventure,Animation,Children,Comedy,Crime,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
16135,162,Multiplicity (1996),5.0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
16136,162,"People vs. Larry Flynt, The (1996)",5.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16150,162,Star Wars (1977),5.0,0,1,1,0,0,0,0,...,0,0,0,0,0,1,1,0,1,0
16119,162,"Birdcage, The (1996)",4.0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
16131,162,Jerry Maguire (1996),4.0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


Since I used User #162 during development, I had done a lot of digging into their data & patterns, so next I'll investigate a random User that has > 50 ratings, User #100 with 64 rated movies.

In [25]:
u100_top5 = content_based_recommendation(100, user_ratings_df, movies_df)
print(f"\nTop 5 content-based recommended movies :")
u100_top5.head()

User 100 weights by genre: 
genre_unknown    0.000000
Action           0.718750
Adventure        0.171875
Animation        0.000000
Children         0.031250
Comedy           0.515625
Crime            0.265625
Documentary      0.000000
Drama            1.656250
Fantasy          0.031250
Film-Noir        0.125000
Horror           0.093750
Musical          0.046875
Mystery          0.218750
Romance          0.515625
Sci-Fi           0.281250
Thriller         1.062500
War              0.265625
Western          0.000000
dtype: float64

Top 5 content-based recommended movies :


Unnamed: 0_level_0,score
title,Unnamed: 1_level_1
Outbreak (1995),0.866451
Apollo 13 (1995),0.866451
Tough and Deadly (1995),0.866451
Smilla's Sense of Snow (1997),0.866451
Fire Down Below (1997),0.866451


In [26]:
movies_df.loc[u100_top5.index].head()

Unnamed: 0_level_0,genre_unknown,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Outbreak (1995),0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0
Apollo 13 (1995),0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0
Tough and Deadly (1995),0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0
Smilla's Sense of Snow (1997),0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0
Fire Down Below (1997),0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0


From the genre-weighted Content-based recommendations, we can see that User #100 prefers Drama/Thriller/Action as their top genres with a very strong preference for Drama. From the recommended movies, these tick the box for Drama, and also Thriller and Action, so looks like a pretty good fit based on the content. Next I'll look at the collaboration ratings: 

In [27]:
# test the function for user_id 162
u100_collab_rec_list = collaborative_filtering_recommendation(100, user_ratings_df)
print(f"\nTop 5 collaborative-based recommended movies :")
u100_collab_rec_list

user_id
863    0.627904
784    0.600292
616    0.594277
856    0.580211
489    0.570090
Name: 100, dtype: float64

Top 5 collaborative-based recommended movies :


Unnamed: 0_level_0,score
title,Unnamed: 1_level_1
Cop Land (1997),4.2
"Devil's Advocate, The (1997)",3.8
"Edge, The (1997)",3.4
Alien: Resurrection (1997),2.8
Murder at 1600 (1997),2.8


In [28]:
# show the genres of these movies to compare against the user's genre preferences
movies_df[movies_df.index.isin(u100_collab_rec_list)]

Unnamed: 0_level_0,genre_unknown,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1


In [29]:
movies_ratings_df[movies_ratings_df['user_id']==100].sort_values(by='rating', ascending=False).head()

Unnamed: 0,user_id,title,rating,genre_unknown,Action,Adventure,Animation,Children,Comedy,Crime,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
11029,100,As Good As It Gets (1997),5.0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
11028,100,Apt Pupil (1998),5.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
11082,100,Titanic (1997),5.0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
11024,100,Air Force One (1997),4.0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
11030,100,"Big Bang Theory, The (1994)",4.0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


Similar to User #162, User #100 only had 3 movies with rating of "5". And in a similar fashion, the collaboration engine recommends some unusual movies based on viewing & rating patterns from similar users, yielding some surprising recommendations. Again, there is only one Action movie (Alien: Resurrection) recommended by the collaboration engine, yet that follows User #162's own trend, with only 1 Action movie in their "5" ratings grouping. What's a bit different from User #162, is that here the recommendation engine only recommended 1 Drama-genre movie, even though User #100 has watched quite a number of Dramas, they tend not to rate them highly. 

Test 3
* Finally, I'll choose another random user that has 20 rated movies and see how the engines do (there aren't any users here with fewer than 20 movies rated)
* This will be User #895

In [30]:
u895_top5 = content_based_recommendation(895, user_ratings_df, movies_df)
print(f"\nTop 5 content-based recommended movies :")
u895_top5.head()

User 895 weights by genre: 
genre_unknown    0.00
Action           1.25
Adventure        1.05
Animation        0.20
Children         0.45
Comedy           1.55
Crime            0.20
Documentary      0.00
Drama            0.85
Fantasy          0.00
Film-Noir        0.00
Horror           0.10
Musical          0.00
Mystery          0.20
Romance          1.60
Sci-Fi           0.65
Thriller         1.00
War              0.50
Western          0.00
dtype: float64

Top 5 content-based recommended movies :


Unnamed: 0_level_0,score
title,Unnamed: 1_level_1
"Princess Bride, The (1987)",0.846821
True Lies (1994),0.846821
"Empire Strikes Back, The (1980)",0.748516
So I Married an Axe Murderer (1993),0.744582
To Catch a Thief (1955),0.744582


In [31]:
# test the engines for user_id=685
movies_df.loc[u895_top5.index].head()

Unnamed: 0_level_0,genre_unknown,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
"Princess Bride, The (1987)",0,1,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0
True Lies (1994),0,1,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0
"Empire Strikes Back, The (1980)",0,1,1,0,0,0,0,0,1,0,0,0,0,0,1,1,0,1,0
So I Married an Axe Murderer (1993),0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0
To Catch a Thief (1955),0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0


From the genre-weighted Content-based recommendations, we can see that User #895 prefers Romantic Comedies and occasionally Action as their top genres. From the recommended movies, these tick the boxes for Romance, Comedy, with some  Action, so looks like a pretty good fit based on the content, though "Empire Strikes Back" is a bit of a surprise but it still fulfills the requirements, and doesn't have any of the genres that are not liked, such as Fantasy or Film-Noir, so it's still a strong recommendation. Next I'll look at the collaboration ratings: 

In [32]:
u895_collab_rec_list = collaborative_filtering_recommendation(895, user_ratings_df)
print(f"\nTop 5 collaborative-based recommended movies :")
u895_collab_rec_list


user_id
759    0.458637
517    0.443175
597    0.432416
552    0.409502
730    0.408543
Name: 895, dtype: float64

Top 5 collaborative-based recommended movies :


Unnamed: 0_level_0,score
title,Unnamed: 1_level_1
Air Force One (1997),4.4
Contact (1997),3.6
"Godfather, The (1972)",2.8
Face/Off (1997),2.6
Independence Day (ID4) (1996),2.6


In [33]:
# show the genres of these movies to compare against the user's genre preferences
movies_df[movies_df.index.isin(u895_collab_rec_list)]


Unnamed: 0_level_0,genre_unknown,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1


In [34]:
movies_ratings_df[movies_ratings_df['user_id']==895].sort_values(by='rating', ascending=False).head()

Unnamed: 0,user_id,title,rating,genre_unknown,Action,Adventure,Animation,Children,Comedy,Crime,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
95971,895,Sense and Sensibility (1995),5.0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
95973,895,Star Wars (1977),5.0,0,1,1,0,0,0,0,...,0,0,0,0,0,1,1,0,1,0
95967,895,Return of the Jedi (1983),5.0,0,1,1,0,0,0,0,...,0,0,0,0,0,1,1,0,1,0
95964,895,Mighty Aphrodite (1995),5.0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
95976,895,Willy Wonka and the Chocolate Factory (1971),5.0,0,0,1,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


Once again, the collaborative ratings are a bit surprising, and don't meet any of the expected genre recommendations. However, these appear to be highly rated by similar users, and personally I can see how a movie like "Contact" would be recommended given that the user rated "Star Wars" and "Return of the Jedi" highly. I think what may also be happening is we're getting more movies recommneded that have a very similar score, so if this was a system going into production, I'd probably add another 3rd layer of engine, something combining the genres and the collaboration results for cases like this where the user hasn't provided many recommendations yet, and is probably getting a bigger pool of potential movies as recommendations. Also the scores here are lower, with only 2 movies above a score of 3, and less than 3 seems to be a bit ambiguous. 

<B>Final thoughts:</B> It would be interesting to see how an algorithm like can evolve over time, especially if the user watches a movie recommended, and then rates it vastly different than the engine originally recommended. Perhaps there's some 2nd stage learning that could happen from there, along the lines of "watching these movies X/Y/Z shows a potential recommendation of A/B/C, but after empirically testing, users typically don't like B" - this may be a good fit for soemthing like a neural network, where more complex preferences can be mapped that are not apparent on the surface, or by doing more textual analysis with TF-IDF using plot information, or actual review text submitted by users.