<a href="https://colab.research.google.com/github/nalinis07/APT_Ref_Copy_Links/blob/MASTER/AT_Lesson_136_Reference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 136: Collaborative Filtering II - Cosine Similarity

---

**WARNING:** The reference notebook is meant **ONLY** for a teacher. Please **DO NOT** share it with any student.

The contents of the reference notebook are meant only to prepare a teacher for a class. To conduct the class, use the class copy of the reference notebook. The link and the instructions for the same are provided in the **Notes To The Teacher** section.


|Particulars|Description|
|-|-|
|**Topic**|Collaborative Filtering II - Cosine Similarity|
|||
|**Class Description**|In this class, a student will perform collaborative filtering using Cosine Similarity to build a movie recommender|
|||
|**Class**|C136|
|||
|**Class Time**|50 minutes|
|||
|**Goal**|Understand Cosine Similarity|
||Understand `tolist()`, `sorted()` functions|
||Understand `pandas.DataFrame.from_dict` class-method, and and `pandas.isin()` method|
||Recommend similar movies based on other user's ratings|
|**Teacher Resources**|Google Account|
||Laptop with internet connectivity|
||Earphones with mic|
|||
|**Student Resources**|Google Account|
||Laptop with internet connectivity|
||Earphones with mic|

---

### Teacher-Student Activities

In the previous class, we built a collaborative based movie recommender using Pearson correlation coefficient as a similarity score.

In this class, we will build a movie recommender by performing collaborative filtering using **Cosine Similarity** score.

Let us first recall the concepts covered in the previous class and begin this class from **Activity 1: Understanding Cosine Similarity**.




---

#### What are Collaborative Filters?

While shopping through e-commerce platforms, you must have encountered:

**Customers who bought Macbook Pro also purchased: 'ProDisplay XR' | 'LG Gaming Monitor' | 'AirPods'**

Some movie hosting/OTT platforms suggests:

Say if you are watching: **Inception**

**Customers also watched: 'The Matrix' | 'Gravity' | 'Tenet'**

Such suggestions are given to a user on the basis of the likes and dislikes of similar users. This is exactly what Collaborative filters do.

**Collaborative filtering** builds a model from the user's past behaviour (i.e. items purchased or searched by the user) as well as similar decisions made by other users. This model is then used to predict items that users may have an interest in.

Let us now understand the problem statement in more detail.

**Problem Statement:**

- We will build an intelligent recommender that would recommend movies to a customer say **X** based on the customer's watch history.
- First, we need to find other sets of users who have watched same movies along with some other movies and suggest customer **X** the movies which were appreciated by those set of users.

<center><img src=https://s3-whjr-v2-prod-bucket.whjr.online/whjr-v2-prod-bucket/8346c283-08f7-46c5-b37f-96a3eac57800.png></center>

In this way the customers are likely to appreciate the recommendation and as a result stay connected to the streaming platform.

Let us now explore the datasets that will be used to solve this problem statement.

---

#### Datasets

We will use following three datasets to set up a recommender system that will recommend movies to a user based on ratings given by other users:

**1. The `movie_metadata.csv` file:**

- This is the main Movies Metadata file.
- It contains information on 45,000 movies featured in the Full [MovieLens](https://movielens.org) database.

  **Note:** This was the same dataset which we had used to build simple movie recommenders in the previous lesson.

- Below are the features information:

  **Attribute Information:**
  ```
    adult: Indicates if the movie is X-Rated or Adult.
    belongs_to_collection: A stringified dictionary that gives information on the movie series the particular film belongs to.
    budget: The budget of the movie in dollars.
    genres: A stringified list of dictionaries that list out all the genres associated with the movie.
    homepage: The Official Homepage of the move.
    id: The TMDB ID of the movie.
    imdb_id: The IMDB ID of the movie.
    original_language: The language in which the movie was originally shot in.
    original_title: The original title of the movie.
    overview: A brief blurb of the movie.
    popularity: The Popularity Score assigned by TMDB.
    poster_path: The URL of the poster image.
    production_companies: A stringified list of production companies involved with the making of the movie.
    production_countries: A stringified list of countries where the movie was shot/produced in.
    release_date: Theatrical Release Date of the movie.
    revenue: The total revenue of the movie in dollars.
    runtime: The runtime of the movie in minutes.
    spoken_languages: A stringified list of spoken languages in the film.
    status: The status of the movie (Released, To Be Released, Announced, etc.)
    tagline: The tagline of the movie.
    title: The Official Title of the movie.
    video: Indicates if there is a video present of the movie with TMDB.
    vote_average: The average rating of the movie.
    vote_count: The number of votes by users, as counted by TMDB.
 ```

**2. The `links.csv` file:**

- This file contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.
- Below are the features information:
  ```
  movieId: A unique identifier for each movie
  imdbId: The IMDB ID of the movie
  tmdbId: The TMDB ID of the movie
  ```


**3. The `ratings_small.csv` file:**

- This file is a subset of 100,000 ratings from 700 users on 9,000 movies.
- Below are the features information:
  ```
  userId: The user ID of the subscriber
  movieId: A unique identifier for each movie
  rating: Rating given by a subscriber (Out of 5)
  timestamp: Time at which the rating was recorded
  ```



**Acknowledgement:** These datasets are an ensemble created by Rounak Banik using the data collected from TMDB and GroupLens.

**Dataset Source:** https://www.kaggle.com/rounakbanik/the-movies-dataset

---

#### Recap

Importing Modules and Reading Data

Let us load the first dataset `movies_metadata.csv` into a pandas DataFrame.

**The `movies_metadata.csv` Dataset link:** https://drive.google.com/uc?id=1rPR-P45M2UWsbXc8vpyCzWcQAYUfgVJX




In [None]:
# Import the required modules

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

# Load the Movies Metadata Dataset
df = pd.read_csv('https://drive.google.com/uc?id=1rPR-P45M2UWsbXc8vpyCzWcQAYUfgVJX')

# Create 'movies_df' DataFrame consisting of columns: 'id', 'imdb_id', 'title'
movies_df = df[['id', 'imdb_id', 'title']]

# Drop missing values from the DataFrame.
movies_df.dropna(inplace = True)

# Convert data-type of 'id' column to float
movies_df['id'] = movies_df['id'].astype('float')

# Load 'links.csv' file into 'links_df' DataFrame.
links_df = pd.read_csv('https://drive.google.com/uc?id=1Hn83CnGeHG6evq274ztIm6VcOrImBOAF')

# Merge 'movies_df' and 'links_df' DataFrames
m_links_df = pd.merge(movies_df, links_df, left_on ='id', right_on ='tmdbId')

# Obtain the final DataFrame consisting of only 'movieId' and 'title' columns.
m_df = m_links_df[['movieId', 'title']]

# Load 'ratings_small.csv' file into 'ratings_df' DataFrame.
ratings_df = pd.read_csv('https://drive.google.com/uc?id=17xgnHVj8in4SxBGh7j9daAbtYa2zz8fw')

# Drop 'timestamp' column from 'ratings_df' DataFrame.
ratings_df = ratings_df.drop('timestamp', axis=1)

# Merge 'm_df' and 'ratings_df' DataFrames.
final_movies_df = pd.merge(m_df, ratings_df, on = 'movieId')

---

#### Data Analysis

Let us first find out the average rating of each movie by grouping movies based on their title.

In [None]:
# Group the DataFrame by 'title' column and use 'mean()' function to determine average rating.
final_movies_df.groupby('title')['rating'].mean()

# Print top 5 movies having highest mean rating.
final_movies_df.groupby('title')['rating'].mean().sort_values(ascending = False).head()

# Count the number of ratings given to each movie.
final_movies_df.groupby('title')['rating'].count()

# Print top 5 movies having highest count of ratings.
final_movies_df.groupby('title')['rating'].count().sort_values(ascending = False).head()

# Create a DataFrame with average rating and number of ratings for each movie.
all_movies_ratings = pd.DataFrame(final_movies_df.groupby('title')['rating'].mean())
all_movies_ratings['num of ratings'] = pd.DataFrame(final_movies_df.groupby('title')['rating'].count())

# Create a pivot table with index ='userId', columns ='title', values ='rating'
user_ratings = final_movies_df.pivot_table(index ='userId', columns ='title', values ='rating')
# Note: There will be a lot of NaN values in the obtained pivot table, because most people have not seen most of the movies.

# Calculate correlation coefficient between each pair of movies using 'corr()' function.
similarity_df = user_ratings.corr()

# Create a DataFrame containing the correlation coefficients of other movies with 'Toy Story'
similar_to_toystory = similarity_df["Toy Story"]
similar_to_toystory_df = pd.DataFrame(similar_to_toystory)

# Rename the column to 'correlation'.
similar_to_toystory_df.rename(columns={similar_to_toystory_df.columns[0]: 'correlation'}, inplace = True)

# Sort the above DataFrame by 'correlation' column to find top 10 highly correlated movies.
similar_to_toystory_df.sort_values('correlation',ascending=False).head(10)

# Obtain the number of ratings of each movie along with the correlation coefficients
# by joining 'all_movies_ratings['num of ratings']' DataFrame with the above DataFrame.
corr_toystory = similar_to_toystory_df.join(all_movies_ratings['num of ratings'])

# Keep only those movies whose number of ratings are greater than 100.
# Sort them in descending order and print first 20 values.
corr_toystory[corr_toystory['num of ratings'] > 100].sort_values('correlation',ascending=False).head(10)

Unnamed: 0_level_0,correlation,num of ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Toy Story,1.0,247
Toy Story 2,0.743352,125
A Bug's Life,0.677299,105
"Monsters, Inc.",0.549582,130
The Dark Knight,0.540978,121
Finding Nemo,0.537958,122
Austin Powers: The Spy Who Shagged Me,0.519847,112
The Lion King,0.517524,200
Spider-Man,0.512995,134
The Incredibles,0.508661,126


Hence, we obtained a DataFrame which contains the average rating and number of ratings for each movie.
Let's move on to creating a collaborative filtering based recommendation system.


---

#### Recommending a Movie

Say we have a **user X** as our target person for whom we want to recommend best movie to watch. Consider the following user data is with you.

<center><img src=https://s3-whjr-v2-prod-bucket.whjr.online/whjr-v2-prod-bucket/602074ed-3289-4349-89b3-cce587168baf.png>

`Table 1: Users rating and watch history database`</center>
<br>


To obtain recommendation for similar movies based on the ratings given by other users (as given in table 1), we will compute a **similarity score**. Collaborative filters can use a variety of similarity scores, for example:

1. Pearson Correlation Coefficient

2. Cosine Similarity

3. Singular Value Decomposition and a lot more.

In [None]:
# Define 'recommend_movies()' function.
def recommend_movies(movie_name):
  similar_movies = similarity_df[movie_name]
  similar_movies_df = pd.DataFrame(similar_movies)
  similar_movies_df.rename(columns = {similar_movies_df.columns[0]: 'correlation'}, inplace = True)
  corr_num_ratings = similar_movies_df.join(all_movies_ratings['num of ratings'])
  return corr_num_ratings[corr_num_ratings['num of ratings'] > 100].sort_values('correlation',ascending = False).head(10)

# Call 'recommend_movies()' function and pass 'Star Wars' as input.
recommend_movies('Star Wars')

Unnamed: 0_level_0,correlation,num of ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Star Wars,1.0,291
Return of the Jedi,0.747774,217
The Empire Strikes Back,0.70079,234
The Dark Knight,0.549486,121
The Lord of the Rings: The Fellowship of the Ring,0.477582,200
Raiders of the Lost Ark,0.476442,220
The Incredibles,0.450914,126
The Lord of the Rings: The Two Towers,0.448153,188
E.T. the Extra-Terrestrial,0.428289,160
Star Trek: Generations,0.413682,114


The recommendation engine defined above was based on correlation values using the `corr()` function.

**Correlation:**

- Correlation measures the strength of a linear relationship between two variables.
- A correlation coefficient is a number between -1 and 1 that describes a negative or positive correlation respectively. A value of zero indicates no correlation.

**The corr() Function:**
  
  To calculate the correlation coefficient between all the numeric columns in a DataFrame, use the `corr()` function of the `pandas` module. It returns an N-dimensional DataFrame containing the correlation coefficient values between the numeric columns.

<br>

The recommendation engine designed so far is quite good, however:

1. There are high chances that the user might have watched most of the movies recommended by this type of recommender system.

2. The movies suggested by this recommendation engine are mostly of same type which might not be a great idea. As a user you sometimes want to watch comedy, sometimes thrillers and so on.

**In this class, we will collaborative filter based recommendation engine using Cosine Similarity.**

---

#### Activity 1: Understanding Cosine Similarity

**Measuring Similarity**

Measuring similarity between two points/vectors is a bit different from calculating distance. For instance, the distance between two points can easily be calculated using Euclidean distance.

When we talk about similarity we are calculating how close 2 points/vectors are to each other. The similarity can be expressed in the range of $-1$ to $1$.

1. A similarity score of 1 indicates overlapping or same points/vectors.

2. A similarity score of 0 indicates orthogonal/perpendicular points/vectors.

3. A similarity score of -1 indicates opposite points/vectors.

Consider two vectors represented by $\vec{A}$ and $\vec{B}$, making an angle of $\theta$ with each other. Mathematically the similarity between these two vectors is given by:

\begin{align}
\text{sim}(A,B) = \text{cos}(\theta) = \frac{A \cdot B}{||A|| \times ||B||}
\end{align}

We can say that **Cosine Similarity** measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction.

For further reading visit: [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)

For example, consider the vectors given in the figure below:



<center><img src=https://s3-whjr-v2-prod-bucket.whjr.online/whjr-v2-prod-bucket/d5a5a766-54a6-4b7b-9a24-4eec4d20c1b7.png width=800>

Fig 1.1 Vectors in a 2-D space</center>

1. Consider the vectors $\vec{u}$, and $\vec{b}$:

  The angle between these vectors is $0^o$. Hence the Cosine Similarity score netween these two vectors can be calculated as:

\begin{align}
\text{sim}(u,b) = \text{cos}(0) = 1
\end{align}
  
&emsp;&emsp; A similarity score of 1 suggests that these vectors are overlapping which can be verified from figure 1.1

2. Consider the vectors $\vec{u}$, and $\vec{v}$:

  The angle between these vectors is $45^o$ Hence the Cosine Similarity score between these two vectors can be calculated as:

\begin{align}
\text{sim}(u,v) = \text{cos}(45) = 0.7071
\end{align}

&emsp;&emsp; Hence, the similarity of any two vectors inclined at an angle $\alpha$ depends on the cosine value of the angle.

3. Consider the vectors $\vec{u}$, and $\vec{w}$:

  The angle between these vectors is $90^o$. Hence the Cosine Similarity score between these two vectors can be calculated as:

\begin{align}
\text{sim}(u,w) = \text{cos}(90) = 0
\end{align}

&emsp;&emsp; A similarity score of 0 suggests that these vectors are orthogonal which can be verified from figure 1.1

4. Consider the vectors $\vec{u}$, and $\vec{a}$:

  The angle between these vectors is $180^o$. Hence the Cosine Similarity score between these two vectors can be calculated as:

\begin{align}
\text{sim}(u,a) = \text{cos}(180) = -1
\end{align}

&emsp;&emsp; A similarity score of -1 suggests that these vectors point in opposite directions which can be verified from figure 1.1

---

Consider two users: User X, and User A and their ratings for two random movies  as given in the table below:

| User | The Dark Knight | Iron Man |
| -- | -- | -- |
| User X | 4 | 3|
| User A | 5 | 5 |

Let's calculate the similarity between the users.

\begin{align}
\text{sim}(X, A) = = \frac{X \cdot A}{||X|| \times ||A||}
\end{align}

1. Calculate $ X \cdot A$:

\begin{align}
X \cdot A = (4 \times 5) + (3 \times 5) = 35
\end{align}

2. Calculate $||X|| \times ||A||$:

\begin{align}
||X|| = \sqrt{4^2 + 3^2} = 5 \\
||A|| = \sqrt{5^2 + 5^2} = 7.07
\end{align}

3. Obtain the similarity between users:

\begin{align}
\text{sim}(X, A) = = \frac{X \cdot A}{||X|| \times ||A||} = \frac{35}{5 \times 7.07} = 0.99
\end{align}

The similarity score of 0.99 suggests that the users are likely to provide almost similar ratings to a given movie.

**Note:** The similarity scores between users is always a dynamic parameter. It means the similarity scores is always updated based on the rating of new movies by the respective users.

<br>

Suppose, the same users watch one more movie and provide the following ratings to them:

| User | The Dark Knight | Iron Man | Toy Story |
| -- | -- | -- | -- |
| User X | 4 | 3| 4 |
| User A | 5 | 5 | 2 |

This time let's use `sklearn` to obtain the `cosine_similarity` value for these users. For this:

1. `import cosine_similarity` form the `sklearn.metric.pairwise` module.

2. Use `cosine_similarity()`.

In [None]:
# S1.1: Calculate Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([[4, 3, 4]], [[5, 5, 2]])
print(similarity)
print('The variable similarity is of: ',type(similarity))

[[0.91385996]]
The variable similarity is of:  <class 'numpy.ndarray'>


Here, we observe that due to different ratings of the $3^{rd}$ movie 'Toy Story' the similarity score between the users have decreased.

**Note:**

1. Passing one dimension arrays as input data is deprecated in `sklearn` version 0.17, and will raise `ValueError` for the above code if the user ratings are passed as 1-dimensional arrays to `cosine_similarity()` function.

2. You can also observe that the output: `similarity` score generated by `cosine_similarity()` function is a 2-d numpy array.

<br>


Suppose the users watch one more movie based on the previous recommendation engine. Based on the previously watched movie: 'Toy Story', the recommender engine would suggest users to watch 'Toy Story 2'.

Say, the users watched the movie 'Toy Story 2' based on recommendation and provided the following rating to the movie.


| User | The Dark Knight | Iron Man | Toy Story | Toy Story 2 |
| -- | -- | -- | -- | -- |
| User X | 4 | 3| 4 | 5 |
| User A | 5 | 5 | 2 | 2 |

We can safely say that the similarity score of these 2 users shall decrease further as they are showing different choices in movies.

Let us again calculate the similairty scores based on the ratings and print the similarity score of these users.

In [None]:
# S1.2: Obtain the similarity scores based on the updated user ratings

similarity = cosine_similarity([[4, 3, 4, 5]], [[5, 5, 2, 2]])
print(similarity)

[[0.85662334]]


As expected the similarity score is further reduced due to different liking of the users.

- From this you can also conclude that even if the previously watched movie by a user was 'Toy Story'. Based on the rating of the movie by user and watch history, suggesting the user to watch 'Toy Story 2' is not the best recommendation.

- This means we need to modify our recommendation engine to accommodate for user watch history as well as ratings of the movie in the watch history.

- This will ensure a better personalised movie title recommendation.

- Moving forward we will use **Cosine Similarity** scores to our ratings dataset.

---

#### Activity 2: Data Preparation

Let's elaborate the movie recommendation problem statement.

Say we have a **User X** as our target person with a particular watch history, and for whom we want to recommend best movie to watch. Consider the following user data is with you.

<center><img src=https://s3-whjr-v2-prod-bucket.whjr.online/whjr-v2-prod-bucket/602074ed-3289-4349-89b3-cce587168baf.png>

Table 1: Users rating and watch history database</center>

**Question:** What logic will you use to recommend movies to User X based on the ratings given in Table 1?

**Answer:** If our target is User X. We can have following observations:

1. User X, and User D have no common watch history. Hence, ratings from User D are not useful for recommending a movie to User X.

2. User B has watched every movie User X has watched and the ratings are also similar. Hence, we can surely recommend a movie form the watch history of User B.

3. User C and User A also have some part of watch history common to User X, hence we can consider their watch history to recommend a movie to User X.

<br>

**Recommendation Objective**

Imagine you have access to this kind of dataset from thousands/millions of users. Now, the main question is which movies to recommend to a particular user. Figure 2.1 below provides a recommendation system logic used by majority of streaming platforms.

<center><img src=https://s3-whjr-v2-prod-bucket.whjr.online/whjr-v2-prod-bucket/b329ec36-1c14-4313-8f28-a2b85ee23b72.png>

Figure 2.1: Recommendation engine design logic</center>


In simpler words, to personalize the recommendation engine for any User X having a watch history of M movies:

**Step 1**. We will get the most similar users of the User X.

**Step 2**. We will get the most similar movies of the movie set M.

**Step 3**. Then we combine the ratings for each item of the movies for similar users, and the similarity score of the  similar users.

**Step 4**. We will recommend only the items that have higher ratings given by the similar users.

<br>

Now, let's generate a random watch history to work upon as we would be recommending movies based on the watch history and movie ratings for the corresponding user.

In [None]:
# Predefined user watch history

user_history = [
            {'title':'Hotel Transylvania 2', 'rating':4},
            {'title':"Indiana Jones and the Temple of Doom", 'rating':4.5},
            {'title':"Indiana Jones and the Kingdom of the Crystal Skull", 'rating':4},
            {'title':'Men in Black II', 'rating':4}
         ]
user_history_df = pd.DataFrame(user_history)
user_history_df

Unnamed: 0,title,rating
0,Hotel Transylvania 2,4.0
1,Indiana Jones and the Temple of Doom,4.5
2,Indiana Jones and the Kingdom of the Crystal S...,4.0
3,Men in Black II,4.0


Here, we observe that the `user_history_df` consists of the movie `title` and `rating` columns. In order to correlate the user watch history with our database, we need to assign the `movieId` to the respective movies from the database.

First let's look for the `title` from the `user_history_df`, find it in `moveis_df`. For this we will use:

1. `tolist()` function from Pandas

2. `DataFrame.isin()` method from Pandas

#### `tolist()` Function

`tolist()` is a versatile function and can be used with numpy series, Pandas Index etc. and returns a list of the corresponding items.

For further reading visit[`tolist()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.tolist.html)

Here, we will obtain the list of the movies the user has watched from the `title` column of the user watch history DataFrame.



In [None]:
# S2.1: Obtain  the list of movie titles the user has watched

watch_list = user_history_df['title'].tolist()
print('The variable watch_list is of: ',type(watch_list))
print(watch_list)

The variable watch_list is of:  <class 'list'>
['Hotel Transylvania 2', 'Indiana Jones and the Temple of Doom', 'Indiana Jones and the Kingdom of the Crystal Skull', 'Men in Black II']


Here, we observe that the `watch_list` is of list data type.

<br>

Next, we will search for the movie titles appearing in `watch_list` in the master dataset we obtained in the previous class: `m_links_df` to obtain the `movieId` for the corresponding movies. For this we will use `Pandas.isin()` method.

#### `Pandas.isin()` function

Pandas `isin()` function is used to filter the DataFrame based on particular column. The `isin()` method determines the rows with matching `values`, and returns the DataFrame of *boolean* values indicating whether or not the `value` is in the DataFrame.

**Syntax**: `DataFrame.isin(values)`

Where,
`values` can be DataFrame column(s), list, series, dictionary.

For further reading visit: [`DataFrame.isin()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html)

Here, we will use `movies_df` DataFrame's `title` column to search for the list of movies `watch_list` the user has watched and rated.

In [None]:
# S2.2: Obtain a DataFrame consisting of movieId based on user watch history

user_history_id = m_links_df[m_links_df['title'].isin(watch_list)]
user_history_id

Unnamed: 0,id,imdb_id,title,movieId,imdbId,tmdbId
2014,87.0,tt0087469,Indiana Jones and the Temple of Doom,2115,87469,87.0
5345,608.0,tt0120912,Men in Black II,5459,120912,608.0
12668,217.0,tt0367882,Indiana Jones and the Kingdom of the Crystal S...,59615,367882,217.0
33301,159824.0,tt2510894,Hotel Transylvania 2,142997,2510894,159824.0


From the output we observe that the `user_history_id` DataFrame now consists of the `movieId` of the movies the user has watched.

However, to work with collaborative filters we need the user ratings as well. So, Merge the DataFrames: `user_history_id`, `user_history_df` to obtain the workable dataset, say: `watched_movies_df`

In [None]:
# S2.3: Obtain the movieId and rating DataFrame for the user watch history

watched_movies_df = pd.merge(user_history_id, user_history_df)
watched_movies_df

Unnamed: 0,id,imdb_id,title,movieId,imdbId,tmdbId,rating
0,87.0,tt0087469,Indiana Jones and the Temple of Doom,2115,87469,87.0,4.5
1,608.0,tt0120912,Men in Black II,5459,120912,608.0,4.0
2,217.0,tt0367882,Indiana Jones and the Kingdom of the Crystal S...,59615,367882,217.0,4.0
3,159824.0,tt2510894,Hotel Transylvania 2,142997,2510894,159824.0,4.0


The watch history dataset `watched_movies_df` from user now consist of the movie `title`, `rating`, and `movieId` for all the movies user X has watched.

As per the recommendation engine logic design in figure 2.1, we need to find the subset of other users who have watched and rated the movies from the user X watch history: `watched_movies_df`.

<center>
<img src=https://s3-whjr-v2-prod-bucket.whjr.online/whjr-v2-prod-bucket/59e2d368-d408-40e4-a6a6-b8ec9ed4eb67.svg>

Image source: https://commons.wikimedia.org/wiki/File:Indiana_Jones_logo.svg </center>

<br>

This can be done in a similar way we searched for the `movieId`:

1. Create a list `watch_list_movieid` of the field `movieId` from the user X watch history DataFrame: `watched_movies_df` using `tolist()` function.

2. Create a new DataFrame `users_subset_df` from the movie ratings DataFrame: `ratings_df` using `isin()` method.

3. Print the shape of the obtained DataFrame to understand the user subset data.

4. Finally, display the first 10 rows of the obtained DataFrame.

In [None]:
# S2.4: Creating subset of users that has watched movies from the user watched movies dataframe

watch_list_movieid = watched_movies_df['movieId'].tolist()
users_subset_df = ratings_df[ratings_df['movieId'].isin(watch_list_movieid)]
print('Shape of the users_subset_df DataFrame: ',users_subset_df.shape)
users_subset_df.head(10)

Shape of the users_subset_df DataFrame:  (173, 3)


Unnamed: 0,userId,movieId,rating
285,4,2115,5.0
1407,15,2115,4.0
1927,15,5459,1.0
2258,15,59615,1.0
3876,22,2115,3.5
3965,22,5459,3.5
4295,23,2115,4.0
5442,30,2115,4.0
7032,41,5459,3.5
7493,48,5459,3.0


Here, we observe that we have obtained a DataFrame for the users subset that has watched movies from the user watch history. For example:

1. `userId` 4 has watched the `movieId` 2115 (Indiana Jones and the Temple of Doom) and rated it 5.0 out of 5.

2. Similarly, `userId` 15 has watched 3 movies, with `movieId` 2115, 5459, and 59615 respectively.

<br>

Next, let's group this dataset with respect to users and sort it based on the most number of common movies the subset users have watched. For this:

1. Create a `users_subset_group` using the `groupby()` function based on `userId` field.

2.  Use `sorted()` function to sort the `users_subset_group` on the basis of users who have watched most movies in common with the user X.

3. Display the first 5 entries from the sorted users subset.

<br>

#### `sorted()` Function

The `sorted()` function returns a sorted list from the iterable object.

**Syntax of `sorted()` function**: `sorted(iterable, key, reverse)`

Parameters: sorted takes three parameters from which two are optional.

- `iterable`: sequence (list, tuple, string) or collection (dictionary, set, frozenset) or any other iterator that needs to be sorted.

- `key`(optional) : A function that would serve as a key or a basis of sort comparison.

- `reverse`(optional) : If set true, then the iterable would be sorted in reverse (descending) order, by default it is set as false.


Here, we want to sort the `users_subset_group` on the basis of users who have watched most movies in common with the user X. Hence the parameters for `sorted()` function would be:

1. `users_subset_group` would be passed as iterable.

2. For `key` we need to determine a `lambda` function, since we need to sort various groups (grouped by `userId`) on the basis of their length. Hence for any user say: `u` we would sort the group entries by the number of movies the user `u` has watched. This can be done using `len()`:

&emsp;&emsp;&ensp;`key = lambda a: len(a[1])`

3. Use `reverse = True` as we want to sort the groups in descending order.

In [None]:
# S2.5: Create users subset group based on userId and sort on base of highest number of common movies watched

users_subset_group = users_subset_df.groupby(['userId'])
users_subset_group = sorted(users_subset_group,  key=lambda u: len(u[1]), reverse=True)
users_subset_group[0:5]

[(624,        userId  movieId  rating
  93859     624     2115     4.0
  94331     624     5459     2.0
  94639     624    59615     2.0
  95254     624   142997     2.5), (15,       userId  movieId  rating
  1407      15     2115     4.0
  1927      15     5459     1.0
  2258      15    59615     1.0), (73,        userId  movieId  rating
  10591      73     2115     5.0
  10992      73     5459     2.0
  11435      73    59615     1.5), (88,        userId  movieId  rating
  13636      88     2115     2.0
  13704      88     5459     2.5
  13754      88    59615     1.0), (213,        userId  movieId  rating
  29493     213     2115     3.5
  29663     213     5459     2.5
  29928     213    59615     2.0)]

Here, we observe the `userId` 624 has watched 4 movies in common with the `user X`. For instance,

- `userId` 624 has watched `movieID` 2115 and gave it a `rating` of 4.0

- `userId` 624 has watched `movieID` 5459 and gave it a `rating` of 2.0

- `userId` 624 has watched `movieID` 59615 and gave it a `rating` of 2.0

- `userId` 624 has watched `movieID` 142997 and gave it a `rating` of 2.5

Similarly, `userId` 15 has watched 3 movies in common with the user X` and so on.

<br>

We now have the users subset ready. Next we will proceed with similarity measure using `cosine_similarity()`.

---

#### Activity 3: Obtain Similar Users for Target User X

The first step as described in **Activity 2: Data Preparation** was: **We will get the most similar users of the user X**.

In order to obtain users having similar choices in the movies to our target user X we find the similarity scores for the users in `users_subset_group` with user X.

For this:

1. Create an empty dictionary: `cosine_similarity_dict`.

2. To add the similarity scores we need to access the users and the common movies the users have watched. To obtain this initiate a `for` loop in `users_subset_group`. This loop will yield two iterables: `user`, and the respective `group` for the corresponding `user`. Inside this loop:

  - Sort the respective group entries by `movieId` field using `sort.values()` function and assign it to a variable `group`.

  - Similarly, sort the watch history of user X `watched_movies_df` based on `movieId` field using `sort.values()` function and assign it to a variable `input_movies`.

  - For the corresponding `user` in the `group` let's create a `temp_df` for the respective group (which is the movies the corresponding `user` has watched) based on `movieId` which are in `input_movies` (which is the sorted DataFrame of movies the user X has watched and rated).

  - Convert the `rating` for these movies the corresponding `user` and user X have watched in common and store them in a temporary list, say: `temp_rating_list`. This list will provide the list of rating user X has given to the movies that both the user X, and corresponding `user` have watched.

  - Similarly, convert the `rating` for the movies `user` and user X have watched in common and store them in a temporary list, say: `temp_group_list`. This list will provide the list of rating the corresponding `user` has given to the movies that both the user X, and corresponding `user` have watched.

  - Now, we have the ratings list from `user`, and user X which will be used to determine the similarity of both the users. For this, use `cosine_similarity()` function and pass the ratings list as 2D array:

    - `[temp_rating_list]`, and `[temp_group_list]`. Assign this value to `similarity`.

  - Finally, append the corresponding `user` and `similarity` score to the dictionary.

**Note:** While appending the key value pair to the dictionary do remeber that the output generated by `cosine_similarity()` function is a 2D array. Hence you, need to `reshape(1)` the similarity array in order to obtain a 1D array.

In [None]:
# S3.1: Create a dictionary to store the similarity scores of users subset with respect to target user X

# Generate an empty dictionary
cosine_similarity_dict = {}

# Iterate through individual users and the movie corresponding user has watched from the user subset group
for user, group in users_subset_group:

    # Sorting the target user and current user group to prevent mismatch in movieId field
    group = group.sort_values(by='movieId')
    input_movies = watched_movies_df.sort_values(by='movieId')

    # Obtain the rating for the movies that they both have in common
    temp_df = input_movies[input_movies['movieId'].isin(group['movieId'].tolist())]

    # Store these ratings them in a temporary buffer list for similarity calculations
    temp_rating_list = temp_df['rating'].tolist()

    # Also put the corresponding user group rating in a temporary buffer list for similarity calculations
    temp_group_list = group['rating'].tolist()

    # Obtain the similarity scores and append to dictionary with respect to corresponding user
    similarity = cosine_similarity([temp_rating_list], [temp_group_list])
    cosine_similarity_dict[user] = similarity.reshape(1)

In above code cell we are creating a dictionary which will store the `userId` as key and `similarity` scores as value which will be used later to obtain the recommendation.

The similarity scores are obtained based on the ratings provided by the corresponding user in the `user_subset_group` and ratings provided by the target user X. The dictionary `cosine_similarity_dict` has these similarity scores for all the users from the `users_subset_group` with respect to our target user X.

<center>
<img src=https://s3-whjr-v2-prod-bucket.whjr.online/whjr-v2-prod-bucket/a8f84199-657f-42b4-bbfb-4c255726a596.jpg width=450>

Image Source: Photo by Min An from Pexels </center>

<br>

Let's display the dictionary items in order to understand it.

In [None]:
#  S3.2: Display the key value pairs of the similarity score dictionary

cosine_similarity_dict.items()

dict_items([(624, array([0.96836405])), (15, array([0.84780105])), (73, array([0.90328481])), (88, array([0.94865528])), (213, array([0.98432694])), (468, array([0.97421458])), (471, array([0.99734724])), (481, array([0.97825343])), (580, array([0.97332853])), (664, array([0.98422368])), (22, array([0.99827437])), (93, array([0.99434562])), (134, array([0.99865342])), (150, array([0.98591396])), (176, array([0.99827437])), (212, array([0.9904049])), (287, array([0.99827437])), (294, array([1.])), (311, array([0.9856839])), (324, array([0.99778516])), (346, array([0.87373206])), (355, array([0.95281498])), (384, array([0.99996948])), (402, array([0.99705449])), (426, array([0.99705449])), (452, array([0.99949111])), (475, array([0.9486833])), (553, array([0.99328922])), (574, array([0.99827437])), (607, array([0.99784915])), (654, array([0.9904049])), (4, array([1.])), (23, array([1.])), (30, array([1.])), (41, array([1.])), (48, array([1.])), (49, array([1.])), (56, array([1.])), (57, 

Here, you can observe that the `similarity` score of `userId` 624 and user X is 0.968 and so on. We have now obtained the individual users form the `users_subset_group` and their similarity scores with respect to the target user X.

<br>

Next, let's convert this to a DataFrame. As the dictionary items are stored as array we will use `pd.DataFrame.from_dict` classmethod.

#### `pandas.DataFrame.from_dict` ClassMethod

To recall:

**Syntax of `from_dict()` function**: `classmethod DataFrame.from_dict(data, orient='columns')`

Where,
- `data` is the dictionary from which DataFrame has to be obtained.

- `orient` is either 'index' or 'columns'. This means “orientation” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass `columns` (default). Otherwise if the keys should be rows, pass `index`.

In order to obtain the DataFrame:

1. First obtain a DataFrame say: `cosine_df` using `pd.DataFrame.from_dict` classmethod and pass the following parameters:

 - `cosine_similarity_dict` as data.

 - `orient = 'index'` to set the DataFrame index as `userId`.

2. Next, rename the column of the DataFrame as `similarity score`.

3. Finally, display the first 5 entries of the resulting DataFrame.

In [None]:
# S3.3: Obtain a DataFrame from the similarity score dictionary

cosine_df = pd.DataFrame.from_dict(cosine_similarity_dict, orient='index')
cosine_df.columns = ['similarity score']
cosine_df.head()

Unnamed: 0,similarity score
624,0.968364
15,0.847801
73,0.903285
88,0.948655
213,0.984327


We have now obtained the DataFrame and we need to create a new column for the `userId`.

1. This can be obtained from the `index` of the DataFrame.

2. Finally, reset the index of the DataFrame and display the first 5 entries of the resulting DataFrame.



In [None]:
# S3.4: Append a column for userId and reset the index of the DataFrame

cosine_df['userId'] = cosine_df.index
cosine_df.reset_index(inplace=True)
cosine_df.head()

Unnamed: 0,index,similarity score,userId
0,624,0.968364,624
1,15,0.847801,15
2,73,0.903285,73
3,88,0.948655,88
4,213,0.984327,213


Let's sort the DataFrame on the basis of descending values of the  `similarity score` and assign it a new DataFrame say: `top_users`.

In [None]:
# S3.5: Create a new DataFrame for top users by sorting the users based on similarity scores

top_users = cosine_df.sort_values(by ='similarity score', ascending=False)
top_users.head()

Unnamed: 0,index,similarity score,userId
65,244,1.0,244
98,463,1.0,463
96,457,1.0,457
95,456,1.0,456
94,442,1.0,442


Here we have obtained the DataFrame for the users with their respective similarity scores to the user X.

<br>

Next, we will set up the recommendation engine by finding the movies the users subset has watched.

---

#### Activity 4: The Recommendation Engine

The next step in setting up the recommendation engine is: **to get the most similar movies of the user X watch history.**

For this we need to obtain the movies and the ratings for the similar users as described in the `top_users` dataset have watched.

Recall we have created a DataFrame `final_movies_df` which consists of the movies every user has watched and rated. Let's display the `final_movies_df` DataFrame.

In [None]:
# S4.1: Display the final movies DataFrame

final_movies_df

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story,7,3.0
1,1,Toy Story,9,4.0
2,1,Toy Story,13,5.0
3,1,Toy Story,15,2.0
4,1,Toy Story,19,3.0
...,...,...,...,...
99845,161918,Sharknado 4: The 4th Awakens,624,1.5
99846,161944,The Last Brickmaker in America,287,5.0
99847,162542,Rustom,611,5.0
99848,162672,Mohenjo Daro,611,3.0


From the above code cell output we can verify that the DataFrame `final_movies_df` consists of the movies every user has watched and rated.

<br>

Next, create a DataFrame `top_users_rating` which consists of the movies and the respective ratings provided by the `userId` in the `top_users` DataFrame. For this:

1. Create a DataFrame by merging the `top_users` and `final_movies_df` based `on = userId`, and select an `inner` join.

2. Display the resulting DataFrame.

In [None]:
# S4.2: Obtain the movies watched and ratings provided by users of top users DataFrame

top_users_rating = top_users.merge(final_movies_df, on = 'userId', how ='inner')
top_users_rating

Unnamed: 0,index,similarity score,userId,movieId,title,rating
0,244,1.000000,244,110,Braveheart,3.5
1,244,1.000000,244,260,Star Wars,5.0
2,244,1.000000,244,318,The Shawshank Redemption,5.0
3,244,1.000000,244,356,Forrest Gump,5.0
4,244,1.000000,244,457,The Fugitive,0.5
...,...,...,...,...,...,...
49522,15,0.847801,15,160271,Central Intelligence,2.5
49523,15,0.847801,15,160563,The Legend of Tarzan,1.0
49524,15,0.847801,15,160565,The Purge: Election Year,2.0
49525,15,0.847801,15,160567,Mike and Dave Need Wedding Dates,4.0


Here you observe that the `top_users_rating` now consist of the movies watched  and rated by the users similar to our target user X.

<center>
<img src=https://s3-whjr-v2-prod-bucket.whjr.online/whjr-v2-prod-bucket/e0c95ff6-dc40-499e-b98f-918133d55e98.jpg>

Image Source: https://en.wikipedia.org/wiki/Braveheart#/media/File:Braveheart_imp.jpg

<img src=https://s3-whjr-v2-prod-bucket.whjr.online/whjr-v2-prod-bucket/718b7515-9afb-4821-a2dc-2d7524529cff.svg>

Image Source: https://en.wikipedia.org/wiki/Star_Wars#/media/File:Star_wars2.svg
</center>

<br>

As of now these ratings are with perception to the respective users and we are not sure of whether or not our target user X will like these movies. Hence we will proceed with the next step to **combine the ratings for each item of the movies for similar users, and the similarity score of the similar users**. For this:

1. Create a new column in the DataFrame `weighted rating` by multiplying the users `similarity score` and `rating` such that the new column provides a better index for the movies inclination towards the choice of user X.

2. Finally display the resulting DataFrame.

In [None]:
# S4.3: Obtain the weighted rating by combining the similarity score and movie rating for top users

top_users_rating['weighted rating'] = top_users_rating['similarity score']*top_users_rating['rating']
top_users_rating

Unnamed: 0,index,similarity score,userId,movieId,title,rating,weighted rating
0,244,1.000000,244,110,Braveheart,3.5,3.500000
1,244,1.000000,244,260,Star Wars,5.0,5.000000
2,244,1.000000,244,318,The Shawshank Redemption,5.0,5.000000
3,244,1.000000,244,356,Forrest Gump,5.0,5.000000
4,244,1.000000,244,457,The Fugitive,0.5,0.500000
...,...,...,...,...,...,...,...
49522,15,0.847801,15,160271,Central Intelligence,2.5,2.119503
49523,15,0.847801,15,160563,The Legend of Tarzan,1.0,0.847801
49524,15,0.847801,15,160565,The Purge: Election Year,2.0,1.695602
49525,15,0.847801,15,160567,Mike and Dave Need Wedding Dates,4.0,3.391204


We now have a `weighted rating` based on the `similarity score` and `rating` provided by individual users.

<br>

The final step is to obtain similar movies with the: **highest ratings given by the similar users**.

Let's create a temporary DataFrame, say: `temp_top_users_rating`. For this:

1. Group the movies in the `top_users_rating` by `movieId` and obtain the cumulative scores for: `similarity score` and `weigthed rating` fields using `sum()` function.

2. Rename the columns of the resulting DataFrame as `cumulative similarity score` and `cumulative weighted rating`.

3. Finally, display the resulting DataFrame.

In [None]:
# S4.4: Obtain the cumulative similarity scores and weighted rating for similar movies

temp_top_users_rating = top_users_rating.groupby('movieId').sum()[['similarity score','weighted rating']]
temp_top_users_rating.columns = ['cumulative similarity score','cumulative weighted rating']
temp_top_users_rating

Unnamed: 0_level_0,cumulative similarity score,cumulative weighted rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,79.435396,291.002862
2,45.378484,142.941369
3,14.938193,42.326552
4,2.000000,2.500000
5,15.781033,46.119740
...,...,...
161155,0.847801,0.423901
161594,0.903285,2.709854
161830,0.968364,0.968364
161918,0.968364,1.452546


Here we can observe that we have now obtained the cumulative scores for individual movies.

**Note:** The index of this DaatFrame is the respective `movieId`.

<br>

The last job of the recommendation engine is to sort these movies based on the descending values of the cumulative score and **Bingo!** we have successfully designed the recommendation engine.

For this:

1. Create a copy of the temporary rating DataFrame, say: `recommendation_df`.

2. Create a new column `Id` which consists of the corresponding `movieId` from the DataFrame index.

3. Next, sort the DataFrame in descending order of the obtained `cumulative weighted rating`.

4. Display the number of movies you want to recommend to user X.

In [None]:
# S4.5: Obtain the movies recommended from the cumulative ratings obtained from the top users

recommendation_df = temp_top_users_rating.copy()
recommendation_df['Id'] = recommendation_df.index
recommendation_df = recommendation_df.sort_values(by = 'cumulative weighted rating' , ascending=False)
recommendation_df.head(15)

Unnamed: 0_level_0,cumulative similarity score,cumulative weighted rating,Id
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
260,104.258973,450.521202,260
1198,104.312532,442.914943,1198
1196,102.276437,439.599611,1196
2571,100.261682,415.656374,2571
356,100.398112,401.945802,356
4993,92.270852,394.684642,4993
1270,95.274002,385.832857,1270
1291,95.441718,381.202445,1291
296,86.422818,378.314396,296
1210,88.328291,371.3482,1210


The above DataFrame does not convey information on movie `title`. In order to link `title` to the respective `movieId` let's merge: `recommednation_df` and `m_links_df`. While merging these DataFrames set the following parameters:

1. `left_on = 'Id'`

2. `right_on = 'movieId'`

In [None]:
# S4.6: Obtain the movie title for recommended movies and save it to a DataFrame

recommend_movies = pd.merge(recommendation_df, m_links_df, left_on ='Id', right_on ='movieId')

Finally we can display the recommended movies based on the watch history of user X

**TIP: ** Use `print(recommend_movies.title)` to obtain only the list of titles for the recommended movies.

In [None]:
# S4.7 Display the recommended movies sorted by our recommendation engine

recommend_movies.head(10)

Unnamed: 0,cumulative similarity score,cumulative weighted rating,Id,id,imdb_id,title,movieId,imdbId,tmdbId
0,104.258973,450.521202,260,11.0,tt0076759,Star Wars,260,76759,11.0
1,104.312532,442.914943,1198,85.0,tt0082971,Raiders of the Lost Ark,1198,82971,85.0
2,102.276437,439.599611,1196,1891.0,tt0080684,The Empire Strikes Back,1196,80684,1891.0
3,100.261682,415.656374,2571,603.0,tt0133093,The Matrix,2571,133093,603.0
4,100.398112,401.945802,356,13.0,tt0109830,Forrest Gump,356,109830,13.0
5,92.270852,394.684642,4993,120.0,tt0120737,The Lord of the Rings: The Fellowship of the Ring,4993,120737,120.0
6,95.274002,385.832857,1270,105.0,tt0088763,Back to the Future,1270,88763,105.0
7,95.441718,381.202445,1291,89.0,tt0097576,Indiana Jones and the Last Crusade,1291,97576,89.0
8,86.422818,378.314396,296,680.0,tt0110912,Pulp Fiction,296,110912,680.0
9,88.328291,371.3482,1210,1892.0,tt0086190,Return of the Jedi,1210,86190,1892.0


**Bingo!** We now have a list to recommend to the user X which matches with the user's choice of movies.


<center>
<img src=https://s3-whjr-v2-prod-bucket.whjr.online/whjr-v2-prod-bucket/34c112ff-f2f8-4cad-bcc0-0c2930f3562d.jpg width=450>

Image Source: Photo by Andrea Piacquadio from Pexels </center>



**Conclusion**

1. The recommendation engine also ensures to include diverse recommendations so that the customer does not gets bored of watching the movies of the same series.

2. The recommendation engine is dynamic hence whenever the user watches and rate a new movie it gets added to the user watch history and the recommendations are upadated based on it.

3. Here, we have used Cosine Similarity measure to find similarity in users and rate the recommendations accordingly. We can also use other similarity measures  for the same.

4. Most of the top rated streaming platforms use Hybrid models as recommendation engine. This Hybrid model comprises of multiple complex algorithms working in synchronization to find the best suited recommendation.

<br>

We will stop here. For next classes we are going to explore new concepts of **Choropleth Maps** and learn how to visualise **Satellite Data**



---

#### Activities


**Teacher Activities:**

1.   Collaborative Filtering II - Cosine Similarity (Class Copy)

    Link on Panel


2.  Collaborative Filtering II - Cosine Similarity (Reference)

    https://colab.research.google.com/drive/1nctJJJjIRuQBPp6tlNmh0TzkUof-ljH6

