<a href="https://colab.research.google.com/github/michalis0/DataScience_and_MachineLearning/blob/master/Week_12/Week_12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install omdbapi

Collecting omdbapi
  Downloading omdbapi-0.7.0.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: omdbapi
  Building wheel for omdbapi (setup.py) ... [?25l[?25hdone
  Created wheel for omdbapi: filename=omdbapi-0.7.0-py3-none-any.whl size=16803 sha256=49178552fe5ea67627f4135933cdb5cdf060429c207ed2dcef80bdc8884a5f2f
  Stored in directory: /root/.cache/pip/wheels/de/8b/88/5ed1c7214f5de08a6017805252f1591eb52bdff3f098d5834d
Successfully built omdbapi
Installing collected packages: omdbapi
Successfully installed omdbapi-0.7.0


In [2]:
import numpy as np
import pandas as pd
from omdbapi.movie_search import GetMovie


# Recommender Systems

<img src='https://imgs.xkcd.com/comics/star_ratings.png' width="300">

Source: [xkcd 1908](https://xkcd.com/1098/)

## Content

The goal of this walkthrough is show you how to calculate recommender systems and how to evaluate them. A [recommender system](https://en.wikipedia.org/wiki/Recommender_system) is a subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item. They are primarily used in commercial applications.

In this notebook, we will do:
- [Implementation](#Implementation)
- [Calculations](#Calculations)
  - [Cosine similarity](#Cosine-similarity)
- [Evaluation](#Evaluation)
  - [Rating Prediction accuracy](#Rating-Prediction-accuracy)
  - [Classification accurcy](#Classification-accurcy)
  - [Appendix](#Appendix)


## Implementation
Follow [this kaggle notebook](https://www.kaggle.com/code/rounakbanik/movie-recommender-systems) to see an example of an implementation of a recommander system in Python.

## Calculations
The goal of this section is to calculate manually some metrics. It will help you understand the simple mathematics behind recommander systems. Try to compute the metrics by hand with a pen and paper and then check your answer with Python.

First, we will get real data from the [Open Movie Database](https://www.omdbapi.com/). You will have to register to get a free API key in order to retrieve the data.

In [3]:
#retrieve data
API = "ENTER API KEY"
movies = ["Interstellar", "Godfather", "Life of Brian", "The Incredibles", "Monsters, Inc."]

movie = GetMovie(api_key=API)

data = []
for titles in movies:
  movie.get_movie(title=titles)
  data.append(movie.get_data('title', 'year', 'genre', 'director', 'imdbrating'))
df = pd.DataFrame(data)

In [4]:
display(df)

Unnamed: 0,title,year,genre,director,imdbrating
0,Interstellar,2014,"Adventure, Drama, Sci-Fi",Christopher Nolan,8.7
1,Godfather,2022,"Action, Crime, Drama",Mohan Raja,5.2
2,Life of Brian,1979,Comedy,Terry Jones,8.0
3,The Incredibles,2004,"Animation, Action, Adventure",Brad Bird,8.0
4,"Monsters, Inc.",2001,"Animation, Adventure, Comedy","Pete Docter, David Silverman, Lee Unkrich",8.1


### Cosine similarity

Let's say that for a movie rating system, we have three users x, y, and z. The rating system is represented by a vector for each user.


In [5]:
x = {'User':'X', 'Interstellar': 8.1, 'Godfather':7.9, 'Life of Brian':4.5, 'The Incredibles':2, 'Monsters, Inc.':3.5}
y = {'User':'Y','Interstellar': 5.1, 'Godfather':6.4, 'Life of Brian':3.2, 'The Incredibles':8.7, 'Monsters, Inc.':8.1}
z= {'User':'Z','Interstellar': 7.2, 'Godfather':8.5, 'Life of Brian':3.2, 'The Incredibles':3, 'Monsters, Inc.':2.4}
data_users = [x,y,z]

df_users = pd.DataFrame(data_users)
df_users

Unnamed: 0,User,Interstellar,Godfather,Life of Brian,The Incredibles,"Monsters, Inc."
0,X,8.1,7.9,4.5,2.0,3.5
1,Y,5.1,6.4,3.2,8.7,8.1
2,Z,7.2,8.5,3.2,3.0,2.4


What is the cosine similarity between each of the users and what are the most similar users ?

In [6]:
# Your turn


## Evaluation
There are three main ways to evaluate a recommander system:


1.   **Rating prediction accuracy**: how close are the predicted ratings to actual ratings?
2.   **Classification accuracy**: is the recommended item relevant for me?
3.   **Ranking accuracy**: does a ranked list of recommendations match my preferences?

### Rating Prediction accuracy
Given the error on the rating of an item by a user is:

\begin{equation}
e_{ui} = r_{ui} - \hat r_{ui}
\end{equation}

where $r_{ui}$ is the actual rating and $\hat r_{ui}$ is the predicted rating

Tow main metrics:

*   **MAE** (mean absolute error):
\begin{equation}
\frac{1}{|R|} ∑_{r_{ui}\in R} |e_{ui}|
\end{equation}


*   **RMSE** (root mean squared error):
\begin{equation}
\sqrt{\frac{1}{|R|} ∑_{r_{ui}\in R} e_{ui}^2}
\end{equation}


Based on this dataframe, what is the MAE and the RMSE of this recommander ?

In [8]:
df_rec = pd.DataFrame([['W','Interstellar',8.4, 7.5], ['W','Life of Brian',3.4, 7.5], ['V',"Godfather",8.7,8.0], ['V','The Incredibles',3.8, 5.1]], columns=["User", "Movie", "Rating", "Prediction"])
display(df_rec)

Unnamed: 0,User,Movie,Rating,Prediction
0,W,Interstellar,8.4,7.5
1,W,Life of Brian,3.4,7.5
2,V,Godfather,8.7,8.0
3,V,The Incredibles,3.8,5.1


In [53]:
# Your turn


### Classification accuracy
Three main metrics:


*   Precision
\begin{equation}
P = \dfrac{TP}{TP+FP}
\end{equation}
*   Recall
\begin{equation}
R = \dfrac{TP}{TP+FN}
\end{equation}
*   F1
\begin{equation}
F_1 = 2 \cdot \dfrac{P\cdot R}{P+R} = \dfrac{2}{\frac{1}{P}+\frac{1}{R}}
\end{equation}


Based on this dataframe, add a column with the type of the prediction, calculate the precision, the recall and the F1.

In [38]:
movies2 = ["Oppenheimer", "Gladiator", "Dune", "Home Alone", "Barbie"]
data = []
for titles in movies2:
  movie.get_movie(title=titles)
  data.append(movie.get_data('title', 'year', 'genre', 'director', 'imdbrating'))
df2 = pd.concat((df,pd.DataFrame(data)), ignore_index=True)
display(df2)

Unnamed: 0,title,year,genre,director,imdbrating
0,Interstellar,2014,"Adventure, Drama, Sci-Fi",Christopher Nolan,8.7
1,Godfather,2022,"Action, Crime, Drama",Mohan Raja,5.2
2,Life of Brian,1979,Comedy,Terry Jones,8.0
3,The Incredibles,2004,"Animation, Action, Adventure",Brad Bird,8.0
4,"Monsters, Inc.",2001,"Animation, Adventure, Comedy","Pete Docter, David Silverman, Lee Unkrich",8.1
5,Oppenheimer,2023,"Biography, Drama, History",Christopher Nolan,8.5
6,Gladiator,2000,"Action, Adventure, Drama",Ridley Scott,8.5
7,Dune,2021,"Action, Adventure, Drama",Denis Villeneuve,8.0
8,Home Alone,1990,"Comedy, Family",Chris Columbus,7.7
9,Barbie,2023,"Adventure, Comedy, Fantasy",Greta Gerwig,7.0


In our dataset, an actual rating is relevant if it's over or equal 8.0 and is a well recommanded if it's recommanded is over or equal 8.

In [39]:
df_rating = df2[['title', 'imdbrating']]
df_rating["imdbrating"] = df_rating["imdbrating"].astype(float)
df_rating['Recommander rating'] = [6.9, 5.9, 8.3, 3.2, 7.4, 8.6, 5.1, 7.5, 8.5, 9.0]
df_rating['Actual'] = np.where(df_rating['imdbrating'] >= 8.0, 'R', 'NR')
df_rating['Prediction'] = np.where(df_rating['Recommander rating'] >= 8.0, 'R', 'NR')
display(df_rating)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_rating["imdbrating"] = df_rating["imdbrating"].astype(float)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_rating['Recommander rating'] = [6.9, 5.9, 8.3, 3.2, 7.4, 8.6, 5.1, 7.5, 8.5, 9.0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_rating['Actual'] = np.where(df_rating['imdbrating'] >

Unnamed: 0,title,imdbrating,Recommander rating,Actual,Prediction
0,Interstellar,8.7,6.9,R,NR
1,Godfather,5.2,5.9,NR,NR
2,Life of Brian,8.0,8.3,R,R
3,The Incredibles,8.0,3.2,R,NR
4,"Monsters, Inc.",8.1,7.4,R,NR
5,Oppenheimer,8.5,8.6,R,R
6,Gladiator,8.5,5.1,R,NR
7,Dune,8.0,7.5,R,NR
8,Home Alone,7.7,8.5,NR,R
9,Barbie,7.0,9.0,NR,R


In [54]:
#Your code


### Precision @ k
In this section, we are going to have a look at the precision@k and the recall@k. First, here are the definition of these metrics.

Definition:
- Precision@k is the proportion of recommended items in the top-k set that are relevant
- Recall@k is the proportion of relevant items found in the top-k recommendations

And mathematically, this gives:


```
Precision@k = (# of recommended items @k that are relevant) / (# of recommended items @k)
Recall@k = (# of recommended items @k that are relevant) / (total # of relevant items)
```

Calculate the precision@5 and the recall@5 for the previous dataframe


In [55]:
#your code
