### MOVIE RECOMMENDATION SISTEM


Prvo importujemo potrebne biblioteke i dodajemo imena fajlova koje cemo koristiti u daljem radu.

```python

In [1]:
import pandas as pd
import numpy as np

TMDB_MOVIES = "./data/movies.csv"
TMDB_CREDITS = "./data/credits.csv"
RATINGS_SMALL = "./data/ratings.csv"

### 1. Za parsiranje fajlova koristimo pandas i funkciju:
columns je opcioni parametar koji nam omogucava da odredimo nazive kolona u tabeli koje cemo citati iz fajla.

In [2]:
def parse_csv(filepath: str, columns: list = None) -> pd.DataFrame:
    if columns is None:
        df = pd.read_csv(filepath)
    else:
        df = pd.read_csv(filepath, usecols=columns)
    return df

Citamo fajlove, spajamo i dodajemo imena kolona. 

In [3]:
# number of first n movies filters will return
RESULT_SIZE = 20
attributes = ["movie_id", "title", "cast", "crew"]
df_credits = parse_csv(TMDB_CREDITS, attributes)
# changing the name of cols
df_credits.columns = ["id", "title", "cast", "crew"]

# df movies
attributes = ["genres", "id", "keywords", "vote_average", "vote_count", "tagline","popularity", "release_date", "runtime", "overview"]
df_movies = parse_csv(TMDB_MOVIES, attributes)

df_movies = df_movies.merge(df_credits, on="id")
df_movies.head()

Unnamed: 0,genres,id,keywords,overview,popularity,release_date,runtime,tagline,vote_average,vote_count,title,cast,crew
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di...",150.437577,2009-12-10,162.0,Enter the World of Pandora.,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha...",139.082615,2007-05-19,169.0,"At the end of the world, the adventure begins.",6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",A cryptic message from Bond’s past sends him o...,107.376788,2015-10-26,148.0,A Plan No One Escapes,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",Following the death of District Attorney Harve...,112.31295,2012-07-16,165.0,The Legend Ends,7.6,9106,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","John Carter is a war-weary, former military ca...",43.926995,2012-03-07,132.0,"Lost in our world, found in another.",6.1,2124,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


Nakon uspjesnog spajanja fajlova dobijamo potrebnu tabelu koju mozemo koristiti u nastavku.

### 2. Demografsko filtriranje
Ovi sistemi pružaju generalizovane preporuke svakom korisniku na osnovu popularnosti filma i/ili žanra. Osnovna ideja iza ovog sistema jeste da će filmovi koji su popularni i dobili kritičke pohvale imati veću verovatnoću da se sviđaju prosečnoj publici.


Metrika koju ćemo koristiti za sortiranje filmova je IMDB-ov weighted rating:


![wr.png](images/wr.png)

- v je broj glasova za film
- m je minimalni broj glasova potreban da bi se našao na listi
- R je prosječna ocjena filma
- C je prosječni broj glasova na listi

```python

In [4]:
CORRECT_FACTOR = 100000
def rate_movie(movie, m, C):
    v = movie["vote_count"]
    R = movie["vote_average"]
    return ((v/(v+m)*R) + m*(m+v)*C)/CORRECT_FACTOR

In [5]:
# filter the given dataset and returns top resultsize movies
# wr = (v/(v+m)*R) + (m/(m+v)*C)
# v is the number of votes for the movie;
# m is the minimum votes required to be listed in the chart;
# R is the average rating of the movie; And
# C is the mean vote across the whole report
def demographic_filtering(movies: pd.DataFrame, result_size: int = 20):
    # v - vote_count
    # R - vote_average
    m = movies["vote_count"].quantile(0.9)
    C = movies["vote_average"].mean()
    

    # removing all movies with low number of votes
    movies = movies.loc[movies["vote_count"] >= m]
    # rating every movie
    rate = movies.apply(lambda movie: rate_movie(movie, m, C), axis=1)
    movies.insert(0, "rate", rate)
    # sort my resulted movies by rate
    movies = movies.sort_values(by="rate", ascending=False)
    return movies[0:result_size]

Na osnovu ove metrike mozemo izracunati tezinu svakog filma i sortirati ih po toj tezini, te tako dobiti najbolje ocijenjene filmove za metodu demografskog filtriranja.

Vracamo fiksan broj najboljih filmova koji je odredjen parametrom RESLUL_SIZE.

In [6]:
results = demographic_filtering(df_movies, RESULT_SIZE)
results[["rate","title", "vote_count", "vote_average"]]

Unnamed: 0,rate,title,vote_count,vote_average
96,1746.101204,Inception,13752,8.1
65,1550.103861,The Dark Knight,12002,8.2
0,1527.480159,Avatar,11800,7.2
16,1524.792197,The Avengers,11776,7.4
788,1437.321382,Deadpool,10995,7.4
95,1422.985582,Interstellar,10867,8.1
287,1336.970744,Django Unchained,10099,7.8
94,1296.987287,Guardians of the Galaxy,9742,7.9
426,1264.843714,The Hunger Games,9455,6.9
127,1261.707759,Mad Max: Fury Road,9427,7.2


### 3. Kolaorativno filtriranje

Ovaj sistem upoređuje osobe sa sličnim interesovanjima i pruža preporuke na osnovu ovog podudaranja. Kolaborativni filteri ne zahtijevaju metapodatke elementa (filmova) kao njihovi sadržajno bazirani pandani.

Postoje dvije vrste kolaborativnog filtriranja: user-based i item-based. Implementiraćemo oba i uporediti rezultate.

Koristićemo Single Value Decomposition da smanjimo dimenzionalnost problema. 
Koristićemo metrike poput RMSE i MSE za upoređivanje rezultata.


- postavljamo imena kolona koje cemo koristiti u daljem radu
- mijenjamo ime kolone movieId u id da bi mogli da spojimo tabele
- citamo fajlove
- spajamo tabele movies i ratings

In [7]:
# COLLABORATIVE FILTERING
attributes = ["userId", "movieId", "rating"]
# loading the files
df_movies = df_movies.rename(columns={"id": "movieId"})

df_ratings = parse_csv(RATINGS_SMALL, attributes)

titles = df_movies[["title", "movieId"]]
# merging two tables
df_ratings_title = df_ratings.merge(titles, on="movieId")
df_ratings_title.head(10)

Unnamed: 0,userId,movieId,rating,title
0,1,2105,4.0,American Pie
1,4,2105,4.0,American Pie
2,15,2105,4.0,American Pie
3,30,2105,2.0,American Pie
4,34,2105,4.0,American Pie
5,35,2105,3.5,American Pie
6,41,2105,5.0,American Pie
7,49,2105,3.0,American Pie
8,59,2105,1.0,American Pie
9,73,2105,3.0,American Pie


Pogledajmo kako izgleda nasa matrica korisnika i filmova. Koristicemo funkciju get_user_item_matrix koja vraca matricu korisnika i filmova. Na x-osi su filmovi, a na y-osi korisnici. Vrijednosti u matrici su ocjene koje su korisnici dali filmovima.

In [8]:
def get_user_item_matrix(df: pd.DataFrame) -> pd.DataFrame:
    user_item_matrix = df.pivot(
        index=["userId"], columns=["movieId"], values=["rating"]
    ).fillna(0)
    return user_item_matrix


In [9]:
get_user_item_matrix(df_ratings_title)

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
movieId,5,11,12,13,14,16,18,19,20,22,...,64499,69640,74510,77866,89492,98369,103731,114635,115210,116977
userId,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,0.0,3.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
668,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
670,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Kao sto smo mogli pretpostaviti matrica je veoma rijetka, jer svaki korisnik je ocijenio samo mali broj filmova. To je veliki problem za kolaborativno filtriranje, jer ne mozemo da izracunamo slicnost izmedju korisnika koji nisu ocijenili iste filmove. 

Zbog toga cemo koristiti SVD da smanjimo dimenzionalnost problema.

#### 1. Sta je SVD (Single Value Decomposition)?

SVD je tehnika za smanjivanje dimenzionalnosti koja se koristi za kompresiju podataka. SVD se može koristiti za smanjivanje dimenzionalnosti u skupu podataka. 

Jedan način za rukovanje skalabilnošću i retkoćom koju stvara kolaborativno filtriranje je iskorišćavanje latentnog faktorskog modela kako bi se uhvatila sličnost između korisnika i filmova. Bitno je pretvoriti problem recommendation sistema u problem optimizacije. Možemo ga posmatrati kao koliko smo dobri u predviđanju ocena za filmove datog korisnika. Jedna uobičajena metrika je Koren srednjekvadratna greška (RMSE). Što je niži RMSE, to je bolja performansa.

Sada, kada govorimo o latentnom faktoru, možda se pitate šta je to? To je široka ideja koja opisuje osobinu ili koncept koji korisnik ili stavka imaju. Na primer, za muziku, latentni faktor može se odnositi na žanr kojem muzika pripada. SVD smanjuje dimenziju matrice korisnosti izvlačeći njene latentne faktore. Bitno je mapirati svakog korisnika i svaku stavku u latentni prostor sa dimenzijom r. Na taj način, pomaže nam da bolje razumemo odnos između korisnika i stavki jer postaju direktno uporedivi. Sledeća slika ilustruje ovu ideju.

![wr.png](images/SVD-1.jpg.webp)



- Prvo cemo importovati SVD iz surprise biblioteke i sve potrebne funkcije

In [10]:
from surprise import Reader, Dataset, SVD, accuracy
from surprise.model_selection import cross_validate, train_test_split

#### 1.1 User-based kolaborativno filtriranje

- Treniramo nas model na trening setu

In [11]:
def get_svd_model(df: pd.DataFrame) -> SVD:
    reader = Reader()
    # convert pandas dataframe to surprise dataset
    data = Dataset.load_from_df(df, reader)

    svd = SVD()
    # Perform train-test split
    trainset, testset = train_test_split(data, test_size=0.25, random_state=42)

    svd.fit(trainset)

    predictions = svd.test(testset)
    
    # Then compute RMSE
    accuracy.rmse(predictions)

    return svd

Za odredjenog korisnika mozemo dobiti njegove najbolje preporuke. Izracunavamo slicnost izmedju korisnika i sortiramo ih po slicnosti. Vracamo fiksni broj najboljih preporuka. 

Takodje mozemo koristiti funkciju resize_movie_dataset da bi suzili skup filmova na one koji su najbolje ocijenjeni.

In [12]:
# Resize movie dataset to get most popular movies to apply collaborativve filtering on them
def resize_movie_dataset(movie_df: pd.DataFrame, procentage: float = 0.7):
    m = movie_df["vote_count"].quantile(procentage)
    movie_df = movie_df.loc[movie_df["vote_count"] >= m]
    return movie_df

In [51]:
def CF_get_top_movies_for_user(
    user_id: int,
    df_users: pd.DataFrame,
    df_movies: pd.DataFrame,
    result_size: int = 20,
):
    movies = resize_movie_dataset(df_movies, 0.9)
    # movies = df_movies
    model = get_svd_model(df_users)
    # print(movies.head(5))
    # print(movies.columns)
    # rating every movie
    rate = movies.apply(
        lambda movie: model.predict(user_id, movie["movieId"])[3], axis=1
    )
    movies.insert(0, "rate", rate)
    # print(movies.head(5))
    movies = movies.sort_values(by="rate", ascending=False)
    return movies[0:result_size]

Provjeravamo korisnika sa odredjenim id - jem kako bi vidjeli koje filmove je ocijenio i koje filmove mu preporucujemo.

In [59]:
titles = df_movies[["title", "movieId"]]
# merging two tables
df_ratings_title = df_ratings.merge(titles, on="movieId")
user_ratings = df_ratings_title.loc[df_ratings_title["userId"] == 5]
# user_ratings
titles.shape
df_ratings_title.shape
user_ratings.head(10)

Unnamed: 0,userId,movieId,rating,title
48,5,2294,4.0,Jay and Silent Bob Strike Back
1605,5,364,4.0,Batman Returns
1805,5,377,4.0,A Nightmare on Elm Street
2461,5,500,4.5,Reservoir Dogs
2955,5,586,4.0,Wag the Dog
3212,5,588,3.5,Silent Hill
4169,5,595,4.0,To Kill a Mockingbird
4664,5,141,4.0,Donnie Darko
5240,5,440,4.0,Aliens vs Predator: Requiem
6191,5,1544,3.5,Imagine Me & You


In [60]:
if 'rate' in df_movies.columns:
    df_movies = df_movies.drop(columns=['rate'], axis=1)
results = CF_get_top_movies_for_user(5, df_ratings,df_movies, RESULT_SIZE)
results[["rate", "title", "vote_count", "vote_average"]]

RMSE: 0.9035


Unnamed: 0,rate,title,vote_count,vote_average
2108,4.547764,Edward Scissorhands,3601,7.5
425,4.533163,Mission: Impossible,2631,6.7
509,4.500455,Madagascar,3237,6.6
1145,4.484514,The Sixth Sense,3147,7.7
1260,4.466288,Amélie,3310,7.8
566,4.450447,Cars,3877,6.6
690,4.332125,The Green Mile,4048,8.2
197,4.324881,Harry Potter and the Philosopher's Stone,7006,7.5
2085,4.322342,Raiders of the Lost Ark,3854,7.7
150,4.311502,Men in Black II,3114,6.0


### 1.2 Item-based kolaborativno filtriranje

- Treniramo nas model na trening setu
- Za odredjeni film mozemo dobiti njegove najbolje preporuke. Izracunavamo slicnost izmedju filmova i sortiramo ih po slicnosti. Vracamo fiksni broj najboljih preporuka.

- Prvo za odredjeno ime filma dobijamo njegov id

In [61]:
def get_movie_id(movieName: str, movies: pd.DataFrame):
    movie = movies.loc[movies["title"] == movieName]
    # if we have more than one movie with the same name but they are different
    # i will just send the first because later we will send id with each movie!
    if len(movie) > 1:
        print(movie.iloc[0]["movieId"])
        return movie.iloc[0]["movieId"]
    
    elif movie.empty:
        return None
    try:
        movie_id = int(movie["movieId"])
        return movie_id
    except ValueError:
        return None

- Zatim za taj id dobijamo preporuke

In [62]:
def get_svd_model_ub(df: pd.DataFrame) -> SVD:
    reader = Reader()
    # convert pandas dataframe to surprise dataset
    data = Dataset.load_from_df(df[['movieId', 'movieId', 'rating']], reader)

    svd = SVD()
    # Perform train-test split
    trainset, testset = train_test_split(data, test_size=0.25, random_state=42)

    svd.fit(trainset)

    predictions = svd.test(testset)
    
    # Then compute RMSE
    accuracy.rmse(predictions)

    return svd

In [63]:
def CF_get_top_movies_for_movie(movieName:str, 
                                df_users: pd.DataFrame, 
                                df_movies: pd.DataFrame,
                                result_size: int = 20):
    movie_id = get_movie_id(movieName, df_movies)
    if movie_id is None:
        print("Movie doesnt exist")
        return None
    model = get_svd_model_ub(df_users)
    rate = df_movies.apply(
        lambda movie: model.predict(movie_id, movie['movieId'])[3], axis=1
        )
    df_movies.insert(0, 'rate', rate)
    df_movies = df_movies.sort_values(by="rate", ascending=False)
    return df_movies[0:result_size]

In [74]:
if 'rate' in df_movies.columns:
    df_movies = df_movies.drop(columns=['rate'], axis=1)
user_item_matrix = get_user_item_matrix(df_ratings_title)
results = CF_get_top_movies_for_movie("Batman", df_ratings, df_movies, 10)
if not results.empty:
    # user_item_matrix.transpose
    print(results[["rate", "title", "vote_count", "vote_average"]])

268
RMSE: 0.9833
          rate                    title  vote_count  vote_average
1359  4.094753                   Batman        2096           7.0
4457  4.089622            Pandora's Box          45           7.6
514   4.028722    Ice Age: The Meltdown        2951           6.5
1025  3.995017  The Thomas Crown Affair         339           6.6
3934  3.994636                 Basquiat          94           6.6
225   3.992939              Speed Racer         354           5.7
1391  3.981178           License to Wed         253           5.2
952   3.980283    Beverly Hills Cop III         434           5.5
2569  3.978836              Match Point        1105           7.3
3338  3.977254               Flashdance         302           6.1
