# MOVIE RECOMMENDATION SISTEM


Recommendation engine je skup razlicitih algoritama koji analiziraju i filtriraju podatke te najrelevantije pokazuje korisnicama. Prvo zabiljezi posljednje postupke korisnika pa na osnovu toga predlaze neke druge ponude. 

Na primjer, ukoliko korisnik često gleda filmove određenog žanra ili kupuje proizvode određene vrste, sistem će zabeležiti ove aktivnosti. Na osnovu tih podataka, recommendation engine može preporučiti slične filmove ili proizvode koji bi mogli biti od interesa korisniku. 



Prvo importujemo potrebne biblioteke i dodajemo imena fajlova koje cemo koristiti u daljem radu.

U konstante zapisujemo putanje do fajlova koje cemo koristiti u daljem radu.

- **movies.csv** - sadrzi podatke o filmovima
- **ratings.csv** - sadrzi podatke o ocjenama filmova
- **credits.csv** - sadrzi podatke o glumcima i ekipi filma

In [1]:
import pandas as pd
import numpy as np

TMDB_MOVIES = "./data/movies.csv"
TMDB_CREDITS = "./data/credits.csv"
RATINGS_SMALL = "./data/ratings.csv"

## 1. Za parsiranje fajlova koristimo pandas i funkciju:

Funckija **read_csv()** iz pandas biblioteke nam omogucava da ucitamo csv fajlove u dataframe. **Columns** je opcioni parametar koji nam omogucava da odaberemo samo odredjene kolone koje zelimo da ucitamo u memoriju.

In [2]:
def parse_csv(filepath: str, columns: list = None) -> pd.DataFrame:
    if columns is None:
        df = pd.read_csv(filepath)
    else:
        df = pd.read_csv(filepath, usecols=columns)
    return df

Kako imamo 3 fajla koja cemo koristiti, prvo cemo ih ucitati u memoriju, a zatim biramo kolone koje su nam potrebne za dalji rad. Pa tako spajamo podatke iz fajlova **movies.csv** i **credits.csv** u jedan dataframe. I koristimo kolone:
- **id** - id filma
- **title** - naziv filma
- **cast** - glumci
- **crew** - ekipa filma
- **keywords** - kljucne rijeci filma
- **genres** - zanrovi filma
- **vote_average** - prosjecna ocjena filma
- **vote_count** - broj ocjena filma
- **overview** - kratak opis filma


In [3]:
# number of first n movies filters will return
RESULT_SIZE = 5
attributes = ["movie_id", "title", "cast", "crew"]
df_credits = parse_csv(TMDB_CREDITS, attributes)
# changing the name of cols
df_credits.columns = ["id", "title", "cast", "crew"]

# df movies
attributes = ["genres", "id", "keywords", "vote_average", "vote_count", "tagline", "overview"]
df_movies = parse_csv(TMDB_MOVIES, attributes)

df_movies = df_movies.merge(df_credits, on="id")
df_movies.head()

Unnamed: 0,genres,id,keywords,overview,tagline,vote_average,vote_count,title,cast,crew
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di...",Enter the World of Pandora.,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha...","At the end of the world, the adventure begins.",6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",A cryptic message from Bond’s past sends him o...,A Plan No One Escapes,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",Following the death of District Attorney Harve...,The Legend Ends,7.6,9106,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","John Carter is a war-weary, former military ca...","Lost in our world, found in another.",6.1,2124,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


Nakon uspjesnog spajanja fajlova dobijamo potrebnu tabelu koju mozemo koristiti u nastavku.

## 2. Demografsko filtriranje
Ovi sistemi pružaju generalizovane preporuke svakom korisniku na osnovu popularnosti filma i/ili žanra. Osnovna ideja iza ovog sistema jeste da će filmovi koji su popularni i dobili kritičke pohvale imati veću verovatnoću da se sviđaju prosečnoj publici.


Metrika koju ćemo koristiti za sortiranje filmova je IMDB-ov weighted rating:


![wr.png](images/wr.png)

- **v** - je broj glasova za film
- **m** - je minimalni broj glasova potreban da bi se našao na listi
- **R** - je prosječna ocjena filma
- **C** - je prosječni broj glasova na listi


In [4]:
CORRECT_FACTOR = 100000
def rate_movie(movie, m, C):
    v = movie["vote_count"]
    R = movie["vote_average"]
    return ((v/(v+m)*R) + m*(m+v)*C)/CORRECT_FACTOR

In [5]:
# filter the given dataset and returns top resultsize movies
# wr = (v/(v+m)*R) + (m/(m+v)*C)
# v is the number of votes for the movie;
# m is the minimum votes required to be listed in the chart;
# R is the average rating of the movie; And
# C is the mean vote across the whole report
def demographic_filtering(movies: pd.DataFrame, result_size: int = 20):
    # v - vote_count
    # R - vote_average
    m = movies["vote_count"].quantile(0.9)
    C = movies["vote_average"].mean()
    

    # removing all movies with low number of votes
    movies = movies.loc[movies["vote_count"] >= m]
    # rating every movie
    rate = movies.apply(lambda movie: rate_movie(movie, m, C), axis=1)
    movies.insert(0, "rate", rate)
    # sort my resulted movies by rate
    movies = movies.sort_values(by="rate", ascending=False)
    return movies[0:result_size]

Na osnovu ove metrike mozemo izracunati tezinu svakog filma i sortirati ih po toj tezini, te tako dobiti najbolje ocijenjene filmove za metodu demografskog filtriranja.

Vracamo fiksan broj najboljih filmova koji je odredjen parametrom RESLULT_SIZE.

In [6]:
results = demographic_filtering(df_movies, RESULT_SIZE)
results[["rate","title", "vote_count", "vote_average"]].head(RESULT_SIZE)

Unnamed: 0,rate,title,vote_count,vote_average
96,1746.101204,Inception,13752,8.1
65,1550.103861,The Dark Knight,12002,8.2
0,1527.480159,Avatar,11800,7.2
16,1524.792197,The Avengers,11776,7.4
788,1437.321382,Deadpool,10995,7.4


## 3. Kolaborativno filtriranje

Ovaj sistem upoređuje osobe sa sličnim interesovanjima i pruža preporuke na osnovu ovog podudaranja. Kolaborativni filteri ne zahtijevaju metapodatke elementa (filmova) kao njihovi sadržajno bazirani pandani.

Postoje dvije vrste kolaborativnog filtriranja: user-based i item-based. Implementiraćemo oba i uporediti rezultate.

Koristićemo **Single Value Decomposition** da smanjimo dimenzionalnost problema. 
Koristićemo metrike poput **RMSE** i **MSE** za upoređivanje rezultata.


- postavljamo imena kolona koje cemo koristiti u daljem radu
- mijenjamo ime kolone movieId u id da bi mogli da spojimo tabele
- citamo fajlove
- spajamo tabele movies i ratings

In [7]:
# COLLABORATIVE FILTERING
attributes = ["userId", "movieId", "rating"]
# loading the files
df_movies = df_movies.rename(columns={"id": "movieId"})

df_ratings = parse_csv(RATINGS_SMALL, attributes)

titles = df_movies[["title", "movieId"]]
# merging two tables
df_ratings_title = df_ratings.merge(titles, on="movieId")
df_ratings_title.head(10)

Unnamed: 0,userId,movieId,rating,title
0,1,2105,4.0,American Pie
1,4,2105,4.0,American Pie
2,15,2105,4.0,American Pie
3,30,2105,2.0,American Pie
4,34,2105,4.0,American Pie
5,35,2105,3.5,American Pie
6,41,2105,5.0,American Pie
7,49,2105,3.0,American Pie
8,59,2105,1.0,American Pie
9,73,2105,3.0,American Pie


Pogledajmo kako izgleda nasa matrica korisnika i filmova. Koristicemo funkciju **get_user_item_matrix** koja vraca matricu korisnika i filmova. Na x-osi su filmovi, a na y-osi korisnici. Vrijednosti u matrici su ocjene koje su korisnici dali filmovima.

In [8]:
def get_user_item_matrix(df: pd.DataFrame) -> pd.DataFrame:
    user_item_matrix = df.pivot(
        index=["userId"], columns=["movieId"], values=["rating"]
    ).fillna(0)
    return user_item_matrix


In [9]:
get_user_item_matrix(df_ratings_title)

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
movieId,5,11,12,13,14,16,18,19,20,22,...,64499,69640,74510,77866,89492,98369,103731,114635,115210,116977
userId,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,0.0,3.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
668,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
670,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Kao sto smo mogli pretpostaviti matrica je veoma rijetka, jer svaki korisnik je ocijenio samo mali broj filmova. To je veliki problem za kolaborativno filtriranje, jer ne mozemo da izracunamo slicnost izmedju korisnika koji nisu ocijenili iste filmove. 

Zbog toga cemo koristiti **SVD** da smanjimo dimenzionalnost problema.

## Sta je SVD (Single Value Decomposition)?

SVD je tehnika za smanjivanje dimenzionalnosti koja se koristi za kompresiju podataka. SVD se može koristiti za smanjivanje dimenzionalnosti u skupu podataka. 

Jedan način za rukovanje skalabilnošću i retkoćom koju stvara kolaborativno filtriranje je iskorišćavanje latentnog faktorskog modela kako bi se uhvatila sličnost između korisnika i filmova. Bitno je pretvoriti problem recommendation sistema u problem optimizacije. Možemo ga posmatrati kao koliko smo dobri u predviđanju ocena za filmove datog korisnika. Jedna uobičajena metrika je Koren srednjekvadratna greška (RMSE). Što je niži RMSE, to je bolja performansa.

Sada, kada govorimo o latentnom faktoru, možda se pitate šta je to? To je široka ideja koja opisuje osobinu ili koncept koji korisnik ili stavka imaju. Na primer, za muziku, latentni faktor može se odnositi na žanr kojem muzika pripada. SVD smanjuje dimenziju matrice korisnosti izvlačeći njene latentne faktore. Bitno je mapirati svakog korisnika i svaku stavku u latentni prostor sa dimenzijom r. Na taj način, pomaže nam da bolje razumemo odnos između korisnika i stavki jer postaju direktno uporedivi. Sledeća slika ilustruje ovu ideju.

![wr.png](images/SVD-1.jpg.webp)



![wr.png](images/svd_graph.png)

- Prvo cemo importovati SVD iz surprise biblioteke i sve potrebne funkcije

In [10]:
from surprise import Reader, Dataset, SVD, accuracy
from surprise.model_selection import cross_validate, train_test_split

### 1.1 User-based kolaborativno filtriranje

Ovi sistemi preporučuju proizvode korisniku koje su slični korisnici voljeli. Za mjerenje sličnosti između dva korisnika možemo koristiti kosinusnu sličnost. Ova tehnika filtriranja može se ilustrovati primerom. 

U sledećim matricama, svaki red predstavlja korisnika, dok kolone odgovaraju različitim filmovima osim poslednje koja beleži sličnost između tog korisnika i ciljnog korisnika. Svaka ćelija predstavlja ocjenu koju je korisnik dao tom filmu. Pretpostavimo da je korisnik E ciljni korisnik.

![wr.png](images/ub.png)

Prvo cemo odrediti slicnost izmedju korisnika. Pa pomocu te slicnosti cemo izracunati ocjenu koju bi ciljni korisnik dao filmu.

Tako dobijene filmove sortiramo po ocjeni i vracamo najbolje ocijenjene filmove.

- Treniramo nas model na trening setu

In [11]:
def get_svd_model(df: pd.DataFrame) -> SVD:
    reader = Reader()
    # convert pandas dataframe to surprise dataset
    data = Dataset.load_from_df(df, reader)

    svd = SVD()
    # Perform train-test split
    trainset, testset = train_test_split(data, test_size=0.25, random_state=42)

    svd.fit(trainset)

    predictions = svd.test(testset)
    
    # Then compute RMSE
    accuracy.rmse(predictions)

    return svd

Za odredjenog korisnika mozemo dobiti njegove najbolje preporuke. Izracunavamo slicnost izmedju korisnika i sortiramo ih po slicnosti. Vracamo fiksni broj najboljih preporuka. 

Takodje mozemo koristiti funkciju **resize_movie_dataset** da bi suzili skup filmova na one koji su najbolje ocijenjeni.

In [12]:
# Resize movie dataset to get most popular movies to apply collaborativve filtering on them
def resize_movie_dataset(movie_df: pd.DataFrame, procentage: float = 0.7):
    m = movie_df["vote_count"].quantile(procentage)
    movie_df = movie_df.loc[movie_df["vote_count"] >= m]
    return movie_df

In [13]:
def CF_get_top_movies_for_user(
    user_id: int,
    df_users: pd.DataFrame,
    df_movies: pd.DataFrame,
    result_size: int = 20,
):
    # movies = resize_movie_dataset(df_movies, 0.9)
    movies = df_movies.copy()
    model = get_svd_model(df_users)
    # print(movies.head(5))
    # print(movies.columns)
    # rating every movie
    rate = movies.apply(
        lambda movie: model.predict(user_id, movie["movieId"])[3], axis=1
    )
    movies.insert(0, "rate", rate)
    # print(movies.head(5))
    movies = movies.sort_values(by="rate", ascending=False)
    return movies[0:result_size]

Provjeravamo korisnika sa odredjenim id - jem kako bi vidjeli koje filmove je ocijenio i koje filmove mu preporucujemo.

In [14]:
titles = df_movies[["title", "movieId"]]
# merging two tables
df_ratings_title = df_ratings.merge(titles, on="movieId")
user_ratings = df_ratings_title.loc[df_ratings_title["userId"] == 5]
# user_ratings
titles.shape
df_ratings_title.shape
user_ratings.sort_values(by="rating", ascending=False).head(10)

Unnamed: 0,userId,movieId,rating,title
7308,5,597,5.0,Titanic
7903,5,4995,4.5,Boogie Nights
2461,5,500,4.5,Reservoir Dogs
7273,5,277,4.5,Underworld
6333,5,1961,4.0,My Name Is Bruce
7523,5,1777,4.0,Fahrenheit 9/11
7021,5,104,4.0,Run Lola Run
1605,5,364,4.0,Batman Returns
48,5,2294,4.0,Jay and Silent Bob Strike Back
5240,5,440,4.0,Aliens vs Predator: Requiem


Kao sto vidimo korisnik sa id - jem 5 je gledao razlicite zanrove filmova od ljubavnih do akcionih i naucnofantasticnih. Sto otezava preporuku filmova. 

Testiracemo nas sistem na korisniku sa id - jem 5 i vidjeti koliko je dobar.

In [15]:
if 'rate' in df_movies.columns:
    df_movies = df_movies.drop(columns=['rate'], axis=1)
results = CF_get_top_movies_for_user(99, df_ratings,df_movies, 10)
results[["rate", "title", "vote_count", "vote_average"]]

RMSE: 0.9070


Unnamed: 0,rate,title,vote_count,vote_average
425,4.173839,Mission: Impossible,2631,6.7
1145,4.148357,The Sixth Sense,3147,7.7
1666,4.10242,The Good Thief,31,6.0
150,4.058924,Men in Black II,3114,6.0
1053,4.0435,Galaxy Quest,710,6.9
2278,4.041587,Dances with Wolves,1046,7.6
4457,4.037456,Pandora's Box,45,7.6
1260,3.98994,Amélie,3310,7.8
1028,3.985996,Solaris,357,7.7
1025,3.983865,The Thomas Crown Affair,339,6.6


Takodje kao rezultat naseg filtera dobijamo filmove razlicitog zanra, sto je dobro jer korisnik voli razlicite zanrove filmova. Sto znaci da sistem solidno radi svoj posao. Kada bi analizirali preporuke za korisnika vidjeli bi da se zanrovi filmova poklapaju sa zanrovima koje je korisnik gledao. Pa mozemo smatrati da item-based kolaborativno filtriranje radi dobro.

### 1.2 Item-based kolaborativno filtriranje

- Treniramo nas model na trening setu
- Za odredjeni film mozemo dobiti njegove najbolje preporuke. Izracunavamo slicnost izmedju filmova i sortiramo ih po slicnosti. Vracamo fiksni broj najboljih preporuka.
![wr.png](images/ib.png)

- Prvo za odredjeno ime filma dobijamo njegov id

In [16]:
def get_movie_id(movieName: str, movies: pd.DataFrame):
    movie = movies.loc[movies["title"] == movieName]
    # if we have more than one movie with the same name but they are different
    # i will just send the first because later we will send id with each movie!
    if len(movie) > 1:
        print(movie.iloc[0]["movieId"])
        return movie.iloc[0]["movieId"]
    
    elif movie.empty:
        return None
    try:
        movie_id = int(movie["movieId"])
        return movie_id
    except ValueError:
        return None

- Zatim za taj id dobijamo preporuke

In [17]:
def get_svd_model_ub(df: pd.DataFrame) -> SVD:
    reader = Reader()
    # convert pandas dataframe to surprise dataset
    data = Dataset.load_from_df(df[['movieId', 'userId', 'rating']], reader)

    svd = SVD()
    # Perform train-test split
    trainset, testset = train_test_split(data, test_size=0.25, random_state=42)

    svd.fit(trainset)

    predictions = svd.test(testset)
    
    # Then compute RMSE
    accuracy.rmse(predictions)

    return svd

In [18]:
def CF_get_top_movies_for_movie(movieName:str, 
                                df_users: pd.DataFrame, 
                                df_movies: pd.DataFrame,
                                result_size: int = 20):
    movie_id = get_movie_id(movieName, df_movies)
    if movie_id is None:
        print("Movie doesnt exist")
        return None
    model = get_svd_model_ub(df_users)
    rate = df_movies.apply(
        lambda movie: model.predict(movie_id, movie['movieId'])[3], axis=1
        )
    df_movies.insert(0, 'rate', rate)
    df_movies = df_movies.sort_values(by="rate", ascending=False)
    return df_movies[0:result_size]

In [19]:
if 'rate' in df_movies.columns:
    df_movies = df_movies.drop(columns=['rate'], axis=1)
user_item_matrix = get_user_item_matrix(df_ratings_title)
results = CF_get_top_movies_for_movie("Batman", df_ratings, df_movies, 10)
if not results.empty:
    # user_item_matrix.transpose
    print(results[["rate", "title", "vote_count", "vote_average"]])

268


RMSE: 0.9077
          rate                               title  vote_count  vote_average
398   4.597019                    Ocean's Thirteen        1999           6.5
1557  4.576238                 Million Dollar Baby        2439           7.7
1006  4.543419  Indiana Jones and the Last Crusade        3152           7.6
4346  4.536804                        Transamerica         144           6.9
2166  4.515643                          To Die For         175           6.7
274   4.493813                           Gladiator        5439           7.9
5     4.465043                        Spider-Man 3        3576           5.9
2018  4.439686        There's Something About Mary        1590           6.5
4339  4.435743                              Dr. No         940           6.9
2247  4.435737                   Princess Mononoke        1983           8.2


## 4. Filtriranje na osnovu sadržaja

Da bismo do punog kapaciteta iskoristili podatke koje imamo na raspolaganju, treba da analiziramo i podatke o sadržaju filmova.

Sistemi filtriranja na osnovu sadržaja upoređuju sličnosti između sadržaja filmova i korisnicima kojima se svidio neki film možemo da preporučimo neki njemu sličan.

Dodatni podaci koje ćemo ovdje upotrijebiti predstavljaju kratak opis radnje filma (overview), ključne riječi, režisere i glavne glumce, a po potrebi je moguće iskoristiti još podataka.

In [20]:
from ast import literal_eval

MOVIE_ATTRS = [
        "id",
        "title",
        "genres",
        "budget",
        "keywords",
        "vote_average",
        "vote_count",
        "tagline",
        "popularity",
        "release_date",
        "runtime",
        "overview",
    ]
df_movies = parse_csv(TMDB_MOVIES, MOVIE_ATTRS)

# merge credits data to movies dataframe
df_credits = parse_csv(TMDB_CREDITS)
df_credits.columns = ['id','tittle','cast','crew']
df_movies = df_movies.merge(df_credits, on='id')

# convert missing string values to empty strings
df_movies['overview'] = df_movies['overview'].fillna('')
df_movies = df_movies.drop_duplicates(subset=['id', 'title'])
df_movies = df_movies.dropna()

to_numeric_cols = ['id']
df_movies[to_numeric_cols] = df_movies[to_numeric_cols].apply(pd.to_numeric, errors='coerce')

to_object_cols = ['cast', 'crew', 'keywords', 'genres']
for col in to_object_cols:
    df_movies[col] = df_movies[col].apply(literal_eval)

Definišemo funkcije pomoću kojih ćemo izvući reditelja, 3 glavna glumca, ključne riječi i žanrove.

In [21]:
# extract director's name from crew
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan


# returns the first 3 names from a list
def get_top3_names(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        if len(names) > 3:
            names = names[:3]
        return names
    return []

In [22]:
# extract data from objects
df_movies['director'] = df_movies['crew'].apply(get_director)

names_cols = ['cast', 'keywords', 'genres']
for col in names_cols:
    df_movies[col] = df_movies[col].apply(get_top3_names)

### 4.1 TF-IDF vektori

Prvo ćemo analizirati sličnosti između tekstualnog sadržaja. Da bismo to uradili moramo tekstualni sadržaj polja `overview` pretvoriti u vektor. Koristićemo TF-IDF (Term Frequency - Inverse Document Frequency) vektore teksta. Računa se po formuli TF*IDF, gdje je `TF=broj pojavljivanja pojma / ukupan broj pojmova` i `IDF = log(broj dokumenata / broj dokumenata sa pojmom)`. Pomoću scikit-learn biblioteke pravimo matricu u kojoj je svaki red vektor TF-IDF vrijednosti za jedan opis radnje filma.

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(df_movies['overview'])
tfidf_matrix.shape

(3959, 18655)

Vidimo da se vektori sastoje od preko 18000 pojmova.

Koristimo kosinusnu sličnost ovih vektora da odredimo koji vektori opisa radnje filmova su slični. Suštinski, računamo kosinus ugla između dva vektora pomoću vektorskog proizvoda.

![cosinus similarity](images/cos.png)

Računamo matricu sličnosti tako da se u presjeku i-tog reda i j-te kolone nalazi kosinusna sličnost i-tog i j-tog filma. Koristimo scikit-learn.

In [24]:
from sklearn.metrics.pairwise import cosine_similarity

overview_similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)
overview_similarity_matrix.shape

(3959, 3959)

Definišemo funkciju koja za jedan film rangira sve ostale na osnovu sličnosti sa njim i vraća određeni broj najboljih rezultata.

In [25]:
# definisemo inverzno mapiranje naziva filmova u id-jeve
movie_indices = pd.Series(df_movies.index, index=df_movies['title']).drop_duplicates()

In [26]:
def CBF_get_top_movies_for_movie(movie_title: str, similarity_matrix: np.ndarray, result_size: int = 20) -> pd.DataFrame:
    idx = movie_indices[movie_title]

    similarity_scores = list(enumerate(similarity_matrix[idx]))
    similarity_scores = sorted(similarity_scores, key=lambda x : x[1], reverse=True)
    
    if result_size is None:
        result_size = len(similarity_scores)
    # the provided movie is always first
    best_indices = [s[0] for s in similarity_scores[1:result_size+1]]
    result = df_movies.iloc[best_indices].copy()

    # set score column on result
    result.insert(0, "score", [s[1] for s in similarity_scores[1:result_size+1]])
    return result

In [27]:
CBF_get_top_movies_for_movie("The Dark Knight", overview_similarity_matrix, 10)[['title', 'score']]

Unnamed: 0,title,score
79,Iron Man 2,0.180697
31,Iron Man 3,0.15015
1868,Cradle 2 the Grave,0.136069
7,Avengers: Age of Ultron,0.128299
538,Hostage,0.109252
119,Batman Begins,0.104002
4574,Roadside,0.0875
2044,The Little Vampire,0.08607
2633,The Clan of the Cave Bear,0.085849
570,Ransom,0.080365


### 4.2 Count vektori
Vidimo da kao rezultat dobijamo filmove sa sličnim opisima. Međutim, preporuke možemo unaprijediti koristeći i druge kolone. Upotrijebićemo još i žanrove, ključne riječi, reditelja i glavne glumce. Ove vrijednosti moramo da formatiramo i spojimo u string pomoću kojeg možemo napraviti vektore.



In [28]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''
        
# Concatenate all the features in one string
def create_soup(x):
    return x['director'] + ' ' + ' '.join(x['cast']) + ' ' + ' '.join(x['genres']) + ' ' + ' '.join(x['keywords'])

In [29]:
features = ['cast', 'keywords', 'director', 'genres']
for feature in features:
    df_movies[feature] = df_movies[feature].apply(clean_data)

df_movies['soup'] = df_movies.apply(lambda x : create_soup(x), axis=1)
df_movies['soup'].head()

0    jamescameron samworthington zoesaldana sigourn...
1    goreverbinski johnnydepp orlandobloom keirakni...
2    sammendes danielcraig christophwaltz léaseydou...
3    christophernolan christianbale michaelcaine ga...
4    andrewstanton taylorkitsch lynncollins samanth...
Name: soup, dtype: object

S obzirom na to da ne želimo da degradiramo uticaj glumca ako je on učestvovao u mnogo filmova, u ovom slučaju nam je pogodnije upotrijebiti CountVectorizer.

In [30]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(stop_words='english')
count_matrix = count_vectorizer.fit_transform(df_movies['soup'])
count_similarity_matrix = cosine_similarity(count_matrix, count_matrix)
print(count_similarity_matrix.shape)
count_similarity_matrix

(3959, 3959)


array([[1.        , 0.3       , 0.2       , ..., 0.09534626, 0.        ,
        0.        ],
       [0.3       , 1.        , 0.2       , ..., 0.09534626, 0.        ,
        0.        ],
       [0.2       , 0.2       , 1.        , ..., 0.19069252, 0.        ,
        0.        ],
       ...,
       [0.09534626, 0.09534626, 0.19069252, ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

Sada možemo upotrijebiti ovu matricu sličnosti za dobijanje preporuka:

In [31]:
CBF_get_top_movies_for_movie("The Dark Knight", count_similarity_matrix, 10)[['title', 'score']]

Unnamed: 0,title,score
79,Iron Man 2,0.6
7,Avengers: Age of Ultron,0.4
16,The Avengers,0.4
26,Captain America: Civil War,0.4
31,Iron Man 3,0.4
39,TRON: Legacy,0.4
83,The Lovers,0.358569
193,After Earth,0.33541
4117,Six-String Samurai,0.33541
91,Independence Day: Resurgence,0.316228


Ili kombinovati `overview_similarity_matrix` i `count_similarity_matrix` da ne bismo zanemarili neki podatak. Možemo rezultat dobiti kao linearnu kombinaciju ovih matrica ili proizvod po elementima.

In [32]:
similarity_matrix = overview_similarity_matrix * count_similarity_matrix
print(similarity_matrix.shape)
CBF_get_top_movies_for_movie("The Dark Knight", similarity_matrix, 10)[['title', 'score']]

(3959, 3959)


Unnamed: 0,title,score
79,Iron Man 2,0.108418
31,Iron Man 3,0.06006
7,Avengers: Age of Ultron,0.051319
2912,Star Wars,0.016025
1628,Sanctum,0.015474
1868,Cradle 2 the Grave,0.013607
127,Mad Max: Fury Road,0.013046
182,Ant-Man,0.012214
0,Avatar,0.011755
1802,Highlander: The Final Dimension,0.011596


### 4.3 Predlaganje sadržaja za korisnika

Možemo upotrijebiti informacije o tome koje je filmove korisnik ranije gledao i kakavu je ocjenu ostavio da bismo preporučivali filmove sličnog sadržaja tim filmovima. Koristimo podatke o ocjenama korisnika iz gore definisanog DataFrame-a `df_ratings`

In [33]:
# helper function to convert similarity score to predicted user rating
def convert_to_user_ratings(df: pd.DataFrame, user_id: int) -> pd.DataFrame:
    min_score = df['score'].min()
    max_score = df['score'].max()
    df['rate'] = df['score'].apply(lambda x : (x - min_score) / (max_score - min_score) * 5)
    return df

In [34]:
# returns the similarity score between the given movie and the movie rated by the user, weighted by the rating
def user_movie_similarity(idx: int, x) -> float:
        if x['title'] not in movie_indices:
            return 0
        if idx > len(similarity_matrix) - 1:
            return 0
        if movie_indices[x['title']] > len(similarity_matrix) - 1:
            return 0
        return x['rating'] * similarity_matrix[idx][movie_indices[x['title']]]


# returns weighted sum of similarity scores for movies rated by user and the given movie
def user_similarity(user_id:int, movie_title: str) -> float:
    idx = movie_indices[movie_title]
    if type(idx) != np.int64:
        return 0
    user_scores = df_ratings[df_ratings['userId'] == user_id]
    user_scores = pd.merge(user_scores, df_movies, left_on='movieId', right_on='id')
    user_scores = user_scores[user_scores['title'] != movie_title]
    if user_scores.empty:
        return 0
    return sum(user_scores.apply(lambda x : user_movie_similarity(idx, x), axis=1)) / sum(user_scores['rating'])


# Returns the dataframe of up to result_size size with the highest similarity scores for a given user.
# Similarity scores are calculated based on movies rated by the user.
# The return dataframe format is the same as input one, with added score column.
def CBF_get_top_movies_for_user(user_id: str, result_size: int = 20) -> pd.DataFrame:
    similarity_scores = list(enumerate(df_movies['title'].apply(lambda x : user_similarity(user_id, x))))
    similarity_scores = sorted(similarity_scores, key=lambda x : x[1], reverse=True)
    
    if result_size is None:
        result_size = len(similarity_scores)
    best_indices = [s[0] for s in similarity_scores[:result_size]]
    result = df_movies.iloc[best_indices].copy()

    # set score column on result
    result.insert(0, "score", [s[1] for s in similarity_scores[:result_size]])
    result = convert_to_user_ratings(result, user_id)
    return result

In [35]:
CBF_get_top_movies_for_user(5, 10)[['title', 'score', 'rate']]

Unnamed: 0,title,score,rate
81,Maleficent,0.005528,5.0
2973,For Greater Glory - The True Story of Cristiada,0.004618,3.317919
2214,Three to Tango,0.004334,2.792678
7,Avengers: Age of Ultron,0.004149,2.450608
197,Harry Potter and the Philosopher's Stone,0.003711,1.639793
3634,Zero Effect,0.003294,0.870229
1111,Victor Frankenstein,0.003142,0.587528
3637,Kill the Messenger,0.003052,0.421733
897,Deck the Halls,0.002841,0.031654
2206,Ghost Town,0.002824,0.0


## 5. Hibridni model

Za najbolje moguće rezultate, potrebno je napraviti kombinaciju filtriranja na osnovu sadržaja i kolaborativnog pristupa. Ovo je upravno ono što pokušavaju hibridni modeli.

Postoji nekoliko načina da kombinujemo ove modele. Ovdje ćemo dati dva jednostavna ali efektivna načina koji su nama dali dobre rezultata.

Hajde prvo da učitamo rezultate oba recommender sistema.

In [36]:
user_id = 5
df_movies['movieId'] = df_movies['id']
df1 = CF_get_top_movies_for_user(user_id, df_ratings, df_movies, None)
df2 = CBF_get_top_movies_for_user(user_id, None)

RMSE: 0.9024


### 5.1 Težinski (Weighted) sistem

Rejting svakog filma računama kao linearnu kombinaciju rejtinga koji se dobijaju kolaborativnim i pristupom na osnovu sadržaja.

![weighted filtering](images/weighted.webp)

Pomoću parametra `alpha` možemo kontrolisati koliko koji sistem ima udjela u rezultatu.

In [37]:
def weighted_merge(df1: pd.DataFrame, df2: pd.DataFrame, alpha: float, score_col: str, result_size: int) -> pd.DataFrame:
    df1 = df1.rename(columns={score_col: 'score1'})
    df2 = df2.rename(columns={score_col: 'score2'})
    df = pd.merge(df1, df2[['id', 'score2']], on='id')
    df['score'] = alpha * df['score1'] + (1 - alpha) * df['score2']
    df = df.sort_values(by=['score'], ascending=False)
    df = df.drop(columns=['score1', 'score2'])
    return df.head(result_size)

In [38]:
weighted_merge(df1, df2, 0.5, 'rate', 10)[['title', 'score']]

Unnamed: 0,title,score
2638,Maleficent,4.386392
591,For Greater Glory - The True Story of Cristiada,3.974995
776,Three to Tango,3.846533
31,Harry Potter and the Philosopher's Stone,3.815373
2584,Avengers: Age of Ultron,3.762871
1910,Zero Effect,3.376348
2314,Victor Frankenstein,3.307207
1912,Kill the Messenger,3.266657
2378,Deck the Halls,3.171253
770,Ghost Town,3.163511


### 5.2 Miješani (Mixed) sistem

Prvo nađemo preporuke filmova pomoću svakog sistema zasebno, a zatim od tih rezultata uzimamo one najbolje.

![mixed](images/mixed.webp)

In [39]:
def mixed_merge(df1: pd.DataFrame, df2: pd.DataFrame, score_col: str, result_size: int) -> pd.DataFrame:
    df1 = df1.rename(columns={score_col: 'score1'})
    df2 = df2.rename(columns={score_col: 'score2'})
    df = pd.merge(df1, df2[['id', 'score2']], on='id')
    df['score'] = df['score1'] + df['score2']
    df = df.sort_values(by=['score'], ascending=False)
    df = df.drop(columns=['score1', 'score2'])
    return df.head(result_size)

In [40]:
mixed_merge(df1, df2, 'rate', 10)[['title', 'score']]

Unnamed: 0,title,score
2638,Maleficent,8.772783
591,For Greater Glory - The True Story of Cristiada,7.94999
776,Three to Tango,7.693067
31,Harry Potter and the Philosopher's Stone,7.630747
2584,Avengers: Age of Ultron,7.525743
1910,Zero Effect,6.752697
2314,Victor Frankenstein,6.614413
1912,Kill the Messenger,6.533314
2378,Deck the Halls,6.342506
770,Ghost Town,6.327023
