### Movie Recommender with matrix factorization by Irinej Slapal

First we import the necessary libraries (we will use nupy and pandas) and
read the data from the csv file into a pandas dataframe.

In [2]:
import numpy as np
import pandas as pd
from datetime import datetime
import csv
import collections

In [3]:

df_ratings = pd.read_csv('data/ml-latest-small/ratings.csv')
df_movies = pd.read_csv('data/ml-latest-small/movies.csv')
df_links = pd.read_csv('data/ml-latest-small/links.csv')
df_cast = pd.read_csv('data/ml-latest-small/cast.csv')
df_tags = pd.read_csv('data/ml-latest-small/tags.csv')


TODO:
- matrix factorization algorithm
- different search and recomendation options
- simple console ui
- users backend (new users, new ratings, users adding new tags, and so on)
- frontend possibly
- 


Now, for the matrix factorization to work, we need to form a matrix X (users * movies), this matriks will show us with what rating and which movies a certain user rated.


In [4]:

print(df_ratings.groupby('movieId')['rating'].describe())

         count      mean       std  min   25%  50%  75%  max
movieId                                                     
1        247.0  3.872470  0.958981  1.0  3.00  4.0  5.0  5.0
2        107.0  3.401869  0.880714  1.5  3.00  3.0  4.0  5.0
3         59.0  3.161017  1.150115  0.5  2.25  3.0  4.0  5.0
4         13.0  2.384615  0.938835  1.0  1.50  3.0  3.0  3.5
5         56.0  3.267857  0.948512  1.0  3.00  3.0  4.0  5.0
...        ...       ...       ...  ...   ...  ...  ...  ...
161944     1.0  5.000000       NaN  5.0  5.00  5.0  5.0  5.0
162376     1.0  4.500000       NaN  4.5  4.50  4.5  4.5  4.5
162542     1.0  5.000000       NaN  5.0  5.00  5.0  5.0  5.0
162672     1.0  3.000000       NaN  3.0  3.00  3.0  3.0  3.0
163949     1.0  5.000000       NaN  5.0  5.00  5.0  5.0  5.0

[9066 rows x 8 columns]


In [12]:
#what tables we need from data
# - movie genres                                                ×
# - avg rating for each movie                                   ×
# - avg rating for each genre
# - rating count for each movie                                 ×
# - what movies user has rated and the ratings for each user    ×
# - how much movies has each user rated                         ×
# - movie references (everything connected to one movie id)

#matrix x for users and movies, used for recommendation system
df_X = df_ratings.pivot(index='userId', columns='movieId', values='rating').fillna(0)
np_X = df_X.to_numpy()


#other matrices for features like recommending by genre, cast, etc.
#set of movie genres
movie_genres = set(genre.lower() 
                    for i in range(len(df_movies)) 
                    for genre in df_movies.loc[i, 'genres'].split('|'))


#avg rating for each movie
avgMovieRating = {movieId[0]: [movieId[1]['rating'].mean(), len(movieId[1])] 
                    for movieId in df_ratings.groupby('movieId')}


#avg rating for each genre
avgGenreRating = {movieId[0]: [movieId[1]['rating'].mean(), len(movieId[1])] 
                    for movieId in df_ratings.groupby('movieId')}


#movie references movieId: (genres + tags + cast)
list_movie_tag = list(zip(df_tags['movieId'], df_tags['tag']))


movie_tags = {movie[0]: movie[1] for movie in list_movie_tag}

movie_references = {movie[0]: (movie[1]['title'].values, movie[1]['cast'].values) 
                    for movie in pd.merge(df_movies, df_cast, on='movieId').groupby('movieId')}


for movieId in movie_tags:
    movie_references[movieId] = (movie_tags[movieId], movie_references[movieId][0], movie_references[movieId][1])

movie_cast = pd.merge(df_movies, df_cast, on='movieId')





[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [4. 0. 0. ... 0. 0. 0.]
 [5. 0. 0. ... 0. 0. 0.]]


The idea behind matrix factorization is to find two smaller matrices, that multiply into the main matrix and this is why we call it matrix factorization.
Now since we have a lot of missing values in the initial matrix, 


In [7]:

def funkSVD(X, H, W, K):
    steps = 1000
    lr = 0.0002
    rp = 0.02

    
    rows = X.shape[0]
    collumns = X.shape[1]
    progressIndicator = 0

    for step in range(steps):
        #indicate progress
        if step % (steps/100) == 0:
            progressIndicator += 1
            print("\rCalculation progress: ", progressIndicator, "%", end="")


        for i in range(rows):
            for j in range(collumns):
                if np_X[i, j] > 0:
                    # calculate error
                    error = np_X[i, j] - np.dot(H[i,:],W[:,j])

                    for k in range(K):
                        H[i, k] = H[i, k] + lr * (2 * error * W[k, j] - rp * H[i, k])
                        W[k, j] = W[k, j] + lr * (2 * error * H[i, k] - rp * W[k, j])
        e = 0
        for i in range(rows):
            for j in range(collumns):
                if X[i, j] > 0:
                    e = e + pow(X[i, j] - np.dot(H[i,:],W[:,j]), 2)
                    for k in range(K):
                        e = e + (rp/2) * (pow(H[i, k],2) + pow(W[k, j],2))
        if e < 0.001:
            break
    return H, W






      movieId                                              title  \
0           1                                   Toy Story (1995)   
1           2                                     Jumanji (1995)   
2           3                            Grumpier Old Men (1995)   
3           4                           Waiting to Exhale (1995)   
4           5                 Father of the Bride Part II (1995)   
...       ...                                                ...   
9120   162672                                Mohenjo Daro (2016)   
9121   163056                               Shin Godzilla (2016)   
9122   163949  The Beatles: Eight Days a Week - The Touring Y...   
9123   164977                           The Gay Desperado (1936)   
9124   164979                              Women of '69, Unboxed   

                                           genres  \
0     Adventure|Animation|Children|Comedy|Fantasy   
1                      Adventure|Children|Fantasy   
2                       