In [3]:
#hide
from movierecommender import movies_metadata as mmd
from movierecommender import users
import pandas as pd

# Movie Recommender

> This is a movie recommender implementation of user to movie and movie to movie recommendations. The methods used are primarily focused on embeddings to extrapolate similarity between users and items.

https://hassanhabbak.github.io/movie_recommender/

## Dependencies

`conda create --name <env> --file requirements.txt`

## How to use

## movies_metadata module

Loading in the meta data features:

- Cleans data from duplicates
- Convert adult tag on movies to bool
- Label Encodes genres after cleanning
- Drops incorrect Iso for languages
- Gets numerical features and corrects wrong values
- Bucketizes the decade the movie was launched in
- Creates a flag of whether the movie is recent or not
- NLP processing of overview description (TFIDF + LDA)

In [4]:
meta_df = pd.read_csv('data/movies_metadata.csv')
movies_df = mmd.get_movie_features(meta_df)
movies_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0_level_0,adult,vote_count,vote_average,runtime,popularity,decade_label,released_recently,G_0,G_1,G_2,...,G_10,G_11,G_12,G_13,G_14,G_15,G_16,G_17,G_18,G_19
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
862,False,5415.0,7.7,81.0,21.946943,2.0,False,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8844,False,2413.0,6.9,104.0,17.015539,2.0,False,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15602,False,92.0,6.5,101.0,11.7129,2.0,False,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
31357,False,34.0,6.1,127.0,3.859495,2.0,False,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
11862,False,173.0,5.7,106.0,8.387519,2.0,False,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## users module

Collaborative filtering is a technique that usually relies on matrix factorization to reduce a fat matrix to two thin ones, horizontally for user similarity, and vertically for item similarity. Using matrix completition, we could deduce how likely one user is to rate another. However the dataset of user iteractions is large and can become bigger by time. This is why I selected NN approach.

In this approach, a network is constructed to have two embedding layers learned against the movie rating as a target. One layer is for the user, the other is for the movie. The network weights for the layer is optimized to reduce the error in predicting the rating. The vectors from the embedding layers then will be the vector representation of similarity of movies and users.

#### Using NN collaborative filtering

In this part, the users and movies are represented as label encoded input to the network. Only the rating behavior is then the deciding factor for helping the model converge on a solution.

In [6]:
ratings_df = pd.read_csv('data/ratings.csv')
df, user_le, movie_le = users.add_labels(ratings_df)
df.head()

Unnamed: 0,userId,movieId,rating,timestamp,user_label,movie_label
0,1,81834,5.0,1425942133,0,16196
1,1,112552,5.0,1425941336,0,23638
2,1,98809,0.5,1425942640,0,20011
3,1,99114,4.0,1425941667,0,20089
4,1,858,5.0,1425941523,0,843


In [8]:
movie_train, movie_val, movie_test, user_train, \
user_val, user_test, rating_train, rating_val, \
rating_test = users.create_training_data(df.movie_label.values, df.user_label.values, df.rating)

In [None]:
model, history = users.train_nn_user_behaviour(df, movie_train, movie_val, user_train, user_val, rating_train, rating_val)

To extract the embedding layers:

In [None]:
movie_vec = users.extract_weights('movie_vec', model)
user_vec = users.extract_weights('user_vec', model)

And to produce the output, use:

In [None]:
users.get_user_movie_output(model, eval_df, user_le, movie_le)

#### Using content similarity and collaborative embeddings

For this part, the movies are now represented as a dimenstionally reduced vector of the metadata features for the movie. This is combined with the movie embeddings from the previous NN to have a representation of both content similarity and user behaviour similarity. Once combined, I apply UMAP on top to reduce the dimensionality and construct cosine similarity matrix that will have the distance between -1 to 1 for all movies and each other.

In [None]:
movie_to_movie_df = get_movie_to_movie_rating(model, movie_le, embedding_df)
movie_to_movie_df.to_csv('output/movie_to_movie.csv', index=False)