# Testing data and Models
In this notebook the data is loaded, preprocessed. And the model is built from scratch.

In [None]:
# Import packages
import pandas as pd

# graphics
import matplotlib.pyplot as plt
import seaborn as sns

# text procesing
import re
from datetime import datetime
import numpy as np

# for the model
import tensorflow as tf
from keras import layers, models

In [None]:
# Movies: 
raw_movies = pd.read_csv('../data/raw/ml-latest-small/movies.csv', encoding='utf-8')
# Users' ratings
ratings_raw = pd.read_csv('../data/raw/ml-latest-small/ratings.csv', encoding='utf-8')
# Tags: Xxx, abc
tags_raw = pd.read_csv('../data/raw/ml-latest-small/tags.csv', encoding='utf-8')

## Exploring the Data
Here I plan several hipotesis that represent new features:
+ Encode each genre and assigna weight to it. Maybe some genres have higher ratings overall

+ Assign a weight to each movie based on the number of views it has. Maybe the most viewed movies have a higher chance to receive a higher rating from a new user.

+ Maybe the ratings given by the users should be weighetd based on the number of movies watched by the user. 

In [None]:
# Exploring the data
# Try to find an encode for each genre and assign a weight to it. 
#   Then predict the ratng based on the weights of the genres

# Assign a weight to each movie that represents how many users viewed it. 
#   MAybe the most viewed movies have a better chance to be a great recommendation
#   to new users

# The more movies a user watchs the more valuable their ratings, maybe.

# 

In [None]:
# Extract the year from the name of the move (if exists)
# dtf_products["name"] = 
#   dtf_products["title"].apply(lambda x: re.sub("[\(\[].*?[\)\]]", "", x).strip())

# For the Movie Dataframe:

# copy raw dataframe:
pro_movies = raw_movies.copy()

# Extract the Year from the Title of the Movie (if its between parenthesis)
pro_movies['Year'] = pro_movies['title'].apply(
    lambda x: int(x.split("(")[-1][:4].replace(")", "").strip()) # if there are 2 years (like 2006-2010), the first year is taken
        if "(" in x else np.nan)    # if theres a ( in the Name, set the year, else, a NA
        
pro_movies['title'] = pro_movies['title'].apply(lambda x: re.sub("[\(\[].*?[\)\]]", "", x).strip())


# copy raw ratings data:
pro_ratings = ratings_raw.copy()

# Add date column
pro_ratings['Date'] = pro_ratings['timestamp']


In [None]:
# One-hot encoding of genres: 

# First get a list of lists (each list is the list of genres for each movie)
aux = [i.split('|') for i in pro_movies.genres.unique()]
# Then create a set (unique array of elements) and remove the no genres listed
vocab = list(set(i for k in aux for i in k))
vocab.remove('(no genres listed)')
print("Genres present in the dataset: ", vocab)

# Now, create a column for each genre:
for genre in vocab:
    pro_movies[genre] = pro_movies.genres.apply(lambda x: 1 if genre in x else 0)

pro_movies

Onehot encoding, in this case, is quite bad, as the dataframe shows below, theres very sparse. This means that our model will have a lot of entries that will, most of the time, receive a 0 as an input.

In [None]:
pro_movies
sns.heatmap(pro_movies[[i for i in vocab]])

## Feature Engineering
There are several approach to take in Movie recommendations.
+ Content based: recommend similar movies to the ones the user liked 
+ Colaborative filtering: if two users have similar ratings on some movies, one movie that one user already watched and rated it high is a good recommendation to the other user, if they didn't watched it already

This two models are merged with the context of each movie (The year and the genres, in this case).

In [None]:
# More sparse (yikes):
tmp = pro_ratings.copy()
dtf_users = tmp.pivot_table(index="userId", columns="movieId", values="rating")
missing_cols = list(set(pro_movies.index) - set(dtf_users.columns))
for col in missing_cols:
    dtf_users[col] = np.nan
dtf_users = dtf_users[sorted(dtf_users.columns)]

print(dtf_users.shape[0], dtf_users.shape[1])

In [None]:
dtf_users

In [None]:
# get the number of users and movies
users_n, movies_n = len(pro_ratings.userId.unique()), len(pro_movies.movieId.unique())
print(users_n, movies_n)
print(len(raw_movies.movieId.unique()))
# define the output dimmension of the embedding layer:
e_size = 75


# NOW the layers:

# Users:
users_input = layers.Input(name = "user_input", shape = (1,))
users_embedd = layers.Embedding(users_n, e_size)(users_input)
# is this necesssary ? 
users_final = layers.Reshape(name = "users", target_shape=(e_size, ))(users_embedd)

In [None]:
# Create the Model
embeddings_size = 50
usr, prd = dtf_users.shape[0], dtf_users.shape[1]

# Users (1,embedding_size)
xusers_in = layers.Input(name="xusers_in", shape=(1,))
xusers_emb = layers.Embedding(name="xusers_emb", input_dim=usr, output_dim=embeddings_size)(xusers_in)
xusers = layers.Reshape(name='xusers', target_shape=(embeddings_size,))(xusers_emb)

# Products (1,embedding_size)
xproducts_in = layers.Input(name="xproducts_in", shape=(1,))
xproducts_emb = layers.Embedding(name="xproducts_emb", input_dim=prd, output_dim=embeddings_size)(xproducts_in)
xproducts = layers.Reshape(name='xproducts', target_shape=(embeddings_size,))(xproducts_emb)

# Product (1)
xx = layers.Dot(name='xx', normalize=True, axes=1)([xusers, xproducts])

# Predict ratings (1)
y_out = layers.Dense(name="y_out", units=1, activation='linear')(xx)

# Compile
model = models.Model(inputs=[xusers_in,xproducts_in], outputs=y_out, name="CollaborativeFiltering")
model.compile(optimizer='adam', loss='mean_absolute_error', metrics=['mean_absolute_percentage_error'])