# Remember to save this as a new notebook before you begin solving!! 
# Also remember to open the notebook through a virtual env that works well with keras

### This exercise is meant to teach you how to use embedding layers, and how to create a recommendation system. The data we'll use is the data from the Netflix Prize. This exercise should come after you have some experience with NN (not necessarily extensive experience)

### Author: Philip Tannor

#### Data description: The first line of each 'batch of movies' contains the movie_id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the following format:
CustomerID,Rating,Date
MovieIDs range from 1 to 17770 sequentially. 
CustomerIDs range from 1 to 2649429, with gaps. 
There are 480189 users. 
Ratings are on a five star (integral) scale from 1 to 5. 
Dates have the format YYYY-MM-DD.

#### Note that originally the data was stored with one movie per file. This is a new, easier, format created by a kaggler (DLao @ Hong Kong) with a later touch of Ittai Haran.
Since I don't think that arranging the data teaches very much, I left the basic lines of preprocessing that I did. Feel free to delete it and start over - just make sure to split the data on your own to train and test (randomly). 
If you're more hard-working than I am - used the dates in the original data to split in a more realistic manner (regarding time). Notice there is an apparent leakage - since we create the dictionaries base on all of the data. This is intentional, and also realistic (we can't use these ML techniques for new users or movies).

In [1]:
import keras
import pandas as pd
import numpy as np
from collections import Counter

import os
from sklearn.model_selection import train_test_split

In [3]:
df = pd.read_csv("resources/combined_data.csv")#, nrows = 100000)
df.columns

Index(['user_id', 'grade', 'date', 'num'], dtype='object')

In [4]:
#we drop the date since this isn't used for the embedding model
df = df.rename(columns = {'num': 'movie_id', 'grade': 'rating'}).drop('date', axis = 1)[['user_id', 'movie_id', 'rating']]
df.tail()

Unnamed: 0,user_id,movie_id,rating
100480502,1790158,17770,4
100480503,1608708,17770,3
100480504,234275,17770,1
100480505,255278,17770,4
100480506,453585,17770,2


In [5]:
print('Dataset shape: {}'.format(df.shape))

Dataset shape: (100480507, 3)


In [6]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('rating', axis = 1), df.rating, test_size = 0.3, random_state = 77)

In [7]:
#this cell creates dictionaries which help with changing the unique ID's into series of int's. 
#This is neccessary for embedding layers in keras (so use the transformed columns later on), 
#especially if you only use part of the data.

list_unique = list(set(df.user_id))
transforming_user = {v:k for k,v in zip(range(len(list_unique)), list_unique)}
list_unique = list(set(df.movie_id))
transforming_movie = {v:k for k,v in zip(range(len(list_unique)), list_unique)}

X_train['user_id_transformed'] = X_train['user_id'].apply(lambda x: transforming_user[x])
X_train['movie_id_transformed'] = X_train['movie_id'].apply(lambda x: transforming_movie[x])

X_test['user_id_transformed'] = X_test['user_id'].apply(lambda x: transforming_user[x])
X_test['movie_id_transformed'] = X_test['movie_id'].apply(lambda x: transforming_movie[x])

# OK, now go through the instructions - and then you'll be on your own for a while. 

1. Read about embedding layers here: https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12. The explanation isn't detailed enough, but you should try to think of the layer as a change of representation - from a very sparse one-hot vector (which is equivalent), to a dense vector of much lower dimension. In the new representation, each element really means something - and sometimes these elements may be translated into a scale which correlated with some intuitive feature (age, connection holocaust content, spendable income, etc.). 
2. Pay attention that this function which creates the change of representation (the embedding layer) is learned as the network trains. This means that at the beginning - the elements in the new representation won't mean much, but as the learning progresses the representation will be more and more meaningful.
3. Create a neural network in Keras (use functional API - *NOT SEQUENTIAL*. This will become important later on). The NN should have 2 different embedding layers - one which get the user_id as inputs, and one that gets the movie_id as input. I won't tell you the dimension of the 2 new representations - play around with these 2 numbers. Intuitively - try to think how many number you would need to describe a user (same thing for a movie). Notice that 'input_dim' is the size of the vocabulary (and not '1').
4. These embedding layers should be merged, and then flattened. After this, add a few dense layers. If you get the hang of the training and don't want to wait too long, or if you want to compare to my model, read the model in the hidden answers folder using keras.model.load_model. TL;DR - the saved model reaches an MSE of a bit less than 0.72.
5. The output should be a single number - this can be treated as a regression problem, or as an ordinal classification problem. Originally, it was treated as a regression problem by Netflix, so your output should be a floating number between 1 and 5 and you should minimize the MSE.
6. Bonus: treat this as an ordinal classification problem, and use the loss from this paper (squared EMD): https://arxiv.org/abs/1611.05916. This bonus shouldn't be attempted if this is the first time you've dealt with custom losses or ordinal classification. Also, notice there is more work to be done later on in the notebook (which isn't "a bonus")!

In [8]:
list_unique_user_id = list(set(df.user_id))
print(len(list_unique_user_id))

list_unique_movie_id = list(set(df.movie_id))
print(len(list_unique_movie_id))

480189
17770


In [13]:
from keras.models import Model
from keras.layers import Input, Embedding, Dense, concatenate, Flatten

user_dim = 20
movie_dim = 60

x_user = Input(shape = (1,))
embedding_user = Embedding(input_dim = len(list_unique_user_id), output_dim = user_dim, dtype='float32', input_length=1)(x_user)

x_movie = Input(shape = (1,))
embedding_movie = Embedding(input_dim = len(list_unique_movie_id), output_dim = movie_dim, dtype='float32', input_length=1)(x_movie)

concat = concatenate([embedding_user, embedding_movie], axis = -1)

flatten = Flatten()(concat)
dense1 = Dense(units = (user_dim + movie_dim), activation = 'tanh')(flatten)
dense2 = Dense(units = (user_dim + movie_dim), activation = 'tanh')(dense1)
dense3 = Dense(units = (user_dim + movie_dim), activation = 'tanh')(dense2)
output_layer = Dense(units = 1, activation = 'linear')(dense3)

model = Model(inputs = [x_user, x_movie], outputs = output_layer)
model.compile(optimizer = 'adam', loss = 'mse')
model.summary()

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_7 (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
input_8 (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, 1, 20)        9603780     input_7[0][0]                    
__________________________________________________________________________________________________
embedding_5 (Embedding)         (None, 1, 60)        1066200     input_8[0][0]                    
_______________________________________________________________________________________

In [16]:
model.fit(x=[X_train['user_id_transformed'], X_train['movie_id_transformed']], y=y_train, 
         validation_data=[[X_test['user_id_transformed'], X_test['movie_id_transformed']], y_test],
         epochs=10, verbose=2, batch_size = 1000)

Epoch 1/10


KeyboardInterrupt: 

### Great! If you've reached here, you manged to create a neural network which uses embeddings and can predict movie rankings using only very basic information (unique ID's). 
Now we want to turn this into a recommendation system. Write a function which gets one specific user_id as an input, and outputs 10 names of movies with the highest expected ratings. Read the file named movie_titles.csv to connect between the movie id's and the names of the files.

In [None]:
movies_df = pd.read_csv('resources/data/movie_titles.csv', header = None)#, names = ['movie_id', 'year', 'name'])
movies_dict = {v: k for v, k in zip(movies_df[0], movies_df[2])}
untransforming_movie = {v: k for k, v in transforming_movie.iteritems()}

In [None]:
def recommend_10(user):
    recommend_df = pd.DataFrame(list(set(X_test.movie_id_transformed)), columns = ['movie_id_transformed'])
    recommend_df['user_id_transformed'] = [transforming_user[user]] * len(recommend_df)
    recommend_df['predictions'] = model.predict([recommend_df['user_id_transformed'], recommend_df['movie_id_transformed']])
    top_10 = recommend_df.sort_values('predictions', ascending = False).head(10)
    top_10['movie_id'] = top_10.movie_id_transformed.apply(lambda x: untransforming_movie[x])
    return top_10.movie_id.apply(lambda x: movies_dict[x])

In [None]:
recommend_10(2242835)

### So is the system we built any good? We can check this at least somewhat by checking if the embeddings are good. This is how we'll do it: 
1. Extract the embeddings of the different movies from the network. How, you may ask? I hope you remembered to use the functional api. This will allow you to create a new model - using the trained layers that you already used. Create a new Model where the input is the movie_id, and the output is the movie embedding layer (before the merge).
2. Do not train this model. Only use the .predict of this model, and the output should be the embedding vectors which represent the different movie id's. Notice you may have to reshape the matrix of the predictions for the next steps.
3. Now use sklearn.cluster.KMeans to cluster these vectors. Use k=1000. Make sure to save the cluster number for each movie.
4. Check manually if the clusters make sense (can you find connections between movies in the same cluster?). 
5. Try to visualize the clustering by looking at only some of the clusters as one time. You can use PCA with n_components = 2 to help you visualize (use a different color for each cluster by using 'c = ...' in plt.scatter).

In [None]:
model_embedding = Model(inputs = [x_movie], outputs = [embedding_movie])
vectors = model_embedding.predict(movies_for_pred).reshape(-1,1)
vectors = vectors.reshape(-1,movie_dim)

In [None]:
from sklearn.cluster import KMeans

clf = KMeans(n_clusters=1000)
clusters = clf.fit_predict(vectors)

In [None]:
movie_id_transformed_to_cluster = dict(zip(movies_for_pred, clusters))
clusters_to_take = range(10)

vectors_chosen = vectors[map(lambda x: x in clusters_to_take, clusters)]
pca = PCA(n_components = 2)
pcaed_vectored = pca.fit_transform(vectors_chosen)

In [None]:
plt.scatter(pcaed_vectored[:,0], pcaed_vectored[:,1], c=filter(lambda x: x in clusters_to_take, clusters))
plt.show()

### Great! Now just one little sophistication. 
Write another function which gives me the top 10 movies while no 2 of them are from the same cluster. This can be used for a more sophisticated type of recommendation system.

This will be more exciting if you go back and reduce the amount of clusters you allow to be not much higher than 10 (and check how the previous system you created will react).

In [None]:
def recommend_10_from_different_clusters(user):
    recommend_df = pd.DataFrame(list(set(X_test.movie_id_transformed)), columns = ['movie_id_transformed'])
    recommend_df['user_id_transformed'] = [transforming_user[user]] * len(recommend_df)
    recommend_df['predictions'] = model.predict([recommend_df['user_id_transformed'], recommend_df['movie_id_transformed']])
    recommend_df['cluster'] = recommend_df.movie_id_transformed.apply(lambda x: movie_id_transformed_to_cluster[x])
    top_10 = recommend_df.sort_values('predictions', ascending = False).groupby('cluster').head(1).head(10)
    top_10['movie_id'] = top_10.movie_id_transformed.apply(lambda x: untransforming_movie[x])
    return top_10.movie_id.apply(lambda x: movies_dict[x])

In [None]:
recommend_10_from_different_clusters(2242835)

# These cells should be used for telling yourself when your code finishes running

In [9]:
import datetime
from MMMUtils import *

In [10]:
beep()

body = 'Yo Phil, my man - your code (embedding layer exercise) finished running at: ' + str(datetime.datetime.now()) \
    + '\n\n\nThis was an automated email'
send_email('ptannor@gmail.com', body = body)