Given a user ratings dataset and a movie dataset, build an autoencoder for recommender system

In [None]:
pip install tensorflow


Using Autoencoder for recommender system

In [2]:
import pandas as pd 
import numpy as np 

In [3]:
df = pd.read_csv('ratings.csv')
mov = pd.read_csv('movies.csv')

In [5]:
df.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1217897793
1,1,24,1.5,1217895807
2,1,32,4.0,1217896246
3,1,47,4.0,1217896556
4,1,50,4.0,1217896523
5,1,110,4.0,1217896150
6,1,150,3.0,1217895940
7,1,161,4.0,1217897864
8,1,165,3.0,1217897135
9,1,204,0.5,1217895786


In [6]:
mov.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


Idea is to create a user -item collaborative matrix and use an autoencoder to generate feature vector and find the similarity between the movies

In [7]:
df.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [8]:
df.shape

(105339, 4)

In [9]:
mov.shape

(10329, 3)

In [10]:
colab = df.pivot(index='movieId',columns='userId',values='rating').fillna(0)

In [11]:
colab.head(3)

userId,1,2,3,4,5,6,7,8,9,10,...,659,660,661,662,663,664,665,666,667,668
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,5.0,0.0,0.0,4.0,0.0,0.0,5.0,0.0,0.0,...,0.0,0.0,4.0,5.0,3.0,0.0,0.0,0.0,0.0,3.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,3.0
3,0.0,2.0,0.0,0.0,0.0,0.0,0.0,4.0,3.0,0.0,...,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0


In [12]:
colab.shape

(10325, 668)

colab looks spare , let's check the sparsity of the matrix

In [14]:
# sparsity  
(colab > 0.0).sum().sum() / (colab.shape[0] * colab.shape[1]) *100

1.5272940801206305

only 1.5 percent of the values are filled or non - zero 

In [15]:
from sklearn.model_selection import train_test_split 

train, test = train_test_split(colab, test_size=0.25, random_state=42) 
# remeber there is no target here 


In [16]:
print(train.shape) 
print(test.shape) 


(7743, 668)
(2582, 668)


Building Auto encoder with neural network

In [17]:
from tensorflow.keras.layers import Input, Flatten, Dense

from tensorflow.keras.models import Model

In [19]:
inp = Input(shape=(668,))
# encoder stack
e1 = Dense(512, activation='relu')(inp)
e2 = Dense(256, activation='relu')(e1)
e3 = Dense(128, activation='relu')(e2)
e4 = Dense(64, activation='relu')(e3)

# decoder stack
d1 = Dense(128, activation='relu')(e4)
d2 = Dense(256, activation='relu')(d1)
d3 = Dense(512, activation='relu')(d2)
d4 = Dense(668, activation='relu')(d3)

model = Model(inp, d4)

In [20]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 668)]             0         
                                                                 
 dense (Dense)               (None, 512)               342528    
                                                                 
 dense_1 (Dense)             (None, 256)               131328    
                                                                 
 dense_2 (Dense)             (None, 128)               32896     
                                                                 
 dense_3 (Dense)             (None, 64)                8256      
                                                                 
 dense_4 (Dense)             (None, 128)               8320      
                                                                 
 dense_5 (Dense)             (None, 256)               33024 

In [21]:
# compile model
model.compile(optimizer='adam', loss='mse') 

# fit model     
model.fit(train, train, epochs=90, batch_size=256, validation_data=(test, test))

Epoch 1/90
Epoch 2/90
Epoch 3/90
Epoch 4/90
Epoch 5/90
Epoch 6/90
Epoch 7/90
Epoch 8/90
Epoch 9/90
Epoch 10/90
Epoch 11/90
Epoch 12/90
Epoch 13/90
Epoch 14/90
Epoch 15/90
Epoch 16/90
Epoch 17/90
Epoch 18/90
Epoch 19/90
Epoch 20/90
Epoch 21/90
Epoch 22/90
Epoch 23/90
Epoch 24/90
Epoch 25/90
Epoch 26/90
Epoch 27/90
Epoch 28/90
Epoch 29/90
Epoch 30/90
Epoch 31/90
Epoch 32/90
Epoch 33/90
Epoch 34/90
Epoch 35/90
Epoch 36/90
Epoch 37/90
Epoch 38/90
Epoch 39/90
Epoch 40/90
Epoch 41/90
Epoch 42/90
Epoch 43/90
Epoch 44/90
Epoch 45/90
Epoch 46/90
Epoch 47/90
Epoch 48/90
Epoch 49/90
Epoch 50/90
Epoch 51/90
Epoch 52/90
Epoch 53/90
Epoch 54/90
Epoch 55/90
Epoch 56/90
Epoch 57/90
Epoch 58/90
Epoch 59/90
Epoch 60/90
Epoch 61/90
Epoch 62/90
Epoch 63/90
Epoch 64/90
Epoch 65/90
Epoch 66/90
Epoch 67/90
Epoch 68/90
Epoch 69/90
Epoch 70/90
Epoch 71/90
Epoch 72/90
Epoch 73/90
Epoch 74/90
Epoch 75/90
Epoch 76/90
Epoch 77/90
Epoch 78/90
Epoch 79/90
Epoch 80/90
Epoch 81/90
Epoch 82/90
Epoch 83/90
Epoch 84/90
E

<keras.src.callbacks.History at 0x27e414fa4d0>

In [22]:
# we can plot epochs against mse and optimize for the same from model's dict 

# let's focus on getting the bottle neck embedding
model.layers

[<keras.src.engine.input_layer.InputLayer at 0x27e403dda50>,
 <keras.src.layers.core.dense.Dense at 0x27e403df8b0>,
 <keras.src.layers.core.dense.Dense at 0x27e403ddb40>,
 <keras.src.layers.core.dense.Dense at 0x27e414904f0>,
 <keras.src.layers.core.dense.Dense at 0x27e41491330>,
 <keras.src.layers.core.dense.Dense at 0x27e41491f00>,
 <keras.src.layers.core.dense.Dense at 0x27e41492770>,
 <keras.src.layers.core.dense.Dense at 0x27e41493340>,
 <keras.src.layers.core.dense.Dense at 0x27e41492a10>]

In [26]:
embedd_layer = model.layers[4].output
embedd_layer

<KerasTensor: shape=(None, 64) dtype=float32 (created by layer 'dense_3')>

In [27]:
model.input

<KerasTensor: shape=(None, 668) dtype=float32 (created by layer 'input_1')>

In [28]:
embed_model = Model(model.input, embedd_layer)

In [30]:
test_embed = embed_model.predict(test)



In [33]:
total_embed = embed_model.predict(colab.values)



In [34]:
total_embed.shape

(10325, 64)

it's like finding out similarity between each movie with respect to every other movie

In [35]:
# let's build similarity matrix 
from sklearn.metrics.pairwise import cosine_similarity
sim_mat = cosine_similarity(total_embed)
sim_mat.shape

(10325, 10325)

We are going to take a movie_id from movies table and find similar movies from the similarity matrix

In [36]:
mov.head(4)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance


In [39]:
mov.loc[18]

movieId                                       19
title      Ace Ventura: When Nature Calls (1995)
genres                                    Comedy
Name: 18, dtype: object

In [40]:
sim_mat_df = pd.DataFrame(sim_mat, columns=colab.index, index=colab.index)

In [42]:
sim_mat_df.shape

(10325, 10325)

In [43]:
sim_mat_df.head(4)

movieId,1,2,3,4,5,6,7,8,9,10,...,144482,144656,144976,146344,146656,146684,146878,148238,148626,149532
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.810123,0.701381,0.579881,0.732376,0.723421,0.745652,0.562495,0.655826,0.767094,...,0.446489,0.53582,0.518827,0.520499,0.494886,0.492949,0.482887,0.532426,0.556994,0.538511
2,0.810123,1.0,0.847857,0.690491,0.8018,0.770306,0.823006,0.671271,0.731023,0.707319,...,0.609854,0.626602,0.627131,0.619346,0.637405,0.557504,0.556984,0.625283,0.638994,0.627665
3,0.701381,0.847857,1.0,0.834606,0.834774,0.835817,0.787431,0.769575,0.873239,0.701788,...,0.67579,0.747478,0.7683,0.750461,0.688561,0.661711,0.667492,0.748854,0.771947,0.746392
4,0.579881,0.690491,0.834606,1.0,0.776735,0.732021,0.715156,0.878291,0.834283,0.714623,...,0.797017,0.833067,0.82561,0.840283,0.800947,0.797282,0.8035,0.835537,0.816642,0.831037


In [44]:
sim_mat_df.loc[19].sort_values(ascending=False).head(10)

movieId
19       1.000000
344      0.943017
231      0.903316
333      0.899549
53996    0.897309
420      0.896601
1562     0.890176
45722    0.887743
208      0.887587
367      0.885839
Name: 19, dtype: float32

In [45]:
# the above are the top ten movies that are similar to movie_id 19  
sim_mat_df.loc[19].sort_values(ascending=False).head(10).index

Int64Index([19, 344, 231, 333, 53996, 420, 1562, 45722, 208, 367], dtype='int64', name='movieId')

In [47]:
# Finally finding the similar movies with Ace ventura movie 

mov[mov['movieId'].isin(sim_mat_df.loc[19].sort_values(ascending=False).head(10).index)]

Unnamed: 0,movieId,title,genres
18,19,Ace Ventura: When Nature Calls (1995),Comedy
180,208,Waterworld (1995),Action|Adventure|Sci-Fi
202,231,Dumb & Dumber (Dumb and Dumber) (1994),Adventure|Comedy
293,333,Tommy Boy (1995),Comedy
304,344,Ace Ventura: Pet Detective (1994),Comedy
326,367,"Mask, The (1994)",Action|Comedy|Crime|Fantasy
368,420,Beverly Hills Cop III (1994),Action|Comedy|Crime|Thriller
1253,1562,Batman & Robin (1997),Action|Adventure|Fantasy|Thriller
7035,45722,Pirates of the Caribbean: Dead Man's Chest (2006),Action|Adventure|Fantasy
7422,53996,Transformers (2007),Action|Sci-Fi|Thriller|IMAX


In [48]:
sim_mat_df.loc[53996].sort_values(ascending=False).head(10)

movieId
53996    1.000000
51662    0.933098
34048    0.918943
31696    0.914599
49278    0.914213
45447    0.911847
68319    0.908503
59784    0.908280
56174    0.906658
57368    0.905278
Name: 53996, dtype: float32

In [49]:
mov[mov['movieId'].isin(sim_mat_df.loc[53996].sort_values(ascending=False).head(10).index)]

Unnamed: 0,movieId,title,genres
6490,31696,Constantine (2005),Action|Fantasy|Horror|Thriller
6662,34048,War of the Worlds (2005),Action|Adventure|Sci-Fi|Thriller
7017,45447,"Da Vinci Code, The (2006)",Drama|Mystery|Thriller
7202,49278,Déjà Vu (Deja Vu) (2006),Action|Sci-Fi|Thriller
7306,51662,300 (2007),Action|Fantasy|War|IMAX
7422,53996,Transformers (2007),Action|Sci-Fi|Thriller|IMAX
7564,56174,I Am Legend (2007),Action|Horror|Sci-Fi|Thriller|IMAX
7606,57368,Cloverfield (2008),Action|Mystery|Sci-Fi|Thriller
7741,59784,Kung Fu Panda (2008),Action|Animation|Children|Comedy|IMAX
8092,68319,X-Men Origins: Wolverine (2009),Action|Sci-Fi|Thriller


Unnamed: 0,movieId,title,genres
3157,4006,Transformers: The Movie (1986),Adventure|Animation|Children|Sci-Fi
7422,53996,Transformers (2007),Action|Sci-Fi|Thriller|IMAX
8150,69526,Transformers: Revenge of the Fallen (2009),Action|Adventure|Sci-Fi|IMAX
9028,87520,Transformers: Dark of the Moon (2011),Action|Adventure|Sci-Fi|War|IMAX
9993,112370,Transformers: Age of Extinction (2014),Action|Adventure|Sci-Fi


In [51]:
''' we can build a simple streamlit webapp where user enters the movie name and if the movie is present in the movie dataset and with the help of the index 
we can recommend similar movies to the user'''

' we can build a simple streamlit webapp where user enters the movie name and if the movie is present in the movie dataset and with the help of the index \nwe can recommend similar movies to the user'