# Movie Recommendation System

In this notebook, we work with data from MovieLens which gives us access to user ratings for movies.

The number of movies rated are disproportionately low compared to the total number of users and movies and as such, we wil explore using matrix factorization to help fill in the rating gaps.

Lastly, we will use KMeans clustering to see if the movies are similar in genre.


## Import packages


In [1]:
# zip
import zipfile

# import data
import pandas as pd

# nn
import torch
import numpy as np
from torch.autograd import Variable
from tqdm import tqdm_notebook as tqdm
from torch.utils.data import DataLoader
from torch.utils.data.dataset import Dataset

# kmeans
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt


## Get Data

In this project, the data is going to be gotten from the MovieLens dataset, so we will download the zip file and extract it.

In [2]:
# Download zip file
! curl http://files.grouplens.org/datasets/movielens/ml-latest-small.zip -o ml-latest-small.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  955k  100  955k    0     0   829k      0  0:00:01  0:00:01 --:--:--  829k


In [3]:
# unzip file
with zipfile.ZipFile('ml-latest-small.zip', 'r') as zip_ref:
    zip_ref.extractall('ml-latest-small')

## Import data

In [4]:
movies_df = pd.read_csv('ml-latest-small/ml-latest-small/movies.csv')
ratings_df = pd.read_csv('ml-latest-small/ml-latest-small/ratings.csv')
print(f'Movies shape: {movies_df.shape}')
print(f'Ratings shape: {ratings_df.shape}')

Movies shape: (9742, 3)
Ratings shape: (100836, 4)


In [5]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### Map movie ID to movie name

In [7]:
movie_names = movies_df.set_index('movieId')['title'].to_dict()
ratings_df['title'] = ratings_df['movieId'].map(movie_names)
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title
0,1,1,4.0,964982703,Toy Story (1995)
1,1,3,4.0,964981247,Grumpier Old Men (1995)
2,1,6,4.0,964982224,Heat (1995)
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995)
4,1,50,5.0,964982931,"Usual Suspects, The (1995)"


In [8]:
n_users = len(ratings_df.userId.unique())
n_items = len(ratings_df.movieId.unique())
print(f'Number of users: {n_users} \nNumber of movies: {n_items}')
print(f'The full ratings matrix will have {n_users * n_items} elements')

Number of users: 610 
Number of movies: 9724
The full ratings matrix will have 5931640 elements


In [9]:
print(f'Number of ratings: {len(ratings_df)}')
print(f'This means that {(len(ratings_df) /( n_users * n_items)) * 100}% of the matrix is filled')

Number of ratings: 100836
This means that 1.6999683055613624% of the matrix is filled


The above results show that we are working with a very sparse matrix, given that only 1.69% of the matrix has values. In order to increase the percentage of the matix currently filled, we will now use matrix factorization.

## Matrix Factorization

In [10]:
class MatrixFactorization(torch.nn.Module):
    def __init__(self, n_users, n_items, n_factors=20):
        super().__init__()
        # user embeddings
        self.user_factors = torch.nn.Embedding(n_users, n_factors)
        # item embeddings
        self.item_factors = torch.nn.Embedding(n_items, n_factors)

        # self.user_factors.weight.data.uniform(0, 0.05)
        # self.item_factors.weight.data.uniform(0, 0.05)

        torch.nn.init.uniform_(self.user_factors.weight, 0, 0.05)
        torch.nn.init.uniform_(self.item_factors.weight, 0, 0.05)

    def forward(self, data):
        # matrix multiplication
        users, items = data[:,0], data[:,1]
        return (self.user_factors(users)*self.item_factors(items)).sum(1)

## Preprocessing dataset for learning algorithm

### Create Torch data set

This is done to ensure that the data can be processed by the data loaders to be used by the learning algorithms

In [15]:
class MovieDataset(Dataset):
    def __init__(self, ratings_df):
        self.ratings_df = ratings_df.copy()

        # extract user id and movie id
        # self.ratings_df['userId'], _ = pd.factorize(self.ratings_df['userId'])
        # self.ratings_df['movieId'], _ = pd.factorize(self.ratings_df['movieId'])
        users = self.ratings_df.userId.unique()
        movies = self.ratings_df.movieId.unique()

        # produce new ids for users and movies i.e unique_vals: index
        self.userid2idx = {value:idx for idx,value in enumerate(users)}
        self.movieid2idx = {value:idx for idx,value in enumerate(movies)}

        self.idx2userid = {idx:value for idx, value in enumerate(users)}
        self.idx2movieid = {idx:value for idx, value in enumerate(movies)}

        # replace ids with new indexed ids
        self.ratings_df.userId = ratings_df.userId.apply(lambda x: self.userid2idx[x])
        self.ratings_df.movieId = ratings_df.movieId.apply(lambda x: self.movieid2idx[x])


        # print(f'This is X: {self.ratings_df.values}')
        # print(f'This is y: {self.ratings_df["rating"].values}
        self.x = self.ratings_df.drop(['rating', 'timestamp', 'title'], axis=1).values
        self.y = self.ratings_df['rating'].values
        self.x, self.y = torch.tensor(self.x), torch.tensor(self.y)
        # self.x = torch.tensor(self.x,  dtype=torch.int64)
        # self.y = torch.tensor(self.y, torch.float32)

    def __getitem__(self, index):
        return (self.x[index], self.y[index])

    def __len__(self):
        return len(self.ratings_df)

### Set Hyper parameters

In [12]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = MatrixFactorization(n_users, n_items, n_factors=8).to(device)
print(model)


MatrixFactorization(
  (user_factors): Embedding(610, 8)
  (item_factors): Embedding(9724, 8)
)


In [13]:
for name, param in model.named_parameters():
    if param.requires_grad:
        print(name, param.shape)

user_factors.weight torch.Size([610, 8])
item_factors.weight torch.Size([9724, 8])


In [17]:
# loss
criterion = torch.nn.MSELoss()
# optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# epochs
epochs = 120

# load_train data
train_dataset = MovieDataset(ratings_df)
train_loader = DataLoader(train_dataset, 128, shuffle=True)

## Train model

In [18]:
for it in tqdm(range(epochs)):
    losses = []
    for x, y in train_loader:
        x = x.to(device)
        y = y.to(device)
        output = model(x)
        loss = criterion(output.squeeze(), y.type(torch.float32))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        losses.append(loss.item())

    if it % 10 == 0:
        print(f'Epoch: {it} Loss: {np.mean(losses)}')

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for it in tqdm(range(epochs)):


  0%|          | 0/120 [00:00<?, ?it/s]

Epoch: 0 Loss: 11.070700886890974
Epoch: 10 Loss: 0.7595207256260257
Epoch: 20 Loss: 0.6604842084086486
Epoch: 30 Loss: 0.6490953147865189
Epoch: 40 Loss: 0.6008479786176367
Epoch: 50 Loss: 0.51816202499691
Epoch: 60 Loss: 0.44928774025839596
Epoch: 70 Loss: 0.40615986894608147
Epoch: 80 Loss: 0.37988411621923374
Epoch: 90 Loss: 0.3629802611613939
Epoch: 100 Loss: 0.35103711822462563
Epoch: 110 Loss: 0.342304933343442


## Kmeans
Now that we have trained our model, we want to get a sense of how closely these movies are related to each other.

In [19]:
# get unique movie factor weights
trained_movie_embeddings = model.item_factors.weight.data.cpu().numpy()
trained_movie_embeddings

array([[0.45019212, 0.4881864 , 0.25370574, ..., 0.67150515, 0.6310844 ,
        0.37505266],
       [0.22918005, 0.8606945 , 0.36849856, ..., 0.28990775, 0.6564158 ,
        0.49818307],
       [0.16191955, 0.90323293, 0.68490624, ..., 0.6758368 , 0.4424198 ,
        0.3603704 ],
       ...,
       [0.30253246, 0.33442432, 0.34778222, ..., 0.34925002, 0.31884152,
        0.3242194 ],
       [0.43096972, 0.40483552, 0.41202503, ..., 0.3970471 , 0.42232344,
        0.4097304 ],
       [0.38624316, 0.40292203, 0.4321991 , ..., 0.4238267 , 0.38585022,
        0.40153095]], dtype=float32)

In [20]:
kmeans = KMeans(n_clusters=10, random_state=42).fit(trained_movie_embeddings)

In [26]:
for cluster in range(10):
    print(f'Cluster {cluster}')
    movies = []
    for movieidx in np.where(kmeans.labels_ == cluster)[0]:
        movieid = train_dataset.idx2movieid[movieidx]
        ratings_count = ratings_df.loc[ratings_df['movieId'] == movieid].shape[0]
        movies.append((movie_names[movieid], ratings_count))
    for mov in sorted(movies, key=lambda tup: tup[1], reverse=True)[:10]:
        print(f'\t{mov[0]}')
    # print('\n')


Cluster 0
	Addams Family Values (1993)
	X2: X-Men United (2003)
	Coneheads (1993)
	Casper (1995)
	Judge Dredd (1995)
	Three Musketeers, The (1993)
	Hot Shots! Part Deux (1993)
	From Dusk Till Dawn (1996)
	Scary Movie (2000)
	Troy (2004)
Cluster 1
	Forrest Gump (1994)
	Shawshank Redemption, The (1994)
	Silence of the Lambs, The (1991)
	Matrix, The (1999)
	Star Wars: Episode IV - A New Hope (1977)
	Braveheart (1995)
	Terminator 2: Judgment Day (1991)
	Apollo 13 (1995)
	Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
	Star Wars: Episode VI - Return of the Jedi (1983)
Cluster 2
	Independence Day (a.k.a. ID4) (1996)
	Star Wars: Episode I - The Phantom Menace (1999)
	X-Men (2000)
	Twister (1996)
	Net, The (1995)
	Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
	Harry Potter and the Chamber of Secrets (2002)
	Matrix Reloaded, The (2003)
	Iron Man (2008)
	Armageddon (1998)
Cluster 3
	Ace Ventura: When Nature Calls (1

Notice that the movies that are in the same cluster are in similar genres.

Recall that the algorithim is obtaining the relationship based on embeddings that show how users respond to movie selections

## References


*   Data Citation:
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligents Systems (TiiS) 5, 4: 19:1–19:19.

*   Data Link: ([MovieLens](https://grouplens.org/datasets/movielens/))

* [Tutotrial](https://www.youtube.com/watch?v=G4MBc40rQ2k)

