# Programming Assignment 2: Movie recommendation

- 과제 목표: 뉴럴 네트워크 모델을 설계한 후 모델을 학습하여 각 영화들의 embedding 들을 생성하고, 영화 embedding 을 활용하여 각 사용자에게 맞춤형 영화를 추천

# Notice

<br>

- 과제를 수행하면서 각 task 마다 꼭 주어진 1개의 cell만을 사용할 필요는 없으며, 여러 개의 cell을 추가하여 자유롭게 사용해도 괜찮습니다.
- 과제 수행을 위해 필요한 module이 있다면 추가로 import 해도 괜찮습니다.

# Import Modules

In [12]:
import warnings, random
import numpy as np
import pandas as pd
import torch # For building network

from itertools import permutations # For making pairs
from torch.utils.data import DataLoader
import torch.nn as nn
import torch.nn.functional as F
import os

warnings.filterwarnings('ignore')

# Data loading

In [13]:
dir = './MovieLens100K/'
df_ratings = pd.read_csv(dir + 'ratings.csv', usecols=['userId', 'movieId', 'rating'])
df_movies = pd.read_csv(dir + 'movies.csv', usecols=['movieId', 'title', 'genres']) # for title-matching

# Preprocessing data

<br>

> ### Problem 1 (3 points)

<br>

1. df_ratings로 부터 각 사용자들이 본 영화를 기록.
2. 사용자 마다 본 영화 목록을 $(movie1, movie2)$, $(movie2, movie1)$ 과 같이 pair로 생성.
    - 즉, 각 사용자 마다 본 영화 목록에 대해 Permutation을 수행
3. 2번 과정이 끝난 후, random을 이용해 각 pair 순서를 무작위로 shuffle.

In [14]:
#### Your Code Here
# 영화 아이디가 1302이하인 데이터만 이용
data = df_ratings.pivot(index = 'movieId', columns = 'userId', values = 'rating').fillna(0).head(1000)
data

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1298,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1299,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,4.0,0.0,0.0,3.5,0.0,0.0,0.0,0.0
1300,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.0,0.0,0.0,3.5,0.0,0.0,0.0,4.5
1301,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
# 페어 생성
user_mv_id = df_ratings['movieId'][df_ratings['userId'] == 1]
user_mv_id = user_mv_id.loc[user_mv_id.values<data.index[-1]]
pair = torch.tensor(list(permutations(user_mv_id, 2)))

for i in range(1,df_ratings['userId'].max()):
    user_mv_id = df_ratings['movieId'][df_ratings['userId'] == i+1]
    user_mv_id = user_mv_id.loc[user_mv_id.values<data.index[-1]]
    pair = torch.cat([pair,torch.tensor(list(permutations(user_mv_id, 2)))], dim = 0)
print(pair.shape)

torch.Size([3463554, 2])


In [16]:
tr_num = int(len(pair)*0.9) # ---> train dataset 개수
val_num = len(pair) - int(len(pair)*0.9) # ---> test dataset 개수
print(tr_num, val_num)

3117198 346356


In [17]:
# split
train_dataset, test_dataset = torch.utils.data.random_split(pair, [tr_num, val_num])

# 무작위 셔플
train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=True)

# Build and train neural networks for generating movie embeddings

<br>

> ### Problem 2 (4 points)

<br>

- 각 영화 임베딩을 구하기 위해 뉴럴 네트워크 모델을 활용하여 multi-class classification 을 수행

- 설계할 신경망의 기본 구조는 **Input Layer - Hidden(Embedding) Layer - Output Layer**.
    - 7주차 강의자료 p.7 신경망 구조 이미지 참고

- 현재 Network를 통해 하고자 하는 task는 multi-class classification.
    - 예: $(movie1, movie2)$ 와 같은 입력 데이터와 정답 출력 데이터를 이용해 모델을 학습
        - Input : $movie1$의 one-hot vector
        - Output : $\widehat{movie2}$의 one-hot vector
        - Compute Loss : $\widehat{movie2}$ 와 $movie2$ 간의 Cross-entropy Loss
- 학습이 완료된 이후에 input layer와 hidden(embedding) layer 사이의 weight matrix $W_{in}$를 movie에 대한 embedding vector로 사용이 가능.
> embedding size(# of hidden units)는 100 이하로 두는 것을 권장. <br>
> embedding layer 다음 hidden layer를 더 추가하여 Genre와 같은 추가 정보를 학습에 활용 할 수도 있음 (필수적으로 고려해야할 사항은 아님).

- 설계한 뉴럴 네트워크 모델의 학습이 완료된 후, 학습된 weight matrix $W_{in}$의 행/열벡터를 각 영화에 대한 embedding vector로 간주하여 영화 embedding 들을 구할 수 있음.

In [18]:
#### Your Code Here
class Model(nn.Module): # torch.nn.Module을 상속받는 파이썬 클래스
    def __init__(self, num_cls): #
        super().__init__()
        self.embedding = nn.Linear(num_cls, 100)
        self.out = nn.Linear(100, num_cls)
        self.softmax = nn.Softmax(dim = 1)

    def forward(self, x):
        x = self.embedding(x)
        x = self.out (x)
        out = self.softmax(x) 
        return out


In [30]:
nb_epochs = 10
num_cls = data.index[-1]
model = Model(num_cls)
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
save_loss = 100

In [31]:
for epoch in range(nb_epochs + 1):
  for batch_idx, samples in enumerate(train_dataloader):

    x_train, y_train = samples[:,0], samples[:,1]

    one_hot_x_train = F.one_hot(x_train.long(), num_classes=num_cls)
    one_hot_y_train = F.one_hot(y_train.long(), num_classes=num_cls)

    prediction = model(one_hot_x_train.float())


    # cost 계산
    loss = F.cross_entropy(prediction, one_hot_y_train.float())

    # cost로 H(x) 계산
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    
    if batch_idx % 1000 == 0:
      print('Epoch {:4d}/{} Batch {}/{} Loss: {:.6f}'.format(
        epoch, nb_epochs, batch_idx+1, len(train_dataloader),
        loss.item()
        ))

      torch.save({
            'Epoch': epoch,
            'Batch': batch_idx+1,
            'state_dict': model.state_dict(),
            'optimizer': optimizer.state_dict()
        }, os.path.join('./out/'+str(batch_idx)+'.pth'))

      if save_loss > loss:
        save_loss = loss
        torch.save({
            'Epoch': epoch,
            'Batch': batch_idx+1,
            'state_dict': model.state_dict(),
            'optimizer': optimizer.state_dict()
        }, os.path.join('./out/'+str(batch_idx)+'best.pth'))
      

Epoch    0/10 Batch 1/48707 Loss: 7.171658
Epoch    0/10 Batch 1001/48707 Loss: 7.170923
Epoch    0/10 Batch 2001/48707 Loss: 7.168122
Epoch    0/10 Batch 3001/48707 Loss: 7.164025


FileNotFoundError: [Errno 2] No such file or directory: './out/3000.pth'

# Recommend customized movies to user

<br>

> ### Problem 3 (3 points)

<br>

- 임의의 한명의 사용자에 대하여 해당 사용자가 봤던 영화 n개에 대해 **통합된 embedding vector**를 생성.
    - n개의 embedding vector들에 대해, element-wise한 계산을 통해 통합된 하나의 embedding vector를 생성.
    - 이 embedding vector는 해당 사용자의 전반적인 영화 시청 성향을 나타내는 embedding vector로 간주할 수 있음.
    - 즉, **사용자 1명 당 1개의 embedding vector**를 가짐.
- 통합된 embedding vector와 학습된 weight matrix $W_{in}$의 모든 영화 embedding vector들 간의 유사도를 계산.
- 그 중 유사도가 높은 (top n) 영화들을 선정, 사용자에게 추천.
    > Recommended format : MovieId, Title, Genre, Similarity 가 포함된 형식

In [None]:
#### Your Code Here
