<a href="https://colab.research.google.com/github/hotorch/DL_based_RecSys/blob/master/%5BPractice4%5DWDL(DeepCTR)%26Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

다룰 데이터는 영화 평점 예측하는 데이터 입니다. 데이터의 Description은 [여기](http://files.grouplens.org/datasets/movielens/ml-1m-README.txt) 에서 확인을 할 수 있습니다.  

또한 DeepCTR 공식문서에서의 예제 데이터들은 [여기](https://github.com/shenweichen/DeepCTR-Torch/tree/master/examples)에 있지만 샘플이 200개 밖에 안되는 데이터이기 때문에 CPU에서 한번 돌려보고 감잡기에 좋습니다.  

## 0. Load Data & install DeepCTR-torch

In [1]:
!mkdir data
!wget -q http://www.grouplens.org/system/files/ml-1m.zip ./data
!unzip -o ml-1m -d data
!pip install deepctr-torch

Archive:  ml-1m.zip
   creating: data/ml-1m/
  inflating: data/ml-1m/movies.dat   
  inflating: data/ml-1m/ratings.dat  
  inflating: data/ml-1m/README       
  inflating: data/ml-1m/users.dat    
Collecting deepctr-torch
[?25l  Downloading https://files.pythonhosted.org/packages/5c/33/047478cada1347e2168edd6726b91a1c2fe1d706382ed5584584aa51ab24/deepctr_torch-0.2.0-py3-none-any.whl (45kB)
[K     |████████████████████████████████| 51kB 6.5MB/s 
Installing collected packages: deepctr-torch
Successfully installed deepctr-torch-0.2.0


In [2]:
import tensorflow as tf
print(tf.__version__)

1.15.0


In [0]:
import pandas as pd
import numpy as np
import torch
import math
import itertools
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from deepctr_torch.models import WDL
from deepctr_torch.inputs import SparseFeat,VarLenSparseFeat, get_feature_names
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

data가 3개 존재하기 때문에 이를 하나로 병합할 준비작업을 할 예정입니다.

In [4]:
ratings = pd.read_csv('./data/ml-1m/ratings.dat', header=None, sep="::", engine='python')
ratings.head(3) # userID / movieID/ Rating / timestamp

Unnamed: 0,0,1,2,3
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968


In [5]:
movies = pd.read_csv('./data/ml-1m/movies.dat', header=None, sep="::", engine='python')
movies.head(3) # movieID / Title / Genres

Unnamed: 0,0,1,2
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [6]:
users = pd.read_csv('./data/ml-1m/users.dat', header=None, sep="::", engine='python')
users.head(3) # UserID / Gender / Age / Occupation / Zip-code

Unnamed: 0,0,1,2,3,4
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117


- Age is chosen from the following ranges:

	*  1:  "Under 18"
	* 18:  "18-24"
	* 25:  "25-34"
	* 35:  "35-44"
	* 45:  "45-49"
	* 50:  "50-55"
	* 56:  "56+"

- Occupation is chosen from the following choices:

	*  0:  "other" or not specified
	*  1:  "academic/educator"
	*  2:  "artist"
	*  3:  "clerical/admin"
	*  4:  "college/grad student"
	*  5:  "customer service"
	*  6:  "doctor/health care"
	*  7:  "executive/managerial"
	*  8:  "farmer"
	*  9:  "homemaker"
	* 10:  "K-12 student"
	* 11:  "lawyer"
	* 12:  "programmer"
	* 13:  "retired"
	* 14:  "sales/marketing"
	* 15:  "scientist"
	* 16:  "self-employed"
	* 17:  "technician/engineer"
	* 18:  "tradesman/craftsman"
	* 19:  "unemployed"
	* 20:  "writer"

데이터 column명이 너무 성의가 없어서 변경해줍니다.  timestamp 변수는 참고로 활용하지 않을 계획입니다.  

In [0]:
ratings.columns = ["uid","iid","rating",'timestamp']
movies.columns = ["iid", "movie_name","genre"]
users.columns = ["uid","Gender", "Age", "Occupation", "zipcode"]

In [8]:
print(ratings.shape)
print(movies.shape)
print(users.shape)

(1000209, 4)
(3883, 3)
(6040, 5)


rating 기준으로 left join을 할 예정

In [9]:
rating_join_movies = pd.merge(ratings, movies, how = 'left', on = 'iid')
rating_join_movies.head()

Unnamed: 0,uid,iid,rating,timestamp,movie_name,genre
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama
1,1,661,3,978302109,James and the Giant Peach (1996),Animation|Children's|Musical
2,1,914,3,978301968,My Fair Lady (1964),Musical|Romance
3,1,3408,4,978300275,Erin Brockovich (2000),Drama
4,1,2355,5,978824291,"Bug's Life, A (1998)",Animation|Children's|Comedy


In [10]:
rating_join_movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns (total 6 columns):
uid           1000209 non-null int64
iid           1000209 non-null int64
rating        1000209 non-null int64
timestamp     1000209 non-null int64
movie_name    1000209 non-null object
genre         1000209 non-null object
dtypes: int64(4), object(2)
memory usage: 53.4+ MB


In [11]:
data = pd.merge(rating_join_movies, users, how = 'left', on = 'uid')
data.head()

Unnamed: 0,uid,iid,rating,timestamp,movie_name,genre,Gender,Age,Occupation,zipcode
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama,F,1,10,48067
1,1,661,3,978302109,James and the Giant Peach (1996),Animation|Children's|Musical,F,1,10,48067
2,1,914,3,978301968,My Fair Lady (1964),Musical|Romance,F,1,10,48067
3,1,3408,4,978300275,Erin Brockovich (2000),Drama,F,1,10,48067
4,1,2355,5,978824291,"Bug's Life, A (1998)",Animation|Children's|Comedy,F,1,10,48067


In [12]:
data.shape

(1000209, 10)

In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns (total 10 columns):
uid           1000209 non-null int64
iid           1000209 non-null int64
rating        1000209 non-null int64
timestamp     1000209 non-null int64
movie_name    1000209 non-null object
genre         1000209 non-null object
Gender        1000209 non-null object
Age           1000209 non-null int64
Occupation    1000209 non-null int64
zipcode       1000209 non-null object
dtypes: int64(6), object(4)
memory usage: 83.9+ MB


이제 다루고자 하는 데이터를 간단히 만들었으니 DeepCTR에 있는 기능들을 활용을 해봅니다. 

## 1. Features, Target 지정

In [0]:
sparse_features = ['iid', 'uid', 'Gender','Age', 'Occupation', 'zipcode', ]
target = ['rating']

## 2. Label Encoding for sparse features & count unique features for each sparse field

문자로 들어가 있는 부분을 숫자로 바꾸기 위해 라벨인코딩을 실시합니다. 

In [0]:
for feat in sparse_features:
    lbe = LabelEncoder()
    data[feat] = lbe.fit_transform(data[feat])

In [16]:
data[sparse_features].head()

Unnamed: 0,iid,uid,Gender,Age,Occupation,zipcode
0,1104,0,0,0,10,1588
1,639,0,0,0,10,1588
2,853,0,0,0,10,1588
3,3177,0,0,0,10,1588
4,2162,0,0,0,10,1588


In [17]:
# emb size 공식 : embedding 의 더 높은 dimension 은 더 높은 자유도로 features 의 대표들을 학습할 수 있다. 
# TF tutorial에서는 모든 feature columns 의 dimension(차원)을 8로 지정합니다
# Tutorial에서 경험적으로, a more informed decision 차원의 수는 ,
# $$k\\log_2(n)$$ 나 $$k\\sqrt[4]n$$ 의 차수 값에서 시작.
# 이때 $$n$$ 은 feature column 에서 유니크한 feature 의 수 /
# $$k$$ 는 작은 상수값(일반적으로 10보다 작은 값)이다.
# 하지만 DeepCTR 모듈 들어다 보면 6 * int(pow(vocabulary_size, 0.25)) 입니다.

fixlen_feature_columns = [SparseFeat(
                                      name = feat, 
                                      vocabulary_size = data[feat].nunique(),
                                      embedding_dim = "auto" # math.ceil(2*math.log(data[feat].nunique(),2))
                                      )
                    for feat in sparse_features]
fixlen_feature_columns

[SparseFeat(name='iid', vocabulary_size=3706, embedding_dim=42, use_hash=False, dtype='int32', embedding_name='iid', group_name='default_group'),
 SparseFeat(name='uid', vocabulary_size=6040, embedding_dim=48, use_hash=False, dtype='int32', embedding_name='uid', group_name='default_group'),
 SparseFeat(name='Gender', vocabulary_size=2, embedding_dim=6, use_hash=False, dtype='int32', embedding_name='Gender', group_name='default_group'),
 SparseFeat(name='Age', vocabulary_size=7, embedding_dim=6, use_hash=False, dtype='int32', embedding_name='Age', group_name='default_group'),
 SparseFeat(name='Occupation', vocabulary_size=21, embedding_dim=12, use_hash=False, dtype='int32', embedding_name='Occupation', group_name='default_group'),
 SparseFeat(name='zipcode', vocabulary_size=3439, embedding_dim=42, use_hash=False, dtype='int32', embedding_name='zipcode', group_name='default_group')]

## 3. process sequence features & generate feature config for sequence feature

장르에 대해 시퀀셜한 처리를 해줄 계획입니다. 

In [0]:
def split(x):
    key_ans = x.split('|')
    for key in key_ans:
        if key not in key2index:
            key2index[key] = len(key2index) + 1
    return list(map(lambda x: key2index[x], key_ans))

In [19]:
# preprocess the sequence feature
key2index = {}
genres_list = list(map(split, data['genre'].values))
genres_list[:6]

[[1], [2, 3, 4], [4, 5], [1], [2, 3, 6], [7, 8, 6, 5]]

In [20]:
genres_length = np.array(list(map(len, genres_list)))
genres_length[:6]

array([1, 3, 2, 1, 3, 4])

In [21]:
max_len = max(genres_length)
max_len

6

padding된 것을 살펴보면 다음과 같습니다

In [22]:
genres_list = pad_sequences(genres_list, maxlen=max(genres_length), padding='post', )
genres_list[0:6]

array([[1, 0, 0, 0, 0, 0],
       [2, 3, 4, 0, 0, 0],
       [4, 5, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [2, 3, 6, 0, 0, 0],
       [7, 8, 6, 5, 0, 0]], dtype=int32)

In [23]:
genres_list.shape

(1000209, 6)

In [24]:
varlen_feature_columns = [VarLenSparseFeat(SparseFeat('genre',  
                                           vocabulary_size = len(key2index)+1,
                                           embedding_dim = "auto"),
                                           maxlen = max(genres_length), 
                                           combiner = 'mean')]  # 평균이나 max나 다양하게 결합을 할 수 있습니다. 유튜브 추천시스템에서는 평균으로 결합하는 것이 가장 좋았다고 헀습니다.
varlen_feature_columns

[VarLenSparseFeat(sparsefeat=SparseFeat(name='genre', vocabulary_size=19, embedding_dim=12, use_hash=False, dtype='int32', embedding_name='genre', group_name='default_group'), maxlen=6, combiner='mean', length_name=None)]

## 4. Combine Components
앞서 만든 요소들을 결합합니다

In [0]:
linear_feature_columns = fixlen_feature_columns + varlen_feature_columns
dnn_feature_columns = fixlen_feature_columns + varlen_feature_columns

In [27]:
linear_feature_columns

[SparseFeat(name='iid', vocabulary_size=3706, embedding_dim=42, use_hash=False, dtype='int32', embedding_name='iid', group_name='default_group'),
 SparseFeat(name='uid', vocabulary_size=6040, embedding_dim=48, use_hash=False, dtype='int32', embedding_name='uid', group_name='default_group'),
 SparseFeat(name='Gender', vocabulary_size=2, embedding_dim=6, use_hash=False, dtype='int32', embedding_name='Gender', group_name='default_group'),
 SparseFeat(name='Age', vocabulary_size=7, embedding_dim=6, use_hash=False, dtype='int32', embedding_name='Age', group_name='default_group'),
 SparseFeat(name='Occupation', vocabulary_size=21, embedding_dim=12, use_hash=False, dtype='int32', embedding_name='Occupation', group_name='default_group'),
 SparseFeat(name='zipcode', vocabulary_size=3439, embedding_dim=42, use_hash=False, dtype='int32', embedding_name='zipcode', group_name='default_group'),
 VarLenSparseFeat(sparsefeat=SparseFeat(name='genre', vocabulary_size=19, embedding_dim=12, use_hash=False

In [26]:
feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)
feature_names

['iid', 'uid', 'Gender', 'Age', 'Occupation', 'zipcode', 'genre']

## 5. data split & generate input data for model
원래 데이터 잘라야 정석인데 평가만 하겠습니다. input의 형태로 맞추기 위해 다음처럼 처리합니다. 

In [0]:
model_input = {name: data[name] for name in sparse_features}  
model_input["genre"] = genres_list

In [29]:
# dict 형태로 input
model_input

{'Age': 0          0
 1          0
 2          0
 3          0
 4          0
           ..
 1000204    2
 1000205    2
 1000206    2
 1000207    2
 1000208    2
 Name: Age, Length: 1000209, dtype: int64, 'Gender': 0          0
 1          0
 2          0
 3          0
 4          0
           ..
 1000204    1
 1000205    1
 1000206    1
 1000207    1
 1000208    1
 Name: Gender, Length: 1000209, dtype: int64, 'Occupation': 0          10
 1          10
 2          10
 3          10
 4          10
            ..
 1000204     6
 1000205     6
 1000206     6
 1000207     6
 1000208     6
 Name: Occupation, Length: 1000209, dtype: int64, 'genre': array([[ 1,  0,  0,  0,  0,  0],
        [ 2,  3,  4,  0,  0,  0],
        [ 4,  5,  0,  0,  0,  0],
        ...,
        [ 6,  1,  0,  0,  0,  0],
        [ 1,  0,  0,  0,  0,  0],
        [ 3,  1,  9, 10,  0,  0]], dtype=int32), 'iid': 0          1104
 1           639
 2           853
 3          3177
 4          2162
            ... 
 1000204   

## 6. Define Device & Model

In [30]:
device = 'cpu'
use_cuda = True
if use_cuda and torch.cuda.is_available():
    print('cuda ready...')
    device = 'cuda:0'

cuda ready...


In [31]:
model = WDL(linear_feature_columns, 
            dnn_feature_columns, 
            task='regression', 
            device=device)
model

WDL(
  (embedding_dict): ModuleDict(
    (Age): Embedding(7, 6)
    (Gender): Embedding(2, 6)
    (Occupation): Embedding(21, 12)
    (genre): Embedding(19, 12)
    (iid): Embedding(3706, 42)
    (uid): Embedding(6040, 48)
    (zipcode): Embedding(3439, 42)
  )
  (linear_model): Linear(
    (embedding_dict): ModuleDict(
      (Age): Embedding(7, 1)
      (Gender): Embedding(2, 1)
      (Occupation): Embedding(21, 1)
      (genre): Embedding(19, 1)
      (iid): Embedding(3706, 1)
      (uid): Embedding(6040, 1)
      (zipcode): Embedding(3439, 1)
    )
  )
  (out): PredictionLayer()
  (dnn): DNN(
    (dropout): Dropout(p=0, inplace=False)
    (linears): ModuleList(
      (0): Linear(in_features=168, out_features=256, bias=True)
      (1): Linear(in_features=256, out_features=128, bias=True)
    )
    (activation_layers): ModuleList(
      (0): ReLU(inplace=True)
      (1): ReLU(inplace=True)
    )
  )
  (dnn_linear): Linear(in_features=128, out_features=1, bias=False)
)

In [0]:
# https://github.com/shenweichen/DeepCTR-Torch/blob/master/deepctr_torch/models/basemodel.py
model.compile("adam", 
              "mse", 
              metrics=['mse'], )

## 7. Evaulate & Predict

In [33]:
history = model.fit(model_input,
                    data[target].values,
                    batch_size=256, 
                    epochs=10, 
                    verbose=1, 
                    validation_split=0.2)

0it [00:00, ?it/s]

cuda:0
Train on 800167 samples, validate on 200042 samples, 3126 steps per epoch


3126it [00:32, 95.13it/s]


Epoch 1/10


4it [00:00, 39.13it/s]

32s - loss:  0.9661 - mse:  0.9661 - val_mse:  0.9919


3126it [00:32, 94.84it/s]


Epoch 2/10


5it [00:00, 46.58it/s]

32s - loss:  0.8363 - mse:  0.8363 - val_mse:  0.9823


3126it [00:32, 95.50it/s]


Epoch 3/10


4it [00:00, 36.31it/s]

32s - loss:  0.8300 - mse:  0.8300 - val_mse:  0.9838


3126it [00:32, 95.25it/s]


Epoch 4/10


4it [00:00, 38.07it/s]

32s - loss:  0.8247 - mse:  0.8248 - val_mse:  0.9834


3126it [00:32, 94.10it/s]


Epoch 5/10


4it [00:00, 38.87it/s]

32s - loss:  0.8089 - mse:  0.8089 - val_mse:  0.9960


3126it [00:32, 97.31it/s] 


Epoch 6/10


4it [00:00, 39.51it/s]

32s - loss:  0.7887 - mse:  0.7887 - val_mse:  0.9856


3126it [00:32, 95.31it/s]


Epoch 7/10


5it [00:00, 46.27it/s]

32s - loss:  0.7671 - mse:  0.7671 - val_mse:  1.0286


3126it [00:32, 96.73it/s] 


Epoch 8/10


5it [00:00, 46.32it/s]

32s - loss:  0.7486 - mse:  0.7486 - val_mse:  0.9974


3126it [00:32, 96.53it/s]


Epoch 9/10


6it [00:00, 55.32it/s]

32s - loss:  0.7346 - mse:  0.7346 - val_mse:  1.0010


3126it [00:31, 98.39it/s] 


Epoch 10/10
31s - loss:  0.7213 - mse:  0.7213 - val_mse:  1.0120
