# Wide & Deep Learning for Recommender System

- Google에서 App Store를 활용해서 발표한 논문([링크](https://arxiv.org/pdf/1606.07792.pdf))


## Google 공식 문서
- Google의 AI Blog([링크](https://ai.googleblog.com/2016/06/wide-deep-learning-better-together-with.html))
- Google의 Tensorflow github([링크](https://github.com/tensorflow/tensorflow/blob/v2.4.0/tensorflow/python/keras/premade/wide_deep.py#L34-L219))
- TensorFlow v2.4 API
  - [tf.keras.experimental.WideDeepModel](https://www.tensorflow.org/api_docs/python/tf/keras/experimental/WideDeepModel?hl=en#methods_2)
  - [tf.estimator.DNNLinearCombinedClassifier](https://www.tensorflow.org/api_docs/python/tf/estimator/DNNLinearCombinedClassifier)

## 함께볼만한 PyTorch Library
- [pytorch-widedeep](https://github.com/jrzaurin/pytorch-widedeep)

In [1]:
!pip install pytorch-widedeep

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytorch-widedeep
  Downloading pytorch_widedeep-1.2.2-py3-none-any.whl (21.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.1/21.1 MB[0m [31m46.5 MB/s[0m eta [36m0:00:00[0m
Collecting fastparquet>=0.8.1
  Downloading fastparquet-2023.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m42.4 MB/s[0m eta [36m0:00:00[0m
Collecting torchmetrics
  Downloading torchmetrics-0.11.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.2/519.2 KB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting einops
  Downloading einops-0.6.0-py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.6/41.6 KB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting pandas>=1.3.5
  Downloadin

In [2]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

## DataLoader

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
data_path = "/content/drive/My Drive/data/kmrd"
%cd $data_path

if not os.path.exists(data_path):
  !git clone https://github.com/lovit/kmrd
  !python setup.py install
else:
  print("data and path already exists!")

path = data_path + '/kmr_dataset/datafile/kmrd-small'

/content/drive/My Drive/data/kmrd
data and path already exists!


In [5]:
df = pd.read_csv(os.path.join(path,'rates.csv'))
train_df, test_df = train_test_split(df, test_size=0.2, random_state=1234, shuffle=True)



In [6]:
print(train_df.shape,test_df.shape)
train_df.head()

(112568, 4) (28142, 4)


Unnamed: 0,user,movie,rate,time
137023,48423,10764,10,1212241560
92868,17307,10170,10,1122185220
94390,18180,10048,10,1573403460
22289,1498,10001,9,1432684500
80155,12541,10022,10,1370458140


In [7]:
train_df = train_df[:10000]
test_df = test_df[:1000]

In [8]:
# Load all related dataframe
movies_df = pd.read_csv(os.path.join(path, 'movies.txt'), sep='\t', encoding='utf-8')
movies_df = movies_df.set_index('movie')

castings_df = pd.read_csv(os.path.join(path, 'castings.csv'), encoding='utf-8')
countries_df = pd.read_csv(os.path.join(path, 'countries.csv'), encoding='utf-8')
genres_df = pd.read_csv(os.path.join(path, 'genres.csv'), encoding='utf-8')

# Get genre information
genres = [(list(set(x['movie'].values))[0], '/'.join(x['genre'].values)) for index, x in genres_df.groupby('movie')]
combined_genres_df = pd.DataFrame(data=genres, columns=['movie', 'genres'])
combined_genres_df = combined_genres_df.set_index('movie')

# Get castings information
castings = [(list(set(x['movie'].values))[0], x['people'].values) for index, x in castings_df.groupby('movie')]
combined_castings_df = pd.DataFrame(data=castings, columns=['movie','people'])
combined_castings_df = combined_castings_df.set_index('movie')

# Get countries for movie information
countries = [(list(set(x['movie'].values))[0], ','.join(x['country'].values)) for index, x in countries_df.groupby('movie')]
combined_countries_df = pd.DataFrame(data=countries, columns=['movie', 'country'])
combined_countries_df = combined_countries_df.set_index('movie')

movies_df = pd.concat([movies_df, combined_genres_df, combined_castings_df, combined_countries_df], axis=1)

print(movies_df.shape)
print(movies_df.head())

(999, 7)
                      title                           title_eng    year  \
movie                                                                     
10001                시네마 천국              Cinema Paradiso , 1988  2013.0   
10002              빽 투 더 퓨쳐           Back To The Future , 1985  2015.0   
10003            빽 투 더 퓨쳐 2    Back To The Future Part 2 , 1989  2015.0   
10004            빽 투 더 퓨쳐 3  Back To The Future Part III , 1990  1990.0   
10005  스타워즈 에피소드 4 - 새로운 희망                    Star Wars , 1977  1997.0   

         grade         genres  \
movie                           
10001   전체 관람가     드라마/멜로/로맨스   
10002  12세 관람가         SF/코미디   
10003  12세 관람가         SF/코미디   
10004   전체 관람가  서부/SF/판타지/코미디   
10005       PG   판타지/모험/SF/액션   

                                                  people   country  
movie                                                               
10001  [4374, 178, 3241, 47952, 47953, 19538, 18991, ...  이탈리아,프랑스  
10002    [1076, 4603, 917,

In [None]:
movies_df.columns

Index(['title', 'title_eng', 'year', 'grade', 'genres', 'people', 'country'], dtype='object')

In [None]:
movies_df['genres'].str.get_dummies(sep='/')

Unnamed: 0_level_0,SF,가족,공포,느와르,다큐멘터리,드라마,로맨스,멜로,모험,뮤지컬,...,범죄,서부,서사,스릴러,애니메이션,액션,에로,전쟁,코미디,판타지
movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10001,0,0,0,0,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
10002,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
10003,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
10004,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,1
10005,1,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
10996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
10997,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10998,0,0,0,0,0,1,0,0,1,0,...,0,0,0,1,0,1,0,0,0,0


In [9]:
dummy_genres_df = movies_df['genres'].str.get_dummies(sep='/')
train_genres_df = train_df['movie'].apply(lambda x: dummy_genres_df.loc[x])
test_genres_df = test_df['movie'].apply(lambda x: dummy_genres_df.loc[x])
print(train_genres_df.head(), test_genres_df.head())

        SF  가족  공포  느와르  다큐멘터리  드라마  로맨스  멜로  모험  뮤지컬  ...  범죄  서부  서사  스릴러  \
137023   0   0   0    0      0    1    1   1   0    0  ...   0   0   0    0   
92868    0   0   0    0      0    0    0   0   0    0  ...   0   0   0    0   
94390    0   0   0    0      0    1    0   0   0    0  ...   0   0   0    0   
22289    0   0   0    0      0    1    1   1   0    0  ...   0   0   0    0   
80155    0   0   0    0      0    1    0   0   0    0  ...   0   0   0    0   

        애니메이션  액션  에로  전쟁  코미디  판타지  
137023      0   0   0   0    0    0  
92868       0   1   0   0    0    0  
94390       0   0   0   0    0    0  
22289       0   0   0   0    0    0  
80155       0   1   0   0    0    0  

[5 rows x 21 columns]         SF  가족  공포  느와르  다큐멘터리  드라마  로맨스  멜로  모험  뮤지컬  ...  범죄  서부  서사  스릴러  \
76196    0   0   0    0      0    0    0   0   0    0  ...   1   0   0    1   
109800   0   0   0    0      0    1    1   1   0    1  ...   0   0   0    0   
60479    1   0   0    0      0    0  

In [10]:
dummy_grade_df = pd.get_dummies(movies_df['grade'], prefix='grade')
train_grade_df = train_df['movie'].apply(lambda x: dummy_grade_df.loc[x])
test_grade_df = test_df['movie'].apply(lambda x: dummy_grade_df.loc[x])
train_grade_df.head()

Unnamed: 0,grade_12세 관람가,grade_15세 관람가,grade_G,grade_NR,grade_PG,grade_PG-13,grade_R,grade_전체 관람가,grade_청소년 관람불가
137023,1,0,0,0,0,0,0,0,0
92868,0,0,0,0,1,0,0,0,0
94390,1,0,0,0,0,0,0,0,0
22289,0,0,0,0,0,0,0,1,0
80155,1,0,0,0,0,0,0,0,0


In [None]:
movies_df

Unnamed: 0_level_0,title,title_eng,year,grade,genres,people,country
movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
10001,시네마 천국,"Cinema Paradiso , 1988",2013.0,전체 관람가,드라마/멜로/로맨스,"[4374, 178, 3241, 47952, 47953, 19538, 18991, ...","이탈리아,프랑스"
10002,빽 투 더 퓨쳐,"Back To The Future , 1985",2015.0,12세 관람가,SF/코미디,"[1076, 4603, 917, 8637, 5104, 9986, 7470, 9987]",미국
10003,빽 투 더 퓨쳐 2,"Back To The Future Part 2 , 1989",2015.0,12세 관람가,SF/코미디,"[1076, 4603, 917, 5104, 391, 5106, 5105, 5107,...",미국
10004,빽 투 더 퓨쳐 3,"Back To The Future Part III , 1990",1990.0,전체 관람가,서부/SF/판타지/코미디,"[1076, 4603, 1031, 5104, 10001, 5984, 10002, 1...",미국
10005,스타워즈 에피소드 4 - 새로운 희망,"Star Wars , 1977",1997.0,PG,판타지/모험/SF/액션,"[1007, 535, 215, 1236, 35]",미국
...,...,...,...,...,...,...,...
10995,공포의 여정,"Journey Into Fear , 1975",,PG,스릴러,"[2464, 16573, 2101, 10619, 17815, 17814, 16848...",미국
10996,버스틴 루즈,"Bustin' Loose , 1981",,R,코미디,"[9598, 6520, 506, 11123]",미국
10997,블랙 엔젤,"Mausoleum , 1983",,청소년 관람불가,공포,"[198255, 17831, 10233, 140473, 31534, 200668, ...",미국
10998,폭주 기관차,"Runaway Train , 1985",1989.0,15세 관람가,드라마/액션/모험/스릴러,"[793, 412, 1284, 17833, 15383, 14165, 13856, 7...",미국


In [11]:
train_df['year'] = train_df.apply(lambda x: movies_df.loc[x['movie']]['year'], axis=1)
test_df['year'] = train_df.apply(lambda x: movies_df.loc[x['movie']]['year'], axis=1)

In [None]:
print(train_df.head(), test_df.head())

In [12]:
train_df = pd.concat([train_df, train_grade_df, train_genres_df], axis=1)
test_df = pd.concat([test_df, test_grade_df, test_genres_df], axis=1)
train_df.head()

Unnamed: 0,user,movie,rate,time,year,grade_12세 관람가,grade_15세 관람가,grade_G,grade_NR,grade_PG,...,범죄,서부,서사,스릴러,애니메이션,액션,에로,전쟁,코미디,판타지
137023,48423,10764,10,1212241560,1987.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
92868,17307,10170,10,1122185220,1985.0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
94390,18180,10048,10,1573403460,2016.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
22289,1498,10001,9,1432684500,2013.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80155,12541,10022,10,1370458140,1980.0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [13]:
wide_cols = list(dummy_genres_df.columns) + list(dummy_grade_df.columns)
wide_cols

['SF',
 '가족',
 '공포',
 '느와르',
 '다큐멘터리',
 '드라마',
 '로맨스',
 '멜로',
 '모험',
 '뮤지컬',
 '미스터리',
 '범죄',
 '서부',
 '서사',
 '스릴러',
 '애니메이션',
 '액션',
 '에로',
 '전쟁',
 '코미디',
 '판타지',
 'grade_12세 관람가',
 'grade_15세 관람가',
 'grade_G',
 'grade_NR',
 'grade_PG',
 'grade_PG-13',
 'grade_R',
 'grade_전체 관람가',
 'grade_청소년 관람불가']

In [15]:
print(len(wide_cols))
print(wide_cols)

wide_cols = wide_cols[:5]

30
['SF', '가족', '공포', '느와르', '다큐멘터리', '드라마', '로맨스', '멜로', '모험', '뮤지컬', '미스터리', '범죄', '서부', '서사', '스릴러', '애니메이션', '액션', '에로', '전쟁', '코미디', '판타지', 'grade_12세 관람가', 'grade_15세 관람가', 'grade_G', 'grade_NR', 'grade_PG', 'grade_PG-13', 'grade_R', 'grade_전체 관람가', 'grade_청소년 관람불가']


In [16]:
# wide_cols = ['genre', 'grade']
# cross_cols = [('genre', 'grade')]
wide_cols

['SF', '가족', '공포', '느와르', '다큐멘터리']

In [17]:
import itertools
from itertools import product  
unique_combinations = list(list(zip(wide_cols, element)) 
                           for element in product(wide_cols, repeat = len(wide_cols))) 

print(unique_combinations)
cross_cols = [item for sublist in unique_combinations for item in sublist]
cross_cols = [x for x in cross_cols if x[0] != x[1]]
cross_cols = list(set(cross_cols))
print(cross_cols)

[[('SF', 'SF'), ('가족', 'SF'), ('공포', 'SF'), ('느와르', 'SF'), ('다큐멘터리', 'SF')], [('SF', 'SF'), ('가족', 'SF'), ('공포', 'SF'), ('느와르', 'SF'), ('다큐멘터리', '가족')], [('SF', 'SF'), ('가족', 'SF'), ('공포', 'SF'), ('느와르', 'SF'), ('다큐멘터리', '공포')], [('SF', 'SF'), ('가족', 'SF'), ('공포', 'SF'), ('느와르', 'SF'), ('다큐멘터리', '느와르')], [('SF', 'SF'), ('가족', 'SF'), ('공포', 'SF'), ('느와르', 'SF'), ('다큐멘터리', '다큐멘터리')], [('SF', 'SF'), ('가족', 'SF'), ('공포', 'SF'), ('느와르', '가족'), ('다큐멘터리', 'SF')], [('SF', 'SF'), ('가족', 'SF'), ('공포', 'SF'), ('느와르', '가족'), ('다큐멘터리', '가족')], [('SF', 'SF'), ('가족', 'SF'), ('공포', 'SF'), ('느와르', '가족'), ('다큐멘터리', '공포')], [('SF', 'SF'), ('가족', 'SF'), ('공포', 'SF'), ('느와르', '가족'), ('다큐멘터리', '느와르')], [('SF', 'SF'), ('가족', 'SF'), ('공포', 'SF'), ('느와르', '가족'), ('다큐멘터리', '다큐멘터리')], [('SF', 'SF'), ('가족', 'SF'), ('공포', 'SF'), ('느와르', '공포'), ('다큐멘터리', 'SF')], [('SF', 'SF'), ('가족', 'SF'), ('공포', 'SF'), ('느와르', '공포'), ('다큐멘터리', '가족')], [('SF', 'SF'), ('가족', 'SF'), ('공포', 'SF'), ('느와르', '공포'), ('다큐멘터리', '공포')], [('

In [18]:
# embed_cols = [('genre', 16),('grade', 16)]
embed_cols = list(set([(x[0], 16) for x in cross_cols]))  # 16이라는 임베딩 dim을 줘서 만들 것
continuous_cols = ['year']

print(embed_cols)
print(continuous_cols)

[('가족', 16), ('공포', 16), ('다큐멘터리', 16), ('SF', 16), ('느와르', 16)]
['year']


In [19]:
target = train_df['rate'].apply(lambda x: 1 if x > 9 else 0).values
target

array([1, 1, 1, ..., 1, 1, 1])

## Wide & Deep
+ https://github.com/jrzaurin/pytorch-widedeep 의 quick start 참고
    + 과거와 코드바뀜. Deep부분이 Tab으로 바뀐 것들이 있음

In [None]:
import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split

from pytorch_widedeep import Trainer
from pytorch_widedeep.preprocessing import WidePreprocessor, TabPreprocessor
from pytorch_widedeep.models import Wide, TabMlp, WideDeep
from pytorch_widedeep.metrics import Accuracy
from pytorch_widedeep.datasets import load_adult


# prepare the data
wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=cross_cols)
X_wide = wide_preprocessor.fit_transform(train_df)

tab_preprocessor = TabPreprocessor(
    # 카테고리변수와 연속형 변수 선택하거나 둘다 이용
    cat_embed_cols=embed_cols  
    #, continuous_cols=embed_cols  # type: ignore[arg-type]
)
X_tab = tab_preprocessor.fit_transform(train_df)



# build the model
wide = Wide(input_dim=np.unique(X_wide).shape[0], pred_dim=1)
tab_mlp = TabMlp(
    column_idx=tab_preprocessor.column_idx,
    cat_embed_input=tab_preprocessor.cat_embed_input,
    #continuous_cols=continuous_cols,
)
model = WideDeep(wide=wide, deeptabular=tab_mlp)


# train and validate
trainer = Trainer(model, objective="binary", metrics=[Accuracy])
trainer.fit(
    X_wide=X_wide,
    X_tab=X_tab,
    target=target,
    n_epochs=5,
    batch_size=256,
)


  0%|          | 0/40 [00:00<?, ?it/s]

In [None]:

# predict on test
X_wide_te = wide_preprocessor.transform(test_df)
X_tab_te = tab_preprocessor.transform(test_df)
preds = trainer.predict(X_wide=X_wide_te, X_tab=X_tab_te)

## 아래 부분은 구버전 코드
+ deep 부분 바뀜

In [None]:
from pytorch_widedeep.preprocessing import WidePreprocessor, DensePreprocessor
from pytorch_widedeep.models import Wide, DeepDense, WideDeep
from pytorch_widedeep.metrics import Accuracy

ImportError: ignored

### Wide Component

In [None]:
preprocess_wide = WidePreprocessor(wide_cols=wide_cols, crossed_cols=cross_cols)
X_wide = preprocess_wide.fit_transform(train_df)
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1)

NameError: ignored

In [None]:
X_wide.size

9000

In [None]:
wide

Wide(
  (wide_linear): Embedding(29, 1, padding_idx=0)
)

### Deep Component

In [None]:
preprocess_deep = DensePreprocessor(embed_cols=embed_cols, continuous_cols=continuous_cols)
X_deep = preprocess_deep.fit_transform(train_df)
deepdense = DeepDense(
    hidden_layers=[64, 32],
    deep_column_idx=preprocess_deep.deep_column_idx,
    embed_input=preprocess_deep.embeddings_input,
    continuous_cols=continuous_cols,
)

In [None]:
deepdense

DeepDense(
  (embed_layers): ModuleDict(
    (emb_layer_가족): Embedding(3, 16)
    (emb_layer_공포): Embedding(3, 16)
    (emb_layer_SF): Embedding(3, 16)
  )
  (embed_dropout): Dropout(p=0.0, inplace=False)
  (dense): Sequential(
    (dense_layer_0): Sequential(
      (0): Linear(in_features=49, out_features=64, bias=True)
      (1): LeakyReLU(negative_slope=0.01, inplace=True)
      (2): Dropout(p=0.0, inplace=False)
    )
    (dense_layer_1): Sequential(
      (0): Linear(in_features=64, out_features=32, bias=True)
      (1): LeakyReLU(negative_slope=0.01, inplace=True)
      (2): Dropout(p=0.0, inplace=False)
    )
  )
)

### Build and Train

In [None]:
# build, compile and fit
model = WideDeep(wide=wide, deepdense=deepdense)
model.compile(method="binary", metrics=[Accuracy])
model.fit(
    X_wide=X_wide,
    X_deep=X_deep,
    target=target,
    n_epochs=5,
    batch_size=256,
    val_split=0.1,
)

  0%|          | 0/4 [00:00<?, ?it/s]

Training


epoch 1: 100%|██████████| 4/4 [00:00<00:00, 10.65it/s, loss=nan, metrics={'acc': 0.1556}]
valid: 100%|██████████| 1/1 [00:00<00:00, 15.18it/s, loss=nan, metrics={'acc': 0.14}]
epoch 2: 100%|██████████| 4/4 [00:00<00:00, 29.81it/s, loss=nan, metrics={'acc': 0.0}]
valid: 100%|██████████| 1/1 [00:00<00:00, 15.34it/s, loss=nan, metrics={'acc': 0.0}]
epoch 3: 100%|██████████| 4/4 [00:00<00:00, 30.95it/s, loss=nan, metrics={'acc': 0.0}]
valid: 100%|██████████| 1/1 [00:00<00:00, 14.27it/s, loss=nan, metrics={'acc': 0.0}]
epoch 4: 100%|██████████| 4/4 [00:00<00:00, 30.49it/s, loss=nan, metrics={'acc': 0.0}]
valid: 100%|██████████| 1/1 [00:00<00:00, 14.39it/s, loss=nan, metrics={'acc': 0.0}]
epoch 5: 100%|██████████| 4/4 [00:00<00:00, 27.75it/s, loss=nan, metrics={'acc': 0.0}]
valid: 100%|██████████| 1/1 [00:00<00:00, 14.39it/s, loss=nan, metrics={'acc': 0.0}]


In [None]:
X_deep.shape

In [None]:
X_wide.shape