<font color='tomato'><font color="#CC3D3D"><p>
# Cornac
https://cornac.preferred.ai/

- **Cornac**은 Multimodal 추천시스템을 위한 비교 프레임워크(comparative framework)
- Side information(예: 항목 설명 텍스트 및 이미지, 소셜 네트워크 등)를 활용하는 모델을 편리하게 작업할 수 있도록 하는 데 중점
- 새로운 모델을 빠르게 실험하고 간단하게 구현이 가능
- 기존 머신 러닝 라이브러리(예: TensorFlow, PyTorch)와 호환성 높음
- 추천 알고리즘의 평가 및 재현성을 위해 ACM RecSys 2023에서 권장하는 프레임워크 중 하나
- Cornac의 실험절차    
<img src=cornac_flow.jpg>

### Setup

In [9]:
!pip install numpy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [10]:
!pip install cornac


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
!pip install tqdm

Collecting tqdm
  Using cached tqdm-4.67.0-py3-none-any.whl.metadata (57 kB)
Using cached tqdm-4.67.0-py3-none-any.whl (78 kB)
Installing collected packages: tqdm
Successfully installed tqdm-4.67.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [11]:
import pandas as pd
import numpy as np

# MS recommenders API 
# import sys
# sys.path.append('D:\\강의\\2023-2\\추천시스템')  # 본인이 msr.zip 압축을 푼 위치를 확인(셀에서 pwd 명령어 실행) 후 변경해야 함. 
                                                # 윈도우에서는 폴더 구분자를 // 또는 \\로 해야 함.  
from cornac_utils import predict_ranking
from msr.python_splitters import python_stratified_split

# Cornac API 
import cornac
print(f"Cornac version: {cornac.__version__}")
from cornac.eval_methods import BaseMethod, RatioSplit, StratifiedSplit, CrossValidation
from cornac.models import NeuMF, VAECF, EASE, UserKNN, ItemKNN, MF
from cornac.metrics import Precision, Recall, NDCG, AUC, MAP
#from cornac.hyperopt import Discrete, Continuous
#from cornac.hyperopt import GridSearch, RandomSearch

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

In [3]:
# Data column definition
DEFAULT_USER_COL = 'resume_seq'
DEFAULT_ITEM_COL = 'recruitment_seq'
DEFAULT_RATING_COL = 'rating'
DEFAULT_PREDICTION_COL = 'prediction'

# Top k items to recommend
TOP_K = 5

# Random seed, Verbose, etc.
SEED = 202311
VERBOSE = True

### Data Preparation

In [4]:
# 데이터 로딩
data = pd.read_csv('apply_train.csv')
data[DEFAULT_RATING_COL] = 1  # Cornac에서 지정한 데이터형식(UIR: User, Item, Rating)에 따라

In [5]:
# 데이터 분할
train, test = python_stratified_split(
    data, 
    filter_by="user", 
    ratio=0.7,
    col_user=DEFAULT_USER_COL, col_item=DEFAULT_ITEM_COL,
    seed=SEED
)

print(
    "ratings per train user: ", train.groupby(DEFAULT_USER_COL).size().mean(), 
    "\nratings per test user: ", test.groupby(DEFAULT_USER_COL).size().mean()
) 

#train, test = [], []
#df_groupby = data.groupby(DEFAULT_USER_COL)[DEFAULT_ITEM_COL].apply(list)
#for uid, iids in zip(df_groupby.index.tolist(), df_groupby.values.tolist()):
#    for iid in iids[:-1]:
#        train.append([uid,iid])
#    test.append([uid, iids[-1]])    
#train = pd.DataFrame(train); train.columns = {DEFAULT_USER_COL, DEFAULT_ITEM_COL}
#test = pd.DataFrame(test); test.columns = {DEFAULT_USER_COL, DEFAULT_ITEM_COL}
#train[DEFAULT_RATING_COL] = 1.0
#test[DEFAULT_RATING_COL] = 1.0

ratings per train user:  4.715986795567083 
ratings per test user:  2.115656684744164


### Modeling

In [6]:
models = {}  # models['모델명'][0] => model 객체, models['모델명'][1] => model 파라미터

##### User/Item K-Nearest-Neighbors (UserKNN/ItemKNN)

In [7]:
params = {
    'k': 20,
    'similarity': 'cosine', # ['cosine', 'pearson']
}

# U-to-U CF
model = UserKNN(**params, seed=SEED, verbose=VERBOSE)
models[model.name] = (model, params)

In [8]:
# I-to-I CF
model = ItemKNN(**params, seed=SEED, verbose=VERBOSE)
models[model.name] = (model, params)

##### Matrix Factorization (MF)

In [9]:
params = {
    'k': 10,
    'max_iter': 25,
    'learning_rate': 0.01,
    'lambda_reg': 0.02,
    'use_bias': True,
    'early_stop': True,
}

model = MF(**params, seed=SEED, verbose=VERBOSE)
models[model.name] = (model, params)

##### Embarrassingly Shallow Autoencoders for Sparse Data (EASE)

In [10]:
params = {
    'lamb': 500,
    'posB': True,
}

model = EASE(**params, seed=SEED, verbose=VERBOSE)
models[model.name] = (model, params)

##### Neural Collaborative Filtering

In [11]:
params = {
    'num_factors': 8,
    'layers': [32, 16, 8],
    'act_fn': 'tanh', # ["tanh", "sigmoid", "relu", "leaky_relu"]
    'num_neg': 3,
    'lr': 0.001,
    'num_epochs': 10,
    'batch_size': 256,
}

model = NeuMF(**params, seed=SEED, verbose=VERBOSE)
models[model.name] = (model, params)

##### Variational Autoencoder for Collaborative Filtering (VAECF)

In [12]:
params = {
    'k': 20,
    'autoencoder_structure': [40],
    'act_fn': "tanh",     # ["tanh", "sigmoid", "relu", "leaky_relu"]
    'likelihood': "mult", # ["bern", "mult", "gaus", "pois"]
    'n_epochs': 100,
    'batch_size': 100,
    'learning_rate': 0.005,
    'beta': 0.1,
}

model = VAECF(**params, seed=SEED, verbose=VERBOSE)
models[model.name] = (model, params)

### Experiment

In [13]:
# 평가방법 설정
eval_method = BaseMethod.from_splits(
    train_data=np.array(train), 
    test_data=np.array(test), 
    exclude_unknowns=True,  # Unknown users and items will be ignored.
    verbose=True
)

#Random split
#ratio_split = RatioSplit(
#  data=df, test_size=0.2, exclude_unknowns=True, seed=SEED, verbose=VERBOSE
#)

#K-fold CV
#ratio_split = CrossValidation(
#  data=data, n_folds=5, exclude_unknowns=True, seed=SEED, verbose=VERBOSE
#)

rating_threshold = 1.0
exclude_unknowns = True
---
Training data:
Number of users = 8482
Number of items = 6671
Number of ratings = 40001
Max rating = 1.0
Min rating = 1.0
Global mean = 1.0
---
Test data:
Number of users = 8482
Number of items = 6671
Number of ratings = 17868
Number of unknown users = 0
Number of unknown items = 0
---
Total users = 8482
Total items = 6671


In [14]:
%%time

# 평가척도 설정
metrics = [Recall(k=TOP_K), NDCG(k=TOP_K)]

# 실험 수행
ex = cornac.Experiment(
    eval_method=eval_method,
    models= [m[0] for m in models.values()],
    metrics=metrics,
).run()


[UserKNN] Training started!


  0%|          | 0/8482 [00:00<?, ?it/s]


[UserKNN] Evaluation started!


Ranking:   0%|          | 0/8452 [00:00<?, ?it/s]


[ItemKNN] Training started!


  0%|          | 0/6671 [00:00<?, ?it/s]


[ItemKNN] Evaluation started!


Ranking:   0%|          | 0/8452 [00:00<?, ?it/s]


[MF] Training started!


  0%|          | 0/25 [00:00<?, ?it/s]

Optimization finished!

[MF] Evaluation started!


Ranking:   0%|          | 0/8452 [00:00<?, ?it/s]


[EASEᴿ] Training started!

[EASEᴿ] Evaluation started!


Ranking:   0%|          | 0/8452 [00:00<?, ?it/s]


[NeuMF] Training started!


  0%|          | 0/10 [00:00<?, ?it/s]


[NeuMF] Evaluation started!


Ranking:   0%|          | 0/8452 [00:00<?, ?it/s]


[VAECF] Training started!


  0%|          | 0/100 [00:00<?, ?it/s]


[VAECF] Evaluation started!


Ranking:   0%|          | 0/8452 [00:00<?, ?it/s]


TEST:
...
        | NDCG@5 | Recall@5 | Train (s) | Test (s)
------- + ------ + -------- + --------- + --------
UserKNN | 0.0694 |   0.0893 |    0.4029 |  82.2444
ItemKNN | 0.0490 |   0.0621 |    0.2399 |  60.1855
MF      | 0.0005 |   0.0007 |    0.4782 |   3.1565
EASEᴿ   | 0.0729 |   0.0949 |    1.8603 |   1.7676
NeuMF   | 0.0104 |   0.0128 |    9.8463 |   6.6857
VAECF   | 0.0338 |   0.0444 |   82.9796 |   2.7932

CPU times: user 5min 15s, sys: 3min 35s, total: 8min 51s
Wall time: 4min 12s


### Prediction

In [15]:
# 전체 데이터 Cornac 데이터형식으로 변환
full_data = cornac.data.Dataset.from_uir(data.itertuples(index=False), seed=SEED)

# 모델 선택
model = UserKNN

# 전체 데이터로 다시 학습
model = model(**models['UserKNN'][1], verbose=VERBOSE, seed=SEED)
model.fit(full_data)

  0%|          | 0/8482 [00:00<?, ?it/s]

<cornac.models.knn.recom_knn.UserKNN at 0x32a03b850>

In [16]:
%%time

# All item에 대한 예측값 생성
all_pred = predict_ranking(
    model, data, 
    usercol=DEFAULT_USER_COL, itemcol=DEFAULT_ITEM_COL, 
    remove_seen=True
)

  0%|          | 0/8482 [00:00<?, ?it/s]

CPU times: user 1min 44s, sys: 2min 17s, total: 4min 1s
Wall time: 4min 1s


In [17]:
%%time

# Top-K item 생성
top_k = (
    all_pred
    .groupby(DEFAULT_USER_COL)
    .apply(lambda x: x.nlargest(TOP_K, DEFAULT_PREDICTION_COL))
    .reset_index(drop=True)
    .drop(DEFAULT_PREDICTION_COL, axis=1)
    .sort_values(by=DEFAULT_USER_COL)
)

# submission 화일 저장
t = pd.Timestamp.now()
fname = f"submit_{model.name}_{t.month:02}{t.day:02}{t.hour:02}{t.minute:02}.csv"
top_k.to_csv(fname, index=False)

CPU times: user 8.62 s, sys: 325 ms, total: 8.95 s
Wall time: 8.95 s


<font color='tomato'><font color="#CC3D3D"><p>
# End