  - 단일변수 회귀
  - 다변수 회귀분석

앙상블
  - 의사결정 트리
  - 랜덤포레스트
  - 스태킹
  - 베깅
  - 부트팅

In [None]:
from sklearn.ensemble import StackingRegressor

In [None]:
'''
개별 모델이 예측한 데이터를 기반으로
회귀모델을 여러개 만들고 각각 학습
필요하면 하이퍼파라메터도 튜닝
교차검증을 통해서 각 모델별 성능점수를 추출해서 상위 5개 이런식으로
모델을리스트로 만들어서 -- estimators

estimators = [
  ('randomforest', rf),
  ('extrtreegreg', ext),
  ('xgboost', xgb)
]

 StackingRegressor(estimators, final_estimator = xgb)

'''

데이콘 데이터 로드

In [None]:
pip install rdkit

Collecting rdkit
  Downloading rdkit-2023.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.7/29.7 MB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: rdkit
Successfully installed rdkit-2023.3.3


In [1]:
import random
import os

import numpy as np
import pandas as pd

from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from rdkit import DataStructs
from rdkit.Chem import PandasTools, AllChem

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

In [2]:
def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)

seed_everything(42) # Seed 고정

In [None]:
!unzip '/content/drive/MyDrive/데이콘_신약개발/데이콘 신약데이터.zip'

Archive:  /content/drive/MyDrive/데이콘_신약개발/데이콘 신약데이터.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [3]:
train = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')

In [4]:
PandasTools.AddMoleculeColumnToFrame(train,'SMILES','Molecule')
PandasTools.AddMoleculeColumnToFrame(test,'SMILES','Molecule')

In [5]:
def mol2fp(mol):
    fp = AllChem.GetHashedMorganFingerprint(mol, 6, nBits=4096)
    ar = np.zeros((1,), dtype=np.int8)
    DataStructs.ConvertToNumpyArray(fp, ar)
    return ar

In [6]:
# FPs column 추가
train["FPs"] = train.Molecule.apply(mol2fp)
test["FPs"] = test.Molecule.apply(mol2fp)

In [7]:
# 사용할 column만 추출
train = train[['FPs','MLM', 'HLM']]
test = test[['FPs']]

In [8]:
X = train['FPs']
y = train[['MLM', 'HLM']]

In [9]:
X_X = pd.concat([pd.DataFrame(i).T for i in X])

In [None]:
# 예측모델 후보 - 랜덤포레스트 회귀
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
rfr.fit(X_X,y)

In [None]:
test_X = pd.concat([pd.DataFrame(i).T for i in test['FPs']])
test_X_predict = rfr.predict(test_X)

In [None]:
# predict
df_submission = pd.read_csv("./sample_submission.csv")
df_submission["MLM"] = test_X_predict[:,0]
df_submission["HLM"] = test_X_predict[:,1]
df_submission.to_csv("result.csv", index = False, encoding = "utf-8-sig")

개선작업
  - 모델링
  - 대략 성능이 좋을거 같은 회귀모델을 여러개 선정하고
  - 각각의 모델의 교차검증을 해서
  - 상위 x개 의 모델을 추출(예를들어 3개)
  - 해당 모델들로 앙상블
    - 스태킹, 베깅, 부스팅

In [10]:
from sklearn.linear_model import LinearRegression, Ridge,Lasso,ElasticNet
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
import xgboost as xgb
from sklearn.model_selection import cross_val_score

In [154]:
models = [
    LinearRegression(), Ridge(),Lasso(),ElasticNet(),
    RandomForestRegressor(),
    GradientBoostingRegressor(),
    SVR(),KNeighborsRegressor(),MLPRegressor(),xgb.XGBRegressor()
]

scores_list = []
for model in models:
  scores = cross_val_score(model,X_X,y,n_jobs=-1,cv=3)
  scores_list.append( scores['test_score'].mean() )

np.argsort()[::-1][:3]  # 내림차순으로 정렬한데이터를 상위 3개만 -> 인덱스

In [22]:
#  xgb.XGBRegressor(), RandomForestRegressor(),GradientBoostingRegressor()
from sklearn.ensemble import StackingRegressor,BaggingRegressor
from sklearn.multioutput import MultiOutputRegressor  # 다중클래스 문제로 변경

In [12]:
test_X = pd.concat([pd.DataFrame(i).T for i in test['FPs']])
test_X.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4086,4087,4088,4089,4090,4091,4092,4093,4094,4095
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
y.head(2)

Unnamed: 0,MLM,HLM
0,26.01,50.68
1,29.27,50.59


In [None]:
base_model = [
    ('xgb', xgb.XGBRegressor()),
    ('rfr',RandomForestRegressor()),
    ('gb',GradientBoostingRegressor())
]
# 최종평가 메타모델
meta_model = RandomForestRegressor()
# 스태킹을 적용
stacking_regression = StackingRegressor(estimators=base_model,final_estimator=meta_model)
stacking_regression.fit(X_X,y[['MLM']])
test_X_predict_MLM = stacking_regression.predict(test_X)

In [None]:
stacking_regression.fit(X_X,y[['HLM']])
test_X_predict_HLM = stacking_regression.predict(test_X)

In [None]:
# RandomForestRegressor().fit(X_X,y)

In [None]:
# predict
df_submission = pd.read_csv("./sample_submission.csv")
df_submission["MLM"] = test_X_predict_MLM
df_submission["HLM"] = test_X_predict_HLM
df_submission.to_csv("stacking_result.csv", index = False, encoding = "utf-8-sig")