  - 단일변수 회귀
  - 다변수 회귀분석

앙상블
  - 의사결정 트리
  - 랜덤포레스트
  - 스태킹
  - 베깅
  - 부트팅

In [None]:
from sklearn.ensemble import StackingRegressor

In [None]:
'''
개별 모델이 예측한 데이터를 기반으로
회귀모델을 여러개 만들고 각각 학습
필요하면 하이퍼파라메터도 튜닝
교차검증을 통해서 각 모델별 성능점수를 추출해서 상위 5개 이런식으로
모델을리스트로 만들어서 -- estimators

estimators = [
  ('randomforest', rf),
  ('extrtreegreg', ext),
  ('xgboost', xgb)
]

 StackingRegressor(estimators, final_estimator = xgb)

'''

데이콘 데이터 로드

In [7]:
pip install rdkit

Collecting rdkit
  Downloading rdkit-2023.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.7/29.7 MB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: rdkit
Successfully installed rdkit-2023.3.3


In [17]:
import random
import os

import numpy as np
import pandas as pd

from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from rdkit import DataStructs
from rdkit.Chem import PandasTools, AllChem

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

In [9]:
def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)

seed_everything(42) # Seed 고정

In [3]:
!unzip '/content/drive/MyDrive/데이콘_신약개발/데이콘 신약데이터.zip'

Archive:  /content/drive/MyDrive/데이콘_신약개발/데이콘 신약데이터.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [50]:
train = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')

In [51]:
PandasTools.AddMoleculeColumnToFrame(train,'SMILES','Molecule')
PandasTools.AddMoleculeColumnToFrame(test,'SMILES','Molecule')

In [52]:
def mol2fp(mol):
    fp = AllChem.GetHashedMorganFingerprint(mol, 6, nBits=4096)
    ar = np.zeros((1,), dtype=np.int8)
    DataStructs.ConvertToNumpyArray(fp, ar)
    return ar

In [53]:
# FPs column 추가
train["FPs"] = train.Molecule.apply(mol2fp)
test["FPs"] = test.Molecule.apply(mol2fp)

In [54]:
# 사용할 column만 추출
train = train[['FPs','MLM', 'HLM']]
test = test[['FPs']]

In [107]:
X = train['FPs']
y = train[['MLM', 'HLM']]

In [120]:
X_X = pd.concat([pd.DataFrame(i).T for i in X])

In [121]:
# 예측모델 후보 - 랜덤포레스트 회귀
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
rfr.fit(X_X,y)

In [132]:
test_X = pd.concat([pd.DataFrame(i).T for i in test['FPs']])
test_X_predict = rfr.predict(test_X)

In [137]:
# predict
df_submission = pd.read_csv("./sample_submission.csv")
df_submission["MLM"] = test_X_predict[:,0]
df_submission["HLM"] = test_X_predict[:,1]
df_submission.to_csv("result.csv", index = False, encoding = "utf-8-sig")