<a href="https://colab.research.google.com/github/kimjaehwankimjaehwan/Dacon/blob/main/5_4_8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!unzip '/content/drive/MyDrive/데이콘/open.zip' -d '/content/drive/MyDrive/데이콘'

Archive:  /content/drive/MyDrive/데이콘/open.zip
  inflating: /content/drive/MyDrive/데이콘/sample_submission.csv  
  inflating: /content/drive/MyDrive/데이콘/test.csv  
  inflating: /content/drive/MyDrive/데이콘/train.csv  


In [3]:
pip install rdkit

Collecting rdkit
  Downloading rdkit-2024.3.5-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.9 kB)
Downloading rdkit-2024.3.5-cp310-cp310-manylinux_2_28_x86_64.whl (33.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.1/33.1 MB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rdkit
Successfully installed rdkit-2024.3.5


In [7]:
import pandas as pd
import numpy as np
import os
import random

from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import warnings

# 모든 경고 메시지 무시
warnings.filterwarnings("ignore")



In [8]:
CFG = {
    'NBITS':2048,
    'SEED':42,
}

def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
seed_everything(CFG['SEED']) # Seed 고정

# SMILES 데이터를 분자 지문으로 변환
def smiles_to_fingerprint(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is not None:
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=CFG['NBITS'])
        return np.array(fp)
    else:
        return np.zeros((CFG['NBITS'],))



In [9]:
# 학습 ChEMBL 데이터 로드
chembl_data = pd.read_csv('/content/drive/MyDrive/데이콘/train.csv')  # 예시 파일 이름
chembl_data.head()



Unnamed: 0,Molecule ChEMBL ID,Standard Type,Standard Relation,Standard Value,Standard Units,pChEMBL Value,Assay ChEMBL ID,Target ChEMBL ID,Target Name,Target Organism,Target Type,Document ChEMBL ID,IC50_nM,pIC50,Smiles
0,CHEMBL4443947,IC50,'=',0.022,nM,10.66,CHEMBL4361896,CHEMBL3778,Interleukin-1 receptor-associated kinase 4,Homo sapiens,SINGLE PROTEIN,CHEMBL4359855,0.022,10.66,CN[C@@H](C)C(=O)N[C@H](C(=O)N1C[C@@H](NC(=O)CC...
1,CHEMBL4556091,IC50,'=',0.026,nM,10.59,CHEMBL4345131,CHEMBL3778,Interleukin-1 receptor-associated kinase 4,Homo sapiens,SINGLE PROTEIN,CHEMBL4342485,0.026,10.59,CC(C)(O)[C@H](F)CN1Cc2cc(NC(=O)c3cnn4cccnc34)c...
2,CHEMBL4566431,IC50,'=',0.078,nM,10.11,CHEMBL4345131,CHEMBL3778,Interleukin-1 receptor-associated kinase 4,Homo sapiens,SINGLE PROTEIN,CHEMBL4342485,0.078,10.11,CC(C)(O)[C@H](F)CN1Cc2cc(NC(=O)c3cnn4cccnc34)c...
3,CHEMBL4545898,IC50,'=',0.081,nM,10.09,CHEMBL4345131,CHEMBL3778,Interleukin-1 receptor-associated kinase 4,Homo sapiens,SINGLE PROTEIN,CHEMBL4342485,0.081,10.09,CC(C)(O)[C@H](F)CN1Cc2cc(NC(=O)c3cnn4cccnc34)c...
4,CHEMBL4448950,IC50,'=',0.099,nM,10.0,CHEMBL4361896,CHEMBL3778,Interleukin-1 receptor-associated kinase 4,Homo sapiens,SINGLE PROTEIN,CHEMBL4359855,0.099,10.0,COc1cc2c(OC[C@@H]3CCC(=O)N3)ncc(C#CCCCCCCCCCCC...


1. 타겟 단백질:

  - 이름: Interleukin-1 receptor-associated kinase 4 (IRAK4)
  - 생물종: Homo sapiens (사람)

2.활성 측정:

  - IC50 값 범위: 0.022 nM ~ 0.099 nM
  - pIC50 값 범위: 10.00 ~ 10.66

3. 활성 화합물:

  - 최고 활성 화합물: CHEMBL4443947
    - IC50: 0.022 nM
    - pIC50: 10.66

  - 최저 활성 화합물: CHEMBL4448950
    - IC50: 0.099 nM
    - pIC50: 10.00
4. SMILES 구조:

  - 각 화합물의 SMILES 구조는 화합물의 화학적 구성을 나타냅니다.
  - 예를 들어, 최고 활성 화합물(CHEMBL4443947)의 SMILES 구조는 "CNC@@HC(=O)N[C@H](C(=O)N1CC@@H로 시작합니다.

In [13]:
chembl_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1952 entries, 0 to 1951
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Molecule ChEMBL ID  1952 non-null   object 
 1   Standard Type       1952 non-null   object 
 2   Standard Relation   1952 non-null   object 
 3   Standard Value      1952 non-null   float64
 4   Standard Units      1952 non-null   object 
 5   pChEMBL Value       1952 non-null   float64
 6   Assay ChEMBL ID     1952 non-null   object 
 7   Target ChEMBL ID    1952 non-null   object 
 8   Target Name         1952 non-null   object 
 9   Target Organism     1952 non-null   object 
 10  Target Type         1952 non-null   object 
 11  Document ChEMBL ID  1952 non-null   object 
 12  IC50_nM             1952 non-null   float64
 13  pIC50               1952 non-null   float64
 14  Smiles              1952 non-null   object 
dtypes: float64(4), object(11)
memory usage: 228.9+ KB


In [21]:
chembl_data.describe(include='all')

Unnamed: 0,Molecule ChEMBL ID,Standard Type,Standard Relation,Standard Value,Standard Units,pChEMBL Value,Assay ChEMBL ID,Target ChEMBL ID,Target Name,Target Organism,Target Type,Document ChEMBL ID,IC50_nM,pIC50,Smiles
count,1952,1952,1952,1952.0,1952,1952.0,1952,1952,1952,1952,1952,1952,1952.0,1952.0,1952
unique,1952,1,1,,1,,72,1,1,1,1,66,,,1952
top,CHEMBL4443947,IC50,'=',,nM,,CHEMBL3887118,CHEMBL3778,Interleukin-1 receptor-associated kinase 4,Homo sapiens,SINGLE PROTEIN,CHEMBL3886172,,,CN[C@@H](C)C(=O)N[C@H](C(=O)N1C[C@@H](NC(=O)CC...
freq,1,1952,1952,,1952,,582,1952,1952,1952,1952,582,,,1
mean,,,,649.001365,,7.518586,,,,,,,649.001365,7.518586,
std,,,,2639.946734,,1.107959,,,,,,,2639.946734,1.107959,
min,,,,0.022,,4.26,,,,,,,0.022,4.26,
25%,,,,4.1,,6.68,,,,,,,4.1,6.68,
50%,,,,15.25,,7.82,,,,,,,15.25,7.82,
75%,,,,209.1975,,8.39,,,,,,,209.1975,8.39,


###Interleukin-1 receptor-associated kinase 4 (IRAK4)는 중요한 신호전달 단백질입니다.

1. 역할:

  - IRAK4는 면역 반응에서 핵심적인 역할을 수행합니다.
  - 주로 선천성 면역 반응을 조절하는 데 관여합니다.
  - 특정 패턴 인식 수용체(Toll-like receptors, TLRs)와 인터루킨-1 수용체(IL-1 receptors)와 같은 수용체를 통해 활성화됩니다.

2. 기능:

  - 면역 세포가 병원체나 염증 신호를 인식하면, IRAK4가 활성화됩니다.
  - IRAK4는 다른 단백질들과 함께 신호 전달 경로를 활성화하여 염증성 사이토카인과 같은 면역 반응을 유도하는 물질을 생산하게 합니다.

3. 임상적 중요성:

  - IRAK4의 기능 이상이나 과도한 활성화는 염증성 질환, 자가면역 질환과 관련될 수 있습니다.
  - 따라서 IRAK4는 이러한 질환을 치료하기 위한 약물 타겟으로 연구되고 있습니다.
  - 특히 IRAK4 억제제는 과도한 염증 반응을 억제하는 데 사용될 수 있습니다.

###SMILES(Simplified Molecular Input Line Entry System)는 화학 구조를 텍스트 문자열로 표현하는 방법입니다.

1. 목적:

  - 화합물의 구조를 간단하게 표현하고 저장, 검색, 분석을 쉽게 하기 위해 사용됩니다.
텍스트 형식이기 때문에 컴퓨터 시스템에서 쉽게 처리할 수 있습니다.

2. 구조:

  - 원자: 원자는 해당 원소의 기호로 표시됩니다. 예를 들어, 탄소는 "C", 산소는 "O"로 표현됩니다.
  - 결합: 단일 결합은 보통 생략되며, 이중 결합은 "="로, 삼중 결합은 "#"으로 표시됩니다.
  - 분지 구조: 분지된 부분은 괄호로 표현됩니다. 예를 들어, 에탄올의 SMILES는 "CCO"입니다.
  - 고리 구조: 고리 구조는 숫자를 사용하여 나타내며, 고리의 시작과 끝을 숫자로 연결합니다. 예를 들어, 사이클로헥산은 "C1CCCCC1"로 표현됩니다.

3. 예시:

  - 아세트산의 SMILES: "CC(=O)O"
  - 벤젠의 SMILES: "c1ccccc1"

DATA preprocessing

In [10]:
train = chembl_data[['Smiles', 'pIC50']]
train['Fingerprint'] = train['Smiles'].apply(smiles_to_fingerprint)





In [23]:
train

Unnamed: 0,Smiles,pIC50,Fingerprint
0,CN[C@@H](C)C(=O)N[C@H](C(=O)N1C[C@@H](NC(=O)CC...,10.66,"[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ..."
1,CC(C)(O)[C@H](F)CN1Cc2cc(NC(=O)c3cnn4cccnc34)c...,10.59,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ..."
2,CC(C)(O)[C@H](F)CN1Cc2cc(NC(=O)c3cnn4cccnc34)c...,10.11,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ..."
3,CC(C)(O)[C@H](F)CN1Cc2cc(NC(=O)c3cnn4cccnc34)c...,10.09,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ..."
4,COc1cc2c(OC[C@@H]3CCC(=O)N3)ncc(C#CCCCCCCCCCCC...,10.00,"[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
...,...,...,...
1947,O=C(Nc1nc2cc[nH]cc-2n1)c1cccc([N+](=O)[O-])c1,4.52,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1948,CCCCn1c(NC(=O)c2cccc(Cl)c2)nc2ccccc21,4.52,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1949,O=C(Nc1nc2cc(F)c(F)cc2[nH]1)c1cccc([N+](=O)[O-...,4.52,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1950,OC[C@H]1C[C@@H](Nc2nc(Nc3ccccc3)ncc2-c2nc3cccc...,4.38,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [11]:
train_x = np.stack(train['Fingerprint'].values)
train_y = train['pIC50'].values



In [22]:
train_x

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [12]:
# 학습 및 검증 데이터 분리
train_x, val_x, train_y, val_y = train_test_split(train_x, train_y, test_size=0.3, random_state=42)



Train & Validation

In [50]:
# 파이프라인을 통해서 다양한 머신러닝 모델을 교차 검증을 통해 최상의 모델 선택
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [14]:
# 랜덤 포레스트 모델 학습
model = RandomForestRegressor(random_state=CFG['SEED'])
model.fit(train_x, train_y)



In [15]:
def pIC50_to_IC50(pic50_values):
    """Convert pIC50 values to IC50 (nM)."""
    return 10 ** (9 - pic50_values)



In [16]:
# Validation 데이터로부터의 학습 모델 평가
val_y_pred = model.predict(val_x)
mse = mean_squared_error(pIC50_to_IC50(val_y), pIC50_to_IC50(val_y_pred))
rmse = np.sqrt(mse)

print(f'RMSE: {rmse}')



RMSE: 2169.5781089857264


In [39]:
test = pd.read_csv('/content/drive/MyDrive/데이콘/test.csv')



In [40]:
test['Fingerprint'] = test['Smiles'].apply(smiles_to_fingerprint)
test['Fingerprint']





Unnamed: 0,Fingerprint
0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ..."
4,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ..."
...,...
108,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
109,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
110,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
111,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [41]:
test['Fingerprint'].values

array([array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 1, ..., 0, 0, 0]),
       array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]),
       array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]),
       array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]),
       array([0, 0, 0, ..., 0, 0, 0]), array([0, 1, 0, ..., 0, 0, 0]),
       array([0, 1, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]),
       array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]),
       array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]),
       array([0, 0, 1, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]),
       array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]),
       array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]),
       array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]),
       array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]),
       array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]),
      

In [42]:
test['Fingerprint'].shape

(113,)

In [43]:
test['Fingerprint'][0]

array([0, 0, 0, ..., 0, 0, 0])

In [44]:
test['Fingerprint'][0].shape

(2048,)

In [45]:
test_x = np.stack(test['Fingerprint'].values)


In [46]:
test_x
# print(np.stack(test['Fingerprint'].values).shape, test_x.shape)
# test_x = np.stack(test['Fingerprint'].values)



array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [47]:
test_x.shape

(113, 2048)

In [49]:
test_y_pred = model.predict(test_x)

In [35]:
submit = pd.read_csv('/content/drive/MyDrive/데이콘/sample_submission.csv')
submit['IC50_nM'] = pIC50_to_IC50(test_y_pred)
submit.head()



Unnamed: 0,ID,IC50_nM
0,TEST_000,181.961706
1,TEST_001,31.6422
2,TEST_002,10.780527
3,TEST_003,21.376667
4,TEST_004,25.312789


In [None]:
submit.to_csv('./baseline_submit.csv', index=False)