<a href="https://colab.research.google.com/github/gottalottarock/ml-intro/blob/main/practice_covid/IB_Practice_COVID.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Предсказание активности молекул по отношению к таргету

*При подготовке ноутбука использовались данные из соревнования [Global AI Challenge](https://codenrock.com/contests/global-ai#/)* 

Целью данной задачи является предсказание активности молекулы лиганда по отношению к таргету - Covid 19

![](https://cloudfront.jove.com/files/media/science-education/science-education-thumbs/11513.jpg)

## План анализа данных:

  1. Загрузить данные для обучения
  2. Обработать данные перед обучением модели
  3. Обучить модель на обучающей выборке
  4. Загрузить и предобработать данные для тестирования
  5. Провалидировать модель на тестовой выборке


# 0. Установка и импорт библиотек

In [4]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "end_to_end_project"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# 1. Загрузка данных

In [2]:
!wget https://www.dropbox.com/s/48c34raijlxc0nw/train.csv
!wget https://www.dropbox.com/s/297trreazro8ivr/test_labels.csv

--2022-03-30 16:52:29--  https://www.dropbox.com/s/48c34raijlxc0nw/train.csv
Resolving www.dropbox.com (www.dropbox.com)... 162.125.4.18, 2620:100:601c:18::a27d:612
Connecting to www.dropbox.com (www.dropbox.com)|162.125.4.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/48c34raijlxc0nw/train.csv [following]
--2022-03-30 16:52:29--  https://www.dropbox.com/s/raw/48c34raijlxc0nw/train.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucf0baa4b08dc891579dc6f163b2.dl.dropboxusercontent.com/cd/0/inline/BifKkgartxowjgLoxb5lz2ra3efTKsw8dCFBcfKqYacm-8Y8qtNqtYFZVq0l8uk257_C6oYAxVKU79SnaLwS3pEYsgf1nHOLoL3U4HfEL1fwhb572539M1M7_YPqFLvOSYs3ldJNPnYq6ZnDXl97IsNWf4Nk-zbnbCxHL4A6R-wtIw/file# [following]
--2022-03-30 16:52:30--  https://ucf0baa4b08dc891579dc6f163b2.dl.dropboxusercontent.com/cd/0/inline/BifKkgartxowjgLoxb5lz2ra3efTKsw8dCFBcfKqYacm-8Y8qtNqtYFZVq0l8uk257_C6oYAxVKU7

In [5]:
DATA_PATH = "./"
TRAIN_FILE = "train.csv"
TEST_FILE = "test_labels.csv"

SMILES_COLUMN = "smiles"
TARGET_COLUMN = "Active"

In [6]:
import pandas as pd

def load_train_test_data():
    train_csv_path = os.path.join(DATA_PATH, TRAIN_FILE)
    test_csv_path = os.path.join(DATA_PATH, TEST_FILE)
    train_data = pd.read_csv(train_csv_path, index_col = 0)
    test_data = pd.read_csv(test_csv_path,index_col = 0)
    return train_data.rename(columns = {"Smiles":SMILES_COLUMN}), test_data.rename(columns = {"Smiles":SMILES_COLUMN})

## 1.1 Анализ данных, формулировка задачи машинного обучения

Молекулы представлены в виде [SMILES нотации](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system).

![](https://upload.wikimedia.org/wikipedia/commons/thumb/0/00/SMILES.png/450px-SMILES.png)

In [7]:
train_data, test_data = load_train_test_data()
train_data.head()

Unnamed: 0,smiles,Active
0,COc1ccc2[nH]cc(CCN)c2c1,False
1,CCCN1CCC[C@H](c2cccc(O)c2)C1.Cl,False
2,O=C(NO)c1cnc(N2CCN(S(=O)(=O)c3ccc4ccccc4c3)CC2...,False
3,Nc1cccc(CNC(=O)c2ccc(Oc3ccc(OCc4cccc(F)c4)cc3)...,False
4,Fc1ccccc1CNCc1ccc(-c2ccnc3[nH]ccc23)cc1,False


In [8]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5557 entries, 0 to 5556
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   smiles  5557 non-null   object
 1   Active  5557 non-null   bool  
dtypes: bool(1), object(1)
memory usage: 92.3+ KB


In [9]:
train_data[TARGET_COLUMN].value_counts()

False    5351
True      206
Name: Active, dtype: int64

## 1.2 Предобработка данных

In [10]:
from rdkit import Chem
from rdkit.Chem.SaltRemover import SaltRemover

In [11]:
def remove_salts_and_canonicalized(smiles: str):
    remover = SaltRemover(defnData="[Cl,Br]")
    mol = Chem.MolFromSmiles(smiles)
    res = remover.StripMol(mol)
    processed_smiles = Chem.MolToSmiles(res)
    return processed_smiles

In [12]:
train_data[SMILES_COLUMN] = list(map(remove_salts_and_canonicalized, train_data[SMILES_COLUMN]))
test_data[SMILES_COLUMN] = list(map(remove_salts_and_canonicalized, test_data[SMILES_COLUMN]))

In [13]:
def change_str_target_to_int(targets: pd.Series):
    target_map = {True: 1, False: 0}
    processed_targets = targets.map(target_map)
    return processed_targets.values

In [14]:
train_data[TARGET_COLUMN] = change_str_target_to_int(train_data[TARGET_COLUMN])
test_data[TARGET_COLUMN] = change_str_target_to_int(test_data[TARGET_COLUMN])

In [15]:
train_data.head()

Unnamed: 0,smiles,Active
0,COc1ccc2[nH]cc(CCN)c2c1,0
1,CCCN1CCC[C@H](c2cccc(O)c2)C1,0
2,O=C(NO)c1cnc(N2CCN(S(=O)(=O)c3ccc4ccccc4c3)CC2...,0
3,Nc1cccc(CNC(=O)c2ccc(Oc3ccc(OCc4cccc(F)c4)cc3)...,0
4,Fc1ccccc1CNCc1ccc(-c2ccnc3[nH]ccc23)cc1,0


## 1.3 Feature engineering

Молекулу можно представить в виде фингерпринта - вектора свойств, полученного по определенному алгоритму.

Мы будем считать фингерпринты при помощи библиотеки RDKit. Про различные фингерпринты и их описание можно почитать тут - https://www.rdkit.org/docs/GettingStartedInPython.html#fingerprinting-and-molecular-similarity

![](https://sun9-64.userapi.com/impf/_8Zy5WO6Mt0SIPx1YS02DeErAoZ0RHcwgc-kZg/Md98bNVzBg0.jpg?size=831x415&quality=96&sign=cb20481128a04ff523fd662dd0e604ab&type=album)


### Моргановские фингерпринты (ECFP)

![](https://d3i71xaburhd42.cloudfront.net/52adf3589e8b7b9855353e5815669258ef6e3405/6-Figure2-1.png)

In [16]:
from enum import Enum
from functools import partial
from rdkit import Chem, DataStructs
from rdkit.DataStructs import ExplicitBitVect
from rdkit.Chem import AllChem, MACCSkeys
from typing import List

In [17]:
class FingerprintsNames(Enum):
    ECFP4 = "morgan_2_2048"
    RDKitFP = "RDKFingerprint"
    TOPOTORSION = "topological_torsion"
    MACCS = "MACCSkeys"
    PATTERN = "PatternFingerprint"
    ATOMPAIR = "AtomPairFingerprint"



FINGERPRINTS_METHODS = {
    FingerprintsNames.ECFP4: partial(AllChem.GetMorganFingerprintAsBitVect, radius=2, nBits=2048),
    FingerprintsNames.RDKitFP: partial(Chem.RDKFingerprint, fpSize=2048),#TODO
    FingerprintsNames.TOPOTORSION: partial(AllChem.GetHashedTopologicalTorsionFingerprintAsBitVect, nBits=2048),#TODO
    FingerprintsNames.MACCS: MACCSkeys.GenMACCSKeys,#TODO
    FingerprintsNames.PATTERN: partial(Chem.PatternFingerprint,fpSize=1024),#TODO
    FingerprintsNames.ATOMPAIR: partial(AllChem.GetHashedAtomPairFingerprintAsBitVect,nBits=2048)}#TODO


In [18]:
fingerprint_type_name = FingerprintsNames.ECFP4
fingerprint_type_method = FINGERPRINTS_METHODS[fingerprint_type_name]

In [19]:
def bit_vectors_to_numpy_arrays(fps: List[ExplicitBitVect]) -> np.array:
    output_arrays = [np.zeros((1,)) for i in range(len(fps))]
    _ = list(
        map(lambda fp_output_array: DataStructs.ConvertToNumpyArray(fp_output_array[0], fp_output_array[1]),
            zip(fps, output_arrays)))
    return np.asarray(output_arrays)

def get_np_array_of_fps(fp_type, smiles: List[str]):
    # Calculate the morgan fingerprint
    mols = [Chem.MolFromSmiles(m) for m in smiles]
    fp = list(map(fp_type, mols))
    return bit_vectors_to_numpy_arrays(fp)

In [20]:
train_fp = get_np_array_of_fps(fp_type=fingerprint_type_method, smiles=train_data[SMILES_COLUMN])
test_fp = get_np_array_of_fps(fp_type=fingerprint_type_method, smiles=test_data[SMILES_COLUMN])

In [21]:
y_train = train_data[TARGET_COLUMN]
y_test = test_data[TARGET_COLUMN]

# 2. Подготовка к обучению модели

## 2.1 Кросс-валидация

![](https://pubs.rsc.org/image/article/2018/SC/c7sc02664a/c7sc02664a-f3_hi-res.gif)

In [33]:
from dgllife.utils import ScaffoldSplitter

Using backend: pytorch


In [34]:
class ScaffoldCVSklearn:
    def __init__(self, data, k_folds):
        self.scaffold_splits = ScaffoldSplitter.k_fold_split(data, k=k_folds)

    def split(self):
        indices_splits = []
        for train_data, val_data in self.scaffold_splits:
          train_indices = train_data.indices
          val_indices = val_data.indices
          indices_splits.append((train_indices, val_indices))
        return indices_splits

    def convert_data_to_indices(self, dataset):
        indices = [index for index, row in dataset.iterrows()]
        return indices


In [35]:
cv = ScaffoldCVSklearn(train_data, k_folds=3).split()

Start initializing RDKit molecule instances...
Creating RDKit molecule instance 1000/5557
Creating RDKit molecule instance 2000/5557
Creating RDKit molecule instance 3000/5557
Creating RDKit molecule instance 4000/5557
Creating RDKit molecule instance 5000/5557
Start computing Bemis-Murcko scaffolds.
Computing Bemis-Murcko for compound 1000/5557
Computing Bemis-Murcko for compound 2000/5557
Computing Bemis-Murcko for compound 3000/5557
Computing Bemis-Murcko for compound 4000/5557
Computing Bemis-Murcko for compound 5000/5557
Processing fold 1/3
Processing fold 2/3
Processing fold 3/3


# Задание (10 баллов + 3 бонусных)
1. (3 балла) Добавить решение проблемы несбалансированной классификации

Варианты:
* UnderSampling
* OverSampling
* SMOTE
* Внутренние инструменты модели (`scale_pos_weight`)

2. (2 балла) Использовать еще 2 вида фингерпринтов из `FingerprintsNames`

3. (3 балла) Получить f1-score на тестовом датасете больше 0.35

Варианты:
* Увеличить количество параметров в подборе гиперпараметров
* Использовать другие алгоритмы подбора гиперпараметров (например, [RandomizedSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html))
* Использовать другие модели (Random Forest, SVC, MLPClassifier, etc)

4. (2 балла) Логирование

В качестве финального результата предоставьте таблицу (можно `pd.DataFrame`) c колонками: Model, Fingerprint, Best Parameters, Mean Cross-Validation Score, Std Cross-Validation Score, Test Score 

Проанализируйте результаты: 
* Какие фингерпринты дали лучший результат?
* Какая модель дала лучший результат.
* Коррелируют ли скоры на кросс-валидации и тестовой выборке?

5. (Бонус +3 балла) Получить f1-score на тестовом датасете больше 0.45

In [53]:
from imblearn.over_sampling import SMOTEN, ADASYN
from joblib import parallel_backend
import lightgbm
import json
from tqdm.auto import tqdm

In [26]:
fingerprint_type_name = FingerprintsNames.MACCS
fingerprint_type_method = FINGERPRINTS_METHODS[fingerprint_type_name]

In [27]:
train_fp = get_np_array_of_fps(fp_type=fingerprint_type_method, smiles=train_data[SMILES_COLUMN])
test_fp = get_np_array_of_fps(fp_type=fingerprint_type_method, smiles=test_data[SMILES_COLUMN])

y_train = train_data[TARGET_COLUMN]
y_test = test_data[TARGET_COLUMN]

In [36]:
cv = ScaffoldCVSklearn(train_data, k_folds=3).split()

Start initializing RDKit molecule instances...
Creating RDKit molecule instance 1000/5557
Creating RDKit molecule instance 2000/5557
Creating RDKit molecule instance 3000/5557
Creating RDKit molecule instance 4000/5557
Creating RDKit molecule instance 5000/5557
Start computing Bemis-Murcko scaffolds.
Computing Bemis-Murcko for compound 1000/5557
Computing Bemis-Murcko for compound 2000/5557
Computing Bemis-Murcko for compound 3000/5557
Computing Bemis-Murcko for compound 4000/5557
Computing Bemis-Murcko for compound 5000/5557
Processing fold 1/3
Processing fold 2/3
Processing fold 3/3


In [37]:
y_train.shape/y_train.sum()

array([26.97572816])

Используем scale_pos_weight = 27

In [38]:
from sklearn.metrics import f1_score

import optuna


class Objective:
    def __init__(self, X_data, y_data, cv):
        self.X_data = X_data.astype(int)
        self.y_data = y_data.values.astype(int)
        self.cv = cv
        self.train_data = []
        self.test_data = []
        for train_index, test_index in tqdm(cv):
            X_train, X_test = (
                self.X_data[train_index, :],
                self.X_data[test_index, :],
            )
            y_train, y_test = (
                self.y_data[train_index],
                self.y_data[test_index],
            )
            X_train,y_train = self.oversample(X_train,y_train)
            self.train_data.append((X_train,y_train))
            self.test_data.append((X_test,y_test))
        
    def oversample(self, x, y):
        return x,y
#         self.oversampler = SMOTEN(k_neighbors=5)
#         return self.oversampler.fit_resample(x, y)

    def __call__(self, trial):

        params = {
            "subsample_freq": trial.suggest_int("subsample_freq", 0, 0),
            "n_estimators": trial.suggest_int("n_estimators", 50, 300),  # n_trees
            "reg_alpha": trial.suggest_loguniform("reg_alpha", 1e-20, 2),
            "reg_lambda": trial.suggest_loguniform("reg_lambda", 1e-20, 2),
            "colsample_bytree": trial.suggest_uniform("colsample_bytree", 0.01, 1.0),
            "subsample": trial.suggest_uniform("subsample", 0.01, 1.0),
            "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
            "max_depth": trial.suggest_int("max_depth", 3, 20),
            "learning_rate": trial.suggest_uniform("learning_rate", 0.001, 0.3),
#             "scale_pos_weight":trial.suggest_int("scale_pos_weight",27)
            "boosting_type":trial.suggest_categorical("boosting_type",['gbdt'])
        }
        params = self.add_params(params)

        result = []
        for (X_train,y_train), (X_test,y_test) in zip(self.train_data, self.test_data):
           
            model = lightgbm.LGBMClassifier(**params)

            model.fit(X_train, y_train)

            y_pred = model.predict_proba(X_test)[:, 1].astype(float)

            f1_val = f1_score(y_test, y_pred > 0.5)

            result.append(f1_val)

        return np.mean(result)
    
    def add_params(self, params):
        params.update({
            "verbosity": -1,
            "n_jobs": 2,
            "device": 'gpu',
            "num_leaves":2 ** min(16,params["max_depth"]),    
            "scale_pos_weight":27
        })
        return params


# Load the dataset in advance for reusing it each trial execution.
objective = Objective(train_fp, y_train,cv=cv)



  0%|          | 0/3 [00:00<?, ?it/s]

In [44]:
objective = Objective(train_fp, y_train,cv=cv)
study = optuna.create_study(direction="maximize")


  0%|          | 0/3 [00:00<?, ?it/s]

[32m[I 2022-05-11 14:40:40,479][0m A new study created in memory with name: no-name-ad997acc-7ed7-4ed7-9c00-c6d52baa7704[0m


In [47]:
# study.enqueue_trial(best_params)
study.optimize(objective, n_trials=200, n_jobs=10)
print(study.best_trial)

[32m[I 2022-05-11 14:41:09,753][0m Trial 18 finished with value: 0.20945547040261467 and parameters: {'subsample_freq': 0, 'n_estimators': 73, 'reg_alpha': 0.003351701387542272, 'reg_lambda': 5.6307283729471915e-14, 'colsample_bytree': 0.8336910727762574, 'subsample': 0.7492357951602373, 'min_child_samples': 54, 'max_depth': 5, 'learning_rate': 0.2444327163473863, 'boosting_type': 'gbdt'}. Best is trial 4 with value: 0.22764005200919402.[0m
[32m[I 2022-05-11 14:41:14,647][0m Trial 12 finished with value: 0.2004932624724607 and parameters: {'subsample_freq': 0, 'n_estimators': 70, 'reg_alpha': 4.55496554070594e-11, 'reg_lambda': 9.422614599900819e-07, 'colsample_bytree': 0.19641824008100697, 'subsample': 0.11148345131315113, 'min_child_samples': 72, 'max_depth': 12, 'learning_rate': 0.1400929372422486, 'boosting_type': 'gbdt'}. Best is trial 4 with value: 0.22764005200919402.[0m
[32m[I 2022-05-11 14:41:14,776][0m Trial 11 finished with value: 0.22995733213563888 and parameters: 

[32m[I 2022-05-11 14:42:04,765][0m Trial 32 finished with value: 0.0 and parameters: {'subsample_freq': 0, 'n_estimators': 236, 'reg_alpha': 5.09379834502669e-08, 'reg_lambda': 2.954443804257021e-19, 'colsample_bytree': 0.01542224832797634, 'subsample': 0.9818445631832556, 'min_child_samples': 5, 'max_depth': 19, 'learning_rate': 0.005204753620513147, 'boosting_type': 'gbdt'}. Best is trial 11 with value: 0.22995733213563888.[0m
[32m[I 2022-05-11 14:42:07,623][0m Trial 34 finished with value: 0.2399315458618776 and parameters: {'subsample_freq': 0, 'n_estimators': 234, 'reg_alpha': 1.1267664392279897, 'reg_lambda': 1.5491275220753475, 'colsample_bytree': 0.6343436663278159, 'subsample': 0.9898149799757036, 'min_child_samples': 59, 'max_depth': 9, 'learning_rate': 0.05725674662726693, 'boosting_type': 'gbdt'}. Best is trial 34 with value: 0.2399315458618776.[0m
[32m[I 2022-05-11 14:42:11,242][0m Trial 31 finished with value: 0.136249674960027 and parameters: {'subsample_freq': 0

[32m[I 2022-05-11 14:42:59,924][0m Trial 41 finished with value: 0.24086509376890505 and parameters: {'subsample_freq': 0, 'n_estimators': 298, 'reg_alpha': 0.04026873705924034, 'reg_lambda': 7.250425980162591e-07, 'colsample_bytree': 0.25123010771223203, 'subsample': 0.3612934890701701, 'min_child_samples': 59, 'max_depth': 10, 'learning_rate': 0.05204268207819846, 'boosting_type': 'gbdt'}. Best is trial 43 with value: 0.2480351127896295.[0m
[32m[I 2022-05-11 14:43:06,507][0m Trial 51 finished with value: 0.21077460225349384 and parameters: {'subsample_freq': 0, 'n_estimators': 103, 'reg_alpha': 7.895092856484511e-05, 'reg_lambda': 2.5025135417173418e-15, 'colsample_bytree': 0.6830508665063145, 'subsample': 0.629640779178273, 'min_child_samples': 86, 'max_depth': 9, 'learning_rate': 0.19130332420906795, 'boosting_type': 'gbdt'}. Best is trial 43 with value: 0.2480351127896295.[0m
[32m[I 2022-05-11 14:43:12,530][0m Trial 47 finished with value: 0.19805825242718447 and parameter

[32m[I 2022-05-11 14:44:11,525][0m Trial 61 finished with value: 0.21570313059674762 and parameters: {'subsample_freq': 0, 'n_estimators': 196, 'reg_alpha': 0.0014360619486786803, 'reg_lambda': 1.7314791322552283e-09, 'colsample_bytree': 0.824222479033268, 'subsample': 0.4212563156596175, 'min_child_samples': 14, 'max_depth': 11, 'learning_rate': 0.06537663707912497, 'boosting_type': 'gbdt'}. Best is trial 43 with value: 0.2480351127896295.[0m
[32m[I 2022-05-11 14:44:14,510][0m Trial 68 finished with value: 0.24285549723896124 and parameters: {'subsample_freq': 0, 'n_estimators': 200, 'reg_alpha': 0.004467115210107101, 'reg_lambda': 1.6331587438223856e-09, 'colsample_bytree': 0.8630285503425541, 'subsample': 0.4414297684123353, 'min_child_samples': 54, 'max_depth': 11, 'learning_rate': 0.039214675482532824, 'boosting_type': 'gbdt'}. Best is trial 43 with value: 0.2480351127896295.[0m
[32m[I 2022-05-11 14:44:23,040][0m Trial 70 finished with value: 0.24094168411701247 and parame

[32m[I 2022-05-11 14:45:35,879][0m Trial 84 finished with value: 0.2411495731075626 and parameters: {'subsample_freq': 0, 'n_estimators': 225, 'reg_alpha': 0.006858444776331063, 'reg_lambda': 0.049626774087273784, 'colsample_bytree': 0.9524431382486744, 'subsample': 0.3124339565280576, 'min_child_samples': 38, 'max_depth': 14, 'learning_rate': 0.01727485159138503, 'boosting_type': 'gbdt'}. Best is trial 75 with value: 0.2655040166473093.[0m
[32m[I 2022-05-11 14:45:36,467][0m Trial 85 finished with value: 0.22580401997830982 and parameters: {'subsample_freq': 0, 'n_estimators': 230, 'reg_alpha': 0.00600230972726785, 'reg_lambda': 0.02421219477010415, 'colsample_bytree': 0.9992848130085711, 'subsample': 0.4987893994290143, 'min_child_samples': 45, 'max_depth': 14, 'learning_rate': 0.012010159239652003, 'boosting_type': 'gbdt'}. Best is trial 75 with value: 0.2655040166473093.[0m
[32m[I 2022-05-11 14:45:43,189][0m Trial 89 finished with value: 0.22578924965265276 and parameters: {

[32m[I 2022-05-11 14:47:04,654][0m Trial 105 finished with value: 0.22423841192460855 and parameters: {'subsample_freq': 0, 'n_estimators': 204, 'reg_alpha': 0.0004223919114727188, 'reg_lambda': 0.17582406522624192, 'colsample_bytree': 0.8499046963313949, 'subsample': 0.283935107388822, 'min_child_samples': 28, 'max_depth': 12, 'learning_rate': 0.044835981860321165, 'boosting_type': 'gbdt'}. Best is trial 75 with value: 0.2655040166473093.[0m
[32m[I 2022-05-11 14:47:07,086][0m Trial 101 finished with value: 0.22162380708166793 and parameters: {'subsample_freq': 0, 'n_estimators': 241, 'reg_alpha': 0.0003462258033396657, 'reg_lambda': 0.0003940910090433806, 'colsample_bytree': 0.9030535351613038, 'subsample': 0.2526775981975868, 'min_child_samples': 32, 'max_depth': 15, 'learning_rate': 0.030919098733126524, 'boosting_type': 'gbdt'}. Best is trial 75 with value: 0.2655040166473093.[0m
[32m[I 2022-05-11 14:47:12,660][0m Trial 103 finished with value: 0.23296494355317887 and param

[32m[I 2022-05-11 14:48:46,790][0m Trial 116 finished with value: 0.23907789653078063 and parameters: {'subsample_freq': 0, 'n_estimators': 256, 'reg_alpha': 0.003182807967533366, 'reg_lambda': 0.003133469342572049, 'colsample_bytree': 0.9553227788023705, 'subsample': 0.2268683543217862, 'min_child_samples': 62, 'max_depth': 16, 'learning_rate': 0.05770233523440567, 'boosting_type': 'gbdt'}. Best is trial 75 with value: 0.2655040166473093.[0m
[32m[I 2022-05-11 14:48:48,276][0m Trial 123 finished with value: 0.23844937231210628 and parameters: {'subsample_freq': 0, 'n_estimators': 168, 'reg_alpha': 0.003036485541460723, 'reg_lambda': 2.9185578038109497e-09, 'colsample_bytree': 0.8878845089871275, 'subsample': 0.3709846930024303, 'min_child_samples': 41, 'max_depth': 10, 'learning_rate': 0.039365432411338934, 'boosting_type': 'gbdt'}. Best is trial 75 with value: 0.2655040166473093.[0m
[32m[I 2022-05-11 14:48:53,202][0m Trial 125 finished with value: 0.238800505050505 and paramet

[32m[I 2022-05-11 14:49:49,694][0m Trial 141 finished with value: 0.24673043957756438 and parameters: {'subsample_freq': 0, 'n_estimators': 233, 'reg_alpha': 2.290875373995342e-20, 'reg_lambda': 6.269584521486397e-10, 'colsample_bytree': 0.8681290977206413, 'subsample': 0.07744124843861527, 'min_child_samples': 69, 'max_depth': 9, 'learning_rate': 0.04935584484669347, 'boosting_type': 'gbdt'}. Best is trial 75 with value: 0.2655040166473093.[0m
[32m[I 2022-05-11 14:49:52,012][0m Trial 140 finished with value: 0.229615949799436 and parameters: {'subsample_freq': 0, 'n_estimators': 215, 'reg_alpha': 0.008397654678012493, 'reg_lambda': 1.0078295947806827e-10, 'colsample_bytree': 0.8725581439056128, 'subsample': 0.44785181026034176, 'min_child_samples': 37, 'max_depth': 9, 'learning_rate': 0.04903878788538539, 'boosting_type': 'gbdt'}. Best is trial 75 with value: 0.2655040166473093.[0m
[32m[I 2022-05-11 14:49:57,513][0m Trial 138 finished with value: 0.23729169757207144 and parame

[32m[I 2022-05-11 14:50:58,573][0m Trial 168 finished with value: 0.18935041378281656 and parameters: {'subsample_freq': 0, 'n_estimators': 226, 'reg_alpha': 7.67364580953976e-11, 'reg_lambda': 6.09595927177505e-09, 'colsample_bytree': 0.8404186490455912, 'subsample': 0.1256185859844235, 'min_child_samples': 47, 'max_depth': 3, 'learning_rate': 0.20437413229927603, 'boosting_type': 'gbdt'}. Best is trial 75 with value: 0.2655040166473093.[0m
[32m[I 2022-05-11 14:51:09,290][0m Trial 164 finished with value: 0.2210956817597373 and parameters: {'subsample_freq': 0, 'n_estimators': 227, 'reg_alpha': 0.0012674149544937463, 'reg_lambda': 1.948235997935126e-10, 'colsample_bytree': 0.8353221242047949, 'subsample': 0.023203917638679475, 'min_child_samples': 58, 'max_depth': 8, 'learning_rate': 0.04940837801181824, 'boosting_type': 'gbdt'}. Best is trial 75 with value: 0.2655040166473093.[0m
[32m[I 2022-05-11 14:51:09,926][0m Trial 165 finished with value: 0.23429387331256488 and paramet

[32m[I 2022-05-11 14:51:56,896][0m Trial 178 finished with value: 0.18990845031066947 and parameters: {'subsample_freq': 0, 'n_estimators': 221, 'reg_alpha': 0.04411455161248252, 'reg_lambda': 1.580530572952059e-11, 'colsample_bytree': 0.8983745317755413, 'subsample': 0.06458728638936216, 'min_child_samples': 50, 'max_depth': 13, 'learning_rate': 0.15543500455218018, 'boosting_type': 'gbdt'}. Best is trial 75 with value: 0.2655040166473093.[0m
[32m[I 2022-05-11 14:52:04,887][0m Trial 179 finished with value: 0.2433784853139692 and parameters: {'subsample_freq': 0, 'n_estimators': 218, 'reg_alpha': 0.05088337102023184, 'reg_lambda': 5.6733514762328886e-08, 'colsample_bytree': 0.737215189289989, 'subsample': 0.09293583285287217, 'min_child_samples': 50, 'max_depth': 13, 'learning_rate': 0.031702600487532456, 'boosting_type': 'gbdt'}. Best is trial 75 with value: 0.2655040166473093.[0m
[32m[I 2022-05-11 14:52:07,679][0m Trial 180 finished with value: 0.21959036547844332 and parame

[32m[I 2022-05-11 14:53:10,034][0m Trial 198 finished with value: 0.23201429033219564 and parameters: {'subsample_freq': 0, 'n_estimators': 210, 'reg_alpha': 0.2205002785733699, 'reg_lambda': 1.3880273601496087e-08, 'colsample_bytree': 0.8045748645953185, 'subsample': 0.3197405307167675, 'min_child_samples': 47, 'max_depth': 12, 'learning_rate': 0.020213300938566596, 'boosting_type': 'gbdt'}. Best is trial 75 with value: 0.2655040166473093.[0m
[32m[I 2022-05-11 14:53:10,103][0m Trial 196 finished with value: 0.25826057181989387 and parameters: {'subsample_freq': 0, 'n_estimators': 210, 'reg_alpha': 0.19977725525812978, 'reg_lambda': 9.536865592481625e-08, 'colsample_bytree': 0.8053232287488338, 'subsample': 0.09169229467677023, 'min_child_samples': 19, 'max_depth': 12, 'learning_rate': 0.030077248474543593, 'boosting_type': 'gbdt'}. Best is trial 75 with value: 0.2655040166473093.[0m
[32m[I 2022-05-11 14:53:11,151][0m Trial 200 finished with value: 0.24655548255201076 and param

[32m[I 2022-05-11 14:54:08,644][0m Trial 212 finished with value: 0.21610524551701019 and parameters: {'subsample_freq': 0, 'n_estimators': 191, 'reg_alpha': 0.015043652198604611, 'reg_lambda': 3.830815739726157e-09, 'colsample_bytree': 0.9755965085608612, 'subsample': 0.3537856304577166, 'min_child_samples': 21, 'max_depth': 14, 'learning_rate': 0.048948480906600235, 'boosting_type': 'gbdt'}. Best is trial 75 with value: 0.2655040166473093.[0m
[32m[I 2022-05-11 14:54:10,199][0m Trial 217 finished with value: 0.2033909273280269 and parameters: {'subsample_freq': 0, 'n_estimators': 194, 'reg_alpha': 0.0008700728771744578, 'reg_lambda': 3.503810753949442e-09, 'colsample_bytree': 0.9767204719820819, 'subsample': 0.08960066102303739, 'min_child_samples': 40, 'max_depth': 9, 'learning_rate': 0.2152247444869479, 'boosting_type': 'gbdt'}. Best is trial 75 with value: 0.2655040166473093.[0m
[32m[I 2022-05-11 14:54:13,728][0m Trial 219 finished with value: 0.2461551137858146 and paramet

FrozenTrial(number=75, values=[0.2655040166473093], datetime_start=datetime.datetime(2022, 5, 11, 14, 44, 14, 513377), datetime_complete=datetime.datetime(2022, 5, 11, 14, 44, 43, 46694), params={'subsample_freq': 0, 'n_estimators': 222, 'reg_alpha': 0.010136808578101649, 'reg_lambda': 0.0038821761773882787, 'colsample_bytree': 0.9384661847570124, 'subsample': 0.3790813020841549, 'min_child_samples': 69, 'max_depth': 13, 'learning_rate': 0.038987514479036006, 'boosting_type': 'gbdt'}, distributions={'subsample_freq': IntUniformDistribution(high=0, low=0, step=1), 'n_estimators': IntUniformDistribution(high=300, low=50, step=1), 'reg_alpha': LogUniformDistribution(high=2.0, low=1e-20), 'reg_lambda': LogUniformDistribution(high=2.0, low=1e-20), 'colsample_bytree': UniformDistribution(high=1.0, low=0.01), 'subsample': UniformDistribution(high=1.0, low=0.01), 'min_child_samples': IntUniformDistribution(high=100, low=5, step=1), 'max_depth': IntUniformDistribution(high=20, low=3, step=1), '

In [56]:
study.optimize(objective, n_trials=100, n_jobs=10)
print(study.best_trial)

[32m[I 2022-05-11 15:07:05,219][0m Trial 224 finished with value: 0.22054499426073557 and parameters: {'subsample_freq': 0, 'n_estimators': 58, 'reg_alpha': 0.001328336111034693, 'reg_lambda': 6.809616214280116e-10, 'colsample_bytree': 0.8257651224032235, 'subsample': 0.2993531663662367, 'min_child_samples': 13, 'max_depth': 12, 'learning_rate': 0.027219551496184623, 'boosting_type': 'gbdt'}. Best is trial 75 with value: 0.2655040166473093.[0m
[32m[I 2022-05-11 15:07:35,491][0m Trial 230 finished with value: 0.22638125605122172 and parameters: {'subsample_freq': 0, 'n_estimators': 240, 'reg_alpha': 0.05885119612865432, 'reg_lambda': 1.979703567876392e-06, 'colsample_bytree': 0.9465154295847428, 'subsample': 0.14647098828450716, 'min_child_samples': 73, 'max_depth': 12, 'learning_rate': 0.03618363541390248, 'boosting_type': 'gbdt'}. Best is trial 75 with value: 0.2655040166473093.[0m
[32m[I 2022-05-11 15:07:35,697][0m Trial 222 finished with value: 0.2608065625307005 and paramet

KeyboardInterrupt: 

In [49]:
best_params = objective.add_params(study.best_params)

In [50]:
# adasyn = ADASYN(sampling_strategy=1,
#                n_neighbors=2)
# smote = SMOTEN(k_neighbors=5)
final_model = lightgbm.LGBMClassifier(**best_params)
best_params['subsample_freq'] = 0
final_model.fit(train_fp, y_train)
test_predictions = final_model.predict(test_fp)
score = f1_score(y_test, test_predictions)
print(f"Best model test f1 score is {round(score, 3)}")

Best model test f1 score is 0.439


In [51]:
best_params

{'subsample_freq': 0,
 'n_estimators': 222,
 'reg_alpha': 0.010136808578101649,
 'reg_lambda': 0.0038821761773882787,
 'colsample_bytree': 0.9384661847570124,
 'subsample': 0.3790813020841549,
 'min_child_samples': 69,
 'max_depth': 13,
 'learning_rate': 0.038987514479036006,
 'boosting_type': 'gbdt',
 'verbosity': -1,
 'n_jobs': 2,
 'device': 'gpu',
 'num_leaves': 8192,
 'scale_pos_weight': 27}

In [55]:
with open("best_params_maccs_2.json",'w') as f:
    json.dump(best_params,f)

## pubchem fingerprint

In [57]:
from PyFingerprint.fingerprint import get_fingerprint, get_fingerprints
from tqdm.auto import tqdm

In [58]:
train_fp_pubchem  = [get_fingerprint(smile,"pubchem") for smile in tqdm(train_data.smiles.values)]
train_fp_pubchem =np.stack([fp.to_numpy() for fp in train_fp_pubchem])

  0%|          | 0/5557 [00:00<?, ?it/s]

In [59]:
test_fp_pubchem  = [get_fingerprint(smile,"pubchem") for smile in tqdm(test_data.smiles.values)]
test_fp_pubchem =np.stack([fp.to_numpy() for fp in test_fp_pubchem])

  0%|          | 0/1614 [00:00<?, ?it/s]

In [60]:
objective = Objective(train_fp_pubchem, y_train,cv=cv)


  0%|          | 0/3 [00:00<?, ?it/s]

In [61]:
study = optuna.create_study(direction="maximize")
# study.enqueue_trial(best_params)
study.optimize(objective, n_trials=200, n_jobs=10)
print(study.best_trial)

[32m[I 2022-05-11 15:08:42,844][0m A new study created in memory with name: no-name-4de76583-fbd1-4388-8214-70d77d2408fb[0m
[32m[I 2022-05-11 15:08:47,752][0m Trial 6 finished with value: 0.19040701441883753 and parameters: {'subsample_freq': 0, 'n_estimators': 72, 'reg_alpha': 8.712520548970911e-15, 'reg_lambda': 0.00799757862990873, 'colsample_bytree': 0.39414361896016287, 'subsample': 0.08151113171536317, 'min_child_samples': 97, 'max_depth': 6, 'learning_rate': 0.19278202190328259, 'boosting_type': 'gbdt'}. Best is trial 6 with value: 0.19040701441883753.[0m
[32m[I 2022-05-11 15:08:52,888][0m Trial 9 finished with value: 0.0 and parameters: {'subsample_freq': 0, 'n_estimators': 272, 'reg_alpha': 1.680845308239597e-19, 'reg_lambda': 5.77566254948995e-20, 'colsample_bytree': 0.03580549527879794, 'subsample': 0.2865200200521715, 'min_child_samples': 58, 'max_depth': 5, 'learning_rate': 0.002113073796533035, 'boosting_type': 'gbdt'}. Best is trial 6 with value: 0.19040701441883

[32m[I 2022-05-11 15:09:47,388][0m Trial 20 finished with value: 0.24355001606757562 and parameters: {'subsample_freq': 0, 'n_estimators': 207, 'reg_alpha': 0.9088762869229303, 'reg_lambda': 6.083804990459589e-17, 'colsample_bytree': 0.12295554768431324, 'subsample': 0.7036243085303481, 'min_child_samples': 42, 'max_depth': 15, 'learning_rate': 0.06998323693579427, 'boosting_type': 'gbdt'}. Best is trial 12 with value: 0.2701040615251076.[0m
[32m[I 2022-05-11 15:09:47,416][0m Trial 23 finished with value: 0.26073841136734216 and parameters: {'subsample_freq': 0, 'n_estimators': 117, 'reg_alpha': 2.2319711791715762e-07, 'reg_lambda': 1.2721633080933487e-07, 'colsample_bytree': 0.11692318641563704, 'subsample': 0.9221193099255445, 'min_child_samples': 61, 'max_depth': 18, 'learning_rate': 0.1505524121099715, 'boosting_type': 'gbdt'}. Best is trial 12 with value: 0.2701040615251076.[0m
[32m[I 2022-05-11 15:10:02,212][0m Trial 26 finished with value: 0.2569644505980525 and paramete

[32m[I 2022-05-11 15:10:42,501][0m Trial 41 finished with value: 0.13147240623250303 and parameters: {'subsample_freq': 0, 'n_estimators': 154, 'reg_alpha': 5.888593357906554e-06, 'reg_lambda': 2.2052453810840945e-05, 'colsample_bytree': 0.010382469143678325, 'subsample': 0.8674326936693462, 'min_child_samples': 53, 'max_depth': 13, 'learning_rate': 0.116899280002713, 'boosting_type': 'gbdt'}. Best is trial 12 with value: 0.2701040615251076.[0m
[32m[I 2022-05-11 15:10:44,710][0m Trial 30 finished with value: 0.26021478930972286 and parameters: {'subsample_freq': 0, 'n_estimators': 128, 'reg_alpha': 1.8669098565467498e-07, 'reg_lambda': 1.9260393162534756e-06, 'colsample_bytree': 0.20613156986177822, 'subsample': 0.9639862071492218, 'min_child_samples': 67, 'max_depth': 20, 'learning_rate': 0.1547966862783023, 'boosting_type': 'gbdt'}. Best is trial 12 with value: 0.2701040615251076.[0m
[32m[I 2022-05-11 15:10:48,207][0m Trial 43 finished with value: 0.13801544018527204 and para

[32m[I 2022-05-11 15:11:41,548][0m Trial 52 finished with value: 0.2195139803835456 and parameters: {'subsample_freq': 0, 'n_estimators': 85, 'reg_alpha': 2.91382481537295e-17, 'reg_lambda': 5.9495058701016315e-08, 'colsample_bytree': 0.33037500234702477, 'subsample': 0.8785273172487768, 'min_child_samples': 38, 'max_depth': 17, 'learning_rate': 0.22618650039400817, 'boosting_type': 'gbdt'}. Best is trial 12 with value: 0.2701040615251076.[0m
[32m[I 2022-05-11 15:11:42,199][0m Trial 53 finished with value: 0.22532546339442894 and parameters: {'subsample_freq': 0, 'n_estimators': 91, 'reg_alpha': 0.000561690291769916, 'reg_lambda': 5.3954473761502845e-14, 'colsample_bytree': 0.29534134477837726, 'subsample': 0.11824881618944746, 'min_child_samples': 38, 'max_depth': 17, 'learning_rate': 0.03270181195058028, 'boosting_type': 'gbdt'}. Best is trial 12 with value: 0.2701040615251076.[0m
[32m[I 2022-05-11 15:11:47,172][0m Trial 55 finished with value: 0.25628084357666936 and paramet

[32m[I 2022-05-11 15:13:04,939][0m Trial 81 finished with value: 0.23736854373191832 and parameters: {'subsample_freq': 0, 'n_estimators': 51, 'reg_alpha': 1.5993040538463997e-18, 'reg_lambda': 0.001727502687829426, 'colsample_bytree': 0.265592850448767, 'subsample': 0.46050173558633245, 'min_child_samples': 57, 'max_depth': 14, 'learning_rate': 0.1409935306164763, 'boosting_type': 'gbdt'}. Best is trial 12 with value: 0.2701040615251076.[0m
[32m[I 2022-05-11 15:13:06,825][0m Trial 80 finished with value: 0.2284133763084009 and parameters: {'subsample_freq': 0, 'n_estimators': 71, 'reg_alpha': 2.8690621276737102e-18, 'reg_lambda': 2.1336947844243418e-18, 'colsample_bytree': 0.26909360259068493, 'subsample': 0.4813262233278844, 'min_child_samples': 56, 'max_depth': 14, 'learning_rate': 0.13687271219989589, 'boosting_type': 'gbdt'}. Best is trial 12 with value: 0.2701040615251076.[0m
[32m[I 2022-05-11 15:13:09,093][0m Trial 82 finished with value: 0.20864188122123573 and paramete

[32m[I 2022-05-11 15:13:52,304][0m Trial 89 finished with value: 0.24862233724081426 and parameters: {'subsample_freq': 0, 'n_estimators': 125, 'reg_alpha': 1.5847044468155129e-13, 'reg_lambda': 7.122226761329178e-06, 'colsample_bytree': 0.208075722960595, 'subsample': 0.9506972882841054, 'min_child_samples': 73, 'max_depth': 18, 'learning_rate': 0.11616376227653456, 'boosting_type': 'gbdt'}. Best is trial 12 with value: 0.2701040615251076.[0m
[32m[I 2022-05-11 15:14:01,527][0m Trial 90 finished with value: 0.27265783506385016 and parameters: {'subsample_freq': 0, 'n_estimators': 121, 'reg_alpha': 0.12371715156312237, 'reg_lambda': 1.9520844224378866e-15, 'colsample_bytree': 0.4473236695437951, 'subsample': 0.2615014527273104, 'min_child_samples': 60, 'max_depth': 18, 'learning_rate': 0.06795652023312676, 'boosting_type': 'gbdt'}. Best is trial 90 with value: 0.27265783506385016.[0m
[32m[I 2022-05-11 15:14:03,896][0m Trial 101 finished with value: 0.24124194927807865 and parame

[32m[I 2022-05-11 15:15:08,608][0m Trial 115 finished with value: 0.24319727891156462 and parameters: {'subsample_freq': 0, 'n_estimators': 93, 'reg_alpha': 0.004519747776231572, 'reg_lambda': 1.4069958445659801e-11, 'colsample_bytree': 0.5320321866091329, 'subsample': 0.1240619880644308, 'min_child_samples': 100, 'max_depth': 17, 'learning_rate': 0.051955625954690804, 'boosting_type': 'gbdt'}. Best is trial 90 with value: 0.27265783506385016.[0m
[32m[I 2022-05-11 15:15:19,844][0m Trial 116 finished with value: 0.2521498366338912 and parameters: {'subsample_freq': 0, 'n_estimators': 95, 'reg_alpha': 0.004111429614768603, 'reg_lambda': 1.9844613365459522e-07, 'colsample_bytree': 0.5385965939301767, 'subsample': 0.11687538113071037, 'min_child_samples': 66, 'max_depth': 17, 'learning_rate': 0.08118664133068476, 'boosting_type': 'gbdt'}. Best is trial 90 with value: 0.27265783506385016.[0m
[32m[I 2022-05-11 15:15:22,439][0m Trial 114 finished with value: 0.27543512504752815 and pa

[32m[I 2022-05-11 15:16:13,938][0m Trial 131 finished with value: 0.24397844674718017 and parameters: {'subsample_freq': 0, 'n_estimators': 111, 'reg_alpha': 0.7322533482303286, 'reg_lambda': 1.5311501894489305e-12, 'colsample_bytree': 0.5128779893906925, 'subsample': 0.022290391808508107, 'min_child_samples': 51, 'max_depth': 20, 'learning_rate': 0.04041708142045037, 'boosting_type': 'gbdt'}. Best is trial 128 with value: 0.2816323034948219.[0m
[32m[I 2022-05-11 15:16:26,554][0m Trial 133 finished with value: 0.2576216912405419 and parameters: {'subsample_freq': 0, 'n_estimators': 108, 'reg_alpha': 0.03136143081014461, 'reg_lambda': 6.221092530451024e-08, 'colsample_bytree': 0.49781311511176674, 'subsample': 0.1936923170371905, 'min_child_samples': 53, 'max_depth': 16, 'learning_rate': 0.0789756867043014, 'boosting_type': 'gbdt'}. Best is trial 128 with value: 0.2816323034948219.[0m
[32m[I 2022-05-11 15:16:29,573][0m Trial 134 finished with value: 0.24491814674731271 and param

[32m[I 2022-05-11 15:17:43,990][0m Trial 151 finished with value: 0.1603701591909139 and parameters: {'subsample_freq': 0, 'n_estimators': 97, 'reg_alpha': 0.0001352648323822314, 'reg_lambda': 3.1485113866699656e-14, 'colsample_bytree': 0.6344392144456883, 'subsample': 0.14992310346958648, 'min_child_samples': 32, 'max_depth': 20, 'learning_rate': 0.16359535361014976, 'boosting_type': 'gbdt'}. Best is trial 128 with value: 0.2816323034948219.[0m
[32m[I 2022-05-11 15:17:45,406][0m Trial 150 finished with value: 0.20570907788952905 and parameters: {'subsample_freq': 0, 'n_estimators': 162, 'reg_alpha': 0.0016332210691114852, 'reg_lambda': 1.990085400356479e-15, 'colsample_bytree': 0.6500605385572961, 'subsample': 0.15254414037933572, 'min_child_samples': 57, 'max_depth': 20, 'learning_rate': 0.16359177350068016, 'boosting_type': 'gbdt'}. Best is trial 128 with value: 0.2816323034948219.[0m
[32m[I 2022-05-11 15:17:51,507][0m Trial 160 finished with value: 0.21905969041630277 and p

[32m[I 2022-05-11 15:19:22,236][0m Trial 176 finished with value: 0.2336413397398666 and parameters: {'subsample_freq': 0, 'n_estimators': 85, 'reg_alpha': 0.3463725782910659, 'reg_lambda': 1.4968783233767396e-17, 'colsample_bytree': 0.6029678602475912, 'subsample': 0.03608566623181646, 'min_child_samples': 44, 'max_depth': 11, 'learning_rate': 0.033997237850531555, 'boosting_type': 'gbdt'}. Best is trial 128 with value: 0.2816323034948219.[0m
[32m[I 2022-05-11 15:19:27,828][0m Trial 170 finished with value: 0.23878126703349603 and parameters: {'subsample_freq': 0, 'n_estimators': 105, 'reg_alpha': 1.0218871560358776e-20, 'reg_lambda': 5.284592281066935e-10, 'colsample_bytree': 0.6863198933153956, 'subsample': 0.17811188756601726, 'min_child_samples': 42, 'max_depth': 19, 'learning_rate': 0.03612067597674144, 'boosting_type': 'gbdt'}. Best is trial 128 with value: 0.2816323034948219.[0m
[32m[I 2022-05-11 15:19:28,983][0m Trial 161 finished with value: 0.22287354658175326 and pa

[32m[I 2022-05-11 15:20:45,856][0m Trial 192 finished with value: 0.2654865614591741 and parameters: {'subsample_freq': 0, 'n_estimators': 109, 'reg_alpha': 1.8975400282263413e-19, 'reg_lambda': 1.6120215383775922e-09, 'colsample_bytree': 0.5512204391827215, 'subsample': 0.5636609636481346, 'min_child_samples': 51, 'max_depth': 15, 'learning_rate': 0.041256054499049434, 'boosting_type': 'gbdt'}. Best is trial 128 with value: 0.2816323034948219.[0m
[32m[I 2022-05-11 15:20:51,535][0m Trial 195 finished with value: 0.2602233139876918 and parameters: {'subsample_freq': 0, 'n_estimators': 112, 'reg_alpha': 1.2716071049595508e-19, 'reg_lambda': 1.5504933009544595e-10, 'colsample_bytree': 0.5512999151485792, 'subsample': 0.7326191881704349, 'min_child_samples': 51, 'max_depth': 10, 'learning_rate': 0.040779721757923985, 'boosting_type': 'gbdt'}. Best is trial 128 with value: 0.2816323034948219.[0m
[32m[I 2022-05-11 15:20:51,748][0m Trial 197 finished with value: 0.2537751163212704 and

FrozenTrial(number=128, values=[0.2816323034948219], datetime_start=datetime.datetime(2022, 5, 11, 15, 15, 23, 452311), datetime_complete=datetime.datetime(2022, 5, 11, 15, 15, 59, 413014), params={'subsample_freq': 0, 'n_estimators': 114, 'reg_alpha': 0.015868560983727742, 'reg_lambda': 5.464446011087016e-08, 'colsample_bytree': 0.5865195742947396, 'subsample': 0.18317392695047902, 'min_child_samples': 51, 'max_depth': 20, 'learning_rate': 0.04588590354697835, 'boosting_type': 'gbdt'}, distributions={'subsample_freq': IntUniformDistribution(high=0, low=0, step=1), 'n_estimators': IntUniformDistribution(high=300, low=50, step=1), 'reg_alpha': LogUniformDistribution(high=2.0, low=1e-20), 'reg_lambda': LogUniformDistribution(high=2.0, low=1e-20), 'colsample_bytree': UniformDistribution(high=1.0, low=0.01), 'subsample': UniformDistribution(high=1.0, low=0.01), 'min_child_samples': IntUniformDistribution(high=100, low=5, step=1), 'max_depth': IntUniformDistribution(high=20, low=3, step=1),

In [65]:
best_params = objective.add_params(study.best_params)

In [74]:
# adasyn = ADASYN(sampling_strategy=1,
#                n_neighbors=2)
# smote = SMOTEN(k_neighbors=5)
final_model = lightgbm.LGBMClassifier(**best_params)
best_params['subsample_freq'] = 0
final_model.fit(train_fp_pubchem, y_train)
test_predictions = final_model.predict(test_fp_pubchem)
score = f1_score(y_test, test_predictions)
print(f"Best model test f1 score is {round(score, 3)}")

Best model test f1 score is 0.371


In [68]:
with open("./best_params_pubchem.json",'w') as f:
    json.dump(best_params, f)

## Таблица результатов

In [69]:
with open("best_params_maccs_2.json",'r') as f:
    best_params_maccs = json.load(f)
with open("best_params_pubchem.json",'r') as f:
    best_params_pubchem = json.load(f)

In [75]:
from sklearn.model_selection import cross_val_score

In [80]:
best_params_maccs

{'subsample_freq': 0,
 'n_estimators': 222,
 'reg_alpha': 0.010136808578101649,
 'reg_lambda': 0.0038821761773882787,
 'colsample_bytree': 0.9384661847570124,
 'subsample': 0.3790813020841549,
 'min_child_samples': 69,
 'max_depth': 13,
 'learning_rate': 0.038987514479036006,
 'boosting_type': 'gbdt',
 'verbosity': -1,
 'n_jobs': 2,
 'device': 'gpu',
 'num_leaves': 8192,
 'scale_pos_weight': 27}

In [86]:
from pprint import pprint
def print_report(params, fingerprint_name, mean_score, test_score):
    print(f"LGBM на {fingerprint_name} фингерпринтах")
    print("С параметрами: ")
    pprint(params)
    model = lightgbm.LGBMClassifier(**best_params)
    print(f"mean CV score: {mean_score}")
    print(f"Test f1 score is {round(test_score, 3)}")

In [87]:
print_report(best_params_maccs, "MACCS", 0.265, 0.439)
print()
print_report(best_params_pubchem, "PUBCHEM", 0.282, 0.371)

LGBM на MACCS фингерпринтах
С параметрами: 
{'boosting_type': 'gbdt',
 'colsample_bytree': 0.9384661847570124,
 'device': 'gpu',
 'learning_rate': 0.038987514479036006,
 'max_depth': 13,
 'min_child_samples': 69,
 'n_estimators': 222,
 'n_jobs': 2,
 'num_leaves': 8192,
 'reg_alpha': 0.010136808578101649,
 'reg_lambda': 0.0038821761773882787,
 'scale_pos_weight': 27,
 'subsample': 0.3790813020841549,
 'subsample_freq': 0,
 'verbosity': -1}
mean CV score: 0.265
Test f1 score is 0.439

LGBM на PUBCHEM фингерпринтах
С параметрами: 
{'boosting_type': 'gbdt',
 'colsample_bytree': 0.5865195742947396,
 'device': 'gpu',
 'learning_rate': 0.04588590354697835,
 'max_depth': 20,
 'min_child_samples': 51,
 'n_estimators': 114,
 'n_jobs': 2,
 'num_leaves': 65536,
 'reg_alpha': 0.015868560983727742,
 'reg_lambda': 5.464446011087016e-08,
 'scale_pos_weight': 27,
 'subsample': 0.18317392695047902,
 'subsample_freq': 0,
 'verbosity': -1}
mean CV score: 0.282
Test f1 score is 0.371
