<a href="https://colab.research.google.com/github/gottalottarock/ml-intro/blob/main/practice_covid/IB_Practice_COVID.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Предсказание активности молекул по отношению к таргету

*При подготовке ноутбука использовались данные из соревнования [Global AI Challenge](https://codenrock.com/contests/global-ai#/)* 

Целью данной задачи является предсказание активности молекулы лиганда по отношению к таргету - Covid 19

![](https://cloudfront.jove.com/files/media/science-education/science-education-thumbs/11513.jpg)

## План анализа данных:

  1. Загрузить данные для обучения
  2. Обработать данные перед обучением модели
  3. Обучить модель на обучающей выборке
  4. Загрузить и предобработать данные для тестирования
  5. Провалидировать модель на тестовой выборке


# 0. Установка и импорт библиотек

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "end_to_end_project"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# 1. Загрузка данных

In [2]:
!wget https://www.dropbox.com/s/48c34raijlxc0nw/train.csv
!wget https://www.dropbox.com/s/297trreazro8ivr/test_labels.csv

--2022-03-30 16:52:29--  https://www.dropbox.com/s/48c34raijlxc0nw/train.csv
Resolving www.dropbox.com (www.dropbox.com)... 162.125.4.18, 2620:100:601c:18::a27d:612
Connecting to www.dropbox.com (www.dropbox.com)|162.125.4.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/48c34raijlxc0nw/train.csv [following]
--2022-03-30 16:52:29--  https://www.dropbox.com/s/raw/48c34raijlxc0nw/train.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucf0baa4b08dc891579dc6f163b2.dl.dropboxusercontent.com/cd/0/inline/BifKkgartxowjgLoxb5lz2ra3efTKsw8dCFBcfKqYacm-8Y8qtNqtYFZVq0l8uk257_C6oYAxVKU79SnaLwS3pEYsgf1nHOLoL3U4HfEL1fwhb572539M1M7_YPqFLvOSYs3ldJNPnYq6ZnDXl97IsNWf4Nk-zbnbCxHL4A6R-wtIw/file# [following]
--2022-03-30 16:52:30--  https://ucf0baa4b08dc891579dc6f163b2.dl.dropboxusercontent.com/cd/0/inline/BifKkgartxowjgLoxb5lz2ra3efTKsw8dCFBcfKqYacm-8Y8qtNqtYFZVq0l8uk257_C6oYAxVKU7

In [3]:
DATA_PATH = "./"
TRAIN_FILE = "train.csv"
TEST_FILE = "test_labels.csv"

SMILES_COLUMN = "smiles"
TARGET_COLUMN = "Active"

In [4]:
import pandas as pd

def load_train_test_data():
    train_csv_path = os.path.join(DATA_PATH, TRAIN_FILE)
    test_csv_path = os.path.join(DATA_PATH, TEST_FILE)
    train_data = pd.read_csv(train_csv_path, index_col = 0)
    test_data = pd.read_csv(test_csv_path,index_col = 0)
    return train_data.rename(columns = {"Smiles":SMILES_COLUMN}), test_data.rename(columns = {"Smiles":SMILES_COLUMN})

## 1.1 Анализ данных, формулировка задачи машинного обучения

Молекулы представлены в виде [SMILES нотации](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system).

![](https://upload.wikimedia.org/wikipedia/commons/thumb/0/00/SMILES.png/450px-SMILES.png)

In [5]:
train_data, test_data = load_train_test_data()
train_data.head()

Unnamed: 0,smiles,Active
0,COc1ccc2[nH]cc(CCN)c2c1,False
1,CCCN1CCC[C@H](c2cccc(O)c2)C1.Cl,False
2,O=C(NO)c1cnc(N2CCN(S(=O)(=O)c3ccc4ccccc4c3)CC2...,False
3,Nc1cccc(CNC(=O)c2ccc(Oc3ccc(OCc4cccc(F)c4)cc3)...,False
4,Fc1ccccc1CNCc1ccc(-c2ccnc3[nH]ccc23)cc1,False


In [6]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5557 entries, 0 to 5556
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   smiles  5557 non-null   object
 1   Active  5557 non-null   bool  
dtypes: bool(1), object(1)
memory usage: 92.3+ KB


In [7]:
train_data[TARGET_COLUMN].value_counts()

False    5351
True      206
Name: Active, dtype: int64

## 1.2 Предобработка данных

In [8]:
from rdkit import Chem
from rdkit.Chem.SaltRemover import SaltRemover

In [9]:
def remove_salts_and_canonicalized(smiles: str):
    remover = SaltRemover(defnData="[Cl,Br]")
    mol = Chem.MolFromSmiles(smiles)
    res = remover.StripMol(mol)
    processed_smiles = Chem.MolToSmiles(res)
    return processed_smiles

In [10]:
train_data[SMILES_COLUMN] = list(map(remove_salts_and_canonicalized, train_data[SMILES_COLUMN]))
test_data[SMILES_COLUMN] = list(map(remove_salts_and_canonicalized, test_data[SMILES_COLUMN]))

In [11]:
def change_str_target_to_int(targets: pd.Series):
    target_map = {True: 1, False: 0}
    processed_targets = targets.map(target_map)
    return processed_targets.values

In [12]:
train_data[TARGET_COLUMN] = change_str_target_to_int(train_data[TARGET_COLUMN])
test_data[TARGET_COLUMN] = change_str_target_to_int(test_data[TARGET_COLUMN])

In [13]:
train_data.head()

Unnamed: 0,smiles,Active
0,COc1ccc2[nH]cc(CCN)c2c1,0
1,CCCN1CCC[C@H](c2cccc(O)c2)C1,0
2,O=C(NO)c1cnc(N2CCN(S(=O)(=O)c3ccc4ccccc4c3)CC2...,0
3,Nc1cccc(CNC(=O)c2ccc(Oc3ccc(OCc4cccc(F)c4)cc3)...,0
4,Fc1ccccc1CNCc1ccc(-c2ccnc3[nH]ccc23)cc1,0


## 1.3 Feature engineering

Молекулу можно представить в виде фингерпринта - вектора свойств, полученного по определенному алгоритму.

Мы будем считать фингерпринты при помощи библиотеки RDKit. Про различные фингерпринты и их описание можно почитать тут - https://www.rdkit.org/docs/GettingStartedInPython.html#fingerprinting-and-molecular-similarity

![](https://sun9-64.userapi.com/impf/_8Zy5WO6Mt0SIPx1YS02DeErAoZ0RHcwgc-kZg/Md98bNVzBg0.jpg?size=831x415&quality=96&sign=cb20481128a04ff523fd662dd0e604ab&type=album)


### Моргановские фингерпринты (ECFP)

![](https://d3i71xaburhd42.cloudfront.net/52adf3589e8b7b9855353e5815669258ef6e3405/6-Figure2-1.png)

In [14]:
from enum import Enum
from functools import partial
from rdkit import Chem, DataStructs
from rdkit.DataStructs import ExplicitBitVect
from rdkit.Chem import AllChem, MACCSkeys
from typing import List

In [30]:
class FingerprintsNames(Enum):
    ECFP4 = "morgan_2_2048"
    RDKitFP = "RDKFingerprint"
    TOPOTORSION = "topological_torsion"
    MACCS = "MACCSkeys"
    PATTERN = "PatternFingerprint"
    ATOMPAIR = "AtomPairFingerprint"



FINGERPRINTS_METHODS = {
    FingerprintsNames.ECFP4: partial(AllChem.GetMorganFingerprintAsBitVect, radius=2, nBits=2048),
    FingerprintsNames.RDKitFP: partial(Chem.RDKFingerprint, fpSize=2048),#TODO
    FingerprintsNames.TOPOTORSION: partial(AllChem.GetHashedTopologicalTorsionFingerprintAsBitVect, nBits=2048),#TODO
    FingerprintsNames.MACCS: MACCSkeys.GenMACCSKeys,#TODO
    FingerprintsNames.PATTERN: partial(Chem.PatternFingerprint,fpSize=1024),#TODO
    FingerprintsNames.ATOMPAIR: partial(AllChem.GetHashedAtomPairFingerprintAsBitVect,nBits=2048)}#TODO


In [46]:
fingerprint_type_name = FingerprintsNames.MACCS
fingerprint_type_method = FINGERPRINTS_METHODS[fingerprint_type_name]

In [47]:
def bit_vectors_to_numpy_arrays(fps: List[ExplicitBitVect]) -> np.array:
    output_arrays = [np.zeros((1,)) for i in range(len(fps))]
    _ = list(
        map(lambda fp_output_array: DataStructs.ConvertToNumpyArray(fp_output_array[0], fp_output_array[1]),
            zip(fps, output_arrays)))
    return np.asarray(output_arrays)

def get_np_array_of_fps(fp_type, smiles: List[str]):
    # Calculate the morgan fingerprint
    mols = [Chem.MolFromSmiles(m) for m in smiles]
    fp = list(map(fp_type, mols))
    return bit_vectors_to_numpy_arrays(fp)

In [48]:
train_fp = get_np_array_of_fps(fp_type=fingerprint_type_method, smiles=train_data[SMILES_COLUMN])
test_fp = get_np_array_of_fps(fp_type=fingerprint_type_method, smiles=test_data[SMILES_COLUMN])

In [23]:
y_train = train_data[TARGET_COLUMN]
y_test = test_data[TARGET_COLUMN]

# 2. Подготовка к обучению модели

## 2.1 Кросс-валидация

![](https://pubs.rsc.org/image/article/2018/SC/c7sc02664a/c7sc02664a-f3_hi-res.gif)

In [30]:
from dgllife.utils import ScaffoldSplitter

Using backend: pytorch


In [31]:
class ScaffoldCVSklearn:
    def __init__(self, data, k_folds):
        self.scaffold_splits = ScaffoldSplitter.k_fold_split(data, k=k_folds)

    def split(self):
        indices_splits = []
        for train_data, val_data in self.scaffold_splits:
          train_indices = train_data.indices
          val_indices = val_data.indices
          indices_splits.append((train_indices, val_indices))
        return indices_splits

    def convert_data_to_indices(self, dataset):
        indices = [index for index, row in dataset.iterrows()]
        return indices


In [32]:
cv = ScaffoldCVSklearn(train_data, k_folds=3).split()

Start initializing RDKit molecule instances...
Creating RDKit molecule instance 1000/5557
Creating RDKit molecule instance 2000/5557
Creating RDKit molecule instance 3000/5557
Creating RDKit molecule instance 4000/5557
Creating RDKit molecule instance 5000/5557
Start computing Bemis-Murcko scaffolds.
Computing Bemis-Murcko for compound 1000/5557
Computing Bemis-Murcko for compound 2000/5557
Computing Bemis-Murcko for compound 3000/5557
Computing Bemis-Murcko for compound 4000/5557
Computing Bemis-Murcko for compound 5000/5557
Processing fold 1/3
Processing fold 2/3
Processing fold 3/3


# Задание (10 баллов + 3 бонусных)
1. (3 балла) Добавить решение проблемы несбалансированной классификации

Варианты:
* UnderSampling
* OverSampling
* SMOTE
* Внутренние инструменты модели (`scale_pos_weight`)

2. (2 балла) Использовать еще 2 вида фингерпринтов из `FingerprintsNames`

3. (3 балла) Получить f1-score на тестовом датасете больше 0.35

Варианты:
* Увеличить количество параметров в подборе гиперпараметров
* Использовать другие алгоритмы подбора гиперпараметров (например, [RandomizedSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html))
* Использовать другие модели (Random Forest, SVC, MLPClassifier, etc)

4. (2 балла) Логирование

В качестве финального результата предоставьте таблицу (можно `pd.DataFrame`) c колонками: Model, Fingerprint, Best Parameters, Mean Cross-Validation Score, Std Cross-Validation Score, Test Score 

Проанализируйте результаты: 
* Какие фингерпринты дали лучший результат?
* Какая модель дала лучший результат.
* Коррелируют ли скоры на кросс-валидации и тестовой выборке?

5. (Бонус +3 балла) Получить f1-score на тестовом датасете больше 0.45

In [38]:
from imblearn.over_sampling import SMOTEN, ADASYN
from joblib import parallel_backend
import lightgbm
from tqdm.auto import tqdm

In [54]:
y_train.shape/y_train.sum()

array([26.97572816])

Используем scale_pos_weight = 27

In [17]:
from sklearn.metrics import f1_score

import optuna


class Objective:
    def __init__(self, X_data, y_data, cv):
        self.X_data = X_data.astype(int)
        self.y_data = y_data.values.astype(int)
        self.cv = cv
        self.train_data = []
        self.test_data = []
        for train_index, test_index in tqdm(cv):
            X_train, X_test = (
                self.X_data[train_index, :],
                self.X_data[test_index, :],
            )
            y_train, y_test = (
                self.y_data[train_index],
                self.y_data[test_index],
            )
            X_train,y_train = self.oversample(X_train,y_train)
            self.train_data.append((X_train,y_train))
            self.test_data.append((X_test,y_test))
        
    def oversample(self, x, y):
        return x,y
#         self.oversampler = SMOTEN(k_neighbors=5)
#         return self.oversampler.fit_resample(x, y)

    def __call__(self, trial):

        params = {
            "subsample_freq": trial.suggest_int("subsample_freq", 0, 0),
            "n_estimators": trial.suggest_int("n_estimators", 50, 300),  # n_trees
            "reg_alpha": trial.suggest_loguniform("reg_alpha", 1e-20, 2),
            "reg_lambda": trial.suggest_loguniform("reg_lambda", 1e-20, 2),
            "colsample_bytree": trial.suggest_uniform("colsample_bytree", 0.01, 1.0),
            "subsample": trial.suggest_uniform("subsample", 0.01, 1.0),
            "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
            "max_depth": trial.suggest_int("max_depth", 3, 20),
            "learning_rate": trial.suggest_uniform("learning_rate", 0.001, 0.3),
#             "scale_pos_weight":trial.suggest_int("scale_pos_weight",27)
            "boosting_type":trial.suggest_categorical("boosting_type",['gbdt'])
        }
        params = self.add_params(params)

        result = []
        for (X_train,y_train), (X_test,y_test) in zip(self.train_data, self.test_data):
           
            model = lightgbm.LGBMClassifier(**params)

            model.fit(X_train, y_train)

            y_pred = model.predict_proba(X_test)[:, 1].astype(float)

            f1_val = f1_score(y_test, y_pred > 0.5)

            result.append(f1_val)

        return np.mean(result)
    
    def add_params(self, params):
        params.update({
            "verbosity": -1,
            "n_jobs": 2,
            "device": 'gpu',
            "num_leaves":2 ** min(16,params["max_depth"]),    
            "scale_pos_weight":27
        })
        return params


# Load the dataset in advance for reusing it each trial execution.
# objective = Objective(train_fp, y_train,cv=cv)



  from .autonotebook import tqdm as notebook_tqdm


In [56]:
study = optuna.create_study(direction="maximize")
# study.enqueue_trial(best_params)
study.optimize(objective, n_trials=100, n_jobs=10)
print(study.best_trial)

[32m[I 2022-03-30 16:35:44,870][0m A new study created in memory with name: no-name-c1d77f41-f543-4f66-be80-48d37533093f[0m
[32m[I 2022-03-30 16:35:50,128][0m Trial 9 finished with value: 0.17428936942549744 and parameters: {'subsample_freq': 0, 'n_estimators': 79, 'reg_alpha': 5.016762442392253e-14, 'reg_lambda': 6.959282880570099e-15, 'colsample_bytree': 0.5806754425884342, 'subsample': 0.8186472526093561, 'min_child_samples': 44, 'max_depth': 3, 'learning_rate': 0.24990607399979636, 'boosting_type': 'gbdt'}. Best is trial 9 with value: 0.17428936942549744.[0m
[32m[I 2022-03-30 16:36:06,600][0m Trial 3 finished with value: 0.24355797282202005 and parameters: {'subsample_freq': 0, 'n_estimators': 86, 'reg_alpha': 1.7850744867008663e-05, 'reg_lambda': 1.6648333419312513, 'colsample_bytree': 0.2976440631533308, 'subsample': 0.9903770180583413, 'min_child_samples': 69, 'max_depth': 20, 'learning_rate': 0.1209511252715648, 'boosting_type': 'gbdt'}. Best is trial 3 with value: 0.24

[32m[I 2022-03-30 16:37:32,741][0m Trial 18 finished with value: 0.23796296296296293 and parameters: {'subsample_freq': 0, 'n_estimators': 274, 'reg_alpha': 2.8786514443770676e-17, 'reg_lambda': 0.16849495653922456, 'colsample_bytree': 0.3212399714928402, 'subsample': 0.869143335127403, 'min_child_samples': 38, 'max_depth': 6, 'learning_rate': 0.04148019624007067, 'boosting_type': 'gbdt'}. Best is trial 7 with value: 0.25013555154044936.[0m
[32m[I 2022-03-30 16:37:34,544][0m Trial 5 finished with value: 0.17831076724693742 and parameters: {'subsample_freq': 0, 'n_estimators': 299, 'reg_alpha': 1.7059976451406273e-17, 'reg_lambda': 9.531948730452922e-17, 'colsample_bytree': 0.4196742513276729, 'subsample': 0.7920054561438572, 'min_child_samples': 13, 'max_depth': 14, 'learning_rate': 0.21935878650111443, 'boosting_type': 'gbdt'}. Best is trial 7 with value: 0.25013555154044936.[0m
[32m[I 2022-03-30 16:37:59,311][0m Trial 17 finished with value: 0.2134989147232471 and parameters:

[32m[I 2022-03-30 16:40:08,630][0m Trial 36 finished with value: 0.22604555775287483 and parameters: {'subsample_freq': 0, 'n_estimators': 287, 'reg_alpha': 3.312557430774106e-12, 'reg_lambda': 5.94208040565424e-07, 'colsample_bytree': 0.21329864114197145, 'subsample': 0.5849881691596059, 'min_child_samples': 23, 'max_depth': 7, 'learning_rate': 0.040017971593260156, 'boosting_type': 'gbdt'}. Best is trial 7 with value: 0.25013555154044936.[0m
[32m[I 2022-03-30 16:40:15,986][0m Trial 37 finished with value: 0.15165902222767644 and parameters: {'subsample_freq': 0, 'n_estimators': 223, 'reg_alpha': 1.2156607657439561e-05, 'reg_lambda': 1.2045602739659761e-06, 'colsample_bytree': 0.2274453444861055, 'subsample': 0.04386417082578329, 'min_child_samples': 22, 'max_depth': 8, 'learning_rate': 0.0043258620886567345, 'boosting_type': 'gbdt'}. Best is trial 7 with value: 0.25013555154044936.[0m
[32m[I 2022-03-30 16:40:22,184][0m Trial 31 finished with value: 0.22274734455185582 and par

[32m[I 2022-03-30 16:42:42,455][0m Trial 58 finished with value: 0.19479422974863703 and parameters: {'subsample_freq': 0, 'n_estimators': 295, 'reg_alpha': 0.0012508365971506876, 'reg_lambda': 4.3845648279411566e-11, 'colsample_bytree': 0.2831564079188711, 'subsample': 0.35438631206342197, 'min_child_samples': 47, 'max_depth': 11, 'learning_rate': 0.11383496569742965, 'boosting_type': 'gbdt'}. Best is trial 7 with value: 0.25013555154044936.[0m
[32m[I 2022-03-30 16:42:42,565][0m Trial 57 finished with value: 0.19137170675830473 and parameters: {'subsample_freq': 0, 'n_estimators': 298, 'reg_alpha': 0.0005175380087436846, 'reg_lambda': 8.036066496990616e-12, 'colsample_bytree': 0.2886694846851139, 'subsample': 0.36565237259818517, 'min_child_samples': 47, 'max_depth': 11, 'learning_rate': 0.11630391360174155, 'boosting_type': 'gbdt'}. Best is trial 7 with value: 0.25013555154044936.[0m
[32m[I 2022-03-30 16:43:05,852][0m Trial 55 finished with value: 0.17660490233014806 and para

[32m[I 2022-03-30 16:44:24,409][0m Trial 75 finished with value: 0.21468648516393074 and parameters: {'subsample_freq': 0, 'n_estimators': 259, 'reg_alpha': 1.0903338232738717e-18, 'reg_lambda': 0.47855499618949326, 'colsample_bytree': 0.3393563399079069, 'subsample': 0.4920804377661814, 'min_child_samples': 8, 'max_depth': 6, 'learning_rate': 0.07522541649615906, 'boosting_type': 'gbdt'}. Best is trial 7 with value: 0.25013555154044936.[0m
[32m[I 2022-03-30 16:44:35,786][0m Trial 77 finished with value: 0.2149501233470699 and parameters: {'subsample_freq': 0, 'n_estimators': 276, 'reg_alpha': 0.2995540842198168, 'reg_lambda': 0.447304211658286, 'colsample_bytree': 0.34084629033907426, 'subsample': 0.05630836641459319, 'min_child_samples': 6, 'max_depth': 6, 'learning_rate': 0.07584661626768385, 'boosting_type': 'gbdt'}. Best is trial 7 with value: 0.25013555154044936.[0m
[32m[I 2022-03-30 16:45:10,676][0m Trial 81 finished with value: 0.259382179434645 and parameters: {'subsam

[32m[I 2022-03-30 16:46:27,437][0m Trial 92 finished with value: 0.22671017364416823 and parameters: {'subsample_freq': 0, 'n_estimators': 245, 'reg_alpha': 9.19469508717781e-18, 'reg_lambda': 0.00037605023694895287, 'colsample_bytree': 0.9276114161446463, 'subsample': 0.7336706370987504, 'min_child_samples': 80, 'max_depth': 9, 'learning_rate': 0.032577392129876995, 'boosting_type': 'gbdt'}. Best is trial 81 with value: 0.259382179434645.[0m
[32m[I 2022-03-30 16:46:29,301][0m Trial 96 finished with value: 0.24229071161209628 and parameters: {'subsample_freq': 0, 'n_estimators': 197, 'reg_alpha': 9.972220938220113e-20, 'reg_lambda': 0.0003620461595691682, 'colsample_bytree': 0.7146884325844984, 'subsample': 0.7457893973610346, 'min_child_samples': 67, 'max_depth': 8, 'learning_rate': 0.03279934823712991, 'boosting_type': 'gbdt'}. Best is trial 81 with value: 0.259382179434645.[0m
[32m[I 2022-03-30 16:46:30,168][0m Trial 86 finished with value: 0.2629737580717973 and parameters:

FrozenTrial(number=86, values=[0.2629737580717973], datetime_start=datetime.datetime(2022, 3, 30, 16, 44, 35, 788738), datetime_complete=datetime.datetime(2022, 3, 30, 16, 46, 30, 167944), params={'subsample_freq': 0, 'n_estimators': 235, 'reg_alpha': 4.517845153419324e-20, 'reg_lambda': 0.0003814408820850157, 'colsample_bytree': 0.43913094108760964, 'subsample': 0.7444318013887783, 'min_child_samples': 17, 'max_depth': 10, 'learning_rate': 0.01250188105888474, 'boosting_type': 'gbdt'}, distributions={'subsample_freq': IntUniformDistribution(high=0, low=0, step=1), 'n_estimators': IntUniformDistribution(high=300, low=50, step=1), 'reg_alpha': LogUniformDistribution(high=2.0, low=1e-20), 'reg_lambda': LogUniformDistribution(high=2.0, low=1e-20), 'colsample_bytree': UniformDistribution(high=1.0, low=0.01), 'subsample': UniformDistribution(high=1.0, low=0.01), 'min_child_samples': IntUniformDistribution(high=100, low=5, step=1), 'max_depth': IntUniformDistribution(high=20, low=3, step=1),

In [60]:
best_params = objective.add_params(study.best_params)

In [61]:
# adasyn = ADASYN(sampling_strategy=1,
#                n_neighbors=2)
# smote = SMOTEN(k_neighbors=5)
final_model = lightgbm.LGBMClassifier(**best_params)
best_params['subsample_freq'] = 0
final_model.fit(train_fp, y_train)
test_predictions = final_model.predict(test_fp)
score = f1_score(y_test, test_predictions)
print(f"Best model test f1 score is {round(score, 3)}")

Best model test f1 score is 0.396


## pubchem fingerprint

In [18]:
from PyFingerprint.fingerprint import get_fingerprint, get_fingerprints
from tqdm.auto import tqdm

In [19]:
train_fp_pubchem  = [get_fingerprint(smile,"pubchem") for smile in tqdm(train_data.smiles.values)]
train_fp_pubchem =np.stack([fp.to_numpy() for fp in train_fp_pubchem])

100%|██████████| 5557/5557 [00:36<00:00, 152.20it/s]


In [20]:
test_fp_pubchem  = [get_fingerprint(smile,"pubchem") for smile in tqdm(test_data.smiles.values)]
test_fp_pubchem =np.stack([fp.to_numpy() for fp in test_fp_pubchem])

100%|██████████| 1614/1614 [00:09<00:00, 161.92it/s]


In [27]:
y_train

0       0
1       0
2       0
3       0
4       0
       ..
5552    0
5553    0
5554    0
5555    0
5556    0
Name: Active, Length: 5557, dtype: int64

In [34]:
y_train

0       0
1       0
2       0
3       0
4       0
       ..
5552    0
5553    0
5554    0
5555    0
5556    0
Name: Active, Length: 5557, dtype: int64

In [36]:
objective = Objective(train_fp_pubchem, y_train,cv=cv)


100%|██████████| 3/3 [00:00<00:00, 56.09it/s]


In [39]:
study = optuna.create_study(direction="maximize")
# study.enqueue_trial(best_params)
study.optimize(objective, n_trials=100, n_jobs=8)
print(study.best_trial)

[32m[I 2022-03-30 16:55:28,165][0m A new study created in memory with name: no-name-f98a19b4-2435-433c-84b8-59473b5cf736[0m
[32m[I 2022-03-30 16:55:40,526][0m Trial 5 finished with value: 0.23159729684552377 and parameters: {'subsample_freq': 0, 'n_estimators': 297, 'reg_alpha': 0.0005681711684213201, 'reg_lambda': 0.0006785226295028941, 'colsample_bytree': 0.4710510073513651, 'subsample': 0.8533862582822663, 'min_child_samples': 86, 'max_depth': 4, 'learning_rate': 0.21126740149945172, 'boosting_type': 'gbdt'}. Best is trial 5 with value: 0.23159729684552377.[0m
[32m[I 2022-03-30 16:55:43,045][0m Trial 6 finished with value: 0.20899066737380376 and parameters: {'subsample_freq': 0, 'n_estimators': 151, 'reg_alpha': 0.33703790633828656, 'reg_lambda': 6.1318906060051755e-09, 'colsample_bytree': 0.7501655600228566, 'subsample': 0.2700564162106386, 'min_child_samples': 9, 'max_depth': 6, 'learning_rate': 0.23705325959583873, 'boosting_type': 'gbdt'}. Best is trial 5 with value: 0.

KeyboardInterrupt: 

In [None]:
study.optimize(objective, n_trials=100, n_jobs=10)


[32m[I 2022-03-30 17:13:53,150][0m Trial 221 finished with value: 0.23579636543010574 and parameters: {'subsample_freq': 0, 'n_estimators': 159, 'reg_alpha': 8.196029498236362e-17, 'reg_lambda': 2.8080503161834836e-14, 'colsample_bytree': 0.09314374251942265, 'subsample': 0.9430172649536079, 'min_child_samples': 92, 'max_depth': 18, 'learning_rate': 0.09993998993248598, 'boosting_type': 'gbdt'}. Best is trial 148 with value: 0.28779342723004697.[0m
[32m[I 2022-03-30 17:13:54,087][0m Trial 217 finished with value: 0.2707209852108417 and parameters: {'subsample_freq': 0, 'n_estimators': 133, 'reg_alpha': 7.1101490315123465e-16, 'reg_lambda': 6.684121467040956e-08, 'colsample_bytree': 0.6830205922664551, 'subsample': 0.1403265279141977, 'min_child_samples': 92, 'max_depth': 18, 'learning_rate': 0.09309790999303785, 'boosting_type': 'gbdt'}. Best is trial 148 with value: 0.28779342723004697.[0m
[32m[I 2022-03-30 17:13:56,541][0m Trial 224 finished with value: 0.21800898864728654 an

[32m[I 2022-03-30 17:15:00,446][0m Trial 236 finished with value: 0.23803141034767572 and parameters: {'subsample_freq': 0, 'n_estimators': 240, 'reg_alpha': 1.0091230733045792e-18, 'reg_lambda': 8.469967010422418e-15, 'colsample_bytree': 0.7530801069458782, 'subsample': 0.110059299774615, 'min_child_samples': 86, 'max_depth': 17, 'learning_rate': 0.06809718378521944, 'boosting_type': 'gbdt'}. Best is trial 148 with value: 0.28779342723004697.[0m
[32m[I 2022-03-30 17:15:02,260][0m Trial 237 finished with value: 0.27044685696597165 and parameters: {'subsample_freq': 0, 'n_estimators': 177, 'reg_alpha': 2.6199131249182344e-18, 'reg_lambda': 5.616485960645729e-15, 'colsample_bytree': 0.6790478568121775, 'subsample': 0.10065689458574091, 'min_child_samples': 96, 'max_depth': 20, 'learning_rate': 0.049637421662212446, 'boosting_type': 'gbdt'}. Best is trial 148 with value: 0.28779342723004697.[0m
[32m[I 2022-03-30 17:15:16,867][0m Trial 239 finished with value: 0.2791997116077866 an

[32m[I 2022-03-30 17:16:20,544][0m Trial 254 finished with value: 0.2862517485196949 and parameters: {'subsample_freq': 0, 'n_estimators': 143, 'reg_alpha': 6.506772167361188e-20, 'reg_lambda': 1.996807533700553e-16, 'colsample_bytree': 0.9756550130640169, 'subsample': 0.5127864284269412, 'min_child_samples': 81, 'max_depth': 19, 'learning_rate': 0.05732098496473026, 'boosting_type': 'gbdt'}. Best is trial 148 with value: 0.28779342723004697.[0m
[32m[I 2022-03-30 17:16:31,756][0m Trial 256 finished with value: 0.2812082894274675 and parameters: {'subsample_freq': 0, 'n_estimators': 138, 'reg_alpha': 1.3459869531216716e-19, 'reg_lambda': 3.522517097071946e-14, 'colsample_bytree': 0.9397647093767834, 'subsample': 0.5824971735706715, 'min_child_samples': 82, 'max_depth': 19, 'learning_rate': 0.05772942987983273, 'boosting_type': 'gbdt'}. Best is trial 148 with value: 0.28779342723004697.[0m
[32m[I 2022-03-30 17:16:34,993][0m Trial 255 finished with value: 0.28462491090165537 and p

In [48]:
best_params = objective.add_params(study.best_params)

In [49]:
# adasyn = ADASYN(sampling_strategy=1,
#                n_neighbors=2)
# smote = SMOTEN(k_neighbors=5)
final_model = lightgbm.LGBMClassifier(**best_params)
best_params['subsample_freq'] = 0
final_model.fit(train_fp_pubchem, y_train)
test_predictions = final_model.predict(test_fp_pubchem)
score = f1_score(y_test, test_predictions)
print(f"Best model test f1 score is {round(score, 3)}")

Best model test f1 score is 0.397
