Plik z modelami regresyjnymi. Modele testowałem na trzech rodzajach fingerprintów (MACCSFP, Klekota&Roth, haszowane) oraz na ich połączeniu. W zasadzie zawsze połączenie tych trzech zbiorów okazywało się najlepsze. Po wstępnej analize danych (z pliku data_analysis.ipynb) wydawało mi się, że dobrym pomysłem będzie usunięcie części najmniej znaczących zmiennych. Robiłem to poprzez wybieranie cech które pojawiają się w największej liczbie związków, oraz w drugim podejściu przy pomocy funkcji SelectKBest z sklearn. Okazało się jednak, że większa liczba zmiennych mimo wszystko poprawia wyniki. Modele korzystające tylko z małej części danych są jednak w stanie osiągać dość podobne wyniki, a przy tym czas wykonania jest zdecydowanie mniejszy. Pokazuje to także, że jesteśmy w stanie w ten sposób identyfikować jakąś cześć cech, która jest najbardziej odpowiedzialna za kardiotoksyczność związków. W notebooku przetestowałem sporo modeli płytkich oraz prostą sieć neuronową. Do oceny wyników używałem r2_score. Najlepszy wynik udało mi się uzyskać za pomocą SVR (~0.68 r^2), w przypadku sieci neuronowej było to ~0.62 r2. Dokładniejsze wyniki można prześledzić poniżej. Ze względu na to, że testowałem dośc dużo kombinacji modeli/danych/parametrów cały notebook liczy się dość długo. W razie własnych testów polecam wywoływać funkcje z ograniczoną liczbą zmiennych, wtedy powinno działać szybciej. Powinno dać się to zrobić przez usunięcie częsci parametrów w miejscu gdzie wywołuje kod.

In [1]:
import numpy as np

from sklearn.linear_model import Ridge, Lasso, ElasticNet, BayesianRidge, LogisticRegression, SGDRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import r2_score

from warnings import filterwarnings

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

from tensorflow.keras.callbacks import History
from keras.callbacks import LearningRateScheduler
from keras.callbacks import ModelCheckpoint
from keras.callbacks import EarlyStopping
from keras.layers import Dropout
from keras.layers import BatchNormalization

import matplotlib.pyplot as plt

import utils

In [2]:
seed = 1

In [4]:
def RidgeRegr(X, y, cv):
    params = {'alpha':np.hstack([np.linspace(1e-9, 2, 10), np.linspace(10, 1e9, 3), np.linspace(0.1, 2, 3), np.array([1.0])]),
              'max_iter':np.array([50000])}

    grid = GridSearchCV(Ridge(), params, cv=cv, return_train_score=True)
    grid.fit(X, y)
    return grid

In [5]:
def LassoRegr(X, y, cv):
    params = {'alpha':np.hstack([np.linspace(1e-9, 2, 1), np.linspace(10, 1e9, 1), np.linspace(0.1, 2, 1), np.array([1.0])]),
              'max_iter':np.array([10000])}

    grid = GridSearchCV(Lasso(), params, cv=cv, return_train_score=True)
    grid.fit(X, y)
    return grid

In [6]:
def ElasticNetRegr(X, y, cv):
    params = {'alpha':np.hstack([np.linspace(1e-9, 2, 3), np.linspace(10, 1e9, 3), np.linspace(0.1, 2, 1), np.array([1.0])]),
              'max_iter':np.array([10000])}
    
    grid = GridSearchCV(ElasticNet(), params, cv=cv, return_train_score=True)
    grid.fit(X, y)
    return grid

In [7]:
def BayesianRidgeRegr(X, y, cv):
    params = {'alpha_1':np.hstack([np.linspace(1e-9, 2, 5), np.linspace(10, 1e9, 3), np.linspace(0.1, 2, 3), np.array([1e-6])]),
              'alpha_2':np.hstack([np.linspace(1e-9, 2, 5), np.linspace(10, 1e9, 3), np.linspace(0.1, 2, 3), np.array([1e-6])])}
    
    grid = GridSearchCV(BayesianRidge(), params, cv=cv, return_train_score=True)
    grid.fit(X, y)
    return grid

In [8]:
def SGDRegressorRegr(X, y, cv):
    params = {'alpha':np.hstack([np.linspace(1e-9, 2, 10), np.linspace(10, 1e9, 3), np.linspace(0.1, 2, 3), np.array([0.0001])])}

    grid = GridSearchCV(SGDRegressor(), params, cv=cv, return_train_score=True)
    grid.fit(X, y)
    return grid

In [9]:
def SVMRegr(X, y, cv):
    params = {'C':np.hstack([np.linspace(1e-9, 2, 1), np.linspace(10, 1e9, 1), np.linspace(0.1, 2, 1), np.array([1.0])])}
    grid = GridSearchCV(SVR(), params, cv=cv, return_train_score=True)
    grid.fit(X, y)
    return grid

In [10]:
def DecisionTreeRegr(X, y, cv):
    params = {'ccp_alpha':np.hstack(np.array([0]))}
    grid = GridSearchCV(DecisionTreeRegressor(), params, cv=cv, return_train_score=True)
    grid.fit(X, y)
    return grid

In [11]:
def KNeighborsRegr(X, y, cv):
    params = {'algorithm':np.array(['auto'])}
    grid = GridSearchCV(KNeighborsRegressor(), params, cv=cv, return_train_score=True)
    grid.fit(X, y)
    return grid

In [15]:
def perform_regression(X, y, svm_only=False):
    
    model_functions = [
    ("Ridge", RidgeRegr),
    ("Lasso", LassoRegr),
    ("Elastic Net", ElasticNetRegr),
    ("Bayesian", BayesianRidgeRegr),
    ("SGD", SGDRegressorRegr),
    ("SVM", SVMRegr),
    ("Decision Tree", DecisionTreeRegr),
    ("K Neighbors", KNeighborsRegr),
    ]
    
    if svm_only:
        model_functions = [ ("SVM", SVMRegr) ]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed)

    kfold = KFold(n_splits=5, random_state=seed, shuffle=True)
    
    np.seterr(divide='ignore', invalid='ignore')
    filterwarnings('ignore')

    print(f"{'Regression model:'.ljust(19)}   R^2:")

    best_score = 0
    best_result = ("None", 0, {})

    for (name, function) in model_functions:
        model = function(X_train, y_train, kfold)
        prediction = model.best_estimator_.predict(X_test)
        score = r2_score(y_test, prediction)
        if (score > best_score):
            best_score = score
            best_result = (name, score, model.best_params_)

        print(f"{name.ljust(19)}   {str(round(score, 6)).ljust(9)}   {model.best_params_}")

    print(f"\nBest\n{best_result[0].ljust(19)}   {str(round(best_result[1], 6)).ljust(9)}   {best_result[2]}\n")
    

In [None]:
for perc in [0.9, 0.8, 0.5, 0.3, 0.1]:
    df_hashed = utils.get_hashed_fingerprints(min_perc_used=perc)
    df_maccsfp = utils.get_MACCSFP_fingerprints(min_perc_used=perc)
    df_klekota = utils.get_KlekotaRoth_fingerprints(min_perc_used=perc)
    df_mixed = utils.get_mixed_fingerprints(min_perc_used=perc)
    dfs = [
        (df_hashed, "Hashed Extended Fingerprints"), 
        (df_maccsfp, "MACCSFP Fingerprints"), 
        (df_klekota, "Klekota&Roth Fingerprints"), 
        (df_mixed, "Mixed Fingerprints")
    ]
    
    print(f'\nUsing fetures that are present in at least {perc*100}% of substances.\n')
    for df, title in dfs:
        X = df.drop('IC50', axis=1)
        y = df['IC50']
        print(title, '\n')
        perform_regression(X, y)
        print()
    print('\n\n')

Preparing (ready_sets/cardiotoxicity_hERG_ExtFP.csv) file.
DataFrame base shape: (11504, 1025)
Shape after removing wrong values: (10635, 1025)
Shape after removing least used features: (10635, 51)
Shape after removing outliers: (10396, 51)

Preparing (ready_sets/cardiotoxicity_hERG_MACCSFP.csv) file.
DataFrame base shape: (11504, 167)
Shape after removing wrong values: (10635, 167)
Shape after removing least used features: (10635, 15)
Shape after removing outliers: (10396, 15)

Preparing (ready_sets/cardiotoxicity_hERG_KlekFP.csv) file.
DataFrame base shape: (11504, 4861)
Shape after removing wrong values: (10635, 4861)
Shape after removing least used features: (10635, 9)
Shape after removing outliers: (10396, 9)

Preparing files for mixed fingerprints.

Preparing (ready_sets/cardiotoxicity_hERG_MACCSFP.csv) file.
DataFrame base shape: (11504, 167)
Shape after removing wrong values: (10635, 167)
Shape after removing least used features: (10635, 15)
Shape after removing outliers: (1039

DataFrame base shape: (11504, 167)
Shape after removing wrong values: (10635, 167)
Shape after removing least used features: (10635, 59)
Shape after removing outliers: (10396, 59)

Preparing (ready_sets/cardiotoxicity_hERG_KlekFP.csv) file.
DataFrame base shape: (11504, 4861)
Shape after removing wrong values: (10635, 4861)
Shape after removing least used features: (10635, 36)
Shape after removing outliers: (10396, 36)

Preparing files for mixed fingerprints.

Preparing (ready_sets/cardiotoxicity_hERG_MACCSFP.csv) file.
DataFrame base shape: (11504, 167)
Shape after removing wrong values: (10635, 167)
Shape after removing least used features: (10635, 59)
Shape after removing outliers: (10396, 59)

Preparing (ready_sets/cardiotoxicity_hERG_KlekFP.csv) file.
DataFrame base shape: (11504, 4861)
Shape after removing wrong values: (10635, 4861)
Shape after removing least used features: (10635, 36)
Shape after removing outliers: (10396, 36)

Preparing (ready_sets/cardiotoxicity_hERG_ExtFP.cs

DataFrame base shape: (11504, 4861)
Shape after removing wrong values: (10635, 4861)
Shape after removing least used features: (10635, 161)
Shape after removing outliers: (10396, 161)

Preparing files for mixed fingerprints.

Preparing (ready_sets/cardiotoxicity_hERG_MACCSFP.csv) file.
DataFrame base shape: (11504, 167)
Shape after removing wrong values: (10635, 167)
Shape after removing least used features: (10635, 118)
Shape after removing outliers: (10396, 118)

Preparing (ready_sets/cardiotoxicity_hERG_KlekFP.csv) file.
DataFrame base shape: (11504, 4861)
Shape after removing wrong values: (10635, 4861)
Shape after removing least used features: (10635, 161)
Shape after removing outliers: (10396, 161)

Preparing (ready_sets/cardiotoxicity_hERG_ExtFP.csv) file.
DataFrame base shape: (11504, 1025)
Shape after removing wrong values: (10635, 1025)
Shape after removing least used features: (10635, 1007)
Shape after removing outliers: (10396, 1007)


Using fetures that are present in at l

Tutaj musiałem przerwać obliczenia ze względu na czas, wywołałem ponownie dla pozostałych parametrów, teraz już tylko dla połączonych fingerprintów, na których dostawałem najlepsze wyniki

In [None]:
for perc in [0.1, 0.01, 0]:
#     df_hashed = utils.get_hashed_fingerprints(min_perc_used=perc)
#     df_maccsfp = utils.get_MACCSFP_fingerprints(min_perc_used=perc)
#     df_klekota = utils.get_KlekotaRoth_fingerprints(min_perc_used=perc)
    df_mixed = utils.get_mixed_fingerprints(min_perc_used=perc)
    dfs = [
#         (df_hashed, "Hashed Extended Fingerprints"), 
#         (df_maccsfp, "MACCSFP Fingerprints"), 
#         (df_klekota, "Klekota&Roth Fingerprints"), 
        (df_mixed, "Mixed Fingerprints")
    ]
    
    print(f'\nUsing fetures that are present in at least {perc*100}% of substances.\n')
    for df, title in dfs:
        X = df.drop('IC50', axis=1)
        y = df['IC50']
        print(title, '\n')
        perform_regression(X, y)
        print()
    print('\n\n')

Preparing files for mixed fingerprints.

Preparing (ready_sets/cardiotoxicity_hERG_MACCSFP.csv) file.
DataFrame base shape: (11504, 167)
Shape after removing wrong values: (10635, 167)
Shape after removing least used features: (10635, 118)
Shape after removing outliers: (10396, 118)

Preparing (ready_sets/cardiotoxicity_hERG_KlekFP.csv) file.
DataFrame base shape: (11504, 4861)
Shape after removing wrong values: (10635, 4861)
Shape after removing least used features: (10635, 161)
Shape after removing outliers: (10396, 161)

Preparing (ready_sets/cardiotoxicity_hERG_ExtFP.csv) file.
DataFrame base shape: (11504, 1025)
Shape after removing wrong values: (10635, 1025)
Shape after removing least used features: (10635, 1007)
Shape after removing outliers: (10396, 1007)


Using fetures that are present in at least 10.0% of substances.

Mixed Fingerprints 

Regression model:     R^2:
Ridge                 0.472615    {'alpha': 10.0, 'max_iter': 50000}
Lasso                 0.458676    {'alpha

In [16]:
for perc in [0.01, 0]:
#     df_hashed = utils.get_hashed_fingerprints(min_perc_used=perc)
#     df_maccsfp = utils.get_MACCSFP_fingerprints(min_perc_used=perc)
#     df_klekota = utils.get_KlekotaRoth_fingerprints(min_perc_used=perc)
    df_mixed = utils.get_mixed_fingerprints(min_perc_used=perc)
    dfs = [
#         (df_hashed, "Hashed Extended Fingerprints"), 
#         (df_maccsfp, "MACCSFP Fingerprints"), 
#         (df_klekota, "Klekota&Roth Fingerprints"), 
        (df_mixed, "Mixed Fingerprints")
    ]
    
    print(f'\nUsing fetures that are present in at least {perc*100}% of substances.\n')
    for df, title in dfs:
        X = df.drop('IC50', axis=1)
        y = df['IC50']
        print(title, '\n')
        perform_regression(X, y, svm_only=True)
        print()
    print('\n\n')

Preparing files for mixed fingerprints.

Preparing (ready_sets/cardiotoxicity_hERG_MACCSFP.csv) file.
DataFrame base shape: (11504, 167)
Shape after removing wrong values: (10635, 167)
Shape after removing least used features: (10635, 138)
Shape after removing outliers: (10396, 138)

Preparing (ready_sets/cardiotoxicity_hERG_KlekFP.csv) file.
DataFrame base shape: (11504, 4861)
Shape after removing wrong values: (10635, 4861)
Shape after removing least used features: (10635, 619)
Shape after removing outliers: (10396, 619)

Preparing (ready_sets/cardiotoxicity_hERG_ExtFP.csv) file.
DataFrame base shape: (11504, 1025)
Shape after removing wrong values: (10635, 1025)
Shape after removing least used features: (10635, 1008)
Shape after removing outliers: (10396, 1008)


Using fetures that are present in at least 1.0% of substances.

Mixed Fingerprints 

Regression model:     R^2:
SVM                   0.684308    {'C': 10.0}

Best
SVM                   0.684308    {'C': 10.0}





Preparin

Tutaj testuje metode SelectKBest do wyboru zmiennych. Wyniki są nieco gorsze w porównaniu z wcześniejszym podeściem.

In [13]:
for perc in [0.01, 0.1, 0.3, 0.5, 0.8, 0.9]:
    df_hashed = utils.get_hashed_fingerprints(min_perc_used=perc)
    df_maccsfp = utils.get_MACCSFP_fingerprints(min_perc_used=perc)
    df_klekota = utils.get_KlekotaRoth_fingerprints(min_perc_used=perc)
    df_mixed = utils.get_mixed_fingerprints(min_perc_used=perc)
    dfs = [
        (df_hashed, "Hashed Extended Fingerprints"), 
        (df_maccsfp, "MACCSFP Fingerprints"), 
        (df_klekota, "Klekota&Roth Fingerprints"), 
        (df_mixed, "Mixed Fingerprints")
    ]
    
    print(f'\nUsing custom feture selection with {perc*100}% of all features.\n')
    for df, title in dfs:
        X = df.drop('IC50', axis=1)
        y = df['IC50']
        select = int(X.shape[1] * perc)
        np.seterr(divide='ignore', invalid='ignore')
        X = SelectKBest(f_regression, k=select).fit_transform(X, y)
        print(title, '\n')
        perform_regression(X, y)
        print()
    print('\n\n')

Preparing (ready_sets/cardiotoxicity_hERG_ExtFP.csv) file.
DataFrame base shape: (11504, 1025)
Shape after removing wrong values: (10635, 1025)
Shape after removing least used features: (10635, 1008)
Shape after removing outliers: (10396, 1008)

Preparing (ready_sets/cardiotoxicity_hERG_MACCSFP.csv) file.
DataFrame base shape: (11504, 167)
Shape after removing wrong values: (10635, 167)
Shape after removing least used features: (10635, 138)
Shape after removing outliers: (10396, 138)

Preparing (ready_sets/cardiotoxicity_hERG_KlekFP.csv) file.
DataFrame base shape: (11504, 4861)
Shape after removing wrong values: (10635, 4861)
Shape after removing least used features: (10635, 619)
Shape after removing outliers: (10396, 619)

Preparing files for mixed fingerprints.

Preparing (ready_sets/cardiotoxicity_hERG_MACCSFP.csv) file.
DataFrame base shape: (11504, 167)
Shape after removing wrong values: (10635, 167)
Shape after removing least used features: (10635, 138)
Shape after removing outl


Preparing (ready_sets/cardiotoxicity_hERG_KlekFP.csv) file.
DataFrame base shape: (11504, 4861)
Shape after removing wrong values: (10635, 4861)
Shape after removing least used features: (10635, 70)
Shape after removing outliers: (10396, 70)

Preparing files for mixed fingerprints.

Preparing (ready_sets/cardiotoxicity_hERG_MACCSFP.csv) file.
DataFrame base shape: (11504, 167)
Shape after removing wrong values: (10635, 167)
Shape after removing least used features: (10635, 80)
Shape after removing outliers: (10396, 80)

Preparing (ready_sets/cardiotoxicity_hERG_KlekFP.csv) file.
DataFrame base shape: (11504, 4861)
Shape after removing wrong values: (10635, 4861)
Shape after removing least used features: (10635, 70)
Shape after removing outliers: (10396, 70)

Preparing (ready_sets/cardiotoxicity_hERG_ExtFP.csv) file.
DataFrame base shape: (11504, 1025)
Shape after removing wrong values: (10635, 1025)
Shape after removing least used features: (10635, 655)
Shape after removing outliers: 

DataFrame base shape: (11504, 4861)
Shape after removing wrong values: (10635, 4861)
Shape after removing least used features: (10635, 15)
Shape after removing outliers: (10396, 15)

Preparing (ready_sets/cardiotoxicity_hERG_ExtFP.csv) file.
DataFrame base shape: (11504, 1025)
Shape after removing wrong values: (10635, 1025)
Shape after removing least used features: (10635, 84)
Shape after removing outliers: (10396, 84)


Using custom feture selection with 80.0% of all features.

Hashed Extended Fingerprints 

Regression model:     R^2:
Ridge                 0.098948    {'alpha': 10.0, 'max_iter': 50000}
Lasso                 0.097875    {'alpha': 1e-09, 'max_iter': 10000}
Elastic Net           0.097875    {'alpha': 1e-09, 'max_iter': 10000}
Bayesian              0.099751    {'alpha_1': 10.0, 'alpha_2': 1e-09}
SGD                   0.090023    {'alpha': 1e-09}
SVM                   0.29427     {'C': 10.0}
Decision Tree         0.219541    {'ccp_alpha': 0}
K Neighbors           0.231465

Na koniec przetestowałem jeszcze prostą sieć neuronową

In [14]:
def perform_NN_regression(X, y):
    history = History()
    model = Sequential()

    size = X.shape[1]
    model.add(Dense(size//2, activation="sigmoid",input_shape=(size,)))
    model.add(BatchNormalization())
    model.add(Dropout(0.5))

    model.add(Dense(size//4, activation="sigmoid"))
    model.add(BatchNormalization())
    model.add(Dropout(0.5))

    model.add(Dense(size//8, activation="sigmoid"))
    model.add(BatchNormalization())
    model.add(Dropout(0.5))

    model.add(Dense(size//16, activation="sigmoid"))
    model.add(BatchNormalization())
    model.add(Dropout(0.5))

    model.add(Dense(size//32, activation="sigmoid"))
    model.add(BatchNormalization())
    model.add(Dropout(0.5))

    model.add(Dense(1))
    model.summary()


    early_stopping = EarlyStopping(patience=30, monitor="val_loss")
    model.compile(loss='mean_absolute_error', optimizer="Adam")

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed)

    model.fit(X_train, y_train, validation_data= (X_test, y_test), batch_size=256, epochs=200, validation_split=0.2, callbacks=[early_stopping, history], verbose=0)
    prediction = model.predict(X_test)
    score = r2_score(y_test, prediction)
    print(f'\nFinal R^2 score:   {score}')
    print()
    

In [15]:
for perc in reversed([0, 0.01, 0.1, 0.3, 0.5, 0.8, 0.9]):
    df_hashed = utils.get_hashed_fingerprints(min_perc_used=perc)
    df_maccsfp = utils.get_MACCSFP_fingerprints(min_perc_used=perc)
    df_klekota = utils.get_KlekotaRoth_fingerprints(min_perc_used=perc)
    df_mixed = utils.get_mixed_fingerprints(min_perc_used=perc)
    dfs = [
        (df_hashed, "Hashed Extended Fingerprints"), 
        (df_maccsfp, "MACCSFP Fingerprints"), 
        (df_klekota, "Klekota&Roth Fingerprints"), 
        (df_mixed, "Mixed Fingerprints")
    ]
    
    print(f'\nUsing fetures that are present in at least {perc*100}% of substances.\n')
    for df, title in dfs:
        X = df.drop('IC50', axis=1)
        y = df['IC50']
        print(title, '\n')
        perform_NN_regression(X, y)
        print()
    print('\n\n')

Preparing (ready_sets/cardiotoxicity_hERG_ExtFP.csv) file.
DataFrame base shape: (11504, 1025)
Shape after removing wrong values: (10635, 1025)
Shape after removing least used features: (10635, 51)
Shape after removing outliers: (10396, 51)

Preparing (ready_sets/cardiotoxicity_hERG_MACCSFP.csv) file.
DataFrame base shape: (11504, 167)
Shape after removing wrong values: (10635, 167)
Shape after removing least used features: (10635, 15)
Shape after removing outliers: (10396, 15)

Preparing (ready_sets/cardiotoxicity_hERG_KlekFP.csv) file.
DataFrame base shape: (11504, 4861)
Shape after removing wrong values: (10635, 4861)
Shape after removing least used features: (10635, 9)
Shape after removing outliers: (10396, 9)

Preparing files for mixed fingerprints.

Preparing (ready_sets/cardiotoxicity_hERG_MACCSFP.csv) file.
DataFrame base shape: (11504, 167)
Shape after removing wrong values: (10635, 167)
Shape after removing least used features: (10635, 15)
Shape after removing outliers: (1039


Final R^2 score:   -10.789249898487117


Mixed Fingerprints 

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_18 (Dense)            (None, 36)                2628      
                                                                 
 batch_normalization_15 (Bat  (None, 36)               144       
 chNormalization)                                                
                                                                 
 dropout_15 (Dropout)        (None, 36)                0         
                                                                 
 dense_19 (Dense)            (None, 18)                666       
                                                                 
 batch_normalization_16 (Bat  (None, 18)               72        
 chNormalization)                                                
                                                         


Final R^2 score:   -10.789249898487117


Klekota&Roth Fingerprints 

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_36 (Dense)            (None, 7)                 105       
                                                                 
 batch_normalization_30 (Bat  (None, 7)                28        
 chNormalization)                                                
                                                                 
 dropout_30 (Dropout)        (None, 7)                 0         
                                                                 
 dense_37 (Dense)            (None, 3)                 24        
                                                                 
 batch_normalization_31 (Bat  (None, 3)                12        
 chNormalization)                                                
                                                  


Final R^2 score:   0.32007536149258176


MACCSFP Fingerprints 

Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_54 (Dense)            (None, 29)                1711      
                                                                 
 batch_normalization_45 (Bat  (None, 29)               116       
 chNormalization)                                                
                                                                 
 dropout_45 (Dropout)        (None, 29)                0         
                                                                 
 dense_55 (Dense)            (None, 14)                420       
                                                                 
 batch_normalization_46 (Bat  (None, 14)               56        
 chNormalization)                                                
                                                       


Final R^2 score:   0.3973103893275761





Preparing (ready_sets/cardiotoxicity_hERG_ExtFP.csv) file.
DataFrame base shape: (11504, 1025)
Shape after removing wrong values: (10635, 1025)
Shape after removing least used features: (10635, 655)
Shape after removing outliers: (10396, 655)

Preparing (ready_sets/cardiotoxicity_hERG_MACCSFP.csv) file.
DataFrame base shape: (11504, 167)
Shape after removing wrong values: (10635, 167)
Shape after removing least used features: (10635, 80)
Shape after removing outliers: (10396, 80)

Preparing (ready_sets/cardiotoxicity_hERG_KlekFP.csv) file.
DataFrame base shape: (11504, 4861)
Shape after removing wrong values: (10635, 4861)
Shape after removing least used features: (10635, 70)
Shape after removing outliers: (10396, 70)

Preparing files for mixed fingerprints.

Preparing (ready_sets/cardiotoxicity_hERG_MACCSFP.csv) file.
DataFrame base shape: (11504, 167)
Shape after removing wrong values: (10635, 167)
Shape after removing least used features: 


Final R^2 score:   0.055933103508097215


Mixed Fingerprints 

Model: "sequential_15"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_90 (Dense)            (None, 401)               322003    
                                                                 
 batch_normalization_75 (Bat  (None, 401)              1604      
 chNormalization)                                                
                                                                 
 dropout_75 (Dropout)        (None, 401)               0         
                                                                 
 dense_91 (Dense)            (None, 200)               80400     
                                                                 
 batch_normalization_76 (Bat  (None, 200)              800       
 chNormalization)                                                
                                                       


Final R^2 score:   0.15548729740158285


Klekota&Roth Fingerprints 

Model: "sequential_18"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_108 (Dense)           (None, 80)                12880     
                                                                 
 batch_normalization_90 (Bat  (None, 80)               320       
 chNormalization)                                                
                                                                 
 dropout_90 (Dropout)        (None, 80)                0         
                                                                 
 dense_109 (Dense)           (None, 40)                3240      
                                                                 
 batch_normalization_91 (Bat  (None, 40)               160       
 chNormalization)                                                
                                                 


Final R^2 score:   0.5655981119947822


MACCSFP Fingerprints 

Model: "sequential_21"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_126 (Dense)           (None, 68)                9384      
                                                                 
 batch_normalization_105 (Ba  (None, 68)               272       
 tchNormalization)                                               
                                                                 
 dropout_105 (Dropout)       (None, 68)                0         
                                                                 
 dense_127 (Dense)           (None, 34)                2346      
                                                                 
 batch_normalization_106 (Ba  (None, 34)               136       
 tchNormalization)                                               
                                                       


Final R^2 score:   0.6191569557104494





Preparing (ready_sets/cardiotoxicity_hERG_ExtFP.csv) file.
DataFrame base shape: (11504, 1025)
Shape after removing wrong values: (10635, 1025)
Shape after removing outliers: (10396, 1025)

Preparing (ready_sets/cardiotoxicity_hERG_MACCSFP.csv) file.
DataFrame base shape: (11504, 167)
Shape after removing wrong values: (10635, 167)
Shape after removing outliers: (10396, 167)

Preparing (ready_sets/cardiotoxicity_hERG_KlekFP.csv) file.
DataFrame base shape: (11504, 4861)
Shape after removing wrong values: (10635, 4861)
Shape after removing outliers: (10396, 4861)

Preparing files for mixed fingerprints.

Preparing (ready_sets/cardiotoxicity_hERG_MACCSFP.csv) file.
DataFrame base shape: (11504, 167)
Shape after removing wrong values: (10635, 167)
Shape after removing outliers: (10396, 167)

Preparing (ready_sets/cardiotoxicity_hERG_KlekFP.csv) file.
DataFrame base shape: (11504, 4861)
Shape after removing wrong values: (10635, 4861)
Shape after


Final R^2 score:   0.5382226773563157


Mixed Fingerprints 

Model: "sequential_27"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_162 (Dense)           (None, 3025)              18304275  
                                                                 
 batch_normalization_135 (Ba  (None, 3025)             12100     
 tchNormalization)                                               
                                                                 
 dropout_135 (Dropout)       (None, 3025)              0         
                                                                 
 dense_163 (Dense)           (None, 1512)              4575312   
                                                                 
 batch_normalization_136 (Ba  (None, 1512)             6048      
 tchNormalization)                                               
                                                         