The following are the steps for using AutoML for a regression task:

_Note: Setting the flag for featurization= 'True' generates represents molecules using 5 representation techniques._ 

1. Requires an input pandas dataframe consisting of two columns:<br>
    * SMILES strings<br>
    * target property values<br>


2. Molecules are represented as:<br>
    * coloumb matrix<br>
    * rdkit morgan fingerprints<br>
    * MACCs<br>
    * rdkit hashed topological torsion<br>
    * rdkit molecular descriptors (all)<br>



3. Screens through various sklearn regressor models:<br>
    * [MLPRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html)<br>
    * [GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)<br>
    * [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)<br>
    * [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)<br>
    * [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)<br>
    * [SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html)<br>
    * [ElasticNet](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html)<br>
    * [DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)<br>


Yields 'n-best' models, with optimized hyperparamters.

Returns dataframe of error metrics, machine learning model, algorithm, tuned hyperparameter values and featurization technique.

# Load your data 

In [1]:
import pandas as pd
import numpy as np
from chemml.chem import Molecule
from chemml.datasets import load_organic_density

2023-05-17 17:03:15.776972: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-05-17 17:03:15.777105: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
molecules, target, dragon_subset = load_organic_density()
df=pd.concat([molecules, target], axis=1)
df.head()

Unnamed: 0,smiles,density_Kg/m3
0,C1CSC(CS1)c1ncc(s1)CC1CCCC1,1184.64
1,Oc1nccnc1c1coc(c1)c1cnccn1,1333.85
2,N1CN(CN(C1)c1cncs1)c1ccc(cn1)c1cocc1,1332.41
3,OC1(CSCCS1)c1ccoc1O,1370.11
4,N1CNC(N(C1)c1cscn1)c1cocc1c1cscn1,1387.8


# Run autoML for a regression task

In [3]:
from chemml.autoML import ModelScreener
MS = ModelScreener(df, target="density_Kg/m3", featurization=True, smiles="smiles", 
                   screener_type="regressor", output_file="testing.txt")
scores = MS.screen_models(n_best=4)

featurizing molecules in batches of 62 ...
Merging batch features ...    [DONE]

Running model no:  1 ; Name:  RandomForestRegressor
RandomForestRegressor : GeneticAlgorithm - complete
Model: RandomForestRegressor
GA time(hours):  0.14414592577351465


scores_list:  [         ME        MAE         MSE       RMSE      MSLE     RMSLE      MAPE  \
0  3.766114  33.290882  1855.76677  43.078612  0.001197  0.034601  2.655691   

     MaxAPE     RMSPE       MPE      MaxAE   deltaMaxE  r_squared        std  \
0  9.274192  3.500977  0.126405  102.68927  184.648282   0.693355  77.793578   

   time(seconds)                  Model  \
0      537.54183  RandomForestRegressor   

                                          parameters        Feature  
0  {'bootstrap': True, 'ccp_alpha': 0.0, 'criteri...  CoulombMatrix  ]
--------------------------------------------------------------------------------

Running model no:  2 ; Name:  Ridge
Ridge : GeneticAlgorithm - complete
Model: Ridge
GA time(hours):  

Ridge : GeneticAlgorithm - complete
Model: Ridge
GA time(hours):  0.0015956753492355348


scores_list:  [         ME        MAE         MSE       RMSE      MSLE     RMSLE      MAPE  \
0  3.766114  33.290882  1855.76677  43.078612  0.001197  0.034601  2.655691   

     MaxAPE     RMSPE       MPE      MaxAE   deltaMaxE  r_squared        std  \
0  9.274192  3.500977  0.126405  102.68927  184.648282   0.693355  77.793578   

   time(seconds)                  Model  \
0      537.54183  RandomForestRegressor   

                                          parameters        Feature  
0  {'bootstrap': True, 'ccp_alpha': 0.0, 'criteri...  CoulombMatrix  ,          ME        MAE          MSE      RMSE      MSLE     RMSLE      MAPE  \
0 -9.877151  41.996243  2860.304026  53.48181  0.001757  0.041913  3.318766   

     MaxAPE     RMSPE       MPE       MaxAE   deltaMaxE  r_squared        std  \
0  9.493894  4.233819 -0.775979  128.591294  246.767488   0.527366  77.793578   

   time(seconds)  Model  

RandomForestRegressor : GeneticAlgorithm - complete
Model: RandomForestRegressor
GA time(hours):  0.0032238522503111097


scores_list:  [         ME        MAE         MSE       RMSE      MSLE     RMSLE      MAPE  \
0  3.766114  33.290882  1855.76677  43.078612  0.001197  0.034601  2.655691   

     MaxAPE     RMSPE       MPE      MaxAE   deltaMaxE  r_squared        std  \
0  9.274192  3.500977  0.126405  102.68927  184.648282   0.693355  77.793578   

   time(seconds)                  Model  \
0      537.54183  RandomForestRegressor   

                                          parameters        Feature  
0  {'bootstrap': True, 'ccp_alpha': 0.0, 'criteri...  CoulombMatrix  ,          ME        MAE          MSE      RMSE      MSLE     RMSLE      MAPE  \
0 -9.877151  41.996243  2860.304026  53.48181  0.001757  0.041913  3.318766   

     MaxAPE     RMSPE       MPE       MaxAE   deltaMaxE  r_squared        std  \
0  9.493894  4.233819 -0.775979  128.591294  246.767488   0.527366  77.7935

Lasso : GeneticAlgorithm - complete
Model: Lasso
GA time(hours):  0.0015968909528520372


scores_list:  [         ME        MAE         MSE       RMSE      MSLE     RMSLE      MAPE  \
0  3.766114  33.290882  1855.76677  43.078612  0.001197  0.034601  2.655691   

     MaxAPE     RMSPE       MPE      MaxAE   deltaMaxE  r_squared        std  \
0  9.274192  3.500977  0.126405  102.68927  184.648282   0.693355  77.793578   

   time(seconds)                  Model  \
0      537.54183  RandomForestRegressor   

                                          parameters        Feature  
0  {'bootstrap': True, 'ccp_alpha': 0.0, 'criteri...  CoulombMatrix  ,          ME        MAE          MSE      RMSE      MSLE     RMSLE      MAPE  \
0 -9.877151  41.996243  2860.304026  53.48181  0.001757  0.041913  3.318766   

     MaxAPE     RMSPE       MPE       MaxAE   deltaMaxE  r_squared        std  \
0  9.493894  4.233819 -0.775979  128.591294  246.767488   0.527366  77.793578   

   time(seconds)  Model  

RandomForestRegressor : GeneticAlgorithm - complete
Model: RandomForestRegressor
GA time(hours):  0.009141929613219368


scores_list:  [         ME        MAE         MSE       RMSE      MSLE     RMSLE      MAPE  \
0  3.766114  33.290882  1855.76677  43.078612  0.001197  0.034601  2.655691   

     MaxAPE     RMSPE       MPE      MaxAE   deltaMaxE  r_squared        std  \
0  9.274192  3.500977  0.126405  102.68927  184.648282   0.693355  77.793578   

   time(seconds)                  Model  \
0      537.54183  RandomForestRegressor   

                                          parameters        Feature  
0  {'bootstrap': True, 'ccp_alpha': 0.0, 'criteri...  CoulombMatrix  ,          ME        MAE          MSE      RMSE      MSLE     RMSLE      MAPE  \
0 -9.877151  41.996243  2860.304026  53.48181  0.001757  0.041913  3.318766   

     MaxAPE     RMSPE       MPE       MaxAE   deltaMaxE  r_squared        std  \
0  9.493894  4.233819 -0.775979  128.591294  246.767488   0.527366  77.79357

Lasso : GeneticAlgorithm - complete
Model: Lasso
GA time(hours):  0.0033649594916237723


scores_list:  [         ME        MAE         MSE       RMSE      MSLE     RMSLE      MAPE  \
0  3.766114  33.290882  1855.76677  43.078612  0.001197  0.034601  2.655691   

     MaxAPE     RMSPE       MPE      MaxAE   deltaMaxE  r_squared        std  \
0  9.274192  3.500977  0.126405  102.68927  184.648282   0.693355  77.793578   

   time(seconds)                  Model  \
0      537.54183  RandomForestRegressor   

                                          parameters        Feature  
0  {'bootstrap': True, 'ccp_alpha': 0.0, 'criteri...  CoulombMatrix  ,          ME        MAE          MSE      RMSE      MSLE     RMSLE      MAPE  \
0 -9.877151  41.996243  2860.304026  53.48181  0.001757  0.041913  3.318766   

     MaxAPE     RMSPE       MPE       MaxAE   deltaMaxE  r_squared        std  \
0  9.493894  4.233819 -0.775979  128.591294  246.767488   0.527366  77.793578   

   time(seconds)  Model  


Running model no:  4 ; Name:  ElasticNet
ElasticNet : GeneticAlgorithm - complete
Model: ElasticNet
GA time(hours):  0.005486866633097331


scores_list:  [         ME        MAE         MSE       RMSE      MSLE     RMSLE      MAPE  \
0  3.766114  33.290882  1855.76677  43.078612  0.001197  0.034601  2.655691   

     MaxAPE     RMSPE       MPE      MaxAE   deltaMaxE  r_squared        std  \
0  9.274192  3.500977  0.126405  102.68927  184.648282   0.693355  77.793578   

   time(seconds)                  Model  \
0      537.54183  RandomForestRegressor   

                                          parameters        Feature  
0  {'bootstrap': True, 'ccp_alpha': 0.0, 'criteri...  CoulombMatrix  ,          ME        MAE          MSE      RMSE      MSLE     RMSLE      MAPE  \
0 -9.877151  41.996243  2860.304026  53.48181  0.001757  0.041913  3.318766   

     MaxAPE     RMSPE       MPE       MaxAE   deltaMaxE  r_squared        std  \
0  9.493894  4.233819 -0.775979  128.591294  246.767488 

RandomForestRegressor : GeneticAlgorithm - complete
Model: RandomForestRegressor
GA time(hours):  0.08338304936885833


scores_list:  [         ME        MAE         MSE       RMSE      MSLE     RMSLE      MAPE  \
0  3.766114  33.290882  1855.76677  43.078612  0.001197  0.034601  2.655691   

     MaxAPE     RMSPE       MPE      MaxAE   deltaMaxE  r_squared        std  \
0  9.274192  3.500977  0.126405  102.68927  184.648282   0.693355  77.793578   

   time(seconds)                  Model  \
0      537.54183  RandomForestRegressor   

                                          parameters        Feature  
0  {'bootstrap': True, 'ccp_alpha': 0.0, 'criteri...  CoulombMatrix  ,          ME        MAE          MSE      RMSE      MSLE     RMSLE      MAPE  \
0 -9.877151  41.996243  2860.304026  53.48181  0.001757  0.041913  3.318766   

     MaxAPE     RMSPE       MPE       MaxAE   deltaMaxE  r_squared        std  \
0  9.493894  4.233819 -0.775979  128.591294  246.767488   0.527366  77.793578

Ridge : GeneticAlgorithm - complete
Model: Ridge
GA time(hours):  0.0007020504607094659


scores_list:  [         ME        MAE         MSE       RMSE      MSLE     RMSLE      MAPE  \
0  3.766114  33.290882  1855.76677  43.078612  0.001197  0.034601  2.655691   

     MaxAPE     RMSPE       MPE      MaxAE   deltaMaxE  r_squared        std  \
0  9.274192  3.500977  0.126405  102.68927  184.648282   0.693355  77.793578   

   time(seconds)                  Model  \
0      537.54183  RandomForestRegressor   

                                          parameters        Feature  
0  {'bootstrap': True, 'ccp_alpha': 0.0, 'criteri...  CoulombMatrix  ,          ME        MAE          MSE      RMSE      MSLE     RMSLE      MAPE  \
0 -9.877151  41.996243  2860.304026  53.48181  0.001757  0.041913  3.318766   

     MaxAPE     RMSPE       MPE       MaxAE   deltaMaxE  r_squared        std  \
0  9.493894  4.233819 -0.775979  128.591294  246.767488   0.527366  77.793578   

   time(seconds)  Model  

Lasso : GeneticAlgorithm - complete
Model: Lasso
GA time(hours):  0.0018881410360336304


scores_list:  [         ME        MAE         MSE       RMSE      MSLE     RMSLE      MAPE  \
0  3.766114  33.290882  1855.76677  43.078612  0.001197  0.034601  2.655691   

     MaxAPE     RMSPE       MPE      MaxAE   deltaMaxE  r_squared        std  \
0  9.274192  3.500977  0.126405  102.68927  184.648282   0.693355  77.793578   

   time(seconds)                  Model  \
0      537.54183  RandomForestRegressor   

                                          parameters        Feature  
0  {'bootstrap': True, 'ccp_alpha': 0.0, 'criteri...  CoulombMatrix  ,          ME        MAE          MSE      RMSE      MSLE     RMSLE      MAPE  \
0 -9.877151  41.996243  2860.304026  53.48181  0.001757  0.041913  3.318766   

     MaxAPE     RMSPE       MPE       MaxAE   deltaMaxE  r_squared        std  \
0  9.493894  4.233819 -0.775979  128.591294  246.767488   0.527366  77.793578   

   time(seconds)  Model  

ElasticNet : GeneticAlgorithm - complete
Model: ElasticNet
GA time(hours):  0.0019778504636552598


scores_list:  [         ME        MAE         MSE       RMSE      MSLE     RMSLE      MAPE  \
0  3.766114  33.290882  1855.76677  43.078612  0.001197  0.034601  2.655691   

     MaxAPE     RMSPE       MPE      MaxAE   deltaMaxE  r_squared        std  \
0  9.274192  3.500977  0.126405  102.68927  184.648282   0.693355  77.793578   

   time(seconds)                  Model  \
0      537.54183  RandomForestRegressor   

                                          parameters        Feature  
0  {'bootstrap': True, 'ccp_alpha': 0.0, 'criteri...  CoulombMatrix  ,          ME        MAE          MSE      RMSE      MSLE     RMSLE      MAPE  \
0 -9.877151  41.996243  2860.304026  53.48181  0.001757  0.041913  3.318766   

     MaxAPE     RMSPE       MPE       MaxAE   deltaMaxE  r_squared        std  \
0  9.493894  4.233819 -0.775979  128.591294  246.767488   0.527366  77.793578   

   time(seconds

In [4]:
scores

Unnamed: 0,ME,MAE,MSE,RMSE,MSLE,RMSLE,MAPE,MaxAPE,RMSPE,MPE,MaxAE,deltaMaxE,r_squared,std,time(seconds),Model,parameters,Feature
0,-4.720613,10.683362,173.020379,13.153721,0.000105,0.010237,0.83854,2.256886,1.028124,-0.370118,30.497531,56.140709,0.97141,77.793578,6.962296,Lasso,"{'alpha': 0.003162277660168382, 'copy_X': True...",rdkit_descriptors
0,-4.85961,10.759084,180.048202,13.418204,0.00011,0.010499,0.846624,2.187539,1.054594,-0.386489,26.585163,52.046855,0.970249,77.793578,7.30605,ElasticNet,"{'alpha': 0.003162277660168382, 'copy_X': True...",rdkit_descriptors
0,-6.393704,12.457197,258.461391,16.076734,0.000161,0.012694,0.98761,3.204478,1.279798,-0.515773,40.295881,63.947872,0.957292,77.793578,2.596274,Ridge,"{'alpha': 9.9, 'copy_X': True, 'fit_intercept'...",rdkit_descriptors
0,-6.000613,22.519728,837.234426,28.935003,0.000545,0.023355,1.80575,8.17048,2.372345,-0.559577,87.613689,145.405067,0.861656,77.793578,13.748923,ElasticNet,"{'alpha': 0.10000000000000002, 'copy_X': True,...",morganfingerprints_radius3


# Save scores to csv

In [5]:
scores.to_csv("autoML_test.csv",index=False)