# AutoML (Auto-sklearn) on matbench v 0.1

### AutoML-Benchmark in materials design
- This algorithm is a modification of the *AutoML Benchmark framework* from [Conrad2022AutoMLBench](https://www.nature.com/articles/s41598-022-23327-1).
- It combines 4 AutoML tools and selects the most performant one. 
- For this purpose, the AutoML tools are each run in a container to solve the problems of the different dependencies.
- Further information on the implementation can be found in the publication and the Git-Hub repository: https://github.com/mm-tud/automl-materials
- This framework is simplified for this benchmark, so Docker is not needed.
- Therefore only the best AutoML Tool is used for the specific task.

```
Conrad, F., Mälzer, M., Schwarzenberger, M. et al. Benchmarking AutoML for regression tasks 
on small tabular data in materials design. Sci Rep 12, 19350 (2022). 
https://doi.org/10.1038/s41598-022-23327-1
```

### Auto-sklearn
- The best framework for the task *matbench-stells* is Auto-sklearn, so only this is needed.
- More details on Auto-sklearn can be found in [Feurer2015autosklearn](https://automl.github.io/auto-sklearn/master/)

```
Feurer, M., Klein, A., Eggensperger, K. et al. Efficient and Robust Automated Machine Learning.
Advances in Neural Information Processing Systems 28, 2962--2970, (2015)
```

## Defining Parameters for Run
- The given parameters match exatly these from the publication [Conrad2022AutoMLBench](https://www.nature.com/articles/s41598-022-23327-1)

In [None]:
CONSTANTS = dict(INNER_SPLITS = 10,
                 NUM_CORES = 8,
                 MAX_TIME_MINUTES = 60,
                 SEED=1,
                 AUTO_TEMP_FOLDER = 'temp_autosklearn' )

In [None]:
import sys
sys.path.insert(0,'../..')
from matbench.bench import MatbenchBenchmark

mb = MatbenchBenchmark(autoload=False, subset=['matbench_steels'])



for task in mb.tasks:
    print(mb.tasks)
    print(task)
    task.load()
    for fold in task.folds:

        # Inputs are either chemical compositions as strings
        # or crystal structures as pymatgen.Structure objects.
        # Outputs are either floats (regression tasks) or bools (classification tasks)
        train_inputs, train_outputs = task.get_train_and_val_data(fold)

        import autosklearn.regression
        from autosklearn.metrics import r2
        from autosklearn.metrics import mean_squared_error as mse
        from autosklearn.metrics import mean_absolute_error as mae
        from autosklearn.metrics import median_absolute_error as mabse
        from autosklearn.regression import AutoSklearnRegressor as Regressor
        import pandas as pd
       
        # Helper function for transforming string to dataframe
        def Convert(a):
            it = iter(a)
            res_dct = dict(zip(it, it))
            return res_dct
        
        # Definition of Model (Part of the Framework form: https://github.com/mm-tud/automl-materials)
        class my_model:
            def __init__(self):
                self.train_inputs = None
                self.train_outputs = None
                self.model = None
                self.columns = None
                self.test_inputs = None
                self.predictions = None
                
            def data_conversion_composition(self, data):
                data_comp = data.str.split(r'([\d.]+)')
                for n in range(len(data_comp)):
                    data_comp.iat[n] = Convert(data_comp.iat[n])
                data_comp = pd.json_normalize(data_comp)
                data_comp = data_comp.fillna(0)
                data_comp = data_comp.astype(float)
                return data_comp
            
            def data_conversion_label(self, data):
                data_comp = data.str.split(r'([\d.]+)')
                for n in range(len(data_comp)):
                    data_comp.at[n,'composition'] = Convert(data_comp.at[n,'composition'])
                data_comp = data_comp.fillna(0)
                return data_comp
                
            
            def train_and_validate(self, train_inputs, train_outputs):
                
                self.train_inputs = self.data_conversion_composition(train_inputs)
                self.columns = self.train_inputs.columns
                self.train_outputs = train_outputs
                

                self.model = Regressor(time_left_for_this_task=CONSTANTS['MAX_TIME_MINUTES']*60,
                                       per_run_time_limit=CONSTANTS['MAX_TIME_MINUTES']*5,
                                       resampling_strategy='cv',
                                       resampling_strategy_arguments={'folds':CONSTANTS['INNER_SPLITS']},
                                       n_jobs=CONSTANTS['NUM_CORES'],
                                       seed=CONSTANTS['SEED'],
                                       scoring_functions=[r2, mse, mae, mabse],
                                       tmp_folder=CONSTANTS['AUTO_TEMP_FOLDER'],
                                       delete_tmp_folder_after_terminate=True)


                self.model.fit(self.train_inputs, self.train_outputs)
            
            def predict(self, test_inputs):
                self.test_inputs = self.data_conversion_composition(test_inputs)
                self.test_inputs = self.test_inputs[self.columns]
                self.predictions = self.model.predict(self.test_inputs)
                return self.predictions
        
        my_model = my_model()
        
        # train and validate your model
        my_model.train_and_validate(train_inputs, train_outputs)

        # Get testing data
        test_inputs = task.get_test_data(fold, include_target=False)

        # Predict on the testing data
        # Your output should be a pandas series, numpy array, or python iterable
        # where the array elements are floats or bools
        predictions = my_model.predict(test_inputs)

        # Record your data!
        task.record(fold, predictions)

# Save your results
mb.to_file("results.json.gz")

## Load and show results (MAE)

In [None]:
import json
import pandas as pd

with open('results.json.gz') as f:
    jsonstr = json.load(f)

df = pd.io.json.json_normalize(jsonstr)
df.filter(regex='scores').filter(regex='mae')