# Automatminer v1.0.3.20200727 on Matbench v0.1
###### Created April 15, 2021

![logo](logo.png)


# Description

This submission is for Automatminer Express v1.0.3.20200727. Automatminer is a fully-automated, feature-based algorithm for prototyping ML pipelines with AutoML search for hyperparameter optimization. Find the full source code of automatminer at [https://github.com/hackingmaterials/automatminer](https://github.com/hackingmaterials/automatminer).

# Benchmark name
Matbench v0.1

# Package versions
```
- matbench==0.1.0
- automatminer==1.0.3.20200727 # many packages, such as matminer, have specific required versions specified in this version of automatminer
```

# Algorithm description
The Automatminer express model is an abridged version of the full Automatminer pipeline described in [Dunn et al.](https://doi.org/10.1038/s41524-020-00406-3).

Automatminer works in 4 distinct steps:
- **Autofeaturization**: Sequentially apply many featurizers implemented in matminer (most of which are "hand crafted" (meaning a scientist has thought about their physical interpretation and relevance); automatically remove those failing preliminary validity pre-checks and those with high failure rates.
- **Data cleaning**: Remove and impute nan samples. Remove erroneous and inf features from autofeaturization.
- **Feature reduction**: Intelligently reduce the number of features - for example, removing similar features between distinct featurizers with correlation-based feature reduction, then sequentially applying tree-model-based feature reduction.
- **AutoML**: Using TPOT's genetic algorithm-based optimization, evolve different machine learning pipelines based on internal validation scores. This optimization also automatically tunes the hyperparameters of the models. Some of the included models are Random Forest, Gradient Boosted Trees, Logistic Regression, Linear regression (with various regularizations), and Support Vector Machines; a full specification of all the models and hyperparameter grids defined for the preset is available in the automatminer [source code](https://github.com/hackingmaterials/automatminer).

![pipe](pipe.png)


The primary data store of automatminer is the pandas dataframe.

![df](dataframe_pipe.png)

The primary working object in automatminer is the MatPipe, which holds comprehensive information about each of the above steps. MatPipes utilize a fit/predict style syntax similar to sklearn's `BaseEstimator`. After being fit, MatPipes can be inspected (detailed internal information for each pipeline) with the `.inspect` method or summarized (more abbreviated internal information) with the `.summarize()` method.


# Relevant citations
- [Dunn et al.](https://doi.org/10.1038/s41524-020-00406-3)
- [Olson et al.](http://doi.acm.org/10.1145/2908812.2908918)
- [Ward et al.](https://doi.org/10.1016/j.commatsci.2018.05.018)


# Any other relevant info

This notebook is best run on many-core machines with a high number of `n_jobs` for parallelization. Intermediate results should be saved such that intermittent failures do not crash the entire benchmark. Walltime requested should approximately be 26hr per task, as the maximum evaluation time for each task is 24 hours of AutoML optimization.


---

In [None]:
# Import our required classes
from matbench import MatbenchBenchmark
from automatminer import MatPipe

# Running the actual benchmark

Fit a pipeline for each fold of each task. 

If desired, a `cache_src` can be specified as a powerup argument to `from_preset` such that non-unique features are only computed once, drastically reducing the compute time for large datasets such as `matbench_mp_e_form`



In [None]:
# Create a benchmark
mb = MatbenchBenchmark(autoload=False)

# Run our benchmark on the dummy models
for task in mb.tasks:
    task.load()
    
    for fold in task.folds:
        
        # Change n_jobs if you are running on a lower-core machine
        # To something like 2-4
        pipe = MatPipe.from_preset("express", n_jobs=10)
        
        # Get the training data
        train_df = task.get_train_and_val_data(fold, as_type="df")

        # Fit the automatminer pipeline automatically
        target = task.metadata.target
        pipe.fit(train_df, target)
        
        # Get test data (an array of pymatgen.Structure or string compositions, e.g., "Fe2O3")
        test_inputs = task.get_test_data(fold, include_target=False, as_type="df")
        
        # Make predictions on the test data, returning an array of either bool or float, depending on problem
        predicted_df = pipe.predict(test_inputs)
        predictions = predicted_df[f"{target} predicted"].tolist()
        params = {"best_pipeine": pipe.learner.best_pipeline, "features": pipe.learner.features}
        
        # Record predictions
        task.record(fold, predictions, params=params)
        
        # Save the entire pipeline to file, as a means of checkpointing
        pipe.save(f"{task.dataset_name}-{fold}-pipeline.p")
        

# Validate and record pipeline config

We should record the pipeline config as future versions of automatminer may have different preset configurations. 

In [None]:
from automatminer.presets import get_preset_config

# Make sure our benchmark is valid
valid = mb.is_valid
print(f"is valid: {valid}")

# Record the pipe config into benchmark metadata, which is the same for all tasks
pipe_config = {
    'autofeaturizer_kwargs': {'n_jobs': 10, 'preset': 'express'},
    'cleaner_kwargs': {'feature_na_method': 'drop',
                    'max_na_frac': 0.1,
                    'na_method_fit': 'mean',
                    'na_method_transform': 'mean'},
    'learner_kwargs': {'max_eval_time_mins': 20,
                    'max_time_mins': 1440,
                    'memory': 'auto',
                    'n_jobs': 10,
                    'population_size': 200},
    'learner_name': 'TPOTAdaptor',
    'reducer_kwargs': {'reducers': ['corr', 'tree'],
                    'tree_importance_percentile': 0.99}
}
mb.add_metadata(pipe_config)

# Save our benchmark to file



In [None]:
# Save the valid benchmark to file
mb.to_file("results.json.gz")