![](logo.png)

# Welcome to the automatminer basic tutorial!
#### Versions used to make this notebook (`automatminer 2019.10.14` and `matminer 0.6.2`)

---

[Automatminer](https://github.com/hackingmaterials/automatminer) is a package for *automatically* creating ML pipelines using matminer's featurizers, feature reduction techniques, and Automated Machine Learning (AutoML). Automatminer works end to end - raw data to prediction - without *any* human input necessary. 

#### Put in a dataset, get out a machine that predicts materials properties.

Automatminer is competitive with state of the art hand-tuned machine learning models across multiple domains of materials informatics. Automatminer also included utilities for running MatBench, a materials science ML benchmark. 

#### Learn more about Automatminer and MatBench from the [official documentation](http://hackingmaterials.lbl.gov/automatminer/). 



# How does automatminer work?
Automatminer automatically decorates a dataset using hundreds of descriptor techniques from matminer’s descriptor library, picks the most useful features for learning, and runs a separate AutoML pipeline. Once a pipeline has been fit, it can be summarized in a text file, saved to disk, or used to make predictions on new materials.

![](pipe.png)

Materials primitives (e.g., crystal structures) go in one end, and property predictions come out the other. MatPipe handles the intermediate operations such as assigning descriptors, cleaning problematic data, data conversions, imputation, and machine learning.

### MatPipe is the main Automatminer object
`MatPipe` is the central object in Automatminer. It has a sklearn BaseEstimator syntax for `fit` and `predict` operations. Simply `fit` on your training data, then `predict` on your testing data.

### MatPipe uses [pandas](https://pandas.pydata.org>) dataframes as inputs and outputs. 
Put dataframes (of materials) in, get dataframes (of property predictions) out.


# What's in this notebook?

In this notebook, we walk through the basic steps of using Automatminer to train and predict on data. We'll also view the internals of our AutoML pipeline using Automatminer's API. 

* First, we'll load a dataset of ~4,600 band gaps collected from experimental sources.
* Next, we'll fit a Automatminer `MatPipe` (pipeline) to the data
* Then, we'll predict experimental band gap from chemical composition, and see how our predictions do.
* Finally, we'll examine our pipeline with `MatPipe`'s introspection methods. 
* Bonus: We'll repeat the problem but for classifying metals and nonmetals.

*Note: for the sake of brevity, we will use a single train-test split in this notebook. To run a full Automatminer benchmark, see the documentation for `MatPipe.benchmark`*

# Preparing a dataset

Let's load a dataset to play around with. For this example, we will use matminer to load one of the MatBench v0.1 datasets. If you have been through some of machine learning or data retrieval tutorials on this repo, you will be familiar with the commands needed to fetch our dataset as a dataframe.


In [3]:
from matminer.datasets import load_dataset

df = load_dataset("matbench_expt_gap")

# Let's look at our dataset
df.describe()

Unnamed: 0,gap expt
count,4604.0
mean,0.975951
std,1.445034
min,0.0
25%,0.0
50%,0.0
75%,1.8125
max,11.7


### Looking at the data

In [4]:
df.head()

Unnamed: 0,composition,gap expt
0,Ag(AuS)2,0.0
1,Ag(W3Br7)2,0.0
2,Ag0.5Ge1Pb1.75S4,1.83
3,Ag0.5Ge1Pb1.75Se4,1.51
4,Ag2BBr,0.0


### Seeing how many unique compositions are present
We should find all the compositions are unique.

In [6]:
# How many unique compositions do we have?
df["composition"].unique().shape[0]

4604

### Generate a train-test split

In [8]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=20191014)

### Remove the target property from the test_df

Let's remove the testing dataframe's target property so we can be sure we are not giving Automatminer any test information.

Our target variable is `"gap expt"`.

In [10]:
target = "gap expt"
prediction_df = test_df.drop(columns=[target])
prediction_df.head()

Unnamed: 0,composition
4514,ZnSb
834,Co1Te1.88
4481,Zn2Ni9O13
3958,TiAlAu2
3087,Pr(MnSi)2


In [11]:
prediction_df.describe()

Unnamed: 0,composition
count,921
unique,921
top,Sc5Bi3
freq,1


# Fitting an Automatminer MatPipe

Our dataset contains 4,604 unique stoichiometries and experimentally measured band gaps. We have everything we need to start our AutoML pipeline.

For simplicity, we will use an `MatPipe` preset. `MatPipe` is highly customizable and has hundreds of configuration options, but most use cases will be satisfied by using one of the preset configurations. We use the `from_preset` method.

In this example, we'll use the debug preset just to get some quick calculations. If you want better accuracy, try using the "express" preset by uncommenting the second `pipe=` line.


In [13]:
from automatminer import MatPipe

# If you have a few minutes
pipe = MatPipe.from_preset("debug")

# If you have a couple of hours
# pipe = MatPipe.from_preset("express")

### Fitting the pipeline

To fit an Automatminer `MatPipe` to the data, pass in your training data and desired target.

In [None]:
pipe.fit(train_df, target)

2019-10-14 20:42:33 INFO     Problem type is: regression
2019-10-14 20:42:33 INFO     Fitting MatPipe pipeline to data.
2019-10-14 20:42:33 INFO     AutoFeaturizer: Starting fitting.
2019-10-14 20:42:33 INFO     AutoFeaturizer: Compositions detected as strings. Attempting conversion to Composition objects...


HBox(children=(IntProgress(value=0, description='StrToComposition', max=3683, style=ProgressStyle(description_…


2019-10-14 20:42:33 INFO     AutoFeaturizer: Guessing oxidation states of compositions, as they were not present in input.


HBox(children=(IntProgress(value=0, description='CompositionToOxidComposition', max=3683, style=ProgressStyle(…


