[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/leschultz/materials_application_domain_machine_learning/blob/main/examples/jupyter/tutorial_1.ipynb)


## Setup

Install dependencies

In [None]:
!pip install madml

Import packages for run

In [None]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from madml.models import dissimilarity, calibration, domain, combine
from madml.splitters import BootstrappedLeaveClusterOut
from madml.assess import nested_cv
from madml import datasets

## Load data
There are a set of datasets available. You can load any of them with the name given by the following command.

In [None]:
datasets.list_data()

Any of the supported data can be loaded in a standard manner. You are capable of loading your own data instead of any of the supported datasest if needed.

In [None]:
data = datasets.load('strength')
X = data['data']
y = data['target']
g = data['class_name']

## Build model
We define three model types: uncertanty quantification, distance, and regression model. If we want uncertainty quantification, the regression model must be an ensmble model (e.g. random forest, bagged LASSO, et cetera).

We start with a distance model.

In [None]:
ds_model = dissimilarity(dis='kde')

Now we add a polynomial uncertaty quantifiction model. The number of arguments for te argument params defines the degree of the polynomial, and their values are the inital guesses to the optimizer.

In [None]:
uq_model = calibration(params=[0.0, 1.0])

Now we define the regression model. The regression model must be a gridserach object with a pipeline. The example here does not iterate over folds for hyperparamter optimization, but it can be modified to do so.

In [None]:
# ML
scale = StandardScaler()
model = RandomForestRegressor()

# The grid for grid search
grid = {}
grid['model__n_estimators'] = [100]

# The machine learning pipeline
pipe = Pipeline(steps=[
                        ('scaler', scale),
                        ('model', model),
                        ])

# The gridsearch model
gs_model = GridSearchCV(
                        pipe,
                        grid,
                        cv=((slice(None), slice(None)),),  # No splits
                        )

# Building the splits
Here comes the fun part. The performance of a model on a test set depends on many things. We want to guard against using predictions on data that are sampled dissimilarly to the data used for training. First, we build splits where test data are sampled similarly to training data. We give these splits the special name of "fit" so that we only use this kind of splitter for our uncertainty quantification model.

In [None]:
n_repeats = 2  # The number of times to repeat splits
splits = [('fit', RepeatedKFold(n_repeats=n_repeats))]

How we need to tell the model what data are dissimilar. We use come pre-clustering and split data accordingly. Here, we do 2 and 3 cluster and use agglomerative clustering.

In [None]:
for i in [2, 3]:

    # Cluster Splits
    top_split = BootstrappedLeaveClusterOut(
                                            AgglomerativeClustering,
                                            n_repeats=n_repeats,
                                            n_clusters=i
                                            )

    splits.append(('agglo_{}'.format(i), top_split))

# Fitting and Assessing the Model

We can fit a single model without assessment, which is faster because of no nested cross validation. However, overfitting may occur.

In [None]:
model = combine(gs_model, ds_model, uq_model, splits)
model.fit(X, y, g)

We can assess the model through neseted cross validation and then fit a final model on all data. The assessment of the model is saved in a directory of the user's choice. The model created here should be the one used for deployment.

In [None]:
cv = nested_cv(model, X, y, splitters=splits)
df, df_bin, fit_model = cv.test()

# Example of Model Use
Our assessment returns a model. We can also use dill to load the saved model. Here, we predict on the features used to build the model.

In [None]:
df = fit_model.predict(X)
print(df)

Maybe the predefined thresholds for domain are insufficient. We can include a manual threshold.

In [None]:
df = fit_model.predict(X, 0.5)
print(df)