# Hands on Machine Learning (ML) and Sequential Learning (SL)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/paolodeangelis/AEM/blob/main/2-Hands_on_ML_SL.ipynb)

## Preamble

In this section we install and import the necessary modules for running the first part of the excersise (importing and handling data, training and validating the ML model, interpreting the ML model); we also define useful functions for materials featurization.

Installing with `pip` the necessary libraries (force the version with ==)

In [None]:
%pip install matminer==0.8.0
%pip install pymatgen==2020.1.28
%pip install scikit_learn==0.22.2
%pip install shap==0.38.1

Importing the necessary modules

In [2]:
import numpy as np
import pandas as pd
import shap
import sklearn
from matminer.featurizers import composition as cf
from matminer.featurizers.base import MultipleFeaturizer
from pymatgen.core import Composition
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split

Defining functions for efficiently extracting materials features

In [3]:
def get_compostion(c):
    """Function to get compositions from chemical formula using pymatgen"""
    try:
        return Composition(c)
    except:  # noqa: E722
        return None


def featurizing(data, property_interest=None):

    # Featurizer
    f = MultipleFeaturizer(
        [
            cf.Stoichiometry(),
            cf.ElementProperty.from_preset("magpie"),
            cf.ValenceOrbital(props=["avg"]),
            cf.IonProperty(fast=True),
        ]
    )

    # Inputs
    data["composition"] = [get_compostion(mat) for mat in data.Components]

    featurized_data = pd.DataFrame(
        f.featurize_many(data["composition"], ignore_errors=True),
        columns=f.feature_labels(),
        index=data["Components"],
    )
    if property_interest:
        featurized_data[property_interest] = data[property_interest].values
    return featurized_data

## Data handling

In this section we import the Superconductors database (two columns: brute formula and critical Temperature), extracting composition based features; we thus drop rows where at least a NaN (i.e., Not a Number) value occurs. The features represent the *translation* of chemical formulae into numbers, corresponding to the characteristics of such materials; this is necessary, since the ML works with the numbers. It is the equivalent of length and width of sepals and petals in the iris database (already seen in the first introductory lecture to Python), with the only difference that in this case we want to perform a regression of a scalar quantity (the critical Temperature) and not a classification over the type of flower.

If you want to know the meaning of those 145 features, please check here https://www.nature.com/articles/npjcompumats201628?report=reader (and in the corresponding supplementary material, pdf). 

In brief, they are:
* **Stoichiometric attributes** that depend only on the fractions of elements present and not what those elements actually are
* **Elemental property statistics**, which are defined as the mean, mean absolute deviation, range, minimum, maximum and mode of 22 different elemental properties.
* **Electronic structure attributes**, which are the average fraction of electrons from the *s*, *p*, *d* and *f* valence shells between all present elements.
* **Ionic compound attributes** that include whether it is possible to form an ionic compound assuming all elements are present in a single oxidation state.


Downloading the database

In [None]:
# source National Institute of Materials Science, Materials Information Station,
# SuperCon, http://supercon.nims.go.jp/index_en.html (2011)

!wget https://raw.githubusercontent.com/paolodeangelis/AEM/main/data/Supercon_data_clean.xlsx

In [5]:
data = pd.read_excel(r"Supercon_data_clean.xlsx")  # Import data

In [None]:
Featurized_data = featurizing(data, "Tc")  # Extract composition based features

In [7]:
Featurized_data = Featurized_data.dropna()

In [None]:
Featurized_data

### **Data for the report**
Each group will perform the excercise with a different dataset. Your dataset will be obtained by picking 2000 rows from the original one (containing 12914 rows). The selection procedure of those rows is randomic; to ensure the *randomness reproducibility* we set a random state by means of a SEED. The value of the SEED has to be equal to the number of your group.

**In the cell below, set the SEED equal to your group number.**

The 2000 random rows are picked by setting the variable N_data equal to 2000.

**In the cell below, set the N_data equal to 2000.**

In [None]:
SEED = 
N_data = 

In [8]:
Featurized_data = Featurized_data.sample(n=N_data, random_state=SEED)

In [None]:
Featurized_data

## Machine Learning
In this section we train and validate a Random Forest Regressor for constructing a model which predicts the critical temperature of materials, on the basis of their chemical composition.

#### **Training and testing sets**
We split the 2000 rows long dataset (obtained in the previous section) in a training set (80% of 2000 rows = 1600 rows) and a testing set (20% of 2000 rows = 400 rows). The split is randomic. To ensure *randomness reproducibility* we set the usual random state. The value of the random state has to be equal to the number of your group.

In [10]:
train_df, test_df = train_test_split(
    Featurized_data, test_size=0.2, random_state=SEED
)  # split data in training set (80% of the dataset) and testing set (20% of the dataset)

#### **Machine Learning model definition**
As Machine Learning model for doing regression we choose a Random Forest Regressor.

* Since it is affected by randomness, we set the usual random state (it has to be equal to the number of your group).

* The only hyperparameter we consider is the number of estimators (number of trees is the forest). It has to be equal to 100

**To achieve this, in the cell below set the variable N_trees equal to 100**

In [None]:
N_trees = 

In [11]:
rf = RandomForestRegressor(random_state=SEED, n_estimators=N_Trees)

#### **Model training**
We effectively train the model rf (defined in the previous cell) with the command ```rf.fit(X_training_set, y_training_set)```



In [None]:
rf.fit(train_df.iloc[:, :-1], train_df.iloc[:, -1])

#### **Model performances**
We compare the real critical temperature of the materials in the testing set, with the corresponding values predicted by the trained model.
We check those performances by means of three measures:
* coefficient of determination $R^2$, defined as
\begin{equation}
R^2(\mathbf{y}, {\mathbf{\hat{y}}}) = 1-\frac{\sum_{i=1}^k(y_i-\hat{y_i})^2}{\sum_{i=1}^k(y_i-\overline{y})^2}
\end{equation}
where $\mathbf{y}$ is the $k-$dimensional vector of real values, $\mathbf{\hat{y}}$ is the $k-$dimensional vector of predicted values, $\overline{y}=k^{-1}∑_{i=1}^ky_i$ is the average over real values. $R^2= 1$ if the prediction is perfect (i.e., all the red dots are on the blue line, with $y_i=\hat{y_i}, \forall i\in[1,\dots, k]$).

* Mean Absolute Error $\mathrm{MAE}$, defined as
\begin{equation}
\mathrm{MAE}(\mathbf{y}, {\mathbf{\hat{y}}}) = \frac{1}{k}\sum_{i=1}^k|y_i - \hat{y_i}|
\end{equation}
with the same meaning of the notation.

* Root Mean Squared Error $\mathrm{RMSE}$
\begin{equation}
\mathrm{RMSE}(\mathbf{y}, {\mathbf{\hat{y}}}) = \left(\frac{1}{k}\sum_{i=1}^k(y_i - \hat{y_i})^2 \right)^{1/2}
\end{equation}
with the same meaning of the notation.

In our case, $k$ is the length (i.e., the number of materials) of the testing set.

In [None]:
test_predictions = rf.predict(
    test_df.iloc[:, :-1]
)  # Predicted y over samples of the testing set
test_labels = test_df.iloc[:, -1].values  # True y over samples of the testing set

r2 = sklearn.metrics.r2_score(
    test_labels, test_predictions
)  # coefficient of determination
mae = mean_absolute_error(test_labels, test_predictions)  # mean absolute error
rmse = np.sqrt(
    mean_squared_error(test_labels, test_predictions)
)  # root mean squared error
delta = max(test_labels) - min(test_labels)

import matplotlib.pyplot as plt

plt.figure(figsize=(3, 3), dpi=190)
plt.scatter(test_labels, test_predictions, c="crimson", alpha=0.2)
p1 = max(max(test_predictions), max(test_labels))
p2 = min(min(test_predictions), min(test_labels))
plt.plot([p1, p2], [p1, p2], "b-")
plt.annotate(
    "$R^2$ = %0.3f" % r2,
    xy=(0.02 * delta, 0.95 * delta),
    xytext=(0.02 * delta, 0.95 * delta),
)
plt.annotate(
    "MAE = %0.2f K" % mae,
    xy=(0.02 * delta, 0.85 * delta),
    xytext=(0.02 * delta, 0.85 * delta),
)
plt.annotate(
    "RMSE = %0.2f K" % rmse,
    xy=(0.02 * delta, 0.75 * delta),
    xytext=(0.02 * delta, 0.75 * delta),
)
plt.xlabel(r"True $T_\mathrm{c}$ (K)")
plt.ylabel(r"Predicted $T_\mathrm{c}$ (K)")
plt.show()

**WARNING**: The figure above could in principle show either very bad performances and very good performances. DON'T WORRY: it's normal, it depends on how much you are lucky in picking the random rows from the original database. You will not get worse grade for bad performances of the model.  

## Interpretability
Thanks to the TreeSHAP algorithm, we can find the most relevant features, ranking them in terms of importance with respect to the output.

In [14]:
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(test_df.iloc[:, :-1])

#### **Cumulative of normalized feature importances**
We produce the cumulative curve of normalized importances. The number of features explaining the 75% of the model is shown. 

In [None]:
N = np.shape(test_df.iloc[:, :-1])[1]
k = 0.75
import matplotlib.pyplot as plt

cumsum = np.cumsum(np.sort(np.mean(abs(shap_values), axis=0))[::-1])
normalized_cumulative = np.cumsum(np.sort(np.mean(abs(shap_values), axis=0))[::-1]) / (
    np.max(np.cumsum(np.sort(np.mean(abs(shap_values), axis=0))[::-1]))
)


fig, ax = plt.subplots(figsize=(3, 3), dpi=190)
ax.plot(np.arange(N), normalized_cumulative)
ax.plot(np.arange(N), k * np.ones(N))
ind_cross1 = np.argmin(
    np.fabs(normalized_cumulative - k * max(normalized_cumulative) * np.ones(N))
)
# plt.yticks(np.array([0, 0.5, 1]))

ax.annotate(
    "%i features" % (ind_cross1 + 1),
    xy=(ind_cross1 + 1, 0.01),
    xytext=(ind_cross1 + 10, 0.2),
    arrowprops=dict(facecolor="black", shrink=0.000005, width=0.1, headwidth=4),
)
ax.annotate(
    "75% of\nthe maximum",
    xy=(70, 0.73),
    xytext=(80, 0.45),
    arrowprops=dict(facecolor="black", shrink=0.0005, width=0.1, headwidth=4),
)
plt.scatter(ind_cross1, normalized_cumulative[ind_cross1], color="orange")
plt.plot(
    (ind_cross1, ind_cross1),
    (normalized_cumulative[ind_cross1], 0),
    color="orange",
    ls=":",
)
plt.ylim(0, 1.04)
plt.xlabel("Features")
plt.ylabel("Normalized\ncumulative importance")
plt.show()

The blue curve here above represents the cumulative curve of coefficients of importances. It takes all those coefficients and sorts them from the biggest to the smallest. For instance, the 15th point of this curve represents the sum of the first 15 coefficients. The overall sum is 1.

#### **Ranking of normalized feature importances**
We produce the dataset "Output_mean_shap" containing the list of features sorted in terms of the corresponding importance. Sum of importances is 1.

In [16]:
Output_shap = pd.DataFrame(
    shap_values, index=test_df.iloc[:, :-1].index, columns=test_df.iloc[:, :-1].columns
)
Output_mean_shap = pd.DataFrame(
    abs(Output_shap).describe().loc["mean"]
    / sum(abs(Output_shap).describe().loc["mean"])
).sort_values("mean", ascending=False)

In [17]:
Output_mean_shap.to_excel(
    "Output_mean_shap.xlsx"
)  # list of the features ranked in terms of importance (importances sum up to 1)

## Sequential Learning
We compare three Sequential Learning acquisition functions to choose the next material to be tested, starting from an initial pool of known materials. 

As regression methodology over the training set, we consider the Random Forest Regressor by lolopy (which has nothing to do with the random forest regressor used for the predictive model in the previous sections).

In [None]:
%pip install lolopy

In [19]:
import tqdm
from lolopy.learners import RandomForestRegressor

In [20]:
def MEI(X: np.ndarray, y: np.ndarray, n_steps: int) -> int:
    """Acquisition functions MEI.

    Args:
        X (numpy.ndarray): matrix with n rows (number of total materials for
            which doing the SL, in our case 100) and d columns (number of features
            taken into account for the optimization)
        y (numpy.ndarray): vector with n rows (target property)
        n_steps (int): number of steps allowed for doing SL (in our case,
            100 total materials - 50 materials in the initial training set
            = maximum 50 steps to find the optimum)

    Returns:
        int: the index of the chosen material
    """

    arr = y
    minima = arr.argsort()[0:50]
    in_train = np.zeros(len(X), dtype=np.bool)
    in_train[minima] = True

    all_inds = set(range(len(y)))
    F = np.zeros(n_steps)
    G = np.zeros(n_steps)
    mei_train = [list(set(np.where(in_train)[0].tolist()))]
    mei_train_inds = []

    T = 10

    for i in tqdm.tqdm(range(n_steps)):
        mei_train_inds = mei_train[-1].copy()
        mei_search_inds = list(all_inds.difference(mei_train_inds))

        mei_selection_index = []
        for j in range(T):
            model.fit(X[mei_train_inds], y[mei_train_inds])
            mei_y_pred_prov = model.predict(X[mei_search_inds])
            mei_selection_index.append(np.argmax(mei_y_pred_prov))

        mei_index_G = max(set(mei_selection_index), key=mei_selection_index.count)
        mei_index = mei_search_inds[mei_index_G]  # Pick the most preferred entry
        mei_train_inds.append(mei_search_inds[mei_index_G])
        mei_train.append(mei_train_inds)
        G[i] = mei_index
        F[i] = mei_train_inds[-1]
        if mei_train_inds[-1] == np.argmax(y):
            break

    return F


def MLI(X: np.ndarray, y: np.ndarray, n_steps: int) -> int:
    """Acquisition functions MLI.

    Args:
        X (numpy.ndarray): matrix with n rows (number of total materials for
            which doing the SL, in our case 100) and d columns (number of features
            taken into account for the optimization)
        y (numpy.ndarray): vector with n rows (target property)
        n_steps (int): number of steps allowed for doing SL (in our case,
            100 total materials - 50 materials in the initial training set
            = maximum 50 steps to find the optimum)

    Returns:
        int: the index of the chosen material
    """
    arr = y
    minima = arr.argsort()[0:50]
    in_train = np.zeros(len(X), dtype=np.bool)
    in_train[minima] = True

    all_inds = set(range(len(y)))
    K = np.zeros(n_steps)
    L = np.zeros(n_steps)
    mli_train = [list(set(np.where(in_train)[0].tolist()))]
    mli_train_inds = []

    T = 10

    for i in tqdm.tqdm(range(n_steps)):
        mli_train_inds = mli_train[-1].copy()
        mli_search_inds = list(all_inds.difference(mli_train_inds))

        mli_selection_index = []
        for j in range(T):
            model.fit(X[mli_train_inds], y[mli_train_inds])
            mli_y_pred_prov, mli_y_std_prov = model.predict(
                X[mli_search_inds], return_std=True
            )
            mli_selection_index.append(
                np.argmax(
                    np.divide(
                        mli_y_pred_prov - np.max(y[mli_train_inds]), mli_y_std_prov
                    )
                )
            )

        mli_index_L = max(set(mli_selection_index), key=mli_selection_index.count)
        mli_index = mli_search_inds[mli_index_L]  # Pick the most preferred entry

        mli_train_inds.append(mli_search_inds[mli_index_L])
        mli_train.append(mli_train_inds)
        L[i] = mli_index
        K[i] = mli_train_inds[-1]
        if mli_train_inds[-1] == np.argmax(y):
            break

    return K


def MU(X: np.ndarray, y: np.ndarray, n_steps: int) -> int:
    """Acquisition functions MU.

    Args:
        X (numpy.ndarray): matrix with n rows (number of total materials for
            which doing the SL, in our case 100) and d columns (number of features
            taken into account for the optimization)
        y (numpy.ndarray): vector with n rows (target property)
        n_steps (int): number of steps allowed for doing SL (in our case,
            100 total materials - 50 materials in the initial training set
            = maximum 50 steps to find the optimum)

    Returns:
        int: the index of the chosen material
    """

    arr = y
    minima = arr.argsort()[0:50]
    in_train = np.zeros(len(X), dtype=np.bool)
    in_train[minima] = True

    all_inds = set(range(len(y)))
    R = np.zeros(n_steps)
    S = np.zeros(n_steps)
    mu_train = [list(set(np.where(in_train)[0].tolist()))]
    mu_train_inds = []

    T = 10

    for i in tqdm.tqdm(range(n_steps)):
        mu_train_inds = mu_train[-1].copy()
        mu_search_inds = list(all_inds.difference(mu_train_inds))

        mu_selection_index = []
        for j in range(T):
            model.fit(X[mu_train_inds], y[mu_train_inds])
            mu_y_pred_prov, mu_y_std_prov = model.predict(
                X[mu_search_inds], return_std=True
            )
            mu_selection_index.append(np.argmax(mu_y_std_prov))

        mu_index_R = max(set(mu_selection_index), key=mu_selection_index.count)
        mu_index = mu_search_inds[mu_index_R]

        mu_train_inds.append(mu_search_inds[mu_index_R])
        mu_train.append(mu_train_inds)  # Pick the most preferred entry
        R[i] = mu_index
        S[i] = mu_train_inds[-1]
        if mu_train_inds[-1] == np.argmax(y):
            break

    return S

In [28]:
def produce_Data_SL(
    Data: pd.DataFrame,
    Output_mean_shap: pd.DataFrame,
    n_relevant: int,
    target_property: str,
) -> tuple:
    """Function to produce datasets for SL starting from the complete database (Featurized_data).

    Args:
        Data (pandas.DataFrame): complete database (Featureized_data)
        Output_mean_shap (pandas.DataFrame): ranking of features used in terms of importance
        n_relevant (int): number of relevant features
        target_property (str): name of the target property in the complete database

    Returns:
        tuple: containing:
            -  pandas.DataFrame: dataset with only the relevant features + the target property
            -  pandas.DataFrame: dataset with relevant features + set of unrelevant features
                (summing up to 30 columns) + the target property
    """

    relevant_features = list(Output_mean_shap.iloc[:n_relevant].index)
    unrelevant_features = list(
        Output_mean_shap.sort_values("mean").iloc[: int(30 - n_relevant)].index
    )
    all_features = relevant_features + unrelevant_features

    relevant_features.append(target_property)
    all_features.append(target_property)

    Data_sampled = Data.sample(
        n=100, random_state=SEED
    )  # replace the random_state with your group number

    Data_relevant_features = pd.DataFrame(Data_sampled, columns=relevant_features)
    Data_all_features = pd.DataFrame(Data_sampled, columns=all_features)

    return (Data_relevant_features, Data_all_features)

### **Definition of the SL predictor**
The SL predictor is constructed on the basis of the Random Forest Regressor by lolopy.

In [22]:
model = RandomForestRegressor()

### **Datasets production for SL**
We produce two datasets:
* Data_relevant_features, with 100 random rows and the relevant features explaining the 75% of the predictive model (see section Interpretability) + target property column
* Data_all_features, with 100 random rows and 30 features, given by the relevant features (above point) + set of unrelevant features + target property column

In [23]:
Data_relevant_features, Data_all_features = produce_Data_SL(
    Featurized_data, Output_mean_shap, ind_cross1 + 1, "Tc"
)

In [None]:
Data_relevant_features.columns

In [None]:
Data_all_features.columns

![https://raw.githubusercontent.com/paolodeangelis/AEM/main/img/Image_Data.png](https://raw.githubusercontent.com/paolodeangelis/AEM/main/img/Image_Data.png)

#### **Sequential Learning relevant features**
Start with the pool of the 50 worst samples in terms of the target $y$ (critical Temperature); SL strategies suggest the next material to be evaluated. The objective is to find the optimum among the other 50 materials with as few evaluations as possible. Search stops when the material with the maximum $y$ (among those remaining 50) is chosen. 

Three strategies are compared: Maximum Expected Improvement (MEI), Maximum Likelihood Improvement (MLI), Maximum Uncertainty (MU).

<div class="alert alert-block alert-warning">
<b>WARNING: </b> 

Since also the Random Forest Regressor by lolopy for SL is not deterministic, if you run the code more times, you will end up with different *trajectories* of evaluations.
</div>

![https://raw.githubusercontent.com/paolodeangelis/AEM/main/img/Image_SL.png](https://raw.githubusercontent.com/paolodeangelis/AEM/main/img/Image_SL.png)

In [None]:
MEI_index_relevant = MEI(
    Data_relevant_features.iloc[:, :-1].values,
    Data_relevant_features.iloc[:, -1].values,
    50,
)

In [None]:
MLI_index_relevant = MLI(
    Data_relevant_features.iloc[:, :-1].values,
    Data_relevant_features.iloc[:, -1].values,
    50,
)

In [None]:
MU_index_relevant = MU(
    Data_relevant_features.iloc[:, :-1].values,
    Data_relevant_features.iloc[:, -1].values,
    50,
)

In the cell below, write the numbers of evaluations performed in the three runs above.

In [None]:
N_MEI_index_relevant = 
N_MLI_index_relevant = 
N_MU_index_relevant = 

#### **Sequential Learning relevant + unrelevant features**
Start with the pool of the 50 worst samples in terms of the target $y$ (critical Temperature); SL strategies suggest the next material to be evaluated. The objective is to find the optimum among the other 50 materials with as few evaluations as possible. Search stops when the material with the maximum $y$ (among those remaining 50) is chosen. 

Three strategies are compared: Maximum Expected Improvement (MEI), Maximum Likelihood Improvement (MLI), Maximum Uncertainty (MU).

<div class="alert alert-block alert-warning">
<b>WARNING: </b> 

Since also the Random Forest Regressor by lolopy for SL is not deterministic, if you run the code more times, you will end up with different *trajectories* of evaluations. 
</div>

In [None]:
MEI_index_all = MEI(
    Data_all_features.iloc[:, :-1].values, Data_all_features.iloc[:, -1].values, 50
)

In [None]:
MLI_index_all = MLI(
    Data_all_features.iloc[:, :-1].values, Data_all_features.iloc[:, -1].values, 50
)

In [None]:
MU_index_all = MU(
    Data_all_features.iloc[:, :-1].values, Data_all_features.iloc[:, -1].values, 50
)

In the cell below, write the numbers of evaluations performed in the three runs above.

In [None]:
N_MEI_index_all = 
N_MLI_index_all = 
N_MU_index_all = 

#### **Comprehensive comparison between more strategies and different sets of features**
We plot the performances of the SL in terms of the number of evaluations needed to find the optimum normalized with respect to the average number of evaluations needed with a *naive* random choice.

In [None]:
N = 50 / 2  # number of evaluations in random choice
labels = ["MEI", "MLI", "MU"]
relevant = [
    N_MEI_index_relevant / N,
    N_MLI_index_relevant / N,
    N_MU_index_relevant / N,
]  # replace numbers with the numbers of evaluations performed by SL for MEI, MLI, MU with only relevant features
all = [
    N_MEI_index_all / N,
    N_MLI_index_all / N,
    N_MU_index_all / N,
]  # replace numbers with the numbers of evaluations performed by SL for MEI, MLI, MU with also unrelevant features

x = np.arange(len(labels))  # the label locations
width = 0.2  # the width of the bars

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(5, 3), dpi=190)

rects1 = ax.bar(x - width / 2, relevant, width, label="%i features" % (ind_cross1 + 1))
rects2 = ax.bar(x + width / 2, all, width, label="30 features")
plt.axhline(y=1, color="k", linewidth=1, linestyle="--")


# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel("Number of experiments/\n number of random experiments")
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()

ax.annotate(
    "random choice",
    xy=(1, 1),
    xytext=(1.3, 1.2),
    arrowprops=dict(facecolor="black", width=0.1, headwidth=4),
)


fig.tight_layout()

plt.show()

**WARNING**: In the figure here above you could have all the bars over the random choice. DON'T WORRY, you will not get worse grade for this.

## Assignment 
### Notebook
* In this notebook, set the  ```SEED``` equal to the number of your group. Set the ```N_data``` equal 2000 (for the first part of the report). Set ```N_trees``` equal to 100 (for the first part of the report).
* In this notebook, after the SL runs, in the subsection "Comprehensive comparison between more strategies and different sets of features", please replace the number of evaluations 
```N_MEI_index_relevant ```,
```N_MLI_index_relevant ```,
```N_MU_index_relevant ```,
```N_MEI_index_all ```,
```N_MLI_index_all ```,
```N_MU_index_all ```

with the ones that you obtain in your runs. Please notice that, since the code is not deterministic, if you run the same code more times, in general you will end up with different results. Don't worry: just report the number of evaluations of one run for each of the 6 cases (3 methodologies with 2 different sets of features).

### Report
* **First part**: with this notebook modified as prescribed above, obtain main results (model performances, relevant features, Sequential Learning performances).
* **Second part**: redo the *Machine Learning* section and the *Interpretability* section modifying the number of materials ```N_data``` in the dataset and the number of estimators ```N_trees``` in the Random Forest Regressor (for the predictive model). Combinations of those parameters are provided for each group in the document "Assignments for the machine learning lab". For each combination, please, re-run the code from the SuperCon database import. In this second part DO NOT redo the Sequential Learning.



