# Machine learning models – Exercises

In these exercises, we'll load a cleaned and featurized dataframe then use scikit-learn to predict materials properties.

Before starting, we need to use matminer's `load_dataframe_from_json()` function to load a cleaned and featurized version of the `dielectric_constant` dataset. We will use this dataset for all the exercises.

In [1]:
import os
from matminer.utils.io import load_dataframe_from_json

df = load_dataframe_from_json(os.path.join("resources", "dielectric_constant_featurized.json"))
df.head()

Reading file resources/dielectric_constant_featurized.json: 2112it [00:02, 1015.13it/s]#############################################################################################################8| 2109/2112 [00:02<00:00, 1675.19it/s]
Decoding objects from resources/dielectric_constant_featurized.json: 100%|###########################################################################################################################| 2112/2112 [00:02<00:00, 1015.27it/s]


Unnamed: 0,structure,total_dielectric,composition,MagpieData minimum Number,MagpieData maximum Number,MagpieData range Number,MagpieData mean Number,MagpieData avg_dev Number,MagpieData mode Number,MagpieData minimum MendeleevNumber,...,MagpieData mode GSmagmom,MagpieData minimum SpaceGroupNumber,MagpieData maximum SpaceGroupNumber,MagpieData range SpaceGroupNumber,MagpieData mean SpaceGroupNumber,MagpieData avg_dev SpaceGroupNumber,MagpieData mode SpaceGroupNumber,density,vpa,packing fraction
0,"[[1.75725875 1.2425695 3.04366125] Rb, [5.271...",6.23,"(Rb, Te)",37.0,52.0,15.0,42.0,6.666667,37.0,4.0,...,0.0,152.0,229.0,77.0,203.333333,34.222222,229.0,3.108002,53.167069,0.753707
1,"[[0. 0. 0.] Cd, [ 4.27210959 2.64061969 13.13...",6.73,"(Cd, Cl)",17.0,48.0,31.0,27.333333,13.777778,17.0,70.0,...,0.0,64.0,194.0,130.0,107.333333,57.777778,64.0,3.611055,28.099366,0.284421
2,"[[0. 0. 0.] Mn, [-2.07904300e-06 2.40067320e+...",10.64,"(Mn, I)",25.0,53.0,28.0,43.666667,12.444444,53.0,52.0,...,0.0,64.0,217.0,153.0,115.0,68.0,64.0,4.732379,36.111958,0.318289
3,[[-1.73309900e-06 2.38611186e+00 5.95256328e...,17.99,"(La, N)",7.0,57.0,50.0,32.0,25.0,7.0,13.0,...,0.0,194.0,194.0,0.0,194.0,0.0,194.0,5.760192,22.040641,0.730689
4,"[[1.677294 2.484476 2.484476] Mn, [0. 0. 0.] M...",7.12,"(Mn, F)",9.0,25.0,16.0,14.333333,7.111111,9.0,52.0,...,0.0,15.0,217.0,202.0,82.333333,89.777778,15.0,3.726395,13.8044,0.302832


## Exercise 1: Split dataset in target property and features

You first need to partition the data into the target property and features used for learning. For this dataset, the target property is contained in the `total_dielectric` column. The features are all other columns, except `structure`, and `composition`.

The target property data should be stored in the `y` variable. The set of features used for learning should be stored in the `X` variable.

*Hint remember to exclude the target property from the feature set.* 

In [2]:
# Fill in the blanks

y = df["total_dielectric"].values

X = df.drop(["structure", "composition", "total_dielectric"], axis=1)

## Exercise 2: Train a random forest model on the dataset

Train a random forest model with 150 estimators on the dataset. Next, use the model to get predictions for all samples and store them to the `y_pred` variable.

In [3]:
from sklearn.ensemble import RandomForestRegressor

# Fill in the blanks below

rf = RandomForestRegressor(n_estimators=150)

rf.fit(X, y)

y_pred = rf.predict(X)

To see how well your model is performing, run the next cell.

In [4]:
import numpy as np
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y, y_pred)
print('training RMSE = {:.3f}'.format(np.sqrt(mse)))

training RMSE = 6.620


## Exercise 3: Evaluate the model using cross validation

Evaluate your random forest model using cross validation with 5 splits. This will give a more realistic idea of how well your model will perform in practice.

In [5]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# Fill in the blanks below

kfold = KFold(n_splits=5)

scores = cross_val_score(rf, X, y, scoring='neg_mean_squared_error', cv=kfold)

The final cross validation score can be printed by running the cell below.

In [6]:
rmse_scores = [np.sqrt(abs(s)) for s in scores]
print('Mean RMSE: {:.3f}'.format(np.mean(rmse_scores)))

Mean RMSE: 19.938
