# Evaluate Models
When choosing the correct machine learning it is necessary to explore changes in the data and there effect on the models accuracy metrics. This notebook facilitates running 10-fold group cross validation on different types of models in parallel to assist in choosing the correct model.


In [31]:
#### Standard Libraries ####
import os
import numpy as np
import pandas as pd
import multiprocessing as mp
from functools import partial
import timeit

#### Third-party Libraries ####
from sklearn.svm import SVC
from sklearn.utils import resample
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier as SRC
from lolopy.learners import RandomForestClassifier as LRC
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier

#### Local Libraries ####
from utils.utils import (Result, run_k_folds, 
                   report_column_labels,
                   compile_data, oversample)
from utils.data_manager import DataManager
from utils.featurizer import Featurizer

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Configuration
Use this cell to set any necessary parameters.
* `np.random.seed()` Set the random seed of the notebook for reproducibility.
* `load_path` Path to training data.
* `save_path` Where to save the results of cross validation.
* `mp_api_key_path` Path to a `.txt` file containing a [Materials Project](https://materialsproject.org/) API key.
* `oversample` Fix class imbalance by super sampling the minority class
* `data_ramp` Run cross validation sequentially on even divisions of the training data. Used to investigate impact of additional training data.
* `feature_set` A list of key-words from 'standard', 'cmpd_energy', 'energy_a', or 'energy_b' that sets which [MatMiner](https://hackingmaterials.lbl.gov/matminer/) composition features to apply.

In [10]:
# Configuration

np.random.seed(8)
load_path = os.path.join('..','data','training_data.csv')
save_path = os.path.join('..','results','final_testing.csv')
mp_api_key_path = os.path.join('..','configuration','mp_api_key.txt')
oversample = True
data_ramp = False
feature_set = ['standard']

## Load data
The `DataManager` handles loading and storing training data from a csv.

In [7]:
# Load Data
with open(mp_api_key_path, 'r') as f:
    mp_api_key = f.readline().rstrip()
dm = DataManager(load_path, save_path)
dm.load()

'Loaded 2572 records.'


We can also use the `DataManager` to sample from the data to improve the speed of analysis.

In [8]:
# Sample data
if not data_ramp:
    dm.sample_data(100)

## Formatting and Converting Training Data
The `DataManager` class provides several useful methods for facilitating the transformation of the native training data to the binary classifier schema.
* `to_binary_classes()` will convert the formula and output columns of the data attribute series of binary compounds which represent the stability vector.
* `get_pymatgen_composition()` will convert formulas to pymatgen `Composition` objects which work with MatMiner's composition featurizers.
* `remove_noble_gasses()` will remove any noble gases which we know to be unstable when combined with other elements.
* `remove_features()` strips out the pre-featurized data provided. We have checked and the features provided are identical to the 'standard' feature set here (stoichiometric norms and magpie elemental features).

In [9]:
# Format and careate composition objects
dm.to_binary_classes()
dm.get_pymatgen_composition()
dm.remove_noble_gasses()
dm.remove_features()

In [64]:
# Sampling to accomodate the data ramp feature
if data_ramp:
    dm.data = dm.data.sample(frac=1).reset_index(drop=True)

The `Featurizer` class will assist you in applying sets of MatMiner composition features to your training data. If you wish to use the energy features you must supply a valid Materials Project api key. Featurization cant take up to 30 min when using energy features.

In [11]:
f = Featurizer(feature_set, mp_api_key)

In [12]:
dm.featurized_data = f.featurize(dm.data)

HBox(children=(IntProgress(value=0, description='MultipleFeaturizer', max=871, style=ProgressStyle(description…




During the process of converting the data to the binary classification scheme we labeled each chemical system (formulaA, formulaB pair) with a unique number. Doing so allows us to group together integer formulas from the same system and ensure that they do not become split across the training and testing set during cross validation.

In [13]:
# Set group labels for group K-folds
dm.groups = dm.data['group']

In [14]:
# Set training labels
dm.outputs = dm.data['stable']

## Cross Validation
Cross validation is a technique for assessing model accuracy. Here we provide the `run_k_folds` function to help with running complex group k-folds cross validation with or without super sampling or data ramping. Further it is designed to be used on multiple models at once allowing the efficient evaluation of many types of models in the search for the best fit to the problem.

In [32]:
# Configure group K-folds
k_folds = partial(run_k_folds, inputs=dm.featurized_data,
                  outputs=dm.outputs, groups=dm.groups,
                  sampling=oversample, ramp=data_ramp, splits=5)

We set which models we want to run cross validation on by passing a list of scikit-learn style models.

In [16]:
# Set the models we want to run group K-folds on
models = [GaussianNB(), SVC(), SRC(), LogisticRegression(), DummyClassifier(strategy="most_frequent")]
#models = [SRC()]

Each model's cross validation is assigned a core on the CPU so that evaluation happens in parallel.

In [34]:
# Set up workers for each job
pool = mp.Pool(processes=mp.cpu_count())


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.






Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples.




Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples.




Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples.




Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples.






Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples.




Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples.




Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples.






Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples.




Precision and F-score are ill-defined and being set to 0.0 due to no predicte

In [35]:
%%capture
start_time = timeit.default_timer()
results = pool.map(k_folds, models)    
elapsed = timeit.default_timer() - start_time

Results are stored in a csv file and can be analyzed in notebook 2.

In [36]:
compiled = []
for result in results:
    compiled.extend(compile_data(result))
res_df = pd.DataFrame(compiled, columns=report_column_labels)
res_df.to_csv(save_path)
res_df

Unnamed: 0,type,data_size,accuracy,accuracy_std,f1,f1_std,recall,recall_std,precision,precision_std
0,GaussianNB,871,0.75945,0.094639,0.43976,0.073688,0.550386,0.136503,0.401819,0.130783
1,SVC,871,0.833208,0.0304,0.0,0.0,0.0,0.0,0.0,0.0
2,RandomForestClassifier,871,0.904609,0.02335,0.663163,0.070201,0.572087,0.112946,0.816171,0.113099
3,LogisticRegression,871,0.827673,0.063149,0.61431,0.09482,0.779972,0.077787,0.519593,0.131976
4,DummyClassifier,871,0.833208,0.0304,0.0,0.0,0.0,0.0,0.0,0.0


## Fitting a Single Model
The below code is used to fit a single model to all of the training data. It also allows you to export the model as a pickle for use with other notebooks.

In [None]:
# Configuration
model_export_path = os.path.join('..','models','model.sav')

# Choose which model from the list used for K-folds
model = models[0]

In [90]:
# Apply oversampling to the training data
train = oversample(np.arange(0,len(dm.featurized_data),1), dm.outputs)

In [91]:
# Fit the model
model.fit(dm.featurized_data[train], dm.outputs[train])


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [92]:
# Save and export the model
import pickle
pickle.dump(model, open(model_export_path, 'wb+'))