# Predict Stability Vector
This notebook predicts the stability vector from the provided training data. The majority of the process is encapsulated in the `DataManager` class. Once a user has provided the necessary file paths in the configuration cell, the steps necessary are as follows:

1. Load in test data from a .csv to a pandas dataframe.
2. Convert the input columns into integer chemical formulas.
3. Featurize the input data using MatMiner's stoichometric norms, magpie elemental features, and materials project cohesive energies.
4. Load in and deserialize the chosen machine learning model using a pickle.
5. Make predictions on the featurized inputs.
6. Convert the predicted value of the binary classifier to the stability vector.


In [35]:
#### Standard Libraries ####
import os
import pickle
import numpy as np

#### Local Libraries ####
from utils.utils import Result
from utils.data_manager import DataManager
from utils.featurizer import Featurizer

### Configuration
Use this cell to set any necessary parameters.
* `np.random.seed()` Set the random seed of the notebook for reproducibility.
* `load_path` Path to training data.
* `save_path` Where to save the results of cross validation.
* `mp_api_key_path` Path to a `.txt` file containing a [Materials Project](https://materialsproject.org/) API key.
* `feature_set` A list of key-words from 'standard', 'cmpd_energy', 'energy_a', or 'energy_b' that sets which [MatMiner](https://hackingmaterials.lbl.gov/matminer/) composition features to apply.

In [23]:
# Configuration
load_test_path = os.path.join('..','data','test_data.csv')
load_model_path = os.path.join('..','models','rfc.sav')
save_path = os.path.join('..','data','test_csv_labeled.csv')
feature_set = ['standard','cmpd_energy']
mp_api_key_path = os.path.join('..','configuration','mp_api_key.txt')

In [24]:
# Load Data
with open(mp_api_key_path, 'r') as f:
    mp_api_key = f.readline().rstrip()
dm = DataManager(load_test_path, save_path)
dm.load()

'Loaded 749 records.'


### Validating the test data
1. Using either `csvkit` or excel we check that the test data should have 749 rows with an additional header row. Our load method above reports 749 records loaded which matches or expected value.
2. We need to make sure that every entry for the first two columns is a string.
3. Every string in the first 2 columns should have a max length of two.
4. The first letter of every value for the first 2 columns should always be capitalized and the second should always be lower case.
5. The first letter may not be J
6. However the simplest path is to use the valid attribute of a pymatgen `Compostion` object.

In [25]:
# Validate data - move to data manager
dm.validate_data()

'All input elements are valid'


## Converting Test Data
Just as we trained on data converted into systems of binary compounds represented by integer formulas, we must convert the test data inputs.

In [26]:
# Convert the inputs
dm.convert_inputs()
dm.get_pymatgen_composition()

Featurization is carried out using MatMiner. Depending on your chosen feature set, it may take up to 15 min for featurization to occur. Expect alert messages about noble gases during this process.

In [27]:
# featurize data
f = Featurizer(feature_set, mp_api_key)
dm.featurized_data = f.featurize(dm.data)

"""
Here you can choose how to impute missing values like the
electronegativity of a noble gas. We have chosen to convert them to 0's.
"""

dm.featurized_data = np.nan_to_num(dm.featurized_data)

HBox(children=(IntProgress(value=0, description='MultipleFeaturizer', max=8239, style=ProgressStyle(descriptio…


No electronegativity for Ar. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for He. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for Ar. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for He. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for Ar. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for Ne. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for Ne. Setting to NaN. This has no physical meaning, and is mainly done to av




UsageError: Line magic function `%%capture` not found.


## Making Predictions
Once we have converted and featurized the test data it is time to make predictions about each compounds stability. We start by loading in out chosen model, in this case a Random Forest trained on super sampled data featurized using stoichiomteric norms, the magpie elemental properties, and cohesive energies.

In [28]:
# Load in our final model
with open(load_model_path, "rb") as f:
    model = pickle.load(f)

In [29]:
# Predict binary stability classifier for all formulas
dm.data['stable'] = model.predict(dm.featurized_data)

In [36]:
# Convert the predicted binary classifier results to a stability vector
dm.binary_to_vec()
dm.labeled_data.to_csv(save_path)

Finally we convert the predictions back to stability vectors and label the original test data. You can find the results at your specified save path as a csv file.