# Predict Stability Vector
This notebook predicts the stability vector from the provided training data. The majority of the process is encapsulated in the `DataManager` class. Once a user has provided the necessary file paths in the configuration cell, the steps necessary are as follows:

1. Load in test data from a .csv to a pandas dataframe.
2. Convert the input columns into integer chemical formulas.
3. Featurize the input data using MatMiner's stoichometric norms, magpie elemental features, and materials project cohesive energies.
4. Load in and deserialize the chosen machine learning model using a pickle.
5. Make predictions on the featurized inputs.
6. Convert the predicted value of the binary classifier to the stability vector.


In [130]:
#### Standard Libraries ####
import os
import pickle
from pprint import pprint
import numpy as np
import pandas as pd
import multiprocessing as mp
from functools import partial
from itertools import product
import timeit
import uuid

#### third-party Libraries ####
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier as SRC
from lolopy.learners import RandomForestClassifier as LRC
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.utils import resample
from pymatgen import Composition

#### Local Libraries ####
from utils import (Result, run_k_folds, 
                   report_column_labels,
                   compile_data)
from data_manager import DataManager
from featurizer import Featurizer
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [143]:
# Configuration
load_test_path = os.path.join('data','test_data.csv')
load_model_path = os.path.join('rfc.sav')
save_path = os.path.join('results','final_model.csv')
feature_set = ['standard','cmpd_energy']
mp_api_key = '7n6DwPUQ5cf8ZTWO'

In [144]:
# Load Data
dm = DataManager(load_test_path, save_path)
dm.load()

'Loaded 749 records.'


### Validating the test data
1. Using either `csvkit` or excel we check that the test data should have 749 rows with an additional header row. Our load method above reports 749 records loaded which matches or expected value.
2. We need to make sure that every entry for the first two columns is a string.
3. Every string in the first 2 columns should have a max length of two.
4. The first letter of every value for the first 2 columns should always be capitalized and the second should always be lower case.
5. The first letter may not be J
6. Could use pymatgen composition object

In [140]:
# Validate data - move to data manager
test_col_1 = dm.data.iloc[:,0].apply(lambda x: Composition(x).valid)
test_col_2 = dm.data.iloc[:,1].apply(lambda x: Composition(x).valid)
if not all(test_col_1) or not all(test_col_2):
    pprint("Invalid element in data")

In [145]:
# Convert the inputs
dm.convert_inputs()
dm.get_pymatgen_composition()

# featurize data
f = Featurizer(feature_set, mp_api_key)
dm.featurized_data = f.featurize(dm.data)
dm.featurized_data = np.nan_to_num(dm.featurized_data)
%%capture

HBox(children=(IntProgress(value=0, description='MultipleFeaturizer', max=8239, style=ProgressStyle(descriptio…


No electronegativity for Ar. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for He. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for Ar. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for He. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for Ar. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for Ne. Setting to NaN. This has no physical meaning, and is mainly done to avoid errors caused by the code expecting a float.


No electronegativity for Ne. Setting to NaN. This has no physical meaning, and is mainly done to av

In [146]:
# Load in our final model
with open(load_model_path, "rb") as f:
    model = pickle.load(f)

In [147]:
# predict
dm.data['stable'] = model.predict(dm.featurized_data)

In [82]:
# Convert to vector

In [148]:
# move to data manager
def binary_to_vec(df: pd.DataFrame):
    return df.tolist()

In [164]:
results_df = pd.read_csv(load_test_path)

In [165]:
results_df['stabilityVec'] = dm.data.groupby('system').agg(binary_to_vec)['stable'].values

In [166]:
results_df.to_csv('data/test_csv_labeled.csv')