## Evaluating performance of composition based feature vectors

This file will take you through an example of comparing two different featurisation methods 'fractional' (refered to here as onehot for legacy reasons) and 'magpie' with eachother. 

We'll use 80/20 train/test splits, then LOCO-CV and Kernelised LOCO-CV

Then we'll compare to random projections of the same size.

## First some imports and definitions

In [1]:
import pandas as pd
import os
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, r2_score
import json
from utilities import do_loco_cv

In [2]:
data_folder = 'data/case_studies'
task_info = 'task_info.json'

### For ease we've put information about the different tasks into a large dictionary

In [3]:
with open(task_info) as f:
    tasks = json.load(f)

## Now lets set up our experiment

our options for featurisation are as follows:
* 'oliynyk': Oliynyk Originally designed for prediction of Heusler structured intermetallics 13 , the Oliynyik feature set as implemented in previous work includes 44 features 5. For each of these, the weighted mean, sum, range, and variance of that feature amongst the constituent elements of the compound are taken. Features include atomic weight, metal, metalloid or non metallic properties, periodic table based properties (Period, group, atomic number), various measures of radii (atomic, Miracle, covalent), electronegativity, valency features (such as the number of s, p, d, and f valence electrons), and thermal features (such as boiling point and specific heat capacity).
* 'jarvis' : JARVIS combines structural descriptors with chemical descriptors to create “classical force-field inspired descriptors” (CFID). Structural descriptors include bond angle distributions neighbouring atomic sites, dihedral atom distributions, and radial distributions, among others. Chemical descriptors used include atomic mass, and mean charge distributions. Original work generated CFIDs for tens of thousands of DFT-calculated crystal structures 14 , and subsequent work adapted CFIDs for individual elements to be used in CBFVs for arbitrary compositions without known structures.
* 'magpie' : While the Materials-Agnostic Platform for Informatics and Exploration (MAGPIE) is the name of a library associated with Ward et al.’s work, it this has become synonymous with the 115 features used in the paper and as such we will use Magpie refer to the feature set. These features include 6 stoichiometric attributes which are different normalistion methods (L P norms) of the elements present. These capture information of the ratios of the elements in a material without taking into account what the elements are, 115 elemental based attributes are used, which are derived from the minimum, maximum, range, standard deviation, mode (property of the most prevalent element) and weighted average of 23 elemental properties including atomic number, Mendeleev number, atomic weight among others. Remaining features are derived from valence orbital occupation, and ionic compound attributes (which are based on differences between electronegativity between constituent elements in a compound).
* 'random_200' : A random vector featurisation used by Murdock et al. to represent a lower bounds for performance.
* 'onehot' : (referred to as fractional in the paper, but onehot in code for legacy reasons). This is an implementation of a one-hot style encoding of composition which includes average, sum, range, and variance of each element.
* 'compVec' : a one-hot style encoding of composition as used in ElemNet (containing only the proportions of each element in a composition). Differences between this and fractional are further discussed in section 2.1 of the associated paper

In [4]:
task = 'GFA'
featurisations = ['magpie','onehot'] # For legacy reasons we refer to onehot in the paper as fractional

In [5]:
metric = accuracy_score if tasks[task]['type'] == 'classification' else r2_score
model = RandomForestClassifier() if tasks[task]['type'] == 'classification' else RandomForestRegressor() 

## First lets look at scores with an 80/20 train/test split

In [6]:
cbfv_train_test_score = {} #We will later compare these to random projections, and to LOCO-CV scores

In [7]:
for featurisation_method in featurisations:
    #Find files
    task_folder = os.path.join(data_folder, #were the data is
                 'CBFV_data', #whether we are investigating CBFVs or random projections
                 tasks[task]['study_folder'], #Which study?
                 '80_20_split',#80_20_split or LOCO-CV?
                 tasks[task]['type'], #regression or classification?
                 tasks[task]['task_folder']) #Which task?
    train_file = os.path.join(task_folder, f'{featurisation_method}_train_CBFV.csv')
    test_file = os.path.join(task_folder, f'{featurisation_method}_test_CBFV.csv')
    
    #Load in files
    train_df = pd.read_csv(train_file)
    test_df = pd.read_csv(test_file)
    
    #Train model
    train_x = train_df.drop(['target','formula'], axis=1)
    train_y = train_df['target']
    model.fit(train_x, train_y)
    
    #Make predictions on test set
    test_x = test_df.drop(['target','formula'], axis=1)
    test_y = test_df['target']
    predictions = model.predict(test_x)
    
    #Measure performance
    score = metric(test_y, predictions)
    print(f'Score for {featurisation_method} is {round(score,3)}')
    cbfv_train_test_score[featurisation_method] = score

Score for magpie is 0.568
Score for onehot is 0.543


## Now LOCO-CV and kernelised LOCO-CV
We see that when it comes to measuring LOCO-CV and kernelised LOCO-CV are used in exactly the same way. The difference is in how the data are clustered. For reproducibility here we use the same clusters that are reported in the paper, for an example on how to implement kernelised LOCO-CV please see preparing_kernelised_LOCO_CV.ipynb
We have defined a function do go through the LOCO-CV. From the source code we can see it is quite simple

In [8]:
?? do_loco_cv

[0;31mSignature:[0m  [0mdo_loco_cv[0m[0;34m([0m[0mclusters[0m[0;34m,[0m [0mdata[0m[0;34m,[0m [0mmodel[0m[0;34m,[0m [0mmetric[0m[0;34m,[0m [0mreturn_score_breakdown[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mdo_loco_cv[0m[0;34m([0m[0mclusters[0m[0;34m,[0m [0mdata[0m[0;34m,[0m [0mmodel[0m[0;34m,[0m [0mmetric[0m[0;34m,[0m [0mreturn_score_breakdown[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Performs LOCO-CV given predefined clusters.[0m
[0;34m[0m
[0;34m    Args:[0m
[0;34m        clusters (list): Clusters with which to apply LOCO-CV[0m
[0;34m            in the form [{'k':2, 'formulae':['H2O','NaCl'....],'clusters':[0,1...]},{'k':2...][0m
[0;34m        data (pandas.DataFrame): data to apply LOCO-CV to.[0m
[0;34m        model (any model that uses SKlearn style .fit, .predict interface): the model to evaluate.[0m
[0;34m   

In [9]:
cbfv_loco_cv_score = {}
cbfv_kernelised_loco_cv_score = {}

In [10]:
##This takes a while to run
for featurisation_method in featurisations:
    #Find files
    task_folder = os.path.join(data_folder, #were the data is
                 'CBFV_data', #whether we are investigating CBFVs or random projections
                 tasks[task]['study_folder'], #Which study?
                 'LOCO-CV',#80_20_split or LOCO-CV?
                 tasks[task]['type'], #regression or classification?
                 tasks[task]['task_folder']) #Which task?
    data_file = os.path.join(task_folder,f'{featurisation_method}_CBFV.csv')
    loco_cv_split_file = os.path.join(task_folder,f'{featurisation_method}_CBFV.json')
    kernelised_loco_cv_split_file = os.path.join(task_folder,f'{featurisation_method}_CBFV_rbf.json')
    
    data = pd.read_csv(data_file)
    if tasks[task]['type'] == 'classification':
        data['target'] = data['target'].astype(int)
    with open(loco_cv_split_file) as f:
        loco_cv_split = json.load(f)
    with open(kernelised_loco_cv_split_file) as f:
        kernelised_loco_cv_split = json.load(f)
    
    loco_cv_score = do_loco_cv(loco_cv_split, data, model, metric)
    print(f'LOCO-CV score for {featurisation_method} is {round(loco_cv_score,3)}')
    cbfv_loco_cv_score[featurisation_method] = loco_cv_score
    
    kernelised_loco_cv_score = do_loco_cv(kernelised_loco_cv_split, data, model, metric)
    print(f'kernelised LOCO-CV score for {featurisation_method} is {round(kernelised_loco_cv_score,3)}')
    cbfv_kernelised_loco_cv_score[featurisation_method] = kernelised_loco_cv_score
    
    

LOCO-CV score for magpie is 0.64
kernelised LOCO-CV score for magpie is 0.876
LOCO-CV score for onehot is 0.586
kernelised LOCO-CV score for onehot is 0.743


## Lets compare this to random projections of the same size

In [11]:
random_projection_train_test_score = {} #We will later compare these to random projections, and to LOCO-CV scores

In [12]:
for featurisation_method in featurisations:
    #Find files
    task_folder = os.path.join(data_folder, #were the data is
                 'random_projection_data', #whether we are investigating CBFVs or random projections
                 tasks[task]['study_folder'], #Which study?
                 '80_20_split',#80_20_split or LOCO-CV?
                 tasks[task]['type'], #regression or classification?
                 tasks[task]['task_folder']) #Which task?
    train_file = os.path.join(task_folder, f'{featurisation_method}_train_projection.csv')
    test_file = os.path.join(task_folder, f'{featurisation_method}_test_projection.csv')
    #Load in files
    train_df = pd.read_csv(train_file)
    test_df = pd.read_csv(test_file)
    
    #Train model
    train_x = train_df.drop(['target','formula'], axis=1)
    train_y = train_df['target']
    model.fit(train_x, train_y)
    
    #Make predictions on test set
    test_x = test_df.drop(['target','formula'], axis=1)
    test_y = test_df['target']
    predictions = model.predict(test_x)
    
    #Measure performance
    score = metric(test_y, predictions)
    print(f'Score for {featurisation_method} is {round(score,3)}')
    random_projection_train_test_score[featurisation_method] = score

Score for magpie is 0.87
Score for onehot is 0.875


In [14]:
random_projection_loco_cv_score = {}
random_projection_kernelised_loco_cv_score = {}

In [15]:
##This takes a while to run
for featurisation_method in featurisations:
    #Find files
    task_folder = os.path.join(data_folder, #were the data is
                 'CBFV_data', #whether we are investigating CBFVs or random projections
                 tasks[task]['study_folder'], #Which study?
                 'LOCO-CV',#80_20_split or LOCO-CV?
                 tasks[task]['type'], #regression or classification?
                 tasks[task]['task_folder']) #Which task?
    data_file = os.path.join(task_folder,f'{featurisation_method}_CBFV.csv')
    loco_cv_split_file = os.path.join(task_folder,f'{featurisation_method}_CBFV.json')
    kernelised_loco_cv_split_file = os.path.join(task_folder,f'{featurisation_method}_CBFV_rbf.json')
    
    data = pd.read_csv(data_file)
    if tasks[task]['type'] == 'classification':
        data['target'] = data['target'].astype(int)
    with open(loco_cv_split_file) as f:
        loco_cv_split = json.load(f)
    with open(kernelised_loco_cv_split_file) as f:
        kernelised_loco_cv_split = json.load(f)
    
    loco_cv_score = do_loco_cv(loco_cv_split, data, model, metric)
    print(f'LOCO-CV score for {featurisation_method} is {round(loco_cv_score,3)}')
    random_projection_loco_cv_score[featurisation_method] = loco_cv_score
    
    kernelised_loco_cv_score = do_loco_cv(kernelised_loco_cv_split, data, model, metric)
    print(f'kernelised LOCO-CV score for {featurisation_method} is {round(kernelised_loco_cv_score,3)}')
    random_projection_kernelised_loco_cv_score[featurisation_method] = kernelised_loco_cv_score
    

LOCO-CV score for magpie is 0.628
kernelised LOCO-CV score for magpie is 0.876
LOCO-CV score for onehot is 0.592
kernelised LOCO-CV score for onehot is 0.742


## Lets see how much each featurisation method improves over a random projection of the same size
negative numbers means it's worse than a random projection of the same size

In [16]:
for featurisation_method in featurisations:
    change = ((cbfv_train_test_score[featurisation_method]/random_projection_train_test_score[featurisation_method]) - 1) * 100
    print(f'When measuring using an 80/20 train/test split {featurisation_method} performs {round(change,3)}% better than an equally sized random projection')
    
    change = ((cbfv_train_test_score[featurisation_method]/random_projection_train_test_score[featurisation_method]) - 1) * 100
    print(f'When measuring using LOCO-CV {featurisation_method} performs {round(change,3)}% better than an equally sized random projection')
    
    change = ((cbfv_kernelised_loco_cv_score[featurisation_method]/random_projection_kernelised_loco_cv_score[featurisation_method]) - 1) * 100
    print(f'When measuring using kernelised LOCO-CV {featurisation_method} performs {round(change,3)}% better than an equally sized random projection')
    print()

When measuring using an 80/20 train/test split magpie performs -34.759% better than an equally sized random projection
When measuring using LOCO-CV magpie performs -34.759% better than an equally sized random projection
When measuring using kernelised LOCO-CV magpie performs -0.041% better than an equally sized random projection

When measuring using an 80/20 train/test split onehot performs -37.647% better than an equally sized random projection
When measuring using LOCO-CV onehot performs -37.647% better than an equally sized random projection
When measuring using kernelised LOCO-CV onehot performs 0.177% better than an equally sized random projection

