# Scenario Overview

In this notebook a client is interested in testing our technologies ability to predict a functional property from material composition. The idea behind this proof of concept is to use data to solve a canonical thermodynamics problem: given a pair of elements, predict the stable binary compounds that form on mixing. The customer provided roughly 2500 element pairs as training data. 

The data science team has suggested training labels that are a discretization of the 1D binary phase diagram at 10% intervals. 

For example, the label for OsTi ([1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0]) translates into the following stable compounds:

● Os 1.0 Ti 0.0 aka Os,

● Os 0.5 Ti 0.5 aka OsTi 

● Os 0.0 Ti 1.0 aka Ti

Model development and the approach taken by the team is contained in the following. Additional business considerations are provided in the accompanying slide presentation.

# Notebook Prep

In [None]:
!python -m pip install -U pycaret
!python -m pip install pycaret[tuners]
!python -m pip install -U optuna
!python -m pip install -U shap
!python -m pip install -U sci-kit learn
!python -m pip install -U ipywidgets
!python -m pip install -U plotly-express
!python -m pip install iterative-stratification
!python -m pip install -U summarytools
!python -m pip install -U wandb
!python -m pip install -U torch torchvision torchaudio
!python -m pip install -U fastai


In [1]:
import wandb
import numpy as np
import pandas as pd
import json
import pycaret
import optuna
import shap
from summarytools import dfSummary
from sklearn.model_selection import train_test_split
import plotly_express as px
import plotly.graph_objects as go
from pycaret.classification import *
import fastai
from fastai.tabular.all import *

We will be using Weights and Biases to log data during this investigation. If you do not want to use it for logging, please feel free to comment out the wandb items.

In [2]:
wandb.login()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33malanfiler[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

# Data Wrangling

## Data Import

In [3]:
# import data
df = pd.read_csv('training_data.csv', 
                 dtype={'formulaA' : 'category',
                        'formulaB' : 'category',
                        })

## EDA

First order of business is to explore the dataset and get a feel.

In [4]:
df.head()

Unnamed: 0,formulaA,formulaB,formulaA_elements_AtomicVolume,formulaB_elements_AtomicVolume,formulaA_elements_AtomicWeight,formulaB_elements_AtomicWeight,formulaA_elements_BoilingT,formulaB_elements_BoilingT,formulaA_elements_BulkModulus,formulaB_elements_BulkModulus,...,formulaB_elements_Row,formulaA_elements_ShearModulus,formulaB_elements_ShearModulus,formulaA_elements_SpaceGroupNumber,formulaB_elements_SpaceGroupNumber,avg_coordination_A,avg_coordination_B,avg_nearest_neighbor_distance_A,avg_nearest_neighbor_distance_B,stabilityVec
0,Ac,Ag,37.433086,17.075648,227.0,107.8682,3473.0,2435.0,0.0,100.0,...,5,0.0,30.0,225,225,12.0,12.0,3.99462,2.94195,"[1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0]"
1,Ac,Al,37.433086,16.594425,227.0,26.981539,3473.0,2792.0,0.0,76.0,...,3,0.0,26.0,225,225,12.0,12.0,3.99462,2.85595,"[1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0]"
2,Ac,As,37.433086,21.723966,227.0,74.9216,3473.0,887.0,0.0,22.0,...,4,0.0,0.0,225,166,12.0,3.0,3.99462,2.5579,"[1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0]"
3,Ac,Ba,37.433086,64.969282,227.0,137.327,3473.0,2143.0,0.0,9.6,...,6,0.0,4.9,225,229,12.0,8.0,3.99462,4.35637,"[1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0]"
4,Ac,Bi,37.433086,35.483459,227.0,208.9804,3473.0,1837.0,0.0,31.0,...,6,0.0,12.0,225,12,12.0,3.0,3.99462,3.11221,"[1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0]"


In [5]:
df.describe()

Unnamed: 0,formulaA_elements_AtomicVolume,formulaB_elements_AtomicVolume,formulaA_elements_AtomicWeight,formulaB_elements_AtomicWeight,formulaA_elements_BoilingT,formulaB_elements_BoilingT,formulaA_elements_BulkModulus,formulaB_elements_BulkModulus,formulaA_elements_Column,formulaB_elements_Column,...,formulaA_elements_Row,formulaB_elements_Row,formulaA_elements_ShearModulus,formulaB_elements_ShearModulus,formulaA_elements_SpaceGroupNumber,formulaB_elements_SpaceGroupNumber,avg_coordination_A,avg_coordination_B,avg_nearest_neighbor_distance_A,avg_nearest_neighbor_distance_B
count,2572.0,2572.0,2572.0,2572.0,2572.0,2572.0,2572.0,2572.0,2572.0,2572.0,...,2572.0,2572.0,2572.0,2572.0,2572.0,2572.0,2572.0,2572.0,2572.0,2572.0
mean,2207.340923,2220.778005,112.319674,113.247322,2733.916283,2740.693188,74.569868,78.194751,7.992224,8.06493,...,4.84409,4.857698,34.256726,35.582387,187.127138,189.957232,9.152102,9.271954,3.114011,3.130684
std,8729.184304,8751.899407,65.258759,65.877,1507.624155,1510.148266,93.757854,96.094178,5.496219,5.475384,...,1.377499,1.373744,50.611912,51.760457,56.243399,53.868652,3.637761,3.597788,0.708516,0.716404
min,7.297767,7.297767,4.002602,4.002602,4.07,4.07,0.0,0.0,1.0,1.0,...,1.0,1.0,0.0,0.0,2.0,2.0,1.0,1.0,1.42409,1.42409
25%,15.858734,15.858734,55.845,55.845,1469.0,1615.0,6.3,7.7,3.0,3.0,...,4.0,4.0,0.0,0.0,194.0,194.0,6.0,8.0,2.62673,2.66567
50%,26.082658,26.966785,107.8682,107.8682,2973.0,2973.0,38.7,41.0,7.0,7.0,...,5.0,5.0,18.0,18.0,194.0,194.0,12.0,12.0,2.94955,2.94955
75%,34.784501,34.784501,164.93032,167.259,3680.0,3676.25,110.0,120.0,13.0,13.0,...,6.0,6.0,38.0,38.0,225.0,225.0,12.0,12.0,3.54101,3.56059
max,37236.03556,37236.03556,238.02891,238.02891,5869.0,5869.0,380.0,380.0,18.0,18.0,...,7.0,7.0,222.0,222.0,229.0,229.0,12.0,12.0,5.32395,5.32395


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2572 entries, 0 to 2571
Data columns (total 99 columns):
 #   Column                                      Non-Null Count  Dtype   
---  ------                                      --------------  -----   
 0   formulaA                                    2572 non-null   category
 1   formulaB                                    2572 non-null   category
 2   formulaA_elements_AtomicVolume              2572 non-null   float64 
 3   formulaB_elements_AtomicVolume              2572 non-null   float64 
 4   formulaA_elements_AtomicWeight              2572 non-null   float64 
 5   formulaB_elements_AtomicWeight              2572 non-null   float64 
 6   formulaA_elements_BoilingT                  2572 non-null   float64 
 7   formulaB_elements_BoilingT                  2572 non-null   float64 
 8   formulaA_elements_BulkModulus               2572 non-null   float64 
 9   formulaB_elements_BulkModulus               2572 non-null   float64 
 10  

## Data Cleaning

### Multi-class to multi-label

Now that we have a general idea of the state of the dataset it is time to do some manipulation. First order of business is to turn the dataset from a multi-class problem to a multi-label one. This is for a few reasons.

1. There are 100+ classes currently, with many only having 1 sample per class.
2. By turning it into a multi-label problem, even if you cannot predict the entire stability vector, you may still predict some of the compounds.

In [7]:
# transform string stabilityVec into numeric
df['stabilityVec'] = df['stabilityVec'].apply(json.loads)

In [8]:
df['stabilityVec']

0       [1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0]
1       [1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
2       [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
3       [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
4       [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
                                 ...                           
2567    [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
2568    [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0]
2569    [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
2570    [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
2571    [1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0]
Name: stabilityVec, Length: 2572, dtype: object

In [9]:
# transform list elements from float to int
sV = df['stabilityVec']
result2 = [[int(x) for x in sublist] for sublist in sV]
type(result2[0][0])

# create series from list of lists
sVs = pd.Series(result2)

# overwrite dataframe column with series
df['stabilityVec'] = sVs
df['stabilityVec'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 2572 entries, 0 to 2571
Series name: stabilityVec
Non-Null Count  Dtype 
--------------  ----- 
2572 non-null   object
dtypes: object(1)
memory usage: 20.2+ KB


The following value counts show the issues with the multi-class approach.

In [10]:
df['stabilityVec'].value_counts()

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]    1344
[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]     168
[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]      98
[1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1]      81
[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1]      75
                                     ... 
[1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1]       1
[1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1]       1
[1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1]       1
[1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1]       1
[1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1]       1
Name: stabilityVec, Length: 121, dtype: int64

We will now expand the stability vectors to create a multi-label situation (instead of multi-class).

In [11]:
# create new dataframe to house target columns
target_df = pd.DataFrame()

In [12]:
# expand each stabilityVec per row and add the values to new columns in the target dataframe
target_df['100A_0B'], target_df['90A_10B'], target_df['80A_20B'], target_df['70A_30B'], target_df['60A_40B'], target_df['50A_50B'], target_df['40A_60B'], target_df['30A_70B'], target_df['20A_80B'], target_df['10A_90B'], target_df['0A_100B'] = zip(*list(df['stabilityVec'].values))

In [13]:
target_df.head()

Unnamed: 0,100A_0B,90A_10B,80A_20B,70A_30B,60A_40B,50A_50B,40A_60B,30A_70B,20A_80B,10A_90B,0A_100B
0,1,0,0,1,0,1,0,0,0,0,1
1,1,0,0,1,0,0,0,0,0,0,1
2,1,0,0,0,0,0,0,0,0,0,1
3,1,0,0,0,0,0,0,0,0,0,1
4,1,0,0,0,0,0,0,0,0,0,1


In [14]:
target_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2572 entries, 0 to 2571
Data columns (total 11 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   100A_0B  2572 non-null   int64
 1   90A_10B  2572 non-null   int64
 2   80A_20B  2572 non-null   int64
 3   70A_30B  2572 non-null   int64
 4   60A_40B  2572 non-null   int64
 5   50A_50B  2572 non-null   int64
 6   40A_60B  2572 non-null   int64
 7   30A_70B  2572 non-null   int64
 8   20A_80B  2572 non-null   int64
 9   10A_90B  2572 non-null   int64
 10  0A_100B  2572 non-null   int64
dtypes: int64(11)
memory usage: 221.2 KB


In [15]:
#use summarytools to get a sense of the data
dfSummary(target_df)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,100A_0B [int64],1. 1,"2,572 (100.0%)",,0 (0.0%)
2,90A_10B [int64],1. 0 2. 1,"2,522 (98.1%) 50 (1.9%)",,0 (0.0%)
3,80A_20B [int64],1. 0 2. 1,"2,484 (96.6%) 88 (3.4%)",,0 (0.0%)
4,70A_30B [int64],1. 0 2. 1,"1,974 (76.7%) 598 (23.3%)",,0 (0.0%)
5,60A_40B [int64],1. 0 2. 1,"2,396 (93.2%) 176 (6.8%)",,0 (0.0%)
6,50A_50B [int64],1. 0 2. 1,"1,988 (77.3%) 584 (22.7%)",,0 (0.0%)
7,40A_60B [int64],1. 0 2. 1,"2,387 (92.8%) 185 (7.2%)",,0 (0.0%)
8,30A_70B [int64],1. 0 2. 1,"2,212 (86.0%) 360 (14.0%)",,0 (0.0%)
9,20A_80B [int64],1. 0 2. 1,"2,180 (84.8%) 392 (15.2%)",,0 (0.0%)
10,10A_90B [int64],1. 0 2. 1,"2,512 (97.7%) 60 (2.3%)",,0 (0.0%)


### Column Cleaning

After looking through the columns, at first glance a few things stand out.

1. The chemical formulas are already label encoded by the column 'elements_number', so they should be dropped.
2. The 100A_0B and 0A_100B target columns are constant, so they should also be dropped.

In [16]:
# create feature and target dataframes
feature_df = df.drop(columns=['formulaA', 'formulaB', 'stabilityVec'])
target_df.drop(columns=['100A_0B', '0A_100B'], inplace=True)

In [17]:
feature_df.head()

Unnamed: 0,formulaA_elements_AtomicVolume,formulaB_elements_AtomicVolume,formulaA_elements_AtomicWeight,formulaB_elements_AtomicWeight,formulaA_elements_BoilingT,formulaB_elements_BoilingT,formulaA_elements_BulkModulus,formulaB_elements_BulkModulus,formulaA_elements_Column,formulaB_elements_Column,...,formulaA_elements_Row,formulaB_elements_Row,formulaA_elements_ShearModulus,formulaB_elements_ShearModulus,formulaA_elements_SpaceGroupNumber,formulaB_elements_SpaceGroupNumber,avg_coordination_A,avg_coordination_B,avg_nearest_neighbor_distance_A,avg_nearest_neighbor_distance_B
0,37.433086,17.075648,227.0,107.8682,3473.0,2435.0,0.0,100.0,3,11,...,7,5,0.0,30.0,225,225,12.0,12.0,3.99462,2.94195
1,37.433086,16.594425,227.0,26.981539,3473.0,2792.0,0.0,76.0,3,13,...,7,3,0.0,26.0,225,225,12.0,12.0,3.99462,2.85595
2,37.433086,21.723966,227.0,74.9216,3473.0,887.0,0.0,22.0,3,15,...,7,4,0.0,0.0,225,166,12.0,3.0,3.99462,2.5579
3,37.433086,64.969282,227.0,137.327,3473.0,2143.0,0.0,9.6,3,2,...,7,6,0.0,4.9,225,229,12.0,8.0,3.99462,4.35637
4,37.433086,35.483459,227.0,208.9804,3473.0,1837.0,0.0,31.0,3,15,...,7,6,0.0,12.0,225,12,12.0,3.0,3.99462,3.11221


In [18]:
dfSummary(feature_df)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,formulaA_elements_AtomicVolume [float64],Mean (sd) : 2207.3 (8729.2) min < med < max: 7.3 < 26.1 < 37236.0 IQR (CV) : 18.9 (0.3),82 distinct values,,0 (0.0%)
2,formulaB_elements_AtomicVolume [float64],Mean (sd) : 2220.8 (8751.9) min < med < max: 7.3 < 27.0 < 37236.0 IQR (CV) : 18.9 (0.3),82 distinct values,,0 (0.0%)
3,formulaA_elements_AtomicWeight [float64],Mean (sd) : 112.3 (65.3) min < med < max: 4.0 < 107.9 < 238.0 IQR (CV) : 109.1 (1.7),82 distinct values,,0 (0.0%)
4,formulaB_elements_AtomicWeight [float64],Mean (sd) : 113.2 (65.9) min < med < max: 4.0 < 107.9 < 238.0 IQR (CV) : 111.4 (1.7),82 distinct values,,0 (0.0%)
5,formulaA_elements_BoilingT [float64],Mean (sd) : 2733.9 (1507.6) min < med < max: 4.1 < 2973.0 < 5869.0 IQR (CV) : 2211.0 (1.8),79 distinct values,,0 (0.0%)
6,formulaB_elements_BoilingT [float64],Mean (sd) : 2740.7 (1510.1) min < med < max: 4.1 < 2973.0 < 5869.0 IQR (CV) : 2061.2 (1.8),79 distinct values,,0 (0.0%)
7,formulaA_elements_BulkModulus [float64],Mean (sd) : 74.6 (93.8) min < med < max: 0.0 < 38.7 < 380.0 IQR (CV) : 103.7 (0.8),49 distinct values,,0 (0.0%)
8,formulaB_elements_BulkModulus [float64],Mean (sd) : 78.2 (96.1) min < med < max: 0.0 < 41.0 < 380.0 IQR (CV) : 112.3 (0.8),49 distinct values,,0 (0.0%)
9,formulaA_elements_Column [int64],Mean (sd) : 8.0 (5.5) min < med < max: 1.0 < 7.0 < 18.0 IQR (CV) : 10.0 (1.5),18 distinct values,,0 (0.0%)
10,formulaB_elements_Column [int64],Mean (sd) : 8.1 (5.5) min < med < max: 1.0 < 7.0 < 18.0 IQR (CV) : 10.0 (1.5),18 distinct values,,0 (0.0%)


In [19]:
target_df.head()

Unnamed: 0,90A_10B,80A_20B,70A_30B,60A_40B,50A_50B,40A_60B,30A_70B,20A_80B,10A_90B
0,0,0,1,0,1,0,0,0,0
1,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0


In [20]:
dfSummary(target_df)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,90A_10B [int64],1. 0 2. 1,"2,522 (98.1%) 50 (1.9%)",,0 (0.0%)
2,80A_20B [int64],1. 0 2. 1,"2,484 (96.6%) 88 (3.4%)",,0 (0.0%)
3,70A_30B [int64],1. 0 2. 1,"1,974 (76.7%) 598 (23.3%)",,0 (0.0%)
4,60A_40B [int64],1. 0 2. 1,"2,396 (93.2%) 176 (6.8%)",,0 (0.0%)
5,50A_50B [int64],1. 0 2. 1,"1,988 (77.3%) 584 (22.7%)",,0 (0.0%)
6,40A_60B [int64],1. 0 2. 1,"2,387 (92.8%) 185 (7.2%)",,0 (0.0%)
7,30A_70B [int64],1. 0 2. 1,"2,212 (86.0%) 360 (14.0%)",,0 (0.0%)
8,20A_80B [int64],1. 0 2. 1,"2,180 (84.8%) 392 (15.2%)",,0 (0.0%)
9,10A_90B [int64],1. 0 2. 1,"2,512 (97.7%) 60 (2.3%)",,0 (0.0%)


The target classes are heavily imbalanced. We will have to investigate some techniques to assist when we train the model later.

Before computing correlation, we should mark the columns that are categorical as such. All of them are numeric values, but the values stand in for categories (essentially label encoded).

In [21]:
categorical_columns = ['formulaA_elements_Column',
                       'formulaB_elements_Column',
                       'formulaA_elements_IsAlkali',
                       'formulaB_elements_IsAlkali',
                       'formulaA_elements_IsDBlock',
                       'formulaB_elements_IsDBlock',
                       'formulaA_elements_IsFBlock',
                       'formulaB_elements_IsFBlock',
                       'formulaA_elements_IsMetal',
                       'formulaB_elements_IsMetal',
                       'formulaA_elements_IsMetalloid',
                       'formulaB_elements_IsMetalloid',
                       'formulaA_elements_IsNonmetal',
                       'formulaB_elements_IsNonmetal',
                       'formulaA_elements_Number',
                       'formulaB_elements_Number',
                       'formulaA_elements_Row',
                       'formulaB_elements_Row',
                       'formulaA_elements_SpaceGroupNumber',
                       'formulaB_elements_SpaceGroupNumber']

In [22]:
# save numeric feature column names
numerical_columns = [col for col in feature_df.columns if col not in categorical_columns]
numerical_columns

['formulaA_elements_AtomicVolume',
 'formulaB_elements_AtomicVolume',
 'formulaA_elements_AtomicWeight',
 'formulaB_elements_AtomicWeight',
 'formulaA_elements_BoilingT',
 'formulaB_elements_BoilingT',
 'formulaA_elements_BulkModulus',
 'formulaB_elements_BulkModulus',
 'formulaA_elements_CovalentRadius',
 'formulaB_elements_CovalentRadius',
 'formulaA_elements_Density',
 'formulaB_elements_Density',
 'formulaA_elements_ElectronSurfaceDensityWS',
 'formulaB_elements_ElectronSurfaceDensityWS',
 'formulaA_elements_Electronegativity',
 'formulaB_elements_Electronegativity',
 'formulaA_elements_FirstIonizationEnergy',
 'formulaB_elements_FirstIonizationEnergy',
 'formulaA_elements_GSbandgap',
 'formulaB_elements_GSbandgap',
 'formulaA_elements_GSenergy_pa',
 'formulaB_elements_GSenergy_pa',
 'formulaA_elements_GSestBCClatcnt',
 'formulaB_elements_GSestBCClatcnt',
 'formulaA_elements_GSestFCClatcnt',
 'formulaB_elements_GSestFCClatcnt',
 'formulaA_elements_GSmagmom',
 'formulaB_elements_GSm

In [23]:
# set categorical features to dtype 'category'
feature_df[categorical_columns] = feature_df[categorical_columns].astype('category')
feature_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2572 entries, 0 to 2571
Data columns (total 96 columns):
 #   Column                                      Non-Null Count  Dtype   
---  ------                                      --------------  -----   
 0   formulaA_elements_AtomicVolume              2572 non-null   float64 
 1   formulaB_elements_AtomicVolume              2572 non-null   float64 
 2   formulaA_elements_AtomicWeight              2572 non-null   float64 
 3   formulaB_elements_AtomicWeight              2572 non-null   float64 
 4   formulaA_elements_BoilingT                  2572 non-null   float64 
 5   formulaB_elements_BoilingT                  2572 non-null   float64 
 6   formulaA_elements_BulkModulus               2572 non-null   float64 
 7   formulaB_elements_BulkModulus               2572 non-null   float64 
 8   formulaA_elements_Column                    2572 non-null   category
 9   formulaB_elements_Column                    2572 non-null   category
 10  

Now that the categorical features are marked, it is appropriate to calculate the correlation matrix. We will drop any columns correlated above 0.95.

In [24]:
# calculate correlations
corr = feature_df.corr().abs()
corr.head()

Unnamed: 0,formulaA_elements_AtomicVolume,formulaB_elements_AtomicVolume,formulaA_elements_AtomicWeight,formulaB_elements_AtomicWeight,formulaA_elements_BoilingT,formulaB_elements_BoilingT,formulaA_elements_BulkModulus,formulaB_elements_BulkModulus,formulaA_elements_CovalentRadius,formulaB_elements_CovalentRadius,...,formulaA_elements_NsValence,formulaB_elements_NsValence,formulaA_elements_Polarizability,formulaB_elements_Polarizability,formulaA_elements_ShearModulus,formulaB_elements_ShearModulus,avg_coordination_A,avg_coordination_B,avg_nearest_neighbor_distance_A,avg_nearest_neighbor_distance_B
formulaA_elements_AtomicVolume,1.0,0.00651,0.232801,0.004795,0.441138,0.010947,0.199619,0.009338,0.418746,0.002134,...,0.114496,0.037405,0.271104,0.002626,0.169865,0.003178,0.140814,0.004554,0.240167,0.005904
formulaB_elements_AtomicVolume,0.00651,1.0,0.004134,0.210762,0.020419,0.442287,0.025315,0.204943,0.00699,0.412908,...,0.010565,0.111204,0.010091,0.269235,0.014997,0.173124,0.005553,0.12571,0.019055,0.247128
formulaA_elements_AtomicWeight,0.232801,0.004134,1.0,0.01025,0.368286,0.005871,0.027401,0.023877,0.579407,0.00466,...,0.129965,0.008922,0.303203,0.005826,0.153475,0.003599,0.212331,0.012079,0.297141,0.000139
formulaB_elements_AtomicWeight,0.004795,0.210762,0.01025,1.0,0.020574,0.353179,0.005178,0.04469,0.001971,0.565303,...,0.006155,0.111477,0.00267,0.294927,0.008845,0.180489,0.004403,0.226561,0.004687,0.30958
formulaA_elements_BoilingT,0.441138,0.020419,0.368286,0.020574,1.0,0.01585,0.555017,0.012688,0.277414,0.019266,...,0.057299,0.009183,0.129726,0.027374,0.529127,0.010422,0.254951,0.014152,0.292845,0.00615


In [25]:
# display correlation matrix heatmap
fig = px.imshow(corr, text_auto=True, width=1500, height=1500)
fig.show()

In [26]:
# sort correlations for any above 0.95
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
to_drop

['formulaA_elements_GSestFCClatcnt',
 'formulaB_elements_GSestFCClatcnt',
 'formulaA_elements_GSvolume_pa',
 'formulaB_elements_GSvolume_pa']

In [27]:
# drop highly correlated columns from dataframe
feature_df_lowcorr = feature_df.drop(columns=to_drop)
feature_df_lowcorr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2572 entries, 0 to 2571
Data columns (total 92 columns):
 #   Column                                      Non-Null Count  Dtype   
---  ------                                      --------------  -----   
 0   formulaA_elements_AtomicVolume              2572 non-null   float64 
 1   formulaB_elements_AtomicVolume              2572 non-null   float64 
 2   formulaA_elements_AtomicWeight              2572 non-null   float64 
 3   formulaB_elements_AtomicWeight              2572 non-null   float64 
 4   formulaA_elements_BoilingT                  2572 non-null   float64 
 5   formulaB_elements_BoilingT                  2572 non-null   float64 
 6   formulaA_elements_BulkModulus               2572 non-null   float64 
 7   formulaB_elements_BulkModulus               2572 non-null   float64 
 8   formulaA_elements_Column                    2572 non-null   category
 9   formulaB_elements_Column                    2572 non-null   category
 10  

In [28]:
# update list of numerical column names
[numerical_columns.remove(x) for x in to_drop]
numerical_columns

['formulaA_elements_AtomicVolume',
 'formulaB_elements_AtomicVolume',
 'formulaA_elements_AtomicWeight',
 'formulaB_elements_AtomicWeight',
 'formulaA_elements_BoilingT',
 'formulaB_elements_BoilingT',
 'formulaA_elements_BulkModulus',
 'formulaB_elements_BulkModulus',
 'formulaA_elements_CovalentRadius',
 'formulaB_elements_CovalentRadius',
 'formulaA_elements_Density',
 'formulaB_elements_Density',
 'formulaA_elements_ElectronSurfaceDensityWS',
 'formulaB_elements_ElectronSurfaceDensityWS',
 'formulaA_elements_Electronegativity',
 'formulaB_elements_Electronegativity',
 'formulaA_elements_FirstIonizationEnergy',
 'formulaB_elements_FirstIonizationEnergy',
 'formulaA_elements_GSbandgap',
 'formulaB_elements_GSbandgap',
 'formulaA_elements_GSenergy_pa',
 'formulaB_elements_GSenergy_pa',
 'formulaA_elements_GSestBCClatcnt',
 'formulaB_elements_GSestBCClatcnt',
 'formulaA_elements_GSmagmom',
 'formulaB_elements_GSmagmom',
 'formulaA_elements_HHIp',
 'formulaB_elements_HHIp',
 'formulaA_e

# Data Modeling

## Train and Test Set

Our data is heavily imbalanced, so we will use a stratified train test split package designed for multilabel data. This is to ensure we maintain relative class proportions between test and train data.

In [29]:
# perform a stratified train test split
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

mskf = MultilabelStratifiedKFold(n_splits=5, shuffle=True, random_state=123)

for train_index, test_index in mskf.split(feature_df_lowcorr, target_df):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = feature_df_lowcorr.iloc[train_index], feature_df_lowcorr.iloc[test_index]
   y_train, y_test = target_df.iloc[train_index], target_df.iloc[test_index]

TRAIN: [   0    1    2 ... 2568 2570 2571] TEST: [   3   11   20   21   27   29   30   31   36   42   51   52   71   72
   75   78   80   84   87   98  104  106  118  121  122  128  133  134
  144  145  146  153  154  157  164  166  173  176  180  184  189  190
  193  204  208  216  217  237  242  244  261  267  269  280  282  286
  287  293  299  301  311  322  323  325  335  344  354  356  358  368
  377  381  386  389  398  400  410  411  417  425  428  430  434  438
  440  441  445  449  460  462  468  472  475  476  478  480  488  496
  499  501  504  507  511  514  516  522  531  534  536  542  545  550
  562  568  573  579  582  589  591  596  597  600  609  615  631  632
  643  644  645  651  652  653  654  659  662  683  690  692  696  697
  709  713  724  725  727  728  736  740  746  748  750  751  754  757
  766  768  769  774  783  786  790  795  797  801  809  813  819  836
  847  851  852  857  858  864  884  885  903  904  911  918  923  926
  929  935  937  946  952  9

In [30]:
y_train

Unnamed: 0,90A_10B,80A_20B,70A_30B,60A_40B,50A_50B,40A_60B,30A_70B,20A_80B,10A_90B
0,0,0,1,0,1,0,0,0,0
1,0,0,1,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
2567,0,0,0,0,0,0,0,0,0
2568,0,0,0,0,0,0,0,1,0
2569,0,0,0,0,0,0,0,0,0
2570,0,0,0,0,0,0,0,0,0


In [31]:
X_train

Unnamed: 0,formulaA_elements_AtomicVolume,formulaB_elements_AtomicVolume,formulaA_elements_AtomicWeight,formulaB_elements_AtomicWeight,formulaA_elements_BoilingT,formulaB_elements_BoilingT,formulaA_elements_BulkModulus,formulaB_elements_BulkModulus,formulaA_elements_Column,formulaB_elements_Column,...,formulaA_elements_Row,formulaB_elements_Row,formulaA_elements_ShearModulus,formulaB_elements_ShearModulus,formulaA_elements_SpaceGroupNumber,formulaB_elements_SpaceGroupNumber,avg_coordination_A,avg_coordination_B,avg_nearest_neighbor_distance_A,avg_nearest_neighbor_distance_B
0,37.433086,17.075648,227.000,107.868200,3473.0,2435.0,0.0,100.0,3,11,...,7,5,0.0,30.0,225,225,12.0,12.0,3.99462,2.94195
1,37.433086,16.594425,227.000,26.981539,3473.0,2792.0,0.0,76.0,3,13,...,7,3,0.0,26.0,225,225,12.0,12.0,3.99462,2.85595
3,37.433086,64.969282,227.000,137.327000,3473.0,2143.0,0.0,9.6,3,2,...,7,6,0.0,4.9,225,229,12.0,8.0,3.99462,4.35637
4,37.433086,35.483459,227.000,208.980400,3473.0,1837.0,0.0,31.0,3,15,...,7,6,0.0,12.0,225,12,12.0,3.0,3.99462,3.11221
6,37.433086,8.825090,227.000,12.010700,3473.0,4300.0,0.0,33.0,3,14,...,7,2,0.0,0.0,225,194,12.0,3.0,3.99462,1.42409
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2567,23.265943,32.865683,91.224,232.038060,4682.0,5093.0,0.0,54.0,4,3,...,5,7,33.0,31.0,194,225,12.0,12.0,3.19147,3.56059
2568,23.265943,28.640877,91.224,204.383300,4682.0,1746.0,0.0,43.0,4,13,...,5,6,33.0,2.8,194,194,12.0,8.0,3.19147,3.43253
2569,23.265943,13.844898,91.224,50.941500,4682.0,3680.0,0.0,160.0,4,5,...,5,4,33.0,47.0,194,229,12.0,8.0,3.19147,2.59229
2570,23.265943,36952.924020,91.224,131.293000,4682.0,165.0,0.0,0.0,4,18,...,5,5,33.0,0.0,194,225,12.0,12.0,3.19147,4.85032


We should check the class proportions of the original target dataframe with the test data to ensure the split worked as intended.

In [32]:
dfSummary(target_df)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,90A_10B [int64],1. 0 2. 1,"2,522 (98.1%) 50 (1.9%)",,0 (0.0%)
2,80A_20B [int64],1. 0 2. 1,"2,484 (96.6%) 88 (3.4%)",,0 (0.0%)
3,70A_30B [int64],1. 0 2. 1,"1,974 (76.7%) 598 (23.3%)",,0 (0.0%)
4,60A_40B [int64],1. 0 2. 1,"2,396 (93.2%) 176 (6.8%)",,0 (0.0%)
5,50A_50B [int64],1. 0 2. 1,"1,988 (77.3%) 584 (22.7%)",,0 (0.0%)
6,40A_60B [int64],1. 0 2. 1,"2,387 (92.8%) 185 (7.2%)",,0 (0.0%)
7,30A_70B [int64],1. 0 2. 1,"2,212 (86.0%) 360 (14.0%)",,0 (0.0%)
8,20A_80B [int64],1. 0 2. 1,"2,180 (84.8%) 392 (15.2%)",,0 (0.0%)
9,10A_90B [int64],1. 0 2. 1,"2,512 (97.7%) 60 (2.3%)",,0 (0.0%)


In [33]:
dfSummary(y_test)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,90A_10B [int64],1. 0 2. 1,504 (98.1%) 10 (1.9%),,0 (0.0%)
2,80A_20B [int64],1. 0 2. 1,496 (96.5%) 18 (3.5%),,0 (0.0%)
3,70A_30B [int64],1. 0 2. 1,394 (76.7%) 120 (23.3%),,0 (0.0%)
4,60A_40B [int64],1. 0 2. 1,479 (93.2%) 35 (6.8%),,0 (0.0%)
5,50A_50B [int64],1. 0 2. 1,397 (77.2%) 117 (22.8%),,0 (0.0%)
6,40A_60B [int64],1. 0 2. 1,477 (92.8%) 37 (7.2%),,0 (0.0%)
7,30A_70B [int64],1. 0 2. 1,442 (86.0%) 72 (14.0%),,0 (0.0%)
8,20A_80B [int64],1. 0 2. 1,435 (84.6%) 79 (15.4%),,0 (0.0%)
9,10A_90B [int64],1. 0 2. 1,502 (97.7%) 12 (2.3%),,0 (0.0%)


The class proportions are very close. Looks like it is safe to proceed.

## Pycaret

### Model Comparison and Hyperparameter Tuning

Pycaret is a low level ML library that speeds up ML workflows. It can compare 10+ algorithms in a single command using cross validation; we will select the top performing model for each class. In addition Pycaret will handle some of our preprocessing steps, like feature transformation and normalization, which are necessary for most linear models and nearest neighbor models to function well. Also as our classes are heavily imbalanced Pycaret also implements over and undersampling techniques like SMOTE. To summarize:

1. Feature transformation
2. Normalization/standardization
3. SMOTE
4. Model comparison

#### Pycaret Train and Test Set Setup

In [35]:
# create combined training set

pycaret_train = pd.concat([X_train, y_train], axis='columns')
pycaret_train.head()

Unnamed: 0,formulaA_elements_AtomicVolume,formulaB_elements_AtomicVolume,formulaA_elements_AtomicWeight,formulaB_elements_AtomicWeight,formulaA_elements_BoilingT,formulaB_elements_BoilingT,formulaA_elements_BulkModulus,formulaB_elements_BulkModulus,formulaA_elements_Column,formulaB_elements_Column,...,avg_nearest_neighbor_distance_B,90A_10B,80A_20B,70A_30B,60A_40B,50A_50B,40A_60B,30A_70B,20A_80B,10A_90B
0,37.433086,17.075648,227.0,107.8682,3473.0,2435.0,0.0,100.0,3,11,...,2.94195,0,0,1,0,1,0,0,0,0
1,37.433086,16.594425,227.0,26.981539,3473.0,2792.0,0.0,76.0,3,13,...,2.85595,0,0,1,0,0,0,0,0,0
3,37.433086,64.969282,227.0,137.327,3473.0,2143.0,0.0,9.6,3,2,...,4.35637,0,0,0,0,0,0,0,0,0
4,37.433086,35.483459,227.0,208.9804,3473.0,1837.0,0.0,31.0,3,15,...,3.11221,0,0,0,0,0,0,0,0,0
6,37.433086,8.82509,227.0,12.0107,3473.0,4300.0,0.0,33.0,3,14,...,1.42409,0,0,0,0,0,0,0,0,0


In [36]:
pycaret_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2058 entries, 0 to 2571
Columns: 101 entries, formulaA_elements_AtomicVolume to 10A_90B
dtypes: category(20), float64(42), int64(39)
memory usage: 1.3 MB


In [37]:
# create combined test set

pycaret_test = pd.concat([X_test, y_test], axis='columns')
pycaret_test.head()

Unnamed: 0,formulaA_elements_AtomicVolume,formulaB_elements_AtomicVolume,formulaA_elements_AtomicWeight,formulaB_elements_AtomicWeight,formulaA_elements_BoilingT,formulaB_elements_BoilingT,formulaA_elements_BulkModulus,formulaB_elements_BulkModulus,formulaA_elements_Column,formulaB_elements_Column,...,avg_nearest_neighbor_distance_B,90A_10B,80A_20B,70A_30B,60A_40B,50A_50B,40A_60B,30A_70B,20A_80B,10A_90B
2,37.433086,21.723966,227.0,74.9216,3473.0,887.0,0.0,22.0,3,15,...,2.5579,0,0,0,0,0,0,0,0,0
5,37.433086,42.527825,227.0,79.904,3473.0,332.0,0.0,1.9,3,17,...,2.38875,0,0,1,0,0,0,0,0,0
9,37.433086,11.777365,227.0,55.845,3473.0,3134.0,0.0,170.0,3,8,...,2.46654,0,0,0,0,0,0,0,0,0
10,37.433086,26.082658,227.0,114.818,3473.0,2345.0,0.0,0.0,3,13,...,3.33944,0,0,1,0,0,0,0,1,0
28,37.433086,28.640877,227.0,204.3833,3473.0,1746.0,0.0,43.0,3,13,...,3.43253,0,0,0,0,0,0,0,1,0


#### Pycaret Modelling

In [None]:
# this cell is used for comparative model generation

# set feature and target names for easy setup
feature_names = ['90A_10B', '80A_20B', '70A_30B', '60A_40B', '50A_50B', '40A_60B', '30A_70B', '20A_80B', '10A_90B']
current_target = '90A_10B'
feature_names.remove(current_target)
features_to_ignore = feature_names

# track training and model data using wandb

wandb.init(project="binary_stability_prediction")

# Pycaret initial setup function
s = setup(
          pycaret_train,
          target = current_target,
          ignore_features=features_to_ignore,
          categorical_features=categorical_columns,
          numeric_features=numerical_columns,
          max_encoding_ohe = 1,
          fix_imbalance=True,
          fix_imbalance_method='SMOTETomek',
          test_data=pycaret_test,
          transformation=True,
          normalize=True,
          log_experiment="wandb",
          log_plots= True,
          log_data=True,
          experiment_name=current_target + " Multilabel Manual Transform",
          session_id=123
          )

# generate comparative models
best_model = compare_models(sort='F1')

# tune hyperparameters
best_model = tune_model(best_model, optimize='F1', search_library='optuna')

# save model for later use
save_model(best_model, current_target + '_Pipeline')

# construct SHAP interpretation
interpret_model(best_model, save=True)

# log SHAP chart to wandb
wandb.log({"shap_values": wandb.Image('SHAP summary.png')})

#finish wandb run
wandb.finish()

### Weights and Biases Data Viewers

Weights and Biases is an online MLOps platform that allows for experiment tracking. We will be using it for data presentation as well.

#### Full Project Dashboard

In [38]:
# load wandb dashboard for data viewing

%wandb alanfiler/binary_stability_prediction -h 1080

#### Specific Reports

##### Algorithm Comparison

In [39]:
# algorithm comparison

%wandb alanfiler/binary_stability_prediction/reports/Algorithm-Comparison--Vmlldzo0OTIxMDMw

##### Model Training Scores By Class

In [40]:
# model training scores by class

%wandb alanfiler/binary_stability_prediction/reports/Model-Training-Scores-by-Class--Vmlldzo0OTIwODgx

##### Imbalanced class correction technique comparison

In [None]:
# imbalanced class correction technique comparison

%wandb alanfiler/binary_stability_prediction/reports/Imbalanced-Class-Correction-Technique-Comparison--Vmlldzo0OTIwOTMx

##### Various Charts

In [None]:
%wandb alanfiler/binary_stability_prediction/reports/Charts--Vmlldzo0OTIwOTcw

### Additional Model Generators

In [None]:
# repeat of above except uses the create_model instead of compare_models
# for single model construction

feature_names = ['90A_10B', '80A_20B', '70A_30B', '60A_40B', '50A_50B', '40A_60B', '30A_70B', '20A_80B', '10A_90B']
current_target = '90A_10B'
feature_names.remove(current_target)
features_to_ignore = feature_names

wandb.init(project="binary_stability_prediction")
s = setup(
          pycaret_train,
          target = current_target,
          ignore_features=features_to_ignore,
          categorical_features=categorical_columns,
          numeric_features=numerical_columns,
          max_encoding_ohe = 1,
          fix_imbalance=True,
          fix_imbalance_method='SMOTETomek',
          test_data=pycaret_test,
          transformation=True,
          normalize=True,
          log_experiment="wandb",
          log_plots= True,
          log_data=True,
          experiment_name=current_target + " Multilabel Stratified KNN predict model",
          session_id=123
          )
best_model = create_model('rf')
interpret_model(best_model, save=True)
wandb.log({"shap_values": wandb.Image('SHAP summary.png')})
exp_y_pred = predict_model(best_model, data=None)
wandb.finish()

In [52]:
# this cell implements a for loop into the above to be able to train across all the classes in order
# no wandb is implemented since it doesn't seem to work well with multiple training loops
from pycaret.classification import save_model

feature_names_forloop = ['90A_10B', '80A_20B', '70A_30B', '60A_40B', '50A_50B', '40A_60B', '30A_70B', '20A_80B', '10A_90B']

for i in feature_names_forloop:
    feature_names = ['90A_10B', '80A_20B', '70A_30B', '60A_40B', '50A_50B', '40A_60B', '30A_70B', '20A_80B', '10A_90B']
    current_target = i
    feature_names.remove(current_target)
    features_to_ignore = feature_names

    s = setup(
          pycaret_train,
          target = current_target,
          ignore_features=features_to_ignore,
          categorical_features=categorical_columns,
          numeric_features=numerical_columns,
          max_encoding_ohe = 1,
          fix_imbalance=True,
          fix_imbalance_method='SMOTETomek',
          test_data=pycaret_test,
          transformation=True,
          normalize=True,
          session_id=123
          )
    best_model = compare_models(sort='F1')
    best_model = tune_model(best_model, optimize='F1', search_library='optuna')
    save_model(best_model, current_target + '_Pipeline')



Unnamed: 0,Description,Value
0,Session id,123
1,Target,90A_10B
2,Target type,Binary
3,Original data shape,"(2572, 101)"
4,Transformed data shape,"(4550, 93)"
5,Transformed train set shape,"(4036, 93)"
6,Transformed test set shape,"(514, 93)"
7,Ignore features,8
8,Ordinal features,12
9,Numeric features,72


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.9772,0.8709,0.45,0.4743,0.4191,0.4085,0.4287,0.332
rf,Random Forest Classifier,0.9796,0.8695,0.375,0.55,0.4125,0.4029,0.4256,0.304
lightgbm,Light Gradient Boosting Machine,0.9733,0.9061,0.325,0.3889,0.3224,0.31,0.3258,0.283
gbc,Gradient Boosting Classifier,0.9645,0.8935,0.375,0.3529,0.322,0.3066,0.3253,0.283
knn,K Neighbors Classifier,0.9334,0.7939,0.6,0.1746,0.2662,0.2436,0.2969,0.65
ada,Ada Boost Classifier,0.9388,0.841,0.375,0.2899,0.2633,0.2443,0.2716,0.284
lda,Linear Discriminant Analysis,0.9267,0.8356,0.525,0.2006,0.2595,0.2367,0.279,0.286
svm,SVM - Linear Kernel,0.9388,0.0,0.475,0.1953,0.2566,0.2348,0.2684,0.27
lr,Logistic Regression,0.931,0.8522,0.5,0.1934,0.2536,0.2314,0.2694,0.38
dt,Decision Tree Classifier,0.9495,0.6189,0.275,0.1905,0.2227,0.2051,0.2065,0.317


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.9709,0.9777,0.5,0.3333,0.4,0.3857,0.394
1,0.9806,0.927,0.25,0.5,0.3333,0.3246,0.3448
2,0.966,0.9814,1.0,0.3636,0.5333,0.5197,0.5925
3,0.9709,0.9938,1.0,0.4,0.5714,0.5592,0.623
4,0.9272,0.8614,0.5,0.1333,0.2105,0.1856,0.2314
5,0.9515,0.8205,0.25,0.125,0.1667,0.1445,0.1538
6,0.9563,0.7246,0.25,0.1429,0.1818,0.1611,0.1678
7,0.9709,0.9183,0.25,0.25,0.25,0.2351,0.2351
8,0.9707,0.949,0.75,0.375,0.5,0.4866,0.5179
9,0.961,0.9838,1.0,0.3333,0.5,0.4849,0.5657


[I 2023-09-07 06:31:36,981] Searching the best hyperparameters using 2058 samples...
[I 2023-09-07 06:34:21,531] Finished hyperparemeter search!


Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).
Transformation Pipeline and Model Successfully Saved


Unnamed: 0,Description,Value
0,Session id,123
1,Target,80A_20B
2,Target type,Binary
3,Original data shape,"(2572, 101)"
4,Transformed data shape,"(4490, 93)"
5,Transformed train set shape,"(3976, 93)"
6,Transformed test set shape,"(514, 93)"
7,Ignore features,8
8,Ordinal features,12
9,Numeric features,72


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.9538,0.7583,0.2286,0.3116,0.2536,0.2311,0.2389,0.353
rf,Random Forest Classifier,0.95,0.7878,0.2429,0.3347,0.2475,0.2237,0.2407,0.591
gbc,Gradient Boosting Classifier,0.9213,0.7794,0.2571,0.1923,0.209,0.1751,0.1796,0.337
knn,K Neighbors Classifier,0.8761,0.7307,0.4143,0.1333,0.1994,0.155,0.1831,0.313
lightgbm,Light Gradient Boosting Machine,0.9315,0.7573,0.1857,0.1846,0.1724,0.1409,0.1454,0.3
dt,Decision Tree Classifier,0.8984,0.6097,0.3,0.1161,0.1609,0.1197,0.1381,0.605
ada,Ada Boost Classifier,0.8814,0.741,0.2857,0.1248,0.1578,0.1142,0.1291,0.318
nb,Naive Bayes,0.6939,0.7432,0.7143,0.0797,0.1427,0.0868,0.163,0.293
lr,Logistic Regression,0.8736,0.7281,0.2,0.1036,0.1166,0.072,0.0789,0.325
ridge,Ridge Classifier,0.862,0.0,0.2,0.0603,0.0915,0.0419,0.05,0.297


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.9369,0.911,0.7143,0.3125,0.4348,0.4067,0.4461
1,0.9417,0.9289,0.8571,0.3529,0.5,0.4747,0.528
2,0.8592,0.6181,0.1429,0.0417,0.0645,0.0126,0.0154
3,0.8883,0.9483,0.8571,0.2143,0.3429,0.3051,0.3947
4,0.7718,0.6762,0.2857,0.0455,0.0784,0.021,0.033
5,0.8738,0.7337,0.0,0.0,0.0,-0.0523,-0.0598
6,0.8155,0.8521,0.8571,0.1395,0.24,0.1928,0.2992
7,0.8932,0.9081,0.5714,0.1739,0.2667,0.2264,0.2738
8,0.8683,0.7814,0.4286,0.1154,0.1818,0.1353,0.1705
9,0.8732,0.9538,1.0,0.2121,0.35,0.3112,0.4293


[I 2023-09-07 06:35:37,987] Searching the best hyperparameters using 2058 samples...
[I 2023-09-07 06:38:38,172] Finished hyperparemeter search!


Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).
Transformation Pipeline and Model Successfully Saved


Unnamed: 0,Description,Value
0,Session id,123
1,Target,70A_30B
2,Target type,Binary
3,Original data shape,"(2572, 101)"
4,Transformed data shape,"(3598, 93)"
5,Transformed train set shape,"(3084, 93)"
6,Transformed test set shape,"(514, 93)"
7,Ignore features,8
8,Ordinal features,12
9,Numeric features,72


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
knn,K Neighbors Classifier,0.761,0.821,0.7703,0.4934,0.5993,0.4412,0.4658,0.34
et,Extra Trees Classifier,0.7955,0.8349,0.5991,0.556,0.5739,0.4402,0.4427,0.349
rf,Random Forest Classifier,0.7727,0.817,0.593,0.5143,0.5476,0.398,0.4018,0.36
nb,Naive Bayes,0.7017,0.7668,0.7579,0.4234,0.5403,0.3461,0.3815,0.32
gbc,Gradient Boosting Classifier,0.7401,0.783,0.6223,0.4692,0.5269,0.3561,0.3676,0.339
lightgbm,Light Gradient Boosting Machine,0.7538,0.7759,0.5597,0.4815,0.5106,0.3495,0.3555,0.623
ridge,Ridge Classifier,0.7037,0.0,0.555,0.41,0.4686,0.2726,0.2789,0.35
lda,Linear Discriminant Analysis,0.7008,0.6718,0.5383,0.4101,0.4622,0.2646,0.2687,0.334
lr,Logistic Regression,0.6862,0.6821,0.5153,0.3781,0.4309,0.2252,0.2314,0.368
dt,Decision Tree Classifier,0.6862,0.6225,0.5033,0.3831,0.4306,0.2245,0.2281,0.328


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.6748,0.7236,0.625,0.3797,0.4724,0.2571,0.2738
1,0.7476,0.8902,0.8958,0.4778,0.6232,0.4587,0.51
2,0.699,0.8378,0.8125,0.4239,0.5571,0.3617,0.4057
3,0.733,0.8514,0.9167,0.4632,0.6154,0.4429,0.5036
4,0.767,0.8775,0.8542,0.5,0.6308,0.477,0.5136
5,0.7573,0.8459,0.7917,0.4872,0.6032,0.4423,0.4693
6,0.6456,0.7673,0.6458,0.3563,0.4593,0.2272,0.2494
7,0.7427,0.8678,0.8542,0.4713,0.6074,0.4389,0.4819
8,0.7659,0.9006,0.9362,0.4944,0.6471,0.4957,0.5524
9,0.8585,0.9582,0.9787,0.6216,0.7603,0.6669,0.7015


[I 2023-09-07 06:39:59,695] Searching the best hyperparameters using 2058 samples...
[I 2023-09-07 06:41:41,795] Finished hyperparemeter search!


Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).
Transformation Pipeline and Model Successfully Saved


Unnamed: 0,Description,Value
0,Session id,123
1,Target,60A_40B
2,Target type,Binary
3,Original data shape,"(2572, 101)"
4,Transformed data shape,"(4336, 93)"
5,Transformed train set shape,"(3828, 93)"
6,Transformed test set shape,"(514, 93)"
7,Ignore features,8
8,Ordinal features,12
9,Numeric features,72


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
knn,K Neighbors Classifier,0.8333,0.7999,0.6143,0.2446,0.3381,0.2672,0.3108,0.434
gbc,Gradient Boosting Classifier,0.8785,0.8342,0.4305,0.3225,0.3329,0.277,0.2969,0.93
rf,Random Forest Classifier,0.897,0.8126,0.3652,0.3849,0.3274,0.2788,0.3021,0.552
et,Extra Trees Classifier,0.8873,0.8053,0.3648,0.339,0.3037,0.2519,0.2748,0.52
lightgbm,Light Gradient Boosting Machine,0.9018,0.8073,0.3024,0.3985,0.3033,0.257,0.2788,1.788
nb,Naive Bayes,0.7881,0.7656,0.6367,0.206,0.3023,0.2216,0.273,0.432
ada,Ada Boost Classifier,0.861,0.7476,0.3452,0.3037,0.2724,0.208,0.2293,0.74
ridge,Ridge Classifier,0.8236,0.0,0.3967,0.2073,0.2634,0.186,0.1971,0.401
lr,Logistic Regression,0.8314,0.7465,0.3824,0.2124,0.2607,0.1855,0.1968,0.585
dt,Decision Tree Classifier,0.8606,0.6248,0.3519,0.2425,0.2522,0.1884,0.2081,0.452


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8058,0.7865,0.8571,0.24,0.375,0.3007,0.387
1,0.8058,0.9129,0.7857,0.2292,0.3548,0.279,0.353
2,0.8398,0.8826,0.7857,0.2683,0.4,0.3324,0.3968
3,0.7087,0.8103,0.7857,0.1618,0.2683,0.1753,0.2616
4,0.9029,0.8158,0.5,0.35,0.4118,0.3606,0.3675
5,0.8495,0.8616,0.6429,0.2571,0.3673,0.2993,0.3401
6,0.8398,0.8599,0.8571,0.2791,0.4211,0.3549,0.4308
7,0.6699,0.9455,1.0,0.1807,0.3061,0.2085,0.3411
8,0.8,0.79,0.7143,0.2128,0.3279,0.2488,0.3124
9,0.8732,0.8044,0.5714,0.2857,0.381,0.3189,0.3428


[I 2023-09-07 06:43:38,550] Searching the best hyperparameters using 2058 samples...
[I 2023-09-07 06:46:12,601] Finished hyperparemeter search!


Transformation Pipeline and Model Successfully Saved


Unnamed: 0,Description,Value
0,Session id,123
1,Target,50A_50B
2,Target type,Binary
3,Original data shape,"(2572, 101)"
4,Transformed data shape,"(3624, 93)"
5,Transformed train set shape,"(3102, 93)"
6,Transformed test set shape,"(514, 93)"
7,Ignore features,8
8,Ordinal features,12
9,Numeric features,72


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.8227,0.8814,0.6149,0.6429,0.6116,0.4996,0.5113,0.575
knn,K Neighbors Classifier,0.7809,0.8443,0.7413,0.5246,0.6087,0.4653,0.4829,0.458
lightgbm,Light Gradient Boosting Machine,0.8271,0.8644,0.6221,0.632,0.6042,0.4982,0.5109,2.043
rf,Random Forest Classifier,0.8135,0.8791,0.6131,0.6169,0.5931,0.4756,0.4896,0.611
gbc,Gradient Boosting Classifier,0.7863,0.8321,0.6221,0.5364,0.5608,0.4241,0.4368,0.796
dt,Decision Tree Classifier,0.7265,0.6705,0.5678,0.4277,0.4837,0.3048,0.3131,0.683
nb,Naive Bayes,0.6822,0.7147,0.6147,0.3749,0.4596,0.2539,0.2764,0.448
ada,Ada Boost Classifier,0.6964,0.7055,0.526,0.3702,0.4223,0.2287,0.2445,0.556
qda,Quadratic Discriminant Analysis,0.6706,0.6592,0.4952,0.3247,0.381,0.1736,0.1879,0.467
lr,Logistic Regression,0.673,0.665,0.4159,0.325,0.354,0.148,0.1524,0.581


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7573,0.858,0.7447,0.4795,0.5833,0.4232,0.4437
1,0.7621,0.8333,0.7234,0.4857,0.5812,0.4239,0.4403
2,0.8204,0.9026,0.8511,0.5714,0.6838,0.565,0.5869
3,0.7767,0.8488,0.7447,0.5072,0.6034,0.4557,0.472
4,0.8107,0.865,0.5532,0.5909,0.5714,0.4501,0.4505
5,0.9078,0.9295,0.6596,0.9118,0.7654,0.7099,0.7243
6,0.7524,0.8193,0.5745,0.4655,0.5143,0.3506,0.3541
7,0.7136,0.8912,0.9348,0.4343,0.5931,0.4146,0.4874
8,0.8537,0.9003,0.6304,0.6905,0.6591,0.5662,0.5671
9,0.8878,0.94,0.6739,0.7949,0.7294,0.6592,0.6628


[I 2023-09-07 06:48:15,063] Searching the best hyperparameters using 2058 samples...
[I 2023-09-07 06:51:30,965] Finished hyperparemeter search!


Transformation Pipeline and Model Successfully Saved


Unnamed: 0,Description,Value
0,Session id,123
1,Target,40A_60B
2,Target type,Binary
3,Original data shape,"(2572, 101)"
4,Transformed data shape,"(4320, 93)"
5,Transformed train set shape,"(3818, 93)"
6,Transformed test set shape,"(514, 93)"
7,Ignore features,8
8,Ordinal features,12
9,Numeric features,72


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.8888,0.7731,0.3371,0.389,0.3281,0.2752,0.2898,0.665
knn,K Neighbors Classifier,0.8285,0.7533,0.5519,0.2358,0.3225,0.247,0.279,0.568
rf,Random Forest Classifier,0.8888,0.752,0.2971,0.4155,0.299,0.2477,0.2702,0.72
nb,Naive Bayes,0.7596,0.7094,0.5686,0.2032,0.2719,0.1828,0.2264,0.555
gbc,Gradient Boosting Classifier,0.848,0.7661,0.3386,0.2661,0.2634,0.1935,0.2077,1.111
dt,Decision Tree Classifier,0.8411,0.615,0.351,0.2284,0.2582,0.185,0.1953,0.564
lightgbm,Light Gradient Boosting Machine,0.879,0.7615,0.279,0.2622,0.2527,0.1949,0.2006,2.184
ada,Ada Boost Classifier,0.774,0.6736,0.359,0.1338,0.1885,0.0954,0.1109,0.795
svm,SVM - Linear Kernel,0.7848,0.0,0.2357,0.1283,0.1451,0.0564,0.0599,0.548
lr,Logistic Regression,0.8033,0.6415,0.229,0.1377,0.1374,0.0543,0.0663,0.727


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.835,0.5365,0.4,0.1935,0.2609,0.1804,0.1956
1,0.8447,0.756,0.4,0.2069,0.2727,0.1955,0.2089
2,0.8592,0.8017,0.3333,0.2083,0.2564,0.1832,0.1894
3,0.8447,0.785,0.4,0.2069,0.2727,0.1955,0.2089
4,0.8932,0.8618,0.3333,0.2941,0.3125,0.2549,0.2554
5,0.9175,0.9431,0.6,0.45,0.5143,0.4702,0.476
6,0.8689,0.8932,0.6,0.3,0.4,0.3355,0.361
7,0.6893,0.6366,0.4667,0.1111,0.1795,0.0701,0.0978
8,0.8976,0.9353,0.9286,0.3939,0.5532,0.5058,0.5655
9,0.9512,0.8553,0.3571,0.8333,0.5,0.4786,0.5266


[I 2023-09-07 06:54:02,346] Searching the best hyperparameters using 2058 samples...
[I 2023-09-07 06:57:59,044] Finished hyperparemeter search!


Transformation Pipeline and Model Successfully Saved


Unnamed: 0,Description,Value
0,Session id,123
1,Target,30A_70B
2,Target type,Binary
3,Original data shape,"(2572, 101)"
4,Transformed data shape,"(4018, 93)"
5,Transformed train set shape,"(3486, 93)"
6,Transformed test set shape,"(514, 93)"
7,Ignore features,8
8,Ordinal features,12
9,Numeric features,72


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.8314,0.7873,0.5193,0.4525,0.4605,0.3667,0.3806,0.742
rf,Random Forest Classifier,0.8261,0.7936,0.4953,0.4599,0.4458,0.3497,0.3664,1.769
knn,K Neighbors Classifier,0.7731,0.755,0.5778,0.3409,0.4169,0.2942,0.3155,0.864
lightgbm,Light Gradient Boosting Machine,0.8144,0.7987,0.4536,0.4041,0.3999,0.2973,0.3129,2.643
gbc,Gradient Boosting Classifier,0.8008,0.777,0.4679,0.4076,0.3975,0.2889,0.3103,1.248
nb,Naive Bayes,0.6871,0.709,0.6751,0.2598,0.3702,0.2156,0.2644,0.599
dt,Decision Tree Classifier,0.74,0.6218,0.4576,0.2851,0.3408,0.199,0.2096,0.623
qda,Quadratic Discriminant Analysis,0.7343,0.6737,0.4054,0.2684,0.3042,0.1637,0.1729,0.977
ada,Ada Boost Classifier,0.7348,0.702,0.4603,0.2333,0.302,0.1681,0.1879,1.01
lr,Logistic Regression,0.7459,0.6661,0.3526,0.2406,0.2764,0.1361,0.142,0.869


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8641,0.8736,0.7931,0.5111,0.6216,0.5435,0.5629
1,0.801,0.7329,0.5517,0.3636,0.4384,0.3236,0.334
2,0.8544,0.8365,0.6207,0.4865,0.5455,0.4603,0.4651
3,0.7524,0.8147,0.7931,0.3382,0.4742,0.3449,0.3985
4,0.733,0.6809,0.5172,0.2679,0.3529,0.2056,0.2233
5,0.8544,0.8821,0.5862,0.4857,0.5312,0.4459,0.4487
6,0.8932,0.8395,0.4138,0.7059,0.5217,0.4662,0.4873
7,0.7184,0.8212,0.7931,0.3067,0.4423,0.3002,0.3609
8,0.8195,0.794,0.3214,0.3333,0.3273,0.2231,0.2231
9,0.9073,0.9044,0.3571,0.9091,0.5128,0.4722,0.5357


[I 2023-09-07 07:01:17,252] Searching the best hyperparameters using 2058 samples...
[I 2023-09-07 07:05:33,081] Finished hyperparemeter search!


Transformation Pipeline and Model Successfully Saved


Unnamed: 0,Description,Value
0,Session id,123
1,Target,20A_80B
2,Target type,Binary
3,Original data shape,"(2572, 101)"
4,Transformed data shape,"(3950, 93)"
5,Transformed train set shape,"(3446, 93)"
6,Transformed test set shape,"(514, 93)"
7,Ignore features,8
8,Ordinal features,12
9,Numeric features,72


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
knn,K Neighbors Classifier,0.7775,0.7844,0.6516,0.377,0.466,0.3421,0.3701,0.684
rf,Random Forest Classifier,0.7994,0.7759,0.4576,0.4004,0.4154,0.2995,0.3056,0.846
et,Extra Trees Classifier,0.794,0.7585,0.4668,0.4036,0.4117,0.2944,0.3057,0.874
nb,Naive Bayes,0.6794,0.7096,0.7052,0.2877,0.4002,0.2376,0.2869,0.667
dt,Decision Tree Classifier,0.7561,0.6427,0.4797,0.3306,0.3816,0.2437,0.2527,0.697
gbc,Gradient Boosting Classifier,0.777,0.7768,0.4576,0.3764,0.3786,0.2549,0.2715,1.101
lightgbm,Light Gradient Boosting Machine,0.796,0.7757,0.403,0.4139,0.3717,0.2583,0.2735,2.077
ada,Ada Boost Classifier,0.7062,0.6652,0.4384,0.3314,0.3099,0.1576,0.1836,0.89
ridge,Ridge Classifier,0.7105,0.0,0.3992,0.2571,0.2918,0.137,0.1466,1.19
qda,Quadratic Discriminant Analysis,0.7644,0.6269,0.2773,0.3398,0.2717,0.1433,0.1553,0.671


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8689,0.8915,0.7742,0.5455,0.64,0.5628,0.5757
1,0.7621,0.7993,0.8065,0.3676,0.5051,0.3761,0.4263
2,0.8301,0.8524,0.7097,0.4583,0.557,0.4578,0.4746
3,0.7961,0.8956,0.871,0.4154,0.5625,0.4505,0.503
4,0.7961,0.8276,0.4516,0.359,0.4,0.2791,0.2818
5,0.801,0.8524,0.8125,0.4262,0.5591,0.4463,0.4851
6,0.7621,0.7756,0.6562,0.3559,0.4615,0.3257,0.3508
7,0.6505,0.6008,0.3438,0.1774,0.234,0.0366,0.04
8,0.8244,0.7861,0.5161,0.4324,0.4706,0.3663,0.3684
9,0.8488,0.8027,0.5161,0.5,0.5079,0.4186,0.4187


[I 2023-09-07 07:08:34,248] Searching the best hyperparameters using 2058 samples...
[I 2023-09-07 07:11:17,948] Finished hyperparemeter search!


Transformation Pipeline and Model Successfully Saved


Unnamed: 0,Description,Value
0,Session id,123
1,Target,10A_90B
2,Target type,Binary
3,Original data shape,"(2572, 101)"
4,Transformed data shape,"(4532, 93)"
5,Transformed train set shape,"(4018, 93)"
6,Transformed test set shape,"(514, 93)"
7,Ignore features,8
8,Ordinal features,12
9,Numeric features,72


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
qda,Quadratic Discriminant Analysis,0.951,0.7653,0.5,0.4598,0.4169,0.3985,0.4254,0.763
ada,Ada Boost Classifier,0.9247,0.8693,0.5,0.233,0.2902,0.2676,0.2978,0.831
lr,Logistic Regression,0.9009,0.8853,0.56,0.2758,0.2839,0.2593,0.3112,1.12
knn,K Neighbors Classifier,0.916,0.793,0.555,0.202,0.2823,0.2577,0.296,0.796
et,Extra Trees Classifier,0.9145,0.7276,0.43,0.2707,0.2689,0.2457,0.276,0.804
rf,Random Forest Classifier,0.9063,0.752,0.43,0.3174,0.2685,0.2456,0.2839,1.024
lightgbm,Light Gradient Boosting Machine,0.9,0.81,0.39,0.2695,0.2616,0.2381,0.2585,2.585
gbc,Gradient Boosting Classifier,0.8903,0.8328,0.45,0.2447,0.2554,0.2279,0.2586,1.235
svm,SVM - Linear Kernel,0.899,0.0,0.44,0.2652,0.2404,0.2151,0.2556,0.707
lda,Linear Discriminant Analysis,0.8781,0.8608,0.64,0.1255,0.2032,0.1747,0.2437,0.725


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8738,0.698,0.2,0.0435,0.0714,0.0329,0.0442
1,0.8592,0.9075,0.8,0.125,0.2162,0.1819,0.2807
2,0.8592,0.99,1.0,0.1471,0.2564,0.2236,0.3547
3,0.9903,0.9363,0.8,0.8,0.8,0.795,0.795
4,0.9175,0.6343,0.6,0.1667,0.2609,0.2317,0.2863
5,0.8835,0.8776,0.2,0.0476,0.0769,0.0393,0.0511
6,0.9757,0.9572,0.0,0.0,0.0,0.0,0.0
7,0.9951,0.9741,0.8,1.0,0.8889,0.8864,0.8922
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0
9,0.9756,0.99,1.0,0.4444,0.6154,0.6047,0.6583


[I 2023-09-07 07:14:20,737] Searching the best hyperparameters using 2058 samples...
[I 2023-09-07 07:17:29,323] Finished hyperparemeter search!


Transformation Pipeline and Model Successfully Saved


### Test Set Predictions

We will be using Pycaret to make predicts on the test set, after which we will calculate some metrics.

In [53]:
from pycaret.classification import load_model

# create empty dataframe
y_pred_df = pd.DataFrame()

# set up model name list and dictionary
trained_models = ['90A_10B_Pipeline', '80A_20B_Pipeline','70A_30B_Pipeline','60A_40B_Pipeline','50A_50B_Pipeline','40A_60B_Pipeline','30A_70B_Pipeline','20A_80B_Pipeline','10A_90B_Pipeline']
column_names = {'90A_10B_Pipeline' : '90A_10B', 
                '80A_20B_Pipeline' : '80A_20B',
                '70A_30B_Pipeline' : '70A_30B',
                '60A_40B_Pipeline' : '60A_40B',
                '50A_50B_Pipeline' : '50A_50B',
                '40A_60B_Pipeline' : '40A_60B',
                '30A_70B_Pipeline' : '30A_70B',
                '20A_80B_Pipeline' : '20A_80B',
                '10A_90B_Pipeline' : '10A_90B'}

# loop to load model, run prediction on test set, and combine predictions
for i in trained_models:
    # load saved model
    loaded_model = load_model(i)

    # predict on test set
    pred_target_df = predict_model(loaded_model, data=pycaret_test)

    # grab prediction label
    pred_target_df = pred_target_df['prediction_label']

    # rename prediction label column
    pred_target_df = pred_target_df.rename(i)

    # add column to dataframe
    y_pred_df = pd.concat([y_pred_df, pred_target_df], axis='columns')

# rename columns
y_pred_df.rename(columns=column_names, inplace=True)
y_pred_df

Transformation Pipeline and Model Successfully Loaded


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.9786,0.9612,0.4,0.4444,0.4211,0.4102,0.4108


Transformation Pipeline and Model Successfully Loaded


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.9669,0.8966,0.2222,0.5714,0.32,0.3064,0.3429


Transformation Pipeline and Model Successfully Loaded


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,K Neighbors Classifier,0.8132,0.8782,0.8083,0.5706,0.669,0.5442,0.5602


Transformation Pipeline and Model Successfully Loaded


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,K Neighbors Classifier,0.8191,0.8716,0.6571,0.2212,0.3309,0.255,0.306


Transformation Pipeline and Model Successfully Loaded


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.8327,0.9134,0.7436,0.6084,0.6692,0.5587,0.5638


Transformation Pipeline and Model Successfully Loaded


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.8969,0.8481,0.5946,0.3667,0.4536,0.4002,0.4145


Transformation Pipeline and Model Successfully Loaded


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.8619,0.8958,0.625,0.5056,0.559,0.4782,0.482


Transformation Pipeline and Model Successfully Loaded


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,K Neighbors Classifier,0.821,0.8838,0.7089,0.448,0.549,0.4444,0.4626


Transformation Pipeline and Model Successfully Loaded


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Quadratic Discriminant Analysis,0.9416,0.9202,0.6667,0.2353,0.3478,0.3245,0.3736


Unnamed: 0,90A_10B,80A_20B,70A_30B,60A_40B,50A_50B,40A_60B,30A_70B,20A_80B,10A_90B
2,0,0,1,1,0,0,0,0,0
5,0,0,1,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0
10,0,0,1,0,0,0,0,0,0
28,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
2540,0,0,1,0,1,1,0,1,0
2547,0,0,0,0,0,0,0,0,0
2558,0,0,1,0,0,0,0,1,0
2563,0,0,0,0,0,0,0,0,0


In [54]:
dfSummary(y_pred_df)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,90A_10B [int64],1. 0 2. 1,505 (98.2%) 9 (1.8%),,0 (0.0%)
2,80A_20B [int64],1. 0 2. 1,507 (98.6%) 7 (1.4%),,0 (0.0%)
3,70A_30B [int64],1. 0 2. 1,344 (66.9%) 170 (33.1%),,0 (0.0%)
4,60A_40B [int64],1. 0 2. 1,410 (79.8%) 104 (20.2%),,0 (0.0%)
5,50A_50B [int64],1. 0 2. 1,371 (72.2%) 143 (27.8%),,0 (0.0%)
6,40A_60B [int64],1. 0 2. 1,454 (88.3%) 60 (11.7%),,0 (0.0%)
7,30A_70B [int64],1. 0 2. 1,425 (82.7%) 89 (17.3%),,0 (0.0%)
8,20A_80B [int64],1. 0 2. 1,389 (75.7%) 125 (24.3%),,0 (0.0%)
9,10A_90B [int64],1. 0 2. 1,480 (93.4%) 34 (6.6%),,0 (0.0%)


In [55]:
dfSummary(y_test)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,90A_10B [int64],1. 0 2. 1,504 (98.1%) 10 (1.9%),,0 (0.0%)
2,80A_20B [int64],1. 0 2. 1,496 (96.5%) 18 (3.5%),,0 (0.0%)
3,70A_30B [int64],1. 0 2. 1,394 (76.7%) 120 (23.3%),,0 (0.0%)
4,60A_40B [int64],1. 0 2. 1,479 (93.2%) 35 (6.8%),,0 (0.0%)
5,50A_50B [int64],1. 0 2. 1,397 (77.2%) 117 (22.8%),,0 (0.0%)
6,40A_60B [int64],1. 0 2. 1,477 (92.8%) 37 (7.2%),,0 (0.0%)
7,30A_70B [int64],1. 0 2. 1,442 (86.0%) 72 (14.0%),,0 (0.0%)
8,20A_80B [int64],1. 0 2. 1,435 (84.6%) 79 (15.4%),,0 (0.0%)
9,10A_90B [int64],1. 0 2. 1,502 (97.7%) 12 (2.3%),,0 (0.0%)


# Evaluation

Pycaret provides a nice data viewer with many graphs and charts for evaluating our models. However, for ease of communication we will also be calculating some standard metrics to get some single numbers.

## Pycaret Model Test Evaluation Data Viewer

In [56]:
# cell to show some evalution plots for further exploration

test_model = load_model('40A_60B_Pipeline')
evaluate_model(test_model)

Transformation Pipeline and Model Successfully Loaded


interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

## F_Score, Precision, Recall

It is time to calculate final metrics on our test set evaluations. As our dataset is highly imbalanced, the following metrics are very useful:

1. F1_Score
2. Precision
3. Recall

In [57]:
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

feature_names = ['90A_10B', '80A_20B', '70A_30B', '60A_40B', '50A_50B', '40A_60B', '30A_70B', '20A_80B', '10A_90B']

# initialize lists
f1_list = []
precision_list = []
recall_list = []

# wandb logging
wandb.init(project="binary_stability_prediction")

# for loop to calculate f1_score, precision_score, and recall_score across all classes
for i in feature_names:
    # f1_score
    f1 = f1_score(y_test[i].reset_index(drop=True), y_pred_df[i].reset_index(drop=True), average=None)[1]

    # precision_score
    p = precision_score(y_test[i].reset_index(drop=True), y_pred_df[i].reset_index(drop=True), average=None)[1]

    # recall_score
    r = recall_score(y_test[i].reset_index(drop=True), y_pred_df[i].reset_index(drop=True), average=None)[1]

    # append to appropriate list
    f1_list.append(f1)
    precision_list.append(p)
    recall_list.append(r)

    # wandb logging
    wandb.log({'f1_score': f1, 'precision_score': p, 'recall_score': r})
wandb.finish()

# calculate and print averages
f1_avg = np.average(f1_list)
precision_avg = np.average(precision_list)
recall_avg = np.average(recall_list)
    
print(f1_list, precision_list, recall_list, f1_avg, precision_avg, recall_avg)

0,1
f1_score,▃▁█▁█▄▆▆▂
precision_score,▅▇▇▁█▄▆▅▁
recall_score,▃▁█▆▇▅▆▇▆

0,1
f1_score,0.34783
precision_score,0.23529
recall_score,0.66667


[0.4210526315789474, 0.32, 0.6689655172413792, 0.3309352517985611, 0.6692307692307694, 0.4536082474226804, 0.5590062111801242, 0.5490196078431373, 0.3478260869565218] [0.4444444444444444, 0.5714285714285714, 0.5705882352941176, 0.22115384615384615, 0.6083916083916084, 0.36666666666666664, 0.5056179775280899, 0.448, 0.23529411764705882] [0.4, 0.2222222222222222, 0.8083333333333333, 0.6571428571428571, 0.7435897435897436, 0.5945945945945946, 0.625, 0.7088607594936709, 0.6666666666666666] 0.4799604803613467 0.44128727417271146 0.6029344641158988


In [60]:
f1 = f1_score(y_test['90A_10B'].reset_index(drop=True), y_pred_df['90A_10B'].reset_index(drop=True), average=None)
f1

array([0.98909812, 0.42105263])

In [61]:
class_proportion_list = [0.019, 0.035, 0.233, 0.07, 0.23, 0.07, 0.14, 0.152, 0.023]
f1_gain_list =[]
precision_gain_list =[]
recall_gain_list = []

# wandb logging
wandb.init(project="binary_stability_prediction")

for i in range(len(f1_list)):
    f1_gain = (f1_list[i] - class_proportion_list[i]) / (f1_list[i]*(1-class_proportion_list[i]))
    f1_gain_list.append(f1_gain)

    p_gain = (precision_list[i] - class_proportion_list[i]) / (precision_list[i]*(1-class_proportion_list[i]))
    precision_gain_list.append(p_gain)

    r_gain = (recall_list[i] - class_proportion_list[i]) / (recall_list[i]*(1-class_proportion_list[i]))
    recall_gain_list.append(r_gain)

    # wandb logging
    wandb.log({'f1_gain_score': f1_gain, 'precision_gain_score': p_gain, 'recall_gain_score': r_gain})

f1_gain_avg = np.average(f1_gain_list)
precision_gain_avg = np.average(precision_gain_list)
recall_gain_avg = np.average(recall_gain_list)

wandb.log({'f1_gain_score_average': f1_gain_avg, 'precision_gain_average': precision_gain_avg, 'recall_gain_avg': recall_gain_avg})
wandb.finish()

0,1
f1_gain_score,█▅▁▁▁▄▂▁▇
f1_gain_score_average,▁
precision_gain_average,▁
precision_gain_score,██▂▁▃▅▄▂▆
recall_gain_avg,▁
recall_gain_score,▇▁▄▆▂▆▃▄█

0,1
f1_gain_score,0.95586
f1_gain_score_average,0.89286
precision_gain_average,0.8529
precision_gain_score,0.92349
recall_gain_avg,0.93281
recall_gain_score,0.98823


In [62]:
f1_list[0]

0.4210526315789474

In [63]:
# load wandb dashboard for data viewing

%wandb alanfiler/binary_stability_prediction/reports/Final-Test-F1_Score-Precision-Recall--Vmlldzo0OTIwODE3

## Final Accuracy on Test Stability Vectors

We will now transform our prediction dataframe back into the stability vector format and compare to the ground truth for a final accuracy.

In [65]:
# add '100A_0B' back to test results
y_pred_df['100A_0B'] = np.ones((514), dtype=int)
y_pred_df.head()

Unnamed: 0,90A_10B,80A_20B,70A_30B,60A_40B,50A_50B,40A_60B,30A_70B,20A_80B,10A_90B,100A_0B
2,0,0,1,1,0,0,0,0,0,1
5,0,0,1,0,0,0,0,0,0,1
9,0,0,0,0,0,0,0,0,0,1
10,0,0,1,0,0,0,0,0,0,1
28,0,0,1,0,0,0,0,0,0,1


In [66]:
# add '0A_100B' back to test results
y_pred_df['0A_100B'] = np.ones((514), dtype=int)
y_pred_df.head()

Unnamed: 0,90A_10B,80A_20B,70A_30B,60A_40B,50A_50B,40A_60B,30A_70B,20A_80B,10A_90B,100A_0B,0A_100B
2,0,0,1,1,0,0,0,0,0,1,1
5,0,0,1,0,0,0,0,0,0,1,1
9,0,0,0,0,0,0,0,0,0,1,1
10,0,0,1,0,0,0,0,0,0,1,1
28,0,0,1,0,0,0,0,0,0,1,1


In [67]:
# reorder columns
y_pred_df = y_pred_df.loc[:,['100A_0B', '90A_10B', '80A_20B', '70A_30B', '60A_40B', '50A_50B', '40A_60B', '30A_70B', '20A_80B', '10A_90B', '0A_100B']]
y_pred_df

Unnamed: 0,100A_0B,90A_10B,80A_20B,70A_30B,60A_40B,50A_50B,40A_60B,30A_70B,20A_80B,10A_90B,0A_100B
2,1,0,0,1,1,0,0,0,0,0,1
5,1,0,0,1,0,0,0,0,0,0,1
9,1,0,0,0,0,0,0,0,0,0,1
10,1,0,0,1,0,0,0,0,0,0,1
28,1,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
2540,1,0,0,1,0,1,1,0,1,0,1
2547,1,0,0,0,0,0,0,0,0,0,1
2558,1,0,0,1,0,0,0,0,1,0,1
2563,1,0,0,0,0,0,0,0,0,0,1


In [68]:
# convert dataframe to stability vector series
pred_sV = pd.Series(y_pred_df.values.tolist())
pred_sV = pred_sV.rename('stabilityVec_pred')
pred_sV = pred_sV.to_frame()
pred_sV = pred_sV.set_index(y_pred_df.index)
pred_sV

Unnamed: 0,stabilityVec_pred
2,"[1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1]"
5,"[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]"
9,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]"
10,"[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]"
28,"[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]"
...,...
2540,"[1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1]"
2547,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]"
2558,"[1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1]"
2563,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]"


In [69]:
stabilityVec_df = pred_sV.join(df['stabilityVec'], how='left')
stabilityVec_df

Unnamed: 0,stabilityVec_pred,stabilityVec
2,"[1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1]","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]"
5,"[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]","[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]"
9,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]"
10,"[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]","[1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1]"
28,"[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]","[1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1]"
...,...,...
2540,"[1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1]","[1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1]"
2547,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]"
2558,"[1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1]","[1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1]"
2563,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]"


In [70]:
stabilityVec_df_missing = stabilityVec_df.loc[stabilityVec_df.apply(lambda row: row['stabilityVec_pred'] != row['stabilityVec'], axis=1), ['stabilityVec_pred', 'stabilityVec']]
stabilityVec_df_missing

Unnamed: 0,stabilityVec_pred,stabilityVec
2,"[1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1]","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]"
10,"[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]","[1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1]"
28,"[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]","[1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1]"
37,"[1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1]","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]"
46,"[1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1]","[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1]"
...,...,...
2451,"[1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1]","[1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1]"
2453,"[1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1]","[1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]"
2526,"[1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1]","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]"
2540,"[1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1]","[1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1]"


In [72]:
# final stability vector accuracy calculation
print('Final Accuracy: {}'.format((514-271)/514))

Final Accuracy: 0.4727626459143969


Sadly our final accuracy was not amazing, but was not unexpected given the heavy class imbalance. The model will be more accurate with the classes that have a higher proportion of positive cases in the training data, like '50A_50B'. 

Our F1_scores on a macro and weight averaging scheme are okay (in line with the training sets), but for the stable target class they aren't great. However, the amount of test data is really small, with a support of just 5 in some cases; our scores may increase as the amount of data also increases. It could also stay the same since the class proportions were preserved; rare classes are just harder to predict.

Further steps:

1. Train neural net and see if performance improves.
2. Gather more data by running experiments and providing model feedback.

## Fast AI (this is incomplete)

## Initial Setup

In [34]:
fast_ai_df = pd.concat([feature_df_lowcorr, target_df], axis='columns')
fast_ai_df

Unnamed: 0,formulaA_elements_AtomicVolume,formulaB_elements_AtomicVolume,formulaA_elements_AtomicWeight,formulaB_elements_AtomicWeight,formulaA_elements_BoilingT,formulaB_elements_BoilingT,formulaA_elements_BulkModulus,formulaB_elements_BulkModulus,formulaA_elements_Column,formulaB_elements_Column,...,avg_nearest_neighbor_distance_B,90A_10B,80A_20B,70A_30B,60A_40B,50A_50B,40A_60B,30A_70B,20A_80B,10A_90B
0,37.433086,17.075648,227.000,107.868200,3473.0,2435.0,0.0,100.0,3,11,...,2.94195,0,0,1,0,1,0,0,0,0
1,37.433086,16.594425,227.000,26.981539,3473.0,2792.0,0.0,76.0,3,13,...,2.85595,0,0,1,0,0,0,0,0,0
2,37.433086,21.723966,227.000,74.921600,3473.0,887.0,0.0,22.0,3,15,...,2.55790,0,0,0,0,0,0,0,0,0
3,37.433086,64.969282,227.000,137.327000,3473.0,2143.0,0.0,9.6,3,2,...,4.35637,0,0,0,0,0,0,0,0,0
4,37.433086,35.483459,227.000,208.980400,3473.0,1837.0,0.0,31.0,3,15,...,3.11221,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2567,23.265943,32.865683,91.224,232.038060,4682.0,5093.0,0.0,54.0,4,3,...,3.56059,0,0,0,0,0,0,0,0,0
2568,23.265943,28.640877,91.224,204.383300,4682.0,1746.0,0.0,43.0,4,13,...,3.43253,0,0,0,0,0,0,0,1,0
2569,23.265943,13.844898,91.224,50.941500,4682.0,3680.0,0.0,160.0,4,5,...,2.59229,0,0,0,0,0,0,0,0,0
2570,23.265943,36952.924020,91.224,131.293000,4682.0,165.0,0.0,0.0,4,18,...,4.85032,0,0,0,0,0,0,0,0,0


In [35]:
fast_ai_df[categorical_columns] = fast_ai_df[categorical_columns].astype('category')

In [36]:
fast_ai_df[target_df.columns] = fast_ai_df[target_df.columns].astype('category')

In [37]:
fast_ai_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2572 entries, 0 to 2571
Columns: 101 entries, formulaA_elements_AtomicVolume to 10A_90B
dtypes: category(29), float64(42), int64(30)
memory usage: 1.5 MB


## Model Training

In [38]:
from imblearn.combine import SMOTETomek
from imblearn.combine import SMOTEENN

# function to balance the classes for later training
def imbalance_class_correction(current_target):

    smt = SMOTEENN(random_state=123)

    X_res, y_res = smt.fit_resample(X_train, y_train[current_target])

    res_train_df = pd.concat([X_res, y_res], axis='columns')

    test_df = pd.concat([X_test, y_test[current_target]], axis='columns')

    res_df = pd.concat([res_train_df, test_df], ignore_index=True)

    res_test_index_range = list(res_df.index[len(res_train_df):len(res_df)])

    return res_df, res_test_index_range

In [80]:
  
# this cell is for a single target model

from fastai.callback.wandb import *

# wandb logging
wandb.init(project="binary_stability_prediction")

# set FastAI seed
set_seed(123, True)

# imbalance class balancing

current_target = '90A_10B'
res_df, res_test_index_range = imbalance_class_correction(current_target)

# Data Processors
procs = [Categorify, FillMissing, Normalize]

dls = TabularDataLoaders.from_df(res_df, 
                                 y_names=current_target, 
                                 cat_names=categorical_columns,
                                 cont_names=numerical_columns,
                                 procs=procs,
                                 valid_idx=res_test_index_range,
                                 device = torch.device('cpu'),
                                 y_block = CategoryBlock()
                                 )
    
tab_learn = tabular_learner(dls, loss_func=FocalLossFlat(gamma=2.0, axis=-1, weight=None, reduction='mean'), metrics=[F1Score(average='weighted'), F1Score(average=None), Precision(average='weighted'), Precision(average=None), Recall(average='weighted'), Recall(average=None), MatthewsCorrCoef(sample_weight=None)])
    
tab_learn.fit_one_cycle(20, cbs=[EarlyStoppingCallback(monitor='f1_score', min_delta=0.001, patience=5), SaveModelCallback(monitor='f1_score', min_delta=0.001)])

f1 = tab_learn.validate()[1]

p = tab_learn.validate()[3]

r = tab_learn.validate()[5]

f1_minor = tab_learn.validate()[2][1]

p_minor = tab_learn.validate()[4][1]

r_minor = tab_learn.validate()[6][1]

matt_corrcoef = tab_learn.validate()[7]

# wandb logging
wandb.log({'f1_score_weighted_average': f1, 'precision_score_weighted_average': p, 'recall_score_weighted_average': r, 'f1_score': f1_minor, 'precision_score': p_minor, 'recall_score': r_minor, 'matthew_correlation_coefficient': matt_corrcoef})

wandb.finish()


epoch,train_loss,valid_loss,f1_score,f1_score.1,precision_score,precision_score.1,recall_score,recall_score.1,matthews_corrcoef,time
0,0.116856,0.074989,0.92195,[0.93572181 0.2278481 ],0.980879,[0.99775281 0.13043478],0.881323,[0.88095238 0.9 ],0.316399,00:01
1,0.045132,0.040378,0.953355,[0.96636086 0.29787234],0.978059,[0.99371069 0.18918919],0.935798,[0.94047619 0.7 ],0.342262,00:00
2,0.019864,0.024751,0.967698,[0.98101898 0.2962963 ],0.973285,[0.98792757 0.23529412],0.963035,[0.97420635 0.4 ],0.289017,00:00
3,0.010948,0.030389,0.972677,[0.98507463 0.34782609],0.974788,[0.98802395 0.30769231],0.970817,[0.98214286 0.4 ],0.336163,00:00
4,0.006351,0.041052,0.974055,[0.98711596 0.31578947],0.973438,[0.98613861 0.33333333],0.974708,[0.98809524 0.3 ],0.303378,00:00
5,0.004612,0.01744,0.982521,[0.99209486 0.5 ],0.981934,[0.98818898 0.66666667],0.984436,[0.99603175 0.4 ],0.509258,00:00
6,0.003926,0.029508,0.970151,[0.98305085 0.32 ],0.973943,[0.98797595 0.26666667],0.966926,[0.9781746 0.4 ],0.310321,00:00
7,0.003384,0.048815,0.967829,[0.98207171 0.25 ],0.970986,[0.986 0.21428571],0.964981,[0.9781746 0.3 ],0.236039,00:00
8,0.002552,0.038385,0.969034,[0.98308458 0.26086957],0.971334,[0.98602794 0.23076923],0.966926,[0.98015873 0.3 ],0.24645,00:00
9,0.002096,0.037613,0.971404,[0.98406375 0.33333333],0.974337,[0.988 0.28571429],0.968872,[0.98015873 0.4 ],0.322575,00:00


Better model found at epoch 0 with f1_score value: 0.9219499503491545.
Better model found at epoch 1 with f1_score value: 0.9533552431204054.
Better model found at epoch 2 with f1_score value: 0.967697528008812.
Better model found at epoch 3 with f1_score value: 0.9726767953499292.
Better model found at epoch 4 with f1_score value: 0.9740551298806339.
Better model found at epoch 5 with f1_score value: 0.9825210316666924.
No improvement since epoch 5: early stopping


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
f1_score,▁
f1_score_weighted_average,▁
matthew_correlation_coefficient,▁
precision_score,▁
precision_score_weighted_average,▁
recall_score,▁
recall_score_weighted_average,▁

0,1
f1_score,0.5
f1_score_weighted_average,0.98252
matthew_correlation_coefficient,0.50926
precision_score,0.66667
precision_score_weighted_average,0.98193
recall_score,0.4
recall_score_weighted_average,0.98444


In [46]:
# training function to be able to use wandb sweeps for hyperparameter tuning
# must enter current_target for proper functioning
def train(config=None, current_target='90A_10B'):
  
  set_seed(123, True)
  
  wandb.init(project="binary_stability_prediction", config=config)
  config = wandb.config

  # imbalance class balancing

  res_df, res_test_index_range = imbalance_class_correction(current_target)

  # Data Processors
  procs = [Categorify, FillMissing, Normalize]

  dls = TabularDataLoaders.from_df(res_df, 
                                 y_names=current_target, 
                                 cat_names=categorical_columns,
                                 cont_names=numerical_columns,
                                 procs=procs,
                                 valid_idx=res_test_index_range,
                                 device = torch.device('cpu'),
                                 y_block = CategoryBlock()
                                 )
      
  learn = tabular_learner(dls, wd=config.wd, layers=[config.layers], metrics=[F1Score(average='weighted')])

  learn.fit_one_cycle(config.epochs, lr_max=config.learning_rate, cbs=[EarlyStoppingCallback(monitor='f1_score', min_delta=0.001, patience=3), SaveModelCallback(monitor='f1_score', min_delta=0.001)])

  #f1_score_weighted    
  f1 = learn.validate()[1]

  # wandb logging
  wandb.log({'f1_score_weighted_average': f1})

In [48]:
sweep_config = {
  "method": "bayes",
  "metric": {
    "name": "f1_score_weighted_average",
    "goal": "maximize"
  },
  "parameters": {
    "epochs": {
      "value": 20
    },
    "wd": {
        "min": 0.01,
        "max": 0.1
    },
    "learning_rate": {
      "min": 0.0001,
      "max": 0.01
    },
    "layers": {
      'min': 5,
      'max': 100
    }    
  }
}

sweep_id = wandb.sweep(sweep_config, project="binary_stability_prediction")

Create sweep with ID: ix93hz8q
Sweep URL: https://wandb.ai/alanfiler/binary_stability_prediction/sweeps/ix93hz8q


In [49]:
wandb.agent(sweep_id, function=train, count=1)

[34m[1mwandb[0m: Agent Starting Run: ns4guwf4 with config:
[34m[1mwandb[0m: 	epochs: 20
[34m[1mwandb[0m: 	layers: 20
[34m[1mwandb[0m: 	learning_rate: 0.008102543448992892
[34m[1mwandb[0m: 	wd: 0.02147881636376668


epoch,train_loss,valid_loss,f1_score,time
0,0.313575,0.287621,0.92195,00:03
1,0.126764,0.069466,0.979615,00:00
2,0.062073,0.076547,0.97538,00:00
3,0.04243,0.077199,0.981147,00:00
4,0.027933,0.086072,0.971276,00:00
5,0.023655,0.071629,0.979483,00:00
6,0.023243,0.084639,0.974055,00:00


Better model found at epoch 0 with f1_score value: 0.9219499503491545.
Better model found at epoch 1 with f1_score value: 0.9796151969916379.
Better model found at epoch 3 with f1_score value: 0.9811466593867073.
No improvement since epoch 3: early stopping


0,1
f1_score_weighted_average,▁

0,1
f1_score_weighted_average,0.98115


In [None]:
from fastai.callback.wandb import *

## this cell implements a for loop into the standard training function (not the hyperparameter function) to be able to train across all the classes in order

feature_names_forloop = ['90A_10B', '80A_20B', '70A_30B', '60A_40B', '50A_50B', '40A_60B', '30A_70B', '20A_80B', '10A_90B']
f1_score_weighted_fastai = []
precision_score_weighted_fastai = []
recall_score_weighted_fastai = []

f1_score_minor_fastai = []
precision_score_minor_fastai = []
recall_score_minor_fastai = []
matt_corrcoef_fastai = []

# wandb logging
wandb.init(project="binary_stability_prediction")

for i in feature_names_forloop:
    set_seed(123, True)
    feature_names = ['90A_10B', '80A_20B', '70A_30B', '60A_40B', '50A_50B', '40A_60B', '30A_70B', '20A_80B', '10A_90B']
    current_target = i
    feature_names.remove(current_target)
    features_to_ignore = feature_names
    fast_ai_df_single_target = fast_ai_df.drop(columns=features_to_ignore)

    # imbalance class balancing

    res_df, res_test_index_range = imbalance_class_correction(current_target)

    # Data Processors
    procs = [Categorify, FillMissing, Normalize]

    dls = TabularDataLoaders.from_df(res_df, 
                                 y_names=current_target, 
                                 cat_names=categorical_columns,
                                 cont_names=numerical_columns,
                                 procs=procs,
                                 valid_idx=res_test_index_range,
                                 device = torch.device('cpu'),
                                 y_block = CategoryBlock()
                                 )
    
    tab_learn = tabular_learner(dls, loss_func=FocalLossFlat(gamma=2.0, axis=-1, weight=None, reduction='mean'), metrics=[F1Score(average='weighted'), F1Score(average=None), Precision(average='weighted'), Precision(average=None), Recall(average='weighted'), Recall(average=None), MatthewsCorrCoef(sample_weight=None)])
    
    tab_learn.fit_one_cycle(20, cbs=[EarlyStoppingCallback(monitor='f1_score', min_delta=0.001, patience=5), SaveModelCallback(monitor='f1_score', min_delta=0.001)])

    f1 = tab_learn.validate()[1]
    f1_score_weighted_fastai.append(f1)

    p = tab_learn.validate()[3]
    precision_score_weighted_fastai.append(p)

    r = tab_learn.validate()[5]
    recall_score_weighted_fastai.append(r)

    f1_minor = tab_learn.validate()[2][1]
    f1_score_minor_fastai.append(f1_minor)

    p_minor = tab_learn.validate()[4][1]
    precision_score_minor_fastai.append(p_minor)

    r_minor = tab_learn.validate()[6][1]
    recall_score_minor_fastai.append(r_minor)

    matt_corrcoef = tab_learn.validate()[7]
    matt_corrcoef_fastai.append(matt_corrcoef)

    # wandb logging
    wandb.log({'f1_score_weighted_average': f1, 'precision_score_weighted_average': p, 'recall_score_weighted_average': r, 'f1_score': f1_minor, 'precision_score': p_minor, 'recall_score': r_minor, 'matthew_correlation_coefficient': matt_corrcoef})

wandb.finish()

print(f1_score_weighted_fastai)
print(precision_score_weighted_fastai)
print(recall_score_weighted_fastai)


In [None]:
wandb.finish()

In [None]:
## this cell is used to caclulate gain statistics as well as the averages for f1, precision, recall, and matthew's correlation coefficient

class_proportion_list = [0.019, 0.035, 0.233, 0.07, 0.23, 0.07, 0.14, 0.152, 0.023]
f1_gain_list =[]
precision_gain_list =[]
recall_gain_list = []

# wandb logging
wandb.init(project="binary_stability_prediction")

for i in range(len(f1_score_minor_fastai)):
    f1_gain = (f1_score_minor_fastai[i] - class_proportion_list[i]) / (f1_score_minor_fastai[i]*(1-class_proportion_list[i]))
    f1_gain_list.append(f1_gain)

    p_gain = (precision_score_minor_fastai[i] - class_proportion_list[i]) / (precision_score_minor_fastai[i]*(1-class_proportion_list[i]))
    precision_gain_list.append(p_gain)

    r_gain = (recall_score_minor_fastai[i] - class_proportion_list[i]) / (recall_score_minor_fastai[i]*(1-class_proportion_list[i]))
    recall_gain_list.append(r_gain)

    # wandb logging
    wandb.log({'f1_gain_score': f1_gain, 'precision_gain_score': p_gain, 'recall_gain_score': r_gain})

f1_gain_avg = np.average(f1_gain_list)
precision_gain_avg = np.average(precision_gain_list)
recall_gain_avg = np.average(recall_gain_list)

f1_avg = np.average(f1_score_minor_fastai)
precision_avg = np.average(precision_score_minor_fastai)
recall_avg = np.average(recall_score_minor_fastai)
matt_corrcoef_avg = np.average(matt_corrcoef_fastai)

wandb.log({'f1_gain_score_average': f1_gain_avg, 'precision_gain_average': precision_gain_avg, 'recall_gain_avg': recall_gain_avg, 'f1_score_average': f1_avg, 'precision_score_average': precision_avg, 'recall_score_average': recall_avg, 'matthew_correlation_coefficient_average': matt_corrcoef_avg})
wandb.finish()