This notebook generates the cross-validation MSE and MAE and training MSE and MAE for two models to compare our modeling efforts against:
- predicting the HSE06 bandgap by the mean HSE06 bandgap
    - baseline: Our models should do a lot better than this.
- predicting the HSE06 bandgap as a linear function of the PBE bandgap
    - goalpost: Given the computational expense of computing the PBE bandgap, it would be great if our models matches the accuracy of this model or beats it.

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_val_score, cross_validate
from sklearn.metrics import mean_squared_error, mean_absolute_error

##### import bandgaps:

In [40]:
# import the MOF ids from one of the reduced, training feature sets
df_MOFs = pd.read_csv('../data/datasets_main/Stoich45_FeatureSelected_dataset.csv', sep=',')[['MOF']]

# import qmof_ids and bandgaps from the original qmof csv
df_qmof = pd.read_csv('../data/QMOF/qmof_database/qmof_database/qmof.csv', sep=',', usecols=['qmof_id','outputs.pbe.bandgap','outputs.hse06.bandgap'])

# merge df_qmof onto df_stoich45Int to restrict to MOFs in the stoich45 training set and put them in the right order,
# ... and delete MOF ids
df_bandgaps = df_MOFs.merge(df_qmof, left_on='MOF', right_on='qmof_id', how='left').drop(columns = ['MOF', 'qmof_id'])

##### make defintions for cross validation:

In [35]:
target = 'outputs.hse06.bandgap'
kfold = KFold(n_splits = 4, shuffle = True, random_state = 1234)

def get_mean_cv_mse(model, df_feature_set):
     return -cross_val_score(
         model,
         X = df_feature_set.drop(columns = [target]), y = df_feature_set[target],
         cv = kfold, scoring = 'neg_mean_squared_error',
         n_jobs = 4
     ).mean()

def get_mean_cv_mse_and_mae(model, df_feature_set):
    scores = cross_validate(
        model,
        X = df_feature_set.drop(columns = [target]), y = df_feature_set[target],
        cv = kfold, scoring = ['neg_mean_squared_error', 'neg_mean_absolute_error'],
        n_jobs = 4
    )
    return (-scores['test_neg_mean_squared_error'].mean(), -scores['test_neg_mean_absolute_error'].mean())

def get_training_mse_and_mae(model, df_feature_set):
    model.fit(df_feature_set.drop(columns = [target]), df_feature_set[target])
    preds = model.predict(df_feature_set.drop(columns = [target]))
    return (
        mean_squared_error(df_feature_set[target], preds),
        mean_absolute_error(df_feature_set[target], preds)
    )
    

#### baseline
What is the mean cross-validation MSE for the above k-fold split using just the constant predictor?

In [38]:
# create a data frame with just the target and a constant feature
df_constant_only = df_bandgaps[[target]].copy()
df_constant_only['constant'] = np.ones_like(df_bandgaps[target])

# run linear regression on the above data set
model = LinearRegression(fit_intercept = False)
cv_mse_no_features, cv_mae_no_features = get_mean_cv_mse_and_mae(model, df_constant_only)
train_mse_no_features, train_mae_no_features = get_training_mse_and_mae(model, df_constant_only)

print('Predicting the HSE06 bandgap by its mean:\n')
print(f"{'Cross-Validation MSE = ':>25}{cv_mse_no_features}")
print(f"{'Training MSE = ':>25}{train_mse_no_features}")
print()
print(f"{'Cross-Validation MAE = ':>25}{cv_mae_no_features}")
print(f"{'Training MAE = ':>25}{train_mae_no_features}")

Predicting the HSE06 bandgap by its mean:

  Cross-Validation MSE = 1.1471062514401023
          Training MSE = 1.14624060698406

  Cross-Validation MAE = 0.8076878730740906
          Training MAE = 0.8073838735312149


#### goalpost
What is the mean cross-validation MSE for the above k-fold split using the PBE bandgap as the only feature in a linear model?

In [39]:
model = LinearRegression()
cv_mse_PBE, cv_mae_PBE = get_mean_cv_mse_and_mae(model, df_bandgaps)
train_mse_PBE, train_mae_PBE = get_training_mse_and_mae(model, df_bandgaps)

print('Predicting the HSE06 bandgap as a linear function of the PBE bandgap:\n')
print(f"{'Cross-Validation MSE = ':>25}{cv_mse_PBE}")
print(f"{'Training MSE = ':>25}{train_mse_PBE}")
print()
print(f"{'Cross-Validation MAE = ':>25}{cv_mae_PBE}")
print(f"{'Training MAE = ':>25}{train_mae_PBE}")

Predicting the HSE06 bandgap as a linear function of the PBE bandgap:

  Cross-Validation MSE = 0.3382744129126684
          Training MSE = 0.33777515353575815

  Cross-Validation MAE = 0.44612804418753527
          Training MAE = 0.44587849931130136
