The 'ft_pretrained_100k.pth' model in /streamlit_utils/models/ is a model pretrained on a large number of possibly correct SMILE strings, then fine tuned on the 303 molecules in the dataset. While this model has information on SMILE structure, it still needs to be fine tuned to predict a target variable.

The index and column names and therefore variables which can be predicted:
- [6] OB(CO2) : oxygen balance with respect to CO2
- [7] r0 : density
- [8] HGAS : gas-phase formation enthalpy
- [9] HSUB : sublimation enthalpy
- [10] Q : heat of explosion
- [11] V : detonation velocity
- [12] p : detonation pressure
- [13] EG : gurney energy
- [14] h50(obs) : drop weight impact height measured

In [None]:
target_var_index = 11
model_name = 'detonation_velocity'

# Train

In [None]:
from launcher_of_sm import train_predictor

k = 10 # k-fold cross validation

train_predictor(data_path='data/Dm.csv', 
                pretrained_path='streamlit_utils/models/ft_pretrained_100k.pth', 
                target_index=target_var_index, 
                epochs=100, 
                k=k, # k-fold cross validation, 
                SMILE_enumeration_level=50, # determines amount of training data - use 50/100
                save_filename=model_name,
                )

### Select best model

In [None]:
import pandas as pd
import os

BEST_MODEL_CRITERIA = 'r2'

# Identify best performing model
training_performance = pd.read_csv(f'training_records/{model_name}_Train_Performance.csv')
best_index = training_performance.idxmax()[BEST_MODEL_CRITERIA]

# Move to streamlit models folder and delete other models
for i in range(k):
    if i != best_index and os.path.exists(f'training_records/{model_name}-{i}.pth'):
        os.remove(f'training_records/{model_name}-{i}.pth')

if os.path.exists(f'training_records/{model_name}_Train_Performance.csv'):
    os.remove(f'training_records/{model_name}_Train_Performance.csv')

if os.path.exists(f'training_records/{model_name}-{best_index}.pth'):
    os.rename(f'training_records/{model_name}-{best_index}.pth', f'streamlit_utils/models/{model_name}.pth')


# Prediction
Input needs to be a CSV with one column of SMILE strings

Only one variable is predicted at a time as each model is seperate

It always calculates and writes the Synthetic Acessability (SA) scores as well

In [None]:
from launcher_of_sm import score

molecules_file = 'three_isomers.csv'
save_filename = 'three_isomers_V.csv' # note - only one variable is predicted at a time

# Change these if you wish to predict a different variable from the one trained above
model_file = f'streamlit_utils/models/{model_name}.pth'
predict_var_index = target_var_index 

score(train_data_path='data/Dm.csv', 
      data_path=molecules_file, 
      model_path=model_file, 
      saving_path=save_filename, 
      SMILE_index_1=0, 
      SMILE_index_2=0, 
      target_index=predict_var_index)