# Training a Model
The 'ft_pretrained_100k.pth' model in /streamlit_utils/models/ is a model pretrained on a large number of possibly correct SMILE strings, then fine tuned on the 303 molecules in the dataset. While this model has information on SMILE structure, it still needs to be fine tuned to predict a target variable.

The column names and therefore variables which can be predicted:
- OB(CO2) : oxygen balance with respect to CO2
- r0 : density
- HGAS : gas-phase formation enthalpy
- HSUB : sublimation enthalpy
- Q : heat of explosion
- V : detonation velocity
- p : detonation pressure
- EG : gurney energy
- h50(obs) : drop weight impact height measured

In [1]:
target_var_index = 7 # Default is detonation velocity
model_name = 'density'

k = 10 # k-fold cross validation

In [2]:
from launcher_of_sm import train_predictor

train_predictor(data_path='data/Dm.csv', 
                pretrained_path='streamlit_utils/models/ft_pretrained_100k.pth', 
                target_index=target_var_index, 
                epochs=100, 
                k=k, # k-fold cross validation, 
                SMILE_enumeration_level=50, # determines amount of training data - use 50/100
                save_filename=model_name,
                )

Tokens: [' ', '#', '(', ')', '-', '/', '1', '2', '3', '4', '=', 'C', 'Cl', 'F', 'N', 'O', '[C@@H]', '[N+]', '[N-]', '[O-]', '[n+]', '[nH]', '^', 'c', 'n', 'o']
Property name: r0


  valid_y = torch.tensor(valid_y, dtype=torch.float).to(device)
Training density-0: 100%|██████████| 100/100 [00:52<00:00,  1.92epoch/s]
Training density-1: 100%|██████████| 100/100 [00:57<00:00,  1.74epoch/s]
Training density-2: 100%|██████████| 100/100 [00:58<00:00,  1.72epoch/s]
Training density-3: 100%|██████████| 100/100 [00:58<00:00,  1.70epoch/s]
Training density-4: 100%|██████████| 100/100 [00:58<00:00,  1.71epoch/s]
Training density-5: 100%|██████████| 100/100 [00:59<00:00,  1.69epoch/s]
Training density-6: 100%|██████████| 100/100 [00:57<00:00,  1.73epoch/s]
Training density-7: 100%|██████████| 100/100 [00:58<00:00,  1.72epoch/s]
Training density-8: 100%|██████████| 100/100 [00:58<00:00,  1.70epoch/s]
Training density-9: 100%|██████████| 100/100 [00:57<00:00,  1.74epoch/s]


--- Final Results ---
mae 0.01084136874445023
rmse 0.01882944650298795
r2 0.9539277071865216
Time: 734.3068132400513


### Select best model

In [3]:
import pandas as pd
import os

BEST_MODEL_CRITERIA = 'r2'

# Identify best performing model
training_performance = pd.read_csv(f'training_records/{model_name}_Train_Performance.csv')
best_index = training_performance.idxmax()[BEST_MODEL_CRITERIA]

# Move to streamlit models folder and delete other models
for i in range(k):
    if i != best_index and os.path.exists(f'training_records/{model_name}-{i}.pth'):
        os.remove(f'training_records/{model_name}-{i}.pth')

if os.path.exists(f'training_records/{model_name}_Train_Performance.csv'):
    os.remove(f'training_records/{model_name}_Train_Performance.csv')

if os.path.exists(f'training_records/{model_name}-{best_index}.pth'):
    os.rename(f'training_records/{model_name}-{best_index}.pth', f'streamlit_utils/models/{model_name}.pth')


# Prediction
Input needs to be a CSV with one column of SMILE strings

Only one variable is predicted at a time as each model is seperate

It always calculates and writes the Synthetic Acessability (SA) scores as well

In [4]:
from launcher_of_sm import score

molecules_file = 'three_isomers.csv'
save_filename = 'three_isomers_density.csv' # note - only one variable is predicted at a time

# Change these if you wish to predict a different variable from the one trained above
model_file = f'streamlit_utils/models/{model_name}.pth'
predict_var_index = target_var_index 

score(train_data_path='data/Dm.csv', 
      data_path=molecules_file, 
      model_path=model_file, 
      saving_path=save_filename, 
      SMILE_index_1=0, 
      SMILE_index_2=0, 
      target_index=predict_var_index)

Excluding smile due to an unknown token. O=[N+](C1=NON=C1[N]2=C([N+]([O-])=O)NC([N+]([O-])=O)=N2)[O-]
