<a href="https://colab.research.google.com/github/learningmatter-mit/uvvisml/blob/main/uvvisml_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

The code in this notebook uses Chemprop v1.x, which supports `python>=3.7, <3.9`. When the notebook was first published, Google Colab used a Python version that matched these requirements. As of July 2025, Google Colab uses Python 3.11, and it is challenging to make Colab run a version of Python other than its default, or to change the Colab Python kernel. As a result, this notebook includes several "hacks" to allow the code to run in Python 3.8. Most of the setup steps and the `%%py38` cell magic at the beginning of each cell can be removed if this notebook is executed locally rather than in Colab.

In [1]:
# Install Python 3.8 (the maximum version supported by Chemprop 1.x) - Colab uses Python 3.11 as of July 2025
# Copied from: https://raw.githubusercontent.com/j3soon/colab-python-version/refs/heads/main/scripts/py38.sh

!wget -O miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-py38_23.11.0-2-Linux-x86_64.sh
!bash ./miniconda.sh -b -f -p /usr/local
!conda install -q -y jupyter google-colab traitlets=5.5.0 -c conda-forge  # should take ~2 minutes
!python -m ipykernel install --name "py38" --user
!rm ./miniconda.sh

--2025-07-30 14:38:23--  https://repo.anaconda.com/miniconda/Miniconda3-py38_23.11.0-2-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.32.241, 104.16.191.158, 2606:4700::6810:bf9e, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.32.241|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 131844786 (126M) [application/octet-stream]
Saving to: ‘miniconda.sh’


2025-07-30 14:38:24 (130 MB/s) - ‘miniconda.sh’ saved [131844786/131844786]

PREFIX=/usr/local
Unpacking payload ...
                                                                                    
Installing base environment...


Downloading and Extracting Packages:


Downloading and Extracting Packages:

Preparing transaction: - \ | / - \ | / done
Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done
installation finished.
    You currently have a PYTHO

In [2]:
!python --version # Python 3.8.18

Python 3.8.18


In [3]:
!python3.8 -m pip install chemprop==1.7.1 numpy==1.24.4 # must specify numpy version > 1.22 to avoid C-API import error; should take ~3 minutes

Collecting chemprop==1.7.1
  Downloading chemprop-1.7.1-py3-none-any.whl.metadata (74 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/75.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpy==1.24.4
  Downloading numpy-1.24.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting flask<=2.1.3,>=1.1.2 (from chemprop==1.7.1)
  Downloading Flask-2.1.3-py3-none-any.whl.metadata (3.9 kB)
Collecting Werkzeug<3 (from chemprop==1.7.1)
  Downloading werkzeug-2.3.8-py3-none-any.whl.metadata (4.1 kB)
Collecting hyperopt>=0.2.3 (from chemprop==1.7.1)
  Downloading hyperopt-0.2.7-py2.py3-none-any.whl.metadata (1.7 kB)
Collecting matplotlib>=3.1.3 (from chemprop==1.7.1)
  Downloading matplotlib-3.7.5-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.7 kB)
Collecting pandas-flavor>=0.2.0 (from chem

In [4]:
# Start a persistent Python-3.8 REPL behind the scenes to run everything in for the below cell magic
# Otherwise (with the cell magic only), variables and package imports do not carry over from cell to cell

from IPython.core.magic import register_cell_magic
from jupyter_client import KernelManager
import json, atexit, textwrap

km = KernelManager(kernel_name="python3")      # path resolves to python3.8 if on $PATH
km.kernel_cmd = ["python3.8", "-m", "ipykernel_launcher", "-f", "{connection_file}"]
km.start_kernel()

kc = km.client()
kc.start_channels()

@atexit.register
def _clean():
    kc.stop_channels()
    km.shutdown_kernel(now=True)

# Create new cell magic for cells to run their code through a Python 3.8 interpreter instead of the Colab default Python

@register_cell_magic
def py38(line, cell):
    # send code, wait for reply, print results
    msg_id = kc.execute(textwrap.dedent(cell))
    while True:
        msg = kc.get_iopub_msg()
        if msg['parent_header'].get('msg_id') == msg_id:
            if msg['msg_type'] == 'stream':
                print(msg['content']['text'], end="")
            elif msg['msg_type'] == 'error':
                print("\n".join(msg['content']['traceback']))
            elif msg['msg_type'] == 'execute_result':
                print(msg['content']['data']['text/plain'])
            elif msg['msg_type'] == 'status' and msg['content']['execution_state'] == 'idle':
                break




In [5]:
%%py38
import chemprop # sys.path.append('/usr/local/lib/python3.8/site-packages/')

In [6]:
!git clone https://github.com/learningmatter-mit/uvvisml

Cloning into 'uvvisml'...
remote: Enumerating objects: 147, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 147 (delta 11), reused 6 (delta 6), pack-reused 123 (from 1)[K
Receiving objects: 100% (147/147), 9.28 MiB | 7.17 MiB/s, done.
Resolving deltas: 100% (56/56), done.


In [7]:
%%py38
import pandas as pd
import os
os.chdir('uvvisml/uvvisml')

In [8]:
!cd uvvisml/uvvisml; bash get_model_files.sh # may take ~2-10 minutes (download speeds from Zenodo are typically ~1-5MB/s)

--2025-07-30 14:44:27--  https://zenodo.org/record/5573027/files/models.tar.gz
Resolving zenodo.org (zenodo.org)... 188.185.45.92, 188.185.48.194, 188.185.43.25, ...
Connecting to zenodo.org (zenodo.org)|188.185.45.92|:443... connected.
HTTP request sent, awaiting response... 301 MOVED PERMANENTLY
Location: /records/5573027/files/models.tar.gz [following]
--2025-07-30 14:44:28--  https://zenodo.org/records/5573027/files/models.tar.gz
Reusing existing connection to zenodo.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 651010218 (621M) [application/octet-stream]
Saving to: ‘models.tar.gz’


2025-07-30 14:52:32 (1.28 MB/s) - ‘models.tar.gz’ saved [651010218/651010218]



# Data

In [9]:
%%py38
test_file = 'data/splits/lambda_max_abs/deep4chem/group_by_smiles/smiles_target_test.csv'
df = pd.read_csv(test_file)
df

                                                 smiles  ... peakwavs_max
0                   CCN(CC)c1ccc2c(C(F)(F)F)cc(=O)oc2c1  ...        376.0
1                   CCN(CC)c1ccc2c(C(F)(F)F)cc(=O)oc2c1  ...        392.0
2                   CCN(CC)c1ccc2c(C(F)(F)F)cc(=O)oc2c1  ...        396.0
3                   CCN(CC)c1ccc2c(C(F)(F)F)cc(=O)oc2c1  ...        400.0
4                   CCN(CC)c1ccc2c(C(F)(F)F)cc(=O)oc2c1  ...        413.0
...                                                 ...  ...          ...
1705           c1cc2c3ccc[n+]4cccc(c5ccc[n+](c1)c25)c34  ...        424.0
1706           c1cc2c3ccc[n+]4cccc(c5ccc[n+](c1)c25)c34  ...        432.0
1707  COc1cc(C)c(-c2cc(-c3c(C)cc(OC)cc3C)c3ccc4c(-c5...  ...        367.0
1708  N#Cc1c(N2CCCCC2)cc(-c2cccc3ccccc23)c2c1-c1cccc...  ...        358.0
1709        N#Cc1c(N2CCCC2)cc(-c2ccccc2)c2c1Cc1ccccc1-2  ...        382.0

[1710 rows x 3 columns]


# Make Predictions

## Predict experimental peak with model trained on combined training set

**Equivalent to command line:**

python uvvisml/predict.py --test_file uvvisml/data/splits/lambda_max_abs/deep4chem/group_by_smiles/smiles_target_test.csv --property absorption_peak_nm_expt --method chemprop --preds_file test_preds.csv

In [10]:
%%py38
arguments = [
  '--test_path', test_file,
  '--preds_path', '/dev/null',
  '--checkpoint_dir', 'models/lambda_max_abs/chemprop/combined/production/fold_0',
  '--number_of_molecules', '2',
  #'--gpu', '0'
]

args = chemprop.args.PredictArgs().parse_args(arguments)
preds = chemprop.train.make_predictions(args=args)

preds = [x[0] for x in preds]
df['peakwavs_max_pred'] = preds
df

Loading training args
  vars(torch.load(path, map_location=lambda storage, loc: storage)["args"]),
Setting molecule featurization parameters to default.
Loading data
0it [00:00, ?it/s]1710it [00:00, 203019.13it/s]
100%|██████████| 1710/1710 [00:00<00:00, 108164.20it/s]
Validating SMILES
Test size = 1,710
  state = torch.load(path, map_location=lambda storage, loc: storage)
  state = torch.load(path, map_location=lambda storage, loc: storage)
Loading pretrained parameter "encoder.encoder.0.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.0.W_i.weight".
Loading pretrained parameter "encoder.encoder.0.W_h.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.bias".
Loading pretrained parameter "encoder.encoder.1.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.1.W_i.weight".
Loading pretrained parameter "encoder.encoder.1.W_h.weight".
Loading pretrained parameter "encoder.encoder.1.W_

## Predict TDDFT peak in vacuum

**Equivalent to command line:**

python uvvisml/predict.py --test_file uvvisml/data/splits/lambda_max_abs/deep4chem/group_by_smiles/smiles_target_test.csv --property vertical_excitation_eV_tddft --method chemprop --preds_file test_preds.csv

In [11]:
%%py38
arguments = [
  '--test_path', test_file,
  '--preds_path', '/dev/null',
  '--checkpoint_dir', 'models/lambda_max_abs_wb97xd3/chemprop/all_wb97xd3/production/fold_0',
  '--number_of_molecules', '1',
  #'--gpu', '0'
]

args = chemprop.args.PredictArgs().parse_args(arguments)
preds = chemprop.train.make_predictions(args=args)

preds = [x[0] for x in preds] # predictions are in eV
df['peakwavs_max_pred'] = preds
df['peakwavs_max_pred'] = 1240/df['peakwavs_max_pred'] # convert from eV to nm
df

Loading training args
  vars(torch.load(path, map_location=lambda storage, loc: storage)["args"]),
Setting molecule featurization parameters to default.
Loading data
1710it [00:00, 170273.49it/s]
100%|██████████| 1710/1710 [00:00<00:00, 115144.89it/s]
Validating SMILES
Test size = 1,710
  state = torch.load(path, map_location=lambda storage, loc: storage)
  state = torch.load(path, map_location=lambda storage, loc: storage)
Loading pretrained parameter "encoder.encoder.0.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.0.W_i.weight".
Loading pretrained parameter "encoder.encoder.0.W_h.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.bias".
Loading pretrained parameter "ffn.1.weight".
Loading pretrained parameter "ffn.1.bias".
Loading pretrained parameter "ffn.4.weight".
Loading pretrained parameter "ffn.4.bias".

  0%|          | 0/35 [00:00<?, ?it/s][A
  3%|▎         | 1/35 [00:06<03:56,  6.95s

## Predict experimental peak with model trained on Deep4Chem training set

**Equivalent to command line:**

python uvvisml/predict.py --test_file uvvisml/data/splits/lambda_max_abs/deep4chem/group_by_smiles/smiles_target_test.csv --property absorption_peak_nm_expt --method chemprop --preds_file test_preds.csv --train_dataset deep4chem

In [12]:
%%py38
arguments = [
  '--test_path', test_file,
  '--preds_path', '/dev/null',
  '--checkpoint_dir', 'models/lambda_max_abs/chemprop/deep4chem/production/fold_0',
  '--number_of_molecules', '2',
  #'--gpu', '0'
]

args = chemprop.args.PredictArgs().parse_args(arguments)
preds = chemprop.train.make_predictions(args=args)

preds = [x[0] for x in preds]
df['peakwavs_max_pred'] = preds
df

Loading training args
  vars(torch.load(path, map_location=lambda storage, loc: storage)["args"]),
Setting molecule featurization parameters to default.
Loading data
1710it [00:00, 159019.57it/s]
100%|██████████| 1710/1710 [00:00<00:00, 101888.82it/s]
Validating SMILES
Test size = 1,710
  state = torch.load(path, map_location=lambda storage, loc: storage)
  state = torch.load(path, map_location=lambda storage, loc: storage)
Loading pretrained parameter "encoder.encoder.0.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.0.W_i.weight".
Loading pretrained parameter "encoder.encoder.0.W_h.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.bias".
Loading pretrained parameter "encoder.encoder.1.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.1.W_i.weight".
Loading pretrained parameter "encoder.encoder.1.W_h.weight".
Loading pretrained parameter "encoder.encoder.1.W_o.weight".
Loading p

## Predict experimental peak with multi-fidelity model

**Equivalent to command line:**

python uvvisml/predict.py --test_file uvvisml/data/splits/lambda_max_abs/deep4chem/group_by_smiles/smiles_target_test.csv --property absorption_peak_nm_expt --method chemprop_tddft --preds_file test_preds.csv

In [13]:
%%py38
# TDDFT Predictions
arguments = [
  '--test_path', test_file,
  '--preds_path', 'test_tddft_preds.csv',
  '--checkpoint_dir', 'models/lambda_max_abs_wb97xd3/chemprop/all_wb97xd3/production/fold_0',
  '--number_of_molecules', '1',
  #'--gpu', '0'
]

args = chemprop.args.PredictArgs().parse_args(arguments)
_ = chemprop.train.make_predictions(args=args)

# Convert Predictions to Features File
!python models/tddft_to_features_file.py

# Experimental Predictions
arguments = [
  '--test_path', test_file,
  '--preds_path', '/dev/null',
  '--checkpoint_dir', 'models/lambda_max_abs/chemprop_tddft/combined/production/fold_0',
  '--number_of_molecules', '2',
  '--features_path', 'features_test.csv'
  #'--gpu', '0'
]

args = chemprop.args.PredictArgs().parse_args(arguments)
preds = chemprop.train.make_predictions(args=args)

preds = [x[0] for x in preds]
df['peakwavs_max_pred'] = preds
df

Loading training args
  vars(torch.load(path, map_location=lambda storage, loc: storage)["args"]),
Setting molecule featurization parameters to default.
Loading data
1710it [00:00, 184785.38it/s]
100%|██████████| 1710/1710 [00:00<00:00, 111460.49it/s]
Validating SMILES
Test size = 1,710
  state = torch.load(path, map_location=lambda storage, loc: storage)
  state = torch.load(path, map_location=lambda storage, loc: storage)
Loading pretrained parameter "encoder.encoder.0.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.0.W_i.weight".
Loading pretrained parameter "encoder.encoder.0.W_h.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.bias".
Loading pretrained parameter "ffn.1.weight".
Loading pretrained parameter "ffn.1.bias".
Loading pretrained parameter "ffn.4.weight".
Loading pretrained parameter "ffn.4.bias".

  0%|          | 0/35 [00:00<?, ?it/s][A
  3%|▎         | 1/35 [00:07<03:58,  7.02s

## Predict experimental peak with model trained on combined training set (with ensemble variance)

**Equivalent to command line:**

python uvvisml/predict.py --test_file uvvisml/data/splits/lambda_max_abs/deep4chem/group_by_smiles/smiles_target_test.csv --property absorption_peak_nm_expt --method chemprop --preds_file test_preds.csv

In [14]:
%%py38
arguments = [
  '--test_path', test_file,
  '--preds_path', 'test_preds.csv',
  '--checkpoint_dir', 'models/lambda_max_abs/chemprop/combined/production/fold_0',
  '--number_of_molecules', '2',
  '--ensemble_variance',
  #'--gpu', '0'
]

args = chemprop.args.PredictArgs().parse_args(arguments)
_ = chemprop.train.make_predictions(args=args)

df = pd.read_csv('test_preds.csv')
df

Loading training args
  vars(torch.load(path, map_location=lambda storage, loc: storage)["args"]),
Setting molecule featurization parameters to default.
Loading data
1710it [00:00, 145052.38it/s]
100%|██████████| 1710/1710 [00:00<00:00, 7184.86it/s]
Validating SMILES
Test size = 1,710
  state = torch.load(path, map_location=lambda storage, loc: storage)
  state = torch.load(path, map_location=lambda storage, loc: storage)
Loading pretrained parameter "encoder.encoder.0.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.0.W_i.weight".
Loading pretrained parameter "encoder.encoder.0.W_h.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.bias".
Loading pretrained parameter "encoder.encoder.1.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.1.W_i.weight".
Loading pretrained parameter "encoder.encoder.1.W_h.weight".
Loading pretrained parameter "encoder.encoder.1.W_o.weight".
Loading pre