# Porting latest academic codes into AWS SageMaker

Artificial Intelligence and Machine Learning (AI/ML) have long been prominent topics in academia. Many researchers share their work through Jupyter notebooks, but running these notebooks locally can be challenging. This notebook will guide you on how to port these codes to AWS SageMaker, a fully managed service that enables developers and data scientists to build, train, and deploy machine learning models efficiently. Although the learning curve for AWS SageMaker can be steep, mastering it is worthwhile as it simplifies the setup of environments and dependencies, ensuring consistency with the original academic code. In my experience, AWS SageMaker significantly streamlines experimentation, training, and deployment of the latest ML codebases compared to local setups. This blog will demonstrate how to achieve this.

## Step 1: Choose a code that you want to test 

I have recently been teaching myself about generative models for molecular design. There are several popular codebases for this purpose, such as REINVENT, MolGAN, and JT-VAE. Recently, a new codebase called `s4-for-de-novo-drug-design` has been published. You can find the publication [here](https://www.nature.com/articles/s41467-024-50469-9) and the GitHub repository [here](https://github.com/molML/s4-for-de-novo-drug-design/tree/main). I have been reading reviews by Prof. Grisoni's group and wanted to test their code to gain a better understanding of their work. In this notebook, I will demonstrate how to port their code repository into AWS SageMaker and run examples for pre-training, fine-tuning, and molecular generation.

## Step 2: Set up AWS SageMaker

To set up AWS SageMaker, you need to have an AWS account. If you don't have one, you can create one [here](https://aws.amazon.com/). Once you have an AWS account, you can access AWS SageMaker by logging into the AWS Management Console.

## Step 3: Create a SageMaker notebook instance

To create a SageMaker notebook instance, follow these steps:

1. Open the AWS Management Console and search for SageMaker.

2. Click on the SageMaker service.

3. Click on the Notebook instances tab.

4. Click on the Create notebook instance button.

5. Fill in the details for your notebook instance. You can choose an instance type, such as ml.t2.medium, ml.t2.large, or ml.t2.xlarge. You can also choose the IAM role for your notebook instance.

6. Click on the Create notebook instance button.

## Step 4: Clone the code repository

In [None]:
%ls

In [1]:
import boto3
import sagemaker
import boto3
import sagemaker
from time import gmtime, strftime, sleep
from sagemaker.deserializers import CSVDeserializer
from sagemaker.serializers import CSVSerializer
from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import (
    ProcessingInput, 
    ProcessingOutput, 
    ScriptProcessor,
    FrameworkProcessor
)
from sagemaker.pytorch.processing import PyTorchProcessor

from sagemaker.inputs import TrainingInput

from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import (
    ProcessingStep, 
    TuningStep,
    TrainingStep, 
    CreateModelStep
)
from sagemaker.workflow.check_job_config import CheckJobConfig
from sagemaker.workflow.parameters import (
    ParameterInteger, 
    ParameterFloat, 
    ParameterString, 
    ParameterBoolean
)
from sagemaker.workflow.clarify_check_step import (
    ModelBiasCheckConfig, 
    ClarifyCheckStep, 
    ModelExplainabilityCheckConfig
)
from sagemaker import Model
from sagemaker.inputs import CreateModelInput
from sagemaker.workflow.model_step import ModelStep
from sagemaker.workflow.fail_step import FailStep
from sagemaker.workflow.conditions import (
    ConditionGreaterThan,
    ConditionLessThan,
    ConditionGreaterThanOrEqualTo
)
from sagemaker.workflow.pipeline_experiment_config import PipelineExperimentConfig
from sagemaker.workflow.properties import PropertyFile
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.functions import (
    Join,
    JsonGet
)

from sagemaker.lambda_helper import Lambda

from sagemaker.model_metrics import (
    MetricsSource, 
    ModelMetrics, 
    FileSource
)
from sagemaker.drift_check_baselines import DriftCheckBaselines

from sagemaker.image_uris import retrieve
iam = boto3.client('iam')
from sagemaker.pytorch import PyTorch

sagemaker.__version__

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/ldodda/Library/Application Support/sagemaker/config.yaml


'2.232.2'

In [3]:
sm_role = iam.get_role(RoleName='AmazonSageMaker-ExecutionRole-20211206T145568')['Role']['Arn']

INFO:botocore.tokens:Loading cached SSO token for discovery_account


In [3]:
#!aws s3 cp ./datasets s3://nimbustx-sagemaker/denovo_design/s4dd/ --recursive

In [4]:
%%writefile scripts/pretraining.py
from s4dd import S4forDenovoDesign
from argparse import ArgumentParser
import os
if __name__ == "__main__":
    parser = ArgumentParser('Pretrain an S4 model on a small subset of ChEMBL')
    parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
    #parser.add_argument("--full-data", type=str, default=os.environ["SM_CHANNEL_DATA_FULL"])
    parser.add_argument("--train", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
    parser.add_argument("--test", type=str, default=os.environ["SM_CHANNEL_TEST"])
    args = parser.parse_args().__dict__

    # Create an S4 model
    s4 = S4forDenovoDesign(
        n_max_epochs=3,  # This is for only demonstration purposes. Set this to a (much) higher value for actual training. Default: 400
        batch_size=64,  # This is for only demonstration purposes. The value in the paper is 2048.
        device="cuda",  # replace this with "cpu" if you don't have a CUDA-enabled GPU
    )
    # Pretrain the model on a small subset of ChEMBL
    s4.train(
        training_molecules_path=f"{args['train']}/train.zip",
        val_molecules_path=f"{args['test']}/valid.zip",
    )
    # Save the model
    s4.save(args['model_dir'])

Overwriting scripts/pretraining.py


The provided code is a Python script designed to pretrain, fine-tune, and generate new molecules using an S4 model from the s4dd  library. The script is intended to be executed as a standalone program and uses the [`argparse`](command:_github.copilot.openSymbolFromReferences?%5B%22%22%2C%5B%7B%22uri%22%3A%7B%22scheme%22%3A%22vscode-notebook-cell%22%2C%22authority%22%3A%22%22%2C%22path%22%3A%22%2FUsers%2Fldodda%2FDocuments%2FCodes%2Fs4-for-de-novo-drug-design%2FSM_notebook.ipynb%22%2C%22query%22%3A%22%22%2C%22fragment%22%3A%22W4sZmlsZQ%3D%3D%22%7D%2C%22pos%22%3A%7B%22line%22%3A2%2C%22character%22%3A5%7D%7D%5D%2C%22ec15c6a6-95db-4930-8670-a65f40ee17f3%22%5D "Go to definition") library to handle command-line arguments for various input and output directories.

The script begins by importing necessary modules, including [`S4forDenovoDesign`](command:_github.copilot.openSymbolFromReferences?%5B%22%22%2C%5B%7B%22uri%22%3A%7B%22scheme%22%3A%22vscode-notebook-cell%22%2C%22authority%22%3A%22%22%2C%22path%22%3A%22%2FUsers%2Fldodda%2FDocuments%2FCodes%2Fs4-for-de-novo-drug-design%2FSM_notebook.ipynb%22%2C%22query%22%3A%22%22%2C%22fragment%22%3A%22W4sZmlsZQ%3D%3D%22%7D%2C%22pos%22%3A%7B%22line%22%3A1%2C%22character%22%3A17%7D%7D%5D%2C%22ec15c6a6-95db-4930-8670-a65f40ee17f3%22%5D "Go to definition") from the s4dd  library and [`ArgumentParser`](command:_github.copilot.openSymbolFromReferences?%5B%22%22%2C%5B%7B%22uri%22%3A%7B%22scheme%22%3A%22vscode-notebook-cell%22%2C%22authority%22%3A%22%22%2C%22path%22%3A%22%2FUsers%2Fldodda%2FDocuments%2FCodes%2Fs4-for-de-novo-drug-design%2FSM_notebook.ipynb%22%2C%22query%22%3A%22%22%2C%22fragment%22%3A%22W4sZmlsZQ%3D%3D%22%7D%2C%22pos%22%3A%7B%22line%22%3A2%2C%22character%22%3A21%7D%7D%5D%2C%22ec15c6a6-95db-4930-8670-a65f40ee17f3%22%5D "Go to definition") from the [`argparse`](command:_github.copilot.openSymbolFromReferences?%5B%22%22%2C%5B%7B%22uri%22%3A%7B%22scheme%22%3A%22vscode-notebook-cell%22%2C%22authority%22%3A%22%22%2C%22path%22%3A%22%2FUsers%2Fldodda%2FDocuments%2FCodes%2Fs4-for-de-novo-drug-design%2FSM_notebook.ipynb%22%2C%22query%22%3A%22%22%2C%22fragment%22%3A%22W4sZmlsZQ%3D%3D%22%7D%2C%22pos%22%3A%7B%22line%22%3A2%2C%22character%22%3A5%7D%7D%5D%2C%22ec15c6a6-95db-4930-8670-a65f40ee17f3%22%5D "Go to definition") library. It also imports the [`os`](command:_github.copilot.openSymbolFromReferences?%5B%22%22%2C%5B%7B%22uri%22%3A%7B%22scheme%22%3A%22vscode-notebook-cell%22%2C%22authority%22%3A%22%22%2C%22path%22%3A%22%2FUsers%2Fldodda%2FDocuments%2FCodes%2Fs4-for-de-novo-drug-design%2FSM_notebook.ipynb%22%2C%22query%22%3A%22%22%2C%22fragment%22%3A%22W4sZmlsZQ%3D%3D%22%7D%2C%22pos%22%3A%7B%22line%22%3A3%2C%22character%22%3A7%7D%7D%5D%2C%22ec15c6a6-95db-4930-8670-a65f40ee17f3%22%5D "Go to definition") module to access environment variables. The main function is defined under the [`if __name__ == "__main__":`](command:_github.copilot.openSymbolFromReferences?%5B%22%22%2C%5B%7B%22uri%22%3A%7B%22scheme%22%3A%22vscode-notebook-cell%22%2C%22authority%22%3A%22%22%2C%22path%22%3A%22%2FUsers%2Fldodda%2FDocuments%2FCodes%2Fs4-for-de-novo-drug-design%2FSM_notebook.ipynb%22%2C%22query%22%3A%22%22%2C%22fragment%22%3A%22W4sZmlsZQ%3D%3D%22%7D%2C%22pos%22%3A%7B%22line%22%3A4%2C%22character%22%3A3%7D%7D%5D%2C%22ec15c6a6-95db-4930-8670-a65f40ee17f3%22%5D "Go to definition") block, ensuring that the script runs only when executed directly.

Within the main function, an [`ArgumentParser`](command:_github.copilot.openSymbolFromReferences?%5B%22%22%2C%5B%7B%22uri%22%3A%7B%22scheme%22%3A%22vscode-notebook-cell%22%2C%22authority%22%3A%22%22%2C%22path%22%3A%22%2FUsers%2Fldodda%2FDocuments%2FCodes%2Fs4-for-de-novo-drug-design%2FSM_notebook.ipynb%22%2C%22query%22%3A%22%22%2C%22fragment%22%3A%22W4sZmlsZQ%3D%3D%22%7D%2C%22pos%22%3A%7B%22line%22%3A2%2C%22character%22%3A21%7D%7D%5D%2C%22ec15c6a6-95db-4930-8670-a65f40ee17f3%22%5D "Go to definition") object is created to parse command-line arguments. These arguments include paths for the model directory, training data, testing data, and output directory, all of which are expected to be set as environment variables. The parsed arguments are stored in a dictionary named [`args`](command:_github.copilot.openSymbolFromReferences?%5B%22%22%2C%5B%7B%22uri%22%3A%7B%22scheme%22%3A%22vscode-notebook-cell%22%2C%22authority%22%3A%22%22%2C%22path%22%3A%22%2FUsers%2Fldodda%2FDocuments%2FCodes%2Fs4-for-de-novo-drug-design%2FSM_notebook.ipynb%22%2C%22query%22%3A%22%22%2C%22fragment%22%3A%22W4sZmlsZQ%3D%3D%22%7D%2C%22pos%22%3A%7B%22line%22%3A11%2C%22character%22%3A4%7D%7D%5D%2C%22ec15c6a6-95db-4930-8670-a65f40ee17f3%22%5D "Go to definition").

Next, an instance of the [`S4forDenovoDesign`](command:_github.copilot.openSymbolFromReferences?%5B%22%22%2C%5B%7B%22uri%22%3A%7B%22scheme%22%3A%22vscode-notebook-cell%22%2C%22authority%22%3A%22%22%2C%22path%22%3A%22%2FUsers%2Fldodda%2FDocuments%2FCodes%2Fs4-for-de-novo-drug-design%2FSM_notebook.ipynb%22%2C%22query%22%3A%22%22%2C%22fragment%22%3A%22W4sZmlsZQ%3D%3D%22%7D%2C%22pos%22%3A%7B%22line%22%3A1%2C%22character%22%3A17%7D%7D%5D%2C%22ec15c6a6-95db-4930-8670-a65f40ee17f3%22%5D "Go to definition") class is created with specific parameters for demonstration purposes. These parameters include a reduced number of maximum epochs ([`n_max_epochs=3`](command:_github.copilot.openSymbolFromReferences?%5B%22%22%2C%5B%7B%22uri%22%3A%7B%22scheme%22%3A%22vscode-notebook-cell%22%2C%22authority%22%3A%22%22%2C%22path%22%3A%22%2FUsers%2Fldodda%2FDocuments%2FCodes%2Fs4-for-de-novo-drug-design%2FSM_notebook.ipynb%22%2C%22query%22%3A%22%22%2C%22fragment%22%3A%22W4sZmlsZQ%3D%3D%22%7D%2C%22pos%22%3A%7B%22line%22%3A15%2C%22character%22%3A8%7D%7D%5D%2C%22ec15c6a6-95db-4930-8670-a65f40ee17f3%22%5D "Go to definition")), a smaller batch size ([`batch_size=64`](command:_github.copilot.openSymbolFromReferences?%5B%22%22%2C%5B%7B%22uri%22%3A%7B%22scheme%22%3A%22vscode-notebook-cell%22%2C%22authority%22%3A%22%22%2C%22path%22%3A%22%2FUsers%2Fldodda%2FDocuments%2FCodes%2Fs4-for-de-novo-drug-design%2FSM_notebook.ipynb%22%2C%22query%22%3A%22%22%2C%22fragment%22%3A%22W4sZmlsZQ%3D%3D%22%7D%2C%22pos%22%3A%7B%22line%22%3A16%2C%22character%22%3A8%7D%7D%5D%2C%22ec15c6a6-95db-4930-8670-a65f40ee17f3%22%5D "Go to definition")), and the device set to "cuda" for GPU acceleration. If a CUDA-enabled GPU is not available, the device should be set to "cpu".

The script then proceeds to pretrain the S4 model using a small subset of the ChEMBL dataset. The training and validation data paths are constructed using the parsed arguments and the model is trained using the [`train`](command:_github.copilot.openSymbolFromReferences?%5B%22%22%2C%5B%7B%22uri%22%3A%7B%22scheme%22%3A%22vscode-notebook-cell%22%2C%22authority%22%3A%22%22%2C%22path%22%3A%22%2FUsers%2Fldodda%2FDocuments%2FCodes%2Fs4-for-de-novo-drug-design%2FSM_notebook.ipynb%22%2C%22query%22%3A%22%22%2C%22fragment%22%3A%22W4sZmlsZQ%3D%3D%22%7D%2C%22pos%22%3A%7B%22line%22%3A20%2C%22character%22%3A7%7D%7D%5D%2C%22ec15c6a6-95db-4930-8670-a65f40ee17f3%22%5D "Go to definition") method. After pretraining, the model is saved to the specified model directory.

Following pretraining, the model is fine-tuned on a small subset of bioactive molecules. Similar to the pretraining step, the training and validation data paths are specified, and the model is trained again. The fine-tuned model is then saved to a different directory within the model directory.

Finally, the script generates new molecular designs using the fine-tuned model. The [`design_molecules`](command:_github.copilot.openSymbolFromReferences?%5B%22%22%2C%5B%7B%22uri%22%3A%7B%22scheme%22%3A%22vscode-notebook-cell%22%2C%22authority%22%3A%22%22%2C%22path%22%3A%22%2FUsers%2Fldodda%2FDocuments%2FCodes%2Fs4-for-de-novo-drug-design%2FSM_notebook.ipynb%22%2C%22query%22%3A%22%22%2C%22fragment%22%3A%22W4sZmlsZQ%3D%3D%22%7D%2C%22pos%22%3A%7B%22line%22%3A39%2C%22character%22%3A22%7D%7D%5D%2C%22ec15c6a6-95db-4930-8670-a65f40ee17f3%22%5D "Go to definition") method is called with parameters for the number of designs, batch size, and temperature. The generated designs and their log-likelihoods are saved to separate files in the output directory.

Overall, this script automates the process of pretraining, fine-tuning, and generating new molecules using an S4 model, making it a valuable tool for computational chemistry and drug discovery research.

In [9]:
%%writefile scripts/all_together.py
from s4dd import S4forDenovoDesign
from argparse import ArgumentParser
import os
if __name__ == "__main__":
    parser = ArgumentParser('Pretrain an S4 model, fine-tune and generate new molecules')
    parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
    #parser.add_argument("--full-data", type=str, default=os.environ["SM_CHANNEL_DATA_FULL"])
    parser.add_argument("--train", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
    parser.add_argument("--test", type=str, default=os.environ["SM_CHANNEL_TEST"])
    parser.add_argument("--output",type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
    args = parser.parse_args().__dict__

    # Create an S4 model with (almost) the same parameters as in the paper.
    s4 = S4forDenovoDesign(
        n_max_epochs=3,  # This is for only demonstration purposes. Set this to a (much) higher value for actual training. Default: 400.
        batch_size=128,  # This is for only demonstration purposes. The value in the paper is 2048.
        device="cuda",  # replace this with "cpu" if you don't have a CUDA-enabled GPU.
    )
    # Pretrain the model on a small subset of ChEMBL
    s4.train(
        training_molecules_path=f"{args['train']}/chemblv31/train.zip",
        val_molecules_path=f"{args['test']}/chemblv31/valid.zip",
    )

    # save the pretrained model
    s4.save(f"{args['model_dir']}/demo/pretrained_model/")

    # Fine-tune the model on a small subset of bioactive molecules
    s4.train(
        training_molecules_path=f"{args['train']}/pkm2/train.zip",
        val_molecules_path=f"{args['train']}/pkm2/valid.zip",
    )

    # save the fine-tuned model
    s4.save(f"{args['model_dir']}/demo/finetuned_model/")


    # Design new molecules
    designs, lls = s4.design_molecules(n_designs=128, batch_size=64, temperature=1)

    # Save the designs
    with open(f"{args['output']}/designs.smiles", "w") as f:
        f.write("\n".join(designs))

    # Save the log-likelihoods of the designs
    with open(f"{args['output']}/lls.txt", "w") as f:
        f.write("\n".join([str(ll) for ll in lls]))

Overwriting scripts/all_together.py


In [6]:
train_estimator = PyTorch(
        entry_point='pretraining.py',
        source_dir="scripts",
        role=sm_role,
        framework_version='1.13.1',
        instance_count=1,
        instance_type='ml.g4dn.2xlarge',
        py_version='py39',
        max_run=432000,
        wait=False
    )

In [None]:
train_estimator.fit({'train': 's3://nimbustx-sagemaker/denovo_design/s4dd/chemblv31/', 
                     'test': 's3://nimbustx-sagemaker/denovo_design/s4dd/chemblv31/'})

In [6]:
all_estimator = PyTorch(
        entry_point='all_together.py',
        source_dir="scripts",
        role=sm_role,
        framework_version='1.13.1',
        instance_count=1,
        instance_type='ml.g4dn.2xlarge',
        py_version='py39',
        max_run=432000,
        wait=False
    )

In [None]:
all_estimator.fit({'train': 's3://nimbustx-sagemaker/denovo_design/s4dd/',
                     'test': 's3://nimbustx-sagemaker/denovo_design/s4dd/'})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2024-10-23-17-18-28-645


2024-10-23 17:18:30 Starting - Starting the training job...
2024-10-23 17:18:44 Starting - Preparing the instances for training...
2024-10-23 17:19:32 Downloading - Downloading the training image..................
2024-10-23 17:22:35 Training - Training image download completed. Training in progress....bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
  "cipher": algorithms.TripleDES,
  "class": algorithms.TripleDES,
2024-10-23 17:22:58,438 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2024-10-23 17:22:58,460 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2024-10-23 17:22:58,475 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2024-10-23 17:22:58,478 sagemaker_pytorch_container.training INFO     Invoking user training script.
2024-10-23 17:22:59,724 sagemaker-training-toolkit INFO     Installing depe