# HyenaDNA Model Training Pipeline for Genomic Data
This notebook demonstrates our automated pipeline for training advanced AI models on genomic data using AWS's cloud infrastructure. The process combines AWS's genomics services (HealthOmics) with state-of-the-art machine learning capabilities (SageMaker) to create powerful models for genomic analysis.


![notebook_diagram](../assets/notebook_diagram.jpeg)


## Overview of the Process
1. Data Access: We securely access genomic data from AWS HealthOmics
2. Training Setup: We configure and launch the model training process
3. Monitoring: We track the training progress in real-time
4. Deployment: We make the trained model available for predictions

## Summary of Benefits
This automated pipeline provides several key advantages:
1. **Efficiency**: Automated processes reduce manual work and potential errors
2. **Scalability**: Can handle increasing amounts of genomic data
3. **Cost-effectiveness**: Resources are used only when needed
4. **Reproducibility**: The process can be repeated consistently
5. **Monitoring**: Clear visibility into the training process
6. **Security**: Maintains data security through AWS's infrastructure

## 1. Initial Setup
First, we install the necessary software tools and import our custom modules that manage the training process.

In [8]:
import os
os.chdir('/home/ec2-user/SageMaker')
print(os.listdir(os.getcwd()))


['notebooks', 'scripts', '.virtual_documents', '.Trash-1000', 'lost+found', 'assets', 'README.md', 'requirements.txt', '.sparkmagic', '.ipynb_checkpoints']


In [11]:
import sys
sys.path.append('/home/ec2-user/SageMaker/scripts')

In [12]:
%pip install -qU pip
%pip install -qU sagemaker boto3 awscli ipywidgets

from session_handler import SessionHandler
from data_handler import HealthOmicsHandler
from training_handler import TrainingHandler


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


## 2. AWS Session Initialization
Here we set up secure connections to AWS services. Think of this as logging into all the necessary AWS systems we'll need to use.

In [13]:
# Initialize sessions
session_handler = SessionHandler()
session_info = session_handler.get_session_info

print(f"Using region: {session_info['region_name']}")
print(f"Using S3 bucket: {session_info['bucket']}")

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker:Created S3 bucket: sagemaker-us-west-2-051581800907


Using region: us-west-2
Using S3 bucket: sagemaker-us-west-2-051581800907


## 3. Data Access Configuration
Now we set up access to our genomic data stored in AWS HealthOmics.

In [None]:
# Store configuration
store_id = "8721910603"  # replace with your store id
store_type = "reference"   # or "reference"

# Setup data access
omics_handler = HealthOmicsHandler(region_name=session_info['region_name'])
access_info = omics_handler.get_store_access(
    store_id=store_id,
    store_type=store_type
)

# Get and display required IAM policy
iam_policy = omics_handler.get_iam_policy(access_info["s3_arn"], access_info["key_arn"])
print(f"Required IAM policy for {store_type} store:")
print(json.dumps(iam_policy, indent=2))

print(f"\nData URI for {store_type} store:")
print(access_info["data_uri"])

## 4. Training Configuration
Here we define the specifications for our training process. These settings determine:
- How the model will learn from the data
- How long it will train
- How much computing power it will use
- What metrics we'll track to measure success

Key Parameters Explained:
- Model ID: The base AI model we're using (HyenaDNA)
- Epochs: How many times the model will process the entire dataset
- Batch Size: How much data the model processes at once
- Learning Rate: How quickly the model adjusts its understanding

In [None]:
# Model and job configuration
MODEL_ID = 'LongSafari/hyenadna-small-32k-seqlen-hf'
TRAINING_JOB_NAME = f'hyenaDNA-pretraining-{store_type}'  # Updated to include store type
EXPERIMENT_NAME = "hyenaDNA-pretraining-v2"

# Training hyperparameters
hyperparameters = {
    "species": "mouse",
    "epochs": 150,
    "model_checkpoint": MODEL_ID,
    "max_length": 32_000,
    "batch_size": 4,
    "learning_rate": 6e-4,
    "weight_decay": 0.1,
    "log_level": "INFO",
    "log_interval": 100
}

# Metrics to track
metric_definitions = [
    {"Name": "epoch", "Regex": "Epoch: ([0-9.]*)"},
    {"Name": "step", "Regex": "Step: ([0-9.]*)"},
    {"Name": "train_loss", "Regex": "Train Loss: ([0-9.e-]*)"},
    {"Name": "train_perplexity", "Regex": "Train Perplexity: ([0-9.e-]*)"},
    {"Name": "eval_loss", "Regex": "Eval Average Loss: ([0-9.e-]*)"},
    {"Name": "eval_perplexity", "Regex": "Eval Perplexity: ([0-9.e-]*)"}
]

## 5. Model Training
This is where the actual learning happens. We're using Amazon SageMaker, AWS's machine learning platform, to:
- Distribute the training across powerful GPU computers
- Automatically manage the computing resources
- Track the model's learning progress
- Ensure efficient use of computing resources

The training process will continue automatically until completion, with all progress metrics being logged.

In [None]:
# Setup and run training
training_handler = TrainingHandler(session_info)
training_handler.setup_training(
    experiment_name=EXPERIMENT_NAME,
    base_job_name=TRAINING_JOB_NAME,
    hyperparameters=hyperparameters,
    metric_definitions=metric_definitions
)

print(f"Starting training using data from {store_type} store...")
training_job_name = training_handler.train(data_uri=access_info["data_uri"])
print(f"Training job name: {training_job_name}")

## 6. Training Progress Visualization
We use TensorBoard, an interactive visualization tool, to monitor the training progress in real-time. This allows us to see:
- How quickly the model is learning
- Whether it's improving its accuracy
- If there are any issues that need attention

The visualization provides both technical teams and business stakeholders with clear insights into the training process.

In [None]:
user_profile = "your-profile-name"  # replace with your sagemaker studio profile name


tensorboard_url = session_handler.setup_tensorboard(
    training_job_name=training_job_name,
    user_profile=user_profile
)
print(f"TensorBoard URL: {tensorboard_url}")

--------------------------------------------------------------------

In [None]:
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

In [None]:
# Configure endpoint
endpoint_name = f'hyenaDNA-pretrained-{store_type}-ep'  # Updated to include store type
pytorch_deployment_uri = f"763104351884.dkr.ecr.{session_info['region_name']}.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker"

In [None]:
# Create model
hyenaDNAModel = PyTorchModel(
    model_data=training_handler.estimator.model_data,
    role=session_info['role'],
    image_uri=pytorch_deployment_uri,
    entry_point="inference.py",
    source_dir="scripts/",
    sagemaker_session=session_info['sagemaker_session'],
    name=endpoint_name,
    env = {
        'MMS_MAX_REQUEST_SIZE': '2000000000',
        'MMS_MAX_RESPONSE_SIZE': '2000000000',
        'MMS_DEFAULT_RESPONSE_TIMEOUT': '900',
        'TS_MAX_RESPONSE_SIZE': '2000000000',
        'TS_MAX_REQUEST_SIZE': '2000000000',
    }
)

In [None]:
# Environment configuration
env = {
    'SAGEMAKER_MODEL_SERVER_TIMEOUT': '7200',
    'TS_MAX_RESPONSE_SIZE': '2000000000',
    'TS_MAX_REQUEST_SIZE': '2000000000',
    'MMS_MAX_RESPONSE_SIZE': '2000000000',
    'MMS_MAX_REQUEST_SIZE': '2000000000'
}

# Deploy the model
print(f"Deploying model to endpoint: {endpoint_name}")
predictor = hyenaDNAModel.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.8xlarge",
    endpoint_name=endpoint_name,
    env=env,
)

In [None]:
# Load test data
print("Loading test data...")
sample_genome_data = []
with open("./sample_mouse_data.json") as file:
    for line in file:
        sample_genome_data.append(json.loads(line))

print(f"Loaded {len(sample_genome_data)} test samples")

# Make prediction
print("Making prediction...")
data = [sample_genome_data[0]]
predictor.serializer = JSONSerializer()
predictor.deserializer = JSONDeserializer()
embeddings = predictor.predict(data=data)

print("Prediction complete!")
print(f"Embedding shape or size: {len(embeddings) if isinstance(embeddings, list) else embeddings.shape}")

In [None]:
# Cleanup endpoint
print(f"Cleaning up endpoint: {endpoint_name}")
predictor.delete_endpoint()
print("Cleanup complete!")