# Lab1 (inference): Generate protein sequences at scale with Progen2 on AWS Batch

This notebook will guide you through defining Progen2 prompts with parameters, submitting Batch jobs, monitoring job status, and finally 
viewing generated sequences.

#### Prerequisites
- IAM roles configured with appropriate permissions
- Progen2 docker image pushed to ECR
- Batch resources provisioned (Compute Environment, Job Queue, Job Definition)

## Step 1: Setup and Configuration

First, let's get our AWS account information and set up variables we'll use throughout the notebook.

In [None]:
%pip install biopython

In [None]:
import boto3
import datetime

##########################################################

# Get AWS account information
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()['Account']
region = boto3.Session().region_name

# Define S3 bucket and folder names
S3_BUCKET = f'workshop-data-{account_id}'
LAB1_FOLDER = 'lab1-progen'
LAB2_FOLDER = 'lab2-amplify'
LAB3_FOLDER = 'lab3-esmfold'

print(f"Account ID: {account_id}")
print(f"Region: {region}")
print(f"S3 Bucket: {S3_BUCKET}")

##########################################################

# Define model 
model_version = 'progen2-small'
model_id = f'hugohrban/{model_version}' 

# Create batch client
batch_client = boto3.client('batch')

# Define Batch job queue and job definition 
job_queue_name = 'progen2-batch-job-queue'
job_definition_name = 'progen2-job-definition'

## Step 2: Define prompts with parameters to generate protein sequences

The inference parameters file will be stored on S3, allowing batch jobs to access and use it for sequence generation

In [None]:
%%writefile data/$LAB1_FOLDER/inference-params.json

{
    "inference-params": [
        {"prompt_id": "prompt-001", "prompt": "MEVVIVTGMSGAGK", 
        "max_length":100, "temperature": 0.001, "top_p":0.9, "top_k":50},

        {"prompt_id": "prompt-002", "prompt": "MEVVIVTGMSGAGK", 
        "max_length":100, "temperature": 0.7, "top_p":0.9, "top_k":50},

        {"prompt_id": "prompt-003", "prompt": "MEVVIVTGMSGAGK", 
        "max_length":100, "temperature": 0.001, "top_p":0.9, "top_k":50},

        {"prompt_id": "prompt-004", "prompt": "MEVVIVTGMSGAGK", 
        "max_length":100, "temperature": 0.7, "top_p":0.9, "top_k":50},

        {"prompt_id": "prompt-005", "prompt": "MEVVIVTGMSGAGK", 
        "max_length":100, "temperature": 0.001, "top_p":0.9, "top_k":50},

        {"prompt_id": "prompt-006", "prompt": "MEVVIVTGMSGAGK", 
        "max_length":100, "temperature": 0.7, "top_p":0.9, "top_k":50},

        {"prompt_id": "prompt-007", "prompt": "MEVVIVTGMSGAGK", 
        "max_length":100, "temperature": 0.001, "top_p":0.9, "top_k":50},

        {"prompt_id": "prompt-008", "prompt": "MEVVIVTGMSGAGK", 
        "max_length":100, "temperature": 0.7, "top_p":0.9, "top_k":50},

        {"prompt_id": "prompt-009", "prompt": "MEVVIVTGMSGAGK", 
        "max_length":100, "temperature": 0.001, "top_p":0.9, "top_k":50},

        {"prompt_id": "prompt-010", "prompt": "MEVVIVTGMSGAGK", 
        "max_length":100, "temperature": 0.7, "top_p":0.9, "top_k":50}
    ]
}


In [None]:
!aws s3 cp data/$LAB1_FOLDER/inference-params.json s3://$S3_BUCKET/$LAB1_FOLDER/inference-params.json

## Step 3: Generate protein sequences with inference parameters

### Step 3.1: Submit Batch jobs

* ```batch_count``` defines how many jobs will be created
* ```batch_size``` defines how many sequences will be generated by each job

In [None]:
batch_count = 1
batch_size = 10

jobs = []
for batchNumber in range(batch_count):

    # Generate unique job name
    timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
    job_name = f'progen2-batch-job-{batchNumber}-{timestamp}'

    # Submit the job
    response = batch_client.submit_job(
        jobName=job_name,
        jobQueue=job_queue_name,
        jobDefinition=job_definition_name,
        parameters={
            'hfModelId': model_id,
            's3InputParamsPath': f's3://{S3_BUCKET}/{LAB1_FOLDER}/inference-params.json',
            'batchId' : f'batch-10{batchNumber}',
            'batchSize': f'{batch_size}',
            'batchNumber': f'{batchNumber}',
            's3OutputPath': f's3://{S3_BUCKET}/{LAB1_FOLDER}'
        }
    )
    jobs.append(response)

    job_id = response['jobId']
    print(f"Submitted job: {response['jobName']}")
    print(f"   Job ID: {job_id}")
    print(f"   Job ARN: {response['jobArn']}")

### Step 3.2: Monitor status of the submitted jobs

In [None]:
def print_job_status(job_id):
    """Get detailed job status information"""
    response = batch_client.describe_jobs(jobs=[job_id])
    job = response['jobs'][0]
    
    print(f"Job Name: {job['jobName']}")
    print(f"Job ID: {job['jobId']}")
    print(f"Status: {job['status']}")
    
    if 'statusReason' in job:
        print(f"Status Reason: {job['statusReason']}")
    
    if 'startedAt' in job:
        started_at = datetime.datetime.fromtimestamp(job['startedAt'] / 1000)
        print(f"Started At: {started_at}")
    
    if 'stoppedAt' in job:
        stopped_at = datetime.datetime.fromtimestamp(job['stoppedAt'] / 1000)
        print(f"Stopped At: {stopped_at}")
    

# Check jobs status
for job in jobs:
    print_job_status(job['jobId'])
    print()


## Step 4: View generated sequences

### Step 4.1: Download FASTA files with generated sequences from S3

In [None]:
!aws s3 cp s3://$S3_BUCKET/$LAB1_FOLDER ./data/$LAB1_FOLDER --recursive --exclude "*" --include "*.fasta"

### Step 4.2: Read FASTA file(s) and print generated sequences

In [None]:
from Bio import SeqIO
import os

path = f"data/{LAB1_FOLDER}"
for file in os.listdir(path):
    if file.endswith(".fasta"):

        file_path = os.path.join(path, file)    
        for record in SeqIO.parse(file_path, "fasta"):
            print(f"ID: {record.id}")
            print(f"Description: {record.description}")
            print(f"Sequence: {record.seq}")
            print("-" * 40)