Skip to content

Latest commit

 

History

History
194 lines (127 loc) · 10.6 KB

Readme_AWS_Batch.md

File metadata and controls

194 lines (127 loc) · 10.6 KB

Instructions for using AWS Batch

These are supplementary instructions for setting up AWS Batch in order to use our workflow. These instructions assume that you understand the basics of AWS and particularly AWS Batch. As a warning, this is not for the faint of heart, and we provide some guidance.

In summary you must

  • do the set-up (this need only be done once -- unless your data set sizes vary very considerably or you change region)
    • decide which AWS region you want to run in
    • create an AWS S3 bucket in that region as work directory (I'll refer to this as eg-aws-bucket in these instructions, which you should replace with your own bucket name
      • if your input and output are going to AWS S3 they should be in the same region -- the input could be on your local disk or S3, and the output could go your local disk or S3. It would probably make sense for the input to be on S3 since you might want to run several times, but that is completely up to you and YMMV. Do remember to set the security on the buckets so as to protect any sensitive data
    • set up an AWS Batch Compute Environment with appropriate permissions
    • set up an AWS Batch queue that connects to the batch environment
    • set up a Nextflow config file that puts all this together
  • run the workflow

NB: AWS Batch requires all resources to be set up in the same AWS region. A major cause of misconfiguration is failing to observe this (easy to do), so pay attention to which region you are in. In particular, the S3 buckets and AWS Batch needs to be in the same region.

1. AWS region and set up

Log into your AWS console.

Choose the AWS region based on what is convenient for you and has a the best pricing. It probably doesn't make a big difference. Our University's ethics committee prefers us to run on af-south-1 since these are subject to South African privacy laws but your requirements may well be different. Remember that any buckets you use must be in the same region.

You need to have a bucket that can be used for temporary storage. It could be an existing bucket. Make sure you are the only person who can read and write to it.

Make sure you have a suitable VPC (Virtual Private Cloud) for Batch to run in. If you haven't done this before, consider following our instructions in Section 5 (Addendum below) before setting up the Batch Compute environment.

2. Set up your Batch Compute Environment

This defines what resources you need for your workflow. There are a number of steps but mainly we can use the default values. However, you do need to be careful that you set things up so that you have enough reources.

2.1 Disk space

By default, AWS instances that run batch jobs are 30GB in size. We think that you need an image that is at least 4x bigger than the input data size to run safely. If your need is less than 30GB, there's no problem and you can skip the rest of 2.1. If not, there's an extra configuration step in the configuration to set up an environment with disks of the correct size

The easiest way of doing this is to set up a launch template. You can do this using the console but in my experience it is more complex than using the command line tools.

Install boto3 library using pip or yum or the like (e.g., yum install python3-boto)

There is a file called launch-template.json in this directory. Download it to your machine. Change the LaunchTemplateName field associated value to something meaningful and unique for you and the VolumeSize field to the value you want (in units of GB)

Then, using the following command (mutatis mutandis -- that is change the region and the name of the template file if you've changed it) create the launch template

Note that the service templates are account and region specific. So if you you define a template for af-south-1 it will not show in us-east-1

 aws ec2 --region af-south-1 create-launch-template --cli-input-json file://h3a-100GB-template.json

2.2. RAM size

The bigger your data files the more RAM you need. This is not an issue you have to worry about at configuration time. When you run the workflow you may have to set the plink_mem_req and other_mem_req parameters as discussed elsewhere.

2.3 Setting up the environment and queue

Read

2.3.1 Setting up an environment

You should be able to follow the default settings unless you need a launch template

Choose

  • Computer Environment Configuration
    • Managed environment type
    • give a meaningful name
    • enable environment
    • under "Service role" pick AWSBatchServiceRole
    • under additional settings for "Compute environment configuration" (this sometimes only appears once you click in the next section -- so the UI can be confusing here so be patient).
      • AWSBatchServiceRole
    • ecsInstanceRole
    • Choose a keypair for the region (this is only needed if you intend to ssh into the instances that spin up and so would not normally be done)
  • Instance Configuration
    • Spot pricing (choose the percentage your pocket can afford)
    • Optimal for allowed instance types
    • SPOT_CAPACITY_OPTIMIZED for allocation strategy
    • You don't need a "Spot fleet role" if you have chosen SPOT_CAPACITY_OPTIMIZED
    • Under Additional settings (if you have ever defined a launch template for this region)
      • pick none or a template you have defined if you need to. Note for each template you need to define a new environment. (See section 2.1 above). If you haven't defined a template for this region there will be nothing for you to do and you won't be able to select an option; that's OK.
      • You don't need to to pick an AMI and should do so only if you really know what you are doing.
  • Under networking add a VPC -- note that under additonal settings are the definitions of the securty groups which define access
  • Add tags if you want to : may be helpful for tracking resources

2.3.2 Adding S3 access permissions

The AWS Batch instances will need to access S3 and so you need to give them this permission.

  • From the AWS Console, choose "Services" and then "IAM"
  • Choose "Roles"
  • Choose "ecsInstanceRole"
  • Choose Attach Policies
  • In the filter bar type in AmazonS3FullAccess (NB: no spaces) and select

The ecInstanceRole now comprises two policies: AmazonEC2ContainerServiceforEC2Role and AmazonS3FullAccess

2.3.3 Set up a queue

In the Amazon Console, go back to AWS Batch from the list of services

Create a job queue. Unless you need to do something fancy just pick the default options.

  • for convenience call the queue the same as the environment
  • attach the environment you created to the queue

You use the queue name in your Nextflow config file.

3. Create an additonal Nextflow configuration file

It will look something like this. I'll call this aws.config for the example but you can name it what ever you want. Please note that the accessKey and secretKey must be valid for the account and region in which you created the environment and queue. Also if you are using an IAM user the IAM user must have permission to run Batch jobs.


params {
   plink_mem_req = "8GB"
   other_mem_req = "8GB"
}


profiles {


    awsbatch {
         region = "af-south-1"
         accessKey ='YourAccessKeyForTheRegion'
         secretKey ='AssociatedSecretKey'
         aws.uploadStorageClass = 'ONEZONE_IA'
         process.queue = 'QueueYouCreated'
         process.executor = "awsbatch"

    }


}

Then run the code something like this. Note in this example I am using input that's already in S3, except for the one file is local (this is to show you that the data can be in different places -- it would probably make sense for the phenotype file to stored in S3, but perhaps you are trying to be extra careful).

nextflow run h3agwas/qc/main.nf \
       -profile awsbatch -c aws.config \
       -work-dir 's3://za-batch-work' \
       --input_dir 's3://za-batch-data' --input_pat sim1_s \
       --output_dir 's3://za-batch-data/qc' --output sim1_s_qc\
       --data 's3://za-batch-data/out_qt.pheno' \
       --case_control data/see.fam --case_control_col=pheno

4 Clean up

Note the work bucket you give will start to fill up (as will any output buckets). If you do lots of analysis it's possible for the work bucket to quickly get to the hundreds of GB level. There may be some sensitive data and AWS will also charge you for this, so remember to regularly delete objects from your work bucket.

5 Addendum setting up A VPC

We have found that one of the reasons for Batch not working is that the networking for Batch is not set up correctly. For those who are not very familiar with AWS, these instructions may be useful. If you find that your workflow just hangs, one possible reason is that you have not set up networking properly. These are instructions are better done before setting up the environment

Your batch jobs will run in a VPC (Virtual Private Cloud). This VPC needs to communicate with you and S3. There are many ways of doing this and although not in scope of our instructions you may find this useful. As a warning, our instructions use public IP addresses -- this should be secure but you may want consider using private IP addresses only -- this is really out of scope of our documentation.

  1. In the Console Services choose "VPC"
  2. Choose Create VPC
  3. Select VPC + more
    • Name your VPC meaningfully -- you will need this name
    • Use an IPv4 CIDR block: I've used 10.0.0.0/8 and 172.31.0.0/16 but any sensible choice of private IP range should work
    • Select the number of availability zones AZ: default of 2 is probably good
    • Choose public subnets, using same number as AZs as you chose. Do not create private subnets
    • Don't add a NAT
    • Create
  4. The creation of the VPC will also create an internet gateway -- it's ID will start "igw-" and will have as part of the name the name you gave your VPC.
  5. Once the VPC has been created, select it and look for the main routitng table -- it will start with "rtb". Select that
    • Click on "routes" and "Edit Routes"
    • Choose "Add Routes"
    • Add the entry 0.0.0.0/0 and then under Target choose "Internet gateway" and select the gateway that has created
    • Save
  6. While still on the "VPC Dashboard", click on Subnets in the panel on the left. For each of the subnets that belong to the VPC
  • Select the subnet from the Actions menu option and select Edit subnet settings
  • Tick Enable auto-assign public IPv4 address
  • Save

When you create your Batch environment use this VPC