**Goal of this demo**

We'll deploy a HPC cluster and run a simple job, using AWS ParallelCluster

**What is AWS ParallelCluster**

AWS ParallelCluster is a CloudFormation based environment that allows to build pre-configured HPC clusters.
Services used:
AWS CloudFormation, AWS Identity and Access Management (IAM), Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), Amazon Elastic Compute Cloud (Amazon EC2), Amazon EC2 Auto Scaling, Amazon Elastic Block Store (Amazon EBS), Amazon Simple Storage Service (Amazon S3), Amazon DynamoDB and Amazon FSx Lustre.

AWS ParallelCluster is open-source: https://github.com/aws/aws-parallelcluster

Some key features in the initial release of ParallelCluster that were not in CfnCluster are:

* AWS Batch integration
* Multiple EBS volumes
* Better scaling performance – faster, with updates AutoScaling all at once
* Support for “bring your own AMI” Custom AMI
* Private cluster using proxy

Lets get started

**Install AWS Parallel Cluster**

We use python pip to install parallelcluster here: ~/.local/bin/pcluster

In [15]:
!pip install --user aws-parallelcluster
!mkdir /home/ec2-user/.parallelcluster

[33mYou are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


**Pre-requisites**

Before starting a cluster we will create the following AWS resources using the AWS Python SDK boto3:
* An S3 bucket that will be used for importing and exporting files to an AWS FSx Lustre file-cluster
* A VPC with one subnet, an internet gateway and a route table with a public route. Our cluster master and worker nodes will be created in this subnet when we create our cluster
* An EC2 key pair that we will use to ssh into the cluster master node

Remember to change the bucket name to something globally unique, e.g. **your-initials-pcluster**

In [2]:
# Create an S3 bucket using boto3
import boto3

your_bucket_name = 'peerjako-pcluster' # CHANGE THIS!!!

s3 = boto3.resource('s3')
bucket = s3.create_bucket(Bucket=your_bucket_name,
                         CreateBucketConfiguration={'LocationConstraint':boto3.Session().region_name})

In [3]:
# Create a VPC
ec2 = boto3.resource('ec2')
ec2Client = boto3.client('ec2')
# create VPC
vpc = ec2.create_vpc(CidrBlock='10.0.0.0/16')
# we can assign a name to vpc, or any resource, by using tag
vpc.create_tags(Tags=[{"Key": "Name", "Value": "pcluster_vpc"}])
vpc.wait_until_available()
print(vpc.id)

vpc-0ffdf77a2685939c3


In [4]:
# Create then attach internet gateway to the VPC
ig = ec2.create_internet_gateway()
vpc.attach_internet_gateway(InternetGatewayId=ig.id)
print(ig.id)

igw-02940828dd78790b6


In [5]:
# Create a route table and a public route
route_table = vpc.create_route_table()
route = route_table.create_route(
    DestinationCidrBlock='0.0.0.0/0',
    GatewayId=ig.id
)
print(route_table.id)

rtb-0580c2e5abd02cfa2


In [6]:
# Create a subnet
subnet = ec2.create_subnet(CidrBlock='10.0.1.0/24', VpcId=vpc.id)
print(subnet.id)

# and associate the route table with the subnet
route_table.associate_with_subnet(SubnetId=subnet.id)


subnet-03a42c202d3745009


ec2.RouteTableAssociation(id='rtbassoc-0434d33aa2146758a')

In [7]:
# Enabe DNS support and hostnames for VPC. This is required by parallelcluster
ec2Client.modify_vpc_attribute( VpcId = vpc.id , EnableDnsSupport = { 'Value': True } )
ec2Client.modify_vpc_attribute( VpcId = vpc.id , EnableDnsHostnames = { 'Value': True } )

{'ResponseMetadata': {'RequestId': '4f5beaab-c2b7-4b20-8f49-4ef569576861',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'text/xml;charset=UTF-8',
   'content-length': '237',
   'date': 'Fri, 19 Apr 2019 16:07:58 GMT',
   'server': 'AmazonEC2'},
  'RetryAttempts': 0}}

In [8]:
# Create an EC2 key pair 
keypair = ec2.create_key_pair(KeyName='aws_rsa')

# and save the key material to a file: aws_rsa.pem
with open("aws_rsa.pem", "w") as text_file:
    text_file.write(keypair.key_material)
    
!chmod 400 aws_rsa.pem

**Create a cluster configuration**

Before launching a cluster, we'll need to create a parallelcluster configuration. You can see all the configuration options here: https://aws-parallelcluster.readthedocs.io/en/develop/configuration.html

Some of the configuration options we use for our demo cluster are:
* initial_queue_size = 4  # The cluster will start with 4 worker nodes
* scheduler = slurm  # The cluster will use the slurm scheduler. Other options are sge, torque and AWS Batch (docker containers instead of worker nodes). For a comparison between slurm, torque and sge: https://bitsanddragons.wordpress.com/2017/08/29/slurm-vs-torque-vs-sge/
* fsx_settings = fs # Create an FSx Lustre file cluster to be used by our cluster nodes

In [16]:
# Define cluster config
pcluster_config = (
    '[aws]\n'
    'aws_region_name = ' + boto3.Session().region_name + '\n'
    '\n'
    '[cluster default]\n'
    'vpc_settings = public\n'
    'key_name = ' + keypair.key_name + '\n'
    'scheduler = slurm\n'
    'initial_queue_size = 4\n'
    'maintain_initial_size = true\n'
    'fsx_settings = fs\n'
    '\n'
    '[vpc public]\n'
    'master_subnet_id = ' + subnet.id + '\n'
    'vpc_id = ' + vpc.id + '\n'
    '\n'
    '[global]\n'
    'update_check = true\n'
    'sanity_check = true\n'
    'cluster_template = default\n'
    '\n'
    '[fsx fs]\n'
    'shared_dir = /fsx\n'
    'storage_capacity = 3600\n'
    'import_path = s3://' + bucket.name + '\n'
    'imported_file_chunk_size = 1024\n'
    'export_path = s3://' + bucket.name + '/export\n'
    'weekly_maintenance_start_time = 1:00:00\n'
    '\n'
    '[aliases]\n'
    'ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}\n')

with open("/home/ec2-user/.parallelcluster/config", "w") as text_file:
    text_file.write(pcluster_config)
    
print(pcluster_config)

[aws]
aws_region_name = eu-west-1

[cluster default]
vpc_settings = public
key_name = aws_rsa
scheduler = slurm
initial_queue_size = 4
maintain_initial_size = true
fsx_settings = fs

[vpc public]
master_subnet_id = subnet-03a42c202d3745009
vpc_id = vpc-0ffdf77a2685939c3

[global]
update_check = true
sanity_check = true
cluster_template = default

[fsx fs]
shared_dir = /fsx
storage_capacity = 3600
import_path = s3://peerjako-pcluster
imported_file_chunk_size = 1024
export_path = s3://peerjako-pcluster/export
weekly_maintenance_start_time = 1:00:00

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}



**Copying demo files into the bucket**

For our demo we will need a couple of files (more info on those later). We copy these files into our bucket so that they are available to the cluster through the Luster file-cluster

In [17]:
s3.Bucket(bucket.name).upload_file('pi-mpi.c','pi-mpi.c')
s3.Bucket(bucket.name).upload_file('batch.sh','batch.sh')

**Create the AWS parallelcluster**

We are now ready to create the cluster using the "pcluster create" command. 
The cluster can take 10-20 minutes to create.

In [None]:
# Create our first cluster and call it hello-cluster1
!~/.local/bin/pcluster create hello-cluster1

Beginning cluster creation for cluster: hello-cluster1
Creating stack named: parallelcluster-hello-cluster1
Status: SQSPolicy - CREATE_IN_PROGRESS                                          

**Check out the cluster resources**

pcluster uses AWS CloudFormation to create AWS resources (infrastructure as code):

https://eu-west-1.console.aws.amazon.com/cloudformation/home

The FSx Lustre file cluster can be seen here:

https://eu-west-1.console.aws.amazon.com/fsx/home#file-systems

Once the file cluster is created you can see that pcluster creates master and worker EC2 instances:

https://eu-west-1.console.aws.amazon.com/ec2/v2/home#Instances:sort=instanceId

**SSH into the cluster and test**

Open a terminal and ssh into the cluster:

`~/.local/bin/pcluster ssh hello-cluster1 -i aws_rsa.pem`

Let's check whether we have compute nodes available:

`sinfo`

Let's run a simple job to check that everything is working the right way:

`srun -n 1 /bin/hostname`

Now let's run it on four nodes instead of a single one:

`srun -n 4 /bin/hostname`


**Calculate PI**

Ok, so we now have a cluster and we have checked that we were able to run a job on several node.

Now let's do something slightly more complex and execute an mpi job that calculates the Pi number. 

Change directory into the Lustre partition and check the content of the file **pi-mpi.c**:

`cd /fsx && cat pi-mpi.c`

Gcc and openmpi have been pre-installed by AWS ParallelCluster, so we just need to setup the environment:

`module load mpi && which mpicc`

Ok, now we got everything we need to compile our small mpi application:

`mpicc -v -lm -o pi-mpi pi-mpi.c`

So we should now have a hello-mpi binary ready to be launched. We have a simple batch.sh bash file that we will use to run a batch job:

`cat batch.sh`

We can now submit it to the queue: 

`sbatch -n 4 ./batch.sh`

and check the status of the queue:

`squeue`

The job might be pending, waiting for resources to be launched by AWS ParallelCluster.
Wait for the resource to be launch, and once the job is finished, you'll find a slurm-<jobID>.out file in your directory, containing the output of the job:
    
`tail -f slurm-*.out`
    
Once you have the PI result hit control-c to exit the file tail.
    
**Here we are, you have ran your first MPI parallel job on AWS, congratulations !**


**Play with elasticity**

Submit several jobs in the queue (sbatch), look at the available resources (sinfo), look at the queue (squeue) and see how the situation evolves.


**Recover the job files**

Both the master node and worker nodes have the FSx file partition mounted at /fsx

Using the lfs client you can archive the files of your pcluster (both master and workers) to your S3 buckets export path.

We have our slurm output file(s) on the master node so lets archive the to the S3 bucket:

`sudo lfs hsm_archive /fsx/*.out`

Read more about how to use the lustre client with S3 here: 

https://docs.aws.amazon.com/fsx/latest/LustreGuide/fsx-data-repositories.html

**List job output files from the Lustre files that were exported into the S3 bucket**

In [None]:
for object_summary in bucket.objects.filter(Prefix="export/"):
    print(object_summary.key)

**Tear down the cluster**

In [None]:
# Delete the cluster
!~/.local/bin/pcluster delete hello-cluster1

**Delete the other AWS resources (VPC, EC2 key pair and S3 bucket)**

In [None]:
# Delete the VPC
ec2Client.delete_subnet(SubnetId = subnet.id)
ec2Client.delete_route_table(RouteTableId = route_table.id)
ec2Client.detach_internet_gateway(InternetGatewayId = ig.id, VpcId = vpc.id)
ec2Client.delete_internet_gateway(InternetGatewayId = ig.id)
ec2Client.delete_vpc(VpcId = vpc.id)

In [None]:
# Delete the EC2 key pair and the pem file
ec2Client.delete_key_pair(KeyName = keypair.key_name)
!rm -rf aws_rsa.pem

In [None]:
# Delete the bucket - Warning! This will also delete all the files.
bucket.objects.all().delete()
bucket.delete()

**Documentation**

https://aws-parallelcluster.readthedocs.io/en/latest/index.html
