## EC2 Control

This notebook demonstrates how to setup a single G4dn.8xlarge EC2 instance, and launch a Docker container running Juptyer Lab.

Prerequisites: Before running this notebook, ensure that you have setup your AWS CLI credentials.

### Read in configuration files

Sample EC2 instance congfigurations are included in this repo under `docs/sample_configs`. Read in the G4dn.4x config.

In [6]:
import yaml
import pprint

with open("../sample_configs/g4dn4x.yaml") as in_config:
    config = yaml.safe_load(in_config)
pprint.pprint(config)

{'BlockDeviceMappings': [{'DeviceName': '/dev/sda1',
                          'Ebs': {'VolumeSize': 1000, 'VolumeType': 'gp2'}}],
 'ImageId': 'ami-078a7f1dda72c0775',
 'InstanceType': 'g4dn.4xlarge',
 'KeyName': 'your_keypair_name',
 'MaxCount': 1,
 'MinCount': 1,
 'Monitoring': {'Enabled': False},
 'Placement': {'AvailabilityZone': 'us-east-1b'},
 'SecurityGroupIds': ['sg-your_security_group'],
 'SubnetId': 'subnet-your_subnet',
 'TagSpecifications': [{'ResourceType': 'instance',
                        'Tags': [{'Key': 'Name',
                                  'Value': 'some_name_for_your_instance'}]}]}


Note that a few values still need to be filled in based on your own account. The keypair can be any previously created EC2 keypair. Subnet needs to be a subnet associated with the selected availability zone. This can be found on the EC2 setup page of the AWS console. Security group can be any number of security groups also found in the console. Finally, under tags give your instance any name you'd like.

In [24]:
config['KeyName'] = 'keyname'
config['SecurityGroupIds'] = ['some_security_group']
config['TagSpecifications'][0]['Tags'][0]['Value'] = 'instance_name'
config['SubnetId'] = 'subnet'

In order to keep these tools as transparent and extensible as possible, the EC2 Controls provided in this package are really just some wrappers to simplify already existing tools to work with instances. As much as possible, I try to directly use existing tools. Below, use Boto to launch your instance.

In [25]:
import boto3
ec2_session = boto3.Session(region_name="us-east-1")
ec2_client = ec2_session.client("ec2")
ec2_resource = ec2_session.resource("ec2")

In [26]:
# launch instance
response = ec2_client.run_instances(**config)

Now we can use the ec2_control ssh tool to communicate with the instance. First, we need to get the public ip address of the instance.

In [27]:
from ec2_control import ssh

In [30]:
instances = [instance['InstanceId'] for instance in response['Instances']]
status = ec2_resource.meta.client.describe_instances(InstanceIds=instances)
public_ips = [instance['PublicIpAddress'] for instance in status['Reservations'][0]['Instances']]

In [32]:
#create an ssh client.
ssh_client = ssh.SSH(public_ips, '/root/.aws/keyfile.pem')

In [35]:
# test connection
pci = ssh_client.run_on_all('lspci')
pprint.pprint(pci)

[{'stderr': '',
  'stdout': '00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC '
            '[Natoma]\n'
            '00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA '
            '[Natoma/Triton II]\n'
            '00:01.3 Non-VGA unclassified device: Intel Corporation '
            '82371AB/EB/MB PIIX4 ACPI (rev 08)\n'
            '00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111\n'
            '00:04.0 Non-Volatile memory controller: Amazon.com, Inc. Device '
            '8061\n'
            '00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network '
            'Adapter (ENA)\n'
            '00:1e.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)\n'
            '00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD '
            'Controller\n'}]


Notice the above command uses `run_on_all`. The ssh client can send commands to multiple EC2 nodes at the same time. In this case, there is only one node, so `run_on_all` and `run_on_master` do the same thing. When dealing with multiple nodes, there is also the options to use `run_on_workers` or to send commands to specific nodes by using `run_on_node` and supplying the ip address of a specific node.

In order to get data in and out of your instance, you need to supply it with some kind of AWS credentials associated with your account. This can be done on the instance itself by running `aws configure`. As a simple way to get started, here are the commands to pass your local credentials to the AWS cli configuration.

In [38]:
# Read local credentials
# paths might need to be adjusted (add Users at beginning of path on Mac)

import getpass
import configparser
credentials = configparser.ConfigParser()
credentials.read('/{0}/.aws/credentials'.format(getpass.getuser()))
config = configparser.ConfigParser()
config.read('/{0}/.aws/config'.format(getpass.getuser()))

# run AWS configure passing in access keys and secret
ssh_client.run_on_all('aws configure set aws_access_key_id {}'.format(credentials['default']['aws_access_key_id']))
ssh_client.run_on_all('aws configure set aws_secret_access_key {}'.format(credentials['default']['aws_secret_access_key']))
ssh_client.run_on_all('aws configure set default.region {}'.format(config['default']['region']))

# make sure to delete credentials so we don't accidently do something bad with them
del credentials
del config

The G4dn and P3dn instances include high speed NVME drives that can be useful when training neural nets that require high data throughput. Below we mount the drive to a directory called `shared_workspace` which we will later make available inside the Docker container.

Note that this section is optional, but can be useful in some cases. However, also note that anything stored in NVME storaged will be deleted when the instance is stopped. So if you want this instance to maintain data for future use, it may be best to not use NVME.

In [40]:
lsblk = ssh_client.run_on_all('lsblk')

In [43]:
#check the name of the drive we want to mount.
pprint.pprint(lsblk)

[{'stderr': '',
  'stdout': 'NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT\n'
            'loop0         7:0    0    18M  1 loop '
            '/snap/amazon-ssm-agent/1566\n'
            'loop2         7:2    0  93.9M  1 loop /snap/core/9066\n'
            'loop3         7:3    0    97M  1 loop /snap/core/9289\n'
            'nvme0n1     259:0    0 209.6G  0 disk \n'
            'nvme1n1     259:1    0  1000G  0 disk \n'
            '└─nvme1n1p1 259:2    0  1000G  0 part /\n'}]


In [44]:
# nvme0n1 is not mounted, so that's the drive we'll use

# create a directory
ssh_client.run_on_all('mkdir -p ~/shared_workspace')

# mount nvme0n1 to that directory
# and give broad write and execute permissions
# so it can be shared with docker
ssh_client.run_on_all('sudo mkfs -t xfs /dev/nvme0n1')
ssh_client.run_on_all('sudo mount /dev/nvme0n1 ~/shared_workspace')
ssh_client.run_on_all('mkdir -p ~/shared_workspace/data')
ssh_client.run_on_all('sudo chmod -R 777 ~/shared_workspace')

[{'stdout': '', 'stderr': ''}]

At this point, you might want to do something like load data onto your drive for later use. One way is to pull data from S3 onto your instance. First, use s3fs to explore your S3 buckets and find the data you're interested in, then send the command to download to your node.

In [46]:
from s3fs import S3FileSystem

In [48]:
# unless otherwise specified, s3fs will use
# your default AWS credentials
s3 = S3FileSystem()

In [1]:
s3.ls('s3://some-bucket/data/train')

In [59]:
download_command = "aws s3 cp --recursive s3://some-bucket/data/train ~/shared_workspace/data"
ssh_client.run_on_all("mkdir ~/shared_workspace/data")

[{'stdout': '', 'stderr': ''}]

By default, the ssh client will submit a command and wait for it to complete. When downloading data, we might not want to wait. By setting the option `wait=false` the ssh client will return a python thread that monitors the status of the command.

In [61]:
data_download_thread = ssh_client.run_on_all(download_command, wait=False)

Included in this repo is a Dockerfile script to create a container that runs Jupyter Lab and includes a number of common Tensorflow development tools. The next few paragraphs deploys this Dockerfile to the instance and builds the container. If using another container, skip this section.

Two notes about the lines below, the `scp_local_to_master` copies local files to the master node. As with the `run_on_...` commands, this can be used to interface with master, worker, or all nodes. There is also a tool to copy to local from nodes.

Second, we copy `/opt/amazon/efa` into the docker directory. This directory contains the EFA drivers for using the high speed EFA interconnect between nodes. While this feature is not available on G4dn nodes, it's worth adding to the docker image, in case this image is later used on a P3dn or similar node.

In [None]:
dockerhub_user = 'your dockerhub user name'
dockerhub_repo = 'ec2_notebook'
dockerhub_tag = 'tutorial'

ssh_client.scp_local_to_master('../docker', 'docker', recursive=True)
ssh_client.run_on_master('cp -R /opt/amazon/efa docker/')
ssh_client.run_on_master('cd docker && docker build -t {}/{}:{} .'.format(dockerhub_user,
                                                                          dockerhub_repo,
                                                                          dockerhub_tag))

In [65]:
ec2_client.terminate_instances(InstanceIds=instances)

{'TerminatingInstances': [{'CurrentState': {'Code': 32,
    'Name': 'shutting-down'},
   'InstanceId': 'i-02acc8f01fbaa06ad',
   'PreviousState': {'Code': 16, 'Name': 'running'}}],
 'ResponseMetadata': {'RequestId': '61284b4a-0350-4ca3-890a-09d5f58ba6d4',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '61284b4a-0350-4ca3-890a-09d5f58ba6d4',
   'content-type': 'text/xml;charset=UTF-8',
   'transfer-encoding': 'chunked',
   'vary': 'accept-encoding',
   'date': 'Fri, 12 Jun 2020 01:57:12 GMT',
   'server': 'AmazonEC2'},
  'RetryAttempts': 0}}