# Amazon Basics 3: Basic Scripting

## Usage Notes

The purpose of this notebook is to alleviate some of the boilerplate code associated with file management related to uploading and downloading files to/from an Amazon Elastic Compute Cloud instance as well as boilerplate code associated with running installation scripts on multiple servers (essentially, management tasks usually performed by tools like Puppet or Chef).

## Notebook Imports

In [None]:
from __future__ import print_function
from aws_base import *
from IPython.utils.py3compat import *
import os
import pysftp
import subprocess

Now we'll update our user to give them permissions to read information from Identity and Access Management.

https://console.aws.amazon.com/iam/home#users

## Update Your User

Now we'll update our user to give them permissions to read information about Amazon EC2 instances.

https://console.aws.amazon.com/iam/home#users

Select the user that you created in the previous notebook and click on the **Add Permissions** button. On the permissions screen, select **Attach existing policies directly** and add ``AmazonEC2ReadOnlyAccess``.

## Specify Private Key File

When creating an EC2 instance, you need to specify which private key you would like to use in order to access the EC2 instance. If you don't have a private key yet, please create one in this region.

http://console.aws.amazon.com/ec2/v2/home#KeyPairs

The scripts will need to use your private key in order to transfer files to the servers. Please set the on-disk location of your private key here.

In [None]:
private_key_name = 'mdang-training-shared'
private_key_location = '~/Dropbox/Amazon/mdang-training-shared.pem'

We will want to confirm that the file exists.

In [None]:
private_key_location = os.path.expanduser(private_key_location)
assert private_key_location is not None and os.path.isfile(private_key_location)

Since SSH keys are not shared across regions, the following makes sure that the private key that you have specified above matches one of the keys that is in the region we'll be using for our commands, which we've specified in `aws configure`.

In [None]:
private_key_json = aws(
    'ec2', 'describe-key-pairs', '--key-names', private_key_name)

private_keys = None

if private_key_json is not None:
    private_keys = private_key_json['KeyPairs']

Note that the command this time is `ec2`, and that we're doing a read command. This is the reason that we wanted the `AmazonEC2ReadOnlyAccess` permission on our user.

## Utility Methods

### Local Folders

This series of notebooks creates a lot of intermediate files. In order to reduce clutter in the local folder, we'll create those intermediate files in special folders.

In [None]:
if not os.path.isdir('awscli'):
    os.mkdir('awscli')

if not os.path.isdir('output'):
    os.mkdir('output')

if not os.path.isdir('scripts'):
    os.mkdir('scripts')

### Backwards Compatible check_output

In [None]:
"""
Utility method which calls subprocess.check_output and wraps the result
in bytes_to_str() and strips the extra whitespace
"""
def check_output(args):
    return bytes_to_str(subprocess.check_output(args)).strip()

### Check Known Hosts

Most EC2 commands only work from a notebook if the server is recognized as a known host.

In [None]:
"""
Utility method to determine if the host is registered in the known_hosts file.
Uses ssh-keygen directly to determine this.
"""
def is_known_host(host_name):
    is_known_host_name = subprocess.call(['ssh-keygen', '-F', host_name])
    return is_known_host_name == 0

"""
Utility method to add a known host to the known_hosts file.
"""
def add_known_host(host_name, key_type, key_value):
    remove_known_host(host_name)

    with open(os.path.expanduser('~/.ssh/known_hosts'), 'a') as known_hosts:
        known_hosts.write('%s %s %s' % (host_name, key_type, key_value))

    print('%s added as known host' % host_name)

"""
Utility method to remove a known host from the known_hosts file.
"""
def remove_known_host(host_name):
    if is_known_host(host_name):
        subprocess.call(['ssh-keygen', '-R', host_name])
        print('%s removed from known hosts' % host_name)

### Upload Files

Some steps will require uploading a file to all servers. The following is a wrapper which checks the file size in order to determine whether Amazon S3 should be used as an intermediary (whenever the file is larger than 1 MB) or if it is better to simply upload the file to the server(s) directly.

In [None]:
"""
Utility method which uploads the given file via sFTP or S3 depending on size.
"""
def upload_file(user_name, host_names, source_file_name, target_file_name = None):
    file_size = os.path.getsize(source_file_name)

    if target_file_name is None:
        target_file_name = os.path.basename(source_file_name)

    if file_size < 1024 * 1024:
        upload_file_sftp(user_name, host_names, source_file_name, target_file_name)
    else:
        upload_file_s3(user_name, host_names, source_file_name, target_file_name)

#### Upload File - SFTP

In [None]:
"""
Utility method which uploads the given file via sFTP to multiple servers.
"""
def upload_file_sftp(user_name, host_names, source_file_name, target_file_name):
    global private_key_location

    print('Uploading %s to %d servers' % (source_file_name, len(host_names)))

    for host_name in host_names:
        if not is_known_host(host_name):
            print('%s is not a known host' % host_name)
            continue

        with pysftp.Connection(
            host_name, username = user_name,
            private_key = private_key_location) as sftp:

            sftp.put(source_file_name, target_file_name)

#### Upload File - S3

Some steps will require uploading a large file to all servers. We can use our parallel script runner in order to accomplish this by using S3 as an intermediary.

In [None]:
"""
Utility method which copies a file to an S3 bucket (with a mostly-unique ID) and
downloads the file onto multiple servers. Optionally can be used to upload a
file to S3 but not to any servers by passing an empty list of host names.
"""

def upload_file_s3(user_name, host_names, source_file_name, target_file_name):
    with open('awscli/bucket.txt', 'r') as bucket_file:
        bucket_name = bucket_file.read().strip()

    target_file_path = 's3://%s/%s' % (bucket_name, target_file_name)

    if host_names is not None and len(host_names) > 0:
        local_host_name = check_output(['hostname', '-s'])
        local_timestamp = check_output(['date', '+%s'])

        suffix = '.' + local_host_name + '.' + local_timestamp
        target_file_path += suffix

    aws('s3', 'cp', source_file_name, target_file_path)

    if host_names is None or len(host_names) == 0:
        return

    script_file_name = 'download_from_s3.sh'

    with open('scripts/' + script_file_name, 'w') as script_file:
        script_file.write('#!/bin/bash\n')
        script_file.write('source ~/.profile\n')
        script_file.write('aws s3 cp %s %s' % (target_file_path, target_file_name))

    run_script(user_name, host_names, script_file_name)
    aws('s3', 'rm', target_file_path)

### Run Remote Scripts

Many of the things we build involve creating small shell scripts which we then execute on all servers in parallel.

In [None]:
"""
Utility method which uploads a file in the scripts folder to multiple servers
and prepares a script that will run it on all servers.
"""
def run_script(user_name, host_names, script_name):
    global private_key_location

    upload_file(user_name, host_names, 'scripts/' + script_name)

    with open('scripts/run_script.sh', 'w') as script_file:
        for host_name in host_names:
            if not is_known_host(host_name):
                print('%s is not a known host' % host_name)
                continue

            # Ensure that the file has execute permissions on the host

            script_file.write('ssh -i %s %s@%s "chmod u+x %s"\n' % \
                (private_key_location, user_name, host_name, script_name))

            # Execute the script on the host, but log to a local file to avoid
            # filling up the notebook with text. Also run each script in the
            # background since the nodes do not depend on each other.

            output_log = 'output/' + script_name + '.' + host_name + '.log'

            script_file.write('ssh -i %s %s@%s "./%s" > %s 2>&1 &\n' % \
                (private_key_location, user_name, host_name, script_name, output_log))

        # Wait for all background processes to finish.

        script_file.write('wait\n')

    print('Executing %s on %d servers' % (script_name, len(host_names)))

    subprocess.call(['chmod', 'u+x', 'scripts/run_script.sh'])
    subprocess.call('scripts/run_script.sh', shell = True)

    print('Completed %s on %d servers' % (script_name, len(host_names)))

### Upload Folders

Some steps will require uploading a folder to all servers. The following is a wrapper which transforms the folder into an archive and uploads the archive to all servers. Note that it will not auto-extract the archive, and relies on the caller to do so afterwards.

In [None]:
"""
Utility method which creates an of the specified folder and uploads it to
all servers using the same name as the folder.
"""
def upload_archive(user_name, host_names, source_folder_name):
    archive_name = os.path.basename(source_folder_name) + '.tar.gz'
    subprocess.call(['tar', '-acf', archive_name, source_folder_name])
    upload_file(user_name, host_names, archive_name)
    os.remove(archive_name)

### Install AWS CLI Remotely

In order to access our S3 buckets, we'll need to install AWS CLI. The downside is the AWS CLI that is available from the Ubuntu repositories is out of date. Therefore, we'll need to install it using Miniconda. Additionally, we'll assume there's an `s3_bucket.json` which tells us which region we should use as well as our default bucket.

In [None]:
%%writefile scripts/install_awscli.sh
#!/bin/bash

if [ "" != "$(uname -a | grep Ubuntu)" ]; then
    # Add mirror to avoid slow downloads

    APT_MIRROR="mirror://mirrors.ubuntu.com/mirrors.txt"
    APT_REPOSITORIES="main restricted universe multiverse"

    #sudo sed -i -e "s@[^ ]*ec2.archive.ubuntu.com[^ ]*@$APT_MIRROR@g" \
    #    /etc/apt/sources.list

    sudo apt-get update

    # Update Python SSL libraries

    sudo apt-get -y install gcc libffi-dev libssl-dev python-dev
else
    sudo yum -y update
    sudo yum -y install gcc libffi-devel libssl-devel openssl-devel python-devel

fi

# Install pip
wget --quiet https://bootstrap.pypa.io/get-pip.py
sudo -H python get-pip.py

sudo -H $(which pip) install --upgrade ndg-httpsclient
sudo -H $(which pip) install awscli

# Set defaults on AWS configuration

EC2_REGION=$(cat region.txt | cut -d'"' -f 4)

echo "

$EC2_REGION
json" | aws configure

Leverage our parallel script runner in order to install AWS CLI on all machines.

In [None]:
"""
Utility method which installs AWS CLI on all machines and configures its default
region to be the same as the region for the local machine.
"""
def install_awscli(user_name, host_names):
    global region

    with open('awscli/region.txt', 'w') as region_file:
        region_file.write(region)

    upload_file(user_name, host_names, 'awscli/region.txt')
    run_script(user_name, host_names, 'install_awscli.sh')

### Set Default Bucket

All future notebooks assume that you have access to an S3 bucket that stores files used by installers or files that need to be uploaded/downloaded during the installation process.

http://console.aws.amazon.com/s3/home

In order to allow it to be a variable in these installers, it's useful to be able to set it as an environment variable.

In [None]:
%%writefile scripts/set_bucket.sh
#!/bin/bash

touch $HOME/.profile

S3_BUCKET=$(cat bucket.txt)
echo >> $HOME/.profile
echo "# Added for AWS CLI" >> $HOME/.profile
echo "export S3_BUCKET=$S3_BUCKET" >> $HOME/.profile

Provide a utility method which sets the `S3_BUCKET` environment variable on all hosts, assuming it is in the correct region.

In [None]:
"""
Utility method which sets an environment variable specifying a bucket which can
be referenced in other scripts.
"""
def set_bucket(user_name, host_names, bucket_name):
    global region

    bucket_url = 'http://s3-%s.amazonaws.com/%s/' % (region, bucket_name)
    check_bucket_url = check_output(['curl', '-s', bucket_url])

    if check_bucket_url.find('PermanentRedirect') != -1:
        print('Bucket is not in region %s')

    with open('awscli/bucket.txt', 'w') as bucket_file:
        bucket_file.write(bucket_name)

    upload_file(user_name, host_names, 'awscli/bucket.txt')
    run_script(user_name, host_names, 'set_bucket.sh')

## Convert Notebook to Script

The following cell will use `jupyter nbconvert` to build an `aws_util.py` which will be used in future notebooks in this series.

In [None]:
%%javascript
var script_file = 'aws_util.py';

var notebook_name = window.document.getElementById('notebook_name').innerHTML;
var nbconvert_command = 'jupyter nbconvert --stdout --to script ' + notebook_name;

var grep_command = "grep -v '^#' | grep -v -F get_ipython | sed '/^$/N;/^\\n$/D'";
var command = '!' + nbconvert_command + ' | ' + grep_command + ' > ' + script_file;

if (Jupyter.notebook.kernel) {
    Jupyter.notebook.kernel.execute(command);
}