# ML and AWS practical

You will have been sent an email with a login link to the amazon console.

## Logging in and the console

Once logged in, you will be be placed in the AWS management console. 

<img src="aws1.png" />

A few things to note.
First, AWS is hosted in <a href="https://aws.amazon.com/about-aws/global-infrastructure/">multiple locations world-wide</a>.

<img src="aws2.png" />


From Amazon:
 > These locations are composed of Regions and Availability Zones. Each Region is a separate geographic area. Each Region has multiple, isolated locations known as Availability Zones. Amazon EC2 provides you the ability to place resources, such as instances, and data in multiple locations. Although rare, failures can occur that affect the availability of instances that are in the same location. If you host all your instances in a single location that is affected by such a failure, none of your instances would be available.
 
We will just be using one region today, Ireland (eu-west-1). You can see (and select) the region from the drop down list in the top right of the console.

Under 'all services' one can select which tool of AWS one wishes to use. We will restrict ourselves for today to Elastic Compute Cloud (EC2), Simple Storage Service (S3) and Elastic Map Reduce (EMR).

First we'll explore EC2, set up an 'instance' (virtual machine) and connect to it.

- Click on EC2.
- Click on 'Instances' from the left pane. You will see an empty table and a big button 'Launch Instance' - click it!

#### Step 1: Choose AMI

The first step is deciding what machine image you want on your new instance. An easy start is to use one of the images created by Amazon. <span style="background-color: #FFFF00">Select the **Ubuntu Server 18.04 LTS (HVM)** machine image</span>.
<img src="aws4.png" />
In the future one could use an image already created with software or data already installed, for example.

#### Step 2: Choose Instance Type

The second step is selecting the (virtual) hardware the instance will run on. <span style="background-color: #FFFF00">For this practical **we ask that you use the t2.nano, t2.micro, t2.small or t2.medium instance type**</span> (as we have increased the limit on this account to allow you all to start one).

Click <span style="background-color: #FFFF00">Next: Configure Instance Details.</span>

#### Step 3: Configure Instance Details

**There is no need to modify any of this.** But a couple of items of interest:

**Autoscaling**: In the future, even for ML problems, it can be helpful to configure a load-balancer and autoscaling to increase or decrease the number of instances depending on the demand.

**Spot pricing** Typically AWS will not be using all its computational resources. To make use of this 'spare' hardware, AWS offer a service called 'spot pricing' which is typically considerably cheaper than the on-demand price, but comes at the cost of an instance that might be terminated with two minutes warning.

Click <span style="background-color: #FFFF00">Next: Add Storage</span>

#### Step 4: Add Storage

Simply leave it a 8Gb general purpose SSD. 

A note here: There are many types of storage provided by AWS:

- Low cost, slow access: Amazon Glacier
- Elastic Block Store: This is the type of storage you need in your EC2 instance. (<a href="https://aws.amazon.com/ebs/features/#Amazon_EBS_volume_types">More info</a>) This comes in four flavours,
  - slowest/cheapest: sc1 (cold HDD, solid state)
  - still cheap: st1 (throughput-optimised HDD)
  - solid-state: gp2 (general purpose SSD)
  - fast/expensive: io1.
  
Click <span style="background-color: #FFFF00">Next: Add Tags</span>

#### Step 5: Add Tags

As everyone is using the same account it is very useful to **label your instance**, here it might be worth adding your id so you can find it again. For example, click *Add Tag* then use Key = "email" and Value = your email. Without this you might struggle to find your instance again!

Click <span style="background-color: #FFFF00">Next: Configure Security Group</span>

#### Step 6: Configure Security Group

Security groups are how EC2 organises access to the instances you create. I've already created one called "justssh" which gives access to the SSH port from anywhere. Typically one would restrict this to be from just your IP address, for example. Feel free to either use a security group that already exists or create a new one. You'll need to be able to SSH into the server later.

Click Review and Launch.

#### Step 7: Review and Launch

When you click 'launch' you'll be asked to Select or create a key pair.

A quick detour, from ssh.com:

> Each SSH key pair includes two keys:

> A **public key** that is copied to the SSH server(s). Anyone with a copy of the public key can encrypt data which can then only be read by the person who holds the corresponding private key. Once an SSH server receives a public key from a user and considers the key trustworthy, the server marks the key as authorized in its authorized_keys file. Such keys are called authorized keys.

> A **private key** that remains (only) with the user. The possession of this key is proof of the user's identity. Only a user in possession of a private key that corresponds to the public key at the server will be able to authenticate successfully. The private keys need to be stored and handled carefully, and no copies of the private key should be distributed. The private keys used for user authentication are called identity keys.

Previously, when connecting to SHARC or ICEBERG you were using a username/password. In this case AWS will be using a key pair. This is typically more secure and allows automation and is generally the standard method for secure communication.
<img src="aws6.png" />

So <span style="background-color: #FFFF00">click 'create a new key pair' and enter a **unique** key pair name, e.g. "mtsmithsecretkey".</a> Then click 'download key pair'. You'll receive a file called "mtsmithsecretkey.pem" (or whatever). You'll need this to SSH into this new instance. Depending on your operating system there are a few different things to do to achieve this (see next section).

Finally click 'Launch Instances'.

#### SSHing into your new instance

<img src="aws7.png" />
You'll be shown a summary stating your instances are launching. Either click the link to the instance
or return to the <a href="https://eu-west-1.console.aws.amazon.com/ec2/v2/home?region=eu-west-1#Instances:sort=instanceId">list of instances</a> and filter by the tag email address you entered.

You might need to wait a few seconds while the instance starts.

Click on the instance and then right click (or use the button at the top) select "Connect". This will give instructions on how to SSH in. In linux and iOS one needs to simply set the file's permissions to read only `chmod 400 filename.pem` then ssh:

    ssh -i "mtsmithsecretkey.pem" ubuntu@ec2-34-243-65-51.eu-west-1.compute.amazonaws.com

In windows things are more complicated. <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html?icmpid=docs_ec2_console">AWS has instructions on how to get this working with putty</a>. Mike Croucher recommended <a href="https://mobaxterm.mobatek.net/">MobaXterm</a>. This works by default with the pem file. See <a href="https://angus.readthedocs.io/en/2016/amazon/log-in-with-mobaxterm-win.html">these instructions</a> if you need more help.

If you can't get that working there are <a href="https://www.bitvise.com/tunnelier">alternative SSH clients</a>.

# Elastic Map Reduce (EMR)

We'll now set up a cluster to run a simple Recommender system (a little like the one you set up earlier in the course). Here we're not interested in the algorithm, more in some of the aspects of the cloud.

Click on Services, and find "<a href="https://eu-west-1.console.aws.amazon.com/elasticmapreduce/home?region=eu-west-1#">EMR</a>" (under analytics), then click **Create cluster**. **Then choose Advanced Options.**

#### Step 1: Software and Steps
Here we get to select what software will be preinstalled on the cluster's nodes. In this case we just want hadoop and spark:
<img src="aws8.png" />

#### Step 2: Hardware
<span style="background-color: #FFFF00">**Edit the Master and Core to use the 'r5a.xlarge' instance type**</span> (we've increased the AWS limits on this type. You'll find that using other instance types will probably fail). Select about 3 instances for the core node set. You can save us money if you select "Spot" pricing, but there is a small risk your instacnces will get terminated by AWS during a spike in demand.
<img src="aws9.png" />

#### Step 3: General Cluster Settings
Name your cluster something sensible <span style="background-color: #FFFF00">and unique</span>, maybe the first part of your email address: "m.t.smith.cluster". Uncheck Logging, Debugging and Termination protections - as we won't be using these features. It might be worth adding the same tag you gave your instance before so you can find it.
<img src="aws10.png" />

#### Step 4: Security
Select the EC2 keypair you created previously. You could also try disabling 'cluster visible to all IAM users in account' to avoid clutter for other users. The default security groups are fine. Click **Create cluster**.

<img src="aws11.png" />

(note, I've added inbound SSH to the default security groups here - you'll need to do this if you work with this in the future).

#### Connecting

You will be able to select the cluster from the EMR <a href="https://eu-west-1.console.aws.amazon.com/elasticmapreduce/home?region=eu-west-1#">clusters page</a>, and click on the SSH button. You can ssh in the same way as before. Please be patient, it takes a while to get the cluster spun up. Initially ssh won't be available, and even when it is pyspark etc might not be installed when you first arrive - wait a minute and try again!
    
<img src="aws12.png" />

## The AWS command line interface

We now have a cluster running and ready for our commands.

Let's try out the AWS command line interface. It is often the case that one will want to perform a console operation repeatedly or describe a complex action programmatically to make it easy to see what has been done. To this end Amazon provide both a command line tool and an API. All the actions you can do with the console can be done via these alternative interfaces. As we're now logged into a machine that's got the AWS CLI installed, we could quickly try it out.

#### s3 access

First we're going to look at Amazon's simple storage service (s3). This has a <a href="https://s3.console.aws.amazon.com/s3/home?region=eu-west-1#">web interface</a>, but we'll be accessing it using the command line. On your new instance type:

    aws s3 ls
    
This will list the bucket(s). To list the contents of the bucket;

    aws s3 ls ml10m100k
    
Then to copy the data to our instance,

    aws s3 cp s3://ml10m100k . --recursive

<img src="aws13.png" />

#### ec2 via the command line

Also try `aws ec2 describe-instances` to list all the instances currently in this account.

The command line interface allows you to also create and destroy instances, configure users, launch clusters, and do any other operation that you can also do with the web interface. You may find you don't have access to some of these commands however.

Be careful, you can terminate your fellow students' instances as you are all in the same account and have sufficient permissions. For more help on a particular aspect, you can type something like,

    aws ec2 help
    
## Control via the API
    
We can also access AWS using the API, still logged into the cluster, on the command line, we need to install boto3, a library for accessing the AWS API from python. Type,

    sudo pip install boto3

Then start a python terminal,

    python
    
And then import boto3,

    import boto3
    
We can then start a session connecting to the AWS API to control ec2,

    sess = boto3.client('ec2')
    reservations = sess.describe_instances()['Reservations']

    for res in reservations:
        for j in res['Instances']:
            print(j['InstanceType'],j['PublicIpAddress'])
            if 'Tags' in j: print(j['Tags'])
        break

This will print out a list of all the instance types, and their tags.

# Using pyspark to make a recommendation

We need to put the datafiles we're going to use within reach of hadoop. For this we simply run

    hadoop fs -put *.dat /
    
This copies the data to the <a href="https://www.aosabook.org/en/hdfs.html">Hadoop Distributed File System</a>.
Type `pyspark` to open the interactive python/spark command line.

We load the ratings data from the hadoop file system. This is a :: separated table of ratings. The lambda/map functions are to split by these :: and assign the types to the three columns.

    from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
    data = sc.textFile('/ratings.dat')
    ratings = data.map(lambda l: l.split('::')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
    
This will be very quick as spark only runs things when it needs to (hence none of the above is run until the following lines are run...)

Let's ask for the first item, to check it's working:
    
    ratings.first()

    > Rating(user=1, product=122, rating=5.0)                                         

Let's set up the model:

    test, train = ratings.randomSplit(weights=[0.3, 0.7], seed=1)
    rank = 10
    numIterations = 10
    model = ALS.train(train, rank, numIterations)
    
    testdata = test.map(lambda p: (p[0],p[1]))
    predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
    ratesAndPreds = test.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
    MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
    print("Mean Squared Error = " + str(MSE))    
    
    > 0.7009719027724838
    
Finally, a quick look at the MSE on the training data;

    traindata = train.map(lambda p: (p[0],p[1]))
    predictionstrain = model.predictAll(traindata).map(lambda r: ((r[0], r[1]), r[2]))
    ratesAndPreds = train.map(lambda r: ((r[0], r[1]), r[2])).join(predictionstrain)
    print("Mean Squared Error = " + str(MSE))
    
How do these compare?

## Additional exercises

A global network of air pollution sensors is aggregated by openAQ. They archive this data on s3 (see here: https://openaq-data.s3.amazonaws.com/index.html)

We can copy one of these csv files to our instance from the command line using:

    aws s3 cp s3://openaq-data/2019-01-01.csv .

Tip: This might be faster if the instance were in the same region as the data!

Or, from inside pyspark, can read the data directly:

    df = spark.read.csv("s3a://openaq-data/2019-01-01.csv")
    
This could be improved by treating the headers properly for a start. Add options to do this (hint set `header=True`)

Feel free to try analysing this data. For example count the number of measurements in Uganda?

Hint the 'country' column would equal 'UG'...

    df[df['country']=='UG'].count()
    
Note that without the `.count()` the result is immediate - spark doesn't do any computation until it has to.

You'll find the operation is far quicker the second time thanks to caching. Also look at `.persist()` <a href="https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#rdd-persistence">here</a>.

#### Starting and stopping EC2 instances from the command line

Example code, modify to start an instance yourself.

aws ec2 run-instances --image-id ami-xxxxxxxx --count 1 --instance-type t2.micro --key-name MyKeyPair --security-group-ids sg-903004f8 --subnet-id subnet-6e7f829e

Think about where you can get the security-group-id. (hint: either from the webconsole, or for bonus credit from the command line interface! hint: `aws ec2 describe-security-groups`).

## Lambda Demo

This exercise is <span style="background-color: #FFFF00">**currently unavailable due to permission issues**</span>.

If you've time you could try using Lambda. This is a way of writing code that runs when triggered by a particular event, without needing to set up your own server.

### Lambda

1. Click on Services and find Lambda under 'Compute'.
2. Click 'Create a function'
3. Then 'Author from scratch', and select:
    - A Function name (e.g. demofunction)
    - Then runtime, I used python 3.7.
4. You'll want to 'Use an exsiting role' and select 'service-role/testfunction-role-np5tat7d' this gives the lambda functionn basic permission to upload logs to Amazon CloudWatch Logs. You won't be able to create a new role due to the restrictions placed on the student accounts we created.

<img src='aws14.png' />

5. Click Create function!

### The Lambda interface

This page is quite complicated. Make sure you're in the 'Configuration' section. 

On the left of that panel you can select triggers - these are things that will trigger your lambda code to run. Why not choose API Gateway.

- Select your bucket as the trigger.
- Leave it as "All object creation events"
- Click Add.

At the moment there might be a permission issue with this activity.

## s3 Bucket
Visit the s3 page (via the Services tab). You'll be able to see the buckets we have, and create a new one.

<img src="aws15.png" />

<span style="background-color: #FFFF00">**Bucket names must be unique across all of Amazon S3 for everyone! So please prefix all our buckets with <span style="color: #444444">2018com6012yourbucketname</span> to try and avoid polluting the name space.**</span> You'll also need the region to be the same as the one you're going to be running the lambda function in! (keep it in Ireland). Click Next and leave all the other settings unchanged then Create Bucket!

You'll be able to see the bucket in the list now.