### MACS 30113 Lab Session: Working with EMR Clusters/Spark

### Today's Lab Agenda: 

1. Walk through Spark ML notebook on EMR (same workflow as assignemnt 8)
2. Understand different feature types for this particular dataset (i.e. what is a good example of a spatial feature? datetime feature? etc.)

### Create S3 Bucket

First create S3 bucket to store our files.

In [9]:
import boto3

In [10]:
# Initialize boto3 handler
s3 = boto3.resource('s3')
iam_client = boto3.client('iam')
role = iam_client.get_role(RoleName='LabRole')

# Create a new bucket to store your files
BUCKETNAME = 'rei-example-bucket'
s3.create_bucket(Bucket=BUCKETNAME)

# This is what we will use to interface with the specific bucket
bucket = s3.Bucket(BUCKETNAME)

### Launching EMR Cluster

Next launch EMR Cluster in Terminal/bash.

In [11]:
%%bash 
# ! please remember to change your bucket name
aws emr create-cluster \
    --name "Spark Cluster" \
    --release-label "emr-6.2.0" \
    --applications Name=Hadoop Name=Hive Name=JupyterEnterpriseGateway Name=JupyterHub Name=Livy Name=Pig Name=Spark Name=Tez \
    --instance-type m5.xlarge \
    --instance-count 3 \
    --use-default-roles \
    --region us-east-1 \
    --ec2-attributes '{"KeyName": "vockey"}' \
    --configurations '[{"Classification": "jupyter-s3-conf", "Properties": {"s3.persistence.enabled": "true", "s3.persistence.bucket": "rei-example-bucket"}}]'


{
    "ClusterId": "j-3SJQZ7GWR6GDP",
    "ClusterArn": "arn:aws:elasticmapreduce:us-east-1:014303519904:cluster/j-3SJQZ7GWR6GDP"
}


### Start the Interactive Sessions

You can launch a Jupyter server directly on EMR and work with it interactively.
```
$ ssh -i "labsuser.pem" -NL 9443:localhost:9443 hadoop@ec2-50-17-49-119.compute-1.amazonaws.com
```
This forwards the remote connection to your `https://localhost:9443`, and you can log in with username `jovyan`, password `jupyter`. 

### Remember to shut down EMR and clean the bucket