# Overview

In order to utilize Spark in a cluster environment You need to first spin up a cluster and then attach a notebook to the cluster instance.

## Dependencies

In [8]:
import boto3
import os
from configparser import ConfigParser

## Constants

### Credentials

In [14]:
PROFILE_ID='default'
config_object = ConfigParser()
config_object.read("/home/jovyan/.aws/credentials")
profile_info = config_object[PROFILE_ID]

ACCESS_KEY = profile_info.get('aws_access_key_id')
SECRET_KEY = profile_info.get('aws_secret_access_key')
AWS_SESSION_TOKEN = profile_info.get('aws_session_token')

### Cluster Args

In [3]:
CLUSTER_NAME="jvs-fraud"
LOG_URI=f"s3://aws-logs-119657064844-us-east-1/"
RELEASE_LABEL="emr-5.32.0" # Max version for use with Notebooks
NUM_CORE_NODES=6
NUM_INSTANCES=1 + NUM_CORE_NODES # 1 Master + N slaves

# Must use AMD EPYC instances ie m5a.*
INSTANCE_TYPE="m5a.xlarge"  # $0.172/hr 4core-16GB RAM

EC2_KEY="jvs_ec2_key"
EC2_SUBNET="subnet-171c8c4b"
REGION="us-east-1"

# This is where we want our notebook data to persist
BUCKET_NAME="jvs-fraud-research"

# Create EMR Cluster

## Create EC2 Key Pair and PEM File

1. Go to [Amazon EC2 console](https://console.aws.amazon.com/ec2/home)
2. click Key Pairs
3. Create Key Pairs
4. Make a name
5. Save PEM file locally

## Using AWS CLI
If you prefer starting the cluster with AWS CLI, set the USE_CLI constant to true below 

In [19]:
USE_CLI=False

### Start Cluster

Building the CLI command.

In [22]:
if USE_CLI:
    make_cluster_cmd = f"""
        aws emr create-cluster \\
        --release-label {RELEASE_LABEL} \\
        --applications Name=Spark \\
        --instance-type {INSTANCE_TYPE} \\
        --instance-count {NUM_INSTANCES} \\
        --name {CLUSTER_NAME} \\
        --log-uri s3://aws-logs-{CLUSTER_NAME}/elasticmapreduce/ \\
        --use-default-roles \\
        --ec2-attributes KeyName={EC2_KEY},SubnetId={EC2_SUBNET} \\
        --region {REGION}
    """
    print(make_cluster_cmd)

Run the command and **create the cluster**

In [23]:
if USE_CLI:
    !{make_cluster_cmd}

### Terminate Cluster

Copy the Cluster ID above and input to the cell below to run the command and **terminate the cluster**

In [24]:
cluster_id = "j-3GTMZ6TB4TEEM"
if USE_CLI:
    !aws emr terminate-clusters --cluster-ids {cluster_id} --region us-east-1

## USING Python - boto3

### Create a Bucket
Only creates bucket if doesn't exist

In [25]:
s3 = boto3.client(
    's3',
    aws_access_key_id=ACCESS_KEY,
    aws_secret_access_key=SECRET_KEY,
    aws_session_token=AWS_SESSION_TOKEN
)

Create the bucket if doesn't exists

In [28]:
if BUCKET_NAME not in [val['Name'] for val in s3.list_buckets()['Buckets']]:
    s3.create_bucket(Bucket=BUCKET_NAME)
else:
    print(f'Bucket s3://{BUCKET_NAME} exists')

Bucket s3://jvs-fraud-research exists


### Create Cluster

https://stackoverflow.com/questions/26314316/how-to-launch-and-configure-an-emr-cluster-using-boto

In [30]:
emr = boto3.client('emr', region_name=REGION)

response = emr.run_job_flow(
    Name=CLUSTER_NAME,
    LogUri=LOG_URI,
    ReleaseLabel=RELEASE_LABEL,
    Instances={
        'MasterInstanceType': INSTANCE_TYPE,
        'SlaveInstanceType': INSTANCE_TYPE,
        'InstanceCount': NUM_INSTANCES,
        'KeepJobFlowAliveWhenNoSteps': True,
        'TerminationProtected': False,
        'Ec2KeyName': EC2_KEY,
        'Ec2SubnetId': EC2_SUBNET
    },
    Applications=[
        {'Name': 'Spark'},
        {'Name': 'Hadoop'},
        {'Name': 'Livy'},
        {'Name': 'JupyterEnterpriseGateway'}
    ],
    VisibleToAllUsers=True,
    JobFlowRole='EMR_EC2_DefaultRole',
    ServiceRole='EMR_DefaultRole'
    
)
print(f"EMR ID: {response['JobFlowId']}")

EMR ID: j-3QXDA47I2Y20B


Now that the EMR is stood up you can go to the [EMR Notebooks Page](https://console.aws.amazon.com/elasticmapreduce/home?region=us-east-1#notebooks-list:) To create (or open) a notebook to work from

### Terminate Cluster

In [56]:
emr.terminate_job_flows(JobFlowIds=[response['JobFlowId']])

{'ResponseMetadata': {'RequestId': 'b02a5a98-824c-4290-9915-2e1bbdd7431d',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'b02a5a98-824c-4290-9915-2e1bbdd7431d',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Fri, 18 Dec 2020 22:59:44 GMT'},
  'RetryAttempts': 0}}

## Storing Configs

https://towardsdatascience.com/getting-started-with-pyspark-on-amazon-emr-c85154b6b921

## Notebook Configuration

EMR Notebook Magic
```
%%configure -f
{"executorMemory":"4G"}
```