# Overview

In order to utilize Spark in a cluster environment You need to first spin up a cluster and then attach a notebook to the cluster instance.

## Dependencies

In [64]:
import boto3

# Create EMR Cluster

In [86]:
CLUSTER_NAME="jvs-fraud"
LOG_URI=f"s3://aws-logs-119657064844-us-east-1/"
RELEASE_LABEL="emr-5.32.0" # Max version for use with Notebooks
NUM_CORE_NODES=2
NUM_INSTANCES=1 + NUM_CORE_NODES # 1 Master + N slaves

# Must use AMD EPYC instances ie m5a.*
INSTANCE_TYPE="m5a.xlarge"  # $0.172/hr 4core-16GB RAM

EC2_KEY="jvs_ec2_key"
EC2_SUBNET="subnet-171c8c4b"
REGION="us-east-1"

## Create EC2 Key Pair and PEM File

1. Go to [Amazon EC2 console](https://console.aws.amazon.com/ec2/home)
2. click Key Pairs
3. Create Key Pairs
4. Make a name
5. Save PEM file locally

## Using AWS CLI

Building the CLI command

In [67]:
make_cluster_cmd = f"""
    aws emr create-cluster \\
    --release-label {RELEASE_LABEL} \\
    --applications Name=Spark \\
    --instance-type {INSTANCE_TYPE} \\
    --instance-count {NUM_INSTANCES} \\
    --name {CLUSTER_NAME} \\
    --log-uri s3://aws-logs-{CLUSTER_NAME}/elasticmapreduce/ \\
    --use-default-roles \\
    --ec2-attributes KeyName={EC2_KEY},SubnetId={EC2_SUBNET} \\
    --region {REGION}
"""
print(make_cluster_cmd)


    aws emr create-cluster \
    --release-label emr-5.32.0 \
    --applications Name=Spark \
    --instance-type m5.xlarge \
    --instance-count 3 \
    --name jvs-fraud \
    --log-uri s3://aws-logs-jvs-fraud/elasticmapreduce/ \
    --use-default-roles \
    --ec2-attributes KeyName=jvs_ec2_key \
    --region us-east-1



Run the command and **create the cluster**

In [68]:
#!{make_cluster_cmd}

Copy the Cluster ID above and input to the cell below to run the command and **terminate the cluster**

In [61]:
CLUSTER_ID="j-3GTMZ6TB4TEEM"
#!aws emr terminate-clusters --cluster-ids {CLUSTER_ID} --region us-east-1

## USING Python - boto3

### Create a Bucket

This is where we want our notebook data to persist

In [73]:
BUCKET_NAME="jvs-fraud-research"

In [74]:
s3 = boto3.client('s3', REGION)

Create the bucket if doesn't exists

In [81]:
if BUCKET_NAME not in [val['Name'] for val in s3.list_buckets()['Buckets']]:
    s3.create_bucket(Bucket=BUCKET_NAME)

### Create Cluster

https://stackoverflow.com/questions/26314316/how-to-launch-and-configure-an-emr-cluster-using-boto

In [87]:
emr = boto3.client('emr', region_name=REGION)

response = emr.run_job_flow(
    Name=CLUSTER_NAME,
    LogUri=LOG_URI,
    ReleaseLabel=RELEASE_LABEL,
    Instances={
        'MasterInstanceType': INSTANCE_TYPE,
        'SlaveInstanceType': INSTANCE_TYPE,
        'InstanceCount': NUM_INSTANCES,
        'KeepJobFlowAliveWhenNoSteps': True,
        'TerminationProtected': False,
        'Ec2KeyName': EC2_KEY,
        'Ec2SubnetId': EC2_SUBNET
    },
    Applications=[
        {'Name': 'Spark'},
        {'Name': 'Hadoop'},
        {'Name': 'Livy'}
    ],
    VisibleToAllUsers=True,
    JobFlowRole='EMR_EC2_DefaultRole',
    ServiceRole='EMR_DefaultRole'
    
)
print(f"EMR ID: {response['JobFlowId']}")

EMR ID: j-2X4G7C1R1F89B


Now that the EMR is stood up you can go to the [EMR Notebooks Page](https://console.aws.amazon.com/elasticmapreduce/home?region=us-east-1#notebooks-list:) To create (or open) a notebook to work from

In [56]:
emr.terminate_job_flows(JobFlowIds=[response['JobFlowId']])

{'ResponseMetadata': {'RequestId': 'b02a5a98-824c-4290-9915-2e1bbdd7431d',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'b02a5a98-824c-4290-9915-2e1bbdd7431d',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Fri, 18 Dec 2020 22:59:44 GMT'},
  'RetryAttempts': 0}}

## Storing Configs

https://towardsdatascience.com/getting-started-with-pyspark-on-amazon-emr-c85154b6b921