# Spark on Amazon EMR-on-EKS Starter Notebook

## Table of Contents:

1. [Overview](#Overview)
2. [Dependencies](#Dependencies) <br>
2.1. [Install the AWS CLI](#Install-the-AWS-CLI) <br>
2.2. [Install or Upgrade eksctl](#Install-or-Upgrade-eksctl) <br>
2.3. [Install kubectl](#Install-kubectl) <br>
2.4. [Configure AWS Credentials](#Configure-AWS-Credentials) <br>
3. [Launch an Amazon EKS Cluster](#Launch-an-Amazon-EKS-Cluster)
4. [Create Amazon EMR Virtual Clusters](#Create-Amazon-EMR-Virtual-Clusters)
5. [Submit Spark Jobs](#Submit-Spark-Jobs)
6. [Clean Up Resources](#CleanUp-Resources)

## Overview

This notebook gets you started on Amazon EMR on EKS from a SageMaker notebook instance launched in a VPC.

Useful Links to read later:
- EKS Getting Started : https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html
- eksctl Introduction : https://eksctl.io/introduction/
- EKS Workshop : https://www.eksworkshop.com/

## Dependencies

Let's install a Jupyter extension to get the time on each cell. This will help measure how long each step takes.

In [None]:
!pip install jupyter_contrib_nbextensions
!jupyter contrib nbextension install --user
!jupyter nbextension enable execute_time/ExecuteTime

<div class="alert alert-block alert-warning">
Note: Refresh your browser after you execute the cell above.
</div>

In [None]:
!sudo apt update
!sudo apt upgrade -y
!sudo apt install -y curl

### Install the AWS CLI

Expected Versions : version 1.18.157 or later, or version 2.0.56 or later.

In [None]:
%%sh
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

In [None]:
!aws --version

### Install or Upgrade eksctl

In [None]:
%%sh
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin

In [None]:
!eksctl version

### Install kubectl

In [None]:
%%sh
curl -o kubectl https://amazon-eks.s3.us-west-2.amazonaws.com/1.18.8/2020-09-18/bin/linux/amd64/kubectl
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin

In [None]:
!kubectl version --short --client

### Configure AWS Credentials

Create a local AWS Profile running the following on the Jupyter Terminal (Jupyter Home -> New -> Terminal):

```
$>aws configure
```

The ~/.aws/credentials file should have an entry like below:

```
[default]
aws_access_key_id = <your access key>
aws_secret_access_key = <your secret id>
region = us-west-2
```

## Launch an Amazon EKS Cluster


**Here we launch our EKS Cluster with 3 nodes of m5.2xlarge.** EKS will launch it's own VPC which is recommended,  

**This launches 2 CloudFormation templates and should take around 15-20 mins.**

- Make sure your account is under the VPC and Elastic IP limits.
- Create an EC2 Keypair (e.g. **vm_oregon** is my keypair)

In [None]:
!eksctl version

In [None]:
!eksctl create cluster \
--name <eks-cluster-name> \
--version 1.18 \
--region us-west-2 \
--nodegroup-name linux-nodes \
--node-type m5.2xlarge \
--nodes 3 \
--nodes-min 1 \
--nodes-max 4 \
--ssh-access \
--ssh-public-key vm_oregon \
--managed

In [None]:
!kubectl config get-clusters

In [None]:
!kubectl get namespace

### Create Amazon EMR Virtual Clusters

We will launch the EMR Virtual Cluster in the 'default' namespace.

In [None]:
!eksctl create iamidentitymapping \
    --cluster <eks-cluster-name> \
    --namespace default \
    --service-name "emr-containers"

In [None]:
!aws emr-containers create-virtual-cluster \
--name <emr-virtual-cluster-name> \
--container-provider '{"id": "<eks-cluster-name>","type": "EKS","info": {"eksInfo": {"namespace": "default"}} }'

Our EMR Virtual Cluster should be up and running. Let's first get familiar with some commands before we submit Spark jobs.

In [None]:
!aws emr-containers describe-virtual-cluster --id <emr-virtual-cluster-id>

## Submit Spark Jobs

### Setup the Spark Job Execution Role

Let's now submit some Spark jobs:

First, we will need to create an EMR Spark Job Execution Role with the IAM Policy below.

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:PutLogEvents",
                "logs:CreateLogStream",
                "logs:DescribeLogGroups",
                "logs:DescribeLogStreams"
            ],
            "Resource": [
                "arn:aws:logs:*:*:*"
            ]
        }
    ]
} 
```

Navigate to the IAM console to create the role. Let's call the IAM Role `EMR_EKS_Job_Execution_Role`.

### Setup the Trust Policy for the IAM Job Execution Role

In [None]:
!aws emr-containers update-role-trust-policy \
       --cluster-name <eks-cluster-name> \
       --namespace default \
       --role-name EMR_EKS_Job_Execution_Role

### Enable IAM Roles for Service Accounts (IRSA) on the EKS cluster 

In [None]:
!eksctl utils associate-iam-oidc-provider --cluster <eks-cluster-name> --approve

### Submit and Monitor the Spark Job

In [None]:
!aws emr-containers start-job-run \
--virtual-cluster-id <emr-virtual-cluster-id> \
--cli-input-json file://./start-job-run-request.json

In [None]:
!aws emr-containers  describe-job-run --virtual-cluster-id <emr-virtual-cluster-id> --id <job-run_id>

You can navigate to the Spark History Server from the EMR Console.

EMR Console -> EMR on EKS -> Virtual Clusters -> Select the EMR Cluster -> Job Runs -> Logs Runs for the Job.

## Clean Up Resources

### Delete the EMR Virtual Cluster

In [None]:
!aws emr-containers delete-virtual-cluster --id <emr-virtual-cluster-id>

Delete the CloudFormation templates from the CloudFormation Console.