## EMR Setup with Python SDK (boto3)
This notebook will show how to set up some AWS resources using the Python SDK for AWS, boto3.

Boto3 Documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/redshift.html

---

#### Package Import

---

In [27]:
import boto3
import configparser

---

#### Loading Credentials from file

---

In [28]:
#AWS Credentials
aws_path = "/home/rambino/.aws/credentials"
aws_cred = configparser.ConfigParser()
aws_cred.read(aws_path)

['/home/rambino/.aws/credentials']

---

#### Create SSH keypair for connecting to EC2 instances

---

In [29]:
ec2 = boto3.client('ec2',
    region_name             = "us-east-1",
    aws_access_key_id       = aws_cred['default']['aws_access_key_id'],
    aws_secret_access_key   = aws_cred['default']['aws_secret_access_key']
)

In [10]:
response = ec2.create_key_pair(
    KeyName = 'spark_ec2_key',
    DryRun=False,
    KeyFormat='pem'
)

In [25]:
with open('/home/rambino/.aws/spark_keypair.pem',"w") as file:
    file.writelines(response['KeyMaterial'])

---

#### Setting up VPC for the EMR cluster

---

If no VPC is specified for an EMR cluster, then the cluster is launched in the normal AWS cloud

Creating default VPC:

In [55]:
!aws ec2 create-default-vpc --profile default


An error occurred (DefaultVpcAlreadyExists) when calling the CreateDefaultVpc operation: A Default VPC already exists for this account in this region.


Getting **first** subnetId for this VPC:

In [30]:
vpc_output = ec2.describe_vpcs()

#Getting first (and only) VPC:
vpcId = vpc_output['Vpcs'][0]['VpcId']

subnet_output = ec2.describe_subnets(
    Filters=[
        {
            'Name':'vpc-id',
            'Values':[vpcId]
        }
    ]
)

subnetId = subnet_output['Subnets'][0]['SubnetId']

---

#### Creating EMR Cluster

---

**Steps needed to set up and connect to EMR:**
1. set up cluster with correct specifications
2. get 'master public DNS' for the cluster
3. edit security group to allow my computer to connect via SSH (add inbound rule to allow SSH connection from my IP)
   1. Note: Security group is distinct entity from cluster - why not just set this up beforehand?
      1. Note: It IS possible to set up a security group beforehand - and to specify this security group for the master and slave nodes. For a more official setup, it's probably better to do this to ensure that the security group we set up for EMR is custom-defined (and not default).
      2. UPDATE: Well, actually when you CREATE a cluster, security groups are created automatically for the cluster on the default VPC. I could go through the trouble to set up custom security groups *beforehand*, or I could just create the cluster and then change the security groups as needed once they are created. Since I can't think of a reason it would be better to create custom security groups beforehand rather than just edit the ones which are created for me, I will just edit the ones created for me in this code.
4. Set up proxy to access "persistent web UI for Spark"?
   1. This looks like it's for being able to view the Spark UI somehow, but the way they're setting up the proxy settings and filtering URLs seems really hacky (e.g., they're filtering urls matching "http://10.*)". I'm not sure I want to set this up until I know that it's much better than using AWS' built-in UI viewer.
   2. Update: **It turns out that AWS also recommends using FoxyProxy (or other tools) to connect to Spark UIs on EMR**, so I will in fact do this now.
      1. [read more here](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html)

In [31]:
emr = boto3.client('emr',
    region_name             = "us-east-1",
    aws_access_key_id       = aws_cred['default']['aws_access_key_id'],
    aws_secret_access_key   = aws_cred['default']['aws_secret_access_key']
)

In [46]:
#With boto3
emr.run_job_flow(
            Name='spark-cluster',
            LogUri='s3://emrlogs/',
            ReleaseLabel='emr-5.28.0',
            Instances={
                'MasterInstanceType': 'm5.xlarge',
                'SlaveInstanceType': 'm5.xlarge',
                'InstanceCount': 4,
                'Ec2KeyName':'spark_ec2_key',
                'KeepJobFlowAliveWhenNoSteps': True
                #'EmrManagedMasterSecurityGroup': security_groups['manager'].id,
                #'EmrManagedSlaveSecurityGroup': security_groups['worker'].id,
            },
            Applications=[
                {
                    "Name":"Spark"
                },
                {
                    "Name":"Zeppelin"
                }
            ],
            JobFlowRole='EMR_EC2_DefaultRole',
            ServiceRole='EMR_DefaultRole',
            VisibleToAllUsers=True
        )

#NOTE: Under the 'Applications' specification of the EMR cluster above, you can also load in applications like
# Spark, TensorFlow, Presto, and Hadoop!

{'JobFlowId': 'j-1G9MDNZM8AK80',
 'ClusterArn': 'arn:aws:elasticmapreduce:us-east-1:549653882425:cluster/j-1G9MDNZM8AK80',
 'ResponseMetadata': {'RequestId': 'fd542ec6-5c0d-4248-b1b0-f0c9fdb7aaeb',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'fd542ec6-5c0d-4248-b1b0-f0c9fdb7aaeb',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '118',
   'date': 'Wed, 14 Sep 2022 18:07:14 GMT'},
  'RetryAttempts': 0}}

In [32]:
#with AWS CLI:

!aws emr create-cluster --name test-cluster \
    --use-default-roles \
    --release-label emr-5.28.0 \
    --instance-count 4 \
    --applications Name=Spark Name=Zeppelin \
    --ec2-attributes KeyName='spark_ec2_key',SubnetId='subnet-0b6cc9cfba9463659'\
    --instance-type m5.xlarge \
    --log-uri s3://emrlogs/ \
    --visible-to-all-users

{
    "ClusterId": "j-YR7XBL53LHQW",
    "ClusterArn": "arn:aws:elasticmapreduce:us-east-1:549653882425:cluster/j-YR7XBL53LHQW"
}


---

#### Configuring Cluster

---

In [52]:
cluster_list = emr.list_clusters(
    ClusterStates=['STARTING','RUNNING']
)
print(cluster_list)
cluster_id = cluster_list['Clusters'][0]['Id']

{'Clusters': [{'Id': 'j-1G9MDNZM8AK80', 'Name': 'spark-cluster', 'Status': {'State': 'STARTING', 'StateChangeReason': {}, 'Timeline': {'CreationDateTime': datetime.datetime(2022, 9, 14, 20, 7, 15, 117000, tzinfo=tzlocal())}}, 'NormalizedInstanceHours': 0, 'ClusterArn': 'arn:aws:elasticmapreduce:us-east-1:549653882425:cluster/j-1G9MDNZM8AK80'}], 'ResponseMetadata': {'RequestId': '030d99e7-4228-410a-8fd0-f754cf43d04e', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '030d99e7-4228-410a-8fd0-f754cf43d04e', 'content-type': 'application/x-amz-json-1.1', 'content-length': '279', 'date': 'Wed, 14 Sep 2022 18:08:42 GMT'}, 'RetryAttempts': 0}}


In [55]:
new_cluster = emr.describe_cluster(
    ClusterId = cluster_id
)
new_cluster
secGroup_master = new_cluster['Cluster']['Ec2InstanceAttributes']['EmrManagedMasterSecurityGroup']
cluster_dns = new_cluster['Cluster']['MasterPublicDnsName']


Configure Cluster Security Groups to only accept SSH ingress from my IP address

In [62]:
#Getting my public IP address from config.me website (IP is last element of returned array)
myIP = !curl ifconfig.me
myIP = myIP[-1]

In [63]:
#Specifying internal port (arbitrary?)
myPort = '32'
myCidrIp = myIP + "/" + myPort

In [64]:
response = ec2.authorize_security_group_ingress(
    GroupId=secGroup_master,
    IpPermissions=[
        {
            'FromPort': 22,
            'IpProtocol': 'tcp',
            'IpRanges': [
                {
                    'CidrIp': myCidrIp,
                    'Description': 'SSH access to Spark EMR on AWS from Kevins Computer',
                },
            ],
            'ToPort': 22,
        },
    ],
)

ClientError: An error occurred (InvalidPermission.Duplicate) when calling the AuthorizeSecurityGroupIngress operation: the specified rule "peer: 88.117.176.62/32, TCP, from port: 22, to port: 22, ALLOW" already exists

---

#### Interacting with Cluster

---

In [57]:
#File path where cluster login information is kept on my machine:
pem_path = '/home/rambino/.aws/spark_keypair.pem'

Connect to Cluster via SSH

In [58]:
#Command to use in terminal (interactive):
print(f"ssh hadoop@{cluster_dns} -i {pem_path}")

ssh hadoop@ec2-34-236-148-89.compute-1.amazonaws.com -i /home/rambino/.aws/spark_keypair.pem


---

#### Proxy connection to allow interaction with Spark UI

---

Setting up FoxyProxy to allow connection to Spark UI from localhost
[AWS Documentation on Port forwarding for EMR connections](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ssh-tunnel.html)


I needed to install the browser extension FoxyProxy to allow my browser to interface with the EMR cluster. Once I installed it, I then needed to set up a new proxy with these settings:
- IP address: `localhost`
- Port: `8157` (only needed to match dynamic port forwarding below)

Then, in the 'pattern matching' part, I needed to specify which URLs should be forwarded in this way. This was already specified by Udacity. The json file accompanying this notebook named 'foxyproxy...' shows these patterns.

In [60]:
#Copying credentials file to the master node (not sure why yet)
print(f"scp -i {pem_path} {pem_path} hadoop@{cluster_dns}:/home/hadoop/")

scp -i /home/rambino/.aws/spark_keypair.pem /home/rambino/.aws/spark_keypair.pem hadoop@ec2-34-236-148-89.compute-1.amazonaws.com:/home/hadoop/


In [61]:
#This sets up port forwarding (somehow) so that data from our local machine on port 8157 is forwarded to the master node (allowing interactivity)
#NOTE: Terminal remains open when this request succeeds - and needs to remain running while accessing Spark UI

#Note: Getting this SSH connection to work has been unpredictable at times. Often get 'connection refused' errors, but then it
#suddenly works. Should ideally figure out what's going on there...

print(f"ssh -v -i {pem_path} -N -D 127.0.0.1:8157 hadoop@{cluster_dns}")

ssh -v -i /home/rambino/.aws/spark_keypair.pem -N -D 127.0.0.1:8157 hadoop@ec2-34-236-148-89.compute-1.amazonaws.com


#### Accessing Spark UI:
- Base URL:           http://ec2-54-87-42-167.compute-1.amazonaws.com

- Spark History:      http://ec2-54-87-42-167.compute-1.amazonaws.com:18080/
- YARN Node Manager:  http://ec2-54-87-42-167.compute-1.amazonaws.com:8042/


[See more ports here](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html)

---

#### Deleting EMR Cluster (Teardown)

---

In [65]:
emr.terminate_job_flows(
    JobFlowIds=[
        cluster_id
    ]
)

{'ResponseMetadata': {'RequestId': '0712bc2f-b997-4dac-8f6a-896c46c58bda',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '0712bc2f-b997-4dac-8f6a-896c46c58bda',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Wed, 14 Sep 2022 19:52:32 GMT'},
  'RetryAttempts': 0}}