## EMR Setup with Python SDK (boto3)
This notebook will show how to set up some AWS resources using the Python SDK for AWS, boto3.

Boto3 Documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/redshift.html

In [3]:
from pyspark.sql import SparkSession

aws_key      = aws_cred['default']['aws_access_key_id']
aws_secret   = aws_cred['default']['aws_secret_access_key']

spark = SparkSession \
    .builder \
    .config("spark.jars.packages","com.amazonaws:aws-java-sdk-s3:1.12.311") \
    .config("spark.jars.packages","org.apache.hadoop:hadoop-aws:3.3.4") \
    .getOrCreate()
    

sc = spark.sparkContext
sc.setSystemProperty('com.amazonaws.services.s3.enableV4','true')

sc._jsc.hadoopConfiguration().set('fs.s3a.access.key',aws_key)
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key',aws_secret)
sc._jsc.hadoopConfiguration().set('spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled', 'true')
sc._jsc.hadoopConfiguration().set('fs.s3a.endpoint','s3-us-west-2.amazonaws.com')
sc._jsc.hadoopConfiguration().set('fs.s3a.connection.ssl.enabled','true')

22/10/04 19:18:20 WARN Utils: Your hostname, rambino-AERO-15-XD resolves to a loopback address: 127.0.1.1; using 192.168.0.234 instead (on interface wlp48s0)
22/10/04 19:18:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/home/rambino/.local/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/rambino/.ivy2/cache
The jars for the packages stored in: /home/rambino/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-72b3acfb-237f-441a-ae1c-1b7c35cc41a5;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.3.4 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
:: resolution report :: resolve 117ms :: artifacts dl 3ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.12.262 from central in [default]
	org.apache.hadoop#hadoop-aws;3.3.4 from central in [default]
	org.wildfly.openssl#wildfly-openssl;1.0.7.Final from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	-------------------------------

22/10/04 19:18:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [5]:
from pyspark.sql.types import IntegerType, StringType, FloatType, StructType, StructField

song_schema = StructType([
    StructField('num_songs',IntegerType(),True),
    StructField('artist_id',StringType(),True),
    StructField('artist_latitude',FloatType(),True),
    StructField('artist_longitude',FloatType(),True),
    StructField('artist_location',StringType(),True),
    StructField('artist_name',StringType(),True),
    StructField('song_id',StringType(),True),
    StructField('title',StringType(),True),
    StructField('duration',FloatType(),True),
    StructField('year',IntegerType(),True)
])

log_schema = StructType([
    StructField('artist',StringType(),True),
    StructField('auth',StringType(),True),
    StructField('firstName',StringType(),True),
    StructField('gender',StringType(),True),
    StructField('itemInSession',IntegerType(),True),
    StructField('lastName',StringType(),True),
    StructField('length',IntegerType(),True),
    StructField('level',StringType(),True),
    StructField('location',StringType(),True),
    StructField('method',StringType(),True),
    StructField('page',StringType(),True),
    StructField('registration',StringType(),True),
    StructField('sessionId',IntegerType(),True),
    StructField('song',StringType(),True),
    StructField('status',IntegerType(),True),
    StructField('ts',FloatType(),True),
    StructField('userAgent',StringType(),True),
    StructField('userId',StringType(),True)
])

In [None]:
song_df = spark.read.format('json').schema(song_schema).load('s3a://udacity-dend/song_data/*/*/*')#/A/B/C/TRABCEI128F424C983.json')
log_df = spark.read.format('json').schema(log_schema).load('s3a://udacity-dend/log_data/*/*')#/2018/11/2018-11-12-events.json')

In [None]:
log_df.toPandas()

In [None]:
log_df = log_df.where("page = 'NextSong'")

In [None]:
from pyspark.sql.functions import udf
from datetime import datetime

get_hour = udf(lambda x: x.hour)
get_day = udf(lambda x: x.day)
get_week = udf(lambda x: x.isocalendar().week)
get_month = udf(lambda x: x.month)
get_year = udf(lambda x: x.year)
get_weekday = udf(lambda x: x.weekday())

get_datetime = udf(lambda x: datetime.fromtimestamp(x/1000))
log_df = log_df.withColumn('timestamp',get_datetime('ts'))

In [None]:
# extract columns to create time table

from pyspark.sql import functions as F

log_df = log_df \
    .withColumn('songplay_id',F.expr("uuid()")) \
    .withColumn('hour',get_hour('timestamp')) \
    .withColumn('day',get_day('timestamp')) \
    .withColumn('week',get_week('timestamp')) \
    .withColumn('month',get_month('timestamp')) \
    .withColumn('year',get_year('timestamp')) \
    .withColumn('weekday',get_weekday('timestamp'))

In [None]:
songplays_table = songplays_table \
    .withColumn('songplay_id',F.expr("uuid()"))

In [None]:
#
#data.withColumn('timestamp',ts_to_timestamp('ts')).show()
match_condition = ((log_df.song == song_df.title) & (log_df.artist == song_df.artist_name))
songplays_table = log_df.join(song_df, match_condition, "left") \
    .select(
        log_df.ts, log_df.userId, log_df.level, log_df.sessionId,
        log_df.location, log_df.userAgent, log_df.month, log_df.year,
        song_df.song_id, song_df.artist_id
    )                


#songplays_table.limit(1).show()

In [None]:
filter_df = songplays_table.filter("song_id != 'None'").limit(5).show()

In [None]:
songplays_table \
    .select("songplay_id","ts","userId","level","song_id","artist_id","sessionId","location","userAgent","year","month") \
    .write \
    .option("header",True) \
    .partitionBy("year","month") \
    .csv("./_out/" + "songplays")   

%md

To do (Sept. 30):
1. I saved the output of a local ETL to the _out folder. Take a look at it and see if the data looks right
   1. ~~Why are so many entries missing 'song_id' and 'artist_id'? (333 / 6280)~~
   2. ~~Query the data and see what kind of results I get (Compared to one query for Redshift - I GET THE SAME RESULTS!! Looks like I (probably) did it right)~~
   3. ~~take a look at double-checks I did for Redshift project - any I should implement here?~~
      1. ~~Yes, need: unique **songs, users and artists** - should implement this check in notebook after running ETL locally.~~
2. ~~Run the etl.py again with limited data (Nov. 22 has at least 1 match in song + artist - use that?)~~
3. Clean up this notebook - should have EMR creation code + code to pull in data from S3 and inspect it.
4. Test writing as parquet
5. Cleanup
   1. Clean etl.py file to not have any errant comments or code. Docstrings in place?
   2. Clean EMR_boto3Setup notebook so that testing code is neatly organized or in separate notebook.
   3. Delete _out folder with test data
6. Finish rest of this notebook to spin up EMR
7. Use built-in notebook to run low-data code once
8. Upload .py to EMR via SSH and run

In [6]:
songplays_schema = StructType([
    StructField('songplay_id',StringType(),False),
    StructField('ts',FloatType(),True),
    StructField('userId',StringType(),True),
    StructField('level',StringType(),True),
    StructField('song_id',StringType(),True),
    StructField('artist_id',StringType(),True),
    StructField('sessionId',IntegerType(),True),
    StructField('location',StringType(),True),
    StructField('userAgent',StringType(),True),
    StructField('year',IntegerType(),True),
    StructField('month',IntegerType(),True),
])

In [9]:
songplays.filter("level = 'free'").show()

NameError: name 'songplays' is not defined

In [82]:
from pyspark.sql import functions as F
#from pyspark.sql.functions import cast as choohoh
from pyspark.sql import types as T
#year, month, dayofmonth, hour, weekofyear, date_format, desc, to_timestamp

#make_timestamp = F.udf(lambda x: F.cast(typ=T.TimestampType(),val=x/1000)) #F.udf(lambda x: F.unix_timestamp(x/1000,'dd-MM-yyyy HH:mm:ss.SSS'))

#songplays.select(to_timestamp('ts')).show()
songplays = songplays.withColumn('timestamp',make_timestamp('ts'))
songplays \
    .withColumn('date',F.to_timestamp(songplays['ts']/1000)) \
    .withColumn('month',F.month('date')) \
    .withColumn('hour',F.dayofmonth('date')) \
    .show()
#songplays.withColumn("timestamp",F.date_format(songplays.ts.cast(dataType=T.TimestampType()), "yyyy-MM-dd")).show()

AttributeError: module 'pyspark.sql.functions' has no attribute 'day'

## Testing

In [8]:
read_path_prefix = "./_out/"

In [10]:
songplays = spark.read \
    .format('csv') \
    .option('header',True) \
    .load(read_path_prefix + "songplays")

songplays.createOrReplaceTempView('songplays_tbl')

In [13]:
# Do we have any duplicate userIds?
spark.sql('''
SELECT *
FROM songplays_tbl
WHERE song_id IS NOT NULL
LIMIT 20
''').toPandas()

Unnamed: 0,songplay_id,ts,userId,level,song_id,artist_id,sessionId,location,userAgent,year,month
0,7dc063ba-685e-497f-9233-5e0bf7f5e92c,1542267080000.0,49,paid,SOCUITT12AB0187A32,ARKS2FE1187B99325D,606,"San Francisco-Oakland-Hayward, CA",Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20...,2018,11
1,370b1546-146f-44a8-8692-2b2418a2e3a8,1542278610000.0,80,paid,SOSDZFY12A8C143718,AR748W61187B9B6AB8,611,"Portland-South Portland, ME","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",2018,11
2,32b383cc-62c8-4f65-9f5e-2cb92f34afab,1542278870000.0,80,paid,SOTNWCI12AAF3B2028,ARS54I31187FB46721,611,"Portland-South Portland, ME","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",2018,11
3,b85d7ea5-71ec-4bad-be9b-37586c957d2b,1542279140000.0,80,paid,SOBONKR12A58A7A7E0,AR5E44Z1187B9A1D74,611,"Portland-South Portland, ME","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",2018,11
4,54443a43-6805-4035-af07-7fefe8056041,1542280310000.0,80,paid,SOLZOBD12AB0185720,ARPDVPJ1187B9ADBE9,611,"Portland-South Portland, ME","""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",2018,11
5,15630d35-df92-4218-878d-4f5b75266d8a,1542282020000.0,30,paid,SOULTKQ12AB018A183,ARKQQZA12086C116FC,324,"San Jose-Sunnyvale-Santa Clara, CA",Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) G...,2018,11
6,a477f6b4-4551-465b-8caa-dacd6afb1eba,1542284900000.0,30,paid,SOIOESO12A6D4F621D,ARVLXWP1187FB5B94A,324,"San Jose-Sunnyvale-Santa Clara, CA",Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) G...,2018,11
7,b036aa59-a542-403e-9051-f1caa1e80556,1542286210000.0,30,paid,SOSQIRI12A8C133897,AR1XIHA1187FB4AED3,324,"San Jose-Sunnyvale-Santa Clara, CA",Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) G...,2018,11
8,01ad1f8b-d918-4d66-bbd5-908359ff05cb,1542286740000.0,30,paid,SONZWDK12A6701F62B,ARL4UQB1187B9B74E3,324,"San Jose-Sunnyvale-Santa Clara, CA",Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) G...,2018,11
9,4f1bec2c-87a0-453c-9b84-ca925651d0b8,1542287790000.0,30,paid,SOYUKXG12A58A77837,ARZNLRE1187B99B6C3,324,"San Jose-Sunnyvale-Santa Clara, CA",Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) G...,2018,11


### Songs

In [9]:
songs = spark.read \
    .format('csv') \
    .option('header',True) \
    .load(read_path_prefix + "songs")

songs.createOrReplaceTempView('songs_tbl')

                                                                                

In [8]:
# Do we have any duplicate song Ids?
spark.sql('''
SELECT song_id, COUNT(song_id) count
FROM songs_tbl
GROUP BY song_id
ORDER BY count DESC
LIMIT 5
''').show()



+------------------+-----+
|           song_id|count|
+------------------+-----+
|SOPIHUA12A8AE45B1D|    1|
|SOAXEBM12AB017E5F9|    1|
|SOVJXVJ12A8C13517D|    1|
|SOBTWBZ12A6D4FBE3E|    1|
|SORJVDO12AF72A1970|    1|
+------------------+-----+



                                                                                

### Artists

In [36]:
artists = spark.read \
    .format('csv') \
    .option('header',True) \
    .load(read_path_prefix + "artists")

artists.createOrReplaceTempView('artists_tbl')

                                                                                

In [38]:
# Do we have any duplicate artist Ids?
spark.sql('''
SELECT artist_id, COUNT(artist_id) count
FROM artists_tbl
GROUP BY artist_id
ORDER BY count DESC
LIMIT 5
''').show()



+------------------+-----+
|         artist_id|count|
+------------------+-----+
|ARLGUTA1187B9B605F|    1|
|AR82DJK1187B991107|    1|
|ARBS7RY1187FB3B72F|    1|
|ARTHJGQ1187FB42F1F|    1|
|AR23EC41187FB4805D|    1|
+------------------+-----+



                                                                                

In [16]:
songplays = spark.read \
    .format('csv') \
    .schema(songplays_schema) \
    .option('header',True) \
    .load('./_out/songplays')



## Sample Analytics

In [22]:
#Example analytics: get locations where songs were played on Nov. 11, 2018

spark.sql('''
SELECT count(*) AS freq, location
from play_tbl
WHERE song_id IS NOT NULL
AND (ts/1000) > 1543532400
GROUP BY location
ORDER BY freq DESC
''').show()

+----+--------------------+
|freq|            location|
+----+--------------------+
|   9|San Francisco-Oak...|
|   2|Janesville-Beloit...|
|   2|       Red Bluff, CA|
|   1|Birmingham-Hoover...|
|   1|          Eugene, OR|
|   1|Houston-The Woodl...|
+----+--------------------+



                                                                                



In [None]:
%run etl.py

In [None]:
log_df.select('artist').dropDuplicates().toPandas()

In [None]:
from pyspark.sql.functions import desc
import pandas as pd
pd.set_option('max_colwidth', 800)

#Extract data to make songs table:
df = data.select('userId','firstName','lastName','gender','level').orderBy(desc('ts')).dropDuplicates(['userId'])
df.toPandas()
#sc._jsc.hadoopConfiguration().set('fs.s3a.endpoint','s3-us-west-2.amazonaws.com')

#writing to S3 as parquet:
# df.write \
#     .option("header",True) \
#     .partitionBy("year","artist_id") \
#     .parquet('s3a://rambino-output/test-output-parquet')

#This works ^ but it takes FOREVER. I had 23 records in this test output and it still took 26 MINUTES. Absolutely insane.
#To do:
#1. When testing my code, avoid writing to S3 until the last minute.
#2. Try to do some research as to why this is so slow and how to make it work better.

---

#### Package Import

---

In [None]:
from asyncio import as_completed
import concurrent.futures

def func(x,y,z):
    return x+y+z

results = []
with concurrent.futures.ThreadPoolExecutor() as executor:
    futures = []
    for cluster in [2,4]:
        futures.append(
            executor.submit(func,cluster,2,3)
        )
    for future in concurrent.futures.as_completed(futures):
        results.append(future.result())

results

In [1]:
import boto3
import configparser

---

#### Loading Credentials from file

---

In [2]:
#AWS Credentials
aws_path = "/home/rambino/.aws/credentials"
aws_cred = configparser.ConfigParser()
aws_cred.read(aws_path)

['/home/rambino/.aws/credentials']

---

#### Create SSH keypair for connecting to EC2 instances

---

In [None]:
ec2 = boto3.client('ec2',
    region_name             = "us-east-1",
    aws_access_key_id       = aws_cred['default']['aws_access_key_id'],
    aws_secret_access_key   = aws_cred['default']['aws_secret_access_key']
)

In [None]:
response = ec2.create_key_pair(
    KeyName = 'spark_ec2_key',
    DryRun=False,
    KeyFormat='pem'
)

In [None]:
with open('/home/rambino/.aws/spark_keypair.pem',"w") as file:
    file.writelines(response['KeyMaterial'])

---

#### Setting up VPC for the EMR cluster

---

If no VPC is specified for an EMR cluster, then the cluster is launched in the normal AWS cloud

Creating default VPC:

In [None]:
!aws ec2 create-default-vpc --profile default

Getting **first** subnetId for this VPC:

In [None]:
vpc_output = ec2.describe_vpcs()

#Getting first (and only) VPC:
vpcId = vpc_output['Vpcs'][0]['VpcId']

subnet_output = ec2.describe_subnets(
    Filters=[
        {
            'Name':'vpc-id',
            'Values':[vpcId]
        }
    ]
)

subnetId = subnet_output['Subnets'][0]['SubnetId']

---

#### Creating EMR Cluster

---

**Steps needed to set up and connect to EMR:**
1. set up cluster with correct specifications
2. get 'master public DNS' for the cluster
3. edit security group to allow my computer to connect via SSH (add inbound rule to allow SSH connection from my IP)
   1. Note: Security group is distinct entity from cluster - why not just set this up beforehand?
      1. Note: It IS possible to set up a security group beforehand - and to specify this security group for the master and slave nodes. For a more official setup, it's probably better to do this to ensure that the security group we set up for EMR is custom-defined (and not default).
      2. UPDATE: Well, actually when you CREATE a cluster, security groups are created automatically for the cluster on the default VPC. I could go through the trouble to set up custom security groups *beforehand*, or I could just create the cluster and then change the security groups as needed once they are created. Since I can't think of a reason it would be better to create custom security groups beforehand rather than just edit the ones which are created for me, I will just edit the ones created for me in this code.
4. Set up proxy to access "persistent web UI for Spark"?
   1. This looks like it's for being able to view the Spark UI somehow, but the way they're setting up the proxy settings and filtering URLs seems really hacky (e.g., they're filtering urls matching "http://10.*)". I'm not sure I want to set this up until I know that it's much better than using AWS' built-in UI viewer.
   2. Update: **It turns out that AWS also recommends using FoxyProxy (or other tools) to connect to Spark UIs on EMR**, so I will in fact do this now.
      1. [read more here](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html)

In [None]:
emr = boto3.client('emr',
    region_name             = "us-east-1",
    aws_access_key_id       = aws_cred['default']['aws_access_key_id'],
    aws_secret_access_key   = aws_cred['default']['aws_secret_access_key']
)

In [None]:
#With boto3
emr.run_job_flow(
            Name='spark-cluster',
            LogUri='s3://emrlogs/',
            ReleaseLabel='emr-5.28.0',
            Instances={
                'MasterInstanceType': 'm5.xlarge',
                'SlaveInstanceType': 'm5.xlarge',
                'InstanceCount': 4,
                'Ec2KeyName':'spark_ec2_key',
                'KeepJobFlowAliveWhenNoSteps': True
                #'EmrManagedMasterSecurityGroup': security_groups['manager'].id,
                #'EmrManagedSlaveSecurityGroup': security_groups['worker'].id,
            },
            Applications=[
                {
                    "Name":"Spark"
                },
                {
                    "Name":"Zeppelin"
                }
            ],
            JobFlowRole='EMR_EC2_DefaultRole',
            ServiceRole='EMR_DefaultRole',
            VisibleToAllUsers=True
        )

#NOTE: Under the 'Applications' specification of the EMR cluster above, you can also load in applications like
# Spark, TensorFlow, Presto, and Hadoop!

In [None]:
#with AWS CLI:

!aws emr create-cluster --name test-cluster \
    --use-default-roles \
    --release-label emr-5.28.0 \
    --instance-count 4 \
    --applications Name=Spark Name=Zeppelin \
    --ec2-attributes KeyName='spark_ec2_key',SubnetId='subnet-0b6cc9cfba9463659'\
    --instance-type m5.xlarge \
    --log-uri s3://emrlogs/ \
    --visible-to-all-users

---

#### Configuring Cluster

---

In [None]:
cluster_list = emr.list_clusters(
    ClusterStates=['STARTING','RUNNING']
)
print(cluster_list)
cluster_id = cluster_list['Clusters'][0]['Id']

In [None]:
new_cluster = emr.describe_cluster(
    ClusterId = cluster_id
)
new_cluster
secGroup_master = new_cluster['Cluster']['Ec2InstanceAttributes']['EmrManagedMasterSecurityGroup']
cluster_dns = new_cluster['Cluster']['MasterPublicDnsName']


Configure Cluster Security Groups to only accept SSH ingress from my IP address

In [None]:
#Getting my public IP address from config.me website (IP is last element of returned array)
myIP = !curl ifconfig.me
myIP = myIP[-1]

In [None]:
#Specifying internal port (arbitrary?)
myPort = '32'
myCidrIp = myIP + "/" + myPort

In [None]:
response = ec2.authorize_security_group_ingress(
    GroupId=secGroup_master,
    IpPermissions=[
        {
            'FromPort': 22,
            'IpProtocol': 'tcp',
            'IpRanges': [
                {
                    'CidrIp': myCidrIp,
                    'Description': 'SSH access to Spark EMR on AWS from Kevins Computer',
                },
            ],
            'ToPort': 22,
        },
    ],
)

---

#### Interacting with Cluster

---

In [None]:
#File path where cluster login information is kept on my machine:
pem_path = '/home/rambino/.aws/spark_keypair.pem'

Connect to Cluster via SSH

In [None]:
#Command to use in terminal (interactive):
print(f"ssh hadoop@{cluster_dns} -i {pem_path}")

---

#### Proxy connection to allow interaction with Spark UI

---

Setting up FoxyProxy to allow connection to Spark UI from localhost
[AWS Documentation on Port forwarding for EMR connections](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ssh-tunnel.html)


I needed to install the browser extension FoxyProxy to allow my browser to interface with the EMR cluster. Once I installed it, I then needed to set up a new proxy with these settings:
- IP address: `localhost`
- Port: `8157` (only needed to match dynamic port forwarding below)

Then, in the 'pattern matching' part, I needed to specify which URLs should be forwarded in this way. This was already specified by Udacity. The json file accompanying this notebook named 'foxyproxy...' shows these patterns.

In [None]:
#Copying credentials file to the master node (not sure why yet)
print(f"scp -i {pem_path} {pem_path} hadoop@{cluster_dns}:/home/hadoop/")

In [None]:
#This sets up port forwarding (somehow) so that data from our local machine on port 8157 is forwarded to the master node (allowing interactivity)
#NOTE: Terminal remains open when this request succeeds - and needs to remain running while accessing Spark UI

#Note: Getting this SSH connection to work has been unpredictable at times. Often get 'connection refused' errors, but then it
#suddenly works. Should ideally figure out what's going on there...

print(f"ssh -v -i {pem_path} -N -D 127.0.0.1:8157 hadoop@{cluster_dns}")

#### Accessing Spark UI:
- Base URL:           http://ec2-54-87-42-167.compute-1.amazonaws.com

- Spark History:      http://ec2-54-87-42-167.compute-1.amazonaws.com:18080/
- YARN Node Manager:  http://ec2-54-87-42-167.compute-1.amazonaws.com:8042/


[See more ports here](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html)

---

#### Deleting EMR Cluster (Teardown)

---

In [None]:
emr.terminate_job_flows(
    JobFlowIds=[
        cluster_id
    ]
)