## EMR Setup with Python SDK (boto3)
This notebook will show how to set up some AWS resources using the Python SDK for AWS, boto3.

Boto3 Documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/redshift.html

---

To do (Sept. 30):
1. I saved the output of a local ETL to the _out folder. Take a look at it and see if the data looks right
   1. ~~Why are so many entries missing 'song_id' and 'artist_id'? (333 / 6280)~~
   2. ~~Query the data and see what kind of results I get (Compared to one query for Redshift - I GET THE SAME RESULTS!! Looks like I (probably) did it right)~~
   3. ~~take a look at double-checks I did for Redshift project - any I should implement here?~~
      1. ~~Yes, need: unique **songs, users and artists** - should implement this check in notebook after running ETL locally.~~
2. ~~Run the etl.py again with limited data (Nov. 22 has at least 1 match in song + artist - use that?)~~
3. ~~Clean up this notebook - should have EMR creation code + code to pull in data from S3 and inspect it.~~
4. ~~Test writing as parquet~~
5. Cleanup
   1. ~~Clean etl.py file to not have any errant comments or code. Docstrings in place?~~
   2. ~~Clean EMR_boto3Setup notebook so that testing code is neatly organized or in separate notebook.~~
   3. Delete _out folder with test data
6. Finish rest of this notebook to spin up EMR
7. Use built-in notebook to run low-data code once
8. Upload .py to EMR via SSH and run

---

## EMR Setup with Boto3

---

#### Package Import

---

In [None]:
import boto3
import configparser

---

#### Loading Credentials from file

---

In [None]:
#AWS Credentials
aws_path = "/home/rambino/.aws/credentials"
aws_cred = configparser.ConfigParser()
aws_cred.read(aws_path)

---

#### Create SSH keypair for connecting to EC2 instances

---

In [None]:
ec2 = boto3.client('ec2',
    region_name             = "us-east-1",
    aws_access_key_id       = aws_cred['default']['aws_access_key_id'],
    aws_secret_access_key   = aws_cred['default']['aws_secret_access_key']
)

In [None]:
response = ec2.create_key_pair(
    KeyName = 'spark_ec2_key',
    DryRun=False,
    KeyFormat='pem'
)

In [None]:
with open('/home/rambino/.aws/spark_keypair.pem',"w") as file:
    file.writelines(response['KeyMaterial'])

---

#### Setting up VPC for the EMR cluster

---

If no VPC is specified for an EMR cluster, then the cluster is launched in the normal AWS cloud

Creating default VPC:

In [None]:
!aws ec2 create-default-vpc --profile default

Getting **first** subnetId for this VPC:

In [None]:
vpc_output = ec2.describe_vpcs()

#Getting first (and only) VPC:
vpcId = vpc_output['Vpcs'][0]['VpcId']

subnet_output = ec2.describe_subnets(
    Filters=[
        {
            'Name':'vpc-id',
            'Values':[vpcId]
        }
    ]
)

subnetId = subnet_output['Subnets'][0]['SubnetId']

---

#### Creating EMR Cluster

---

In [None]:
emr = boto3.client('emr',
    region_name             = "us-east-1",
    aws_access_key_id       = aws_cred['default']['aws_access_key_id'],
    aws_secret_access_key   = aws_cred['default']['aws_secret_access_key']
)

In [None]:
#With boto3
emr.run_job_flow(
            Name='spark-cluster',
            LogUri='s3://emrlogs-rambino/',
            ReleaseLabel='emr-5.36.0',
            Instances={
                'MasterInstanceType': 'm5.xlarge',
                'SlaveInstanceType': 'm5.xlarge',
                'InstanceCount': 3,
                'Ec2KeyName':'spark_ec2_key',
                'KeepJobFlowAliveWhenNoSteps': True,
                'Ec2SubnetId': subnetId
            },
            Applications=[
                {
                    "Name":"Spark"
                },
                {
                    "Name":"Zeppelin"
                },
                {
                    "Name":"Hadoop"
                },
                {
                    "Name":"Ganglia"
                },
                {
                    "Name":"Livy"
                },
                {
                    "Name":"JupyterEnterpriseGateway"
                }
            ],
            Configurations=[
                {
                    #Note: setting timeout of 'livy' to be longer to try to fix 'session not active' errors
                    'Classification': 'livy-conf',
                    'Properties': {'livy.server.session.timeout':'3h'}
                }
            ],
            JobFlowRole='EMR_EC2_DefaultRole',
            ServiceRole='EMR_DefaultRole',
            VisibleToAllUsers=True,
            AutoTerminationPolicy={
                'IdleTimeout': 1800
            }
        )


---

#### Configuring Cluster

---

In [None]:
cluster_list = emr.list_clusters(
    ClusterStates=['STARTING','RUNNING','WAITING']
)
print(cluster_list)
cluster_id = cluster_list['Clusters'][0]['Id']

In [None]:
new_cluster = emr.describe_cluster(
    ClusterId = cluster_id
)
print(new_cluster)
secGroup_master = new_cluster['Cluster']['Ec2InstanceAttributes']['EmrManagedMasterSecurityGroup']
iam_service_role = new_cluster['Cluster']['Ec2InstanceAttributes']['IamInstanceProfile']
cluster_dns = new_cluster['Cluster']['MasterPublicDnsName']

Configure Cluster Security Groups to only accept SSH ingress from my IP address

In [None]:
#Getting my public IP address from config.me website (IP is last element of returned array)
myIP = !curl ifconfig.me
myIP = myIP[-1]

In [None]:
#Specifying internal port (arbitrary?)
myPort = '32'
myCidrIp = myIP + "/" + myPort

In [None]:
response = ec2.authorize_security_group_ingress(
    GroupId=secGroup_master,
    IpPermissions=[
        {
            'FromPort': 22,
            'IpProtocol': 'tcp',
            'IpRanges': [
                {
                    'CidrIp': myCidrIp,
                    'Description': 'SSH access to Spark EMR on AWS from Kevins Computer',
                },
            ],
            'ToPort': 22,
        },
    ],
)

response

---

#### Running notebook EMR cluster

---

Note: I uploaded notebook to EMR manually, but this could also be achieved via S3 upload with Boto3
Similarly, the service role used below was automatically created with this cluster, but I could make my own custom role with IAM and boto3 if needed.

In [None]:
#'s3://aws-emr-resources-549653882425-us-east-1/notebooks/e-B41LV1OZ58I8ZG299XTVEG6Y0/emr_spark_code.ipynb',

#Starting EMR notebook
response = emr.start_notebook_execution(
    EditorId='e-B41LV1OZ58I8ZG299XTVEG6Y0',
    RelativePath='emr_spark_code.ipynb',
    ExecutionEngine={
        'Id': cluster_id,
        'Type': 'EMR'
    },
    ServiceRole="EMR_Notebooks_DefaultRole"#iam_service_role
)

exec_id = response['NotebookExecutionId']
print(response)

In [None]:
#Checking execution status:
response = emr.describe_notebook_execution(
    NotebookExecutionId=exec_id
)
response

---

#### Interacting with Cluster

---

Uploading Python Notebook to S3 for EMR usage

Connect to Cluster via SSH

In [None]:
#File path where cluster login information is kept on my machine:
pem_path = '/home/rambino/.aws/spark_keypair.pem'

In [None]:
#Command to use in terminal (interactive):
print(f"ssh hadoop@{cluster_dns} -i {pem_path}")

---

#### Proxy connection to allow interaction with Spark UI

---

In [None]:
#Copy credentials file to the master node
print(f"scp -i {pem_path} {pem_path} hadoop@{cluster_dns}:/home/hadoop/")

In [None]:
#Set up port forwarding (somehow) so that data from our local machine on port 8157 is forwarded to the master node (allowing interactivity)
#NOTE: Terminal remains open when this request succeeds - and needs to remain running while accessing Spark UI

print(f"ssh -v -i {pem_path} -N -D 127.0.0.1:8157 hadoop@{cluster_dns}")

---

#### Deleting EMR Cluster (Teardown)

---

In [None]:
emr.terminate_job_flows(
    JobFlowIds=[
        cluster_id
    ]
)

---

---

## Testing

In [None]:
read_path_prefix = "./_out/"

---

### Users

---

In [None]:
users = spark.read \
    .format('csv') \
    .option('header',True) \
    .load(read_path_prefix + "users")

users.createOrReplaceTempView('users_tbl')

In [None]:
# Do we have any duplicate userIds?
spark.sql('''
SELECT userId, COUNT(userId) count
FROM users_tbl
GROUP BY userId
ORDER BY count DESC
LIMIT 5
''').show()

---

### Songs

---

In [None]:
songs = spark.read \
    .format('csv') \
    .option('header',True) \
    .load(read_path_prefix + "songs")

songs.createOrReplaceTempView('songs_tbl')

In [None]:
# Do we have any duplicate song Ids?
spark.sql('''
SELECT song_id, COUNT(song_id) count
FROM songs_tbl
GROUP BY song_id
ORDER BY count DESC
LIMIT 5
''').show()

---

### Artists

---

In [None]:
artists = spark.read \
    .format('csv') \
    .option('header',True) \
    .load(read_path_prefix + "artists")

artists.createOrReplaceTempView('artists_tbl')

In [None]:
# Do we have any duplicate artist Ids?
spark.sql('''
SELECT artist_id, COUNT(artist_id) count
FROM artists_tbl
GROUP BY artist_id
ORDER BY count DESC
LIMIT 5
''').show()

In [None]:
songplays = spark.read \
    .format('csv') \
    .schema(songplays_schema) \
    .option('header',True) \
    .load('./_out/songplays')

## Sample Analytics

In [None]:
#Example analytics: get locations where songs were played on Nov. 11, 2018

spark.sql('''
SELECT count(*) AS freq, location
from play_tbl
WHERE song_id IS NOT NULL
AND (ts/1000) > 1543532400
GROUP BY location
ORDER BY freq DESC
''').show()