## EMR Setup with Python SDK (boto3)
This notebook will show how to set up some AWS resources using the Python SDK for AWS, boto3.

Boto3 Documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/redshift.html

---

#### Package Import

---

In [3]:
import boto3
import configparser

---

#### Loading Credentials from file

---

In [4]:
#AWS Credentials
aws_path = "/home/rambino/.aws/credentials"
aws_cred = configparser.ConfigParser()
aws_cred.read(aws_path)


['/home/rambino/.aws/credentials']

---

#### Create SSH keypair for connecting to EC2 instances

---

In [5]:
ec2 = boto3.client('ec2',
    region_name             = "us-east-1",
    aws_access_key_id       = aws_cred['udacity_course']['aws_access_key_id'],
    aws_secret_access_key   = aws_cred['udacity_course']['aws_secret_access_key']
)

In [10]:
response = ec2.create_key_pair(
    KeyName = 'spark_ec2_key',
    DryRun=False,
    KeyFormat='pem'
)

In [25]:
with open('/home/rambino/.aws/spark_keypair.pem',"w") as file:
    file.writelines(response['KeyMaterial'])
    #key.write(file)

---

#### Setting up VPC for the EMR cluster

---

If no VPC is specified for an EMR cluster, then the cluster is launched in the normal AWS cloud

Creating default VPC:

In [55]:
!aws ec2 create-default-vpc --profile default


An error occurred (DefaultVpcAlreadyExists) when calling the CreateDefaultVpc operation: A Default VPC already exists for this account in this region.


Getting **first** subnetId for this VPC:

In [53]:
output = ec2.describe_vpcs()

#Getting first (and only) VPC:
vpcId = output['Vpcs'][0]['VpcId']

output = ec2.describe_subnets(
    Filters=[
        {
            'Name':'vpc-id',
            'Values':[vpcId]
        }
    ]
)

subnetId = output['Subnets'][0]['SubnetId']

---

#### Setting up EMR Cluster

---

In [26]:
emr = boto3.client('emr',
    region_name             = "us-east-1",
    aws_access_key_id       = aws_cred['udacity_course']['aws_access_key_id'],
    aws_secret_access_key   = aws_cred['udacity_course']['aws_secret_access_key']
)

In [27]:
response = emr.describe_cluster(
    ClusterId='j-F5E5B0USH8J8'
)
response

{'Cluster': {'Id': 'j-F5E5B0USH8J8',
  'Name': 'spark-cluster',
  'Status': {'State': 'STARTING',
   'StateChangeReason': {},
   'Timeline': {'CreationDateTime': datetime.datetime(2022, 9, 10, 20, 1, 26, 677000, tzinfo=tzlocal())}},
  'Ec2InstanceAttributes': {'Ec2KeyName': 'spark_ec2_key',
   'Ec2SubnetId': 'subnet-091ed9f12b84d7496',
   'RequestedEc2SubnetIds': ['subnet-091ed9f12b84d7496'],
   'Ec2AvailabilityZone': 'us-east-1f',
   'RequestedEc2AvailabilityZones': [],
   'IamInstanceProfile': 'EMR_EC2_DefaultRole',
   'EmrManagedMasterSecurityGroup': 'sg-0febcf83330efd2ef',
   'EmrManagedSlaveSecurityGroup': 'sg-09710e98a447090d5'},
  'InstanceCollectionType': 'INSTANCE_GROUP',
  'LogUri': 's3n://aws-logs-549653882425-us-east-1/elasticmapreduce/',
  'ReleaseLabel': 'emr-5.20.0',
  'AutoTerminate': False,
  'TerminationProtected': False,
  'VisibleToAllUsers': True,
  'Applications': [{'Name': 'Ganglia', 'Version': '3.7.2'},
   {'Name': 'Spark', 'Version': '2.4.0'},
   {'Name': 'Zepp

In [29]:
emr.run_job_flow(
            Name='spark-cluster-test',
            LogUri='s3:///emrlogs/',
            ReleaseLabel='emr-5.28.0',
            Instances={
                'MasterInstanceType': 'm5.xlarge',
                'SlaveInstanceType': 'm5.xlarge',
                'InstanceCount': 4,
                'Ec2KeyName':'spark_ec2_key',
                'KeepJobFlowAliveWhenNoSteps': True
                #'EmrManagedMasterSecurityGroup': security_groups['manager'].id,
                #'EmrManagedSlaveSecurityGroup': security_groups['worker'].id,
            },
            Applications=[
                {
                    "Name":"Spark"
                },
                {
                    "Name":"Zeppelin"
                }
            ],
            JobFlowRole='EMR_EC2_DefaultRole',
            ServiceRole='EMR_DefaultRole',
            VisibleToAllUsers=True
        )

{'JobFlowId': 'j-MXUROK0Y3TK9',
 'ClusterArn': 'arn:aws:elasticmapreduce:us-east-1:549653882425:cluster/j-MXUROK0Y3TK9',
 'ResponseMetadata': {'RequestId': 'efb07b3c-f06a-4566-81c8-d6f3e6f4c216',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'efb07b3c-f06a-4566-81c8-d6f3e6f4c216',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '116',
   'date': 'Sat, 10 Sep 2022 18:20:07 GMT'},
  'RetryAttempts': 0}}

Alternatively, create the EMR cluster via the command line:

In [60]:
subnetId

'subnet-0b6cc9cfba9463659'

In [70]:
!aws emr create-cluster --name spark-cluster \
    --use-default-roles \
    --release-label emr-5.28.0 \
    --instance-count 3 \
    --applications Name=Spark Name=Zeppelin \
    --ec2-attributes KeyName='spark_ec2_key',SubnetId='subnet-0b6cc9cfba9463659' \
    --instance-type m5.xlarge \
    --log-uri s3://emrlogs/

{
    "ClusterId": "j-253YWRF4ZNWTJ",
    "ClusterArn": "arn:aws:elasticmapreduce:us-east-1:549653882425:cluster/j-253YWRF4ZNWTJ"
}
