## Script to upload files to EMR cluster

---

#### Module import

---

In [1]:
import boto3
import configparser

---

#### Credential Upload

---

In [2]:
#AWS Credentials
aws_path = "/home/rambino/.aws/credentials"
aws_cred = configparser.ConfigParser()
aws_cred.read(aws_path)

['/home/rambino/.aws/credentials']

---

#### Uploading data files to S3

---

In [3]:
s3 = boto3.client('s3',
    region_name             = "us-east-1",
    aws_access_key_id       = aws_cred['default']['aws_access_key_id'],
    aws_secret_access_key   = aws_cred['default']['aws_secret_access_key']
)

Creating Bucket

In [8]:
bucketName = 'emr-input-output'

In [6]:
response = s3.create_bucket(
    ACL='public-read-write',
    Bucket=bucketName,
    CreateBucketConfiguration={
        'LocationConstraint': "us-east-2"
    }
)
response

{'ResponseMetadata': {'RequestId': 'MFZNCTA4V619TW2P',
  'HostId': 'xictsv1/1GglA/aH5tCI8WrWM7Xpu1g4XmPcP5Zw11J2xDuA6+O5Lz5M/r1zPJkNQv+IfZ5AoPI=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'xictsv1/1GglA/aH5tCI8WrWM7Xpu1g4XmPcP5Zw11J2xDuA6+O5Lz5M/r1zPJkNQv+IfZ5AoPI=',
   'x-amz-request-id': 'MFZNCTA4V619TW2P',
   'date': 'Wed, 14 Sep 2022 15:26:36 GMT',
   'location': 'http://emr-input-output.s3.amazonaws.com/',
   'server': 'AmazonS3',
   'content-length': '0'},
  'RetryAttempts': 0},
 'Location': 'http://emr-input-output.s3.amazonaws.com/'}

Uploading files to bucket

In [16]:
#Configuring location of resources:
s3_input_path = "input/cities.csv"
s3_output_path = "output/"
s3_config_path = "config/testPaths.cfg"

In [18]:
#Input data

with open('cities.csv', 'rb') as data:
    response = s3.upload_fileobj(data, bucketName,s3_input_path)

In [21]:
#Adjusting config file to reflect input + output data path:
cfg_path = 'testPaths.cfg'

path = configparser.ConfigParser()
path.read(cfg_path)

path['PATHS']['input'] = s3_input_path
path['PATHS']['output'] = s3_output_path

#Saving Config file:
with open(cfg_path,"w") as file:
    path.write(file)


In [22]:
#Config data
with open('testPaths.cfg', 'rb') as data:
    response = s3.upload_fileobj(data, bucketName,s3_config_path)

---

#### Uploading script to EMR

---

Now all of my resources are in S3, I just need to upload my python script to my EMR cluster and submit the Spark job.

I will first need to connect (via SSH) to the EMR master node, and then upload my python script using this command (run from directory with the python file)

```
scp -i /home/rambino/.aws/spark_keypair.pem /home/rambino/dev/DataEngineering_Udacity/05_Spark_DataLakes/AWS_EMR_Practice/emr_upload_files/emr_sparkTest.py hadoop@[EC2-INSTANCE-DNS-HERE]:/home/hadoop
```

Next, I'll need to SSH into the EMR master node and:
1. First figure out where the `spark-submit` command is by running `which spark-submit`
2. Run this command on my new script and see the output (substituting spark-submit path as needed):

`/usr/bin/spark-submit --master yarn emr_sparkTest.py`

##### Post-exercise Notes
1. It is not seemingly possible to configure the master EMR node from within SSH (I tried to install configparser as a Python module, but got 'permission denied' errors)
2. If I run this code again, I will need to either:
   1. Set up cluster such that configparser is installed in the cluster or:
   2. change code so that it does not require this package.

---

#### Using HDFS instead of S3

---

While S3 is very convenient, there might be times when using HDFS is better. EMR clusters always come equipped with HDFS installed, and there is simply some setup required to use this system.

---

Firstly, we'll need to copy files (using 'scp' command) to the EMR master node like we did with the .pem file or the python script we ran on EMR.

Next, we'll make a new HDFS directory called `/data` and copy our local files into it:

```
hdfs dfs -mkdir /data

hdfs dfs -copyFromLocal cities.csv /data/
```

Now the data is within the hadoop file system, we only need to change the URL for the data in our python script and we can access the HDFS files just as easily as we did S3 files! (See accompanying python scripts for syntax)

>Don't forget that you can access the UI for HDFS on your cluster by following the port-forwarding commands in the `EMR_boto3Setup` notebook and using **Port 50070** (for an EMR notebook v5.28). See the Python notebook for a list of other ports if using a different EMR version



