### MACS 30113 Lab Session: Working with EMR Clusters/Spark

### Today's Lab Agenda: 

1. presenting two methods to work with Spark
2. introduction to some simple pyspark commands

### Create S3 Bucket

First create S3 bucket to store our files.

In [18]:
import boto3

In [19]:
# Initialize boto3 handler
s3 = boto3.resource('s3')
iam_client = boto3.client('iam')
role = iam_client.get_role(RoleName='LabRole')

# Create a new bucket to store your files
BUCKETNAME = 'rei-example-bucket'
s3.create_bucket(Bucket=BUCKETNAME)

# This is what we will use to interface with the specific bucket
bucket = s3.Bucket(BUCKETNAME)

In [20]:
# Upload the .py file
with open('KEY_lab_wk7_spark.py', 'rb') as py_file:
    bucket.put_object(Key='lab_wk7/lab_wk7_spark.py', Body=py_file)

print("Files uploaded to S3 under 'lab_wk7/' folder.")

Files uploaded to S3 under 'lab_wk7/' folder.


### Launching EMR Cluster

Next launch EMR Cluster in Terminal/bash.

In [21]:
%%bash 
# ! please remember to change your bucket name
aws emr create-cluster \
    --name "Spark Cluster" \
    --release-label "emr-6.2.0" \
    --applications Name=Hadoop Name=Hive Name=JupyterEnterpriseGateway Name=JupyterHub Name=Livy Name=Pig Name=Spark Name=Tez \
    --instance-type m5.xlarge \
    --instance-count 3 \
    --use-default-roles \
    --region us-east-1 \
    --ec2-attributes '{"KeyName": "vockey"}' \
    --configurations '[{"Classification": "jupyter-s3-conf", "Properties": {"s3.persistence.enabled": "true", "s3.persistence.bucket": "rei-example-bucket"}}]'


{
    "ClusterId": "j-3R36A8AV7QM2I",
    "ClusterArn": "arn:aws:elasticmapreduce:us-east-1:014303519904:cluster/j-3R36A8AV7QM2I"
}


#### Method 1: `ssh` Directly

1. When creating a new cluster, make sure to adjust the security settings to allow for `ssh` access. See `emr_cheatsheet.md` in Week 7 course materials.

2. download the labsuser.pem and save it to your .aws folder
3. run this bash command in your terminal: chmod 400 labsuser.pem

Then: 
Connecting to it:
```
$ ssh -i "labsuser.pem" hadoop@ec2-54-197-37-22.compute-1.amazonaws.com

Uploading a folder called `mystuff` locally -> EMR:
```
$ scp -i "labsuser.pem" -r mystuff @EMR-PUBLIC-ADDRESS:/home/hadoop
```

Downloading a folder called `mystuff` from EMR -> locally:
```
$ scp -i "labsuser.pem" -r hadoop@EMR-PUBLIC-ADDRESS:/home/hadoop/mystuff .
```
---

After uploading your files in there, you can then run Spark jobs with
``` 
[EMR] spark-submit mystuff/myfile.py
```
Alternatively if your files are saved on `S3`, then
```
[EMR] spark-submit s3://rei-example-bucket/lab_wk7/lab_wk7_spark.py rei-example-bucket
```

#### Method 2: Interactive Sessions

You can also launch a Jupyter server directly on EMR and work with it interactively.
```
$ ssh -i "labsuser.pem" -NL 9443:localhost:9443 hadoop@ec2-54-197-37-22.compute-1.amazonaws.com
```
This forwards the remote connection to your `https://localhost:9443`, and you can log in with username `jovyan`, password `jupyter`. 

#### Alternative Options: Running Spark on Midway

Using `sbatch`, refer to `in-class-activities/07_Spark/7M_Spark_EDA_ML/midway` on Week 7 course materials. 

You can also work with Spark interactive on Midway with `sinteractive`:
```bash
$ sinteractive --time=01:00:00 --nodes=1 --ntasks=10 --mem=40G --partition=caslake --account=macs30113
```
set up the pyspark environment
```bash
$ module load python/anaconda-2022.05 spark/3.3.2
pyspark --total-executor-cores 9 --executor-memory 4G --driver-memory 4G
```
log in to your local port
```bash
$ ssh -NL 8888:10.50.250.12:8888 <your-CNetID>@midway3.rcc.uchicago.edu
```


## Remember to shut down EMR and clean the bucket