### Overview

This demo walks through connecting to Ceph from an EPIC cluster using pySpark.

### Setup

- On your client machine, run the script `./scripts/end_user_scripts/ceph/1_demo_server_setup.sh` to setup a ceph nano server on the RDP Server host
- Add the EPIC Spark 2.4 image
- Configure EPIC with Active Directory [see README](https://github.com/bluedata-community/bluedata-demo-env-aws-terraform/blob/master/docs/README-AD.md)
- Setup Demo Tenant with Active Directory [see README](https://github.com/bluedata-community/bluedata-demo-env-aws-terraform/blob/master/docs/README-AD.md)
- Provision a Spark 2.4 cluster in the Demo Tenant with:
  - 1 x Spark Controller (small)
  - 1 x Jupyter Hub (small)
- Classic Jupyter notebook in Jupyterhub (Open Jupyterhub and nagivate to Help -> Launch Classic Notebook)
- SSH into the RDP Host and upload a dataset:

```
wget https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv
sed -i -e "s/\r/\n/g" airline-safety.csv # convert line endings
s3cmd put ./airline-safety.csv s3://sandboxbucket/airline-safety.csv
```

### Connect

- Verify that we are able to get a response from the ceph instance. We should see something like:

```
<?xml version="1.0" encoding="UTF-8"?><ListAllMyBucketsResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Owner><ID>anonymous</ID><DisplayName></DisplayName></Owner><Buckets></Buckets></ListAllMyBucketsResult>
```

- test we are able to connect to the spark context

In [1]:
sc

<SparkContext master=spark://bluedata-1.demo.bdlocal:7077 appName=IBM Spark Kernel>

- set connection to ceph

In [2]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark import SparkConf

conf = (SparkConf(). 
        set("spark.executor.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true").
        set("spark.driver.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true")
       )

sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "sandboxAccessKey")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "sandboxSecretKey")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "10.1.0.216:8080")   #### Change to the private IP of RDP server 
sc._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false")
sc._jsc.hadoopConfiguration().set("com.amazonaws.services.s3a.enableV4", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

- retrieve some data

In [3]:
sql = SparkSession(sc)
csv_df = sql.read.csv("s3a://sandboxbucket/airline-safety.csv")
csv_df.head()

Row(_c0='airline', _c1='avail_seat_km_per_week', _c2='incidents_85_99', _c3='fatal_accidents_85_99', _c4='fatalities_85_99', _c5='incidents_00_14', _c6='fatal_accidents_00_14', _c7='fatalities_00_14')