### Exploring JupyterHub and Spark

JupyterHub and Spark are installed by default with Open Data Hub.  You can create Jupyter Notebooks and connect to Spark.

Running the next cell should connect to Spark and the output:
```
['jupyterhub-nb-kube-3aadmin']
```

In [None]:
from pyspark.sql import SparkSession, SQLContext
import os
import socket
    
# Add the necessary Hadoop and AWS jars to access Ceph from Spark
# Can be omitted if s3 storage access is not required
os.environ['PYSPARK_SUBMIT_ARGS'] = f"--conf spark.jars.ivy={os.environ['HOME']} --packages org.apache.hadoop:hadoop-aws:2.7.3,com.amazonaws:aws-java-sdk:1.7.4 pyspark-shell"

# create a spark session
spark_cluster_url = f"spark://{os.environ['SPARK_CLUSTER']}:7077"
spark = SparkSession.builder.master(spark_cluster_url).getOrCreate()
    
# test your spark connection
spark.range(5, numPartitions=5).rdd.map(lambda x: socket.gethostname()).distinct().collect()

### Object Storage

Let's access data on an Object Store (such as Ceph or AWS S3) using the S3 API.  For instructions on installing Ceph,refer to the Advanced Installation [documentation](https://opendatahub.io/docs/administration/advanced-installation/object-storage.html).

To access S3 directly, we'll use the boto3 library.  We'll download a sample data file with `wget` and then upload itto our S3 storage using `boto3`.

After running the following cell, you should see the `sample_data.csv` available in your S3 bucket.

In [None]:
# Edit this section using your own credentials
s3_region = 'region-1' # AWS region or blank for Ceph
s3_endpoint_url = 'https://s3.storage.server'
s3_access_key_id = 'AccessKeyId-ChangeMe'
s3_secret_access_key = 'SecretAccessKey-ChangeMe'
s3_bucket = 'my-bucket'

# for easy download
!pip install wget
    
import wget
import boto3

# configure boto S3 connection
s3 = boto3.client('s3',
                  s3_region, 
                  endpoint_url = s3_endpoint_url,
                  aws_access_key_id = s3_access_key_id,
                  aws_secret_access_key = s3_secret_access_key)

# download the sample data file
url = "https://gitlab.com/opendatahub/opendatahub.io/raw/master/assets/files/tutorials/basic/sample_data.csv"
file = wget.download(url=url, out='sample_data.csv')
    
#upload the file to storage
s3.upload_file(file, s3_bucket, "sample_data.csv")      

### Spark + Object Storage

Now, let's access that same data file from Spark so you can analyze data.

In [None]:
hadoopConf = spark.sparkContext._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.endpoint", s3_endpoint_url)
hadoopConf.set("fs.s3a.access.key", s3_access_key_id)
hadoopConf.set("fs.s3a.secret.key", s3_secret_access_key)
hadoopConf.set("fs.s3a.path.style.access", "true")
hadoopConf.set("fs.s3a.connection.ssl.enabled", "true") # false if not https
    
data = spark.read.csv('s3a://' + s3_bucket + '/sample_data.csv',sep=",", header=True)
df = data.toPandas()
df.head()