<h1><center>Initial smoke test of the Dell Data Analytics Engine (powered by Starburst)</center></h1>

<a id='the-runtime-environment'></a>
## The runtime environment

This notebook is to allow quick validation that
[Apache Spark](https://spark.apache.org/) code can be run on the
[Dell Data Analytics Engine](https://dell.starburst.io/latest/index.html) -- *powered by [Starburst](httphttps://www.starburst.io/s://)*.

<a id='installing-spark'></a>
## Installing Spark

> These instructions where lifted & enhanced from [Colab and PySpark](https://colab.research.google.com/drive/1G894WS7ltIUTusWWmsCnF_zQhQqZCDOc) whose source file can be downloaded from [here](https://github.com/jacobceles/knowledge-repo/blob/master/pyspark/Colab%20and%20PySpark.ipynb) and then used with any Jupyter notebook.

Install Dependencies:

1.   Java 8 (Dell appliance requires 22, but so far 8 is working from the notebook)
2.   Apache Spark with hadoop (Settled on 3.5.1 for starters as needed >= 3.4 for Spark Connect)
3.   Findspark (used to locate the spark in the system)

> If you have issues with spark version, please upgrade to the latest version from [here](https://archive.apache.org/dist/spark/).

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
!tar xf spark-3.5.1-bin-hadoop3.tgz
!pip install -q findspark

In [None]:
!ls

Set Environment Variables:

In [1]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.1-bin-hadoop3"

<a id='dell-cli-tasks'></a>
## Dell CLI tasks



Obtain & set Spark Connect uri:

> Full details in the [CLI docs](https://dell.starburst.io/latest/dell-data-processing-engine/cli.html),
but here are the general steps after installation.

Run the following wherever you have the Dell CLI installed.

`./dell-data-processing-engine login`

Replace `ACCESS_KEY` and `SECRET_KEY` accordingly and create the Spark Connect instance

```
./dell-data-processing-engine submit \
	--conf spark.hadoop.fs.s3a.access.key=ACCESS_KEY \
	--conf spark.hadoop.fs.s3a.secret.key=SECRET_KEY \
	--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
	--conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider \
	--conf spark.hadoop.fs.s3a.endpoint= \
	--conf spark.sql.repl.eagerEval.enabled=True \
	--spark-connect
```

Copy the outputted `sparkId` value to your clipboard and replace that with `REPLACE-ME` in next step

`./dell-data-processing-engine instance uris REPLACE-ME`

Copy the `Spark Connect` uri (starts with `sc://`) to your clipboard and use it in the next code cell








**Note: when all done be sure to run `./dell-data-processing-engine instance delete REPLACE-ME`**


In [None]:
#
# run this cell and past the Spark Connect uri in the textbox that surfaces (and press <enter> OF COURSE; haha)
#

import getpass

sparkConnectUri = input("Spark Connect uri")

<a id='run-spark'></a>
## Run Spark


Create the SparkSession:

> Output should look similar to
`<pyspark.sql.connect.session.SparkSession at 0x7fe9f73bbe90>`

In [None]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .remote(sparkConnectUri) \
    .getOrCreate()
spark.version

Create a DataFrame from hard-coded data and display it:

In [None]:
from datetime import datetime, date
from pyspark.sql import Row

bogus_df = spark.createDataFrame([
  Row(aNbr=1, nutherNbr=2, aString='string1', aDate=date(2000, 1, 1), aTimestamp=datetime(2000, 1, 1, 12, 0)),
  Row(aNbr=2, nutherNbr=3, aString='string2', aDate=date(2000, 2, 1), aTimestamp=datetime(2000, 1, 2, 12, 0)),
  Row(aNbr=4, nutherNbr=5, aString='string3', aDate=date(2000, 3, 1), aTimestamp=datetime(2000, 1, 3, 12, 0)),
  Row(aNbr=8, nutherNbr=7, aString='string4', aDate=date(2000, 4, 1), aTimestamp=datetime(2000, 1, 4, 12, 0)),
])
bogus_df.show()

## Are you done?

If so (or when you are), don't forget to run the following command.

**`./dell-data-processing-engine instance delete REPLACE-ME`**


<a id='transformation-logic'></a>
## Transformation logic

We are using the publicly available Bluebikes - Hubway dataset. Read more information [about Blue Bikes Boston](http://bluebikes.com/about), a bicycle-sharing program based in Boston since 2011.

We are focusing on the [transactional records](https://bluebikes.com/system-data) of the bike trips from start to finish.

<a id='exploring-the-raw-data'></a>
### Exploring the raw data

In [None]:
# lets just grab a single CSV to explore with
s3_file_path = "s3a://starburst101-handsonlab-nyc-uber-rides/blue_bikes/raw_trips-2022_01-2022-09/202201-bluebikes-tripdata.csv"

# read CSV file into a DataFrame
df = spark.read.csv(s3_file_path, header=True, inferSchema=True)

# Show the DataFrame
df.show()

In [None]:
# Q: how many rows
df.count()

# RAISES EXCEPTION -- DON'T RUN!!
#  Jordan is submitting a bug on this (4/23/2025)

In [None]:
# Q: any null values?
from pyspark.sql.functions import col
df.filter(col("tripduration").isNull()).show()

# A: no null values found (that's good!)

In [None]:
# Q: tripduration values seem realistic? note: time is in seconds
from pyspark.sql.functions import min, max, avg, count
df.select(count("tripduration"),
                 min("tripduration"),
                 max("tripduration"),
                 avg("tripduration")
          ).show()

# A: min trip is a minute seems ok, but max trip of 27 DAYS **seems** WRONG,
#     but maybe this rider just didn't check the bike back in for a month
#     and average of 20 minutes seems reasonable

In [None]:
# Q: are there a BUNCH of super long rides? Say greater than 16 hours (kept it with you all day)

df.filter("tripduration > 50400").sort("tripduration", ascending=False).show(200)

#df.filter("tripduration > 57600").select(count("tripduration")).show()

# A: swap the comments on the lines above to see < 100 are (out of 81613 identified in prior cell),
#     seems reasonable