# Helium and Delta Lake

Helium will now publish all Oracle S3 data files in the Delta Lake format.  Delta Lake and Spark are modern standards for accessing and querying massive data tables.  Delta Lake is easy to use after you've installed the right tools.
With Spark, Jupyter and Delta Lake you can efficiently query Helium data using standard SQL commands.
    
We hope the community enjoys these new data tools and look forward to future community contributions.

The following Jupyter notebook shows how to access Helium's S3 files and SQL query the S3 files using Spark DataFrames.

## References

Delta Lake
https://delta.io/

Getting Started with Delta Lake Spark in AWS

https://towardsdatascience.com/getting-started-with-delta-lake-spark-in-aws-the-easy-way-9215f2970c58

Spark by Examples

https://sparkbyexamples.com/

## Jupyter Setup Instructions
From a Console install delta-spark and create the AWS credentials file

```
pip install delta-spark
mkdir ~/.aws
touch ~/.aws/credentials
vi ~/.aws/credentials
```

## Start Spark Session

In [4]:
import pyspark
from pyspark.sql import SparkSession
from delta import *

builder = SparkSession.builder.master("local[*]") \
    .appName("PySparkLocal") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain")

my_packages = ["org.apache.hadoop:hadoop-aws:3.3.4"]
spark = configure_spark_with_delta_pip(builder, extra_packages=my_packages).getOrCreate()
spark.conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

print(spark.version)

3.4.0


## Read Table from Delta Lake

In [5]:
delta_table_uri="s3a://foundation-data-lake-requester-pays/bronze/packet_router_packet_report_v1/"
df = spark.read.format("delta").load("s3a://foundation-data-lake-requester-pays/bronze/packet_router_packet_report_v1/").createOrReplaceTempView("packets")
sqlDF = spark.sql("SELECT * FROM packets WHERE date = '2023-6-1'")
sqlDF.show()
#df_by_date = df.filter(date === lit("2023-06-01"))
#df_by_date.show()

+------------+---+------+----+---------+-----+--------+------+--------------------+--------------------+------------+-----+------+------------------+----------+--------------------+
|gateway_tmst|oui|net_id|rssi|frequency|  snr|datarate|region|             gateway|        payload_hash|payload_size| free|  type|received_timestamp|      date|                file|
+------------+---+------+----+---------+-----+--------+------+--------------------+--------------------+------------+-----+------+------------------+----------+--------------------+
|   658952164|  1|    36|-101|865402500|  0.8|        | IN865|[00 E7 36 B0 62 B...|[AC 47 D5 30 D3 E...|          23|false|      |     1685577861837|2023-06-01|s3://foundation-p...|
|  1606199439| 12|    36|-125|905300000|-12.8|SF9BW125|      |[00 E2 70 E6 BC 3...|[AB 37 01 00 F7 7...|          24|false|uplink|     1685578035208|2023-06-01|s3://foundation-p...|
|  2993796500| 12|    36|-140|868500000|-22.0|        | EU868|[00 BC 08 55 20 C...|[82 AF 

## Query Table using SQL

In [40]:
spark.sql("""\
SELECT payload_size
FROM packets
where 
      date = '2023-06-01'
limit 5
""").show(truncate=False)

+------------+
|payload_size|
+------------+
|23          |
|24          |
|23          |
|23          |
|23          |
+------------+



In [37]:
df.printSchema()

root
 |-- gateway_tmst: decimal(23,0) (nullable = true)
 |-- oui: decimal(23,0) (nullable = true)
 |-- net_id: long (nullable = true)
 |-- rssi: integer (nullable = true)
 |-- frequency: long (nullable = true)
 |-- snr: float (nullable = true)
 |-- datarate: string (nullable = true)
 |-- region: string (nullable = true)
 |-- gateway: binary (nullable = true)
 |-- payload_hash: binary (nullable = true)
 |-- payload_size: long (nullable = true)
 |-- free: boolean (nullable = true)
 |-- type: string (nullable = true)
 |-- received_timestamp: decimal(23,0) (nullable = true)
 |-- date: date (nullable = false)
 |-- file: string (nullable = true)



## Working with Local Files

Working with S3 can incur egress costs.
If you wish to work with a local filesytem (rather than S3).  Install the AWS Cli.
Next Sync the Helium S3 files to the local file system under ../work/s3

```
aws s3 sync s3://foundation-data-lake-requester-pays/bronze/packet_router_packet_report_v1/ . --exclude "date=1970-01-01/"
```

In [7]:
df_local = spark.read.format("delta").load("/home/jovyan/work/s3")
df_local.show()

+------------+---+------+----+---------+-----+---------+-------+--------------------+--------------------+------------+-----+----+------------------+----------+--------------------+
|gateway_tmst|oui|net_id|rssi|frequency|  snr| datarate| region|             gateway|        payload_hash|payload_size| free|type|received_timestamp|      date|                file|
+------------+---+------+----+---------+-----+---------+-------+--------------------+--------------------+------------+-----+----+------------------+----------+--------------------+
|  3882617758|  1|    36|-108|923400000|-11.5|SF10BW125|AS923_1|[00 79 B6 20 8D E...|[EB 11 B3 60 51 3...|          23|false|    |                 0|1970-01-01|s3://foundation-p...|
|  2880966735|  1|    36|-130|865402500|-18.5|         |  IN865|[00 20 52 FC 6B F...|[9E F4 28 1A 44 3...|          23|false|    |                 0|1970-01-01|s3://foundation-p...|
|   634893225|  1|    36|-114|865985000|-15.5|         |  IN865|[00 59 69 3F 98 C...|[22 D