In [5]:
# run first. then have fun.
from pyspark.sql.functions import col
from delta.tables import DeltaTable

# keep the default compression codec as zstd
spark.conf.set("spark.sql.parquet.compression.codec", "zstd")

# First Steps: Delta Lake Streaming
> Note: This notebook relies heavily on the Apache Spark ecosystem. In the future we will have rust driven notebooks under `first steps` as well. 

We will discover how to easily create a Delta Lake table using the `datasets/ecomm_behavior_data/parquet/[sm|lg]/` data created in the [../notebooks/100-pre-processing/ecomm_csv_to_parquet.ipynb](./notebooks/100-pre-processing/ecomm_csv_to_parquet.ipynb).

1. We will use the `parquet` data to convert to a [Delta Lake table](https://docs.delta.io/latest/delta-batch.html#create-a-table).
2. We will also look at creating the table using the [DeltaTable](https://docs.delta.io/latest/api/python/index.html) builder methods.

In [6]:
dataset_dir = '/opt/spark/work-dir/hitchhikers_guide/datasets/ecomm_behavior_data'
# note: if you download the full dataset from https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store,
# just use the following and comment out the `-sm.csv` datasets.

datasets = ['2019-Oct-sm.csv','2019-Nov-sm.csv']

source_dir = 'sm' if datasets[1].endswith('-sm.csv') else 'lg'
source_parquet_dir = f"{dataset_dir}/parquet/{source_dir}/"

# view the source parquet path
print(source_parquet_dir)

# delta sink information
delta_path = f"{dataset_dir}/delta"
dl_unmanaged_table = "ecomm"

# managed delta table
dl_managed_table = "ecomm_by_day"

/opt/spark/work-dir/hitchhikers_guide/datasets/ecomm_behavior_data/parquet/sm/


# Delta Lake Tables
We will learn to create an `empty` Delta Lake table next. There are many reasons that you'll want to create empty tables, for one, this allows you to create the `promise` of eventual data, while first getting things like the tables `schema` locked into place. If none of this is making sense yet, then never fear, you'll learn about `schemas` and `tblproperties` next.

If you recall, we used a `StructType` to create a schema when we read the `ecomm_behavior_data` in the [100-pre-processing](../100-pre-processing/ecomm_csv_to_parquet.ipynb) notebook. The StructType is to DataFrames, like a structured data is to a Table row, both are strongly typed and provide a bit of peace of mind when working with a dataset. 

Structured Data is also one of the most important concepts to keep in mind while working with Streaming datasets.

## Structured Data as our Data Contract
Delta Lake uses a technique called `schema-on-write`. This means that all data being written by the `writer` or `producer` of a dataset must conform to a known `schema` after the initial `write` which in Delta Lake encapsulates a `transaction`. After the **initial write transaction**, which occurs at the time of table **creation**, a schema will exist. The importance of the `schema` is that it is `type-safe`. Type saftey with our data is also of critical importance for streaming, since a change in type, say from `string` to `integer` would break `backwards-compatibility` and `corrupt` our table. We don't want corrupt tables, so using `schema-on-write` and `schema-enforcement`, both tenents of the Delta Lake architecture, we can rest assured that any changes to our `schema` is backwards compatible*.

> note and warning: (*) in the case where we must break backwards compatibility, we can, but it comes at the cost of `overwriting` the entire table and `schema`. This pattern is ripe for broken promises in the case where communication of a breaking-change, isn't broadcast to any downstream consumer (someone or some team that is relying on your data for their data product).

We will create an Empty Table next that will hold our ecommerce data.

## Creating an Unmanaged Delta Lake Table

In [7]:
# steal the schema pattern
source_parquet = (spark.read
 .format("parquet")
 .load(source_parquet_dir)
)

source_schema = source_parquet.schema.simpleString()
# using the output of the schema from the reference `parquet` table, we can steal enough information to create our empty table
print(source_schema)

struct<event_time:timestamp,event_type:string,product_id:int,category_id:bigint,category_code:string,brand:string,price:float,user_id:int,user_session:string,event_date:date>


In [8]:
## Delta Lake Table Location on the File System
# > note: Delta Lake tables come in two variants (unmanaged and managed)

spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {dl_unmanaged_table} (
        event_time TIMESTAMP,
        event_type STRING,
        product_id INTEGER,
        category_id BIGINT,
        category_code STRING,
        brand STRING,
        price FLOAT,
        user_id INTEGER,
        user_session STRING,
        event_date DATE
    ) USING DELTA
    LOCATION '{delta_path}/{dl_unmanaged_table}'
    PARTITIONED BY (event_date)
    TBLPROPERTIES('delta.logRetentionDuration'='interval 28 days');
   """)

Hive Session ID = fb06f80d-5aa6-4e87-96e4-a56a62b6c3aa


22:45:00.366 [Thread-4] ERROR com.zaxxer.hikari.pool.HikariPool - HikariPool-23 - Exception during pool initialization.
java.sql.SQLException: Failed to create database 'metastore_db', see the next exception for details.
	at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) ~[org.apache.derby_derby-10.14.1.0.jar:?]
	at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) ~[org.apache.derby_derby-10.14.1.0.jar:?]
	at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source) ~[org.apache.derby_derby-10.14.1.0.jar:?]
	at org.apache.derby.impl.jdbc.EmbedConnection.createDatabase(Unknown Source) ~[org.apache.derby_derby-10.14.1.0.jar:?]
	at org.apache.derby.impl.jdbc.EmbedConnection.<init>(Unknown Source) ~[org.apache.derby_derby-10.14.1.0.jar:?]
	at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source) ~[org.apache.derby_derby-10.14.1.0.jar:?]
	at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source) ~[org.apache.derby

Py4JError: An error occurred while calling o99.toString. Trace:
java.lang.IllegalArgumentException: object is not an instance of declaring class
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Unknown Source)



The prior `CREATE TABLE` command will generate an empty Delta Lake table. This doesn't mean that the `table` is actually empty though. The table contains `metadata` which provides information such as the table properties, partition columns, and table location information. Given the `source parquet data` is partitioned by `event_date`, we needed to preserve the `daily` partitions in our `parquet` table. This allows us to not think about how we partition as new data is being added to the table. Using the `event_date` table partitions will be written into without our supervision.

If you have `tree` installed on your local machine, take a look at the output of calling:

`tree ./hitchhikers_guide/datasets/ecomm_behavior_data/delta/`.

```
./hitchhikers_guide/datasets/ecomm_behavior_data/delta/
└── ecomm
    └── _delta_log
        └── 00000000000000000000.json

3 directories, 1 file
```

> Note: If you are using a mac and have brew installed. `brew install tree`.


## Populate our Empty Table using our Parquet Source Table
In order to add records to our newly created `empty` table, we need to just read and write into the new table. 

> **NOTE**: If you are reading the entire october and november ecomm data you may see a JVM OOM followed by Py4JError: py4j does not exist in the JVM. This means the driver just crashed.

> **TIP**: If you are importing the full 2 months of data. Use the isin to import more days at a time.
> `.where(col("event_date").isin("2019-10-01","2019-10-02","2019-10-03"))`
> This way you can also figure out at what point the data is too big and an inevitable crash will occur. 

In [None]:
# providing the source_parquet for completeness. This has been created earlier in the notebook to steal the parquet schema.

# if you want to play around with resolving OOM on the big dataset, don't import by day, and watch things fall over. (~14gb into 1gb memory)...
#source_parquet = (spark.read
# .format("parquet")
# .load(f"{dataset_dir}/parquet/{source_dir}/")
#)

source_parquet = (spark.read
  .format("parquet")
  .load(source_parquet_dir)
  .where(col("event_date").eqNullSafe("2019-10-01"))
  # .where(col("event_date").isin("2019-11-29","2019-11-30"))
)

# TIP: sometimes you just want to make sure the data exists
# if you view what you are going to write, then you can see if the upstream is empty
# with a quick visual (when you are in notebooks), when you are running outside of notebooks
# you'll need to rely on `count()`, or other file listing techniques to see if the data
# exists, otherwise, your import job could pass with flying colors - while there is still sadly
# no data being moved from a -> b.

#source_parquet.show(10)

(source_parquet
 .write
 .format("delta")
 .option("path", f"{delta_path}/{dl_unmanaged_table}")
 .mode("append")
 .save()
)

1. What have you learned in the process of reading the source parquet into the new Delta Table location?
2. What patterns have you picked up here? Did you try playing with different strategies for selecting data using the `.where(col("event_date")....)`? What about using a collection of dates, or matches like `2019-10-1*`? If you are new to using PySpark or Spark in general, take a look at the [pyspark.sql.functions](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html) package to help you on your way.
3. If you were using the large datasets, what kinds of issues did you run into? 

Depending on how many actions you were taking. You probably saw: 

```
[warning][gc,alloc] Executor task launch worker for task 5.0 in stage 603.0 (TID 8573): Retried waiting for GCLocker too often allocating 262144 words
23/06/19 21:22:03 WARN TaskMemoryManager: Failed to allocate a page (2097136 bytes), try again
```

Learning to use Warnings and Exceptions to your advantage can be really helpful to understand what sorts of pressure points exist in your applications. It is also much more fun to break things locally, when we aren't experiencing problems in production.

### What Makes up a Delta Lake Table?
The Delta Lake table is comprised of the `_delta_log` directory, as well as optional `partition` based directories, or in the case of simple tables, just a collection of `part-{uuid}.c000.{compression}.parquet`, which would populate the `partition based directories` as well.

```
./hitchhikers_guide/datasets/ecomm_behavior_data/delta/
└── ecomm
    ├── _delta_log
    │   ├── 00000000000000000000.json
    │   ├── 00000000000000000001.json
    └── event_date=2019-10-01
        ├── part-00002-abb10ec6-6425-4ef1-91e8-ccb05489fa35.c000.zstd.parquet
        ├── part-00004-2eda7d21-dcf8-48b5-8e76-0dc6e71575d2.c000.zstd.parquet
```

You will notice the `zstd` compression. ZSTD compression is like compression on sterioids. We set this earlier on in the notebook using `spark.conf.set("spark.sql.parquet.compression.codec", "zstd")`. 

For comparison, if you comment out the line in the first cell of the notebook, Spark will use the default `snappy` compression codec. This is still a powerful compression codec, but for a size-on-disk comparision.

```
ls -lh ./hitchhikers_guide/datasets/ecomm_behavior_data/delta/ecomm/event_date=2019-10-01
-rw-r--r--  1 {me}  staff    37M Jun 19 13:57 part-00002-3e39ad07-cc35-4ef1-a0eb-77652c3cbc07.c000.snappy.parquet
-rw-r--r--  1 {me}  staff   5.6M Jun 19 13:57 part-00004-fb1843f4-bdb7-4578-b14a-257b2525f6b5.c000.snappy.parquet
-rw-r--r--  1 {me}  staff    18M Jun 19 13:59 part-00002-abb10ec6-6425-4ef1-91e8-ccb05489fa35.c000.zstd.parquet
-rw-r--r--  1 {me}  staff   3.7M Jun 19 13:59 part-00004-2eda7d21-dcf8-48b5-8e76-0dc6e71575d2.c000.zstd.parquet
```

> For the exact same data, the zstd compression results in a ~48% reduction in size from (37mb->18mb) and a ~66% reduction in size from 5.6mb->3.7mb. That is over 50% size reduction which is bonkers.

## Converting an existing External Delta Lake Table to a Managed Table
The Delta Lake table we created uses the `./delta/ecomm/` path on the filesystem. This means we need to understand where in the world a given Table lives, which is not a big problem when there are only a few tables (probably stored somewhere using AWS S3 or Azure Blob Storage, or Google Cloud Storage), but this becomes more problematic as more and more tables become available. At a certain point, it becomes essential to use Managed tables.

> Managed Delta Lake tables use the Hive Metastore (or hive compatible metastore) for OSS Delta, and if you're working inside Databricks, you can just use [Unity Catalog](https://www.databricks.com/product/unity-catalog) to mix access, authentication alongside your Table metadata.

Given this project is all about using OSS Delta, we're riding the `local` spark-warehouse route, which can be seen under the `spark-warehouse` directory to the left of this notebook (in the filesystem view). You'll also notice there is a `metastore_db`. This directory stores the information commonly stored in the Hive Metastore.

In [None]:
# if you want to check what current databases exist, or what tables exist you can use the following.
spark.catalog.listDatabases()
spark.catalog.listTables()


In [None]:
# Create the Managed Table Definition
# also note - we are keeping the `unmanaged` location in tact and copying files into the new location
# The only difference between the external table definition and the managed table definition is the `database.table` vs the `delta.
spark.sql(f"""
  CREATE TABLE IF NOT EXISTS default.`{dl_managed_table}` (
    event_time TIMESTAMP,
    event_type STRING,
    product_id INTEGER,
    category_id BIGINT,
    category_code STRING,
    brand STRING,
    price FLOAT,
    user_id INTEGER,
    user_session STRING,
    event_date DATE
  ) USING DELTA
  PARTITIONED BY (event_date)
  TBLPROPERTIES(
    'delta.logRetentionDuration'='interval 28 days',
    'catalog.team_name'='dldg_authors',
    'catalog.engineering.comms.slack'='https://delta-users.slack.com/archives/CG9LR6LN4'
  );
""")

In [None]:
# Add some additional table properties like the table classification (pii? all-access?)
spark.sql(f"""
  ALTER TABLE default.`{dl_managed_table}` 
  SET TBLPROPERTIES (
    'catalog.engineering.comms.email'='dldg_authors@gmail.com',
    'catalog.table.classification'='all-access'
  )""")

In [None]:
# a little glimpse into the table history (what actions have occured - otherwise known as transactions)
(DeltaTable.forName(spark, f"default.{dl_managed_table}")
 .history(10)
 .select("version", "timestamp", "operation", "operationParameters")
 .show(10, truncate=True, vertical=False)
)

In [None]:
# using the createIfNotExists utility
# can be used instead of the prior two cells
(DeltaTable.createIfNotExists(spark)
    .tableName(f"default.{dl_managed_table}")
    .property("description", "Retail Ecomm Dataset. This can be used to forecast holiday seasonality for multiple categories")
    .addColumn("event_time", "TIMESTAMP")
    .addColumn("event_type", "STRING")
    .addColumn("product_id", "INTEGER")
    .addColumn("category_id", "BIGINT")
    .addColumn("brand", "STRING")
    .addColumn("price", "FLOAT")
    .addColumn("user_id", "INTEGER")
    .addColumn("user_session", "STRING")
    .addColumn("event_date", "DATE")
    .partitionedBy("event_date")
    .property("catalog.team_name", "dldg_authors")
    .property("catalog.engineering.comms.slack",
	"https://delta-users.slack.com/archives/CG9LR6LN4")
    .property("catalog.engineering.comms.email","dldg_authors@gmail.com")
    .property("catalog.table.classification","all-access")
    .execute()
)

The only immediate difference between creating a non-managed table and a managed table all comes down to the table location: 

```
delta.`{delta_path}/{dl_unmanaged_table}` vs default.`{dl_managed_table}`
```

With the managed table, you can also create additional `databases` using the `CREATE DATABASE` syntax. We are currenlty using the `default` database.

In [None]:
# see the unmanaged Delta Table path
print(f"{delta_path}/{dl_unmanaged_table}")

In [None]:
# look up the unmanaged source table by path
#dt = DeltaTable.forPath(spark, f"{delta_path}/{dl_unmanaged_table}").detail().show(1, truncate=False, vertical=True)

In [None]:
# note: the startup.sh adds this information to the session. Having a shared warehouse directory allows us to reuse tables between notebooks
spark.conf.get("spark.sql.warehouse.dir")

In [None]:
# read from the external Delta Lake location
# write into the Managed Delta Lake location

sourceTable = DeltaTable.forPath(spark, f"{delta_path}/{dl_unmanaged_table}")
tableDf = sourceTable.toDF()
(
    tableDf
    .where(col("event_date").eqNullSafe("2019-10-01"))
    #.where(col("event_date").isin("2019-11-28", "2019-11-29", "2019-11-30"))
    .write
    .format("delta")
    .mode("append")
    .saveAsTable(f"default.{dl_managed_table}")
)

## Inspect the Table
> Now that you have data in your unmanaged table. Take a peek. What is interesting to you?


In [None]:
# peek at table details
(DeltaTable.forName(spark, f"default.{dl_unmanaged_table}")
   .detail()
   .show(1, truncate=False, vertical=True)
)
# view differences in describe extended
#spark.sql("describe extended default.ecomm_by_day").show(truncate=True, vertical=False)

In [None]:
# Looking at just the Table Properties
table_info = DeltaTable.forName(spark, "default.ecomm_by_day").detail()
# view the table properties locally (call first then slice by the properties index)
tblproperties = table_info.first()['properties']
tblproperties

In [None]:
spark.table(f"default.{dl_managed_table}").show(2, truncate=True, vertical=True)

## Converting an existing Parquet Table to Delta Lake
Using the `convertToDelta` method via the `DeltaTable` python utility enables us to easily create our Delta Lake table in place. In place just means that the table will not have to be copied and moved, furthermore, since Delta Lake uses Parquet all that is modified is the addition of the `_delta_log` file in the root of the table. 

> note: if you want to use the convertToDelta utility function, just uncomment the following cell and run it, otherwise, skip on to creating a new Delta Lake table.

In [None]:
# The convertToDelta method will take an existing `Parquet` table, and convert it in place to a DeltaLake table.
#parquet_table_dir = f"{dataset_dir}/parquet/{source_dir}/"
# dt = DeltaTable.convertToDelta(spark, f"parquet.`{parquet_table_dir}`")

# What We Learned
1. How to use an existing Parquet Table to create an Unmanaged and Managed Delta Lake Table
2. How to View the Table Metadata using the `describe extended table` SQL command and the `DeltaTable.forName...detail()` view.
3. How to slice the Table detail and view the `properties`. This allows us to quickly view important metadata about the Delta Lake table and we'll see how to use the Table Properties for more and more in other parts of the Guide.

## What's Next?
Now we can use the Managed `ecomm` table as we explore how to use Delta Lake Streaming. 

1. [Delta Lake Streaming 101](./dl-streaming-101.ipynb) is a gentle introduction to Delta Lake Streaming. This is a necessary part of learning how to effectively use Delta Lake for fun and profit.