## USE Databricks Delta

The data used is public data from Lending Club. It includes all funded loans from 2012 to 2017. Each loan includes applicant information provided by the applicant as well as the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. For a full view of the data please view the data dictionary available [here](https://resources.lendingclub.com/LCDataDictionary.xlsx). Download the data sample [here](https://www.lendingclub.com/info/download-data.action). 

![Loan_Data](https://preview.ibb.co/d3tQ4R/Screen_Shot_2018_02_02_at_11_21_51_PM.png)

## About Databricks Delta

Optimization Layer atop blob storage for Performance, Reliability and Low Latency of Streaming + Batch data pipelines:
* Open-source Parquet columnar file format
* Performance: Indexing and Partitioning.
* Reliability: ACID compliance, ANSI SQL UPDATE, DELETE, and MERGE commands, and Schema Validation.
* Low Latency: Auto-compaction for real-time streaming ingest of data.

## Import Data and create pre-Databricks Delta Table
* This will create a lot of small Parquet files emulating the typical small file problem that occurs with streaming or highly transactional data

In [0]:
loan_stats = spark.table("loanstats3a_csv")
print(str(loan_stats.count()) + " loans opened by Lending Club...")

# Create pre-Databricks Delta table
loan_stats.repartition(200).write.parquet("/dennyl/loan_stats_predelta.pq")

## Review Parquet file structure for pre-Databricks Delta Table

In [0]:
%fs ls /dennyl/loan_stats_predelta.pq/

path,name,size
dbfs:/dennyl/loan_stats_predelta.pq/_SUCCESS,_SUCCESS,0
dbfs:/dennyl/loan_stats_predelta.pq/_committed_6870305189401602184,_committed_6870305189401602184,19622
dbfs:/dennyl/loan_stats_predelta.pq/_started_6870305189401602184,_started_6870305189401602184,0
dbfs:/dennyl/loan_stats_predelta.pq/part-00000-tid-6870305189401602184-513fb086-72fc-4dca-96a9-4a517cc4e19f-98-c000.snappy.parquet,part-00000-tid-6870305189401602184-513fb086-72fc-4dca-96a9-4a517cc4e19f-98-c000.snappy.parquet,102912
dbfs:/dennyl/loan_stats_predelta.pq/part-00001-tid-6870305189401602184-513fb086-72fc-4dca-96a9-4a517cc4e19f-99-c000.snappy.parquet,part-00001-tid-6870305189401602184-513fb086-72fc-4dca-96a9-4a517cc4e19f-99-c000.snappy.parquet,91745
dbfs:/dennyl/loan_stats_predelta.pq/part-00002-tid-6870305189401602184-513fb086-72fc-4dca-96a9-4a517cc4e19f-100-c000.snappy.parquet,part-00002-tid-6870305189401602184-513fb086-72fc-4dca-96a9-4a517cc4e19f-100-c000.snappy.parquet,101098
dbfs:/dennyl/loan_stats_predelta.pq/part-00003-tid-6870305189401602184-513fb086-72fc-4dca-96a9-4a517cc4e19f-101-c000.snappy.parquet,part-00003-tid-6870305189401602184-513fb086-72fc-4dca-96a9-4a517cc4e19f-101-c000.snappy.parquet,101362
dbfs:/dennyl/loan_stats_predelta.pq/part-00004-tid-6870305189401602184-513fb086-72fc-4dca-96a9-4a517cc4e19f-102-c000.snappy.parquet,part-00004-tid-6870305189401602184-513fb086-72fc-4dca-96a9-4a517cc4e19f-102-c000.snappy.parquet,95643
dbfs:/dennyl/loan_stats_predelta.pq/part-00005-tid-6870305189401602184-513fb086-72fc-4dca-96a9-4a517cc4e19f-103-c000.snappy.parquet,part-00005-tid-6870305189401602184-513fb086-72fc-4dca-96a9-4a517cc4e19f-103-c000.snappy.parquet,91307
dbfs:/dennyl/loan_stats_predelta.pq/part-00006-tid-6870305189401602184-513fb086-72fc-4dca-96a9-4a517cc4e19f-104-c000.snappy.parquet,part-00006-tid-6870305189401602184-513fb086-72fc-4dca-96a9-4a517cc4e19f-104-c000.snappy.parquet,94828


In [0]:
# How long does a simple count take with a lot of small files
spark.read.parquet("/dennyl/loan_stats_predelta.pq/").count()

This query takes ~10 seconds

In [0]:
# Removing Databricks Delta table if it exists
dbutils.fs.rm("/dennyl/loan_stats_delta.pq", True)

## Recreate same table for downstream Databricks Delta processing

In [0]:
# Import Data
loan_stats = spark.table("loanstats3a_csv")
print(str(loan_stats.count()) + " loans opened by Lending Club...")

# Create Databricks Delta table
loan_stats.repartition(200).write.format("delta").save("/dennyl/loan_stats_delta.pq")

## Review table file structure
* Note, at this moment, this is the same as the pre-Databricks Delta table

In [0]:
%fs ls /dennyl/loan_stats_delta.pq

path,name,size
dbfs:/dennyl/loan_stats_delta.pq/_delta_log/,_delta_log/,0
dbfs:/dennyl/loan_stats_delta.pq/part-00000-403cda02-739d-4199-bd83-cb03c8aac238-c000.snappy.parquet,part-00000-403cda02-739d-4199-bd83-cb03c8aac238-c000.snappy.parquet,102912
dbfs:/dennyl/loan_stats_delta.pq/part-00001-ff5c094e-3341-496f-a81c-c9c7f7586824-c000.snappy.parquet,part-00001-ff5c094e-3341-496f-a81c-c9c7f7586824-c000.snappy.parquet,91745
dbfs:/dennyl/loan_stats_delta.pq/part-00002-23db680b-54c8-46c8-a041-2a8dae5115ac-c000.snappy.parquet,part-00002-23db680b-54c8-46c8-a041-2a8dae5115ac-c000.snappy.parquet,101098
dbfs:/dennyl/loan_stats_delta.pq/part-00003-0696edf0-23ae-4f3d-8b75-322f63312c3f-c000.snappy.parquet,part-00003-0696edf0-23ae-4f3d-8b75-322f63312c3f-c000.snappy.parquet,101362
dbfs:/dennyl/loan_stats_delta.pq/part-00004-cf2787bc-1e7c-47c4-a6c5-63e116057c2f-c000.snappy.parquet,part-00004-cf2787bc-1e7c-47c4-a6c5-63e116057c2f-c000.snappy.parquet,95643
dbfs:/dennyl/loan_stats_delta.pq/part-00005-e80eaad6-4d60-4eb8-b989-b8efc3e1aaa6-c000.snappy.parquet,part-00005-e80eaad6-4d60-4eb8-b989-b8efc3e1aaa6-c000.snappy.parquet,91307
dbfs:/dennyl/loan_stats_delta.pq/part-00006-1d3a370f-2778-46b2-a9c7-322c61be9988-c000.snappy.parquet,part-00006-1d3a370f-2778-46b2-a9c7-322c61be9988-c000.snappy.parquet,94828
dbfs:/dennyl/loan_stats_delta.pq/part-00007-c0c1e2a3-cf98-4819-939d-e24af7732e29-c000.snappy.parquet,part-00007-c0c1e2a3-cf98-4819-939d-e24af7732e29-c000.snappy.parquet,92544
dbfs:/dennyl/loan_stats_delta.pq/part-00008-a5300c74-452c-420b-9a5d-fce6bb8fd40c-c000.snappy.parquet,part-00008-a5300c74-452c-420b-9a5d-fce6bb8fd40c-c000.snappy.parquet,89791


## Create Databricks Delta Table and Optimize
* Using Spark SQL, we can create a Databricks Delta table via `USING DELTA`
* Afterwards, we can execute `OPTIMIZE` that will optimially merge the files together

In [0]:
%sql
DROP TABLE IF EXISTS loan_stats_delta 

In [0]:
%sql
CREATE TABLE loan_stats_delta
USING DELTA
LOCATION '/dennyl/loan_stats_delta.pq'

In [0]:
%sql
select count(1) from loan_stats_delta

count(1)
42537


## Query Times pre-Databricks Delta OPTIMIZE
* As noted in the preceding cell, the query took ~ 4 seconds to complete

In [0]:
%sql
-- Optimize the loan_stats_delta Databricks Delta table
OPTIMIZE loan_stats_delta

path
""


In [0]:
%fs ls /dennyl/loan_stats_delta.pq

path,name,size
dbfs:/dennyl/loan_stats_delta.pq/_delta_log/,_delta_log/,0
dbfs:/dennyl/loan_stats_delta.pq/part-00000-2d6cbdbf-9f56-4f48-a3ab-3db1fe604f1f-c000.snappy.parquet,part-00000-2d6cbdbf-9f56-4f48-a3ab-3db1fe604f1f-c000.snappy.parquet,10510576
dbfs:/dennyl/loan_stats_delta.pq/part-00000-403cda02-739d-4199-bd83-cb03c8aac238-c000.snappy.parquet,part-00000-403cda02-739d-4199-bd83-cb03c8aac238-c000.snappy.parquet,102912
dbfs:/dennyl/loan_stats_delta.pq/part-00001-ff5c094e-3341-496f-a81c-c9c7f7586824-c000.snappy.parquet,part-00001-ff5c094e-3341-496f-a81c-c9c7f7586824-c000.snappy.parquet,91745
dbfs:/dennyl/loan_stats_delta.pq/part-00002-23db680b-54c8-46c8-a041-2a8dae5115ac-c000.snappy.parquet,part-00002-23db680b-54c8-46c8-a041-2a8dae5115ac-c000.snappy.parquet,101098
dbfs:/dennyl/loan_stats_delta.pq/part-00003-0696edf0-23ae-4f3d-8b75-322f63312c3f-c000.snappy.parquet,part-00003-0696edf0-23ae-4f3d-8b75-322f63312c3f-c000.snappy.parquet,101362
dbfs:/dennyl/loan_stats_delta.pq/part-00004-cf2787bc-1e7c-47c4-a6c5-63e116057c2f-c000.snappy.parquet,part-00004-cf2787bc-1e7c-47c4-a6c5-63e116057c2f-c000.snappy.parquet,95643
dbfs:/dennyl/loan_stats_delta.pq/part-00005-e80eaad6-4d60-4eb8-b989-b8efc3e1aaa6-c000.snappy.parquet,part-00005-e80eaad6-4d60-4eb8-b989-b8efc3e1aaa6-c000.snappy.parquet,91307
dbfs:/dennyl/loan_stats_delta.pq/part-00006-1d3a370f-2778-46b2-a9c7-322c61be9988-c000.snappy.parquet,part-00006-1d3a370f-2778-46b2-a9c7-322c61be9988-c000.snappy.parquet,94828
dbfs:/dennyl/loan_stats_delta.pq/part-00007-c0c1e2a3-cf98-4819-939d-e24af7732e29-c000.snappy.parquet,part-00007-c0c1e2a3-cf98-4819-939d-e24af7732e29-c000.snappy.parquet,92544


## Query Times post-Databricks Delta OPTIMIZE
* Notice the large ~200.9 Mb file among all of the ~1.2 Mb bytes
* Databricks Delta has merged all of the small files together into a larger file
* All of the smaller files are still there to ensure that no current transactions are interrupted
* As noted in the following cell, the query took approximately < 1 second (instead of ~ 10 seconds) to complete

In [0]:
%sql
select count(1) from loan_stats_delta

count(1)
42537


## Reference
For more information:
* [Databricks Delta Guide](https://docs.azuredatabricks.net/delta/index.html)
* [Building a Mobile Gaming Events Data Pipeline with Databricks Delta](https://databricks.com/blog/2018/07/02/build-a-mobile-gaming-events-data-pipeline-with-databricks-delta.html)