## Without Delta Pipeline, with Spark and Parquet

![stream](https://kpistoropen.blob.core.windows.net/collateral/delta/non-delta-new.png)

In [2]:
#quick clean up of demo folder:
dbutils.fs.rm("dbfs:/workshop/nodelta/", True)

#### Historical and new data is often written in very small files and very small directories (such as eventhub capture): 
+ This data is also partitioned by arrival time not event time!

![stream](https://docs.microsoft.com/en-us/azure/data-lake-store/media/data-lake-store-archive-eventhub-capture/data-lake-store-eventhub-data-sample.png)

# Step 0: Read data

In [5]:
%fs ls mnt/databricks-workshop-datasets/Contoso-retail/structured-streaming/events

path,name,size
dbfs:/mnt/databricks-workshop-datasets/Contoso-retail/structured-streaming/events/file-0.json,file-0.json,72530
dbfs:/mnt/databricks-workshop-datasets/Contoso-retail/structured-streaming/events/file-1.json,file-1.json,72961
dbfs:/mnt/databricks-workshop-datasets/Contoso-retail/structured-streaming/events/file-10.json,file-10.json,73025
dbfs:/mnt/databricks-workshop-datasets/Contoso-retail/structured-streaming/events/file-11.json,file-11.json,72999
dbfs:/mnt/databricks-workshop-datasets/Contoso-retail/structured-streaming/events/file-12.json,file-12.json,72987
dbfs:/mnt/databricks-workshop-datasets/Contoso-retail/structured-streaming/events/file-13.json,file-13.json,73006
dbfs:/mnt/databricks-workshop-datasets/Contoso-retail/structured-streaming/events/file-14.json,file-14.json,73003
dbfs:/mnt/databricks-workshop-datasets/Contoso-retail/structured-streaming/events/file-15.json,file-15.json,73007
dbfs:/mnt/databricks-workshop-datasets/Contoso-retail/structured-streaming/events/file-16.json,file-16.json,72978
dbfs:/mnt/databricks-workshop-datasets/Contoso-retail/structured-streaming/events/file-17.json,file-17.json,73008


In [6]:
%fs head "/mnt/databricks-workshop-datasets/Contoso-retail/structured-streaming/events/file-0.json"

In [7]:
from pyspark.sql.functions import expr

rawData = spark.read \
  .option("inferSchema", "true") \
  .json("/mnt/databricks-workshop-datasets/Contoso-retail/structured-streaming/events") \
  .drop("time") \
  .withColumn("date", expr("cast(concat('2018-01-', cast(rand(5) * 30 as int) + 1) as date)")) \
  .withColumn("deviceId", expr("cast(rand(5) * 100 as int)"))
  # add a couple of columns for demo purposes

In [8]:
rawData.rdd.getNumPartitions()

In [9]:
display(rawData)

action,date,deviceId
Close,2018-01-03,8
Close,2018-01-18,57
Open,2018-01-21,69
Close,2018-01-14,43
Open,2018-01-15,49
Open,2018-01-18,59
Close,2018-01-18,57
Close,2018-01-01,0
Close,2018-01-01,0
Open,2018-01-25,83


# Step 1: Write out raw data and create staging table

In [11]:
#Define path where to write to -- by default, in this workshop, we write to the workspace filestore
writeBase = "dbfs:/workshop/nodelta/"
writePath = writeBase + "iotPipeline/"

#If there are multiple users working on the same instance, please use this writeBase, adding your $USERNAME to the path, and to any subsequent write/read
#writeBase = writeBase = "dbfs:/workshop/nodelta/$USERNAME/"
#writePath = writeBase + "iotPipeline/"

#As backup, you can always write to this blob
#writeBase = "dbfs:/mnt/databricks-workshop-exercises/Contoso-retail/nodelta"
#writePath = writeBase + "iotPipeline/"

In [12]:
#make sure it uses writePath
rawData.write.format("parquet").partitionBy("date").save(writePath)

In [13]:
%fs ls dbfs:/workshop/nodelta/iotPipeline/

path,name,size
dbfs:/workshop/nodelta/iotPipeline/_SUCCESS,_SUCCESS,0
dbfs:/workshop/nodelta/iotPipeline/date=2018-01-01/,date=2018-01-01/,0
dbfs:/workshop/nodelta/iotPipeline/date=2018-01-02/,date=2018-01-02/,0
dbfs:/workshop/nodelta/iotPipeline/date=2018-01-03/,date=2018-01-03/,0
dbfs:/workshop/nodelta/iotPipeline/date=2018-01-04/,date=2018-01-04/,0
dbfs:/workshop/nodelta/iotPipeline/date=2018-01-05/,date=2018-01-05/,0
dbfs:/workshop/nodelta/iotPipeline/date=2018-01-06/,date=2018-01-06/,0
dbfs:/workshop/nodelta/iotPipeline/date=2018-01-07/,date=2018-01-07/,0
dbfs:/workshop/nodelta/iotPipeline/date=2018-01-08/,date=2018-01-08/,0
dbfs:/workshop/nodelta/iotPipeline/date=2018-01-09/,date=2018-01-09/,0


In [14]:
%fs ls dbfs:/workshop/nodelta/iotPipeline/date=2018-01-01

In [15]:
%sql
-- make sure it uses writePath
DROP TABLE IF EXISTS demo_iot_data;
CREATE TABLE demo_iot_data (action STRING, date DATE, deviceID INTEGER)
USING parquet
OPTIONS (path = "dbfs:/workshop/nodelta/iotPipeline/")
PARTITIONED BY (date)

# Step 2: Query the data

In [17]:
%sql
SELECT count(*) FROM demo_iot_data

count(1)
0


Wait, no results? That's strange. Let's repair the table then.

In [19]:
%sql
SHOW PARTITIONS demo_iot_data

partition


In [20]:
%sql

MSCK REPAIR TABLE demo_iot_data

In [21]:
%sql
SHOW PARTITIONS demo_iot_data

partition
date=2018-01-01
date=2018-01-02
date=2018-01-03
date=2018-01-04
date=2018-01-05
date=2018-01-06
date=2018-01-07
date=2018-01-08
date=2018-01-09
date=2018-01-10


In [22]:
%sql

SELECT count(*) FROM demo_iot_data

count(1)
100000


# Step 3: Appending new data

In [24]:
new_data = spark.range(100000) \
  .selectExpr("'Open' as action", "'2018-01-30' date") \
  .withColumn("deviceId", expr("cast(rand(5) * 500 as int)"))

In [25]:
display(new_data)

action,date,deviceId
Open,2018-01-30,43
Open,2018-01-30,289
Open,2018-01-30,348
Open,2018-01-30,219
Open,2018-01-30,247
Open,2018-01-30,296
Open,2018-01-30,289
Open,2018-01-30,3
Open,2018-01-30,3
Open,2018-01-30,415


##Note: This is dangerous to simply append to the production table.

In [27]:
new_data.write.format("parquet").partitionBy("date").mode("append").save(writePath)

In [28]:
%fs ls dbfs:/workshop/nodelta/iotPipeline/

path,name,size
dbfs:/workshop/nodelta/iotPipeline/_SUCCESS,_SUCCESS,0
dbfs:/workshop/nodelta/iotPipeline/date=2018-01-01/,date=2018-01-01/,0
dbfs:/workshop/nodelta/iotPipeline/date=2018-01-02/,date=2018-01-02/,0
dbfs:/workshop/nodelta/iotPipeline/date=2018-01-03/,date=2018-01-03/,0
dbfs:/workshop/nodelta/iotPipeline/date=2018-01-04/,date=2018-01-04/,0
dbfs:/workshop/nodelta/iotPipeline/date=2018-01-05/,date=2018-01-05/,0
dbfs:/workshop/nodelta/iotPipeline/date=2018-01-06/,date=2018-01-06/,0
dbfs:/workshop/nodelta/iotPipeline/date=2018-01-07/,date=2018-01-07/,0
dbfs:/workshop/nodelta/iotPipeline/date=2018-01-08/,date=2018-01-08/,0
dbfs:/workshop/nodelta/iotPipeline/date=2018-01-09/,date=2018-01-09/,0


# Step 4: Query should show new results

In [30]:
%sql

SELECT count(*) FROM demo_iot_data

count(1)
100000


That's strange, well, we can repair the table again right.

In [32]:
%sql

MSCK REPAIR TABLE demo_iot_data

In [33]:
%sql

SELECT count(*) FROM demo_iot_data

count(1)
200000


# Step 5: Upserts / Changes (on previously written data)

In [35]:
new_data.count()

In [36]:
new_data.drop("date").write.format("parquet").mode("overwrite").save(writePath + "date=2018-01-30/")

# Step 6: Query should reflect new data

In [38]:
%sql

SELECT count(*) FROM demo_iot_data

That's strange, guess we need to refresh the metadata.

In [40]:
%sql

REFRESH TABLE demo_iot_data

In [41]:
%sql

SELECT count(*) FROM demo_iot_data

# Step 7: Add historical data

In [43]:
from pyspark.sql.functions import expr
  
old_batch_data = spark.range(100000) \
  .repartition(200) \
  .selectExpr("'Open' as action", "cast(concat('2018-01-', cast(rand(5) * 15 as int) + 1) as date) as date") \
  .withColumn("deviceId", expr("cast(rand(5) * 100 as int)"))

old_batch_data.write.format("parquet").partitionBy("date").mode("append").save(writePath)

In [44]:
%sql

SELECT count(*) FROM demo_iot_data

Won't be up to date until we call refresh

In [46]:
%sql

REFRESH TABLE demo_iot_data

In [47]:
%sql

SELECT count(*) FROM demo_iot_data

# Performance Improvements

Now we want to build other pipelines with this information, we want to write out to our data lake and allow data scientists to query it quickly. The above query took 7 seconds, there's not much data there - it's probably just not well formatted.

In order to get reasonable performance you're going to have to build a whole pipeline just to manage the file sizes and trying to optimize it for querying later on.

In [49]:
%fs ls dbfs:/workshop/nodelta/iotPipeline/

In [50]:
%fs ls dbfs:/workshop/nodelta/iotPipeline/date=2018-01-01/	

### Use the cell below as the benchmark for a no-delta pipeline

In [52]:
%sql

SELECT count(*) FROM demo_iot_data

## Next Step

[Simple Pipeline with Delta]($../5-Delta/5-02 Simple Pipeline with Delta)

&copy; 2018 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>