# Lakehouse architecture with Databricks and PySpark

We will build a lakehouse architecture utilising delta lakes, an end to end ELT pipeline in Azure Databricks, along with some near-real-time dashboards.
Since this is Databricks, the example is more programmer friendly as the entire pipeline is in python or more specifically pyspark code. 

The use case here I am taking is of a Commerce company that has an ecommerce website as well as traditional retail stores. They want to analyse the online clickstream data to better understand their customers. 

Clickstream data is data about how users interact with your ecommerce websites, what ads they click, what products they view, which pages they spend most time on. Behavioural data that can give you insights into your products and customers so you can better market to your customer base. Its important to start with the vision of any of these data projects. In my case, it could be to eventually develop ML models to provide product recommendations to my customers or to understand whether customers do not like any particular products, understand the churn rate.

We will use Dynamics products and customers data in data lake to do lookups and joins to enrich this raw data or bronze delta table and create more refined tables, or silver delta table. Finally do some aggregation and create a Gold delta table and do some basic analytics right within Databricks.

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime
from pyspark.sql.functions  import from_unixtime
from pyspark.sql.functions  import to_date
from pyspark.sql import Row
from pyspark.sql.functions import to_json, struct
from pyspark.sql import functions as F
import random
import time

In [0]:
storageAccount="salabcommercedatalake"

mountpoint_click = "/mnt/commercedata"
storageEndPoint_click ="abfss://commercedata@{}.dfs.core.windows.net/".format(storageAccount)
mountpoint_fo = "/mnt/dynamics365-financeandoperations"
storageEndPoint_fo ="abfss://dynamics365-financeandoperations@{}.dfs.core.windows.net/".format(storageAccount)
print ('Mount Point ='+mountpoint_click)
print ('Mount Point ='+mountpoint_fo)

#ClientId, TenantId and Secret is for the Application
clientID ="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
tenantID ="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
clientSecret ="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
oauth2Endpoint = "https://login.microsoftonline.com/{}/oauth2/token".format(tenantID)


configs = {"fs.azure.account.auth.type": "OAuth",
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
           "fs.azure.account.oauth2.client.id": clientID,
           "fs.azure.account.oauth2.client.secret": clientSecret,
           "fs.azure.account.oauth2.client.endpoint": oauth2Endpoint}

try:
  dbutils.fs.mount(
  source = storageEndPoint_click,
  mount_point = mountpoint_click,
  extra_configs = configs)
except Exception as e:
    print("Already mounted...."+mountpoint_click)
    
try:
  dbutils.fs.mount(
  source = storageEndPoint_fo,
  mount_point = mountpoint_fo,
  extra_configs = configs)
except Exception as e:
    print("Already mounted...."+mountpoint_fo)

In [0]:
display(dbutils.fs.ls("dbfs:/mnt/dynamics365-financeandoperations/d365commerce.sandbox.operations.dynamics.com/Tables/Common/Customer/Main/CustTable"))

path,name,size,modificationTime
dbfs:/mnt/dynamics365-financeandoperations/d365commerce.sandbox.operations.dynamics.com/Tables/Common/Customer/Main/CustTable/CUSTTABLE_00001.csv,CUSTTABLE_00001.csv,265272,1650158448000
dbfs:/mnt/dynamics365-financeandoperations/d365commerce.sandbox.operations.dynamics.com/Tables/Common/Customer/Main/CustTable/index.json,index.json,160,1650158448000


In [0]:
# Reading customer csv files in a dataframe
df_cust= spark.read.format("csv").option("header",False).load("dbfs:/mnt/dynamics365-financeandoperations/d365commerce.sandbox.operations.dynamics.com/Tables/Common/Customer/Main/CustTable/CUSTTABLE_00001.csv")

In [0]:
display(df_cust.limit(10))

_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16,_c17,_c18,_c19,_c20,_c21,_c22,_c23,_c24,_c25,_c26,_c27,_c28,_c29,_c30,_c31,_c32,_c33,_c34,_c35,_c36,_c37,_c38,_c39,_c40,_c41,_c42,_c43,_c44,_c45,_c46,_c47,_c48,_c49,_c50,_c51,_c52,_c53,_c54,_c55,_c56,_c57,_c58,_c59,_c60,_c61,_c62,_c63,_c64,_c65,_c66,_c67,_c68,_c69,_c70,_c71,_c72,_c73,_c74,_c75,_c76,_c77,_c78,_c79,_c80,_c81,_c82,_c83,_c84,_c85,_c86,_c87,_c88,_c89,_c90,_c91,_c92,_c93,_c94,_c95,_c96,_c97,_c98,_c99,_c100,_c101,_c102,_c103,_c104,_c105,_c106,_c107,_c108,_c109,_c110,_c111,_c112,_c113,_c114,_c115,_c116,_c117,_c118,_c119,_c120,_c121,_c122,_c123,_c124,_c125,_c126,_c127,_c128,_c129,_c130,_c131,_c132,_c133,_c134,_c135,_c136,_c137,_c138,_c139,_c140,_c141,_c142,_c143,_c144,_c145,_c146,_c147,_c148,_c149,_c150,_c151,_c152,_c153,_c154,_c155,_c156,_c157,_c158,_c159,_c160,_c161,_c162,_c163,_c164,_c165,_c166,_c167,_c168,_c169,_c170,_c171,_c172,_c173,_c174,_c175,_c176,_c177,_c178,_c179,_c180,_c181,_c182,_c183,_c184,_c185,_c186,_c187,_c188,_c189,_c190,_c191,_c192,_c193,_c194,_c195,_c196,_c197,_c198,_c199,_c200,_c201,_c202,_c203,_c204,_c205,_c206,_c207,_c208,_c209,_c210,_c211,_c212,_c213,_c214,_c215,_c216,_c217,_c218,_c219
22565420942,,,2022-04-17T01:20:48.4770605Z,22565420942,Net30,,,,US_SI_0002,0,0,,,,,0,,,0,,0,,,,,,,,,,,0,0,0,,0,0,0,0,0.0,,,USD,,0,0,0,20,,0,0,0,0,,,,,,0,,,0,0,0,,,0,,,0,0,0,,0,0,0,0,0,0,0,0,,,0,,,0,0,0,,,,0,,,0,0,0,0,0,,,0,0,0,0,,,,,,,0,,,,0,,22565424586,,,,,,ELECTRONIC,,,,0,,,,,0,0,0,,WC,,,,0,,,0,0,,,,,0,,0,,WA,,,0,0,0,0,0,,,0,,0,,0,0,,2017-07-02T18:37:56.0000000,Yoichiro,2016-09-21T03:10:07.0000000,ussi,0,5637144576,0,0,,,0,0,0,0,0,0,0,1900-01-01T00:00:00.0000000,,0,,0.0,1900-01-01T00:00:00.0000000,0,0.0,,1900-01-01T00:00:00.0000000,0,,,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,0,0,0,0,,0,0,0,0,0,0,0,0,0
22565420943,,,2022-04-17T01:20:48.4771324Z,22565420943,Net30,,,,US_SI_0003,0,0,,,,,0,,,0,,0,,,,,,,,,,,0,0,0,,0,0,0,0,0.0,,,USD,,0,0,0,20,,0,0,0,0,,,,,,0,,,0,0,0,,,0,,,0,0,0,,0,0,0,0,0,0,0,0,,,0,,,0,0,0,,,,0,,,0,0,0,0,0,,,0,0,0,0,,,,,,,0,,,,0,,22565424587,,,,,,ELECTRONIC,,,,0,,,,,0,0,0,,,,,,0,,,0,0,,,,,0,,0,,WA,,,0,0,0,0,0,,,0,,0,,0,0,,2017-06-06T18:38:18.0000000,Admin,2016-09-21T03:10:10.0000000,ussi,0,5637144576,0,0,,,0,0,0,0,0,0,0,1900-01-01T00:00:00.0000000,,0,,0.0,1900-01-01T00:00:00.0000000,0,0.0,,1900-01-01T00:00:00.0000000,0,,,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,0,0,0,0,,0,0,0,0,0,0,0,0,0
22565421250,,,2022-04-17T01:20:48.4771925Z,22565421250,Net30,,,,US_SI_0062,0,0,,,,,0,,,0,,0,,,,,,,,,,,0,0,0,,0,0,0,0,0.0,,,USD,,0,0,0,20,,0,0,0,0,,,,,,0,,,0,0,0,,,0,,,0,0,0,,0,0,0,0,0,0,0,0,,,0,,,0,0,0,,,,0,,,0,0,0,0,0,,3100.0,0,0,0,0,,,,,,,0,,,,0,,22565424884,,,,,,ELECTRONIC,,,,0,,,,,0,0,0,,IW,,,,0,,,0,0,,,,,0,,0,,UT,,,0,0,0,0,0,,,0,,0,,0,0,,2019-02-16T18:48:16.0000000,Admin,2016-09-22T03:44:17.0000000,ussi,0,5637144576,0,0,,,0,0,0,0,0,0,0,1900-01-01T00:00:00.0000000,,0,,0.0,1900-01-01T00:00:00.0000000,0,0.0,,1900-01-01T00:00:00.0000000,0,,,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,0,0,0,0,,0,0,0,0,0,0,0,0,0
22565421251,,,2022-04-17T01:20:48.4772804Z,22565421251,Net30,,,,US_SI_0063,0,0,,,,,0,,,0,,0,,,,,,,,,,,0,0,0,,0,0,0,0,0.0,,,USD,,0,0,0,20,,0,0,0,0,,,,,,0,,,0,0,0,,,0,,,0,0,0,,0,0,0,0,0,0,0,0,,,0,,,0,0,0,,,,0,,,0,0,0,0,0,,,0,0,0,0,,,,,,,0,,,,0,,22565424885,,,,,,ELECTRONIC,,,,0,,,,,0,0,0,,,,,,0,,,0,0,,,,,0,,0,,MI,,,0,0,0,0,0,,,0,,0,,0,0,,2017-06-06T18:38:18.0000000,Admin,2016-09-22T03:44:18.0000000,ussi,0,5637144576,0,0,,,0,0,0,0,0,0,0,1900-01-01T00:00:00.0000000,,0,,0.0,1900-01-01T00:00:00.0000000,0,0.0,,1900-01-01T00:00:00.0000000,0,,,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,0,0,0,0,,0,0,0,0,0,0,0,0,0
22565421252,,,2022-04-17T01:20:48.4773572Z,22565421252,Net30,,,,US_SI_0064,0,0,,,,,0,,,0,,0,,,,,,,,,,,0,0,0,,0,0,0,0,0.0,,,USD,,0,0,0,20,,0,0,0,0,,,,,,0,,,0,0,0,,,0,,,0,0,0,,0,0,0,0,0,0,0,0,,,0,,,0,0,0,,,,0,,,0,0,0,0,0,,,0,0,0,0,,,,,,,0,,,,0,,22565424886,,,,,,ELECTRONIC,,,,0,,,,,0,0,0,,,,,,0,,,0,0,,,,,0,,0,,WA,,,0,0,0,0,0,,,0,,0,,0,0,,2017-06-06T18:38:18.0000000,Admin,2016-09-22T03:44:18.0000000,ussi,0,5637144576,0,0,,,0,0,0,0,0,0,0,1900-01-01T00:00:00.0000000,,0,,0.0,1900-01-01T00:00:00.0000000,0,0.0,,1900-01-01T00:00:00.0000000,0,,,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,0,0,0,0,,0,0,0,0,0,0,0,0,0
22565421253,,,2022-04-17T01:20:48.4774333Z,22565421253,Net30,,,,US_SI_0065,0,0,,,,,0,,,0,,0,,,,,,,,,,,0,0,0,,0,0,0,0,0.0,,,USD,,0,0,0,20,,0,0,0,0,,,,,,0,,,0,0,0,,,0,,,0,0,0,,0,0,0,0,0,0,0,0,,,0,,,0,0,0,,,,0,,,0,0,0,0,0,,,0,0,0,0,,,,,,,0,,,,0,,22565424887,,,,,,ELECTRONIC,,,,0,,,,,0,0,0,,,,,,0,,,0,0,,,,,0,,0,,WA,,,0,0,0,0,0,,,0,,0,,0,0,,2017-06-06T18:38:18.0000000,Admin,2016-09-22T03:44:18.0000000,ussi,0,5637144576,0,0,,,0,0,0,0,0,0,0,1900-01-01T00:00:00.0000000,,0,,0.0,1900-01-01T00:00:00.0000000,0,0.0,,1900-01-01T00:00:00.0000000,0,,,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,0,0,0,0,,0,0,0,0,0,0,0,0,0
22565421254,,,2022-04-17T01:20:48.4774964Z,22565421254,Net30,,,,US_SI_0066,0,0,,,,,0,,,0,,0,,,,,,,,,,,0,0,0,,0,0,0,0,0.0,,,USD,,0,0,0,20,,0,0,0,0,,,,,,0,,,0,0,0,,,0,,,0,0,0,,0,0,0,0,0,0,0,0,,,0,,,0,0,0,,,,0,,,0,0,0,0,0,,,0,0,0,0,,,,,,,0,,,,0,,22565424888,,,,,,ELECTRONIC,,,,0,,,,,0,0,0,,,,,,0,,,0,0,,,,,0,,0,,TX,,,0,0,0,0,0,,,0,,0,,0,0,,2017-06-06T18:38:18.0000000,Admin,2016-09-22T03:44:18.0000000,ussi,0,5637144576,0,0,,,0,0,0,0,0,0,0,1900-01-01T00:00:00.0000000,,0,,0.0,1900-01-01T00:00:00.0000000,0,0.0,,1900-01-01T00:00:00.0000000,0,,,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,0,0,0,0,,0,0,0,0,0,0,0,0,0
22565421255,,,2022-04-17T01:20:48.4775708Z,22565421255,Net30,,,,US_SI_0067,0,0,,,,,0,,,0,,0,,,,,,,,,,,0,0,0,,0,0,0,0,0.0,,,USD,,0,0,0,20,,0,0,0,0,,,,,,0,,,0,0,0,,,0,,,0,0,0,,0,0,0,0,0,0,0,0,,,0,,,0,0,0,,,,0,,,0,0,0,0,0,,,0,0,0,0,,,,,,,0,,,,0,,22565424889,,,,,,ELECTRONIC,,,,0,,,,,0,0,0,,,,,,0,,,0,0,,,,,0,,0,,WA,,,0,0,0,0,0,,,0,,0,,0,0,,2017-06-06T18:38:18.0000000,Admin,2016-09-22T03:44:18.0000000,ussi,0,5637144576,0,0,,,0,0,0,0,0,0,0,1900-01-01T00:00:00.0000000,,0,,0.0,1900-01-01T00:00:00.0000000,0,0.0,,1900-01-01T00:00:00.0000000,0,,,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,0,0,0,0,,0,0,0,0,0,0,0,0,0
22565421256,,,2022-04-17T01:20:48.4776265Z,22565421256,Net30,,,,US_SI_0068,0,0,,,,,0,,,0,,0,,,,,,,,,,,0,0,0,,0,0,0,0,0.0,,,USD,,0,0,0,20,,0,0,0,0,,,,,,0,,,0,0,0,,,0,,,0,0,0,,0,0,0,0,0,0,0,0,,,0,,,0,0,0,,,,0,,,0,0,0,0,0,,,0,0,0,0,,,,,,,0,,,,0,,22565424890,,,,,,ELECTRONIC,,,,0,,,,,0,0,0,,,,,,0,,,0,0,,,,,0,,0,,CO,,,0,0,0,0,0,,,0,,0,,0,0,,2017-06-06T18:38:18.0000000,Admin,2016-09-22T03:44:18.0000000,ussi,0,5637144576,0,0,,,0,0,0,0,0,0,0,1900-01-01T00:00:00.0000000,,0,,0.0,1900-01-01T00:00:00.0000000,0,0.0,,1900-01-01T00:00:00.0000000,0,,,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,0,0,0,0,,0,0,0,0,0,0,0,0,0
22565421257,,,2022-04-17T01:20:48.4777446Z,22565421257,Net30,,,,US_SI_0069,0,0,,,,,0,,,0,,0,,,,,,,,,,,0,0,0,,0,0,0,0,0.0,,,USD,,0,0,0,20,,0,0,0,0,,,,,,0,,,0,0,0,,,0,,,0,0,0,,0,0,0,0,0,0,0,0,,,0,,,0,0,0,,,,0,,,0,0,0,0,0,,,0,0,0,0,,,,,,,0,,,,0,,22565424891,,,,,,ELECTRONIC,,,,0,,,,,0,0,0,,,,,,0,,,0,0,,,,,0,,0,,OR,,,0,0,0,0,0,,,0,,0,,0,0,,2017-06-06T18:38:18.0000000,Admin,2016-09-22T03:44:18.0000000,ussi,0,5637144576,0,0,,,0,0,0,0,0,0,0,1900-01-01T00:00:00.0000000,,0,,0.0,1900-01-01T00:00:00.0000000,0,0.0,,1900-01-01T00:00:00.0000000,0,,,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,1900-01-01T00:00:00.0000000,0,0,0,0,,0,0,0,0,0,0,0,0,0


In [0]:
# rename columns that we need and create a new dataframe
df_custSmall =  df_cust.selectExpr(
    '_c9 AS CustomerId',
    '_c155 AS State',
    '_c175 AS Company')

display(df_custSmall.limit(10))

CustomerId,State,Company
US_SI_0002,WA,ussi
US_SI_0003,WA,ussi
US_SI_0062,UT,ussi
US_SI_0063,MI,ussi
US_SI_0064,WA,ussi
US_SI_0065,WA,ussi
US_SI_0066,TX,ussi
US_SI_0067,WA,ussi
US_SI_0068,CO,ussi
US_SI_0069,OR,ussi


In [0]:
#create a view from dataframe
df_custSmall.createOrReplaceTempView("vw_Customers")


In [0]:
%sql
select * from vw_Customers limit 10

CustomerId,State,Company
US_SI_0002,WA,ussi
US_SI_0003,WA,ussi
US_SI_0062,UT,ussi
US_SI_0063,MI,ussi
US_SI_0064,WA,ussi
US_SI_0065,WA,ussi
US_SI_0066,TX,ussi
US_SI_0067,WA,ussi
US_SI_0068,CO,ussi
US_SI_0069,OR,ussi


In [0]:
display(dbutils.fs.ls("dbfs:/mnt/dynamics365-financeandoperations/d365commerce.sandbox.operations.dynamics.com/Tables/SupplyChain/ProductInformationManagement/Main/EcoResProduct"))

path,name,size,modificationTime
dbfs:/mnt/dynamics365-financeandoperations/d365commerce.sandbox.operations.dynamics.com/Tables/SupplyChain/ProductInformationManagement/Main/EcoResProduct/ECORESPRODUCT_00001.csv,ECORESPRODUCT_00001.csv,1392132,1645207720000
dbfs:/mnt/dynamics365-financeandoperations/d365commerce.sandbox.operations.dynamics.com/Tables/SupplyChain/ProductInformationManagement/Main/EcoResProduct/index.json,index.json,155,1645207720000


In [0]:
# Reading product csv files in a dataframe
df_product= spark.read.format("csv").option("header",False).load("dbfs:/mnt/dynamics365-financeandoperations/d365commerce.sandbox.operations.dynamics.com/Tables/SupplyChain/ProductInformationManagement/Main/EcoResProduct/ECORESPRODUCT_00001.csv")

display(df_product.limit(10))

_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16,_c17,_c18,_c19,_c20,_c21,_c22,_c23,_c24,_c25
22565421183,,,2022-02-03T02:00:46.9157225Z,22565421183,,,,,,,0,D0001,13678,0,1,MidRangeSpeaker,,0,0,5637144576,0,,0,0,0
22565421184,,,2022-02-03T02:00:46.9157463Z,22565421184,,,,,,,0,D0002,13678,0,1,Cabinet,,0,0,5637144576,0,,0,0,0
22565421185,,,2022-02-03T02:00:46.9157627Z,22565421185,,,,,,,0,D0003,13678,0,1,StandardSpeaker,,0,0,5637144576,0,,0,0,0
22565421187,,,2022-02-03T02:00:46.9157762Z,22565421187,,,,,,,0,L0001,13678,0,1,MidRangeSpeaker2,,0,0,5637144576,0,,0,0,0
22565421188,,,2022-02-03T02:00:46.9157886Z,22565421188,,,,,,,0,M0001,13678,0,1,WiringHarness,,0,0,5637144576,0,,0,0,0
22565421189,,,2022-02-03T02:00:46.9158037Z,22565421189,,,,,,,0,M0002,13678,0,1,MidRangeSpeakerUnit,,0,0,5637144576,0,,0,0,0
22565421190,,,2022-02-03T02:00:46.9158168Z,22565421190,,,,,,,0,M0003,13678,0,1,TweeterSpeakerUnit,,0,0,5637144576,0,,0,0,0
22565421191,,,2022-02-03T02:00:46.9158307Z,22565421191,,,,,,,0,M0004,13678,0,1,Crossover,,0,0,5637144576,0,,0,0,0
22565421192,,,2022-02-03T02:00:46.9158424Z,22565421192,,,,,,,0,M0005,13678,0,1,Enclosure,,0,0,5637144576,0,,0,0,0
22565421193,,,2022-02-03T02:00:46.9158548Z,22565421193,,,,,,,0,M0006,13678,0,1,BindingPosts,,0,0,5637144576,0,,0,0,0


In [0]:
# rename columns that we need and create a new dataframe
df_productSmall =  df_product.selectExpr(
    '_c12 AS ProductId',
    '_c16 AS ProductName')

display(df_productSmall.limit(10))

ProductId,ProductName
D0001,MidRangeSpeaker
D0002,Cabinet
D0003,StandardSpeaker
L0001,MidRangeSpeaker2
M0001,WiringHarness
M0002,MidRangeSpeakerUnit
M0003,TweeterSpeakerUnit
M0004,Crossover
M0005,Enclosure
M0006,BindingPosts


In [0]:
#create a view from dataframe
df_productSmall.createOrReplaceTempView("vw_Products")


In [0]:
%sql
select * from vw_Products limit 10

ProductId,ProductName
D0001,MidRangeSpeaker
D0002,Cabinet
D0003,StandardSpeaker
L0001,MidRangeSpeaker2
M0001,WiringHarness
M0002,MidRangeSpeakerUnit
M0003,TweeterSpeakerUnit
M0004,Crossover
M0005,Enclosure
M0006,BindingPosts


In [0]:
display(dbutils.fs.ls("dbfs:/mnt/commercedata"))


path,name,size,modificationTime
dbfs:/mnt/commercedata/clickstream-hist/,clickstream-hist/,0,1650434061000
dbfs:/mnt/commercedata/clickstreamdata/,clickstreamdata/,0,1650606885000
dbfs:/mnt/commercedata/tempDirs/,tempDirs/,0,1650511450000


In [0]:
#Creating the schema for the clickstream data json structure
clickjsonschema = StructType() \
.add("itemid", StringType()) \
.add("userid", StringType()) \
.add("device", StringType()) \
.add("sessionid", IntegerType()) \
.add("event_name", StringType()) \
.add("date", TimestampType())  

#Creating the schema for the sales data json structure
salesjsonschema = StructType() \
.add("orderid", StringType()) \
.add("itemid", StringType()) \
.add("customerid", StringType()) \
.add("channelid", StringType()) \
.add("qty", IntegerType()) \
.add("amount", DoubleType()) \
.add("cost", DoubleType()) \
.add("date", TimestampType())  


In [0]:
#Function to create required folders in mount point
def checkpoint_dir(type="Bronze"): 
  val = f"/mnt/commercedata/clickstreamdata/{type}/chkpnt/" 
  return val

def delta_dir(type="Bronze"): 
  val = f"/mnt/commercedata/clickstreamdata/{type}/delta/" 
  return val

def hist_chkpt_dir(type="Hist"): 
  val = f"/mnt/commercedata/clickstream-hist/{type}/chkpnt" 
  return val
 
def hist_dir(type="Hist"): 
  val = f"/mnt/commercedata/clickstream-hist/{type}/Data" 
  return val

In [0]:
#Event Hubs for Kafka configuration details
BOOTSTRAP_SERVERS = "salabcommerce-eventhubs.servicebus.windows.net:9093"
EH_SASL = 'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"Endpoint=sb://salabcommerce-eventhubs.servicebus.windows.net/;SharedAccessKeyName=EH-ASA-Access;SharedAccessKey=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx";'
GROUP_ID = "$Default"

In [0]:
# Function to read data from EventHub and writing as delta format
def append_kafkadata_stream(topic="clickstream-eventhub"):
    
    kafkaDF = (spark.readStream \
        .format("kafka") \
        .option("subscribe", topic) \
        .option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS) \
        .option("kafka.sasl.mechanism", "PLAIN") \
        .option("kafka.security.protocol", "SASL_SSL") \
        .option("kafka.sasl.jaas.config", EH_SASL) \
        .option("kafka.request.timeout.ms", "60000") \
        .option("kafka.session.timeout.ms", "60000") \
        .option("kafka.group.id", GROUP_ID) \
        .option("failOnDataLoss", "false") \
        .option("startingOffsets", "latest") \
        .load().withColumn("source", lit(topic)))
    
    newkafkaDF=kafkaDF.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)","source").withColumn('clickstream', from_json(col('value'),schema=clickjsonschema))
    kafkajsonDF=newkafkaDF.select("key","value","source", "clickstream.*")
    query=kafkajsonDF.selectExpr(
                      "itemid"	   \
                      ,"userid"	\
                      ,"device" \
                      ,"sessionid" \
                      ,"event_name" \
                      ,"date" \
                      ,"source") \
                .writeStream.format("delta") \
                .outputMode("append") \
                .option("checkpointLocation",checkpoint_dir("Bronze")) \
                .start(delta_dir("Bronze")) 

    return query

In [0]:
# Function to read data from ADLS using readStream API (autoloader) and writing as delta format
def append_batch_source():
    topic ="clickstream-hist"

    histDF = (spark.readStream \
        .schema(clickjsonschema) \
        .format("csv") \
        .load(hist_dir("Hist")).withColumn("source", lit(topic))) # this needs to point to source FO ADLS data

    query=histDF.selectExpr(
                      "itemid"	   \
                      ,"userid"	\
                      ,"device" \
                      ,"sessionid" \
                      ,"event_name" \
                      ,"date" \
                      ,"source") \
                .writeStream.format("delta") \
                .option("checkpointLocation",checkpoint_dir("Hist")) \
                .outputMode("append") \
                .start(delta_dir("Hist")) 

    return query

In [0]:
# Reading data from EventHubs for Kafka
query_source1 = append_kafkadata_stream(topic='clickstream-eventhub')

# Reading data from Historical location ( in this example its from ADLS Gen-2 having historical data for clickstream)
# There may be cases where historical data can be added to this location from any other source where the schema is same for all the files. In such scenarios using readStream API on ADLS location will keep polling for new data and when available it will be ingested

query_source2 = append_batch_source()


In [0]:
# Dropping all Delta tables if required
def DropDeltaTables(confirm=1):
  
  if(confirm ==1):
    spark.sql("DROP TABLE IF EXISTS RetailClickstream.RetailDelta_Bronze")
    spark.sql("DROP TABLE IF EXISTS RetailClickstream.RetailDelta_Silver")
    spark.sql("DROP TABLE IF EXISTS RetailClickstream.RetailDelta_Gold")
    spark.sql("DROP TABLE IF EXISTS RetailClickstream.RetailDelta_Historical")

In [0]:
#Function which drops all delta tables. TO avoid droping tables call the function with confirm=0
DropDeltaTables(confirm=0)

In [0]:
# Wait for 10 seconds before we create the delta tables else it might error out that delta location is not created
time.sleep(10)

In [0]:
%sql
-- Creating the delta table on delta location for Bronze data
CREATE DATABASE IF NOT EXISTS RetailClickstream;

In [0]:
%sql
--Create historical delta table
CREATE TABLE IF NOT EXISTS RetailClickstream.RetailDelta_Historical 
USING DELTA LOCATION "dbfs:/mnt/commercedata/clickstreamdata/Hist/delta/"

In [0]:
%sql
-- Creating the delta table on delta location for Bronze data
CREATE TABLE IF NOT EXISTS RetailClickstream.RetailDelta_Bronze
USING DELTA LOCATION "dbfs:/mnt/commercedata/clickstreamdata/Bronze/delta"

In [0]:
%sql
describe formatted RetailClickstream.RetailDelta_Bronze

col_name,data_type,comment
itemid,string,
userid,string,
device,string,
sessionid,int,
event_name,string,
date,timestamp,
source,string,
,,
# Partitioning,,
Not partitioned,,


In [0]:
#Streaming Data from Bronze Delta Table. This will help in only extracting new data coming from Event Hubs to be loaded into Silver Delta tables.
df_bronze=spark.readStream.format("delta").option("latestFirst", "true").table("RetailClickstream.RetailDelta_Bronze")

In [0]:
#Creating Temp View on Bronze DF
df_bronze.createOrReplaceTempView("vw_TempBronze")

In [0]:
%sql
select count(*) from vw_TempBronze

count(1)
57


In [0]:
%sql
select * from vw_TempBronze limit 10

itemid,userid,device,sessionid,event_name,date,source
D0002,US_SI_0062,tablet,248831,Logout,2022-04-22T07:08:53.360+0000,clickstream-eventhub
M0002,US_SI_0062,mobile,473064,DeleteFromCart,2022-04-22T08:08:53.716+0000,clickstream-eventhub
M0002,US_SI_0062,mobile,676344,AddPromoCode,2022-04-22T11:08:54.865+0000,clickstream-eventhub
D0002,US_SI_0062,tablet,551906,CheckOrderStatus,2022-04-22T12:08:55.240+0000,clickstream-eventhub
D0002,US_SI_0062,computer,512184,CheckOrderStatus,2022-04-22T14:08:56.009+0000,clickstream-eventhub
M0001,US_SI_0062,computer,529470,IncreaseQuantity,2022-04-22T16:08:56.734+0000,clickstream-eventhub
L0001,US_SI_0062,computer,361245,IncreaseQuantity,2022-04-22T17:08:57.114+0000,clickstream-eventhub
L0001,US_SI_0062,tablet,362784,CheckoutAsGuestCompleteOrder,2022-04-22T19:08:57.861+0000,clickstream-eventhub
M0001,US_SI_0063,computer,536169,IncreaseQuantity,2022-04-22T06:08:58.235+0000,clickstream-eventhub
D0002,US_SI_0063,computer,365463,AddPromoCode,2022-04-22T07:08:58.596+0000,clickstream-eventhub


In [0]:
%sql
-- select count(*),hour(eventtime) as hour, day(eventtime) as day from vw_TempSilver group by hour(eventtime),day(eventtime)
select *, Year(date) as Year, month(date) as Month,day(date) as Day, hour(date) as Hour from vw_TempBronze limit 10

itemid,userid,device,sessionid,event_name,date,source,Year,Month,Day,Hour
D0002,US_SI_0062,tablet,248831,Logout,2022-04-22T07:08:53.360+0000,clickstream-eventhub,2022,4,22,7
M0002,US_SI_0062,mobile,473064,DeleteFromCart,2022-04-22T08:08:53.716+0000,clickstream-eventhub,2022,4,22,8
M0002,US_SI_0062,mobile,676344,AddPromoCode,2022-04-22T11:08:54.865+0000,clickstream-eventhub,2022,4,22,11
D0002,US_SI_0062,tablet,551906,CheckOrderStatus,2022-04-22T12:08:55.240+0000,clickstream-eventhub,2022,4,22,12
D0002,US_SI_0062,computer,512184,CheckOrderStatus,2022-04-22T14:08:56.009+0000,clickstream-eventhub,2022,4,22,14
M0001,US_SI_0062,computer,529470,IncreaseQuantity,2022-04-22T16:08:56.734+0000,clickstream-eventhub,2022,4,22,16
L0001,US_SI_0062,computer,361245,IncreaseQuantity,2022-04-22T17:08:57.114+0000,clickstream-eventhub,2022,4,22,17
L0001,US_SI_0062,tablet,362784,CheckoutAsGuestCompleteOrder,2022-04-22T19:08:57.861+0000,clickstream-eventhub,2022,4,22,19
M0001,US_SI_0063,computer,536169,IncreaseQuantity,2022-04-22T06:08:58.235+0000,clickstream-eventhub,2022,4,22,6
D0002,US_SI_0063,computer,365463,AddPromoCode,2022-04-22T07:08:58.596+0000,clickstream-eventhub,2022,4,22,7


In [0]:
#Streaming Data from History Delta Table
df_historical=spark.readStream.format("delta").option("latestFirst", "true").table("RetailClickstream.RetailDelta_Historical")

In [0]:
#Joining both historical and Bronze Streaming Data
df_bronze_hist = df_bronze.union(df_historical)

In [0]:
df_bronze_hist.createOrReplaceTempView("vw_TempBronzeHistorical")

In [0]:
%sql
select count(*) from vw_TempBronzeHistorical

count(1)
88


In [0]:
%sql
select * from vw_TempBronzeHistorical limit 10

itemid,userid,device,sessionid,event_name,date,source
D0002,US_SI_0062,tablet,248831,Logout,2022-04-22T07:08:53.360+0000,clickstream-eventhub
M0002,US_SI_0062,mobile,473064,DeleteFromCart,2022-04-22T08:08:53.716+0000,clickstream-eventhub
M0002,US_SI_0062,mobile,676344,AddPromoCode,2022-04-22T11:08:54.865+0000,clickstream-eventhub
D0002,US_SI_0062,tablet,551906,CheckOrderStatus,2022-04-22T12:08:55.240+0000,clickstream-eventhub
D0002,US_SI_0062,computer,512184,CheckOrderStatus,2022-04-22T14:08:56.009+0000,clickstream-eventhub
M0001,US_SI_0062,computer,529470,IncreaseQuantity,2022-04-22T16:08:56.734+0000,clickstream-eventhub
L0001,US_SI_0062,computer,361245,IncreaseQuantity,2022-04-22T17:08:57.114+0000,clickstream-eventhub
L0001,US_SI_0062,tablet,362784,CheckoutAsGuestCompleteOrder,2022-04-22T19:08:57.861+0000,clickstream-eventhub
M0001,US_SI_0063,computer,536169,IncreaseQuantity,2022-04-22T06:08:58.235+0000,clickstream-eventhub
D0002,US_SI_0063,computer,365463,AddPromoCode,2022-04-22T07:08:58.596+0000,clickstream-eventhub


In [0]:
# Create a silver delta table by joining Bronze view with Customers and Products views from Dynamics

df_silver= spark.sql("select s.*, c.State, c.Company, p.ProductName, Year(date) as Year, month(date) as Month,day(date) as Day, \
                     hour(date) as Hour  \
                     from vw_TempBronzeHistorical s \
                     left join vw_Customers c on s.userid = c.CustomerId \
                     left join vw_Products p on s.itemid = p.ProductId") \
            .writeStream.format("delta").option("MergeSchema","True") \
            .outputMode("append") \
            .option("checkpointLocation",checkpoint_dir("Silver"))  \
            .start(delta_dir("Silver"))

In [0]:

# Wait for 5 seconds before we create the delta tables else it might error out that delta location is not created
time.sleep(5)

In [0]:
%sql
-- drop  TABLE IF  EXISTS RetaiRetailClickstreamlSales.RetailDelta_Silver;
CREATE TABLE IF NOT EXISTS RetailClickstream.RetailDelta_Silver
USING DELTA LOCATION "dbfs:/mnt/commercedata/clickstreamdata/Silver/delta/"

In [0]:
%sql
select count(*) from RetailClickstream.RetailDelta_Silver

count(1)
88


In [0]:
%sql
describe formatted RetailClickstream.RetailDelta_Silver

col_name,data_type,comment
itemid,string,
userid,string,
device,string,
sessionid,int,
event_name,string,
date,timestamp,
source,string,
State,string,
Company,string,
ProductName,string,


In [0]:
%sql
select * from RetailClickstream.RetailDelta_Silver limit 10

itemid,userid,device,sessionid,event_name,date,source,State,Company,ProductName,Year,Month,Day,Hour
D0002,US_SI_0062,tablet,248831,Logout,2022-04-22T07:08:53.360+0000,clickstream-eventhub,UT,ussi,Cabinet,2022,4,22,7
M0002,US_SI_0062,mobile,473064,DeleteFromCart,2022-04-22T08:08:53.716+0000,clickstream-eventhub,UT,ussi,MidRangeSpeakerUnit,2022,4,22,8
M0002,US_SI_0062,mobile,676344,AddPromoCode,2022-04-22T11:08:54.865+0000,clickstream-eventhub,UT,ussi,MidRangeSpeakerUnit,2022,4,22,11
D0002,US_SI_0062,tablet,551906,CheckOrderStatus,2022-04-22T12:08:55.240+0000,clickstream-eventhub,UT,ussi,Cabinet,2022,4,22,12
D0002,US_SI_0062,computer,512184,CheckOrderStatus,2022-04-22T14:08:56.009+0000,clickstream-eventhub,UT,ussi,Cabinet,2022,4,22,14
M0001,US_SI_0062,computer,529470,IncreaseQuantity,2022-04-22T16:08:56.734+0000,clickstream-eventhub,UT,ussi,WiringHarness,2022,4,22,16
L0001,US_SI_0062,computer,361245,IncreaseQuantity,2022-04-22T17:08:57.114+0000,clickstream-eventhub,UT,ussi,MidRangeSpeaker2,2022,4,22,17
L0001,US_SI_0062,tablet,362784,CheckoutAsGuestCompleteOrder,2022-04-22T19:08:57.861+0000,clickstream-eventhub,UT,ussi,MidRangeSpeaker2,2022,4,22,19
M0001,US_SI_0063,computer,536169,IncreaseQuantity,2022-04-22T06:08:58.235+0000,clickstream-eventhub,MI,ussi,WiringHarness,2022,4,22,6
D0002,US_SI_0063,computer,365463,AddPromoCode,2022-04-22T07:08:58.596+0000,clickstream-eventhub,MI,ussi,Cabinet,2022,4,22,7


In [0]:
# create a Gold table with some aggregation for analytics purposes

df_gold=(spark.readStream.format("delta").option("latestFirst", "true").table("RetailClickstream.RetailDelta_Silver") \
                                 .groupBy(window('date',"1 hour"),"State","device","Month","Day","Hour").count()) \
                                 .writeStream.format("delta") \
                                              .outputMode("complete") \
                                              .option("checkpointLocation",checkpoint_dir("Gold"))  \
                                              .start(delta_dir("Gold"))

In [0]:
time.sleep(10)

In [0]:
#Create Gold delta table

spark.sql("CREATE TABLE IF NOT EXISTS RetailClickstream.RetailDelta_Gold USING DELTA LOCATION '{}'".format(delta_dir("Gold")))

In [0]:
df_gold =(spark.readStream.format("delta").table("RetailClickstream.RetailDelta_Gold"))
df_gold.createOrReplaceTempView("vw_GoldAggDetails")

In [0]:
%sql
-- Viwing data from the Gold Delta Tables
select * from RetailClickstream.RetailDelta_Gold
ORDER BY Month DESC, Day Desc,count desc  limit 10

window,State,device,Month,Day,Hour,count
"List(2022-04-22T11:00:00.000+0000, 2022-04-22T12:00:00.000+0000)",WA,mobile,4,22,11,2
"List(2022-04-22T09:00:00.000+0000, 2022-04-22T10:00:00.000+0000)",WA,tablet,4,22,9,2
"List(2022-04-22T06:00:00.000+0000, 2022-04-22T07:00:00.000+0000)",WA,computer,4,22,6,2
"List(2022-04-22T08:00:00.000+0000, 2022-04-22T09:00:00.000+0000)",WA,tablet,4,22,8,2
"List(2022-04-22T07:00:00.000+0000, 2022-04-22T08:00:00.000+0000)",WA,tablet,4,22,7,2
"List(2022-04-22T13:00:00.000+0000, 2022-04-22T14:00:00.000+0000)",WA,tablet,4,22,13,1
"List(2022-04-22T10:00:00.000+0000, 2022-04-22T11:00:00.000+0000)",MI,mobile,4,22,10,1
"List(2022-04-22T15:00:00.000+0000, 2022-04-22T16:00:00.000+0000)",UT,mobile,4,22,15,1
"List(2022-04-22T21:00:00.000+0000, 2022-04-22T22:00:00.000+0000)",WA,mobile,4,22,21,1
"List(2022-04-22T17:00:00.000+0000, 2022-04-22T18:00:00.000+0000)",WA,tablet,4,22,17,1


In [0]:
%sql
-- Viwing data from the Gold Delta Tables
select * from vw_GoldAggDetails limit 10

window,State,device,Month,Day,Hour,count
"List(2022-04-22T08:00:00.000+0000, 2022-04-22T09:00:00.000+0000)",WA,tablet,4,22,8,2
"List(2022-04-22T19:00:00.000+0000, 2022-04-22T20:00:00.000+0000)",WA,tablet,4,22,19,1
"List(2022-04-22T17:00:00.000+0000, 2022-04-22T18:00:00.000+0000)",WA,mobile,4,22,17,1
"List(2022-04-22T18:00:00.000+0000, 2022-04-22T19:00:00.000+0000)",UT,mobile,4,22,18,1
"List(2022-04-22T11:00:00.000+0000, 2022-04-22T12:00:00.000+0000)",WA,mobile,4,22,11,2
"List(2022-04-22T16:00:00.000+0000, 2022-04-22T17:00:00.000+0000)",MI,mobile,4,22,16,1
"List(2022-04-22T08:00:00.000+0000, 2022-04-22T09:00:00.000+0000)",UT,mobile,4,22,8,1
"List(2022-04-22T15:00:00.000+0000, 2022-04-22T16:00:00.000+0000)",WA,computer,4,22,15,1
"List(2022-04-22T15:00:00.000+0000, 2022-04-22T16:00:00.000+0000)",MI,computer,4,22,15,1
"List(2022-04-22T06:00:00.000+0000, 2022-04-22T07:00:00.000+0000)",TX,mobile,4,22,6,1


In [0]:
# write Silver table data to Synapse table, this needs a temp Blob/ADLS storage

blobStorage = "salabcommercedatalake.dfs.core.windows.net"
blobContainer = "commercedata"
blobAccessKey = "xxxxxxxxxxxxxxxxxxxxxxxx"

tempDir = "abfss://" + blobContainer + "@" + blobStorage +"/tempDirs"

print (tempDir)


In [0]:

acntInfo = "fs.azure.account.key."+ blobStorage

sc._jsc.hadoopConfiguration().set(acntInfo, blobAccessKey)

#spark.conf.set(acntInfo, blobAccessKey)

In [0]:
#read Silver table data in a dataframe
retailSilverDf = spark.readStream.format("delta").option("latestFirst", "true").table("RetailClickstream.RetailDelta_Silver")


In [0]:
# write Silver table data to Synapse dedicated pool table
retailSilverDf.writeStream \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://salabcommerce-synapse.sql.azuresynapse.net:1433;database=databaseName;user=sqladminuser@salabcommerce-synapse;password=xxxxxxxxxxxxxxxxxxxxxxxx;encrypt=true;trustServerCertificate=true;hostNameInCertificate=*.sql.azuresynapse.net;loginTimeout=30;") \
.option("tempDir", tempDir) \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "Retailclickstreamtable") \
.option("checkpointLocation", "/mnt/commercedata/clickstreamdata/Synapse/chkpnt/") \
.start() 


In [0]:
# write Silver table data to Cosmos DB - define a config
"""
writeConfig = {
    "Endpoint": "https://salabcommerce-cosmosdb.documents.azure.com:443/",
    "Masterkey": "xxxxxxxxxxxxxxxxxxxxxxxx",
    "Database": "Retail",
    "Collection": "Clickstream",
    "Upsert": "true",
    "WritingBatchSize": "500"
   }
"""


In [0]:
"""
changeFeed = (retailSilverDf
               .writeStream
               .format("com.microsoft.azure.cosmosdb.spark.streaming.CosmosDBSinkProvider")
               .outputMode("append")
               .options(**writeConfig)
               .option("checkpointLocation", "/mnt/commercedata/clickstreamdata/cosmos/chkpnt/")
               .start())
"""

In [0]:
%sql

SELECT State, device,Month,Day,COUNT(*) as TotalDevices

FROM vw_GoldAggDetails

GROUP BY State, device,Month,Day

ORDER BY Month DESC,TotalDevices DESC

State,device,Month,Day,TotalDevices
WA,tablet,4,22,11
WA,mobile,4,22,7
WA,computer,4,22,5
MI,mobile,4,22,5
UT,computer,4,22,5
UT,mobile,4,22,5
MI,tablet,4,22,5
UT,tablet,4,22,4
MI,computer,4,22,3
TX,mobile,4,22,1


In [0]:
%sql

SELECT State, device,Month,COUNT(*) as TotalDevices

FROM vw_GoldAggDetails

GROUP BY State, device,Month

State,device,Month,TotalDevices
MI,computer,4,3
WA,computer,4,5
UT,mobile,4,5
UT,computer,4,5
TX,tablet,4,1
MI,mobile,4,5
UT,tablet,4,4
TX,mobile,4,1
WA,mobile,4,7
WA,tablet,4,11
