# DEMO: Integrating Azure Event Hubs and Azure Databricks
---
Author: Max Fisher
  
---
In this demo we will establish our real-time stream of data using the Python client provided by the lab files. This real-time stream will go through Azure Event Hubs, which is a popular tool for managing incoming messages and data to Azure. From Event Hubs, we will be demonstrating how to set up the connection between Event Hub and your structured streaming pipeline.  
  
First, let's provision Azure Event Hubs. Come back to this notebook once you have this provisioned.  
  
Secondly, let's take a quick look at the code in the Python Producer. It will be called `sender.py` in the Demo folder.

Lastly, enter in the credentials for the Azure Event Hub into the `sender.py` and run the Python Program. 

Now we have data flowing from our Python Producer to Azure Event Hubs. The next step is to set up the connection to Azure Databricks and our structured streaming pipeline.

In [2]:
# The connection string to your Event Hubs Namespace
connectionString = "<connection-string>;EntityPath=<hub-name>"
# Event Hubs Connection Configuration
ehConf = {
  'eventhubs.connectionString' : connectionString
}

The next step is to set up the `readStream` event. For this stream we are not going to use any triggers or windows just so we can focus on the Event Hubs functionality.  
  
Within this `readStream` object we have a `format` element that is used to specify the format of the message that is being read in. **"org.apache.spark.sql.eventhubs.EventHubsSourceProvider"** is the format type you need to specify for messages coming from Azure Event Hubs.  
  
The next option uses the `options` to reference the **ehConf** object, which we defined in the earlier cell. This is the configuration details for the connection to the hub you are streaming from. **Note**: Make sure you append **";EntityPath=hub-name"** to the end of the namespace's connection string or the connection will not be correct.

In [4]:
productsSoldStream = spark \
  .readStream \
  .format("org.apache.spark.sql.eventhubs.EventHubsSourceProvider") \
  .options(**ehConf) \
  .load()

In [5]:
display(productsSoldStream)

body,offset,sequenceNumber,enqueuedTime,publisher,partitionKey
eyJzdG9yZUlkIjogMTAwNCwgInRpbWVzdGFtcCI6IDE1NDE2OTQyMTAuNDY5NDE0LCAicHJvZHVjdHR5cGUiOiA3LCAibmFtZSI6ICJCbGVuZGVyIiwgImNhdGVnb3J5IjogIktpdGNoZW4iLCAicHJpY2UiOiAyNS45OSwgInF1YW50aXR5IjogM30=,6384160,29464,2018-11-08T16:23:19.340+0000,,
eyJzdG9yZUlkIjogMTAwNCwgInRpbWVzdGFtcCI6IDE1NDE2OTQyMTUuNTQzNTM2LCAicHJvZHVjdHR5cGUiOiAxMCwgIm5hbWUiOiAiQ3V0dGluZyBCb2FyZCIsICJjYXRlZ29yeSI6ICJLaXRjaGVuIiwgInByaWNlIjogMTIuOTksICJxdWFudGl0eSI6IDN9,6569264,30300,2018-11-08T16:23:24.430+0000,,
eyJzdG9yZUlkIjogMTAwMSwgInRpbWVzdGFtcCI6IDE1NDE2OTQyMjAuNjE5NTk2MiwgInByb2R1Y3R0eXBlIjogOSwgIm5hbWUiOiAiUmljZSBDb29rZXIiLCAiY2F0ZWdvcnkiOiAiS2l0Y2hlbiIsICJwcmljZSI6IDI5Ljk5LCAicXVhbnRpdHkiOiAyfQ==,6384352,29465,2018-11-08T16:23:29.484+0000,,
eyJzdG9yZUlkIjogMTAwNCwgInRpbWVzdGFtcCI6IDE1NDE2OTQyMjUuNzE2MjQwNCwgInByb2R1Y3R0eXBlIjogMTAsICJuYW1lIjogIkN1dHRpbmcgQm9hcmQiLCAiY2F0ZWdvcnkiOiAiS2l0Y2hlbiIsICJwcmljZSI6IDEyLjk5LCAicXVhbnRpdHkiOiAzfQ==,6569464,30301,2018-11-08T16:23:34.602+0000,,
eyJzdG9yZUlkIjogMTAwNSwgInRpbWVzdGFtcCI6IDE1NDE2OTQyMzAuODA5ODYzMywgInByb2R1Y3R0eXBlIjogOSwgIm5hbWUiOiAiUmljZSBDb29rZXIiLCAiY2F0ZWdvcnkiOiAiS2l0Y2hlbiIsICJwcmljZSI6IDI5Ljk5LCAicXVhbnRpdHkiOiAxfQ==,6384552,29466,2018-11-08T16:23:39.672+0000,,
eyJzdG9yZUlkIjogMTAwMywgInRpbWVzdGFtcCI6IDE1NDE2OTQyMzUuOTEyOTQ0OCwgInByb2R1Y3R0eXBlIjogNywgIm5hbWUiOiAiQmxlbmRlciIsICJjYXRlZ29yeSI6ICJLaXRjaGVuIiwgInByaWNlIjogMjUuOTksICJxdWFudGl0eSI6IDJ9,6569664,30302,2018-11-08T16:23:44.790+0000,,
eyJzdG9yZUlkIjogMTAwNywgInRpbWVzdGFtcCI6IDE1NDE2OTQyNDAuOTkwNDExLCAicHJvZHVjdHR5cGUiOiAxMCwgIm5hbWUiOiAiQ3V0dGluZyBCb2FyZCIsICJjYXRlZ29yeSI6ICJLaXRjaGVuIiwgInByaWNlIjogMTIuOTksICJxdWFudGl0eSI6IDF9,6384752,29467,2018-11-08T16:23:49.844+0000,,
eyJzdG9yZUlkIjogMTAwOCwgInRpbWVzdGFtcCI6IDE1NDE2OTQyNDYuMDYyMTI1LCAicHJvZHVjdHR5cGUiOiAyLCAibmFtZSI6ICJTcGF0dWxhIiwgImNhdGVnb3J5IjogIktpdGNoZW4iLCAicHJpY2UiOiAyLjk5LCAicXVhbnRpdHkiOiAyfQ==,6569856,30303,2018-11-08T16:23:54.948+0000,,
eyJzdG9yZUlkIjogMTAwOSwgInRpbWVzdGFtcCI6IDE1NDE2OTQyNTEuMTM3NTg0NywgInByb2R1Y3R0eXBlIjogNywgIm5hbWUiOiAiQmxlbmRlciIsICJjYXRlZ29yeSI6ICJLaXRjaGVuIiwgInByaWNlIjogMjUuOTksICJxdWFudGl0eSI6IDF9,6384952,29468,2018-11-08T16:24:00.002+0000,,
eyJzdG9yZUlkIjogMTAwNCwgInRpbWVzdGFtcCI6IDE1NDE2OTQyNTYuMjEzOTk0MywgInByb2R1Y3R0eXBlIjogMTAsICJuYW1lIjogIkN1dHRpbmcgQm9hcmQiLCAiY2F0ZWdvcnkiOiAiS2l0Y2hlbiIsICJwcmljZSI6IDEyLjk5LCAicXVhbnRpdHkiOiAzfQ==,6570048,30304,2018-11-08T16:24:05.121+0000,,


Here we are defining the query that will be run on the events that are streaming into Azure Databricks. We first need to cast the message as a string becuase the content of the message is stored as binary in the body field of the productsSoldStream DataFrame. This casting is done within the `select` function which is selecting all of the content that is being streamed in the body.  
  
Another piece you might notice is that we use `alias` to cast the column the data is being written to as **message**.

In [7]:
IncrementalQuery = productsSoldStream.select(productsSoldStream.body.cast("string").alias('message'))

We can take a quick look at how the data will be written by using the `display()` function for the dataframe.

In [9]:
display(IncrementalQuery)

message
"{""storeId"": 1004, ""timestamp"": 1541694190.0443802, ""producttype"": 10, ""name"": ""Cutting Board"", ""category"": ""Kitchen"", ""price"": 12.99, ""quantity"": 2}"
"{""storeId"": 1000, ""timestamp"": 1541694184.965651, ""producttype"": 6, ""name"": ""Saucepan"", ""category"": ""Kitchen"", ""price"": 21.99, ""quantity"": 3}"
"{""storeId"": 1008, ""timestamp"": 1541694195.120094, ""producttype"": 6, ""name"": ""Saucepan"", ""category"": ""Kitchen"", ""price"": 21.99, ""quantity"": 1}"
"{""storeId"": 1006, ""timestamp"": 1541694200.2508397, ""producttype"": 5, ""name"": ""Roasting Pan"", ""category"": ""Kitchen"", ""price"": 17.99, ""quantity"": 1}"
"{""storeId"": 1009, ""timestamp"": 1541694205.3889291, ""producttype"": 1, ""name"": ""Chef's Knife"", ""category"": ""Kitchen"", ""price"": 15.99, ""quantity"": 1}"
"{""storeId"": 1004, ""timestamp"": 1541694210.469414, ""producttype"": 7, ""name"": ""Blender"", ""category"": ""Kitchen"", ""price"": 25.99, ""quantity"": 3}"
"{""storeId"": 1004, ""timestamp"": 1541694215.543536, ""producttype"": 10, ""name"": ""Cutting Board"", ""category"": ""Kitchen"", ""price"": 12.99, ""quantity"": 3}"
"{""storeId"": 1001, ""timestamp"": 1541694220.6195962, ""producttype"": 9, ""name"": ""Rice Cooker"", ""category"": ""Kitchen"", ""price"": 29.99, ""quantity"": 2}"
"{""storeId"": 1004, ""timestamp"": 1541694225.7162404, ""producttype"": 10, ""name"": ""Cutting Board"", ""category"": ""Kitchen"", ""price"": 12.99, ""quantity"": 3}"
"{""storeId"": 1005, ""timestamp"": 1541694230.8098633, ""producttype"": 9, ""name"": ""Rice Cooker"", ""category"": ""Kitchen"", ""price"": 29.99, ""quantity"": 1}"


Now that we have reviewed what the data will look like, let's actually write the data to our desired folder using the `writeStream` event.

In [11]:
IncrementalQuery.writeStream \
          .outputMode("append") \
          .format("json") \
          .option("path", "/wwComDemo/data") \
          .option("checkpointLocation", "wwComDemo/checkpoints") \
          .start()

Now let's check to see if a file has been written using `dbutils`

In [13]:
# check to see if the files made it!

if (dbutils.fs.ls("/wwComDemo/data") == []):
  print("There is no data! :(")
else:
  print("The data made it!")
  print(dbutils.fs.ls("/wwComDemo/data")[3])