### Create DLT with PySpark

### Ingest data

In [0]:
%sh
rm -r /dbfs/device_stream
mkdir /dbfs/device_stream
wget -O /dbfs/device_stream/device_data1.csv https://github.com/MicrosoftLearning/mslearn-databricks/raw/main/data/device_data.csv

### Use delta tables for streaming data

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Define the schema for the incoming data
schema = StructType([
    StructField("device_id", StringType(), True),
    StructField("timestamp", TimestampType(), True),
    StructField("temperature", DoubleType(), True),
    StructField("humidity", DoubleType(), True)
])

# Read streaming data from folder
inputPath = '/device_stream/'
iotstream = spark.readStream.schema(schema).option("header", "true").csv(inputPath)
print("Source stream created...")

In [0]:
# Write the data to a Delta table
query = (iotstream
        .writeStream
        .format("delta")
        .option("checkpointLocation", "/tmp/checkpoints/iot_data")
        .start("/tmp/delta/iot_data"))

### Create a Delta Live Table Pipeline
A pipeline is the main unit for configuring and running data processing workflows with Delta Live Tables. It links data sources to target datasets through a Directed Acyclic Graph (DAG) declared in Python or SQL.

1. Select **Jobs & Pipeline** in the left sidebar and then select **Create ETL Pipeline**.
1. In the Create pipeline page, create a new pipeline with the following settings:
    - **Pipeline name:** Ingestion Pipeline
    - **Product edition:** Advanced
    - **Pipeline mode:** Triggered
    - **Source code:** Leave blank
    - **Storage options:** Hive Metastore
    - **Storage location:** dbfs:/pipelines/device_stream
    - **Target schema:** default
    - Cluster mode: **Fixed size**
    - Workers: **0**
    - Driver type: **Ds3v2**
    - Click Create
1. Select Create to create the pipeline (which will also create a blank notebook for the pipeline code).
1. Once the pipeline is created, open the link to the blank notebook under Source code in the right-side panel. This opens the notebook in a new browser tab.
1. In the first cell of the blank notebook, enter (but don’t run) the following code to create Delta Live Tables and transform the data:

<pre>
import dlt
from pyspark.sql.functions import col, current_timestamp
    
@dlt.table(
    name="raw_iot_data",
    comment="Raw IoT device data"
)
def raw_iot_data():
    return spark.readStream.format("delta").load("/tmp/delta/iot_data")

@dlt.table(
    name="transformed_iot_data",
    comment="Transformed IoT device data with derived metrics"
)
def transformed_iot_data():
    return (
        dlt.read("raw_iot_data")
        .withColumn("temperature_fahrenheit", col("temperature") * 9/5 + 32)
        .withColumn("humidity_percentage", col("humidity") * 100)
        .withColumn("event_time", current_timestamp())
    )
</pre>

4. Close the browser tab containing the notebook (the contents are automatically saved) and return to the pipeline. Then select Start.

5. After the pipeline has successfully completed, run the following code:

In [0]:
%sql
SHOW TABLES;

### View results as a visualization
1. Run the following code to load the transformed_iot_data into a dataframe.
1. From the output select + and then select Visualization to view the visualization editor, and then apply the following options:
    - Visualization type: Line
    - X Column: timestamp
    - Y Column: Add a new column and select temperature_fahrenheit. Apply the Sum aggregation.

In [0]:
 %sql
 SELECT * FROM transformed_iot_data

Databricks visualization. Run in Databricks to view.

### Stop the streaming query

In [0]:
query.stop()