In [16]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()
spark

## Verify that Spark Connect is setup correctly
Executing a hello world of dataframes

In [18]:
from datetime import datetime, date
from pyspark.sql import Row

df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.show()

+---+---+-------+----------+-------------------+
|  a|  b|      c|         d|                  e|
+---+---+-------+----------+-------------------+
|  1|2.0|string1|2000-01-01|2000-01-01 01:00:00|
|  2|3.0|string2|2000-02-01|2000-01-02 01:00:00|
|  4|5.0|string3|2000-03-01|2000-01-03 01:00:00|
+---+---+-------+----------+-------------------+



## Hello world - Stream data to a kafka topic 

We'll use the rate source to generate a stream of data, and write the data to a Kafka topic.

Due to the docker setup, we'll use the host.docker.internal to connect to the kafka broker

While we're executing the pyspark code from local, that command is run by the Spark Docker container, so we can use host.docker.internal to connect to the kafka broker.

In the Kafka Stack docker-compose.yml file
```
KAFKA_ADVERTISED_LISTENERS: INTERNAL://kafka1:19092,EXTERNAL://${DOCKER_HOST_IP:-127.0.0.1}:9092,DOCKER://host.docker.internal:29092
```
The relevant port is 29092

Give changing the bootstrap host and port a shot, but know that using localhost:9092 will not work as the 



In [24]:
from pyspark.sql.functions import expr

# Define a rate source for streaming queries
rateDataFrame = (
    spark.readStream.format("rate")
    .option("rowsPerSecond", 10)
    .load()
)

# Define the sink for streaming queries, in this case, writing back to Kafka
TOPIC_NAME_OUTPUT = "myOutputTopic"

# 
BOOTSTRAP_SERVERS = "host.docker.internal:29092"

kafkaOutput = (
    rateDataFrame.selectExpr("CAST(value AS STRING) AS value")
    .writeStream.outputMode("append")
    .format("kafka")
    .option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS)
    .option("topic", TOPIC_NAME_OUTPUT)
    .option("checkpointLocation", f"/opt/bitnami/spark/checkpoint/{TOPIC_NAME_OUTPUT}")
    .start()
)
kafkaOutput

<pyspark.sql.connect.streaming.query.StreamingQuery at 0x1252f4ed750>

## Go to the Conduktor UI and see the data being written to the topic
### http://localhost:8080

You'll see real time data being written to the kafka topic

### Next, we'll read the data from the kafka topic and write it to a Delta Lake table

In [27]:
from pyspark.sql.functions import expr

tableName = "my_kafka_delta_table"
SPARK_WAREHOUSE_BASE_CHECKPOINT = f"/opt/bitnami/spark/spark-warehouse/{tableName}/_checkpoint"

# Define Kafka source for streaming queries
kafkaDataFrame = (
    spark.readStream.format("kafka")
    .option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS)
    .option("subscribe", TOPIC_NAME_OUTPUT)
    .option("startingOffsets", "earliest")
    .load()
)

# Selecting the data and casting the value from Kafka's binary format to string
valueDataFrame = kafkaDataFrame.selectExpr("CAST(value AS STRING)")

# Define the sink for streaming queries, in this case, a Delta Lake table
q1 = (
    valueDataFrame.writeStream.outputMode("append")
    .trigger(processingTime="5 seconds")
    .format("delta")
    .option("checkpointLocation", SPARK_WAREHOUSE_BASE_CHECKPOINT)
    .toTable(tableName)
)
q1

<pyspark.sql.connect.streaming.query.StreamingQuery at 0x1252f4eec10>

In [30]:
spark.sql("SELECT * FROM my_kafka_delta_table").show(truncate=False)



+-----+
|value|
+-----+
|10813|
|10816|
|10815|
|10817|
|10818|
|10812|
|10811|
|10819|
|10810|
|10814|
|10825|
|10827|
|10826|
|10828|
|10824|
|10820|
|10829|
|10822|
|10823|
|10821|
+-----+
only showing top 20 rows



## Let's query the metadata from the Delta Lake table

In [32]:
# View Delta Lake table metadata, including details like partitions, active records, etc.
# Describe the table to view metadata, including partitioning
spark.sql(f"DESCRIBE DETAIL {tableName}").show(truncate=False)

# Describe the table's history to view active records and other operational metrics
spark.sql(f"DESCRIBE HISTORY {tableName}").show(truncate=False)


+------+------------------------------------+------------------------------------------+-----------+------------------------------------------------------------+----------------------+-----------------------+----------------+--------+-----------+----------+----------------+----------------+------------------------+
|format|id                                  |name                                      |description|location                                                    |createdAt             |lastModified           |partitionColumns|numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion|tableFeatures           |
+------+------------------------------------+------------------------------------------+-----------+------------------------------------------------------------+----------------------+-----------------------+----------------+--------+-----------+----------+----------------+----------------+------------------------+
|delta |94318b5d-4d84-4eaa-80ac-c4cb750fb21d|spar