# Structured Streaming with Apache Kafka

## Example 1

Reading a Kafka topic in AWS.
Before executing this code, replace `kafka:9094` by the right bootstrap server

In [0]:
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType

df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "34.73.125.79:9094") \
  .option("subscribe", "toots") \
  .load()
  
schema = StructType(
    [
        StructField('id', StringType(), True),
        StructField('content', StringType(), True),
        StructField('created_at', StringType(), True),
        StructField('account', StringType(), True)
    ]
)
df.printSchema()

dataset = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "timestamp") \
    .withColumn("value", from_json("value", schema)) \
    .select(col('key'), col("timestamp"), col('value.*'))

In [0]:
dataset.writeStream \
 .outputMode("append") \
 .format("memory") \
 .option("truncate", "false") \
 .queryName("toots_topic") \
 .start()

In [0]:
%sql
SELECT
  *
FROM
  toots_topic

## Exercise 1

Apply a sliding window each minute, 5 minutes of duration, grouping by `server`, applying a count per domain. A server in Mastodon is the domain in account column

---



## Exercise 2

Each minute, get the number of toots received in last 5 minutes

---



## Exercise 3

Get top words with more than 3 letters in 1 minute slots

---



## Clean up DBFS

In [0]:
%scala
// Clean up
val PATH = "dbfs:/tmp/"
dbutils.fs.ls(PATH)
            .map(_.name)
            .foreach((file: String) => dbutils.fs.rm(PATH + file, true))