## Read Data from Kafka Topic

Let us go through the details about reading data from Kafka Topic using Spark Structured Streaming APIs.
* We need to ensure that data from sources such as web server logs are ingested using tools like **Kafka Connect**.
* We can leverage `spark-sql-kafka*` package to integrate Kafka with Spark Structured Streaming.
* We can either pass the jar file using `config` as demonstrated below or by using `--packages` while launching Pyspark CLI.
* Once we load the required jar file for integration, we need to pass **kafka** as part of `spark.readStream.format` along with broker and topic information using `option`. This will result in a streaming Data Frame.
* Let us go ahead and create a Data Frame and preview the Schema.

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1'). \
    config('spark.ui.port', '0'). \
    config('spark.sql.warehouse.dir', f'/user/{username}/warehouse'). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Kafka and Spark Integration'). \
    master('yarn'). \
    getOrCreate()

In [2]:
kafka_bootstrap_servers = 'w01.itversity.com:9092,w02.itversity.com:9092'

In [3]:
df = spark. \
  readStream. \
  format('kafka'). \
  option('kafka.bootstrap.servers', kafka_bootstrap_servers). \
  option('subscribe', f'{username}_retail'). \
  load()

In [4]:
df.isStreaming

True

In [5]:
df.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)

