### Structured Streaming with Kafka 
In this notebook we'll examine how to connect Structured Streaming with Apache Kafka, a popular publish-subscribe system, to stream data from Wikipedia in real time, with a multitude of different languages. 

#### Objectives:
* Learn About Kafka
* Learn how to establish a connection with Kafka
* Learn more aboutcreating visualizations

First, run the following cell to import the data and make various utilities available for our experimentation.

In [0]:
%run "./Includes/Classroom-Setup"

### 1.0. The Kafka Ecosystem
Kafka is software designed upon the **publish/subscribe** messaging pattern.  Publish/subscribe messaging is where a sender (publisher) sends a message that is not specifically directed to any particular receiver (subscriber).  The publisher classifies the message somehow, and the receiver subscribes to receive certain categories of messages.  There are other usage patterns for Kafka, but this is the pattern we focus on in this course.

Publisher/subscriber systems typically have a central point where messages are published, called a **broker**. The broker receives messages from publishers, assigns offsets to them and commits messages to storage.

The Kafka version of a unit of data is an array of bytes called a **message**. A message can also contain a bit of information related to partitioning called a **key**.  In Kafka, messages are categorized into **topics**.


#### 1.1. The Kafka Server
The Kafka server is fed by a separate TCP server that reads the Wikipedia edits, in real time, from the various language-specific IRC channels to which Wikimedia posts them.  That server parses the IRC data, converts the results to JSON, and sends the JSON to a Kafka server, with the edits segregated by language. The various languages are **topics**.  For example, the Kafka topic "en" corresponds to edits for en.wikipedia.org.


##### Required Options
When consuming from a Kafka source, you **must** specify at least two options:
1.  The Kafka bootstrap servers, for example: `dsr.option("kafka.bootstrap.servers", "server1.databricks.training:9092")`
2.  Some indication of the topics you want to consume.

#### 1.2. Specifying a Topic
There are three, mutually-exclusive, ways to specify the topics for consumption:

| Option        | Value                                          | Description                            | Example |
| ------------- | ---------------------------------------------- | -------------------------------------- | ------- |
| **subscribe** | A comma-separated list of topics               | A list of topics to which to subscribe | `dsr.option("subscribe", "topic1")` <br/> `dsr.option("subscribe", "topic1,topic2,topic3")` |
| **assign**    | A JSON string indicating topics and partitions | Specific topic-partitions to consume.  | `dsr.dsr.option("assign", "{'topic1': [1,3], 'topic2': [2,5]}")`
| **subscribePattern**   | A (Java) regular expression           | A pattern to match desired topics      | `dsr.option("subscribePattern", "e[ns]")` <br/> `dsr.option("subscribePattern", "topic[123]")`|

**Note:** In the example to follow, we're using the "subscribe" option to select the topics we're interested in consuming.  We've selected only the "en" topic, corresponding to edits for the English Wikipedia.  If we wanted to consume multiple topics (multiple Wikipedia languages, in our case), we could just specify them as a comma-separate list:

```dsr.option("subscribe", "en,es,it,fr,de,eo")```

There are other, optional, arguments you can give the Kafka source. For more information, see the <a href="https://people.apache.org//~pwendell/spark-nightly/spark-branch-2.1-docs/latest/structured-streaming-kafka-integration.html#" target="_blank">Structured Streaming and Kafka Integration Guide</a>

#### 1.3. The Kafka Schema
Reading from Kafka returns a `DataFrame` with the following fields:

| Field             | Type   | Description |
|------------------ | ------ |------------ |
| **key**           | binary | The key of the record (not needed) |
| **value**         | binary | Our JSON payload. We'll need to cast it to STRING |
| **topic**         | string | The topic this record is received from (not needed) |
| **partition**     | int    | The Kafka topic partition from which this record is received (not needed). This server only has one partition. |
| **offset**        | long   | The position of this record in the corresponding Kafka topic partition (not needed) |
| **timestamp**     | long   | The timestamp of this record  |
| **timestampType** | int    | The timestamp type of a record (not needed) |

In the example below, the only column we want to keep is `value`.

**Note:**  The default of `spark.sql.shuffle.partitions` is 200.  This setting is used in operations like `groupBy`. In this case, we should be setting this value to match the current number of cores.

In [0]:
from pyspark.sql.functions import col
spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism)

kafkaServer = "server1.databricks.training:9092"   # US (Oregon)
# kafkaServer = "server2.databricks.training:9092" # Singapore

editsDF = (spark.readStream                        # Get the DataStreamReader
  .format("kafka")                                 # Specify the source format as "kafka"
  .option("kafka.bootstrap.servers", kafkaServer)  # Configure the Kafka server name and port
  .option("subscribe", "en")                       # Subscribe to the "en" Kafka topic
  .option("startingOffsets", "earliest")           # Rewind stream to beginning when we restart notebook
  .option("maxOffsetsPerTrigger", 1000)            # Throttle Kafka's processing of the streams
  .load()                                          # Load the DataFrame
  .select(col("value").cast("STRING"))             # Cast the "value" column to STRING
)

Let's display some data.

In [0]:
myStreamName = "lesson04a_ps"
display(editsDF,  streamName = myStreamName)

Wait until stream is done initializing...

In [0]:
untilStreamIsReady(myStreamName)

Make sure to stop the stream before continuing.

In [0]:
stopAllStreams()

### 2.0. Use Kafka to Display the Raw Data

The Kafka server acts as a sort of "firehose" (or asynchronous buffer) and displays raw data. Since raw data coming in from a stream is transient, we'd like to save it to a more permanent data structure.  The first step is to define the schema for the JSON payload.

**Note:** Only those fields of future interest are commented below.

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, BooleanType
from pyspark.sql.functions import from_json, unix_timestamp

schema = StructType([
  StructField("channel", StringType(), True),
  StructField("comment", StringType(), True),
  StructField("delta", IntegerType(), True),
  StructField("flag", StringType(), True),
  StructField("geocoding", StructType([                 # (OBJECT): Added by the server, field contains IP address geocoding information for anonymous edit.
    StructField("city", StringType(), True),
    StructField("country", StringType(), True),
    StructField("countryCode2", StringType(), True),
    StructField("countryCode3", StringType(), True),
    StructField("stateProvince", StringType(), True),
    StructField("latitude", DoubleType(), True),
    StructField("longitude", DoubleType(), True),
  ]), True),
  StructField("isAnonymous", BooleanType(), True),      # (BOOLEAN): Whether or not the change was made by an anonymous user
  StructField("isNewPage", BooleanType(), True),
  StructField("isRobot", BooleanType(), True),
  StructField("isUnpatrolled", BooleanType(), True),
  StructField("namespace", StringType(), True),         # (STRING): Page's namespace. See https://en.wikipedia.org/wiki/Wikipedia:Namespace 
  StructField("page", StringType(), True),              # (STRING): Printable name of the page that was edited
  StructField("pageURL", StringType(), True),           # (STRING): URL of the page that was edited
  StructField("timestamp", StringType(), True),         # (STRING): Time the edit occurred, in ISO-8601 format
  StructField("url", StringType(), True),
  StructField("user", StringType(), True),              # (STRING): User who made the edit or the IP address associated with the anonymous editor
  StructField("userURL", StringType(), True),
  StructField("wikipediaURL", StringType(), True),
  StructField("wikipedia", StringType(), True),         # (STRING): Short name of the Wikipedia that was edited (e.g., "en" for the English)
])

Next we can use the function `from_json` to parse out the full message with the schema specified above.

In [0]:
from pyspark.sql.functions import col, from_json

jsonEdits = editsDF.select(
  from_json("value", schema).alias("json"))  # Parse the column "value" and name it "json"

When parsing a value from JSON, we end up with a single column containing a complex object. We can clearly see this by simply printing the schema.

In [0]:
jsonEdits.printSchema()

The fields of a complex object can be referenced with a "dot" notation as in: `col("json.geocoding.countryCode3")` 
Since a large number of these fields/columns can become unwieldy, it's common to extract the sub-fields and represent them as first-level columns as seen below:

In [0]:
from pyspark.sql.functions import isnull, unix_timestamp

anonDF = (jsonEdits
  .select(col("json.wikipedia").alias("wikipedia"),      # Promoting from sub-field to column
          col("json.isAnonymous").alias("isAnonymous"),  #     "       "      "      "    "
          col("json.namespace").alias("namespace"),      #     "       "      "      "    "
          col("json.page").alias("page"),                #     "       "      "      "    "
          col("json.pageURL").alias("pageURL"),          #     "       "      "      "    "
          col("json.geocoding").alias("geocoding"),      #     "       "      "      "    "
          col("json.user").alias("user"),                #     "       "      "      "    "
          col("json.timestamp").cast("timestamp"))       # Promoting and converting to a timestamp
  .filter(col("namespace") == "article")                 # Limit result to just articles
  .filter(~isnull(col("geocoding.countryCode3")))        # We only want results that are geocoded
)

#### 2.1. Mapping Anonymous Editors' Locations

When you run the query, the default is a [live] html table. The geocoded information allows us to associate an anonymous edit with a country. We can then use that geocoded information to plot edits on a [live] world map. In order to create a slick world map visualization of the data, you'll need to click on the item below.

Under **Plot Options**, use the following:
* **Keys:** `countryCode3`
* **Values:** `count`

In **Display type**, use **World map** and click **Apply**.

<img src="https://files.training.databricks.com/images/eLearning/Structured-Streaming/plot-options-map-04.png"/>

By invoking a `display` action on a DataFrame created from a `readStream` transformation, we can generate a LIVE visualization! 

**Note:** Keep an eye on the plot for a minute or two and watch the colors change.

In [0]:
mappedDF = (anonDF
  .groupBy("geocoding.countryCode3") # Aggregate by country (code)
  .count()                           # Produce a count of each aggregate
)
display(mappedDF, streamName = myStreamName)

Wait until stream is done initializing...

In [0]:
untilStreamIsReady(myStreamName)

Stop the streams.

In [0]:
stopAllStreams()

#### Review Questions

**Q:** What `format` should you use with Kafka?<br>
**A:** `format("kafka")`

**Q:** How do you specify a Kafka server?<br>
**A:** `.option("kafka.bootstrap.servers"", "server1.databricks.training:9092")`

**Q:** What verb should you use in conjunction with `readStream` and Kafka to start the streaming job?<br>
**A:** `load()`, but with no parameters since we are pulling from a Kafka server.

**Q:** What fields are returned in a Kafka DataFrame?<br>
**A:** Reading from Kafka returns a DataFrame with the following fields:
key, value, topic, partition, offset, timestamp, timestampType

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [0]:
%run "./Includes/Classroom-Cleanup"

##### Additional Topics &amp; Resources

* <a href="http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-stream#" target="_blank">Create a Kafka Source Stream</a>
* <a href="https://kafka.apache.org/documentation/" target="_blank">Official Kafka Documentation</a>
* <a href="https://www.confluent.io/blog/okay-store-data-apache-kafka/" target="_blank">Use Kafka to store data</a>