## Part 1 - Produce Data
We elected to use the datagen connector to generate fake data for this assignment. The topic we used was 'stocktrades. The steps were as follows:
*  Open a browser and go to http://localhost:9021/
*  Select the available cluster
*  On the menu bar, select Connect
*  Click on the connect-default cluster in the Connect Clusters list.
*  Click on Add connector
*  Select DatagenConnector
*  Enter connector_stock_trades in the Name field

Then:
Generate a data stream with following configurations:
```
{
  "name": "connector_stock_trades",
  "connector.class": "io.confluent.kafka.connect.datagen.DatagenConnector",
  "key.converter": "org.apache.kafka.connect.storage.StringConverter",
  "value.converter": "org.apache.kafka.connect.json.JsonConverter",
  "kafka.topic": "stocktrades",
  "max.interval": "100",
  "quickstart": "Stock_Trades"
}
```


## Part 2 - Using Ksql to create at least 2 streams with filtering from topics

To begin, you need to create a stream called stocktrades with no filtering in place.

### Stream 1 - Sell Stream
It may be in the interest of the business to view only streams where the stock was sold and not bought. This would be useful in identifying which shares should be taken as a 'short' position

### Stream 2 - Buy Stream
It may also be interesting to the business to see trades that were large buys.

### Table 1 - Aggregated Buy Trades


### Table 2 - Aggregated Sell Trades

## Part 3 - Consume/Transform data with Spark Streaming

In [1]:
from pyspark.sql import SparkSession
from IPython.display import display, clear_output
import time
from pyspark.sql import functions as F
from pyspark.sql.types import StructType,StringType, StructField, IntegerType, FloatType, BinaryType

In [2]:
spark = SparkSession.builder \
        .appName('kafka') \
        .getOrCreate()

In [3]:
spark.version

'3.1.1'

In [4]:
spark._jvm.org.apache.hadoop.util.VersionInfo.getVersion()

'3.2.0'

## Raw Data Streams

In [5]:
stream_df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "broker:29092") \
  .option("startingOffsets", "earliest") \
  .option("subscribe", "stocktrades") \
  .load()

In [6]:
stream_df.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



In [7]:
raw_stream = stream_df \
    .writeStream \
    .format("memory") \
    .queryName("raw_stocktrades_view") \
    .start()

In [10]:
clear_output(wait=True)
display(spark.sql('SELECT key, value FROM raw_stocktrades_view').show(20))
time.sleep(1)

+----------------+--------------------+
|             key|               value|
+----------------+--------------------+
|[5A 58 5A 5A 54]|[7B 22 73 63 68 6...|
|[5A 54 45 53 54]|[7B 22 73 63 68 6...|
|      [5A 56 56]|[7B 22 73 63 68 6...|
|   [5A 42 5A 58]|[7B 22 73 63 68 6...|
|   [5A 42 5A 58]|[7B 22 73 63 68 6...|
|[5A 54 45 53 54]|[7B 22 73 63 68 6...|
|[5A 58 5A 5A 54]|[7B 22 73 63 68 6...|
|[5A 54 45 53 54]|[7B 22 73 63 68 6...|
|[5A 58 5A 5A 54]|[7B 22 73 63 68 6...|
|[5A 56 5A 5A 54]|[7B 22 73 63 68 6...|
|   [5A 42 5A 58]|[7B 22 73 63 68 6...|
|      [5A 56 56]|[7B 22 73 63 68 6...|
|[5A 4A 5A 5A 54]|[7B 22 73 63 68 6...|
|[5A 57 5A 5A 54]|[7B 22 73 63 68 6...|
|      [5A 56 56]|[7B 22 73 63 68 6...|
|      [5A 56 56]|[7B 22 73 63 68 6...|
|[5A 54 45 53 54]|[7B 22 73 63 68 6...|
|[5A 4A 5A 5A 54]|[7B 22 73 63 68 6...|
|[5A 58 5A 5A 54]|[7B 22 73 63 68 6...|
|[5A 4A 5A 5A 54]|[7B 22 73 63 68 6...|
+----------------+--------------------+
only showing top 20 rows



None

In [11]:
raw_stream.stop()

### Convert Key Value pairs to strings

In [13]:
string_stream_df = stream_df \
    .withColumn("key", stream_df["key"].cast(StringType())) \
      .withColumn('value', stream_df["value"].cast(StringType()))

In [14]:
string_stream = string_stream_df \
    .writeStream \
    .format("memory") \
    .queryName("string_stocktrades_view") \
    .start()

In [15]:
clear_output(wait=True)
display(spark.sql('SELECT key, value FROM string_stocktrades_view').show(20))
time.sleep(1)

+-----+--------------------+
|  key|               value|
+-----+--------------------+
|ZXZZT|{"schema":{"type"...|
|ZTEST|{"schema":{"type"...|
|  ZVV|{"schema":{"type"...|
| ZBZX|{"schema":{"type"...|
| ZBZX|{"schema":{"type"...|
|ZTEST|{"schema":{"type"...|
|ZXZZT|{"schema":{"type"...|
|ZTEST|{"schema":{"type"...|
|ZXZZT|{"schema":{"type"...|
|ZVZZT|{"schema":{"type"...|
| ZBZX|{"schema":{"type"...|
|  ZVV|{"schema":{"type"...|
|ZJZZT|{"schema":{"type"...|
|ZWZZT|{"schema":{"type"...|
|  ZVV|{"schema":{"type"...|
|  ZVV|{"schema":{"type"...|
|ZTEST|{"schema":{"type"...|
|ZJZZT|{"schema":{"type"...|
|ZXZZT|{"schema":{"type"...|
|ZJZZT|{"schema":{"type"...|
+-----+--------------------+
only showing top 20 rows



None

In [16]:
string_stream.stop()

## Transformation

In [17]:
schema_stocktrades =  StructType([
    StructField('payload', StructType([
        StructField("side", StringType(),  True),
        StructField("quantity", IntegerType(),  True),
        StructField("price", IntegerType(),  True),
        StructField("symbol", StringType(),  True),
        StructField("account", StringType(), True),
         StructField("userid", StringType(), True)
    ]))
])

In [18]:
json_stream_df = string_stream_df\
    .withColumn("value", F.from_json("value", schema_stocktrades))

In [19]:
json_stream_df.printSchema()

root
 |-- key: string (nullable = true)
 |-- value: struct (nullable = true)
 |    |-- payload: struct (nullable = true)
 |    |    |-- side: string (nullable = true)
 |    |    |-- quantity: integer (nullable = true)
 |    |    |-- price: integer (nullable = true)
 |    |    |-- symbol: string (nullable = true)
 |    |    |-- account: string (nullable = true)
 |    |    |-- userid: string (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



In [20]:
json_stream = json_stream_df \
    .writeStream \
    .format("memory") \
    .queryName("extract_stocktrades_view") \
    .start()

In [21]:
clear_output(wait=True)
display(spark.sql('SELECT * FROM extract_stocktrades_view').show(20, False))
time.sleep(1)

+---+-----+-----+---------+------+---------+-------------+
|key|value|topic|partition|offset|timestamp|timestampType|
+---+-----+-----+---------+------+---------+-------------+
+---+-----+-----+---------+------+---------+-------------+



None

In [22]:
json_stream.stop()

### Flatten Data

In [23]:
stocktrades_stream_df = json_stream_df \
    .select( \
        F.col("key").alias("event_key"), \
        F.col("topic").alias("event_topic"), \
        F.col("timestamp").alias("event_timestamp"), \
        "value.payload.side", \
        "value.payload.quantity", \
        "value.payload.price", \
        "value.payload.symbol", \
        "value.payload.account", \
        "value.payload.userid"
    )

In [24]:
stocktrades_stream_df.printSchema()

root
 |-- event_key: string (nullable = true)
 |-- event_topic: string (nullable = true)
 |-- event_timestamp: timestamp (nullable = true)
 |-- side: string (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- price: integer (nullable = true)
 |-- symbol: string (nullable = true)
 |-- account: string (nullable = true)
 |-- userid: string (nullable = true)



In [25]:
stocktrades_stream = stocktrades_stream_df \
    .writeStream \
    .format("memory") \
    .queryName("stocktrades_view") \
    .start()

In [26]:
clear_output(wait=True)
display(spark.sql('SELECT * FROM stocktrades_view').show(20))
time.sleep(1)

+---------+-----------+--------------------+----+--------+-----+------+-------+------+
|event_key|event_topic|     event_timestamp|side|quantity|price|symbol|account|userid|
+---------+-----------+--------------------+----+--------+-----+------+-------+------+
|    ZXZZT|stocktrades|2021-06-05 08:13:...| BUY|    2057|  979| ZXZZT| XYZ789|User_5|
|    ZTEST|stocktrades|2021-06-05 08:13:...| BUY|    4709|  793| ZTEST| ABC123|User_6|
|      ZVV|stocktrades|2021-06-05 08:13:...|SELL|    3951|  194|   ZVV| XYZ789|User_4|
|     ZBZX|stocktrades|2021-06-05 08:13:...|SELL|    1822|  821|  ZBZX| LMN456|User_2|
|     ZBZX|stocktrades|2021-06-05 08:13:...|SELL|    2851|  337|  ZBZX| XYZ789|User_7|
|    ZTEST|stocktrades|2021-06-05 08:13:...|SELL|    2874|  980| ZTEST| XYZ789|User_6|
|    ZXZZT|stocktrades|2021-06-05 08:13:...| BUY|     785|  216| ZXZZT| ABC123|User_2|
|    ZTEST|stocktrades|2021-06-05 08:13:...|SELL|    2617|  867| ZTEST| ABC123|User_7|
|    ZXZZT|stocktrades|2021-06-05 08:13:...

None

In [32]:
stocktrades_stream.stop()

## Create neater function to generate stream
This function generates a stream from stocktrades with one line of code so its easier to call in later components

In [44]:
def generate_stocktrades_stream():
    stream_df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "broker:29092") \
  .option("startingOffsets", "earliest") \
  .option("subscribe", "stocktrades") \
  .load()
    string_stream_df = stream_df \
    .withColumn("key", stream_df["key"].cast(StringType())) \
      .withColumn('value', stream_df["value"].cast(StringType()))
    schema_stocktrades =  StructType([
    StructField('payload', StructType([
        StructField("side", StringType(),  True),
        StructField("quantity", IntegerType(),  True),
        StructField("price", IntegerType(),  True),
        StructField("symbol", StringType(),  True),
        StructField("account", StringType(), True),
         StructField("userid", StringType(), True)
    ]))
])
    json_stream_df = string_stream_df\
    .withColumn("value", F.from_json("value", schema_stocktrades))
    stocktrades_stream_df = json_stream_df \
    .select( \
        F.col("key").alias("event_key"), \
        F.col("topic").alias("event_topic"), \
        F.col("timestamp").alias("event_timestamp"), \
        "value.payload.side", \
        "value.payload.quantity", \
        "value.payload.price", \
        "value.payload.symbol", \
        "value.payload.account", \
        "value.payload.userid"
    )
    return stocktrades_stream_df \
    .writeStream \
    .format("memory") \
    .queryName("stocktrades_view") \
    .start()

In [46]:
stocktrades_stream = generate_clean_stream(stream_df,schema_stocktrades)

In [47]:
clear_output(wait=True)
display(spark.sql('SELECT * FROM stocktrades_view').show(20))
time.sleep(1)

+---------+-----------+--------------------+----+--------+-----+------+-------+------+
|event_key|event_topic|     event_timestamp|side|quantity|price|symbol|account|userid|
+---------+-----------+--------------------+----+--------+-----+------+-------+------+
|    ZXZZT|stocktrades|2021-06-05 08:13:...| BUY|    2057|  979| ZXZZT| XYZ789|User_5|
|    ZTEST|stocktrades|2021-06-05 08:13:...| BUY|    4709|  793| ZTEST| ABC123|User_6|
|      ZVV|stocktrades|2021-06-05 08:13:...|SELL|    3951|  194|   ZVV| XYZ789|User_4|
|     ZBZX|stocktrades|2021-06-05 08:13:...|SELL|    1822|  821|  ZBZX| LMN456|User_2|
|     ZBZX|stocktrades|2021-06-05 08:13:...|SELL|    2851|  337|  ZBZX| XYZ789|User_7|
|    ZTEST|stocktrades|2021-06-05 08:13:...|SELL|    2874|  980| ZTEST| XYZ789|User_6|
|    ZXZZT|stocktrades|2021-06-05 08:13:...| BUY|     785|  216| ZXZZT| ABC123|User_2|
|    ZTEST|stocktrades|2021-06-05 08:13:...|SELL|    2617|  867| ZTEST| ABC123|User_7|
|    ZXZZT|stocktrades|2021-06-05 08:13:...

None

In [48]:
stocktrades_stream.stop()