## Part 1 - Produce Data
We elected to use the datagen connector to generate fake data for this assignment. The topic we used was 'stocktrades. The steps were as follows:
*  Open a browser and go to http://localhost:9021/
*  Select the available cluster
*  On the menu bar, select Connect
*  Click on the connect-default cluster in the Connect Clusters list.
*  Click on Add connector
*  Select DatagenConnector
*  Enter connector_stock_trades in the Name field

Then:
Generate a data stream with following configurations:
```
Key converter class: org.apache.kafka.connect.storage.StringConverter
kafka.topic: stocktrades
max.interval: 100
quickstart: Stock_Trades
```


## Part 2 - Using Ksql to create at least 2 streams with filtering from topics

To begin, you need to create a stream called stocktrades with no filtering in place.

In [None]:
CREATE STREAM STOCKTRADES
   (SIDE STRING, QUANTITY INTEGER, SYMBOL STRING, PRICE INTEGER, ACCOUNT STRING, USERID STRING)
       WITH (KAFKA_TOPIC='stocktrades', VALUE_FORMAT='AVRO');

### Stream 1 - Sell Stream
It may be in the interest of the business to view only streams where the stock was sold and not bought. This would be useful in identifying which shares should be taken as a 'short' position

In [None]:
CREATE STREAM SELL_TRADES WITH (KAFKA_TOPIC='SELL_TRADES', PARTITIONS=1, REPLICAS=1) AS SELECT
  STOCKTRADES.QUANTITY QUANTITY,
  STOCKTRADES.SYMBOL SYMBOL,
  STOCKTRADES.PRICE PRICE,
  STOCKTRADES.ACCOUNT ACCOUNT,
  STOCKTRADES.USERID USERID
FROM STOCKTRADES STOCKTRADES
WHERE (STOCKTRADES.SIDE = 'SELL')
EMIT CHANGES;


### Stream 2 - Buy Stream
It may also be interesting to the business to see trades that were large buys.

In [None]:
CREATE STREAM BUY_TRADES WITH (KAFKA_TOPIC='BUY_TRADES', PARTITIONS=1, REPLICAS=1) AS SELECT
  STOCKTRADES.QUANTITY QUANTITY,
  STOCKTRADES.SYMBOL SYMBOL,
  STOCKTRADES.PRICE PRICE,
  STOCKTRADES.ACCOUNT ACCOUNT,
  STOCKTRADES.USERID USERID
FROM STOCKTRADES STOCKTRADES
WHERE (STOCKTRADES.SIDE = 'BUY')
EMIT CHANGES;

### Table 1 - Aggregated Buy Trades


In [None]:
CREATE TABLE AGG_BUY_ORDERS WITH (KAFKA_TOPIC='AGG_BUY_ORDERS', PARTITIONS=1, REPLICAS=1) AS SELECT
  BUY_TRADES.SYMBOL SYMBOL,
  SUM(BUY_TRADES.QUANTITY) QUANTITY_AGG,
  AVG(BUY_TRADES.PRICE) PRICE_AVG,
  SUM((BUY_TRADES.QUANTITY * BUY_TRADES.PRICE)) VALUE_TRADED
FROM BUY_TRADES BUY_TRADES
WINDOW TUMBLING ( SIZE 60 SECONDS )
GROUP BY BUY_TRADES.SYMBOL
EMIT CHANGES;


### Table 2 - Aggregated Sell Trades

In [None]:
CREATE TABLE AGG_SELL_ORDERS WITH (KAFKA_TOPIC='AGG_SELL_ORDERS', PARTITIONS=1, REPLICAS=1) AS SELECT
  SELL_TRADES.SYMBOL SYMBOL,
  SUM(SELL_TRADES.QUANTITY) QUANTITY_AGG,
  AVG(SELL_TRADES.PRICE) PRICE_AVG,
  SUM((SELL_TRADES.QUANTITY * SELL_TRADES.PRICE)) VALUE_TRADED
FROM SELL_TRADES SELL_TRADES
WINDOW TUMBLING ( SIZE 60 SECONDS )
GROUP BY SELL_TRADES.SYMBOL
EMIT CHANGES;

## Part 3 - Consume/Transform data with Spark Streaming

In [50]:
from pyspark.sql import SparkSession
from IPython.display import display, clear_output
import time
from pyspark.sql import functions as F
from pyspark.sql.types import StructType,StringType, StructField, IntegerType, FloatType, BinaryType

In [66]:
spark = SparkSession.builder \
        .appName('kafka') \
        .getOrCreate()

In [67]:
spark.version

'3.1.1'

In [68]:
spark._jvm.org.apache.hadoop.util.VersionInfo.getVersion()

'3.2.0'

## Raw Data Streams

In [69]:
stream_df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "broker:29092") \
  .option("startingOffsets", "earliest") \
  .option("subscribe", "BUY_TRADES") \
  .load()

In [70]:
stream_df.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



In [71]:
raw_stream = stream_df \
    .writeStream \
    .format("memory") \
    .queryName("raw_stocktrades_view") \
    .start()

In [72]:
clear_output(wait=True)
display(spark.sql('SELECT key, value FROM raw_stocktrades_view').show(20,False))
time.sleep(1)

+----+-------------------------------------------------------------------------------------------------------+
|key |value                                                                                                  |
+----+-------------------------------------------------------------------------------------------------------+
|null|[00 00 00 00 03 02 8E 43 02 0A 5A 57 5A 5A 54 02 FC 0A 02 0C 58 59 5A 37 38 39 02 0C 55 73 65 72 5F 37]|
|null|[00 00 00 00 03 02 BC 01 02 0A 5A 57 5A 5A 54 02 8C 0D 02 0C 58 59 5A 37 38 39 02 0C 55 73 65 72 5F 37]|
|null|[00 00 00 00 03 02 F2 03 02 0A 5A 4A 5A 5A 54 02 C0 07 02 0C 4C 4D 4E 34 35 36 02 0C 55 73 65 72 5F 35]|
|null|[00 00 00 00 03 02 A8 10 02 0A 5A 4A 5A 5A 54 02 EA 0C 02 0C 4C 4D 4E 34 35 36 02 0C 55 73 65 72 5F 32]|
|null|[00 00 00 00 03 02 D2 2A 02 06 5A 56 56 02 C0 02 02 0C 4C 4D 4E 34 35 36 02 0C 55 73 65 72 5F 34]      |
|null|[00 00 00 00 03 02 D2 0F 02 0A 5A 54 45 53 54 02 96 04 02 0C 4C 4D 4E 34 35 36 02 0C 55 73 65 72 5F 37]|
|

None

In [73]:
raw_stream.stop()

### Convert Key Value pairs to strings

In [74]:
binary_to_string = F.udf(lambda x: str(int.from_bytes(x, byteorder='big')), StringType())

In [75]:
string_stream_df = stream_df \
    .withColumn("key", stream_df["key"].cast(StringType())) \
      .withColumn('value', stream_df["value"].cast(BinaryType()).cast(StringType()))

In [76]:
string_stream = string_stream_df \
    .writeStream \
    .format("memory") \
    .queryName("string_stocktrades_view") \
    .start()

In [77]:
clear_output(wait=True)
display(spark.sql('SELECT key, value FROM string_stocktrades_view').show(20, False))
time.sleep(1)

+----+----------------------------------+
|key |value                             |
+----+----------------------------------+
|null|    �C
ZWZZT�
XYZ789User_7|
|null|    �
XYZ789User_7|
|null|    �
ZJZZT�LMN456User_5|
|null|    �
ZJZZT�LMN456User_2|
|null|    �*ZVV�LMN456User_4  |
|null|    �
ZTEST�LMN456User_7|
|null|    �I
ZJZZTABC123User_7 |
|null|    �
ZVZZTLMN456User_9|
|null|    �>
ZJZZT�LMN456User_3|
|null|    �,
ZXZZT�
LMN456User_5|
|null|    �2
ZVZZT(XYZ789User_9 |
|null|    �L
ZTESTABC123User_5|
|null|    �"
ZWZZTABC123User_3 |
|null|    � ZBZX�XYZ789User_2 |
|null|    �
ZTEST�
XYZ789User_7|
|null|    �+
ZVZZT�LMN456User_4|
|null|    �3
ZXZZT�LMN456User_9|
|null|    �4ZBZX�XYZ789User_3 |
|null|    �D
ZWZZTBABC123User_6 |
|null|    �#
ZVZZTABC123User_8|
+----+----------------------------------+
only showi

None

In [51]:
string_stream.stop()

## Transformation

In [73]:
schema_stocktrades = StructType([
        StructField("side", StringType(),  True),
        StructField("quantity", IntegerType(),  True),
        StructField("price", IntegerType(),  True),
        StructField("symbol", StringType(),  True),
        StructField("account", StringType(), True),
         StructField("", StringType(), True)
    
])

In [74]:
json_stream_df = string_stream_df\
    .withColumn("value", F.from_json("value", schema_stocktrades))

In [75]:
json_stream_df.printSchema()

root
 |-- key: string (nullable = true)
 |-- value: struct (nullable = true)
 |    |-- side: string (nullable = true)
 |    |-- quantity: integer (nullable = true)
 |    |-- price: integer (nullable = true)
 |    |-- symbol: string (nullable = true)
 |    |-- account: string (nullable = true)
 |    |-- userid: string (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



In [68]:
json_stream = json_stream_df \
    .writeStream \
    .format("memory") \
    .queryName("extract_stocktrades_view") \
    .start()

In [71]:
clear_output(wait=True)
display(spark.sql('SELECT * FROM extract_stocktrades_view').show(20))
time.sleep(1)

+-----+--------------------+-----------+---------+------+--------------------+-------------+
|  key|               value|      topic|partition|offset|           timestamp|timestampType|
+-----+--------------------+-----------+---------+------+--------------------+-------------+
|ZVZZT|{null, null, null...|stocktrades|        0|     0|2021-06-05 05:39:...|            0|
|ZVZZT|{null, null, null...|stocktrades|        0|     1|2021-06-05 05:39:...|            0|
|ZWZZT|{null, null, null...|stocktrades|        0|     2|2021-06-05 05:39:...|            0|
|ZJZZT|{null, null, null...|stocktrades|        0|     3|2021-06-05 05:39:...|            0|
|ZVZZT|{null, null, null...|stocktrades|        0|     4|2021-06-05 05:39:...|            0|
|ZXZZT|{null, null, null...|stocktrades|        0|     5|2021-06-05 05:39:...|            0|
|ZVZZT|{null, null, null...|stocktrades|        0|     6|2021-06-05 05:39:...|            0|
| ZBZX|{null, null, null...|stocktrades|        0|     7|2021-06-05 05

None

In [72]:
json_stream.stop()

In [65]:
spark.stop()

In [14]:
hex_string = '00 00 00 00 01 08 53 45 4C 4C AE 12 0A 5A 56 5A 5A 54 D8 03 0C 4C 4D 4E 34 35 36 0C 55 73 65 72 5F 39'

In [15]:
bytes_object = bytes.fromhex(hex_string)

In [18]:
print(bytes_object)

b'\x00\x00\x00\x00\x01\x08SELL\xae\x12\nZVZZT\xd8\x03\x0cLMN456\x0cUser_9'


In [20]:
b'\x00\x00\x00\x00\x01'.decode("ASCII")

'\x00\x00\x00\x00\x01'