# Exercise

You were given a task to display measurement from continuous glucose monitoring device (CGM) `GlucoSpark`. Control enginners from your company have already setup test device that sends data through socket on port `65432` directly on your machine.
The signal send by a device is a comma separated line with `<eventTime>,<glucoseMeasurement>,<displayUnit>,<cgmId>`, that ends with newline sign.
Unfortunetly, device sometimes catches backgroud noise signal and displays irrational, negative glucose measuremets. 
The end user only needs timestamp of the measurement and meaurement value, necessary to detect anomalies in blood test reading.


1. Read streaming data using `socket` format, with host being `127.0.0.1` and port `65432`
2. Split device input signal to seperate columns
3. Cast `eventTime` to *timestamp* type and `glucoseMeasurement` to *integer*
4. Filter negative glucose measurements
5. Select only `eventTime` and `glucoseMeasurement`
6. Write data to console in `append` mode
7. Trigger reading measurements every `1 minute` interval

To start mesurement, go to terminal and run script in `exercises` directory
> python sever_glucose.py

This starts a server to listens to incoming connections and when connection has been made, sends device data through a socket

# TODO

In [6]:
# solution

from pyspark.sql import SparkSession
from pyspark.sql.functions import split

host, port = ("127.0.0.1", "65432")

spark = SparkSession \
    .builder \
    .appName("GlucoSpark") \
    .getOrCreate()

lines = spark \
    .readStream \
    .format('socket') \
    .option('host', host) \
    .option('port', port) \
    .load()

CSVData = lines.select(\
        split(lines.value, ',').getItem(0).alias("eventTime").cast("timestamp"),\
        split(lines.value, ',').getItem(1).alias("glucoseMeasurement").cast("int"),\
        split(lines.value, ',').getItem(2).alias("displayUnit"),\
        split(lines.value, ',').getItem(3).alias("deviceID")
        )

selectAndFilter = CSVData.select("eventTime","glucoseMeasurement")\
        .where("glucoseMeasurement > 0")
    
query = selectAndFilter \
    .writeStream \
    .queryName("BloodTest") \
    .outputMode("append") \
    .format("console") \
    .option('truncate', 'false')\
    .trigger(processingTime="1 minute") \
    .start()

24/04/26 14:04:51 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/04/26 14:04:51 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
24/04/26 14:04:51 WARN TextSocketSourceProvider: The socket source should not be used for production applications! It does not support recovery.
24/04/26 14:04:51 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-5f263963-4681-465f-9808-48a2b772f2f2. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
24/04/26 14:04:51 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


-------------------------------------------
Batch: 0
-------------------------------------------
+---------+------------------+
|eventTime|glucoseMeasurement|
+---------+------------------+
+---------+------------------+

-------------------------------------------
Batch: 1
-------------------------------------------
+--------------------------+------------------+
|eventTime                 |glucoseMeasurement|
+--------------------------+------------------+
|2024-04-26 14:04:51.895314|106               |
|2024-04-26 14:04:55.899513|100               |
|2024-04-26 14:04:59.903914|112               |
+--------------------------+------------------+

-------------------------------------------
Batch: 2
-------------------------------------------
+--------------------------+------------------+
|eventTime                 |glucoseMeasurement|
+--------------------------+------------------+
|2024-04-26 14:05:03.908259|113               |
|2024-04-26 14:05:51.959483|118               |
|2024-0

24/04/26 14:14:19 WARN TextSocketMicroBatchStream: Stream closed by 127.0.0.1:65432


-------------------------------------------
Batch: 11
-------------------------------------------
+--------------------------+------------------+
|eventTime                 |glucoseMeasurement|
+--------------------------+------------------+
|2024-04-26 14:14:00.476874|112               |
|2024-04-26 14:14:04.48119 |107               |
|2024-04-26 14:14:08.485701|117               |
|2024-04-26 14:14:12.49006 |102               |
|2024-04-26 14:14:16.494347|113               |
+--------------------------+------------------+



In [4]:
query.lastProgress

{'id': '24a5632c-914e-456e-bdf0-5c48106ce821',
 'runId': '7856fcd9-2d16-4a33-a1a3-9f579fd12721',
 'name': 'BloodTest',
 'timestamp': '2024-04-26T11:57:00.000Z',
 'batchId': 1,
 'numInputRows': 0,
 'inputRowsPerSecond': 0.0,
 'processedRowsPerSecond': 0.0,
 'durationMs': {'latestOffset': 0, 'triggerExecution': 1},
 'stateOperators': [],
 'sources': [{'description': 'TextSocketV2[host: 127.0.0.1, port: 65432]',
   'startOffset': -1,
   'endOffset': -1,
   'latestOffset': -1,
   'numInputRows': 0,
   'inputRowsPerSecond': 0.0,
   'processedRowsPerSecond': 0.0}],
 'sink': {'description': 'org.apache.spark.sql.execution.streaming.ConsoleTable$@dbbf241',
  'numOutputRows': 0}}

24/04/26 13:59:03 WARN TextSocketMicroBatchStream: Stream closed by 127.0.0.1:65432


In [5]:
# Cleanup Classroom
query.stop()
spark.stop()