# Exercise

You were given a task to display measurement from continuous glucose monitoring device (CGM) `GlucoSpark`. The device is already setup to send data through socket on port `65432` directly on your machine.
The signal send by a device is a comma separated line with `<eventTime>,<glucoseMeasurement>,<displayUnit>,<cgmId>` , that ends with newline sign.
Unfortunetly, device sometimes catches backgroud noise signal and displays irrational, negative glucose measuremets. 
The end user only needs timestamp of the measurement and meaurement value, necessary to detect anomalies in blood test reading.


1. Read streaming data using `socket` format, with host being `127.0.0.1` and port `65432`
2. Split device input signal to seperate columns
3. Cast `eventTime` to *timestamp* type and `glucoseMeasurement` to *int*
4. Filter negative glucose measurements
5. Select only `eventTime` and `glucoseMeasurement`
6. Write data in `memory`




In this demo, we're going to define a Spark Structured Streaming job, and then we'll run the job and see how easily we can filter out the data that we don't want. What I've done for this demo is I've connected to my Linux server by using an application known as SSH, which is an extremely common way to access a command line or terminal when you're dealing with Linux. I'm also using another application called tmux, or terminal multiplexer, which allows me to have two console Windows side by side. Here in this demo, we're going to look at a very simple client server relationship and a very basic query. So first, let's take a look at the server. So I've written a really simple server in some basic Python code, and I'm using a text that are known as Vim to be able to edit the text here. Although, you may use something like Nano, which is really, really straightforward. Now here what I'm doing is I'm importing some libraries for Python. Then I'm specifying the host IP for the server and the port. This is the identifying information that Spark's going to need to be able to access the server. Then I simply open up a socket and listen on that IP and port combination, waiting for something to connect. Now, this server is very rudimentary. It can only handle one connection at a time. And once it's made the connection, we can see that what it's doing is it's sending out some text data. It's doing that twice, and in both cases, it's sending out the timestamp of the event, when the event was created, it's sending out a blood glucose reading, so it's sending out a number, and then we have a column for device ID. And so this is some really basic information, but we're going to be able to use Spark Structured Streaming to be able to filter out some of this and pick the columns we only want. Later on, we're going to do more complicated analyses. So we have the server. Let's get it running. Okay, so the server's started. Now let's take a look at our code or our query. So here again, I'm using Python code. I think Python is really, really readable, and it's common and usually already installed on many different types of Linux systems. So at least in my experience, it's really easy to get up and running with Python. Here, again, we're importing some libraries. In this case, we're focusing mostly on the pyspark.sql library so that we can get some functions that we need. Then, we've got our post import again. We're creating a Spark session, and we're saying, okay, we're going to get our data by using a socket connection. We're going to connect directly to the server and just read whatever it sends us. Now, normally, you're not going to do this. You're going to be reading from a system such as Kafka, which is a messaging or eventing system, or you may be reading from comma‑separated value files. But in this case for demo purposes, we're just going to read straight from it. And really, the important part is below. First, we're manually splitting up that text data that we get based on the commas. So every comma delineates a new column, and so we're manually telling it, hey, the 0 item in array is eventTime, and then bloodGlucose, and then deviceID. Normally, you're going to be, say, working with comma‑separated value files, and you'll be able to specify a schema ahead of time in a more elegant way, but here we're doing it manually. Next, and this is the actual query, this is the key part that we care about, we're calling two functions. We're calling select where we say, you know what? I know that we defined these three columns, but for this purpose for this query, I only want these two. I want eventTime and I want bloodGlucose. And then, we're using the where clause, which is just like if you used regular SQL to filter out when the bloodGlucose is negative because it's physically impossible to have a negative bloodGlucose. That doesn't mean anything. And so if we're receiving that data, it means it's an error. Finally, we're taking our query we defined, and we're saying, you know what? Go ahead and write it in append mode, which means just as soon as we get data, spit it back out, and then write it to the console. Again, normally, you're going to save it to a database or to CSV files or something like that. So let's go and run it and see what happens. So it's going to warn us because we're talking to our local IP. We're talking to the same machine, which you should never do in production, but it's going to let us do the work. And so we can see over on the right that it's connected, and we can see on the left that it's receiving data. And again, this is a very simple example. But one of things you'll see is that the eventTime's going to increase. There's not enough space for it to show the seconds, but eventually it's going to increment up to 12. But the other thing that I want you to notice is that that ‑3 bloodGlucose never shows up, and that's because we're filtering it out. So we've selected our two columns, we're filtering out ‑3, and we're receiving batches of data. So this is a very simple query. As we go through the rest of the course, we're going to look at some more complicated and more interesting types of queries that we can perform with Spark Structured Streaming.
Summary
