<a href="https://colab.research.google.com/github/jbpost2/ST-554-Big-Data-with-Python/blob/main/02_Common_Streaming_Tasks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Common Streaming Tasks

> Justin Post

Let's get familiar with some of the common tasks we need to do with streaming data!

Note: These types of webpages are built from Jupyter notebooks (`.ipynb` files). You can access your own versions of them by [clicking here](https://colab.research.google.com/github/jbpost2/ST-554-Big-Data-with-Python/blob/main/01_Programming_in_python/20-Plotting_pandas.ipynb). **It is highly recommended that you go through and run the notebooks yourself, modifying and rerunning things where you'd like!**

Preprocessing/Sending alerts
+ Missing data from a censor
+ Tracking a fleet of vehicles on speed, geo-fences, etc.

Must be able to filter and process the data
+ Data often in the form of a log
+ We'll deal with data in JSON form
+ For now, ignore streaming aspect and parse data in a data frame using a time variable

# Common Issue: Detecting Trends, Counting, and Averages

Often want to understand basic information about a stream of data

- How many events have occurred so far?
    + Basic counting
- How many instances did this record contain?
    + Perhaps word count in a text string of input
- How many records in a row have we increased/decreased?
    + Basic trends over time
- What is the average/standard deviation of the data?
- Average over certain time windows?
    + Moving averages

# Example: Air Quality Data

From: <https://archive.ics.uci.edu/ml/datasets/Air+quality>

>  ...dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device

In [2]:
import pandas as pd
air_data = pd.read_csv("https://www4.stat.ncsu.edu/online/datasets/AirQualityUCI.csv", sep = ";", decimal = ",")
air_data

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16
0,10/03/2004,18.00.00,2.6,1360.0,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,13.6,48.9,0.7578,,
1,10/03/2004,19.00.00,2.0,1292.0,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,13.3,47.7,0.7255,,
2,10/03/2004,20.00.00,2.2,1402.0,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,11.9,54.0,0.7502,,
3,10/03/2004,21.00.00,2.2,1376.0,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,11.0,60.0,0.7867,,
4,10/03/2004,22.00.00,1.6,1272.0,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,11.2,59.6,0.7888,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9466,,,,,,,,,,,,,,,,,
9467,,,,,,,,,,,,,,,,,
9468,,,,,,,,,,,,,,,,,
9469,,,,,,,,,,,,,,,,,


In [3]:
air_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9471 entries, 0 to 9470
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           9357 non-null   object 
 1   Time           9357 non-null   object 
 2   CO(GT)         9357 non-null   float64
 3   PT08.S1(CO)    9357 non-null   float64
 4   NMHC(GT)       9357 non-null   float64
 5   C6H6(GT)       9357 non-null   float64
 6   PT08.S2(NMHC)  9357 non-null   float64
 7   NOx(GT)        9357 non-null   float64
 8   PT08.S3(NOx)   9357 non-null   float64
 9   NO2(GT)        9357 non-null   float64
 10  PT08.S4(NO2)   9357 non-null   float64
 11  PT08.S5(O3)    9357 non-null   float64
 12  T              9357 non-null   float64
 13  RH             9357 non-null   float64
 14  AH             9357 non-null   float64
 15  Unnamed: 15    0 non-null      float64
 16  Unnamed: 16    0 non-null      float64
dtypes: float64(15), object(2)
memory usage: 1.2+ MB


In [4]:
  type(air_data.Time[0])

str

# Dates and Times in Python

Most standard date/time operations can be handled via the `datetime` module.  Includes data types:

- `date`: attributes of `year`, `month`, `day`
- `time`: attributes of `hour`, `minute`, `second`, `microsecond`, and `tzinfo`
- `datetime`: attributes of both `date` and `time`
- `timedelta`: difference between two `date`, `time` or `datetime` instances

With this functionality we can add and subtract dates/times to get meaningful info while keeping the data in a more readable format (rather than say looking at the data as days since Jan 1, 1960)


# Dates and Times in `pandas`

- Uses `NumPy`'s `datetime64` and `timedelta64` dtypes
- Very similar functionality for doing useful things with dates! [Link](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)

In [5]:
a = pd.to_datetime(["04-01-2022 10:00"], dayfirst=True)
a

DatetimeIndex(['2022-01-04 10:00:00'], dtype='datetime64[ns]', freq=None)

In [7]:
b = pd.to_datetime(["04-01-2022 11:00"])
a-b

TimedeltaIndex(['-88 days +23:00:00'], dtype='timedelta64[ns]', freq=None)

In [9]:
b.day

Index([1], dtype='int32')

In [10]:
b.hour

Index([11], dtype='int32')

In [11]:
air_data = pd.read_csv("https://www4.stat.ncsu.edu/online/datasets/AirQualityUCI.csv", sep = ";", decimal = ",", parse_dates = [["Date", "Time"]])
air_data.info()

  air_data = pd.read_csv("https://www4.stat.ncsu.edu/online/datasets/AirQualityUCI.csv", sep = ";", decimal = ",", parse_dates = [["Date", "Time"]])


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9471 entries, 0 to 9470
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date_Time      9471 non-null   object 
 1   CO(GT)         9357 non-null   float64
 2   PT08.S1(CO)    9357 non-null   float64
 3   NMHC(GT)       9357 non-null   float64
 4   C6H6(GT)       9357 non-null   float64
 5   PT08.S2(NMHC)  9357 non-null   float64
 6   NOx(GT)        9357 non-null   float64
 7   PT08.S3(NOx)   9357 non-null   float64
 8   NO2(GT)        9357 non-null   float64
 9   PT08.S4(NO2)   9357 non-null   float64
 10  PT08.S5(O3)    9357 non-null   float64
 11  T              9357 non-null   float64
 12  RH             9357 non-null   float64
 13  AH             9357 non-null   float64
 14  Unnamed: 15    0 non-null      float64
 15  Unnamed: 16    0 non-null      float64
dtypes: float64(15), object(1)
memory usage: 1.2+ MB


  air_data = pd.read_csv("https://www4.stat.ncsu.edu/online/datasets/AirQualityUCI.csv", sep = ";", decimal = ",", parse_dates = [["Date", "Time"]])


In [12]:
air_data = air_data.rename(columns = {'CO(GT)': 'co_gt'})
air_data

Unnamed: 0,Date_Time,co_gt,PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16
0,10/03/2004 18.00.00,2.6,1360.0,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,13.6,48.9,0.7578,,
1,10/03/2004 19.00.00,2.0,1292.0,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,13.3,47.7,0.7255,,
2,10/03/2004 20.00.00,2.2,1402.0,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,11.9,54.0,0.7502,,
3,10/03/2004 21.00.00,2.2,1376.0,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,11.0,60.0,0.7867,,
4,10/03/2004 22.00.00,1.6,1272.0,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,11.2,59.6,0.7888,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9466,nan nan,,,,,,,,,,,,,,,
9467,nan nan,,,,,,,,,,,,,,,
9468,nan nan,,,,,,,,,,,,,,,
9469,nan nan,,,,,,,,,,,,,,,


# Common Task: Preprocessing/Sending alerts

Checking conditions on the data:

- Check if the data is missing
- Check if data is in an appropriate range
- etc.
    + If not, print an alert, write the event to a file, and/or send an email

Let's focus on the `co_gt` variable (true hourly averaged CO concentration (mg/m^3))

- 'Take data in over time' (via a loop over the rows)
- If the data exceeds 8 we print a message

In [13]:
for i in range(air_data.shape[0]):
    if air_data.iloc[i].co_gt > 8:
        print("High CO Concentration at " + str(air_data.Date_Time[i]))

High CO Concentration at 15/03/2004 09.00.00
High CO Concentration at 22/10/2004 18.00.00
High CO Concentration at 25/10/2004 18.00.00
High CO Concentration at 26/10/2004 17.00.00
High CO Concentration at 26/10/2004 18.00.00
High CO Concentration at 02/11/2004 20.00.00
High CO Concentration at 04/11/2004 18.00.00
High CO Concentration at 05/11/2004 17.00.00
High CO Concentration at 17/11/2004 18.00.00
High CO Concentration at 19/11/2004 19.00.00
High CO Concentration at 19/11/2004 20.00.00
High CO Concentration at 23/11/2004 18.00.00
High CO Concentration at 23/11/2004 19.00.00
High CO Concentration at 23/11/2004 20.00.00
High CO Concentration at 23/11/2004 21.00.00
High CO Concentration at 24/11/2004 20.00.00
High CO Concentration at 26/11/2004 18.00.00
High CO Concentration at 26/11/2004 21.00.00
High CO Concentration at 02/12/2004 19.00.00
High CO Concentration at 13/12/2004 18.00.00
High CO Concentration at 14/12/2004 18.00.00
High CO Concentration at 16/12/2004 19.00.00
High CO Co

- 'Take data in over time' (via a loop over the rows)
- If the data exceeds 8 we print a message
- If the data is less than 0 we print a message (-200 represents missing here)
    + Write either occurrence to a log file (or perhaps a database)

In [14]:
for i in range(air_data.shape[0]):
    temp = air_data.iloc[i]
    dt = temp.Date_Time
    value = temp.co_gt
    if value > 8:
        print("High CO Concentration at " + str(dt))
        with open('logs/COHigh.txt', 'a') as f:
            f.write(str(dt) + "," + str(value) + "\n")
    elif value < 0:
        print("Invalid CO Concentration at " + str(dt))
        with open('logs/COInvalid.txt', 'a') as f:
            f.write(str(dt) + "," + str(value) + "\n")

Invalid CO Concentration at 11/03/2004 04.00.00


FileNotFoundError: [Errno 2] No such file or directory: 'logs/COInvalid.txt'

# Check on Status of a Producer

- Check to see if the `producer` seems to be down. If so, send an email! [Article](https://realpython.com/python-send-email/) (let's go through an example notebook)

In [15]:
missing = 0
for i in range(air_data.shape[0]):
    temp = air_data.iloc[i]
    dt = temp.Date_Time
    value = temp.co_gt
    if value > 8:
        print("High CO Concentration at " + str(dt))
        with open('logs/COHigh.txt', 'a') as f:
            f.write(str(dt) + "," + str(value) + "\n")
        missing = 0
    elif value < 0:
        print("Invalid CO Concentration at " + str(dt))
        with open('logs/COInvalid.txt', 'a') as f:
            f.write(str(dt) + "," + str(value) + "\n")
        if value == -200:
            missing += 1
        if missing == 6:
            #Send email code
    else:
        missing = 0

IndentationError: expected an indented block after 'if' statement on line 17 (<ipython-input-15-bf195b2b9826>, line 19)

# Common Issue: Combining Data Streams

- Often have multiple data streams that need to be combined

    + Usually combined via a shared key or time stamps

- Once combined we can then preprocess/summarize/etc.

![](https://www4.stat.ncsu.edu/online/datasets/impressions_clicks.png)

[Example](https://docs.databricks.com/_static/notebooks/stream-stream-joins-python.html) of streams using google ads type data

- Impression - ad seen by a user
- Clicks - ad was clicked on by user

In [16]:
import pandas as pd
import numpy as np
np.random.seed(10)
impressions = pd.DataFrame({
  'adId': range(500),
  'clickTime': (pd.to_datetime('2022-01-01') + pd.to_timedelta(np.random.rand(500), unit = "D")).sort_values()
})
clicks = impressions.iloc[np.random.randint(size = 30, low = 0, high = 499)].sort_index()
clicks.clickTime = clicks.clickTime + pd.to_timedelta(np.random.rand(30)/100, unit = "D")

In [17]:
impressions

Unnamed: 0,adId,clickTime
0,0,2022-01-01 00:02:32.033682620
1,1,2022-01-01 00:05:41.130210730
2,2,2022-01-01 00:12:28.161050454
3,3,2022-01-01 00:13:16.703184279
4,4,2022-01-01 00:14:04.126023142
...,...,...
495,495,2022-01-01 23:33:50.531738794
496,496,2022-01-01 23:34:05.138372043
497,497,2022-01-01 23:42:10.841031524
498,498,2022-01-01 23:44:02.426570551


In [18]:
clicks

Unnamed: 0,adId,clickTime
35,35,2022-01-01 01:30:34.800044013
38,38,2022-01-01 01:37:35.836843295
58,58,2022-01-01 02:42:47.536707642
90,90,2022-01-01 04:10:54.616200590
94,94,2022-01-01 04:13:45.397812575
117,117,2022-01-01 05:29:50.861708782
123,123,2022-01-01 06:03:17.862539408
127,127,2022-01-01 06:16:44.203962287
140,140,2022-01-01 06:45:03.548189309
170,170,2022-01-01 08:07:54.971475997


In [19]:
pd.merge(left = impressions, right = clicks, on = "adId", how = 'right')

Unnamed: 0,adId,clickTime_x,clickTime_y
0,35,2022-01-01 01:24:48.442588485,2022-01-01 01:30:34.800044013
1,38,2022-01-01 01:33:27.963279930,2022-01-01 01:37:35.836843295
2,58,2022-01-01 02:30:39.145926778,2022-01-01 02:42:47.536707642
3,90,2022-01-01 04:02:01.489706816,2022-01-01 04:10:54.616200590
4,94,2022-01-01 04:11:43.920512629,2022-01-01 04:13:45.397812575
5,117,2022-01-01 05:26:33.216858528,2022-01-01 05:29:50.861708782
6,123,2022-01-01 05:51:10.808292309,2022-01-01 06:03:17.862539408
7,127,2022-01-01 06:04:34.678241175,2022-01-01 06:16:44.203962287
8,140,2022-01-01 06:41:53.342051029,2022-01-01 06:45:03.548189309
9,170,2022-01-01 07:56:14.148555253,2022-01-01 08:07:54.971475997


Idea is easy :)  Will be harder with actual data streams!

# Recap

Often need to check/validate the data

- Basic checks to create an issue log file or entries in an issues data base

- Might send email as a notification

- Could of course do basic ETL type operations as the data comes in too!
    + Filter observations
    + Quick transformations
    + Combine data streams
    + etc.