# Data Downtime Challenge | Exercise 1

## 0. Setup

In [1]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd
import plotly.express as px
import re
from datetime import datetime, date, timedelta

In [3]:
import sys
sys.path.append("..")

In [5]:
from data.utils import get_days_index, show_reports

all_days = get_days_index(200)

In [6]:
import sqlite3

conn = sqlite3.connect("../data/dbs/Ex1.db")
c = conn.cursor()

## 1. Introduction

Welcome to the Data Downtime Challenge! In this exercise, we'll learn Data Observability through practice on some sample datasets. Each subproblem will ask you to craft some `SQL` queries that help us learn about the state of our tables and identify Data Downtime issues.

For these exercises, we'll be using mock astronomical data to identify habitable planets.

![SegmentLocal](/tree/data/assets/planets.gif "segment")

The `Ex1.db` database contains a single table, `EXOPLANETS`, which has information on nearly 2000 exoplanets across the Milky Way Galaxy.

In [7]:
c.execute("PRAGMA table_info(EXOPLANETS);")
c.fetchall()

[(0, '_id', 'TEXT', 0, None, 0),
 (1, 'distance', 'REAL', 0, None, 0),
 (2, 'g', 'REAL', 0, None, 0),
 (3, 'orbital_period', 'REAL', 0, None, 0),
 (4, 'avg_temp', 'REAL', 0, None, 0),
 (5, 'date_added', 'TEXT', 0, None, 0)]

A database entry in `EXOPLANETS` contains the following info:

0. `_id`: A UUID corresponding to the planet.
1. `distance`: Distance from Earth, in lightyears.
2. `g`: Surface gravity as a multiple of $g$, the gravitational force constant.
3. `orbital_period`: Length of a single orbital cycle in days.
4. `avg_temp`: Average surface temperature in degrees Kelvin.
5. `date_added`: The date our system discovered the planet and added it automatically to our databases.

Note that one or more of `distance`, `g`, `orbital_period`, and `avg_temp` may be `NULL` for a given planet as a result of missing or erroneous data.

In [8]:
pd.read_sql_query("SELECT * FROM EXOPLANETS LIMIT 10", conn)

Unnamed: 0,_id,distance,g,orbital_period,avg_temp,date_added
0,c168b188-ef0c-4d6a-8cb2-f473d4154bdb,34.627304,,476.480044,,2020-01-01
1,e7b56e84-41f4-4e62-b078-01b076cea369,110.19692,2.525074,839.837817,,2020-01-01
2,a27030a0-e4b4-4bd7-8d24-5435ed86b395,26.695795,10.276497,301.018816,,2020-01-01
3,54f9cf85-eae9-4f29-b665-855357a14375,54.888352,,173.788968,328.644125,2020-01-01
4,4d06ec88-f5c8-4d03-91ef-7493a12cd89e,153.264217,0.922875,200.712662,,2020-01-01
5,e16250b8-2d9d-49f3-aaef-58eed9a8864c,7.454811,5.503701,763.56171,245.129285,2020-01-01
6,a0a6bf97-90d5-4686-8ccb-10753f8d335e,4.925946,0.953473,486.053323,267.786557,2020-01-01
7,b28b4e19-8517-4ab5-97f0-c445f1aae6c4,94.540173,7.118254,629.287426,368.859206,2020-01-01
8,7e34e44e-663f-491c-96c5-bb5acb8d5f1e,19.786255,3.999081,744.106326,180.445029,2020-01-01
9,305e8ea0-663b-4311-b6b3-4198c051c335,95.65403,0.677212,472.344447,,2020-01-01


## 2. Exercise: Visualizing Freshness

Grouping by the `DATE_ADDED` column can give us insight into how `EXOPLANETS` updates daily. For example, we can query for the number of new IDs added per day:

In [9]:
SQL = """
SELECT
    DATE_ADDED,
    COUNT(*) AS ROWS_ADDED
FROM
    EXOPLANETS
GROUP BY
    DATE_ADDED
"""

In [None]:
rows_added = pd.read_sql_query(SQL, conn)
rows_added = rows_added.rename(columns={clmn: clmn.lower() for clmn in rows_added.columns})
rows_added = rows_added.set_index("date_added")
rows_added = rows_added.reindex(all_days)

It looks like `EXOPLANETS` typically updates with around 100 new entries each day. Something looks off in a few places, though. We have what we'd call a **freshness** incident -- on a couple of occasions, the table doesn't update at all for a 3 or more days. It has "stale" (3+ day old) data.

In [14]:
fig = px.bar(x=all_days, y=rows_added["rows_added"])
fig.update_xaxes(title="Date")
fig.update_yaxes(title="Rows Added")
fig.show()

In this exercise, we'll try writing some `SQL` code that returns timestamps for when freshness incidents occur. Feel free to use the query above as a starting point.

- *Hint 1*: The `LAG` window function should help in comparing two subsequent rows in a query.
- *Hint 2*: `SQLite` uses `JULIANDAY()` to cast an object to a date.
- *Hint 3*: An example solution is given in `solutions/exercise_1.ipynb`, if needed for comparison.

To start, let's just copy the `SQL` statement from above, which gives the count of entries added per day.

In [16]:
# YOUR CODE HERE
SQL = """
SELECT
    DATE_ADDED,
    COUNT(*) AS ROWS_ADDED
FROM
    EXOPLANETS
GROUP BY
    DATE_ADDED;
"""
# END YOUR CODE

In [17]:
pd.read_sql_query(SQL, conn).head(5)

Unnamed: 0,date_added,ROWS_ADDED
0,2020-01-01,84
1,2020-01-02,92
2,2020-01-03,101
3,2020-01-04,102
4,2020-01-05,100


Verify that your output looks like this:
![SegmentLocal](/tree/data/assets/ex1img1.png "segment")

Great! Since we've grouped by the `DATE_ADDED` field, we now have one row entry for each data where data came in. As a next step, let's devise a way to compare adjacent dates in our grouped output. For example, in row 1 above, we'd like to know that the previous date (on row 0) was `2021-01-01`.

A great way to compare adjacent rows in SQL is to use the [`LAG` window function](https://www.sqltutorial.org/sql-window-functions/sql-lag/). Also, you can try including our data from above using [SQL's `WITH` prefix](https://modern-sql.com/feature/with).

In [22]:
# YOUR CODE HERE
SQL = """
WITH RC_UPDATES AS(
    SELECT
        DATE_ADDED,
        COUNT(*) AS ROWS_ADDED
    FROM
        EXOPLANETS
    GROUP BY
        DATE_ADDED
)

SELECT
    DATE_ADDED,
    LAG(DATE_ADDED) OVER(ORDER BY DATE_ADDED) AS LAST_DATE_ADDED,
    ROWS_ADDED
FROM
    RC_UPDATES;
"""
# END YOUR CODE

In [23]:
pd.read_sql_query(SQL, conn).head(5)

Unnamed: 0,DATE_ADDED,LAST_DATE_ADDED,ROWS_ADDED
0,2020-01-01,,84
1,2020-01-02,2020-01-01,92
2,2020-01-03,2020-01-02,101
3,2020-01-04,2020-01-03,102
4,2020-01-05,2020-01-04,100


Check that your output looks like this:
![SegmentLocal](/tree/data/assets/ex1img2.png "segment")

Awesome! The ability to compare adjacent dates is crucial for detecting stale data. Our next step is this: given two adjacent dates, calculate the *difference in days* between those dates. We're answering the question, "How many days old is the previous batch?"

Since we're in SQLite, we can cast our strings into dates with `JULIANDAY()`, and [easily find the difference between them](https://www.w3resource.com/sqlite/sqlite-julianday.php).

In [24]:
# YOUR CODE HERE
SQL = """
WITH RC_UPDATES AS(
    SELECT
        DATE_ADDED,
        COUNT(*) AS ROWS_ADDED
    FROM
        EXOPLANETS
    GROUP BY
        DATE_ADDED
)

SELECT
    DATE_ADDED,
    LAG(DATE_ADDED) OVER(ORDER BY DATE_ADDED) AS LAST_DATE_ADDED,
    JULIANDAY(DATE_ADDED) - JULIANDAY(LAG(DATE_ADDED) OVER(ORDER BY DATE_ADDED)) AS DAYS_SINCE_LAST_UPDATE,
    ROWS_ADDED
FROM
    RC_UPDATES;
"""
# END YOUR CODE

In [25]:
pd.read_sql_query(SQL, conn).head(5)

Unnamed: 0,DATE_ADDED,LAST_DATE_ADDED,DAYS_SINCE_LAST_UPDATE,ROWS_ADDED
0,2020-01-01,,,84
1,2020-01-02,2020-01-01,1.0,92
2,2020-01-03,2020-01-02,1.0,101
3,2020-01-04,2020-01-03,1.0,102
4,2020-01-05,2020-01-04,1.0,100


See if you can get something looking like this:
![SegmentLocal](/tree/data/assets/ex1img3.png "segment")

Well done! We're basically all the way there. Recall that our original task was to identify **freshness incidents** -- that is, dates where the previous data entry is more than 1 day old. After adding another `WITH` statement and a `WHERE` clause, our query should be able to do just that.

In [34]:
# YOUR CODE HERE
SQL = """
WITH UPDATE_DELAYS AS(

    WITH RC_UPDATES AS(
        SELECT
            DATE_ADDED,
            COUNT(*) AS ROWS_ADDED
        FROM
            EXOPLANETS
        GROUP BY
            DATE_ADDED
    )

    SELECT
        DATE_ADDED,
        LAG(DATE_ADDED) OVER(ORDER BY DATE_ADDED) AS LAST_DATE_ADDED,
        JULIANDAY(DATE_ADDED) - JULIANDAY(LAG(DATE_ADDED) OVER(ORDER BY DATE_ADDED)) AS DAYS_SINCE_LAST_UPDATE,
        ROWS_ADDED
    FROM
        RC_UPDATES
)

SELECT
    *
FROM
    UPDATE_DELAYS
WHERE
    DAYS_SINCE_LAST_UPDATE > 1;
"""
# END YOUR CODE

In [36]:
detections = pd.read_sql_query(SQL, conn)
detections

Unnamed: 0,DATE_ADDED,LAST_DATE_ADDED,DAYS_SINCE_LAST_UPDATE,ROWS_ADDED
0,2020-02-08,2020-01-31,8.0,83
1,2020-03-30,2020-03-26,4.0,117
2,2020-05-14,2020-05-06,8.0,103
3,2020-06-07,2020-06-04,3.0,82
4,2020-06-17,2020-06-12,5.0,87
5,2020-06-30,2020-06-27,3.0,86


If your result looks like this:
![SegmentLocal](/tree/data/assets/ex1img4.png "segment")

then congratulations! You've built a detector for **freshness incidents**, a key part of any data observability solution. With the following code, you can visualize your detections along with the update data.

In [38]:
for _, row in detections.iterrows():
    fig.add_vline(x=row['DATE_ADDED'], line_color='red')
fig.show()

In the next exercise, we'll build off of these simpler reports to handle scenarios with multiple tables and lineage information.