# Data Downtime Challenge | Exercise 3

## 0. Setup

In [None]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

In [None]:
import pandas as pd
import plotly.express as px
import re
from datetime import datetime, date, timedelta

In [None]:
import sys
sys.path.append("..")

In [None]:
from data.utils import get_days_index

all_days = get_days_index(350)

In [None]:
import sqlite3

conn = sqlite3.connect("../data/dbs/Ex3.db")
c = conn.cursor()

## 1. Introduction

In the last exercise, we looked at incidents spanning multiple tables in a database. Yet, we've still only looked at _individual_ metrics like the row count, rate of null values, and so on. In practice, many genuine data downtime incidents involve _conjunctions_ of events across multiple upstream and downstream tables. In this exercise, we practice crafting single queries that can handle such conjunctive events.

## 2. Data

In this exercise, we'll continue to use the `EXOPLANETS`, `HABITABLES`, and `EXOPLANETS_SCHEMA` tables.

In [None]:
# show all tables in DB
pd.read_sql_query("""
    SELECT 
        NAME
    FROM 
        SQLITE_MASTER 
    WHERE 
        TYPE ='table' AND 
        NAME NOT LIKE 'sqlite_%';
    """,
    conn
)

## 3. Exercise: Pillars of Data Observability in Conjunction

Why care about conjunctions of events, when individual events provide all the information you need?

One important factor is **noise** -- looking at simultaneous events reduces the total number of events you're worried about, and makes it more likely that the issues are genuine. Another factor is **causality** -- given an issue in some table, looking at simultaneous events in upstream tables might help you determine the root cause, and reveal the path to a solution.

Take the past exercise as an example -- we saw that the `habitability` field had anomalous rates of zeroed values:

In [None]:
h_zero = pd.read_sql_query("""
SELECT
    DATE_ADDED,
    CAST(SUM(CASE WHEN HABITABILITY IS 0 THEN 1 ELSE 0 END) AS FLOAT) / COUNT(*) AS ZERO_RATE
FROM
    HABITABLES
GROUP BY
    DATE_ADDED
""", conn)

In [None]:
h_zero = h_zero \
    .rename(columns={clmn: clmn.lower() for clmn in h_zero.columns}) \
    .set_index("date_added") \
    .reindex(all_days)

Suppose we wanted to detect these anomalous values. One simple approach could be to define a threshold, and alert whenever the zero rate exceeded that threshold. How about 30%, for now?

In [None]:
h_zero_alerts = pd.read_sql_query("""
WITH HABITABILITY_ZERO_RATE AS(
    SELECT
        DATE_ADDED,
        CAST(SUM(CASE WHEN HABITABILITY IS 0 THEN 1 ELSE 0 END) AS FLOAT) / COUNT(*) AS ZERO_RATE
    FROM
        HABITABLES
    GROUP BY
        DATE_ADDED
)

SELECT
    DATE_ADDED
FROM
    HABITABILITY_ZERO_RATE
WHERE
    ZERO_RATE IS NOT NULL AND
    ZERO_RATE > 0.3
""", conn)

In [None]:
h_zero_alerts = h_zero_alerts.rename(columns={clmn: clmn.lower() for clmn in h_zero_alerts.columns})

In [None]:
fig = px.bar(x=all_days, y=h_zero["zero_rate"])
for alert in h_zero_alerts["date_added"]: fig.add_vline(x=alert, line_color='red')
fig.update_xaxes(title="Date")
fig.update_yaxes(title="Habitability Zero Rate")
fig.show()

This clearly identifies some problematic timestamps, but it's *way too noisy*. We wouldn't want a notification for every red line on the above graph. Looking for other (upstream) events can not only prune our alerts, but also help us identify the issue's cause.

See if you can `JOIN` the timestamps from `h_zero_alerts` with timestamps identifying a `schema_change`. For a refresher on `JOIN`ing in SQLite, check out [this link](https://www.sqlitetutorial.net/sqlite-inner-join/).

*Hint*: try querying the `EXOPLANETS_SCHEMA` table using the same approach from Exercise 2. For reference, the SQL that returns a schema change looks like this:
```
WITH CHANGES AS(
    SELECT
        DATE,
        SCHEMA,
        LAG(SCHEMA) OVER(ORDER BY DATE) AS PAST_SCHEMA
    FROM
        EXOPLANETS_SCHEMA
)

SELECT
    *
FROM
    CHANGES
WHERE
    SCHEMA != PAST_SCHEMA;
```

In [None]:
# YOUR CODE HERE
SQL = """

"""
# END YOUR CODE

In [None]:
joint_anoms = pd.read_sql_query(SQL, conn)
joint_anoms = joint_anoms \
    .rename(columns={clmn: clmn.lower() for clmn in joint_anoms.columns})
joint_anoms

See if you can get an output looking like this:
![SegmentLocal](../data/assets/ex3img1.png "segment")
Below, we'll see the "de-noising" effect that joining alerts can have, enhancing the clarity of our data observability:

In [None]:
fig = px.bar(x=all_days, y=h_zero["zero_rate"])
for alert in joint_anoms["date"]: fig.add_vline(x=alert, line_color='red')
fig.update_xaxes(title="Date")
fig.update_yaxes(title="Habitability Zero Rate")
fig.show()

You should see a single reported date, `2020-07-19`. Not only have we reduced the number of reports, but we potentially learn something -- that the schema change in `EXOPLANETS` _caused_ the zero rate in `HABITABLES` to spike. By combining data observability pillars, we're one step closer to resolving the problem!

![SegmentLocal](../data/assets/comet.gif "segment")

## 4. Exercise: Diagnosing Another Distribution Issue

Here's another quick mystery. It looks like the `HABITABLES` table returns to normal after a while, if we only look at zero rates. But probing into the **volume** of the table reveals something odd:

In [None]:
rows_added = pd.read_sql_query("""
SELECT
    DATE_ADDED,
    COUNT(*) AS ROWS_ADDED
FROM
    HABITABLES
GROUP BY
    DATE_ADDED
""", conn)
rows_added = rows_added \
    .rename(columns={clmn: clmn.lower() for clmn in rows_added.columns}) \
    .set_index("date_added") \
    .reindex(all_days)

In [None]:
fig = px.bar(x=all_days, y=rows_added["rows_added"])
fig.update_xaxes(title="Date")
fig.update_yaxes(title="Rows Added")
fig.show()

The row counts added seem to increase by ~1.5x each day starting around `2020-09-05`. We could detect this using a naive threshold:

In [None]:
h_rc_anoms = pd.read_sql_query("""
WITH ROW_COUNTS AS(
    SELECT
        DATE_ADDED,
        COUNT(*) AS ROWS_ADDED
    FROM
        HABITABLES
    GROUP BY
        DATE_ADDED
)
SELECT
    DATE_ADDED
FROM
    ROW_COUNTS
WHERE
    ROWS_ADDED > 130 -- this is my "detection parameter" - very naive!
""", conn)
h_rc_anoms = h_rc_anoms \
    .rename(columns={clmn: clmn.lower() for clmn in h_rc_anoms.columns})

In [None]:
fig = px.bar(x=all_days, y=rows_added["rows_added"])
for alert in h_rc_anoms["date_added"]: fig.add_vline(x=alert, line_color='red')
fig.update_xaxes(title="Date")
fig.update_yaxes(title="Rows Added")
fig.show()

But again, that's too noisy to be informative. This issue is compounded because volume is usually a problem when it *decreases* (as we saw earlier with freshness). But something must be the cause of this volume change, and turns out, it's a genuine issue.

As another exercise in understanding **distribution**, we're going to try querying for _uniqueness_. Uniqueness is pretty simple: for a given field, what % of field values are distinct?

Let's take a look at the `HABITABLES` table again:

In [None]:
c.execute("PRAGMA table_info(HABITABLES);")
c.fetchall()

While all fields are interesting, we want to pay special attention to the `_id` field. IDs are an interesting piece of the data observability landscape, since they're mostly thought to be unique.

As a starting point, see if you can use SQLite's [`DISTINCT` keyword](https://www.tutorialspoint.com/sqlite/sqlite_distinct_keyword.htm) to find the number of distinct `_id`s added per day.

In [None]:
# YOUR CODE HERE
SQL = """

"""
# END YOUR CODE

In [None]:
pd.read_sql_query(SQL, conn).head(5)

Can you get a result looking like this?
![SegmentLocal](../data/assets/ex3img2.png "segment")
If so, great! Let's now try to get the *rate* of `_id` values that are distinct per day. Once again, `CAST(... AS FLOAT)` will be your friend.

Also, remember to name your column `PCT_UNIQUE` (or `pct_unique`) so it can be used properly by the visualization code below:

In [None]:
# YOUR CODE HERE
SQL = """

"""
# END YOUR CODE

In [None]:
h_uniq = pd.read_sql_query(SQL, conn)
h_uniq = h_uniq.rename(columns={clmn: clmn.lower() for clmn in h_uniq.columns}) \
    .set_index("date_added") \
    .reindex(all_days)

In [None]:
fig = px.bar(x=all_days, y=h_uniq["pct_unique"])
fig.update_xaxes(title="Date")
fig.update_yaxes(title="_id Uniqueness")
fig.show()

A proper query here should reveal something telling -- the `_ID` field in `HABITABLES` is not unique, meaning we may be adding duplicate entries to our table! Semantics should dictate that `_ID` be 100% unique. Try writing a query that turns up the offending dates below:

In [None]:
# YOUR CODE HERE
SQL = """

"""
# END YOUR CODE

In [None]:
h_uniq_anoms = pd.read_sql_query(SQL, conn)
h_uniq_anoms = h_uniq_anoms \
    .rename(columns={clmn: clmn.lower() for clmn in h_uniq_anoms.columns})

In [None]:
fig = px.bar(x=all_days, y=h_uniq["pct_unique"])
for alert in h_uniq_anoms["date_added"]: fig.add_vline(x=alert, line_color='red')
fig.update_xaxes(title="Date")
fig.update_yaxes(title="_id Uniqueness")
fig.show()

Seems like we've caught the issue... but does this look too noisy to you? :)
# Great work!

This last exercise revealed that certain pillars of data observability are often conjoined to give meaningful alerts (volume and uniqueness; schema change and downstream distributions; etc.). In the next exercise, we'll look at some terms from machine learning to scale our approach.