## Intro and Dataset
This is the first ever real-world usage of the datools DIFF algorithm [1]. I'm trying to replicate the experiments in the Scorpion paper [2] using the Intel Sensor dataset [3].

In the Scorpion paper Sections 8.1 and 8.4, the system detects when various sensors (motes) placed throughout a lab detect too-high temperature values. The paper gives clear explanations, but not all of the details (like which time ranges the anomalies occur). Luckily, the experiments are publicly accessible [4], and I'm guessing they happen between 2004-03-01 and 2004-03-10. Let's see what happens when we run DIFF against the dataset.

[1] DIFF: A Relational Interface for Large-Scale Data Explanation
    by Firas Abuzaid, Peter Kraft, Sahaana Suri, Edward Gan,
    Eric Xu, Atul Shenoy, Asvin Ananthanarayan, John Sheu,
    Erik Meijer, Xi Wu, Jeff Naughton, Peter Bailis, and Matei Zaharia.
    http://www.bailis.org/papers/diff-vldb2019.pdf
    
[2] Scorpion: explaining away outliers in aggregate queries.
    by Eugene Wu and Samuel Madden.
    https://dspace.mit.edu/bitstream/handle/1721.1/89076/scorpion-vldb13.pdf
    
[3] Intel Lab Data
    by Samuel Madden, Peter Bodik, Wei Hong, Carlos Guestrin, 
    Mark Paskin, and Romain Thibaux
    http://db.csail.mit.edu/labdata/labdata.html
    As an aside, the dataset is 17 (!!!) years old, from before I started grad school. Wow!

[4] Scorpion experiments
    by Eugene Wu and Samuel Madden
    https://github.com/sirrice/scorpion/blob/ba1af715ebc33bc4c4a63612d63debd8650ee1cf/scorpion/tests/gentestdata.py#L80
    (At least the part I think I care about)

In [10]:
!wget http://db.csail.mit.edu/labdata/data.txt.gz
!gunzip data.txt.gz
!sed -i '1s/^/day time_of_day epoch moteid temperature humidity light voltage\n/' data.txt
!head data.txt

--2021-11-21 15:52:03--  http://db.csail.mit.edu/labdata/data.txt.gz
Resolving db.csail.mit.edu (db.csail.mit.edu)... 128.52.128.91
Connecting to db.csail.mit.edu (db.csail.mit.edu)|128.52.128.91|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34422518 (33M) [application/x-gzip]
Saving to: ‘data.txt.gz’


2021-11-21 15:52:03 (158 MB/s) - ‘data.txt.gz’ saved [34422518/34422518]

day time_of_day epoch moteid temperature humidity light voltage
2004-03-31 03:38:15.757551 2 1 122.153 -3.91901 11.04 2.03397
2004-02-28 00:59:16.02785 3 1 19.9884 37.0933 45.08 2.69964
2004-02-28 01:03:16.33393 11 1 19.3024 38.4629 45.08 2.68742
2004-02-28 01:06:16.013453 17 1 19.1652 38.8039 45.08 2.68742
2004-02-28 01:06:46.778088 18 1 19.175 38.8379 45.08 2.69964
2004-02-28 01:08:45.992524 22 1 19.1456 38.9401 45.08 2.68742
2004-02-28 01:09:22.323858 23 1 19.1652 38.872 45.08 2.68742
2004-02-28 01:09:46.109598 24 1 19.1652 38.8039 45.08 2.68742
2004-02-28 01:10:16.6789 25 1 19.1456 3

## Import the dataset into SQLite

In [None]:
import sys
!{sys.executable} -m pip install sqlite-utils

In [11]:
!sqlite-utils insert intel-sensor.sqlite readings data.txt --csv --sniff --detect-types

[?25l  [####################################]  100%          

In [26]:
!sqlite-utils schema intel-sensor.sqlite

CREATE TABLE "readings" (
   [day] TEXT,
   [time_of_day] TEXT,
   [epoch] INTEGER,
   [moteid] INTEGER,
   [temperature] FLOAT,
   [humidity] FLOAT,
   [light] FLOAT,
   [voltage] FLOAT
);


# When only considering moteids, we replicate the Scorpion paper
If we only consider moteid (the only set-valued attribute), the Scorpion result replicates: `moteid = 15` is the offender! Per the paper:
> For the first INTEL workload, the outliers are generated by Sensor 15, so Scorpion consistently returns the predicate sensorid = 15.

In [7]:
from sqlalchemy import create_engine
from datools.explanation.algorithms import diff
from datools.models import Column
from datools.sqlalchemy_utils import query_results_pretty_print

engine = create_engine('sqlite:///intel-sensor.sqlite')

In [8]:
candidates = diff(
        engine=engine,
        test_relation='SELECT moteid, temperature, humidity, light, voltage FROM readings WHERE temperature > 100 AND day > "2004-03-01" and day < "2004-03-10"',
        control_relation='SELECT moteid, temperature, humidity, light, voltage FROM readings WHERE temperature <= 100 AND day > "2004-03-01" and day < "2004-03-10"',
        on_column_values={Column('moteid'),},
        on_column_ranges={},
        min_support=0.05,
        min_risk_ratio=2.0,
        max_order=1)
for candidate in candidates:
    print(candidate)

Explanation(predicates=(Predicate(moteid EQUALS 15),), risk_ratio=404.8320855614973)
Explanation(predicates=(Predicate(moteid EQUALS 18),), risk_ratio=200.5765335449176)


## Including range-balued attributes, the implementation can't yet replicate the Scorpion paper
In the Scorpion paper, the authors also consider columns beyond moteid:

>  However, when c approaches 1, Scorpion generates the predicate, light ∈ [0, 923] & voltage ∈ [2.307, 2.33] & sensorid = 15.

The DIFF implementation doesn't yet support combinations of more than one column (`max_order > 1`). It does support range-valued attributes like humidity/light/voltage via the `on_column_ranges` attribute. The Scorpion paper transforms these attributes  into 15 bucketed ranges:

> "The Naive and MC partitioner algorithms were configured to split each continuous attribute’s domain into 15 equi-sized ranges."

When we try to replicate this (moteid and 15 equi-sized ranges for humidity/light/voltage), we get a not-quite-replicated result. Specifically, the top single-column explanation becomes `voltage >= 2.32 AND voltage < 2.3291`. This is similar to the range in the Scorpion paper, but surprisingly it ranks higher than `moteid = 15` alone. `moteid = 15` (the one identified in the paper) is the 4th top explanation, behind voltage/humidity/voltage.

Why might this be?
* This DIFF implementation is new. Perhaps there's a bug?
* We're trying to replicate Scorpion with an implementation of DIFF. Perhaps the artistic license I'm taking to do that (the temperature values I'm thresholding on, using the risk ratio instead of Scorpion's metric) can explain the difference in results.
* Skeptically, I wonder if we should have even considered humidity/light/voltage as reasons. I don't know electronics well, but if a sensor is failing to measure temperature, won't it also fail to measure things like humidity? If so, those can't be used as explanations of a poor temperature rating, so why are we considering them?

In [10]:
candidates = diff(
        engine=engine,
        test_relation='SELECT moteid, temperature, humidity, light, voltage FROM readings WHERE temperature > 100 AND day > "2004-03-01" and day < "2004-03-10"',
        control_relation='SELECT moteid, temperature, humidity, light, voltage FROM readings WHERE temperature <= 100 AND day > "2004-03-01" and day < "2004-03-10"',
        on_column_values={Column('moteid'),},
        on_column_ranges={Column('humidity'), Column('light'), Column('voltage'),},
        min_support=0.05,
        min_risk_ratio=2.0,
        max_order=1)
for candidate in candidates:
    print(candidate)

Explanation(predicates=(Predicate(voltage GTEQ 2.32), Predicate(voltage LT 2.3291)), risk_ratio=732.5350873788292)
Explanation(predicates=(Predicate(humidity GTEQ -3.91901), Predicate(humidity LT -1.3392)), risk_ratio=688.1392174704278)
Explanation(predicates=(Predicate(voltage GTEQ 2.31097), Predicate(voltage LT 2.32)), risk_ratio=413.2085152838428)
Explanation(predicates=(Predicate(moteid EQUALS 15),), risk_ratio=404.8320855614973)
Explanation(predicates=(Predicate(humidity GTEQ -1.3392), Predicate(humidity LT 1.13812)), risk_ratio=378.017473789316)
Explanation(predicates=(Predicate(humidity GTEQ 2.80408), Predicate(humidity LT 8.13082)), risk_ratio=378.017473789316)
Explanation(predicates=(Predicate(humidity GTEQ 1.13812), Predicate(humidity LT 2.80408)), risk_ratio=377.6413965087282)
Explanation(predicates=(Predicate(voltage LT 2.30202),), risk_ratio=373.6497102499095)
Explanation(predicates=(Predicate(humidity LT -3.91901),), risk_ratio=373.6124796373681)
Explanation(predicates=(P

In [9]:
!rm data.txt* intel-sensor.sqlite*

rm: cannot remove 'data.txt*': No such file or directory
rm: cannot remove 'intel-sensor.sqlite*': No such file or directory
