# Computing Experiment Datasets #2: Experiment Observations

This Lab is part of a multi-part series focused on computing useful experiment datasets. In this Lab, we'll use [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) to compute _experiment observations_.

<!-- We use an external image URL rather than a relative path so that this notebook will be rendered correctly on the Optimizely Labs website -->
![Experiment observations computation](https://raw.githubusercontent.com/optimizely/labs/master/labs/computing-experiment-subjects/img/observations_computation.png)

**Experiment observations** map [experiment subjects](https://www.optimizely.com/labs/computing-experiment-subjects/) onto numerical observations made about each subject during an experiment:

| subject_id | experiment_id | variation_id | timestamp                | ordered? | order_count | items_ordered | revenue |
|------------|---------------|--------------|--------------------------|----------|-------------|---------------|---------|
| visitor_1  | 12345         | A            | July 20th, 2020 14:25:00 | 0        | 0           | 0             | 0       |
| visitor_2  | 12345         | B            | July 20th, 2020 14:28:13 | 1        | 2           | 12            | 65.21   |
| visitor_3  | 12345         | A            | July 20th, 2020 14:31:01 | 1        | 1           | 1             | 5.99    |


In this Lab we'll join a visitor-level **subjects** input dataset with an event-level **conversions** dataset to produce a subject-level **observatations** output dataset.  This output dataset contains a record of who was exposed to our experiment, which treatment they received, and when they first received it, along with a set of numerical observations about each subject.  This dataset is useful for
- computing aggregate statistics about an experiment, such as the number of visitors saw each "variation" on a given day, the total number of purchases associated with each experimental treatment, etc.
- computing and visualizing business metrics for each variation in an experiment
- performing statistical analysis on the change observed in your business metrics in an experiment

This Lab is generated from a Jupyter Notebook.  Scroll to the bottom of this page for instructions on how to run it on your own machine.

## Analysis parameters

We'll use the following the parameterize our computation.

### Local Data Storage

These parameters specify where this notebook should read and write data. The default location is `./example_data` in this notebook's directory.  You can point the notebook to another data directory by setting the `OPTIMIZELY_DATA_DIR` environment variable, e.g.

```sh
$ export OPTIMIZELY_DATA_DIR=~/optimizely_data
```

In [8]:
import os

# Local data storage locations
base_data_dir = os.environ.get("OPTIMIZELY_DATA_DIR", "./example_data")
subjects_data_dir = os.path.join(base_data_dir, "type=subjects")
events_data_dir = os.path.join(base_data_dir, "type=events")
observations_output_dir = os.path.join(base_data_dir, "type=observations")

### Subject ID column

This column is used to join our _experiment subjects_ dataset with our _conversion events_ dataset.  See the [experiment subjects](https://www.optimizely.com/labs/computing-experiment-subjects/) Lab for more details on this parameter.

In [2]:
# Subject ID column
subject_id = "visitor_id"

### Analysis Window

Your analysis window determines which conversion events are included in your analysis.

The default window in this notebook is large enough that *all* subjects and events will be included in the computation, but you can adjust this to focus on a specific time window if desired.

In [6]:
from datetime import datetime

# Analysis window
events_start = "2000-01-01 00:00:00"
events_end = "2099-12-31 23:59:59"

# The following are useful for converting timestamps from Optimizely results page URLs into
# an analysis window that can be used by the queries in this notebook.
# events_start = datetime.fromtimestamp(1592416545427 / 1000).strftime('%Y-%m-%d %H:%M:%S')
# events_end = datetime.fromtimestamp(1593108481887 / 1000).strftime('%Y-%m-%d %H:%M:%S')

### Attribution Window

The _attribution window_ denotes the time period after a visitor is first exposed to an experiment during which events are recorded for analysis. If the attribution window is 24 hours, for example, only conversion events recorded during the first 24 hours after a visitor is first exposed to an experiment will be included in the analysis.

Note that Optimizely's [experiment results reports](https://help.optimizely.com/Analyze_Results/The_Experiment_Results_page_for_Optimizely_X) do not use a fixed attribution window for computing experiment results.  With this in mind, we set the default value to a large number (60 days), but this may be adjusted if desired.

In [4]:
# Attribution window

attribution_window_hours = 24 * 60

## Creating a Spark Session

In [14]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL") \
    .config("spark.sql.repl.eagerEval.enabled", True) \
    .config("spark.sql.repl.eagerEval.truncate", 120) \
    .config("spark.driver.bindAddress", "127.0.0.1") \
    .getOrCreate()

## Load subject data

We'll start by loading decision data and isolating the decisions for the experiment specified by `experiment_id` and the time window specfied by `decisions_start` and `decisions_end`.

In [20]:
spark.read.parquet(subjects_data_dir).createOrReplaceTempView("experiment_subjects")

spark.sql("SELECT * FROM experiment_subjects LIMIT 5")

visitor_id,variation_id,timestamp,experiment_id
visitor_1590445653085,18174970251,2020-05-25 15:27:33.085,18156943409
visitor_1590445653325,18112613000,2020-05-25 15:27:33.325,18156943409
visitor_1590445653565,18112613000,2020-05-25 15:27:33.565,18156943409
visitor_1590445653805,18174970251,2020-05-25 15:27:33.805,18156943409
visitor_1590445654045,18174970251,2020-05-25 15:27:34.045,18156943409


## Load conversion data

Here's the conversion data.

In [21]:
spark.read.parquet(events_data_dir).createOrReplaceTempView("loaded_events")

spark.sql(f"""
    CREATE OR REPLACE TEMPORARY VIEW events as
        SELECT 
            *
        FROM 
            loaded_events
        WHERE
            timestamp BETWEEN '{events_start}' AND '{events_end}'
""")

spark.sql("SELECT * FROM events LIMIT 1")

uuid,timestamp,process_timestamp,visitor_id,session_id,account_id,experiments,entity_id,attributes,user_ip,user_agent,referer,event_type,event_name,revenue,value,quantity,tags,revision,client_engine,client_version
235ABEC8-C9A1-4484-94AF-FB107524BFF8,2020-05-24 17:34:27.448,2020-05-24 17:41:59.059,visitor_1590366867448,-1274245065,596780373,"[[18128690585, 18142600572, 18130191769, false]]",15776040040,"[[100,, browserId, ff], [300,, device, ipad], [600,, source_type, campaign], [200,, campaign, frequent visitors], [, ...",174.222.139.0,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/53...",https://app.optimizely.com/,,add_to_cart,1000,1000.00001,,[],,ricky/fakedata.pwned,1.0.0


We can group by `event_name` to count the number of each event type loaded.

In [22]:
spark.sql("SELECT event_name, COUNT(1) as `count of events` FROM events GROUP BY event_name").toPandas()

Unnamed: 0,event_name,count of events
0,add_to_cart,11109


## Compute experiment events

Next we'll isolate the events that can be attributed to each experiment, which we'll call _experiment events_.  An experiment event is an event triggered during a finite window of time (called the _attribution window_) after a visitor has been exposed to an experiment treatment.

In [25]:
spark.sql(f"""
    CREATE OR REPLACE TEMPORARY VIEW experiment_events AS
        SELECT
            u.experiment_id,
            u.variation_id,
            e.*
        FROM
            experiment_subjects u INNER JOIN events e ON u.{subject_id} = e.{subject_id}
        WHERE
            e.timestamp BETWEEN u.timestamp AND (u.timestamp + INTERVAL {attribution_window_hours} HOURS)
""")

spark.sql("""
    SELECT 
        experiment_id,
        variation_id,
        event_name,
        COUNT(1) as `experiment_event_count` 
    FROM 
        experiment_events 
    GROUP BY
        experiment_id,
        variation_id,
        event_name
"""
)

experiment_id,variation_id,event_name,experiment_event_count
18156943409,18112613000,add_to_cart,2577
18156943409,18174970251,add_to_cart,2655


## Compute experiment observations

Next we'll compute a versatile dataset for experiment analysis: _experiment observations_.

In this dataset we map each experiment subject (visitor) to a set of numerical observations made in the course of the experiment.  These observations may then be aggregated to compute _metrics_ we can analyze to measure the impact of our experiment.

We'll start by creating a dictionary to store observations.  Observations are specified by a name, and a query.  The query should operate on our `experiment_events` table, and should select two columns: 

In [9]:
metrics = {}

def add_observation(name, query):
    """Add a metric to the global metrics list.
    
    Each metric is specified by a metric name and a query.  The query """
    query_frame = spark.sql(query)
    metrics[name] = query_frame
    display(query_frame.toPandas().head(5))

Now we'll define a set of metrics by executing simple queries on our experiment events.  Each query computes a single _observation_ for each subject.

### Metric: `add_to_cart` unique Conversions

In this query we compute the number of unique conversions on a particular event. The resulting observation should be `1` if the visitor triggered the event in question during the _attribution window_ and `0` otherwise.  

Since _any_ visitor who triggered an appropriate experiment event should be counted, we can simply select a `1`. 

In [10]:
## Unique conversions on the "capture" event.
add_metric('add_to_cart_Unique_Conversions',
    f"""
        SELECT
            {event_subject_id},
            1 as observation
        FROM
            experiment_events
        WHERE
            event_name = "add_to_cart"
        GROUP BY
            {event_subject_id}
    """
)



Unnamed: 0,visitor_id,observation
0,visitor_1590445778365,1
1,visitor_1590445862125,1
2,visitor_1590445964845,1
3,visitor_1590445992445,1
4,visitor_1590446016445,1


### Metric: `add_to_cart` conversion counts

In this query we compute the number of conversions on a particular event. 

In [11]:
## Unique conversions on the "capture" event.
add_metric('add_to_cart_Total_Conversions',
    f"""
        SELECT
            {event_subject_id},
            count(1) as observation
        FROM
            experiment_events
        WHERE
            event_name = "add_to_cart"
        GROUP BY
            {event_subject_id}
        ORDER BY
            observation DESC
    """
)

Unnamed: 0,visitor_id,observation
0,visitor_1590445778365,1
1,visitor_1590445862125,1
2,visitor_1590445964845,1
3,visitor_1590445992445,1
4,visitor_1590446016445,1


### Metric: `add_to_cart` revenue

In this query we sum the revenue associated with attributed`add_to_cart` event revenue associated with each visitor.

In [12]:
## Unique conversions on the "capture" event.
add_metric('add_to_cart_Revenue',
    f"""
        SELECT
            {event_subject_id},
            sum(revenue) as observation
        FROM
            experiment_events
        WHERE
            event_name = "add_to_cart"
        GROUP BY
            {event_subject_id}
    """
)

Unnamed: 0,visitor_id,observation
0,visitor_1590445778365,0
1,visitor_1590445862125,0
2,visitor_1590445964845,0
3,visitor_1590445992445,0
4,visitor_1590446016445,0


## Compute Experiment subject observations

Finally, we join our metric tables with our `subjects` table in order to produce a unified `observations` table.  The `observations` table maps every subject in our experiment to the set of observations made about that subject.

We assume that the observation value for any subject that does not appear in a given metric table should be 0.

In [13]:
# Reset the subject observations view
spark.sql(f"""
    CREATE OR REPLACE TEMPORARY VIEW observations AS
    SELECT * FROM experiment_subjects
""")

for metric_name, metric_df in metrics.items():
    metric_df.createOrReplaceTempView("metric")
    spark.sql(f"""
        CREATE OR REPLACE TEMPORARY VIEW observations AS
            SELECT
                o.*,
                COALESCE(m.observation, 0) as `{metric_name}`
            FROM
                observations o LEFT JOIN metric m on o.{decision_subject_id} = m.{event_subject_id}
    """)

spark.sql("SELECT * FROM observations LIMIT 5").toPandas()

Unnamed: 0,visitor_id,variation_id,timestamp,add_to_cart_Unique_Conversions,add_to_cart_Total_Conversions,add_to_cart_Revenue
0,visitor_1590445669165,18112613000,2020-05-25 15:27:49.165,0,0,0
1,visitor_1590445676365,18174970251,2020-05-25 15:27:56.365,0,0,0
2,visitor_1590445778365,18174970251,2020-05-25 15:29:38.365,1,1,0
3,visitor_1590445862125,18112613000,2020-05-25 15:31:02.125,1,1,0
4,visitor_1590445923325,18112613000,2020-05-25 15:32:03.325,0,0,0


## Write observation data

Now we'll write our observation data to disk in [Apache Parquet](https://parquet.apache.org/) format

In [14]:
spark.sql("""SELECT * FROM observations""").write.mode('overwrite').parquet(observation_output_dir)