# Computing Experiment Datasets #2: Computing Experiment Units and Events

This notebook is part of a multi-part series focused on computing useful experiment datasets. In this notegbook, we'll use [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) to compute two useful experiment datasets: _experiment units_ and _experiment events_.

**Experiment units** are the individual units that are exposed to a control or treatment in the course of an online experiment.  In most online experiments, subjects are website visitors or app users. However, depending on your experiment design, treatments may also be applied to individual user sessions, service requests, search queries, etc. 

An **Experiment event** is an event, such as a button click or a purchase, that was influenced by an experiment.  We compute this view by isolating the conversion events triggered during a finite window of time (called the _attribution window_) after a visitor has been exposed to an experiment treatment.

The transformations executed in this notebook are part of a larger [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph) that can be used to perform sequential hypothesis testing with event-level experiment data:

![Experiment Analysis DAG](img/transformations.png)

This notebook is _experiment agnostic_ and can be used to compute experiment units and events for arbitrary _decision_ and _conversion_ input data.

## Global parameters

The following global parameters are used to control the execution in this notebook.  These parameters may be overridden by setting environment variables prior to launching the notebook, e.g.:

```
export OPTIMIZELY_DATA_DIR=~/my_analysis_dir
```

In [41]:
import os

# Determines whether output data should be written back to disk
# Defaults to False; setting this to True may be useful when running this notebook
# as part of a larger workflow
SKIP_WRITING_OUTPUT_DATA_TO_DISK = os.environ.get("SKIP_WRITING_OUTPUT_DATA_TO_DISK", False)

# Default path for reading and writing analysis data
OPTIMIZELY_DATA_DIR = os.environ.get("OPTIMIZELY_DATA_DIR", "./covid_test_data")

## Load Decision and Conversion Data into Spark Dataframes

We'll use [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) to transform data in this notebook. We'll start by creating a new local Spark session.

In [55]:
from pyspark.sql import SparkSession

num_cores = 1
driver_ip = "127.0.0.1"
driver_memory_gb = 1
executor_memory_gb = 2

# Create a local Spark session
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL") \
    .config(f"local[{num_cores}]") \
    .config("spark.sql.repl.eagerEval.enabled", True) \
    .config("spark.sql.repl.eagerEval.truncate", 120) \
    .config("spark.driver.bindAddress", driver_ip) \
    .config("spark.driver.host", driver_ip) \
    .config("spark.driver.memory", f"{driver_memory_gb}g") \
    .config("spark.executor.memory", f"{executor_memory_gb}g") \
    .getOrCreate()

Next we'll load our decision data into a Spark dataframe:

In [56]:
import os
from lib import util

decisions_dir = os.path.join(optimizely_data_dir, "type=enriched_decisions")

# load enriched decision data from disk into a new Spark dataframe
decisions = util.read_parquet_data_from_disk(
    spark_session=spark,
    data_path=decisions_dir,
    view_name="enriched_decisions"
)

Now we can write SQL-style queries against our `enriched_decisions` view.  Let's use a simple query to examine our data:

In [57]:
spark.sql("""
    SELECT
        *
    FROM
        enriched_decisions
    LIMIT 3
""")

experiment_name,variation_name,reference_variation_id,uuid,timestamp,process_timestamp,visitor_id,session_id,account_id,campaign_id,variation_id,attributes,user_ip,user_agent,referer,is_holdback,revision,client_engine,client_version,date,experiment,experiment_id
covid_messaging_experiment,control,18802093142,3e1d31a3-41c2-4f22-9b77-6a67f44a269a,2020-09-14 11:21:40.177,2020-09-14 11:22:05.571,user_0,-535397001,596780373,18811053836,18802093142,"[[$opt_bot_filtering, $opt_bot_filtering, custom, false], [$opt_enrich_decisions, $opt_enrich_decisions, custom, true...",162.227.140.251,python-requests/2.24.0,,False,99,python-sdk,3.5.2,2020-09-14,18786493712,18786493712
covid_messaging_experiment,control,18802093142,e37700df-856c-46c9-9c6d-b771ff244c2c,2020-09-14 11:21:40.279,2020-09-14 11:22:12.287,user_1,-1393366513,596780373,18811053836,18802093142,"[[$opt_bot_filtering, $opt_bot_filtering, custom, false], [$opt_enrich_decisions, $opt_enrich_decisions, custom, true...",162.227.140.251,python-requests/2.24.0,,False,99,python-sdk,3.5.2,2020-09-14,18786493712,18786493712
covid_messaging_experiment,control,18802093142,aa14c408-19a3-410d-8536-7cf120777f00,2020-09-14 11:21:41.199,2020-09-14 11:22:13.222,user_10,582981235,596780373,18811053836,18802093142,"[[$opt_bot_filtering, $opt_bot_filtering, custom, false], [$opt_enrich_decisions, $opt_enrich_decisions, custom, true...",162.227.140.251,python-requests/2.24.0,,False,99,python-sdk,3.5.2,2020-09-14,18786493712,18786493712


Next we'll load conversion data:

In [58]:
# oevents downloads conversion data into the type=events subdirectory
conversions_dir = os.path.join(optimizely_data_dir, "type=events")

# load conversion data from disk into a new Spark dataframe
converions = util.read_parquet_data_from_disk(
    spark_session=spark,
    data_path=conversions_dir,
    view_name="events"
)

Let's take a look at our data:

In [59]:
spark.sql("""
    SELECT
        *
    FROM
        events
    LIMIT 3
""")

uuid,timestamp,process_timestamp,visitor_id,session_id,account_id,experiments,entity_id,attributes,user_ip,user_agent,referer,event_type,event_name,revenue,value,quantity,tags,revision,client_engine,client_version,date,event
9f91d071-535a-41a5-83f9-eaea10b097dc,2020-09-14 10:32:22.544,2020-09-14 10:33:05.618,user_0,2107132638,596780373,"[[18803622799, 18805683213, 18821642160, false]]",18822540003,"[[$opt_bot_filtering, $opt_bot_filtering, custom, false], [$opt_enrich_decisions, $opt_enrich_decisions, custom, true...",162.227.140.251,python-requests/2.24.0,,,homepage_view,0,,0,[],91,python-sdk,3.5.2,2020-09-14,homepage_view
ceaf941d-8267-4942-a952-0e7885726fe4,2020-09-14 11:21:40.178,2020-09-14 11:22:05.571,user_0,-535397001,596780373,"[[18803622799, 18805683213, 18821642160, false], [18811053836, 18786493712, 18802093142, false]]",18822540003,"[[$opt_bot_filtering, $opt_bot_filtering, custom, false], [$opt_enrich_decisions, $opt_enrich_decisions, custom, true...",162.227.140.251,python-requests/2.24.0,,,homepage_view,0,,0,[],99,python-sdk,3.5.2,2020-09-14,homepage_view
97e8b812-4608-4329-8d13-52bf18aa7058,2020-09-14 10:32:22.648,2020-09-14 10:33:12.359,user_1,-1316211988,596780373,"[[18803622799, 18805683213, 18809464474, false]]",18822540003,"[[$opt_bot_filtering, $opt_bot_filtering, custom, false], [$opt_enrich_decisions, $opt_enrich_decisions, custom, true...",162.227.140.251,python-requests/2.24.0,,,homepage_view,0,,0,[],91,python-sdk,3.5.2,2020-09-14,homepage_view


## Compute Experiment Units

**Experiment units** are the individual units that are exposed to a control or treatment in the course of an online experiment.  In most online experiments, subjects are website visitors or app users. However, depending on your experiment design, treatments may also be applied to individual user sessions, service requests, search queries, etc. 

<table>
    <tr>
        <td>
            <img src="img/transformations_1.png" alt="Experiment Units" style="width:100%; padding-left:0px">
        </td>
        <td>
            <img src="img/tables_1.png" alt="Experiment Units" style="width:100%; padding-left:0px">
        </td>
    </tr>
</table>

**Note:** The attribution logic captured in the following query is identical to the logic used to count visitors on an [Optimizely Full Stack](https://www.optimizely.com/platform/full-stack/) Experiment Results Page.  It differs slightly from the logic used on an [Optimizely Web](https://www.optimizely.com/platform/experimentation/) Experiment Results Page.  In Optimizely Web, conversion events sent from the browser contain some historial information about which experiments and variations a visitor has seen.  That information is used to count visitors in an experiment even if an explicit decision event is not sent to Optimizely. This ensures that even in the rare case where a decision event is "dropped" (which may occur when a browser window is closed or there are connectivity issues) a visitor will still be counted.

In [30]:
experiment_units = spark.sql(f"""
    SELECT
        *
    FROM (
        SELECT
            *,
            RANK() OVER (PARTITION BY experiment_id, visitor_id ORDER BY timestamp ASC) AS rnk
        FROM
            enriched_decisions
    )
    WHERE
        rnk = 1
    ORDER BY timestamp ASC
""").drop("rnk")
experiment_units.createOrReplaceTempView("experiment_units")

Let's examine our experiment unit dataset:

In [31]:
spark.sql("""
    SELECT
        visitor_id,
        experiment_name,
        variation_name,
        timestamp
    FROM
        experiment_units
    LIMIT 3
""")

visitor_id,experiment_name,variation_name,timestamp
user_0,covid_messaging_experiment,control,2020-09-14 11:21:40.177
user_1,covid_messaging_experiment,control,2020-09-14 11:21:40.279
user_2,covid_messaging_experiment,control,2020-09-14 11:21:40.381


Let's count the number of visitors in each experiment variation:

In [32]:
spark.sql("""
    SELECT 
        experiment_name,
        variation_name,
        count(*) as unit_count
    FROM 
        experiment_units
    GROUP BY 
        experiment_name,
        variation_name
    ORDER BY
        experiment_name ASC,
        variation_name ASC
""")

experiment_name,variation_name,unit_count
covid_messaging_experiment,control,3304
covid_messaging_experiment,message_1,3367
covid_messaging_experiment,message_2,3329


### Experiment Events

An **experiment event** is an event, such as a button click or a purchase, that was influenced by an experiment.  We compute this view by isolating the conversion events triggered during a finite window of time (called the _attribution window_) after a visitor has been exposed to an experiment treatment.

<table>
    <tr>
        <td>
            <img src="img/transformations_2.png" alt="Experiment Units" style="width:100%; padding-left:0px">
        </td>
        <td>
            <img src="img/tables_2.png" alt="Experiment Units" style="width:100%; padding-left:0px">
        </td>
    </tr>
</table>

**Note:** The attribution logic captured in the following query is similar to, but does not exactly match the logic used by Optimizely's experiment results page.  On Optimizely's results page, all events triggered between the moment that a visitor is added to an experiment and the end of the the specified results time range are attributed to the variation to which that visitor was exposed.  For more information, see the ["How Optimizely Counts Conversions" Knowledge Base article](https://help.optimizely.com/Analyze_Results/How_Optimizely_counts_conversions).

In [33]:
# Experiment events are the conversion events that we believe were influenced by an experiment
# The "attribution window" specifies the length of time that an experiment's "influence" lasts
# after a user is exposed to a treatment for the first time.
ATTRIBUTION_WINDOW_HOURS = 24 * 60 

# Create the experiment_events view
experiment_events = spark.sql(f"""
    SELECT
        u.experiment_id,
        u.experiment_name,
        u.variation_id,
        u.variation_name,
        e.*
    FROM
        experiment_units u INNER JOIN events e ON u.visitor_id = e.visitor_id
    WHERE
        e.timestamp BETWEEN u.timestamp AND (u.timestamp + INTERVAL {ATTRIBUTION_WINDOW_HOURS} HOURS)
""")
experiment_events.createOrReplaceTempView("experiment_events")

Let's examine our Experiment Events dataset:

In [34]:
spark.sql("""
    SELECT
        timestamp,
        visitor_id,
        experiment_name,
        variation_name,
        event_name,
        tags,
        revenue
    FROM
        experiment_events
    LIMIT 10
""")

timestamp,visitor_id,experiment_name,variation_name,event_name,tags,revenue
2020-09-14 11:21:40.178,user_0,covid_messaging_experiment,control,homepage_view,[],0
2020-09-14 11:21:40.279,user_1,covid_messaging_experiment,control,homepage_view,[],0
2020-09-14 11:21:41.199,user_10,covid_messaging_experiment,control,homepage_view,[],0
2020-09-14 11:21:50.365,user_100,covid_messaging_experiment,message_2,homepage_view,[],0
2020-09-14 11:23:21.951,user_1000,covid_messaging_experiment,control,homepage_view,[],0
2020-09-14 11:23:22.053,user_1001,covid_messaging_experiment,message_1,homepage_view,[],0
2020-09-14 11:23:22.154,user_1002,covid_messaging_experiment,message_1,homepage_view,[],0
2020-09-14 11:23:22.255,user_1003,covid_messaging_experiment,control,homepage_view,[],0
2020-09-14 11:23:22.357,user_1004,covid_messaging_experiment,message_2,homepage_view,[],0
2020-09-14 11:23:22.458,user_1005,covid_messaging_experiment,control,homepage_view,[],0


As above, let's count the number of events that were influenced by each variation:

In [35]:
spark.sql(f"""
    SELECT
        experiment_name,
        variation_name,
        event_name,
        count(*) as event_count
    FROM
        experiment_events
    GROUP BY
        experiment_name,
        variation_name,
        event_name
    ORDER BY
        experiment_name ASC,
        variation_name ASC,
        event_name ASC
""")

experiment_name,variation_name,event_name,event_count
covid_messaging_experiment,control,add_to_cart,326
covid_messaging_experiment,control,detail_page_view,1799
covid_messaging_experiment,control,homepage_view,3304
covid_messaging_experiment,control,purchase,326
covid_messaging_experiment,message_1,add_to_cart,338
covid_messaging_experiment,message_1,detail_page_view,2414
covid_messaging_experiment,message_1,homepage_view,3367
covid_messaging_experiment,message_1,purchase,338
covid_messaging_experiment,message_2,add_to_cart,446
covid_messaging_experiment,message_2,detail_page_view,2900


## Writing our datasets to disk

We'll write our experiment units and events datasets to disk so that they may be used for other analysis tasks.  Experiment unit data is partitioned by `experiment_id` and experiment event data is partitioned by `event_name`.

In [38]:
if not SKIP_WRITING_OUTPUT_DATA_TO_DISK: 
    
    experiment_units_output_dir = os.path.join(optimizely_data_dir, "type=experiment_units")
    
    experiment_units \
        .coalesce(1) \
        .write.mode('overwrite') \
        .partitionBy("experiment_id") \
        .parquet(experiment_units_output_dir)
    
    experiment_events_output_dir = os.path.join(optimizely_data_dir, "type=experiment_events")
    
    experiment_events \
        .coalesce(1) \
        .write.mode('overwrite') \
        .partitionBy("event_name") \
        .parquet(experiment_events_output_dir)
    
    