# Computing Experiment Datasets #1: Enriching Optimizely Decision data with experiment metadata

This Lab is part of a multi-part series focused on computing useful experiment datasets. In this Lab, we'll use [PySpark](https://spark.apache.org/docs/latest/api/python/index.html), [Bravado](https://github.com/Yelp/bravado), and Optimizely's [Experiments API](https://library.optimizely.com/docs/api/app/v2/index.html#tag/Experiments) to enrich Optimizely ["decision"](https://docs.developers.optimizely.com/optimizely-data/docs/enriched-events-data-specification#decisions-2) data with human-readable experiment and variation names.

Why is this useful?  Exported Optimizely [decision](https://docs.developers.optimizely.com/optimizely-data/docs/enriched-events-data-specification#decisions-2) data contains a record of every "decision" made by Optimizely clients during your experiment.  Each "decision" event records the moment that a visitor is added to a particular variation, and includes unique identifiers for the experiment and variation in question.  For example: 

| visitor_id | experiment_id | variation_id | timestamp                |
|------------|---------------|--------------|--------------------------|
| visitor_1  | 12345         | 678          | July 20th, 2020 14:25:00 |
| visitor_2  | 12345         | 789          | July 20th, 2020 14:28:13 |
| visitor_3  | 12345         | 678          | July 20th, 2020 14:31:01 |

In order to work productively with this data in, it may be useful to enrich it with human-readable names for your experiments and variations, yielding e.g.:

| visitor_id | experiment_id | variation_id | timestamp                | experiment_name          | variation_name |
|------------|---------------|--------------|--------------------------|--------------------------|----------------|
| visitor_1  | 12345         | 678          | July 20th, 2020 14:25:00 | free_shipping_experiment | control        |
| visitor_2  | 12345         | 789          | July 20th, 2020 14:28:13 | free_shipping_experiment | treatment      |
| visitor_3  | 12345         | 678          | July 20th, 2020 14:31:01 | free_shipping_experiment | control        |

This Lab is generated from a Jupyter Notebook.  Scroll to the bottom of this page for instructions on how to run it on your own machine.

## Global parameters

The following global parameters are used to control the execution in this notebook.  These parameters may be overridden by setting environment variables prior to launching the notebook, e.g.:

```
export OPTIMIZELY_DATA_DIR=~/my_analysis_dir
```

In [1]:
import os
from getpass import getpass

# Determines whether output data should be written back to disk
# Defaults to False; setting this to True may be useful when running this notebook
# as part of a larger workflow
SKIP_WRITING_OUTPUT_DATA_TO_DISK = os.environ.get("SKIP_WRITING_OUTPUT_DATA_TO_DISK", False)

# This notebook requires an Optimizely API token.
OPTIMIZELY_API_TOKEN = os.environ.get("OPTIMIZELY_API_TOKEN", "2:d6K8bPrDoTr_x4hiFCNVidcZk0YEPwcIHZk-IZb5sM3Q7RxRDafI")

# Default path for reading and writing analysis data
OPTIMIZELY_DATA_DIR = os.environ.get("OPTIMIZELY_DATA_DIR", "./covid_test_data")

## Create an Optimizely REST API client

First, we'll create an API client using the excellent [Bravado](https://github.com/Yelp/bravado) library.

**Note:** In order to execute this step, you'll need an Optimizely [Personal Access Token](https://docs.developers.optimizely.com/web/docs/personal-token).  You can supply this token to the notebook via the `OPTIMIZELY_API_TOKEN` environment variable.  If `OPTIMIZELY_API_TOKEN` has not been set, you will be prompted to enter an access token manually.

In [2]:
import getpass
from bravado.requests_client import RequestsClient
from bravado.client import SwaggerClient

# Create a custom requests client for authentication
requests_client = RequestsClient()
requests_client.set_api_key(
    "api.optimizely.com",
    f"Bearer {OPTIMIZELY_API_TOKEN}",
    param_name="Authorization",
    param_in="header",
)

# Create an API client using Optimizely's swagger/OpenAPI specification
api_client = SwaggerClient.from_url(
    "https://api.optimizely.com/v2/swagger.json",
    http_client=requests_client,
    config={
        
        "validate_swagger_spec": False,  # validation produces several warnings
    }
)

Now we'll test that this client can successfully authenticate to Optimizely

In [3]:
import bravado.exception

try:
    api_client.Projects.list_projects().response().result
    print("Successfully authenticated to Optimizely.")
except bravado.exception.HTTPUnauthorized as e:
    print(f"Failed to authenticate to Optimizely: {e}")

Successfully authenticated to Optimizely.


## Create a Spark Session

In [4]:
from pyspark.sql import SparkSession

num_cores = 1
driver_ip = "127.0.0.1"
driver_memory_gb = 1
executor_memory_gb = 2

# Create a local Spark session
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL") \
    .config(f"local[{num_cores}]") \
    .config("spark.sql.repl.eagerEval.enabled", True) \
    .config("spark.sql.repl.eagerEval.truncate", 120) \
    .config("spark.driver.bindAddress", driver_ip) \
    .config("spark.driver.host", driver_ip) \
    .config("spark.driver.memory", f"{driver_memory_gb}g") \
    .config("spark.executor.memory", f"{executor_memory_gb}g") \
    .getOrCreate()

## Load decision data

We'll start by loading decision data and isolating the decisions for the experiment specified by `experiment_id` and the time window specfied by `decisions_start` and `decisions_end`.

### Local Data Storage

These parameters specify where this notebook should read and write data. The default location is ./example_data in this notebook's directory. You can point the notebook to another data directory by setting the OPTIMIZELY_DATA_DIR environment variable prior to starting Jupyter Lab, e.g.

```sh
export OPTIMIZELY_DATA_DIR=~/optimizely_data
```

In [5]:
# Local data storage locationsZ
decisions_data_dir = os.path.join(OPTIMIZELY_DATA_DIR, "type=decisions")
enriched_decisions_output_dir = os.path.join(OPTIMIZELY_DATA_DIR, "type=enriched_decisions")

### Read decision data from disk

We'll create a `decisions` view with the loaded data

In [6]:
from lib import util

util.read_parquet_data_from_disk(
    spark_session=spark,
    data_path=decisions_data_dir,
    view_name="decisions"
)

spark.sql("SELECT * FROM decisions LIMIT 1")

uuid,timestamp,process_timestamp,visitor_id,session_id,account_id,campaign_id,experiment_id,variation_id,attributes,user_ip,user_agent,referer,is_holdback,revision,client_engine,client_version,date,experiment
0244b48f-cd2c-45fe-86b5-accb0864aa9f,2020-09-14 11:38:10.022,2020-09-14 11:39:09.401,user_9763,-1361007105,596780373,18811053836,18786493712,18802093142,"[[$opt_bot_filtering, $opt_bot_filtering, custom, false], [$opt_enrich_decisions, $opt_enrich_decisions, custom, true...",162.227.140.251,python-requests/2.24.0,,False,99,python-sdk,3.5.2,2020-09-14,18786493712


## Enrich decision data

Next we'll query our decision data to list the distinct `experiment_id` values found in our dataset.  

In [7]:
from IPython.display import display, Markdown

experiment_ids = spark.sql("SELECT DISTINCT experiment_id FROM decisions").toPandas().experiment_id

print("Found these experiment IDs in the loaded decision data:")
for exp_id in experiment_ids:
    print(f"    {exp_id}")

Found these experiment IDs in the loaded decision data:
    18786493712


In [8]:
import pandas as pd
import warnings

# The Optimizely REST API spec causes Bravado to throw several warnings
warnings.filterwarnings("ignore")

def get_human_readable_name(obj):
    """Return a human-readable name from an Optimizely Experiment or Variation object.
    
    This function is handy because Optimizely Web and Full Stack experiments use different 
    attribute names to store human-readable names."""
    if hasattr(obj, "key") and obj.key is not None: # Optimizely Full Stack experiments
        return obj.key
    elif hasattr(obj, "name") and obj.name is not None: # Optimizely Web experiments
        return obj.name
    return None

def get_experiment_names(api_client, exp_id):
    "Retrieve human-readable names for the experiment and associated variations from Optimizely's experiment API "
    experiment = api_client.Experiments.get_experiment(experiment_id=exp_id).response().result
    return pd.DataFrame([
        {
            'experiment_id' : str(experiment.id),
            'variation_id' : str(variation.variation_id),
            'experiment_name' : get_human_readable_name(experiment),
            'variation_name' : get_human_readable_name(variation),
            'reference_variation_id' : str(experiment.variations[0].variation_id)
        }
        for variation in experiment.variations
    ])

names = pd.concat([get_experiment_names(api_client, exp_id) for exp_id in experiment_ids])

spark.createDataFrame(names).createOrReplaceTempView("names")

spark.sql("SELECT * FROM names LIMIT 10")

experiment_id,variation_id,experiment_name,variation_name,reference_variation_id
18786493712,18802093142,covid_messaging_experiment,control,18802093142
18786493712,18818611832,covid_messaging_experiment,message_1,18802093142
18786493712,18817551468,covid_messaging_experiment,message_2,18802093142


Finally, we'll join our decision data with this mapping in order to enrich it with human-readable names.

In [9]:
spark.sql(f"""
    CREATE OR REPLACE TEMPORARY VIEW enriched_decisions AS
        SELECT
            names.experiment_name,
            names.variation_name,
            names.reference_variation_id,
            decisions.*
        FROM
            decisions LEFT JOIN names on decisions.variation_id = names.variation_id
""")

spark.sql("SELECT * FROM enriched_decisions LIMIT 1")

experiment_name,variation_name,reference_variation_id,uuid,timestamp,process_timestamp,visitor_id,session_id,account_id,campaign_id,experiment_id,variation_id,attributes,user_ip,user_agent,referer,is_holdback,revision,client_engine,client_version,date,experiment
covid_messaging_experiment,control,18802093142,0244b48f-cd2c-45fe-86b5-accb0864aa9f,2020-09-14 11:38:10.022,2020-09-14 11:39:09.401,user_9763,-1361007105,596780373,18811053836,18786493712,18802093142,"[[$opt_bot_filtering, $opt_bot_filtering, custom, false], [$opt_enrich_decisions, $opt_enrich_decisions, custom, true...",162.227.140.251,python-requests/2.24.0,,False,99,python-sdk,3.5.2,2020-09-14,18786493712


## Writing our enriched decisions dataset to disk

We'll store our enriched decision data in the directory specified by `enriched_decision_output_dir`.  Enriched decision data is partitioned into directories for each experiment included in the input decision data.

In [10]:
if not SKIP_WRITING_OUTPUT_DATA_TO_DISK: 
    spark.sql("""SELECT * FROM enriched_decisions""") \
        .coalesce(1) \
        .write.mode('overwrite') \
        .partitionBy("experiment_id") \
        .parquet(enriched_decisions_output_dir)

## Conclusion

In this Lab we used [PySpark](https://spark.apache.org/docs/latest/api/python/index.html), [Bravado](https://github.com/Yelp/bravado), and Optimizely's [Experiments API](https://library.optimizely.com/docs/api/app/v2/index.html#tag/Experiments) to enrich Optimizely ["decision"](https://docs.developers.optimizely.com/optimizely-data/docs/enriched-events-data-specification#decisions-2) data with human-readable experiment and variation names. 

We've written our output dataset to disk so that it can be used in other analyses (and future installments in the Experiment Datasets Lab series.)

## How to run this notebook

This notebook lives in the [Optimizely Labs](http://github.com/optimizely/labs) repository.  You can download it and everything you need to run it by doing one of the following
- Downloading a zipped copy of this Lab directory on the [Optimizely Labs page](https://www.optimizely.com/labs/computing-experiment-subjects/)
- Downloading a [zipped copy of the Optimizely Labs repository](https://github.com/optimizely/labs/archive/master.zip) from Github
- Cloning the [Github respository](http://github.com/optimizely/labs)

Once you've downloaded this Lab directory (on its own, or as part of the [Optimizely Labs](http://github.com/optimizely/labs) repository), follow the instructions in the `README.md` file for this Lab.