Copyright 2022 Google LLC..

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

# Design geo experiment

This notebook analyses pre-existing kpi data to determine the right parameters for running a geo experiment.
That is, the selection of Test and Control geos and the determination of minimum expect/uplift size helping
to estimate the minimum experiment budget.

### Requirements

Historical daily kpi data at the geo level (for the last 3-6 months, 12 months is ideal).

### Install and import required modules

In [None]:
import pandas as pd
import numpy as np
from utils import run_ci
import seaborn as sns
from tqdm import tqdm
import warnings
from IPython.display import display
warnings.filterwarnings('ignore')

### Set parameters

In [None]:
# Inputs
pre_data_csv = 'data/dummy_pre_data.csv'

## Load data

In [None]:
data = pd.read_csv(pre_data_csv, parse_dates=['Date'], index_col='Date')
data.head()

## Examine data

In [None]:
# Period of data
print(min(data.index).date(), max(data.index).date())

In [None]:
# Plot the time series of the data and examine for any inconsistency
data.plot(figsize=(20, 7))

In [None]:
# Check correlations between time seties data
data_corr = data.corr()
data_corr.round(3)

In [None]:
# Generate a custom diverging colormap of correlations
cmap = sns.diverging_palette(230, 20, as_cmap=True)

sns.heatmap(data_corr, cmap=cmap)

### Select Test and Control geos

#### Using the correlations and business decisions:
Test and Control geos can be selected by simply using the correlation between
time series data calculated above. That is, we can select pair of Geos having
high correlations and select the Test and Control geos from them. In order to
train a good counterfactual model (to obtain more reliable results at the post
analysis), having a high correlation between Test and Control time series is important. Therefore, as a general rule of thumb, we can use 0.6 or 0.7 as the
minimum correlation when selecting Test and Control geos. In addition, sometimes,
we'll have to filter out some Geos we select further due to the important
business considerations.

Let's use 0.7 mimimum correlation to select the Test and Control geos for
this use case and assume there are no other business considerations affecting
geo selection.

In [None]:
correlation_threshold = 0.7

data_corr_melt = data_corr.reset_index().melt(id_vars='index')

data_corr_melt.columns = ['Geo X', 'Geo Y', 'corr']

high_corr_geos =\
    data_corr_melt[(data_corr_melt['corr'] >= correlation_threshold) &
                   (data_corr_melt['Geo X'] != data_corr_melt['Geo Y'])]\
    .sort_values('Geo X')

high_corr_geos.round(3)

In [None]:
# From above results let's say we select the following Test and Control geos
# These geos are further validated below

tentative_test_geos = ['Geo_1', 'Geo_3']
tentative_control_geos = ['Geo_4', 'Geo_5', 'Geo_6']

### Why do we need to validate?

Further validation of the selected Test and Control geos is required because while a good correlation with the test geo in the pre period is necessary for a good control, it is not sufficient. The time series model (predicting test from control geos) should perform well and be have well-behaved residuals. Validating the test and control geos in the pre period also serves as an 'A/A' test - we should not observe significant uplift between the actual test kpi and it's counterfactual prediction in the pre period.

### Validate the Test and Control geos

In this step we further validate the Test and Control geo combinations selected
tentatively above. Here, for each Test and Control geo combination, we run the
counterfactual analysis using historical data as outlined below:

1. Select a pre period and a post period based on some arbitrary date (as a rule of thumb we can use the last 4-8 weeks as the post period)
2. Select the control time series having correlations greater than some threshold (by tighting or loosening this threshold we can further fine tune the selected geos)
3. Train a structural time series model on the pre period data
4. Make the counterfactual prediction on the post period and estimate the uplift
5. Determine whether the selected test and control geo combination is sufficient

For each Test geo, the selected Control geos are sufficient if they satisfy the following criteria:
1. The time series model trained is good (high r-squared value (>=0.6), low mape, close match of the predicted vs actual values, residuals and distributed evenly around zero, etc.)
2. There is no statistically significant uplift observed for the post period

In [None]:
# Fake intervention date
post_start_date = '2018-10-01'

# Replace this with desired pre start
pre_start_date = str(min(data.index).date())

# Replace this with desired post end
post_end_date = str(max(data.index).date())

pre_end_date = str(pd.to_datetime(post_start_date) - pd.Timedelta(1, 'D'))

In [None]:
df = data.reset_index()

In [None]:
min_model_r_squared = 0.6

selected_control_geos = []
model_diagnostics_metrics = []
cumulative_effect_upper = []
cumulative_effect_lower = []
min_cumulative_effect_size = []
acceptable_test_control = []

for tentative_test_geo in tqdm(tentative_test_geos):
    ci_out = run_ci.run_ci_analysis(input_params=run_ci.CausalImpactInput(
        df=data.reset_index(),
        date_col='Date',
        test_col=tentative_test_geo,
        control_cols=tentative_control_geos,
        pre_period=[pre_start_date, pre_end_date],
        post_period=[post_start_date, post_end_date],
        corr_threshold=0.7,
        confidence_level=0.95))

    print('\n----------------------------------------------------------------')

    print_results = {'Test geo =': ci_out.test_col,
              'Selected control geos =': ci_out.selected_control_cols,
              'Model diagnostics metrics =': ci_out.diag_metrics,
             }
    display(print_results)

    print('Test and control correlations:')
    print(ci_out.test_control_corr)

    print('Time series plots:')
    ci_out.ts_plot.get_figure()

    print('CausalImpact analysis results:')
    ci_out.ci_results.plot()
    print(ci_out.ci_results.summary())

    selected_control_geos.append(ci_out.selected_control_cols)
    model_diagnostics_metrics.append(ci_out.diag_metrics)
    cumulative_effect_lower\
        .append(ci_out.ci_results.inferences.post_cum_effects_lower[-1])
    cumulative_effect_upper\
        .append(ci_out.ci_results.inferences.post_cum_effects_upper[-1])
    min_cumulative_effect_size.append(
        ci_out.ci_results.inferences.post_cum_effects[-1] -
        ci_out.ci_results.inferences.post_cum_effects_lower[-1])

    if((ci_out.diag_metrics['r-squared'] >= min_model_r_squared) &
        (ci_out.ci_results.inferences.post_cum_effects_upper[-1] > 0) &
        (ci_out.ci_results.inferences.post_cum_effects_lower[-1] < 0)):
        acceptable_test_control.append(True)
    else:
        acceptable_test_control.append(False)

sim_results =\
    pd.DataFrame({'Test geo': tentative_test_geos,
                  'Selected control geos': selected_control_geos,
                  'Model disgnostics metrics': model_diagnostics_metrics,
                  'Post cumulative effect lower bound':
                      cumulative_effect_lower,
                  'Post cumulative effect upper bound':
                      cumulative_effect_upper,
                  'Sufficient test and control setting':
                      acceptable_test_control,
                  'Min estimated effect size': min_cumulative_effect_size})

In [None]:
sim_results

## Final test and control geo selections

From the above results we finally select the following test and control geo combinations (having higher model accuracies and no observed impact
in the simulated post period) for our experiment:

In [None]:
sim_results.loc[sim_results['Sufficient test and control setting'],
                ['Test geo', 'Selected control geos']]

## Minimum effect size
Given the historical KPI variances in the selected geos, we require a minimum uplift at least as large as this to be considered significant:

In [None]:
min_uplift =\
    np.round(sim_results['Min estimated effect size'].max()).astype(int)

min_uplift

We can use this to estimate the required level of investment if we assume some expected value of the ROI:

In [None]:
expected_roi = 1.2

min_investment = min_uplift / expected_roi

np.round(min_investment).astype(int)