In [1]:
import pandas as pd
import numpy as np
import random
from simply import redshift
from scipy import stats

# STEP 1: Make a copy of this notebook in your own folder! 
####  - Preferably in the `src` folder associated with the ticket to pull the groups for the particular experiment.
#### Do not commit directly to the version in the sample-size repo!

---

### STEP 2: Generate the population from which to draw the samples
Enter your query below to generate a list of `user_ref`s from the population of interest. For this example I took all currently enrolled customers (removing known fraudsters). If there are any exclusionary criteria for your experiment, add those into this query. The result should be a list of all possible `user_ref`s from which the sample will be drawn. (**Note: Be cautious if adding exclusionary criteria - the results of the experiment can only be used to inform an understanding of the population of interest**). 

If you are interested in applying any insights/inferences to the entire customer base, you must draw from the entire customer base (i.e., no exclusionary criteria). If you are interested in understanding a subset of customers (e.g., people who live in cities)- then limit the population (e.g., list of all customers who live in cities), with the awareness that the results _cannot_ be used to make any inferences about customers not in the population (e.g., we will still know nothing about customers who do not live in cities after this experiment). 

Reach out if you have any questions!

In [2]:
df = redshift(
"""
SELECT user_ref
FROM curated.dim_user
WHERE num_open_accounts > 0
"""
)

### STEP 3: Draw the random samples

Then we identify the size for the experimental and control groups. Often these are the same size, but this can vary. Note that if you want to use different sized experimental and control groups, the control group must _at minimum_ be big enough to have enough statistical power to detect a difference in the effect of interest. Experimental also needs to be big enough to detect a statistical difference. This can be calculated from a power analysis. Again, reach out if unsure. Once you have calculated the size for each group, enter them below.

In [3]:
# Modify the numbers here to reflect the number of customers you want to include in
#the experimental and control groups
experimental_num = 80000
control_num = 80000

Below we are generating the random samples for experimental and control group.

In [4]:
possible = pd.DataFrame(df.user_ref.unique())
possible.columns = ['user_ref']
experimental_users = random.sample(list(possible.user_ref), experimental_num)
remaining = possible[~possible.user_ref.isin(experimental_users)]
control_users = random.sample(list(remaining.user_ref), control_num)
unused = possible[(~possible.user_ref.isin(experimental_users))&(~possible.user_ref.isin(control_users))]

### STEP 4: Get the baseline metrics for experimental and control groups 

Now that we have our list of experimental and control users, we get their baseline metrics and compare to the total population. I have included average-balance-past-90-days, deposits-past-90-days, swipe-count-past-90-days, and swipe-volume-past-90-days. If you would like to include others add them to the query below (ex: KPIs or metrics you care about in the experiment). However, keep in mind that the more things you want to control for the more stringent your statistical test needs to be to avoid risk of a false positive.

**Rule of thumb: Focus on (ideally) 2-3 key metrics.** These will depend on the particular experiment and hypotheses. If you are conducting a similar test in two different populations (e.g., profitable versus non-profitable) - treat these as distinct experiments and focus on 2-3 key metrics in each (they can be the same or different metrics). For example, you might have a different hypothesis/assumption for profitable than for unprofitable customers; in this case your metrics of interest may differ.

In [5]:
baseline_query = """
SELECT 
    user_ref, 
    avg(balance_eod) AS avg_balance,
    sum(deposit_amount) AS deposits_past90,
    sum(swipe_amount) AS swipe_vol_past90,
    sum(swipe_count) AS swipe_count_past90
FROM curated.fact_customer_day
WHERE user_ref IN :users
    AND date > CURRENT_DATE-90
GROUP BY user_ref
"""

In [18]:
exp_ids_1 = tuple(experimental_users[:len(experimental_users)//2])
exp_ids_2 = tuple(experimental_users[len(experimental_users)//2:])
cont_ids = tuple(control_users)
all_ids = tuple(possible.user_ref)


In [20]:
len(exp_ids_1)

40000

In [12]:
test = redshift(
"""
SELECT distinct user_ref 
FROM curated.fact_customer_day
WHERE user_ref IN :ids
     AND date >= CURRENT_DATE-7
"""
, params = {'ids': exp_ids})

In [13]:
test.describe()

Unnamed: 0,user_ref
count,80000
unique,80000
top,1447397f-8620-4edd-abb3-4aa556e6df36
freq,1


In [14]:
def get_activity(ids):
    return redshift(baseline_query, params = {'users':ids})

In [21]:
experimental_baseline = get_activity(exp_ids)
control_baseline = get_activity(cont_ids)
# all_baseline = get_activity(all_ids)

In [23]:
# all_baseline = get_activity(all_ids)

In [25]:
#all_baseline.describe()

In [26]:
experimental_baseline.describe()

Unnamed: 0,avg_balance,deposits_past90,swipe_vol_past90,swipe_count_past90
count,80000.0,80000.0,80000.0,80000.0
mean,1963.520019,3070.893,-1101.49138,25.970994
std,9732.304378,12574.5,2703.575076,58.768781
min,-4609.33,0.0,-91245.52,0.0
25%,0.0,0.0,-535.3225,0.0
50%,4.6175,0.0,0.0,0.0
75%,514.58875,2235.722,0.0,14.0
max,804411.303,2025903.0,0.0,889.0


In [27]:
control_baseline.describe()

Unnamed: 0,avg_balance,deposits_past90,swipe_vol_past90,swipe_count_past90
count,80000.0,80000.0,80000.0,80000.0
mean,2030.86,3057.911226,-1095.593705,25.659644
std,11015.78,9265.672578,2683.185136,57.779242
min,-3620.061,0.0,-92858.07,0.0
25%,0.0,0.0,-536.6125,0.0
50%,4.55,0.0,0.0,0.0
75%,512.3227,2237.2875,0.0,14.0
max,1311382.0,799602.39,0.0,871.0


### STEP 5: Make sure the groups don't differ on contact rate

Note: You only need to run this part if you will be measuring the impact on contact rate in the experiment.

**THIS SECTION (QUERY, ETC.) NEEDS TO BE UPDATED ONCE CALL DATA IS ADDED TO REDSHIFT**

In [35]:
contact_rate_query = """

select distinct
	du.user_ref,
    isnull(count(distinct dc.chat_ref),0) as num_chats,
    isnull(count(distinct dc.user_ref),0) as chat_flag,
    isnull(count(distinct c.case_number),0) as num_calls,
    isnull(count(distinct c.user_ref),0) as call_flag
from dim_user_pii du
left join (select distinct
                du.user_ref,
                dc.chat_ref
           from dim_user du
           join dim_chat dc on du.user_ref = dc.user_ref
           where created_by = 'CUSTOMER'
                and dc.created_date >= current_date - 120) dc on du.user_ref = dc.user_ref
left join (select distinct
                du.user_ref,
                c.case_number
           from segment_salesforce.cases c
           join curated.dim_user du on ltrim(c.customer_c,'Customers:') = du.user_ref
           where origin = 'Phone'
                and c.created_date::date >= current_date - 120) c on du.user_ref = c.user_ref
where du.user_ref in :ids
group by 1
;


"""

In [36]:
exp_contact = redshift(contact_rate_query, params = {'ids':exp_ids})

In [37]:
con_contact = redshift(contact_rate_query, params = {'ids':cont_ids})

In [38]:
exp_contact.describe()

Unnamed: 0,num_chats,chat_flag,num_calls,call_flag
count,80000.0,80000.0,80000.0,80000.0
mean,0.135425,0.0892,0.221025,0.082325
std,0.544439,0.285034,1.065053,0.274861
min,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0
max,26.0,1.0,54.0,1.0


In [39]:
con_contact.describe()

Unnamed: 0,num_chats,chat_flag,num_calls,call_flag
count,80000.0,80000.0,80000.0,80000.0
mean,0.132887,0.088438,0.216175,0.083013
std,0.559068,0.283932,1.011252,0.275903
min,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0
max,32.0,1.0,35.0,1.0


In [44]:
# Calculation Chat Flag - Chi Square Test
# Note: the Numpy array needs to be 2-dimensional! 
obs = np.array([[exp_contact[exp_contact.chat_flag ==1].user_ref.count(),
                 exp_contact[exp_contact.chat_flag ==0].user_ref.count()],
                [con_contact[con_contact.chat_flag ==1].user_ref.count(),
                 con_contact[con_contact.chat_flag ==0].user_ref.count()]])

chi2, p, dof, expected = stats.chi2_contingency(obs)

#### This p-value indicates whether there's a statistial difference in the proportion of people who contact in each group (experimental versus control).
Re-pull the groups if the p < .1

In [45]:
#P Value for the Chi Square Test
p

0.5980031694205838

In [46]:
#Returns P Value for Number of Chats
[tstat,pvalue] = stats.ttest_ind(exp_contact.num_chats, con_contact.num_chats)

#### This p-value indicates whether there's a statistical difference in the total contact between the groups (experimental versus control)
Re-pull the groups if the pvalue <.1

In [47]:
pvalue

0.35772372327661983

In [50]:
# Calculation Unique Call Flag - Chi Square Test
# Note: the Numpy array needs to be 2-dimensional! 
obs = np.array([[exp_contact[exp_contact.call_flag ==1].user_ref.count(),
                 exp_contact[exp_contact.call_flag ==0].user_ref.count()],
                [con_contact[con_contact.call_flag ==1].user_ref.count(),
                 con_contact[con_contact.call_flag ==0].user_ref.count()]])

chi2, p, dof, expected = stats.chi2_contingency(obs)
p

0.6239710711954163

In [51]:
#Returns P Value for Total Calls
[tstat,pvalue] = stats.ttest_ind(exp_contact.num_calls, con_contact.num_calls)
pvalue

0.35028536005041777

### Step 6: Conduct statistical tests to ensure that experimental and control groups are statistically indistinguishable from each other and from the population (on the metrics we are interested in - other than contact)

Here we will conduct independent t-tests for each measure to investigate baseline differences between (1) experimental group and all customers and (2) experimental and control groups. We are looking for high p-values here to indicate no difference. (If the p-values are trending towards significance there is a problem.) 

**Note that it is important to find statistical similarity (e.g., p > .2) in the variables of interest.** For example, if you are going to be testing for an increase in ADB, then it will be important that the experiment and control group have similar ADB at the beginning of the experiment. If you run the t-test and you find that the groups have different mean ADB, and the p-value is trending towards signfiicance then I would re-pull the groups (e.g., may not be lower than .05 but .1 is still too low; this means that theres only a 10% chance that the mean differences would have been found by chance). Feel free to rerun the notebook until you get similar means and high p-values. 

In [53]:

# print('Comparing experimental to total population')
# print('-------------------------------------------')
# print('Average balance:',stats.ttest_ind(experimental_baseline.avg_balance, all_baseline.avg_balance))
# print('Deposits:',stats.ttest_ind(experimental_baseline.deposits_past90, all_baseline.deposits_past90))
# print('Swipe volume:',stats.ttest_ind(experimental_baseline.swipe_vol_past90, all_baseline.swipe_vol_past90))
# print('Swipe count:',stats.ttest_ind(experimental_baseline.swipe_count_past90, all_baseline.swipe_count_past90))
print('----------------------------------------------------')
print('----------------------------------------------------')
print('Comparing experimental to control')
print('-------------------------------------------')
print('Average balance:',stats.ttest_ind(experimental_baseline.avg_balance, control_baseline.avg_balance))
print('Deposits:',stats.ttest_ind(experimental_baseline.deposits_past90, control_baseline.deposits_past90))
print('Swipe volume:',stats.ttest_ind(experimental_baseline.swipe_vol_past90, control_baseline.swipe_vol_past90))
print('Swipe count:',stats.ttest_ind(experimental_baseline.swipe_count_past90, control_baseline.swipe_count_past90))

----------------------------------------------------
----------------------------------------------------
Comparing experimental to control
-------------------------------------------
Average balance: Ttest_indResult(statistic=-1.2957536717195983, pvalue=0.1950622330430885)
Deposits: Ttest_indResult(statistic=0.2350776285004822, pvalue=0.8141487829115092)
Swipe volume: Ttest_indResult(statistic=-0.4379354162102582, pvalue=0.661433692455781)
Swipe count: Ttest_indResult(statistic=1.0685338845218917, pvalue=0.28528136619068967)


### STEP 7: Save the experimental and control user_refs as CSV files

Save the csv files in the `output` folder (presuming you are currently working in the `src` folder). You can then link to those csv files in the ticket.

In [56]:
redshift("""
drop table if exists public.louise_self_service_card_reorder_test;
create table public.louise_self_service_card_reorder_test as 
select 
    user_ref, 
    :test_group as test_group
from curated.dim_user 
where user_ref in :ids
;

select * 
from public.louise_self_service_card_reorder_test
limit 1
;""", params={'ids': tuple(exp_ids), 'test_group': 'TEST'})

Unnamed: 0,user_ref,test_group
0,00152d70-2af4-4f5e-a427-2f23adf63033,TEST


In [57]:
experimental_baseline.user_ref.to_csv('../output/experimental_ids.csv', header=True, index=False)
control_baseline.user_ref.to_csv('../output/control_ids.csv', header=True, index=False)

## How to measure the test vs. control
Compare the user_ref from public.louise_self_service_card_reorder_test to the customers who have the feature flag OFF