# Starbucks Capstone Challenge - Exploratory Data Analysis

## Introduction

As we now have a new perspective with the data. We'll look closer trying to answer our questions presented at the beginning.

## Setup

In [1]:
import sys

!{sys.executable} -m pip install -e ../ --quiet

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

from sb_capstone.wrangling import (
    clean_transcript_group
)

%matplotlib inline

In [3]:
transcript_group = pd.read_csv("../data/processed/transcript_group.csv")
transcript_group = clean_transcript_group(transcript_group)

transcript_group.head()

Unnamed: 0,id,wave,received,viewed,completed,amount,reward,non_offer_amount,mapped_offer,offer_type,...,web,email,mobile,social,gender,age,income,membership_year,membership_month,membership_day
0,1,2,True,True,False,0.0,0.0,0.0,10,discount,...,True,True,True,False,U,,,2017,2,12
1,2,2,False,False,False,0.0,0.0,0.0,0,no_offer,...,False,False,False,False,F,55.0,112000.0,2017,7,15
2,3,2,True,True,False,0.0,0.0,0.0,4,bogo,...,True,True,True,False,U,,,2018,7,12
3,4,2,True,True,True,19.67,0.0,29.72,8,informational,...,False,True,True,True,F,75.0,100000.0,2017,5,9
4,5,2,False,False,False,0.0,0.0,0.0,0,no_offer,...,False,False,False,False,U,,,2017,8,4


In [4]:
transcript_group.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102000 entries, 0 to 101999
Data columns (total 22 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   id                102000 non-null  int64   
 1   wave              102000 non-null  int64   
 2   received          102000 non-null  bool    
 3   viewed            102000 non-null  bool    
 4   completed         102000 non-null  bool    
 5   amount            102000 non-null  float64 
 6   reward            102000 non-null  float64 
 7   non_offer_amount  102000 non-null  float64 
 8   mapped_offer      102000 non-null  category
 9   offer_type        102000 non-null  category
 10  difficulty        102000 non-null  float64 
 11  duration          102000 non-null  float64 
 12  web               102000 non-null  bool    
 13  email             102000 non-null  bool    
 14  mobile            102000 non-null  bool    
 15  social            102000 non-null  bool    
 16  ge

AB TESTING NOTES

* State Context (goal of having promotions)
* State potential disadvantages
* Pre-requisites
   * Choose Key Metrics
   * Generalize metrics is only fair when N_control = N_treatment
   * If not, per customer metrics, normalization
      * Revenue per user
   * Randomization Units
      * Users, assume there's enough
* Experimentation
   * Target all users/specific users
   * Understand the journey
   * Select a population (choose which stage in the journey you wish to study)
   * Sample size
      * (16 * sigma^2) / delta^2
      * sigma - STD of the population
      * delta - difference between treatment and control
   * Practical significance boundary
      * How many revenue increase per user to outweigh the cost
   * Determine
      * Power of the test: 80%
      * Significance Level: 5%
   * How Long
      * Ramp-up plan
      * Day of week effect (if there are special days, run/round a whole week)
      * Seasonality (should not used for analysis)
      * Primacy and novelty effects
      * Better than overpowered, account for unique users
      * Running experiment too long diminishing return
   * Results to Decision
      * Number of users assigned groups is truly random
      * Statistically and Practically significant
         * If one of them fails, it's not conclusive, run test with more power

https://www.kaggle.com/ekrembayar/a-b-testing-step-by-step-hypothesis-testing
https://towardsdatascience.com/the-math-behind-a-b-testing-with-example-code-part-1-of-2-7be752e1d06f
https://github.com/mnguyenngo/ab-framework/blob/master/src/stats.py


For promotions and offers, it's not about giving stuff free, but attracting to buy without having the initial intentions. Therefore, revenue is the total amount, offer and non-offer minus the given discounts. Since we don't have a balance control and treatment, we'll get the mean revenue per customer.

## Q1: Which offer yields the best results?

### Metrics

Revenue per customer.

In [9]:
transcript_group[~transcript_group.received]

Unnamed: 0,id,wave,received,viewed,completed,amount,reward,non_offer_amount,mapped_offer,offer_type,...,web,email,mobile,social,gender,age,income,membership_year,membership_month,membership_day
1,2,2,False,False,False,0.0,0.0,0.0,0,no_offer,...,False,False,False,False,F,55.0,112000.0,2017,7,15
4,5,2,False,False,False,0.0,0.0,0.0,0,no_offer,...,False,False,False,False,U,,,2017,8,4
5,6,2,False,False,False,0.0,0.0,0.0,0,no_offer,...,False,False,False,False,M,68.0,70000.0,2018,4,26
10,11,2,False,False,False,0.0,0.0,0.0,0,no_offer,...,False,False,False,False,U,,,2017,8,24
16,17,2,False,False,False,0.0,0.0,0.0,0,no_offer,...,False,False,False,False,M,49.0,52000.0,2014,11,13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101972,16973,1,False,False,False,0.0,0.0,0.0,0,no_offer,...,False,False,False,False,F,44.0,51000.0,2017,1,19
101973,16974,1,False,False,False,0.0,0.0,0.0,0,no_offer,...,False,False,False,False,M,30.0,57000.0,2015,10,12
101979,16980,1,False,False,False,0.0,0.0,0.0,0,no_offer,...,False,False,False,False,F,63.0,52000.0,2013,9,22
101995,16996,1,False,False,False,0.0,0.0,0.0,0,no_offer,...,False,False,False,False,F,45.0,54000.0,2018,6,4


#### Invariant Metrics

Before we can proceed with our analysis, let us first check if we have balance and if it is statistically significant.

$$ H_{0}: N_{ctrl} - N_{treat} = 0 $$

$$ H_{1}: N_{ctrl} - N_{treat} \neq 0 $$

## Q2: Which demographic more likely more influenced by offers? 

## Q3: Which type of offer best to a certain demographics?