# Starbucks Capstone Challenge - Exploratory Data Analysis

## Introduction

As we now have a new perspective with the data. We'll look closer trying to answer our questions presented at the beginning.

## Setup

In [1]:
import sys

!{sys.executable} -m pip install -e ../ --quiet

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sb_capstone.wrangling import (
    clean_transcript_group
)

%matplotlib inline

In [3]:
transcript_group = pd.read_csv("../data/processed/transcript_group.csv")
transcript_group = clean_transcript_group(transcript_group)

transcript_group.head()

Unnamed: 0,id,wave,received,viewed,completed,amount,reward,non_offer_amount,mapped_offer,offer_type,difficulty,duration,web,email,mobile,social,gender,age,income,became_member_on
0,1,2,True,True,False,0.0,0.0,0.0,10,discount,10.0,7.0,True,True,True,False,U,,,2017-02-12
1,2,2,False,False,False,0.0,0.0,0.0,0,no_offer,0.0,0.0,False,False,False,False,F,55.0,112000.0,2017-07-15
2,3,2,True,True,False,0.0,0.0,0.0,4,bogo,5.0,7.0,True,True,True,False,U,,,2018-07-12
3,4,2,True,True,True,19.67,0.0,29.72,8,informational,0.0,3.0,False,True,True,True,F,75.0,100000.0,2017-05-09
4,5,2,False,False,False,0.0,0.0,0.0,0,no_offer,0.0,0.0,False,False,False,False,U,,,2017-08-04


In [4]:
transcript_group.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102000 entries, 0 to 101999
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   id                102000 non-null  int64   
 1   wave              102000 non-null  int64   
 2   received          102000 non-null  bool    
 3   viewed            102000 non-null  bool    
 4   completed         102000 non-null  bool    
 5   amount            102000 non-null  float64 
 6   reward            102000 non-null  float64 
 7   non_offer_amount  102000 non-null  float64 
 8   mapped_offer      102000 non-null  category
 9   offer_type        102000 non-null  category
 10  difficulty        102000 non-null  float64 
 11  duration          102000 non-null  float64 
 12  web               102000 non-null  bool    
 13  email             102000 non-null  bool    
 14  mobile            102000 non-null  bool    
 15  social            102000 non-null  bool    
 16  ge

## Q1: Which offer yields the best results?

### Metrics

To answer this question, we need to choose which metrics to measure. 2 things we can look at, **Incremental Response Rate (IRR)** and **Net Incremental Revenue (NIR)**.

**Incremental Response Rate (IRR)**

IRR depicts how many more customers purchased the product with the promotion, as compared to if they didn't receive the promotion. Mathematically, it's the ratio of the number of purchasers in the promotion group to the total number of customers in the purchasers group (_treatment_) minus the ratio of the number of purchasers in the non-promotional group to the total number of customers in the non-promotional group (_control_).

$$ IRR = \frac{purch_{treat}}{cust_{treat}} - \frac{purch_{ctrl}}{cust_{ctrl}} $$


**Net Incremental Revenue (NIR)**

NIR depicts how much is made (or lost) by sending out the promotion. Mathematically, this is 10 times the total number of purchasers that received the promotion minus 0.15 times the number of promotions sent out, minus 10 times the number of purchasers who were not given the promotion.

$$ NIR = (10\cdot purch_{treat} - 0.15 \cdot cust_{treat}) - 10 \cdot purch_{ctrl}$$

We'll make use of these 2 metrics to determine the success of the offer.

#### Invariant Metrics

Before we can proceed with our analysis, let us first check if we have balance and if it is statistically significant.

$$ H_{0}: N_{ctrl} - N_{treat} = 0 $$

$$ H_{1}: N_{ctrl} - N_{treat} \neq 0 $$

## Q2: Which demographic more likely more influenced by offers? 

## Q3: Which type of offer best to a certain demographics?