# Table of contents

- [01. Project Overview](#overview-main)
- [02. Approach](#approach-main)
    - [Import and Understand Data](#import-data)
    - [Establish Hypotheses and Acceptance Criteria](#hypothesis-acceptance)
    - [Calculate Observed and Expected Frequencies](#calculation)
    - [Results & Interpretation](#results)
    - [Discussion & Conclusion](#discussion)
- [03. Concept Overview](#concept-overview)

# Project Overview <a name="overview-main"></a>

Earlier in the year, our client, a grocery retailer, ran a campaign to promote their new "Delivery Club" - an initiative that costs a customer 100 dollars per year for membership, but offers free grocery deliveries rather than the normal cost of 10 dollars per delivery.

For the campaign promoting the club, customers were put randomly into three groups - the first group received a low quality, low cost mailer, the second group received a high quality, high cost mailer, and the third group were a control group, receiving no mailer at all.

The client knows that customers who were contacted, signed up for the Delivery Club at a far higher rate than the control group, but now want to understand if there is a significant difference in signup rate between the cheap mailer and the expensive mailer.  This will allow them to make more informed decisions in the future, with the overall aim of optimising campaign ROI!



## Import and understand data <a name="import-data"></a>

In [None]:
import pandas as pd
from scipy.stats import chi2_contingency, chi2

In [5]:
campaign_data = pd.read_excel("./grocery_database.xlsx", sheet_name = 'campaign_data')
campaign_data

Unnamed: 0,customer_id,campaign_name,campaign_date,mailer_type,signup_flag
0,74,delivery_club,2020-07-01,Mailer1,1
1,524,delivery_club,2020-07-01,Mailer1,1
2,607,delivery_club,2020-07-01,Mailer2,1
3,343,delivery_club,2020-07-01,Mailer1,0
4,322,delivery_club,2020-07-01,Mailer2,1
...,...,...,...,...,...
865,372,delivery_club,2020-07-01,Mailer2,1
866,104,delivery_club,2020-07-01,Mailer1,1
867,393,delivery_club,2020-07-01,Mailer2,1
868,373,delivery_club,2020-07-01,Control,0


In [6]:
campaign_data.mailer_type.value_counts()

mailer_type
Mailer1    375
Mailer2    336
Control    159
Name: count, dtype: int64

In [7]:
campaign_data['mailer_type'].unique()

array(['Mailer1', 'Mailer2', 'Control'], dtype=object)

In [8]:
# remove customers who were in the control group
campaign_data = campaign_data[campaign_data.mailer_type != "Control"]

In [9]:
campaign_data.mailer_type.value_counts()

mailer_type
Mailer1    375
Mailer2    336
Name: count, dtype: int64

In [10]:
pd.pivot_table(campaign_data, index=['mailer_type'], values=['customer_id'], aggfunc='count')

Unnamed: 0_level_0,customer_id
mailer_type,Unnamed: 1_level_1
Mailer1,375
Mailer2,336


In [11]:
pd.pivot_table(campaign_data, index = ['mailer_type'], values = ['signup_flag'])

Unnamed: 0_level_0,signup_flag
mailer_type,Unnamed: 1_level_1
Mailer1,0.328
Mailer2,0.377976


**Mailer 2 appears to have a higher sign up flag of 37.8% - is that significantly different from Mailer 1 at 32.8% or is it random chance?**

### Establish null, alternate hypothesis and acceptance criteria <a name="hypothesis-acceptance"></a>

In [12]:
null_hypothesis = "There is no relationship between mailer type and signup rate. They are independent"
alternate_hypothesis = "There is a relationship between mailer type and signup rate. They are not independent"
acceptance_criteria = 0.05

### Calculate observed frequencies and expected frequencies. 

Note that observed frequencies are the true values we've seen, that is, the actual rates per group in the data itself. The expected frequencies are what we would expect to see based on all of the data combined.

Expected frequency = (row sum x column sum) / table sum

The code below summarises the dataset to a 2x2 matrix for *signup_flag by mailer_type

In [14]:
observed_values = pd.crosstab(campaign_data.mailer_type, campaign_data.signup_flag).values
observed_values

array([[252, 123],
       [209, 127]], dtype=int64)

In [24]:
chi2_contingency(observed_values, correction=False)

Chi2ContingencyResult(statistic=1.9414468614812481, pvalue=0.16351152223398197, dof=1, expected_freq=array([[243.14345992, 131.85654008],
       [217.85654008, 118.14345992]]))

In [41]:
chi2_stats, p_value, dof, exp_val = chi2_contingency(observed_values, correction=False)

In [46]:
x = {"statistic":1.9414468614812481, "pvalue":0.16351152223398197, 
     "dof":1, "expected_freq":[243.14345992, 131.85654008]}

for key, val in x.items():
    print(key, val)
    
critical_value = chi2.ppf(1-acceptance_criteria, dof)
print(critical_value)

statistic 1.9414468614812481
pvalue 0.16351152223398197
dof 1
expected_freq [243.14345992, 131.85654008]
3.841458820694124


In [47]:
# print results based on p-value
if p_value <= acceptance_criteria:
    print(f"As our p-value of {p_value} is lower than our acceptance_criteria of {acceptance_criteria} - we reject the null hypothesis, and conclude that: {alternate_hypothesis}")
else:
    print(f"As our p-value of {p_value} is higher than our acceptance_criteria of {acceptance_criteria} - we retain the null hypothesis, and conclude that: {null_hypothesis}")


As our p-value of 0.16351152223398197 is higher than our acceptance_criteria of 0.05 - we retain the null hypothesis, and conclude that: There is no relationship between mailer type and signup rate. They are independent


In [49]:
# print results based on chi2-value
if chi2_stats >= critical_value:
    print(f"As our chi-square statistic of {chi2_stats} is higher than our critical value of {critical_value} - we reject the null hypothesis, and conclude that: {alternate_hypothesis}")
else:
    print(f"As our chi-square statistic of {chi2_stats} is lower than our critical value of {critical_value} - we retain the null hypothesis, and conclude that: {null_hypothesis}")


As our chi-square statistic of 1.9414468614812481 is lower than our critical value of 3.841458820694124 - we retain the null hypothesis, and conclude that: There is no relationship between mailer type and signup rate. They are independent


### Conclusion <a name="discussion"></a>

While we saw that the higher cost Mailer 2 had a higher signup rate (37.8%) than the lower cost Mailer 1 (32.8%) it appears that this difference is not significant, at least at our Acceptance Criteria of 0.05.

Without running this Hypothesis Test, the client may have concluded that they should always look to go with higher cost mailers - and from what we've seen in this test, that may not be a great decision. It would result in them spending more, but not necessarily gaining any extra revenue as a result