In [50]:
import pandas as pd 
import numpy as np
import scipy.stats as stats

## Importing Data and Modifying it

In [51]:
# Importing data
campaign_df = pd.read_excel("grocery_database.xlsx", sheet_name= "campaign_data")

# Removing  rows "Control" mailer type
campaign_df = campaign_df[campaign_df.mailer_type != "Control"]
print(campaign_df.head())

   customer_id  campaign_name campaign_date mailer_type  signup_flag
0           74  delivery_club    2020-07-01     Mailer1            1
1          524  delivery_club    2020-07-01     Mailer1            1
2          607  delivery_club    2020-07-01     Mailer2            1
3          343  delivery_club    2020-07-01     Mailer1            0
4          322  delivery_club    2020-07-01     Mailer2            1


# Using chisquare test of independence

H0: There is no significant relationship between mailer_type and signup_flag.

H1: There is a significant relationship between mailer_type and signup_flag.

In [52]:
contingency_table = pd.crosstab(campaign_df["mailer_type"], campaign_df["signup_flag"])
print(contingency_table)

contingency_mailer = contingency_table[1]

d1 = campaign_df[campaign_df.mailer_type == "Mailer1"]
mailer1_count = d1.shape[0]

d2 = campaign_df[campaign_df.mailer_type == "Mailer2"]
mailer2_count = d2.shape[0]

print("The signup rate for mailer1 is: " + str(contingency_mailer[0]/mailer1_count))
print("The signup rate for mailer2 is: " +str(contingency_mailer[1]/mailer2_count))

stat,p,dof = stats.chi2_contingency(contingency_table)[0:3]

print("The chisquare stat is: " + str(stat))
print("chi-square critical value is: " + str(stats.chi2.ppf(1-0.05, df=1)))
print("The chisquare p-value is: " + str(p))

signup_flag    0    1
mailer_type          
Mailer1      252  123
Mailer2      209  127
The signup rate for mailer1 is: 0.328
The signup rate for mailer2 is: 0.37797619047619047
The chisquare stat is: 1.728424144871394
chi-square critical value is: 3.841458820694124
The chisquare p-value is: 0.1886122739808747


## Conclusion:

Since the p-value is greater than 0.05 or chisquare value is less than critical value, we can not reject the null hypothesis.

Therefore, there is no significant relationship between mailer_type and signup_flag.

## Why are we using chi-square distribution in this case?

Incases where-
1. we have 1 or more categorical variable(mailer_type) 
2. all the groups of the categorical variable are mutually exclusive 
3. all the observations are independent

#### Gaussian Distribution

1. In order to use gaussian distribution, we need to check the assumption that our data is normally distributed.
2. Gaussian distribution is typically used to compare a variable with a value. (eq - a variable is compared to population mean) while in our case we are comparing 2 different variables (mailer1 and mailer2).

In [48]:
d1
mailer2_count
contigency_mailer[0]

123