# Case Study on Testing of Hypothesis 

### Import required libraries

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind
from scipy.stats import chi2_contingency

### Read the data

In [2]:
sales_data=pd.read_csv(r"C:\Users\aksmk\OneDrive\Desktop\DSA\Datasets\Sales_add.csv")
sales_data.head()

Unnamed: 0,Month,Region,Manager,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
0,Month-1,Region - A,Manager - A,132921,270390
1,Month-2,Region - A,Manager - C,149559,223334
2,Month-3,Region - B,Manager - A,146278,244243
3,Month-4,Region - B,Manager - B,152167,231808
4,Month-5,Region - C,Manager - B,159525,258402


## Clarify whether there is any increase in sales after stepping into digital marketing

### State the null and alternative hypotheses

Null hypothesis, H0: There is no increase in sales after stepping into digital marketing or digital marketing has no impact in sales

Alternative hypothesis, HA: There is an increase in sales after stepping into digital marketing or digital marketing has an impact in sales

### Identify the test statistic

T test is used to determine whether a process actually has an effect on the population of interest. Also T test is performed in samples with relatively small sample size. So here we use T test.

In [3]:
t_stat,p_value=ttest_ind(sales_data['Sales_before_digital_add(in $)'],sales_data['Sales_After_digital_add(in $)'])
p_value

2.614368006904645e-16

Set level of significance to 5%. i.e. alpha = 0.05

In [4]:
alpha=0.05

In [5]:
if p_value<alpha:
    print('Reject null hypothesis')
else:
    print('Accept null hypothesis')

Reject null hypothesis


So we can say that, there is a significant increase in sales after stepping into digital marketing or digital marketing has an impact in sales with 95% confidence.

## Check whether there is any dependency between the features 'Region' and 'Manager'

### State the null and alternative hypotheses

Null hypothesis, H0: There is no dependency between the features 'Region' and 'Manager' or features 'Region' and 'Manager' are independent


Alternative hypothesis, HA: There is some kind of dependency between the features 'Region' and 'Manager' or features 'Region' and 'Manager' are not independent

### Identify the test statistic

If both features are qualitative, we use chi-sqaure test of independence. 

### 1. Using count

We create a cross-tabulated data using 'Region' and 'Manager' counts and then perform chi-square test on that data.

In [6]:
pd.crosstab(sales_data.Region,sales_data.Manager)

Manager,Manager - A,Manager - B,Manager - C
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Region - A,4,3,3
Region - B,4,1,2
Region - C,1,3,1


In [7]:
chi_square_args = pd.crosstab(sales_data.Region,sales_data.Manager).values
chi_stat,p_value,dof,expected= chi2_contingency(chi_square_args)
p_value

0.5493991051158094

Here also level of significance is 5%. i.e. alpha = 0.05

In [8]:
if p_value<alpha:
    print('Reject null hypothesis')
else:
    print('Accept null hypothesis')

Accept null hypothesis


Test failed to reject null hypothesis. So we can say that, when we consider count, the features 'Region' and 'Manager' are statistically independent with 95% confidence.

### 2. Using sum of sales

We create a cross-tabulated data using sum of 'Region' and 'Manager' sales values and then perform chi-square test on that data. Since the data is not in the required format we need to reshape the data using melt function.

In [9]:
sales_data_new=pd.melt(sales_data,id_vars=['Manager','Region'],value_vars=['Sales_before_digital_add(in $)','Sales_After_digital_add(in $)'],value_name='Sales')
sales_data_new.head()     

Unnamed: 0,Manager,Region,variable,Sales
0,Manager - A,Region - A,Sales_before_digital_add(in $),132921
1,Manager - C,Region - A,Sales_before_digital_add(in $),149559
2,Manager - A,Region - B,Sales_before_digital_add(in $),146278
3,Manager - B,Region - B,Sales_before_digital_add(in $),152167
4,Manager - B,Region - C,Sales_before_digital_add(in $),159525


In [10]:
sales_data_new.tail() 

Unnamed: 0,Manager,Region,variable,Sales
39,Manager - B,Region - C,Sales_After_digital_add(in $),191517
40,Manager - A,Region - B,Sales_After_digital_add(in $),227040
41,Manager - B,Region - A,Sales_After_digital_add(in $),212579
42,Manager - A,Region - B,Sales_After_digital_add(in $),263388
43,Manager - C,Region - A,Sales_After_digital_add(in $),243020


In [11]:
df=pd.crosstab(sales_data_new['Region'],sales_data_new['Manager'],values=sales_data_new['Sales'],aggfunc='sum')
df

Manager,Manager - A,Manager - B,Manager - C
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Region - A,1624951,1123683,1121946
Region - B,1510751,383975,760034
Region - C,376799,1113131,352731


In [12]:
chi_stat,p_value,dof,expected=chi2_contingency(df)
p_value

0.0

In [13]:
if p_value<alpha:
    print('Reject null hypothesis')
else:
    print('Accept null hypothesis')

Reject null hypothesis


So we can say that, when we consider the total sales, the features 'Region' and 'Manager' are statistically dependent with 95% confidence.