# CASE STUDY 4 - HYPOTHESIS TESTING

> Done by Jose Johnylal

----

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math

In [2]:
sale = pd.read_csv('Sales_add.csv')

In [3]:
sale.head()

Unnamed: 0,Month,Region,Manager,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
0,Month-1,Region - A,Manager - A,132921,270390
1,Month-2,Region - A,Manager - C,149559,223334
2,Month-3,Region - B,Manager - A,146278,244243
3,Month-4,Region - B,Manager - B,152167,231808
4,Month-5,Region - C,Manager - B,159525,258402


In [4]:
sale.isna().sum()

Month                             0
Region                            0
Manager                           0
Sales_before_digital_add(in $)    0
Sales_After_digital_add(in $)     0
dtype: int64

##  Is there any increase in sales after stepping into digital marketing?

#### Background research on the dataset
 

In [5]:
sale.shape

(22, 5)

* Observations 
    - We see that there are two columns that need to be compared to answer the question. 
    - **Sales_before_digital_add** and **Sales_After_digital_add**  are the columns that is needed to answer the question.
    - Each column is a group that is independent of the other.
    - Each group has 22 rows.
    - There are no null values present in the dataset.

#### Constructing the hypothesis

  > * **Null Hypothesis:** There is no change between the means of the columns, i.e., x1 = x2
  > * **Alternate Hypothesis:** There is a change between the means of the columns, i.e., x1 is less than x2

Note:- **x1** refers to the mean of 'Sales_before_digital_add' and **x2** refers to the mean of 'Sales_After_digital_add'

#### Testing the hypothesis with an experiment

Since the sample size for each group is less than 30 we should conduct a **one tailed two sample independent t-test.**

Degree of freedom = 22 + 22 - 2 = 42

Significance level: 5%

Critical value: -1.6825

Let, 
* x1 be the mean of 'Sales_before_digital_add'
* x2 be the mean of 'Sales_After_digital_add'
* s1 be the variance of 'Sales_before_digital_add'
* s2 be the variance of 'Sales_After_digital_add'

In [6]:
x1 = sale['Sales_before_digital_add(in $)'].mean()
x2 = sale['Sales_After_digital_add(in $)'].mean()
print('The mean of Sales_before_digital_add(in $) is ', x1)
print('The mean of Sales_After_digital_add(in $) is', x2)

The mean of Sales_before_digital_add(in $) is  149239.95454545456
The mean of Sales_After_digital_add(in $) is 231123.72727272726


In [7]:
s1 = sale['Sales_before_digital_add(in $)'].var()
s2 = sale['Sales_After_digital_add(in $)'].var()
print('The variance of Sales_before_digital_add(in $) is ', s1)
print('The variance of Sales_After_digital_add(in $) is', s2)

The variance of Sales_before_digital_add(in $) is  220345610.2359307
The variance of Sales_After_digital_add(in $) is 653148853.7316018


In [8]:
S = (22*s1 + 22*s2)/(22 + 22 - 2)
print('The common variance is', S)

The common variance is 457544719.2210884


In [9]:
S = math.sqrt(S)
l = math.sqrt(1/11)
t = (x1 - x2)/(S * l)
print('The t-test score is', t)

The t-test score is -12.696306939281017


#### Analyzing and reporting the result of the hypothesis testing

Since the absolute value of the t-test score is far greater than the absolute value of the critical value, **the null hypothesis is rejected.**

Since the t-test score is much lesser than the critical value, it shows that there is an **increase in sales after the company has stepped into digital marketing.** 

## Is there any dependency between the features 'Region' and 'Manager' ?

|                  | Manager - A  | Manager - B | Manager - C | **Total** |
| :---:            |     :---:      |    :---:      | :---:         |  :---:  |
| **Region - A**     |    1,030,437   | 656,832       | 701,262       | 2,388,531 |
| **Region - B**     |    939,851     | 231,808       | 429,436       | 1,601,095 |
| **Region - C**     |    229,336     |  643,654      |  222,106      |  1,095,096 |
| **Total**        |   2,199,624    |  1,532,294    |  1,352,804    |  5,084,722 |

The above table shows the sum of all sales (after the company has stepped into digital marketing) for each manager in each region.


To find out whether there is a dependency between features 'Region' and 'Manager' we are going to conduct the chi-square test for independence.

* Degree of freedom : 4
* Significance level : 0.05
* Critical value : 9.488

| Observed       | Expected  | (O-E) * (O-E) | Chi-square values | 
| :---:            |     :---:      |    :---:      | :---:         |  
| 1,030,437  |    1,033,266   | 8,003,241      | 7.746       | 
| 656,832    |    719,790    | 3,963,709,764     | 5506.759      | 
| 701,262    |    635,475    |  4,327,929,369   |  6810.542    | 
| 939,851    |   692,625    |  61,120,695,076   |  88245.003   |  
| 231,808    |  482,494 | 62,843,470,596 | 130247.155 |
| 429436     | 425,976 | 11,971,600 | 28.104 |
| 229336     | 473,733 | 59,729,893,609 | 126083.456 |
| 643654     | 330010 | 98,372,558,736 | 298089.630 |
| 222106     | 291,353 | 4,795,147,009 | 16458.204 | 
|**Total**   |  |  | **671,476.599** |

Since 671,476.599 > 9.488, we can conclude that **there is a dependency between 'Region' and 'Manager'.**