# 1. Objective:

Important statistical concepts that we will learn in this notebook are:

- Variability in Mean
- Sampling Distribution and Bootstrap Resampling
- Central Limit Theorm
- Hypothesis Testing
- One Sample and Two Sample Test

The above concepts are very important from the perspective of exploratory data analysis and A/B Testing.

# 2. Variability in Statistics

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn

import warnings
warnings.filterwarnings('ignore')

In [None]:
DATA_PATH = 'https://raw.githubusercontent.com/manaranjanp/MLCourseV1/main/Session_3/'

In [None]:
cust_df = pd.read_csv(DATA_PATH+'Customers.csv')

In [None]:
cust_df.info()

In [None]:
cust_df.shape

## 2.1 Sample Distribution

- Sample distribution is the distribution of the sample taken from the population.

In [None]:
sn.kdeplot(data = cust_df,
           x = 'Income');

In [None]:
cust_df.Income.mean()

In [None]:
cust_df.Income.std()

## 2.2 Variability in Mean:

- What would be the mean if we have had gotten a different sample? 
- How the mean would have varied?

## 2.2 Resample from the observed data

- Bootstrap Resampling (with replaecement)

The process of taking repeated samples with replacement from observed data.

<img src="resampling.png" alt="Normal Distribution" width="300"/>

In [None]:
sample_1 = cust_df['Income'].sample(500, replace=True)

In [None]:
sample_1.mean()

In [None]:
sample_means = []

for i in range(200):
    samp = cust_df['Income'].sample(2000, replace = True)
    sample_means.append(samp.mean())    

In [None]:
sn.kdeplot(sample_means);

## 2.3 Sampling Distribution

- The sampling distribution considers the distribution of sample statistics, for example, mean.


<img src="samplingdist.png" alt="Normal Distribution" width="400"/>

In [None]:
np.array(sample_means).mean()

In [None]:
np.array(sample_means).std()

## 2.3 Standard Error

**Standard error (SE)** is the standard deviation of its sampling distribution. In this case the standard deviation of distribution of sample means.

$$ SE = {\sigma \over \sqrt n} $$

In [None]:
cust_df.Income.std()/np.sqrt(len(cust_df))

In [None]:
from scipy import stats

In [None]:
mean_interval = stats.norm.interval(0.95,
                    np.array(sample_means).mean(),
                    np.array(sample_means).std())

np.round(mean_interval, 2)

#### Note:

95% confident that the actual population mean is between **108692.5** and **112757.72**.

# 3. Hypothesis Test

In hypothesis testing, an analyst tests a statistical sample, with the objective of finding evidence on the plausibility of the null hypothesis.

- **Null hypothesis**: There is no relationship between two variables.	

- **Alternative hypothesis**: There is some statistical significance between two measured phenomenon. No observed effect. Some observed effect. It is what the researcher tries to prove.

## 3.1 Examples of Hypothesis Test

[Source](https://studiousguy.com/hypothesis-testing-examples-in-real-life/#1_To_Check_the_Manufacturing_Processes)


### Example 1:


**Null Hypothesis**:  The average of the defective products produced is the same before and after the implementation of the new manufacturing method.

$$H_{0}: \mu_{before} = μ_{after}$$

**Alternative Hypothesis**: The average number of defective products produced are different before and after the implementation of the new manufacturing method, i.e., μ after ≠ μ before

$$H_{1}: \mu_{before} \neq μ_{after}$$


### Example 2:


**Null Hypothesis**: The average sales are the same before and after the rise in the digital advertisement budget, i.e., μafter = μbefore

$$H_{0}: {AvareageSales}_{after} = {AvareageSales}_{before}$$

**Alternative Hypothesis**: The average sales increase after the rise in the digital advertisement budget, i.e., μafter > μbefore

$$H_{1}: {AvareageSales}_{after} \gt {AvareageSales}_{before}$$

## 3.2 Case -1 : Change in Height

A new study claims that children between age of 5-10 grows at an average of more than 2.5 inches.

Existing belief:

Kids tend to get taller at a pretty steady pace, growing about 2.5 inches (6 to 7 centimeters) each year.

[Kidshealth.org](https://kidshealth.org/en/parents/growth-6-12.html#:~:text=Kids%20tend%20to%20get%20taller,per%20year%20until%20puberty%20starts)


**Null Hypothesis**: Average growth in height is 2.5  i.e., μafter = μbefore

$$H_{0}: {Growth in Height}_{after} = {2.5 inches} $$

**Alternative Hypothesis**: The average sales increase after the rise in the digital advertisement budget, i.e., μafter > μbefore

$$H_{1}: {Growth in Height}_{after} > {2.5 inches} $$


In [None]:
heights_df = pd.read_csv(DATA_PATH+"heights_v1.csv")

In [None]:
heights_df.info();

In [None]:
heights_df.sample(10)

In [None]:
plt.figure(figsize=(15, 5))
sn.kdeplot(heights_df.growth);

In [None]:
heights_df.growth.mean()

In [None]:
sample_means = []

for i in range(200):
    samp = heights_df['growth'].sample(100, replace = True)
    sample_means.append(samp.mean())    

In [None]:
plt.figure(figsize=(15, 5))
sn.kdeplot(sample_means, label = 'Obeserved Growth in Height');
plt.axvline(2.5, color = 'r', label = 'Average Growth in Height');
plt.legend();

## 3.2.1 p-value and alpha value

The **p-value** measures the probability of getting a more extreme value than the one you got from the experiment.

- In this case it is the cummlative sum of all distributions of growth in height less than 2.5.

**Alpha** is the threshold value that we measure p-values against. It is the extreme observed results must be in order to reject the null hypothesis of a significance test. 

- The p-value is less than or equal to alpha. In this case, we reject the null hypothesis. When this happens, we say that the result is statistically significant. In other words, we are reasonably sure that there is something besides chance alone that gave us an observed sample.

- The p-value is greater than alpha. In this case, we fail to reject the null hypothesis. When this happens, we say that the result is not statistically significant. In other words, we are reasonably sure that our observed data can be explained by chance alone.

[source](https://www.thoughtco.com/)


<img src="hypothesis.png" alt="Normal Distribution" width="500"/>

## 3.3 Case - 2 : Change in Height

Let's say if we have received the following results in stead of the earlier one.

In [None]:
heights_new_df = pd.read_csv(DATA_PATH+"heights_new_v1.csv")

In [None]:
np.round(heights_new_df.growth.mean(), 2)

In [None]:
sample_means = []

for i in range(200):
    samp = heights_new_df['growth'].sample(100, replace = True)
    sample_means.append(samp.mean())   

plt.figure(figsize=(15, 5))
sn.kdeplot(sample_means, label = 'Obeserved Growth in Height');
plt.axvline(2.5, color = 'r', label = 'Average Growth in Height');
plt.legend();

## 3.4 One Sample Test

In [None]:
from scipy import stats

In [None]:
stats.ttest_1samp(heights_df['growth'], 2.5)

#### Note:

The null hypothesis is retained as the p-value is 0.33.

In [None]:
stats.ttest_1samp(heights_new_df['growth'], 2.5)

#### Note:

The null hypothesis is rejected as the p-value is less than alpha value (0.05).

## 3.5 Two Sample Test

A company XYZ, which offers services has upgraded their catalog web page and wants to verify which of the two web pages is better at selling i.e if the old page or new page is better.

Typical sales cycle is lengthy and takes about 2-3 months to complete the sales. So, it will take a long time to gather enough sales data to verify which webpage is superior. So the company decides measure the impact using a proxy variable i.e. how much time user spends on the site. The amount of time user spend on the page is called session duration or session length. It can be measure in seconds. It may have been establised from earlier data that session duration is highly correlated (positive) with actual sales.

In [None]:
sessions_df = pd.read_csv(DATA_PATH+'sessions_v1.csv')

In [None]:
sessions_df

In [None]:
sn.kdeplot(sessions_df.oldpage, label = 'oldpage');
sn.kdeplot(sessions_df.newpage, label = 'newpage');
plt.legend();

In [None]:
oldpage_mean = []

for i in range(200):
    samp = sessions_df['oldpage'].sample(1000, replace = True)
    oldpage_mean.append(samp.mean())   
    
newpage_mean = []

for i in range(200):
    samp = sessions_df['newpage'].sample(1000, replace = True)
    newpage_mean.append(samp.mean())   

sn.kdeplot(oldpage_mean, label = 'oldpage');
sn.kdeplot(newpage_mean, label = 'newpage');
plt.legend();

In [None]:
stats.ttest_ind(sessions_df['oldpage'],
                sessions_df['newpage'])

#### Note:

The null hypothesis is rejected as the p-value is less than alpha value (0.05).

# Chi-square Distribution

Chi-square test is a hypothesis tests to verify whether the observed frequency in the data is same as expected frequency. 

Chi-Square Goodness of Fit Test is done to decide if one variable is likely to come from a given distribution or not.

The probability of getting head or tail from tossing a coin is 0.5. But what if you get 60 heads and 40 tails after 100 tosses. Does that mean the coin is biased towards head.

## Example: 

**Null Hypothesis**: Ho: proportion of customers from all age groups are same.


**Alternative Hypothesis**: H1: proportion of customers from all age groups are not same.

In [None]:
visitors_df = pd.read_csv(DATA_PATH+'visitors_v1.csv')

In [None]:
visitors_df.head()

In [None]:
sn.barplot(data = visitors_df,
           x = 'agegroup',
           y = 'count');

In [None]:
observed_frequency = list(visitors_df['count'])

In [None]:
expected_frequency = [33, 33, 33]

$$ \sum {{(observed - expected)}^{2} \over {expected}}$$

In [None]:
stats.chisquare(f_obs= observed_frequency,   # Array of observed counts
                f_exp= expected_frequency)   # Array of expected counts

#### Note:

The null hypothesis is retained as the p-value is less than alpha value (0.05) i.e. there are no difference in frequency of visitors from different age groups.