In [2]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
from scipy.stats import norm

## Confidence interval of population mean

Now that the surveying agency has plotted the sampling distribution of mean, the next task is to find the confidence interval of population mean with 95% confidence level. 

The sampling distribution of mean is normally distributed as the agency is dealing with a much larger sample size.

Let us see how the agency locates two points (p1,p2) on the sampling distribution curve, such that 95% of the total area within the curve is enclosed. The same is shown in the diagram below.   

![](4_ci.PNG)

We can see that:
- Point p1 encloses 2.5% of the total area under the curve to its left
- Point p2 encloses 95+2.5 = 97.5% of the total area under the curve to its left. 


In [3]:
p1 = norm.ppf(0.025)
p1

-1.9599639845400545

In [4]:
p2 = norm.ppf(0.975)
p2

1.959963984540054

The output of the `norm.ppf()` function suggests that:

| Point | Position                                          |
|-------|---------------------------------------------------|
| p1    | 1.96 std deviations away to the left of the mean  |
| p2    | 1.96 std deviations away to the right of the mean |

The agency has already estimated the mean and standard error of sampling distribution of mean. Thus, point p1 and p2 can be found out as shown below.

![](4_ci2.PNG)

Thus, the surveying agency has successfully found the confidence interval of population mean to be (408.755, 417.795).

__Solution to Challenge 3.1:__ The surveying agency has estimated the population mean i.e. mean mark scored in the surprise test if all the students from the country had appeared the test. 
- The point estimate of population mean was found to be 413.2759.
- The Confidence interval was found to be (408.755,417.795). 

### Confidence interval of population mean: Small sample

As discussed earlier, if the sample size is less than 30 and population standard deviation is unknown, sampling distribution of mean has a t distribution with n-1 degrees of freedom.

Thus, while dealing with small samples (n <30) , if the population standard deviation is unknown, the formula for confidence interval is given by:

$$ CI_of_population_mean = \bar{X}  \pm t_{critical}*\frac{s}{\sqrt{n}}$$

Here $t_{critical}$ is the critical t value based upon the confidence level. 

As we have used `norm.ppf()` function to find the critical Z value, we can use the Python function `t.ppf()` to find the critical t value. 

### Confidence interval of population mean: Large sample

We have seen how the surveying agency has calculated the confidence interval of population mean by analyzing a sample of 1120 students and without the knowledge of population parameters. We can summarize the formula for finding the confidence interval of population mean as shown below.

While dealing with large sample (n >=30), the formula for calculating the confidence interval of population mean , when the population standard deviation is __unknown__ is given by:

$$ CI_of_population_mean = \bar{X}  \pm Z_{critical}*\frac{s}{\sqrt{n}}$$

where $\bar{X}$  is the sample mean.  
$Z_{critical}$ is the critical Z value for the specified confidence level.  
s is the sample standard deviation  
n is the sample size.  


While dealing with large samples (n >=30), the formula for calculating the confidence interval of mean, when the population standard deviation is known  is given by:

$$ CI_of_population_mean = \bar{X}  \pm Z_{critical}*\frac{\sigma}{\sqrt{n}}$$

where,
$\sigma$ is the sample standard deviation  

The table below gives the formula for confidence interval of mean for all possible conditions. 

![](4_ci3.PNG)

## Point estimate of population variance

### Challenge 3.2: Estimating population variance
After estimating population mean, the surveying agency has to estimate the population variance and standard deviation. The task before the agency is to find out:

- Point estimate of population variance 
- Confidence interval of population variance with 95% confidence level
- Confidence interval of population standard deviation with 95% confidence level

To find point estimate of population variance, the agency needs to:
- find the maximum likelihood estimator of population variance. 
- verify whether the estimator is  unbiased in nature. 

The agency starts its estimation by finding the maximum likelihood estimator of population variance in the same way it found the maximum likelihood estimator of population mean.

The agency takes the likelihood function for normally distributed data into consideration and finds out the condition under which the likelihood function attains maximum value. It finds the value of population variance (σ2) under that condition, treating μ as a constant.

This value is called maximum likelihood estimate of population variance. 

By doing so, the agency finds the maximum likelihood estimate of the population variance to be

$$ \frac{1}{n}\sum_{i=1}^n(X_i - \bar(X))^2 $$

where  
n-  sample size  
$\bar(X)$ - sample mean

The next task before the agency is to verify whether this estimator is unbiased in nature or not. The same can be verified by finding the expected value of the above quantity and checking whether it is equal to the population variance $\sigma^2$

#### Unbiased estimator of population variance

![](4_var.PNG)

![](4_var2.PNG)

Thus the surveying agency concludes that __maximum likelihood estimator of population variance is a biased estimator__.

The unbiased estimator of the population variance is calculated by introducing the error factor (n-1)/n in the maximum likelihood estimate of population variance and is found to be

$$ \frac{1}{n-1}\sum_{i=1}^n(X_i - \bar(X))^2 $$

This is the formula for calculating __sample variance__.

Hence, the agency concludes that sample variance s^2 calculated using the formula:

$$ \frac{1}{n-1}\sum_{i=1}^n(X_i - \bar(X))^2 $$

 is the unbiased estimator of the population variance and hence can be used as a point estimator.

The __sample variance__ was calculated by the agency and was found to be __5956.749__, which is used to estimate the population variance. 

After finding the point estimate of population variance, the surveying agency finds the interval estimate or the confidence interval of population variance.



#### Interval estimate of population variance

The surveying agency aims to find the confidence interval of the population variance with 95% confidence level.

Confidence interval of population variance with 95% confidence level can be calculated by locating two points on the sampling distribution of variance curve, that enclose 95% of the total area under the curve. 

Thus, the surveying agency has to plot the sampling distribution of variance. Let's understand some properties of sampling distribution of sample variance.

Recollect the formula for calculating the sample variance and population variance.

The population variance represented as $\sigma^2$ is given by:

$$ \sigma^2 = \frac{1}{n}\sum_{i=1}^n(X_i - \mu)^2 $$

The sample variance represented as s2 is given by:

$$ s^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i - \bar(X))^2 $$

where  
N - Population size  
$\mu$ - Population mean    
n - Sample size  
$\bar(X)$ - Sample mean

From a population of size N, if we take multiple samples of size n, calculate and plot the sample variances for each sample, the resulting distribution will be called as sampling distribution of sample variance for sample size n.

### Properties of sampling distribution of sample variance

Sampling distribution of sample variance has the following properties:
- Mean of the sampling distribution of sample variance is equals to the population variance.
- If the population distribution is normal, then
    $$\frac{(n-1)s^2}{\sigma^2} $$
    
  has a chi square distribution with n-1 degrees of freedom where n= sample size, $s^2$ = sample variance and $\sigma^2$=population variance.
    
- As ratio of two chi square distributions has F distribution, the ratio of sample variances of two independent samples, drawn from normal populations has  F distribution with $n_1$-1 and $n_2$-1 degrees of freedom where $n_1$ and $n_2$ represent the respective sample sizes.  

Let us use the dice output data, used in sampling distribution of mean section to understand the properties of sampling distribution of sample variance.

