# Case study

Each year federal government checks the quality of education across the country. A surveying agency was identified to conduct this exercise on primary school education. This agency checks the quality of education by conducting surprise tests for primary school students.

The Agency has two options for conducting this exercise of checking the quality of education:

- Option 1: Conduct surprise test for each and every student in the country, analyze their performance and then conclude on the quality of primary school education in the country.

- Option 2: Conduct surprise test for a selected group of students across various schools in the country. Analyze the marks of these selected students and draw inferences about the quality of primary school education in the entire country.


Option 1 is about collecting all the required data and performing statistical operations in order to arrive at conclusions. This type of statistical analysis falls under __descriptive statistics__.

Option 2 deals with collecting a subset of data, known as __sample__ in statistical terminology, from the entire data known as population.The sample is analyzed and conclusions are drawn about the population based on the sample. This type of analysis falls under __statistical inference (also known as inferential statistics)__.

To perform descriptive statistics one needs to analyze the entire population. However, it may not be feasible to collect all the data because of the complexity and cost involved in data collection. For example, it is not feasible to conduct surprise test for every student across all the schools in the country.

The surveying agency prefers option 2.

The surveying agency uses various statistical inference techniques to draw conclusions about the population, by analyzing the sample. Throughout this course we are going to learn such techniques.

> Statistical inference is about analyzing various sample statistics to determine the population parameters.

| Term      | Denotes                             | Example                              |
|-----------|-------------------------------------|--------------------------------------|
| Statistic | a quantity/property of a sample     | sample mean, sample variance         |
| Parameter | a quantity/property of a population | population mean, population variance |


In the given scenario, the surveying agency

- identifies the sample(set of students to appear for the surprise test) and conducts the surprise test 
- calculates various sample statistics 
- draws inferences about the population parameters 

## Challenges faced in estimating the quality of education

> Given below are the challenges that the surveying agency must resolve. This course is all about finding solutions to these challenges.

> 1. Deciding the number of students to appear for the surprise test
> 2. Deciding how these students should be selected
> 3. Estimating the performance of all the students in the country by analyzing the performance of the selected students in the surprise test
    - Estimating the mean score from the sample
    - Estimating the variance among the marks from the sample

> 4. Validating the estimations i.e. determining how good the estimates are
> 5. Estimating the difference between the performances of two different schools
> 6. Determining the effect of various factors on the performance of students 
    - Effect of factors like school, teacher, extra efforts put by a student etc.
    - Effect of combined factors like gender and teacher etc. 
    
***
    
# Introduction to Sampling

## Challenge 1: How many students to select for the surprise test?

The surveying agency has decided to conduct the surprise test for a selected set of students. Now they have to decide the number of students who should appear for the surprise test, such that they represent the performance of all the students in the country.

In statistical terms, the surveying agency has to determine the __sample size__. 

What should be the sample size is determined by the __level of accuracy__ we desire. __Level of accuracy__ indicates how close we want our parameter estimation to be with the population parameters.

The level of accuracy is measured in terms of:
1. confidence level
2. maximum permissible error

#### Factors affecting sample size: Confidence Level

The surveying agency is estimating quality of primary school education in the country by analyzing the performance of a selected set of students in the surprise test.

If the agency gives a single point estimate of the performance of students i.e. the average performance of primary school students in the country as 72%, it may not be equal to the actual value.

Rather, if the agency proposes an interval i.e. the average performance of primary school students in the country lies between 70% to 75%, there is a greater certainty associated with this estimate.  

The interval is called __confidence interval__ and the level of certainty associated with the interval is characterized by __confidence level__. 

A confidence level is the probability that the population parameter is within the confidence interval.

95% Confidence interval means that the probability of the population parameter being within this interval is 0.95.

To associate a greater certainty i.e. higher confidence level with the estimations, one must analyze more and more samples such that the population is properly represented. 

Thus, if we associate a __higher level of confidence__ to our analysis we may need to __increase the sample size__.

#### Factors affecting sample size: Maximum Permissible Error

Estimated parameters are different than the population parameters as we consider only a sample and not the entire data.

Accurate analysis means the difference between the population parameters and the estimated ones are small.

This difference is restricted by specifying maximum permissible error, which is the maximum amount of error that we can accommodate in our analysis. 

To reduce the maximum permissible error in our analysis, we may need to analyze more samples. 

The relation between sample size and maximum permissible error is represented in the graph below:

![](1_mpe.PNG)

Therefore, it is suggested to increase the sample size in order to increase the confidence level and to decrease the margin of error in the analysis. 

#### Factors affecting sample size: Population Variance

Another factor that affects determination of sample size is the variance in the population.

Consider two sets of numbers given below as two different populations and calculate the respective population means.

![](1_pop.PNG)

Observe, in Population 1 the variance between the observations is low. That is, the observations are close to each other. In population 2, the variance is high.

As the variance in the population 1 is low, a sample of size 7 is sufficient to closely represent the population.

As the variation in the population 2 is high, smaller samples may not be sufficient to closely represent the population. They may not include extreme values that cause the high variance in the population.

Therefore, for populations with higher variance, a larger sample may be required, so that the selected sample is a true representative of the population. 

__Thus, if the variance/standard deviation in the population is higher, we may need to consider a larger sample. __

#### Determining the sample size

We have found that determination of sample size depends upon 3 major factors, namely,

- confidence level
- maximum permissible error
- population variance/standard deviation

Formula given below can be used to calculate sample size: 

$$ n = [\frac{Z_c * \sigma}{E}]^2 $$

where,  
n - sample size  
$Z_c$ - critical Z statistic for the specified confidence level  
$\sigma$ - population standard deviation  
E - maximum permissible error  

The surveying agency chooses 0.95 or __95% confidence level__.

The critical Z value for a specified confidence level can be found by looking at the Z table.

Or, alternatively it can be found with the help of Python's scipy library [scipy.stats.norm](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html), [norm.ppf()](https://kite.com/python/docs/scipy.stats.norm.ppf) and [Reference: [Probability to z-score and vice-versa in python](https://stackoverflow.com/questions/20864847/probability-to-z-score-and-vice-versa-in-python)]

ppf - Percent point function (inverse of cdf) at q of the given RV.  

![](1_ppf.PNG)

In [1]:
from scipy.stats import norm

In [2]:
norm.ppf(0.95) # we get the corresponding critical z-value

1.6448536269514722

In [3]:
norm.cdf(1.64) # to get the area under the curve till the z-value 1.64

0.9494974165258963


![](1_ppf2.PNG)

From the Z table, the critical Z statistic for 95% confidence level is found to be 

In [4]:
norm.ppf(0.975)

1.959963984540054

The surveying agency sets the maximum permissible error as 5 marks, i.e. estimated value should not differ from the actual value by 5. 

The agency does not know the population standard deviation. Hence, it refers to the previous surveys and finds the standard deviation to be 85.35 marks.

| Parameter                         | Value       |
|-----------------------------------|-------------|
| $Z_C$                             | 1.96        |
| Population std deviation $\sigma$ | 85.35 marks |
| Maximum permissible error (E)     | 5 marks     |

The agency uses the formula for sample size(given below) to determine the sample size

$$sample size = [\frac{1.96 * 85.35}{5}]^2 = 1119.3$$

The surveying agency rounds it off to 1120. 

__Solution to Challenge 1:__ The surveying agency determines the sample size to be 1120 i.e. 1120 students will be selected from all over the country to appear for the surprise test. 

The next task for the agency, is to identify these 1120 students.