# Introduction To Statistics - Sample Vs Population Metrics

© Explore Data Science Academy

## Learning Objectives
By the end of this train, you should be able to:

* Differentiate between sample and population metrics,
* Understand where the Central Limit theorem is utilised, and
* Describe the function and measurement of confidence intervals.

## Outline
To do this we will:

* Examine sample and population mean and variance,
* Look at the Central Limit theorem, and
* Discuss confidence intervals. 

## Introduction

In statistics, we refer to a *population* as the complete pool from which a sample can be drawn. This pool could consist of people, objects, events, categories, measurements, or virtually any set of entities. When considering the potential size of a population, it is often considered impractical to *test all of its data* for two reasons: 
   
   - It is often not feasible, and 
   - It can be extremely computationally expensive to do so. 

Instead, we can try to estimate a population using *sample statistics*. An example of this would be a scenario in which we would like to test 500 patients from Johannesburg General (Charlotte Maxeke) Hospital, and then generalise the results to the rest of Johannesburg's public hospital system. The former sample is obtainable, while the latter would need to cover possibly tens of thousands of individuals across a large geographic area. However, how *representative* would our sample be seeing that it is far smaller than the total population, and is limited to only one hospital site? This question of *representativeness* is a primary concern when working with sample statistics. 

Consider below some of the differences in notation between population and sample statistics:

![image](http://www.astrocyte.in/articles/2014/1/1/images/Astrocyte_2014_1_1_62_131867_t1.jpg)

The aim within this train is to familiarise ourselves with some of these parameters - both from a population and sample statistic perspective.  

## Mean and Variance

Suppose that we are interested in the salaries of data scientists across the globe (and why wouldn't we be??). It's an impossible task to know the distribution of **ALL** these salaries; in practice, that would require information on the salary of every data scientist. We consequently end up relying on inferences made from a representative sample of data scientists. The true values of the population mean and variance are in principle unknowable.

We can however calculate a sample mean and variance. The mean $\bar{x}$ is simply the average of the observations $x_t$, where $t = 1, 2, \ldots, n$:

$$\bar{x} = \displaystyle \frac{1}{n} \sum_{t=1}^n x_t.$$

The sample variance is calculated as:

$$s^2_X = \displaystyle \frac{1}{n-1} \sum_{t=1}^n (X_t - \bar{x})^2.$$

If you're paying attention, you may be asking why we divide by $n-1$ here, rather than $n$, as seems intuitive and as we did for the sample mean. While a little technical at this stage, this has to do with wanting an unbiased estimate of the population variance (in other words, if we repeated the sample a very large number of times and averaged the sample variances, we'd want that average to be equal to the population variance). Technically, we lose a degree of freedom here because we've had to use the sample mean $\bar{x}$ as our estimate of $\mu$: the sample mean together with $n-1$ of the observations determine the value of the $n$th observation; this is no longer a free parameter. If you don't get that, don't sweat it too much: what's important at this stage is that you know that if we want our sample variance to be unbiased, we must use a divisor of $n-1$ rather than $n$.

## Central Limit Theorem 

It's time now to introduce the concept of a sampling distribution, but first let's remind ourselves of a tool that will prove valuable in this quest: the Central Limit Theorem.

The Central Limit Theorem states that if a random variable $X$ is the sum of a large number of independent, identically distributed random outcomes, it will approximately follow a normal distribution, *even if the underlying random variables are not normally distributed*. Proof of this is beyond the scope of the course, but you should realise that this is an asymptotic result: it gets closer and closer to being true as the number of observations increases. As a general rule of thumb, 30 observations are required to rely on the Central Limit Theorem.

In his work, [*Intro to Statistics - Probability and Distributions*](https://www.fd.cvut.cz/department/k611/PEDAGOG/THO_A/A_soubory/statistics_firstfive.pdf), Keon Hon introduces the following the Central Limit theorem in the following manner:  

"Start with a population with a given mean $\mu$  and standard deviation $\sigma$. Take samples of size $n$, where $n$ is a sufficiently large (generally at least 30) number, and compute the mean of each sample.

With the following results:

* The set of all sample means will be approximately normally distributed.

* The mean of the set of samples will equal $\mu$, the mean of the population.

* The standard deviation, $\sigma x$, of the set of sample means will be approximately $\frac{\sigma}{\sqrt{n}}$"

## Confidence Intervals

Using the central limit theorem, we can find the probability that a sample lies within an interval, but this is essentially the same as the probability of the population mean being estimated into a sample interval. 

As Keone puts it: "We can determine how confident we are that the population mean lies within a certain interval of a sample mean."

If a random variable is normally distributed, we can make statements about how certain we are that it will fall within certain multiples of the standard deviation about the mean. One well-known feature, with which you will become familiar, is that 95% of observations lie within 1.96 standard deviations of the mean. 

Before we get into the code, let's define some terms first:

 * **CDF**:
     - Stands for "Cumulative Distribution Function"
     - The CDF will tell us the probability of a value being below $x$.
<br>

 * **PPF** :
     - Stands for "Percent Point Function".
     - This is the inverse of a CDF 
     - Another name for a _quantile function_.


If you want to understand PPF's and CDF's in more detail, go to [this link](https://www.countbayesie.com/blog/2015/4/4/parameter-estimation-the-pdf-cdf-and-quantile-function) which describes those concepts and other distributions.

Let's now look at things in a practical manner. We start by importing our de facto Python statistical library, Scipy:

In [1]:
import scipy.stats as st

In [7]:
# Show that roughly 95% of probability weight lies within 1.96 standard deviations§
print(st.norm.cdf(1.96) - st.norm.cdf(-1.96))      # 95% confidence interval within 1.96 standard deviations

0.950004209703559


In [8]:
# Calculate the exact confidence interval (CI) band: since we want a 95% CI, we want 2.5% above the upper bound and 2.5% below the lower
# bound. Since the distribution is symmetric, we can simply figure out the point below which 97.5% of the 
# probability weight lies.
print(st.norm.ppf(0.975))

# repeat for 90%, 99% and 99.5% CIs
print(st.norm.ppf(0.95))
print(st.norm.ppf(0.995))
print(st.norm.ppf(0.9975))

1.959963984540054
1.6448536269514722
2.5758293035489004
2.807033768343811


You must be wondering "Where did that 1.96 come from and why is the one negative and the other is not?". So let me introduce you to the world of Z-tables. There's a negative and a positive Z-table below (Table 1 & Table 2 respectively). So how would we use it?

If you are 95% confident, that means you are 5% not confident. That 5% can be converted into a probability of 0.05. 
In distributions, they have 2-tails as in the standard bell curve. That means that our probability needs to also be divided by two so 0.05/2 = 0.025.

The next step is to look at a Z-table. So how would you know when to use  a positive and a negative table you may ask? If you look at the decimal numbers in the center of the table, you'll notice that the values which are greater than 0.5 are on the positive table and values less than 0.5 are on the negative table. So therefore we use the NEGATIVE Z-table to find where the probability is 0.025 which should give a -1.9 value on the left column, and 0.06 value on the top row... and would you look at that! You will get -1.96 if you put those two numbers together!

This can work in the for the positive z- table if we use 95% (which equates to 0.95 in probability terms). Can you figure out why 1.96 correlates to 0.975? Remember when we divided the 0.05 by 2! that would give 2.5%.

**Table 1: Negative z-value Table**

![image](http://www.z-table.com/uploads/2/1/7/9/21795380/9340559_orig.png)

As you can see, if you trace the value of a - 1.96 Z value, you will find that the number correlating to that is 0.025.

And here is the positive Z-table:

**Table 2: Positive z-value table**

![image](http://www.z-table.com/uploads/2/1/7/9/21795380/8573955.png?759)

If you look at the positive Z value 0f 1.96, you will see that the number correlating to this is 0.975.

So we can see that 95% of the data lies between 0.025 and 0.975.


How can we use this? Well, if we assume that the population variance $\sigma^2$ is known then we can say with 95% confidence that 

$$\displaystyle \mu - 1.96 \frac{\sigma}{\sqrt{n}} < \bar{X} < \mu + 1.96 \frac{\sigma}{\sqrt{n}}.$$

Just a little rearrangement should convince you that this is equivalent to saying that:

$$\displaystyle \bar{X} - 1.96 \frac{\sigma}{\sqrt{n}} < \mu < \bar{X} + 1.96 \frac{\sigma}{\sqrt{n}}.$$

For a 99% confidence interval, we'd substitute 2.58 for 1.96, but the principle remains the same for establishing confidence intervals around our estimate of the unknown population mean.

---
## Exercises

#### Central Limit Thereom

Foodland shoppers have a mean R60 grocery bill with a standard deviation of R40. What is the probability that a sample of 100 Foodland shoppers will have a mean grocery bill of over R70?

#### Confidence Interval

At a factory, batteries are produced with a standard deviation of 2.4 months.
In a sample of 64 batteries, the mean life expectancy is 12.35. Find a 95% confidence interval estimate for the life expectancy of all batteries produced at the plant.

---

## Conclusion

Understanding statistics is fundamentally important for a data scientist. Population and Sample statistics play fundamental roles in making sure we optimise outcomes whilst preserving computational efficiency by subsetting and sampling. The Central Limit theorem helps to understand the population by just using a sample and the means and standard deviations of the two. Confidence intervals help us to create upper and lower limits when making inferential decisions.

## Solutions
#### Central Limit Theorem
Since the sample size is greater than 30, we can apply the Central Limit Theorem. By this theorem, the set of sample means of size 100 has mean R60 and standard deviation R40
√
100
= R4. Thus, R70 represents a z-score of (R70 − R60) /
R4 = 2.5.
Since the set of sample means of size 100 is normally distributed, we can compare a z-score of 2.5 to the table of normal curve areas. The area between z = 0
and z = 2.5 is 0.4938, so the probability is 0.5 - 0.4938 = 0.0062.

#### Confidence Interval

Since the sample has n larger than 30, the central limit theorem applies. Let
the standard deviation of the set of sample means of size 64 be σx. Then by the
central limit theorem, 2.4 = σx / √64
, so σx = 0.3 months.

Looking at the table of normal curve areas (or referring to section 4.3.3), 95%
of the normal curve area is between the z-scores of -1.96 and 1.96. Since the
standard deviation is 0.3, a z-score of −1.96 represents a raw score of -0.588
months, and a z-score of 1.96 represents a raw score of 0.588 months. So we have
95% confidence that the life expectancy will be between 12.35 − 0.588 = 11.762
months and 12.35 + 0.588 = 12.938 months.

## Appendix

* [Intro to Statistics - Probability and Distributions, Keon Hon](https://www.fd.cvut.cz/department/k611/PEDAGOG/THO_A/A_soubory/statistics_firstfive.pdf)
* [An Introduction to the Science of Statistics, Joseph Watkins](https://www.math.arizona.edu/~jwatkins/statbook.pdf)
* [A Brief summary of some basic statistical concepts](https://towardsdatascience.com/machine-learning-probability-statistics-f830f8c09326)