In [None]:
import pandas as pd

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import random
import math

## **Lab 2 : Confidence Intervals**

Statistical inference is the process of analyzing sample data to gain insight into the population from which the data was collected and to investigate differences between data samples. In data analysis, we are often interested in the characteristics of some large population, but collecting data on the entire population may be infeasible. For example, leading up to U.S. presidential elections it could be very useful to know the political leanings of every single eligible voter, but surveying every voter is not feasible. Instead, we could poll some subset of the population, such as a thousand registered voters, and use that data to make inferences about the population as a whole.

## **Exercise 1**

## **Point Estimates**

Point estimates are estimates of population parameters based on sample data. For instance, if we wanted to know the average age of registered voters in the U.S., we could take a survey of registered voters and then use the average age of the respondents as a point estimate of the average age of the population as a whole. The average of a sample is known as the sample mean.

The sample mean is usually not exactly the same as the population mean. This difference can be caused by many factors including poor survey design, biased sampling methods and the randomness inherent to drawing a sample from a population. Let's investigate point estimates by generating a population of random age data and then drawing a sample from it to estimate the mean:

## **Question 1 : point estimate of a mean**

We generate a sample of a synthetic population with different ages : ages1 (adults) and ages 2 (children)

In [None]:
np.random.seed(10)
population_ages1 = stats.poisson.rvs(loc=18, mu=35, size=150000)
population_ages2 = stats.poisson.rvs(loc=18, mu=10, size=100000)

## **Question 1-a**

Concatenate the two populations. What is the mean of the whole population?

## **Question 1-b**

Sample 1000 values of the whole population. Show the sample mean. What is the difference between the sample mean and the true mean? Comment!


## **Question 2 : point estimate of a proportion**

Another point estimate that may be of interest is the proportion of the population that belongs to some category or subgroup. For example, we might like to know the race of each voter we poll, to get a sense of the overall demographics of the voter base.

We simulate a synthetic example

In [None]:
population_students = (["Humanities"]*500000) + (["Sciences"]*300000) +\
                   (["Medecine"]*200000) + (["Economics"]*400000) +\
                   (["other"]*30000)

Sample 1000 values and calculate a point estimate of the proportion of each category of students on each category of the population. Comment!

## **Exercise 2**

## **Sampling Distributions and discovering empirically the Central Limit Theorem**

Many statistical procedures assume that data follows a normal distribution, because the normal distribution has nice properties like symmetricity and having the majority of the data clustered within a few standard deviations of the mean. Unfortunately, real world data is often not normally distributed and the distribution of a sample tends to mirror the distribution of the population. This means a sample taken from a population with a skewed distribution will also tend to be skewed.

## **Question 1**

Transform the numpy array **population_ages** into a Dataframe and visualize its histogram. Recall what is the skewness and print the skewness of this population. Is it consistent with the histogram? Are the data following a Gaussian distribution?

## **Question 2**

Extract a sample from the whole population and do the same ! Comment.

The sample has roughly the same shape as the underlying population. This suggests that we can't apply techniques that assume a normal distribution to this data set, since it is not normal. In reality, we can, thanks the central limit theorem.

The central limit theorem is one of the most important results of probability theory and serves as the foundation of many methods of statistical analysis. At a high level, the theorem states the distribution of many sample means, known as a sampling distribution, will be normally distributed. This rule holds even if the underlying distribution itself is not normally distributed. As a result we can treat the sample mean as if it were drawn normal distribution.

To illustrate, let's create a sampling distribution by taking 200 samples from our population and then making 200 point estimates of the mean:

## **Question 3**

## **Central limit theorem to the recourse!**

The central limit theorem is one of the most important results of probability theory and serves as the foundation of many methods of statistical analysis. At a high level, the theorem states the distribution of many sample means, known as a sampling distribution, will be normally distributed. This rule holds even if the underlying distribution itself is not normally distributed. As a result we can treat the sample mean as if it were drawn normal distribution.

## **Question 3-a**

Create a sampling distribution by taking 200 samples from the whole population. Thereafter make 200 point estimates of the mean

## **Question 3-b**

Visualize using an histogram the distribution of the 200 point estimates of the mean. Compare the mean of these new observatiosn with the true mean of the whole population and comment!

## **Exercise 3**

## **Confidence Intervals**

A point estimate can give you a rough idea of a population parameter like the mean, but estimates are prone to error and taking multiple samples to get improved estimates may not be feasible. A confidence interval is a range of values above and below a point estimate that captures the true population parameter at some predetermined confidence level. For example, if you want to have a 95% chance of capturing the true population parameter with a point estimate and a corresponding confidence interval, you'd set your confidence level to 95%. Higher confidence levels result in a wider confidence intervals.

Calculate a confidence interval by taking a point estimate and then adding and subtracting a margin of error to create a range. Margin of error is based on your desired confidence level, the spread of the data and the size of your sample. The way you calculate the margin of error depends on whether you know the standard deviation of the population or not.

## **Part I**

## **First case : you know the standard deviation of the population**

In this case, the margin of error is equal to:

$$z \times \frac{\sigma}{\sqrt{n}}$$

Where σ (sigma) is the population standard deviation, n is sample size, and z is a number known as the z-critical value. The z-critical value is the number of standard deviations you'd have to go from the mean of the normal distribution to capture the proportion of the data associated with the desired confidence level. For instance, we know that roughly 95% of the data in a normal distribution lies within 2 standard deviations of the mean, so we could use 2 as the z-critical value for a 95% confidence interval (although it is more exact to get z-critical values with stats.norm.ppf().).

## **Question 1**

Calculate a 95% confidence for the mean point estimate calculated in Exercise 1 using the formula given in Lecture 2:

## **Question 2**

## **Question 2-a**

Sample several times the whole population and create several confidence intervals from the formula corresponding to these samples

## **Question 2-b**

Plot them to get a better sense of what it means to "capture" the true mean:

## **Part II**

## **Second case : you do not know the standard deviation of the population**

In this case, you have to use the standard deviation of your sample as a stand in when creating confidence intervals. Since the sample standard deviation may not match the population parameter the interval will have more error when you don't know the population standard deviation. To account for this error, we use what's known as a t-critical value instead of the z-critical value. The t-critical value is drawn from what's known as a t-distribution--a distribution that closely resembles the normal distribution but that gets wider and wider as the sample size falls. The t-distribution is available in scipy.stats with the nickname "t" so we can get t-critical values with stats.t.ppf().

Let's take a new, smaller sample and then create a confidence interval without the population standard deviation, using the t-distribution:

To account for this error, we use what's known as a t-critical value instead of the z-critical value. The t-critical value is drawn from what's known as a t-distribution--a distribution that closely resembles the normal distribution but that gets wider and wider as the sample size falls. The t-distribution is available in scipy.stats with the nickname "t" so we can get t-critical values with stats.t.ppf().

## **Question 4**

## **Question 4-a**

Take a new, smaller sample of size 25

Create a confidence interval from the formula without the population standard deviation, using the t-distribution. Comment!

## **Question 4-b**

What happens you have a large sample? Compare the t-critical value with the z-critical value. Comment!

## **Question 4-c**

Calculate the confidence interval using the Python function stats.t.interval():

## **Part III**

## **Confidence interval for a population proportion**

We can also make a confidence interval for a point estimate of a population proportion. In this case, the margin of error equals:

$$z \times \sqrt{\frac{p(1-p)}{n}}$$

Where z is the z-critical value for our confidence level, p is the point estimate of the population proportion and n is the sample size.

## **Question 5**

## **Question 5-a**

Calculate a 95% confidence interval for Humanities according to the sample proportion ($p=0.338$) we calculated earlier :

## **Question 5-b**

Same with Python