## Sampling Error and Confidence Intervals

### Confidence Interval for Population Mean using SciPy

When working with large samples (n≥30), we can assume that the sampling distribution of the sample mean is approximately normally distributed (thanks to the Central Limit Theorem). In this case, we can calculate the confidence interval for the mean using the stats.norm.interval() function in Scipy.

In [None]:
# Define a sample dataset (exam grades for a class of 40 students):

In [10]:
import numpy as np
from scipy import stats
np.random.seed(42)
grades = np.random.randint(100, size=40)
print(grades)

[51 92 14 71 60 20 82 86 74 74 87 99 23  2 21 52  1 87 29 37  1 63 59 20
 32 75 57 21 88 48 90 58 41 91 59 79 14 61 61 46]


In [11]:
# Calculate the sample mean:
xbar= np.mean(grades)
xbar

53.15

In [14]:
# Calculate standart error of the mean:
# SE (sample error) = population_standart_deviation / sqrt_of_sample_size

n= len(grades)  # sample size
sigma = 28  #population standart deviation (from past experience)

sem = sigma/np.sqrt(n)

print(sem)

4.427188724235731


In [16]:
# Calculate 95% Confidence Interval around the mean by using scipy.stats:
# calculate confidence interval using xbar:sample mean and sem:standart error of the mean

stats.norm.interval(alpha=0.95, loc=xbar, scale=sem)

(44.47286954773614, 61.82713045226386)

### Population Variance Unknown (using t-score)

Where the population standard deviation σ is unknown, the sample standard deviation s is used. This introduces extra uncertainty, as s varies from sample to sample. So, the t distribution is used instead of the normal distribution. In this case, we can calculate the confidence interval for the mean using the stats.t.interval() function in Scipy.

In [None]:
# Define a sample dataset (exam grades for a class of 20 students):

In [19]:
np.random.seed(101)
grades = np.random.randint(100, size=20)
grades

array([95, 11, 81, 70, 63, 87, 75,  9, 77, 40,  4, 63, 40, 60, 92, 64,  5,
       12, 93, 40])

In [25]:
# Calculate 95% Confidence Interval around the mean by using scipy.stats:

stats.t.interval(alpha=0.95,  # confidence level
                 df=np.size(grades)-1,  # degrees of freedom
                 loc=np.mean(grades),  # sample mean
                 scale=stats.sem(grades)  #standart error of the mean
                )

(39.205382746648176, 68.89461725335181)

In [22]:
scale=np.std(grades)/np.sqrt(len(grades)) ## calculating sample error

6.91284131164603

In [None]:
# Calculate the standard error of the mean

# stats.sem()