# Confidence Intervals

In [1]:
# Libraries
import pandas as pd
import numpy as np
import scipy.stats as stats

## Challenge 1
We want to estimate the average size of the men of a country with a confidence level of 80%. Assuming that the standard deviation of the sizes in the population is 4, get the confidence interval with a sample of men selected randomly, whose heights are:

````
heights = [167, 167, 168, 168, 168, 169, 171, 172, 173, 175, 175, 175, 177, 182, 195]
````

**Hint**: function `stats.norm.interval` from `scipy` can help you get through this exercise. 

In [2]:
# your code here
heights = [167, 167, 168, 168, 168, 169, 171, 172, 173, 175, 175, 175, 177, 182, 195]
len_sample = len(heights)
len_sample

15

In [3]:
heigths_mean =np.mean(heights)
heigths_mean

173.46666666666667

In [4]:
height_std = 4

In [5]:
# Get the critical value
t = stats.t.ppf(1 - ((1 - 0.80) / 2), 15 - 1)
t

1.345030374454649

In [6]:
marginal_height = t * height_std / np.sqrt(15)

In [9]:
# Calculate the confidence interval manually
print("lower bound: ", heigths_mean - marginal_height)
print("upper bound: ", heigths_mean + marginal_height)

lower bound:  172.07752526922508
upper bound:  174.85580806410826


In [10]:
stats.norm.interval(0.80, heigths_mean, scale=height_std/np.sqrt(15))

(172.14308590115726, 174.79024743217607)

## Challenge 2 
In a sample of 105 shops selected randomly from an area, we note that 27 of them have had losses in this month. Get an interval for the proportion of businesses in the area with losses to a confidence level of 80% and a confidence level of 90%.

**Hint**: function `stats.norm.interval` from `scipy` can help you get through this exercise. 

In [11]:
# First I want to look at the proportion of shops with losses in our sample
sample_proportion_losses = 27 / 105
sample_proportion_losses

0.2571428571428571

In [12]:
# Then I need the standard error to measure how close the sample_proportion_losses is to the actual proportion for the entire population
se_sample_prop = np.sqrt(sample_proportion_losses * (1 - sample_proportion_losses) / 105)
se_sample_prop

0.042652572988124506

In [None]:
# Now I want to estimate what would be the proportion of shops with losses for the whole population
# I don't know how large is the population in this case, so I can get look at the confidence intervals just to have an indication of the interval
# where the proportion of shops with losses in the whole population would fall

In [13]:
# confidence interval with level of 80%
# get the critical value
t = stats.t.ppf(1 - ((1 - 0.80) / 2), 105 - 1)
t

1.2897448160236327

In [14]:
marginal_proportion_80 = t * se_sample_prop / np.sqrt(105)

In [15]:
# manual calculation
print("lower bound: ", sample_proportion_losses - marginal_proportion_80)
print("upper bound: ", sample_proportion_losses + marginal_proportion_80)

lower bound:  0.2517743396045227
upper bound:  0.2625113746811915


In [16]:
# stats.norm.interval calculation
stats.norm.interval(0.80, sample_proportion_losses, scale=se_sample_prop/np.sqrt(105))

(0.2518084437221328, 0.26247727056358144)

In [17]:
# confidence interval with level of 90%
# get the critical value
z = stats.t.ppf(1 - ((1 - 0.90) / 2), 105 - 1)
z

1.6596374367131441

In [18]:
marginal_proportion_90 = z * se_sample_prop / np.sqrt(105)

In [19]:
# manual calculation
print("lower bound: ", sample_proportion_losses - marginal_proportion_90)
print("upper bound: ", sample_proportion_losses + marginal_proportion_90)

lower bound:  0.25023467455078413
upper bound:  0.2640510397349301


In [20]:
# stats.norm.interval calculation
stats.norm.interval(0.90, sample_proportion_losses, scale=se_sample_prop/np.sqrt(105))

(0.2502962116401392, 0.263989502645575)

## Bonus Challenge
The error level or sampling error for the first challenge is given by the following expression:
$$Error = z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt n}$$
Where z represents the value for N(0,1)


Suppose that with the previous data of challenge 1, and with a confidence level of
99% (that is, almost certainly) we want to estimate the average population size, so that the error level committed is not greater than half a centimeter.

#### 1.- Determine what size the selected sample of men should be.

In [24]:
# your code here
# we already have the standard deviation of the population (σ) and the error
height_std = 4
error = 0.5

In [30]:
# zα/2 is our critical value - which I calculate using the t-distribution as the sample is very small - for a confidence level of 99%
t = stats.t.ppf(1 - ((1 - 0.99) / 2), 15 - 1)
t

2.97684273411266

In [34]:
# Finally, I revert the expression to calculate the population (n)
import math
n = ((t * height_std)/error)**2
math.floor(n)

567

#### 2.- For the second challenge, we have the following error:
$$ Error = z_{\frac{\alpha}{2}}\sqrt{\frac{p\times q}{n}} $$
#### Determine the sample size required to not exceed an error of 1% with a confidence of 80%.

In [35]:
# Error is defined as 1% or 0.01
error = 0.01

In [36]:
# zα/2 is our critical value - which I calculate using the t-distribution as the sample is small - for a confidence level of 80%
t = stats.t.ppf(1 - ((1 - 0.80) / 2), 105 - 1)
t

1.2897448160236327

In [37]:
# this value stands for p while q = 1 - p
sample_proportion_losses

0.2571428571428571

In [41]:
n = ((t ** 2) * sample_proportion_losses * (1 - sample_proportion_losses)) / error ** 2
math.ceil(n)

3178

## Bonus Challenge

Let's consider the following problem:

Build a confidence interval of 94% for the real difference between the durations of two brands of spotlights, if a sample of 40 spotlights taken randomly from the first mark gave an average duration of 418 hours, and a sample of 50 bulbs of another brand gave a duration average of 402 hours. The standard deviations of the two
populations are 26 hours and 22 hours, respectively.

Sometimes, we will be interested in the difference of two different groups of random variables. We can also build a confidence interval for that! We have some different cases regarding the variance but for this specific case (the variance are different and known), we have that:

$$\overline{X} - \overline{Y}  \sim N(\mu_{X} - \mu_{Y} , \sqrt{\frac{\sigma_{X}^2}{n_X}+\frac{\sigma_{Y}^2}{n_Y}})$$

Solve the problem with this information.

In [42]:
# your code here
sample_a = 40 # this is nX
avg_duration_a = 418 # this is X¯
std_a = 26 # this is σX

In [43]:
sample_b = 50 # this is nY
avg_duration_b = 402 # this is Y¯
std_b = 22 # this is σy

In [44]:
# first I calculate the standard error - the final part of the expression
standard_error = np.sqrt((std_a ** 2 / sample_a) + (std_b ** 2 / sample_b))
standard_error

5.155579501860096

In [45]:
# get the crtical value for a confidence level of 94%
(100 - 94)/2 + 94
z = stats.norm.ppf(.97)
z

1.8807936081512509

In [46]:
difference_duration = z * standard_error

In [47]:
print("lower bound: ", (avg_duration_a - avg_duration_b) - difference_duration)
print("upper bound: ", (avg_duration_a - avg_duration_b) + difference_duration)

lower bound:  6.303419026585921
upper bound:  25.69658097341408
