# Confidence Intervals: Examples

This notebook collects practical exercises of the concepts introduced in the course videos.

Overview of contents:

1. Definitions
2. Personal Notes: Using Distributions with Scipy
3. CI of One Proportion
4. CI of One Mean
5. CI of Two Proportions
6. CI of Two Means

## 1. Definitions

We must distinguish between:
- Pupulation: the real total group of subjects we want to measure.
- Sample: the subset of the population we really measure, due to economical limits; measurements are assumed to be independent and identically distirbuted (iid.).

Even though we measure the sample, we can infer parameters of a population with confidence intervals (CI) defined around a best parameter estimate we have.

Note that we distinguish also
- Sample distribution: the distirbution of the data we have collected. We can compute parameters of it: mean, meadia, variance, proportions, etc.
- Sampling distribution: if define many iid. samples from the population and compute a parameter, the distribution of that parameter is the sampling distirbution. Accorsing to the Central Limit Theorem (CLT) it tends to be normal.

Having a confidenfe interval of 95% means that if we draw 100 independent samples and compute the parameter and its CI with the same method, 95 of the CIs will contain the real paramater of the population. Thus, the confidence is associated to the method we use.

In general, we use the following formula for the computation of the CI:

`Confidence Interval` = `Best Estimate` $\pm$ `Margin of Error`

The terms are obtained as follows:

- The `Best Estimate` is the parameter of our sample: sample mean, sample proportion.
- The `Margin of Error` is `K x Estimated Standard Error`; that is, `K` is how many standard errors we want to cover in the sampling distribution.
- `K` is defined as the value that covers X% in a symmetric Z or T distirbution; generally, `Z*(95%)` or `T*(95%,df=n-1)` are taken. Note that `Z*(95%) = 1.96`. The T distirbution tends to be Z with large sample sizes `n`.
- `Estimated Standard Error = sqrt(var(sample parameter) / n)`


## 2. Personal Notes: Using Distributions with Scipy

In the course videos, fixed values are used for `Z*(95%)` and `T*(95%,df)`. However, it is possible to obtain exact values with `scipy`.

For the CI computation, note that the `95%` coverage in the chosen distribution is two-sided; the significance level `alpha` related to that CI would be: `1 - alpha = 0.95 -> alpha = 0.05`. However, when we look in tables, that two-sided symmetry is not considered: we get `1 - alpha = P(x < v)`; instead, we would like: `1 - alpha = P(-v < x < v)`. In a symmetric distirbution, that can be intuitively achieved taking `1 - alpha/2 = P(x < v)`!

In [24]:
from scipy.stats import norm,t

In [25]:
# Confidence 95% -> significane level alpha = 0.05
# Since we have two sides, we need to consider: alpha/2 = 0.05/2
# Thus, the percentage we look is: 1 - alpha/2 = 0.975
T_star_95 = t(df=10).ppf(0.975)
print(T_star_95)

2.2281388519649385


In [26]:
Z_star_95 = norm.ppf(0.975)
print(Z_star_95)

1.959963984540054


### 2.1 Further Notes on How to Use Distributions

Load distributions:
```python
from scipy.stats import binom,norm,cauchy
```
Instantiate a distribution with its parameters:
```python    
dist = binom(n, b)
dist = norm(m, s)
dist = cauchy(z, g)
...
```

Distributions have usually at least the `loc` and `scale` parameters, which are often related to the `mean` and `stddev`.

Get data:
```python
dist.rvs(N) # N random variables of the distribution
dist.pmf(x) # Probability Mass Function at values x for discrete distributions
dist.pdf(x) # Probability Density Function at values x for continuous distributions
dist.cdf(x) # Cumulative Distribution Function at values x for any distribution
dist.ppf(q) # Percent point function (inverse of `cdf`) at q (% of accumulated area) of the given RV
```
Note:
- `dist.cdf(v)` = $P (x < v)$; $P(x < \infty) = 1$
- `dist.ppf(q)` = $v | P(x < v) = q$

Fitting data to a distribution:
```python
# Choose distrbution or iterate through a set of candidates
# Data: replace this with real dataset
data = dist.rvs(10)
# Fit
params = dist.fit(data)
# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]
# Calculate fitted PDF and error with fit in distribution
pdf = dist.pdf(x, loc=loc, scale=scale, *arg)
sse = np.sum(np.power(y - pdf, 2.0))
```

Get parameters:
```python
params = dist.stats() # Mean(‘m’), variance(‘v’), skew(‘s’), and/or kurtosis(‘k’)
m = dist.mean()
std = dist.std()
...
```

Documentation:

    help(scipy.stat)
    https://docs.scipy.org/doc/scipy/reference/index.html