# <center> Day 24 - Python - Confidence Intervals</center>

I'll start by importing the necessary modules for you.

In [1]:
import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt
%matplotlib inline

---

# Z-test<font color ="red">*</font> (Normal test)

<font size = 1><font color ="red">*</font>Out of all stupid names in statistics, <i>Z-test</i> one is arguably the worst.</font>

A **z-test** is just a two-tailed hypothesis test regarding the <b>mean of a Normally-distributed random variable</b> (or approximately Normal, by the CLT), <b>given that we know the variance</b>. I'll explain why the name "Z-test" at the end of the lecture.

We've already encountered this situation in the lectures and in the homeworks. Let's learn how to do it in Python.

Suppose that $X_1, \cdots, X_{10} \sim Normal(\mu, 2)$, and our estimate is $\hat{\mu} = 1$. 

$$H_0: \mu = 0 \qquad H_1: \mu \neq 0$$

at the $\alpha = 0.10$ significance level.

Let's test by constructing a 90% confidence interval.

Begin by writing gives names to variables, for convenience.

In [2]:
muhat = 1
mu_hyp = 0
sigmasq = 2
sigma = np.sqrt(2)
n = 10
alpha = 0.10

### Finding c

The next step is tricky: we have to find $c$, and to find that we need to find the boundaries of the two rejection regions.

<img src="ci1.png">

<img src="ci3.png">

We're looking for numbers $m_1$ and $m_2$ that satisfy

$$P(\hat{\mu} \leq m_1) = \frac{\alpha}{2} \qquad P(\hat{\mu} \geq m_2) = \frac{\alpha}{2}$$

Or, in terms of the CDF of $\hat{\mu}$

$$F_{\hat{\mu}}(m_1) = \frac{\alpha}{2} \qquad F_{\hat{\mu}}(m_2) = 1 - \frac{\alpha}{2}$$

Graphically, $m_1$ and $m_2$ are here:

<img src="cdf.png", width = 400>

Ok, so how would you find $m_1$, for example? One way would be try out many values of the cdf until you hit $\alpha/2$

In [64]:
ss.norm(mu_hyp, np.sqrt(sigmasq/n)).cdf(-1)  # Too low...

0.012673659338734126

In [65]:
ss.norm(mu_hyp, np.sqrt(sigmasq/n)).cdf(-.5)  # ...Too high...

0.13177623864148635

In [66]:
ss.norm(mu_hyp, np.sqrt(sigmasq/n)).cdf(-.7356)    # You try it!

0.05000020861307778

### **Inverses!**

Wouldn't it be great if we had some sort of... inverse function?

$$m_1 = F^{-1}_{\hat{\mu}}\left(\frac{\alpha}{2}\right) 
\qquad \qquad m_2 = F^{-1}_{\hat{\mu}}\left(1 - \frac{\alpha}{2}\right)$$

This function exists. It is called the **inverse CDF**, or (more commonly) the **quantile function**, or the **percentile function**, or (by Python) the **percent point function**. This is what is looks like:

<img src="inversecdf.png" width = 400>

The Python usage goes like this:

In [67]:
m1 = ss.norm(mu_hyp, np.sqrt(sigmasq/n)).ppf(alpha/2)        
m2 = ss.norm(mu_hyp, np.sqrt(sigmasq/n)).ppf(1 - alpha/2)  
print("Critical values: ", m1, m2)

Critical values:  -0.73560090458 0.73560090458


Now compute $c$

In [68]:
c = (m2 - m1)/2
print("c:", c)

c: 0.73560090458


And output the confidence interval $[\hat{\mu} - c, \hat{\mu} + c]$

In [69]:
Lhat = muhat - c
Uhat = muhat + c
print("Confidence Interval: ", Lhat, Uhat)

Confidence Interval:  0.26439909542 1.73560090458


+ Given these thresholds, do we reject the null $H_0: \mu = 0$?

+ Given these thresholds, would we reject the null $H_0: \mu = 1$?

+ What do you think would happen to the value of $c$ if $\sigma^2 = 20$? 

+ And what if $n = 1000$?

### **But why the name "Z-test"?**

We've been working with the distribution $Normal(\mu_{hyp}, \  \sigma^2)$ for particular values of $\mu_{hyp}$ and $\sigma^2$. 

Had the values been different, we'd have no problem with that, because we can always use Python to query anything we want to know about it:

```python
ss.norm(17, 274).cdf(.05)
```

But what did people do in the *olden days* of pre-Pythonhood? The commonest way to go about this problem is to have *one* table containing values for the $Normal(0,1)$ distribution, and work with the **normalized** $\hat{\mu}$. Recall this from Day22.

$$\hat{\mu} \sim Normal(\mu, \frac{\sigma^2}{n}) 
\qquad \implies \qquad
\frac{\hat{\mu} - \mu}{\sqrt{\frac{\sigma^2}{n}}} \sim Normal(0, 1)$$ 

The normalized statistic used to be called the **z-statistic**, and that's why this test is also known as the **z-test**. And at the end of statistics textbooks you'll often find **z-tables** containing - nothing more than - values of the $Normal(0,1)$ distribution! 

Don't believe me? See <a href="http://www.stat.ufl.edu/~athienit/Tables/Ztable.pdf">here</a>.

In [372]:
z_stat = (muhat - mu_hyp)/np.sqrt(sigmasq/n)  # Normalizing the statistic
m1 = ss.norm(0, 1).ppf(alpha/2)               # Working with N(0,1)!
m2 = ss.norm(0, 1).ppf(1 - alpha/2)           # Working with N(0,1)!
c = (m2 - m1)/2                         
print("Confidence interval boundaries: ", z_stat - c, z_stat + c)

Confidence interval boundaries:  0.591214350548 3.88092160445


### Important! [&#x2620;]

For every $\mu_{hyp}$, we need to compute a new confidence interval and check if $\mu_{hyp}$ falls in or out of it.

Try changing the $\mu_{hyp}$ below, and see if you would or would not reject $H_0$.

In [70]:
mu_hyp = 1                                    # New mu_hyp
z_stat = (muhat - mu_hyp)/np.sqrt(sigmasq/n)  # Z-statistic using new mu_hyp
m1 = ss.norm(0, 1).ppf(alpha/2)               # Working with N(0,1)!
m2 = ss.norm(0, 1).ppf(1 - alpha/2)           # Working with N(0,1)!
c = (m2 - m1)/2                         
print("Confidence interval boundaries: ", z_stat - c, z_stat + c) 

Confidence interval boundaries:  -1.64485362695 1.64485362695


Sadly, the normal was kind of special because the value of $c$ did not change with $\mu_{hyp}$. 

For most tests, the opposite be the norm: the value of *c* will change depending on the value of the hypothetical parameter. See below.

---

# Named tests

In the literature you'll find a plethora of tests: **Binomial test**, **t-test**, **$\chi^2$ test**, **F-tests**, and so on. 

Despite the confusing names, **the only thing that will change is the distribution of the statistic**. That is, a "t-test" uses the *Student-t distribution*, a $\chi^2$ test uses a distribution called $\chi^2$, and so on. 


$$X_1, \cdots, X_n \sim Normal(\mu, \sigma^2) \  \implies \ \frac{\hat{\mu} - \mu}{\sqrt{\frac{\sigma^2}{n}}} \sim Normal(0, 1)$$

$$X_1, \cdots, X_n \sim Bernoulli(p) \  \implies \ n\hat{p} \sim Binomial(n, p) $$

$$X_1, \cdots, X_n \sim Normal(\mu, \sigma^2) \  \implies
\frac{n}{\sigma^2}\hat{\sigma}^2 \ \ \sim \ \ \chi^2_{n-1} \qquad $$

$$X_1, \cdots, X_n \sim Normal(\mu, \sigma^2) \  \implies \ \frac{\hat{\mu} - \mu}{\sqrt{\frac{\hat{\sigma}^2}{n}}} \sim t_{n-1} \qquad \text{} $$

<font size =1>(Note the hat on the sigma!)</font>

Really there's only one rule to keep in mind:

<br>
<center>**Use the right distribution for the right statistics**</center>

You'll do some exercises on this in the next homework.