# Inferential Statistics (Chapter 4)

We wish to infer parameters of, and draw conclusions about a population using a statistic (number/property describing a characteristic of a sample).  

Goal for this module:
* Understanding sampling distributions
* Method 1: Point estimates
* Method 2: Confidence intervals
* Method 3: Hypothesis testing

Lets return to the red wine quality data set, and try to better understand samples versus population.  There are two different viewpoints, both valid:
* We have only *sampled* some of red wines produced in Northern Portugal.  Hence, the data set we have is a *sample*, which we can use to infer characteristics about red wines produced in Portugal.
* We can think of our data set as the population, e.g., the data covers all the varieties of red wines produced at wineries in Portugal in a certain month.  

We shall take the latter interpretation for the rest of the lecture. Lets begin by importing our data set as before.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from matplotlib import cm

# url  =  "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine = pd.read_csv("data/winequality-red.csv", sep=";")

We learned how to find the mean pH of the wine last day.  Lets call this the population mean, $\mu$.

In [None]:
mu = wine["pH"].mean();
print("mean pH, mu = " + str(mu))

Perhaps, someone did not believe that red wine was so acidic, and demanded that we re-measure the pH of the wines.  This might be ridiculously time consuming/expensive, and we don't want to take the time to sample all 1588 red wines again.  Instead, what we want to do is take a few random samples of the wine, and convince the appropriate party that the reported mean pH was reasonable.  How do we do that?  How do we quantify our confidence in the reported pH? To do this, we first need to understand how meaningful it is to take the mean of samples.  A key concept is *the sampling distribution of the sample mean*.

Suppose that our initial measurements were actually accurate (i.e., if we measured the same wine again, the reported pH will be the same value.  Suppose we were only interested in re-measuring the mean pH of $n=30$ wines.  If we drew all possible samples of size $n=30$, measured the mean of each sample, then the probability distribution of this mean is called the *mean sampling distribution*.  For this example, there are a total of $1588 \choose 30$ samples. This is a very large number of samples.  We can instead approximate the mean sampling distribution by drawing a large number of samples of size $n=30$, say 1000.    

In [None]:
N_tests = 1000
n = 5
means = [0] * N_tests

where the last command generates an array with $N_{test}$ elements, each element is initialized to $0$.  Lets now generate each sample and store the mean of each sample into our new array.

In [None]:
for i in range(N_tests):
    observations = np.random.choice(wine.index.values,n)
    sampled_wines = wine.loc[observations]
    means[i] = sampled_wines["pH"].mean()

Let's now generate a histogram of this plot.  (Note, the underscore suppresses the matplotlib output.

In [None]:
nbins = 20
_ = plt.hist(means,nbins )

The central limit theorem tells as that as $n$ is increased, this distribution will approach a normal distribution.  Lets fit a normal distribution to this plot.

In [None]:
from scipy.stats import norm
xbar, s = norm.fit(means)
print("mean = %g, standard deviation of distribution = %g"%(xbar, s))

and now superimpose the normal distribution plot on top of our histogram

In [None]:
_ = plt.hist(means,bins=20,density=True )
xmin, xmax = plt.xlim()
x = np.linspace(xmin,xmax,100)
p = norm.pdf(x, xbar, s)
_ = plt.plot(x, p, 'r', linewidth = 2)

So, our sampling distribution is well approximated by the Gaussian (normal) distribution.  Lets review some properties of the normal distribution:
* the distribution is symmetric about it's mean;
* there is a single peak, mean = median = mode (most frequently occurring value in series), located at $x = \mu$;
* the distribution has inflection points at $\mu \pm \sigma$, where $\sigma$ is the standard deviation;
* The area under the distribution is 1;
* The area under the curve to the left (right) of $\mu$ is 0.5;
* The curve approaches, but never reaches the horizontal axis.

We already saw this last day, the normal distribution can be described mathematically by
$$ N(\mu,\sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left(-\frac{(x-\mu)^2}{2 \sigma^2} \right).$$
Often, we standardize normal data by computing z-scores, so that any Normal curve $N(\mu,\sigma^2)$ can be transformed into the standard normal curve $N(0,1)$,
$$z = \frac{x-\mu}{\sigma}.$$

## Method 1: Point Estimates
We are now ready to describe Method 1: using sample data to give a point estimate (best guess) of the population parameter.  The idea is to take a sample of the population, e.g. $n=30$ wines, and compute the mean of the sample, $\bar{x}$.  Then, one estimates the population mean, $\mu$, and standard error, $SE$, using:
$$ \mu = \bar{x}, \qquad SE = \frac{\sigma}{\sqrt{n}},$$ where $\sigma$ is the standard deviation of the population. If the standard deviation of the population is not known, one can use the sample standard deviation if the population distribution is not skewed, and if $n>30$.

In [None]:
n = 30
observations = np.random.choice(wine.index.values,n)
sampled_wines = wine.loc[observations]
xbar = sampled_wines["pH"].mean()
sigma = wine["pH"].std()
se = sigma/np.sqrt(n)
print "Estimate of population mean = %g, standard error = %g"%(xbar,se)

## Method 2: Confidence Intervals
The $100\cdot(1-\alpha)\%$ confidence interval for $\mu$ is,
$$ \bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}},$$
i.e, we use the point estimate for the mean, the $z$-score which determines the confidence interval, and the standard error of the mean.  Lets try and understand this formula by starting with the standard normal (using the $z$-score), and then transforming to the problem at hand.  Suppose we want the confidence level = $C\% = (1-\alpha)100 \%$.  Then
\begin{align}
C &= (1-\alpha)100 \\
&=P(-z_{\alpha/2} \le Z \le z_{\alpha/2}) \\
&= P(-z_{\alpha/2} \le \frac{\bar{x}-\mu}{\sigma/\sqrt{n}} \le z_{\alpha/2})\\
&= P\left(-z_{\alpha/2}\frac{\sigma}{\sqrt{n}} \le (\bar{x}-\mu) \le z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right)\\
&= P\left(-\bar{x}-z_{\alpha/2}\frac{\sigma}{\sqrt{n}} \le -\mu \le -\bar{x} + z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right)\\
&= P\left(\bar{x}+z_{\alpha/2}\frac{\sigma}{\sqrt{n}} \ge \mu \ge \bar{x} - z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right)\\
&= P\left(\bar{x}-z_{\alpha/2}\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{x} + z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right)\\
\end{align}

For our example, if we want a 95% confidence interval, this means $\alpha=0.05$, which (from a z-table) gives $z_{\alpha/2} = 1.96$.  

In [None]:
alpha = 0.05
z_alphadiv2 = norm.ppf(1-alpha/2)
confidence = norm.cdf(z_alphadiv2)-norm.cdf(-z_alphadiv2)
ci = [xbar - se*z_alphadiv2, xbar + se*z_alphadiv2 ]
print "We are", 100*confidence, "% confident that the interval = ",  ci,  "contains the mean."

**WARNING**: what does this 95% confident mean? We cannot say that our specific sample has a 95% chance of containing the true parameter.    Rather, a correct interpretation is, if we were to take 100 samples and compute 100 confidence intervals, 95% of the confidence intervals are likely to contain the true mean of the population.  Lets explore and see:  

In [None]:
N_test = 100
n = 30
means = np.array([0.0] * N_test)   # initialize list for means
sigma = np.array([0.0] * N_test)   # initialize list for standard deviation
ci = np.array([[0.0,0.0]] * N_test)  # initialize list for confidence intervals
mu = wine["pH"].mean()  # true mean
for i in range(N_test):
    observations = np.random.choice(wine.index.values,n)
    sampled_wines = wine.loc[observations]
    means[i] = sampled_wines["pH"].mean()
    sigma[i] = sampled_wines["pH"].std()
    ci[i] = means[i] + np.array([-sigma[i] * z_alphadiv2/np.sqrt(n), sigma[i]*z_alphadiv2/np.sqrt(n)])

out1 = ci[:,0] > mu # flag CI that do not contain the "true" mean
out2 = ci[:,1] < mu # flag CI that do not contain the "true" mean


fig, ax = plt.subplots(1, 1, figsize=(12, 5))
ind = np.arange(1, N_test+1)
ax.axhline(y = mu, 
           xmin = 0, 
           xmax = N_test+1, 
           color = [0, 0, 0])

ci = np.transpose(ci)
ax.plot([ind,ind], 
        ci, 
        color = '0.75', 
        marker = '_', 
        ms = 0, 
        linewidth = 3)
ax.plot([ind[out1],ind[out1]], 
        ci[:, out1], 
        color = [1, 0, 0, 0.8], 
        marker = '_', 
        ms = 0, 
        linewidth = 3)
ax.plot([ind[out2],ind[out2]], 
        ci[:, out2], 
        color = [1, 0, 0, 0.8], 
        marker = '_',
        ms = 0, 
        linewidth = 3)
ax.plot(ind, 
        means, 
        color = [0, .8, .2, .8], 
        marker = '.',
        ms = 10, 
        linestyle = '')
ax.set_ylabel("Confidence interval for the samples' mean estimate",
              fontsize = 12)
ax.set_xlabel('Samples (with %d observations). '  %n, 
              fontsize = 12)
plt.show()

If the population standard deviation, $\sigma$, is unknown, then we need to estimate the population standard deviation.  Compare:
\begin{align}
  Z = \frac{\bar{x}-\mu}{\sigma/\sqrt{n}} \sim N(0,1)
\end{align}
and
\begin{align}
T = \frac{\bar{x}-\mu}{s/\sqrt{n}} \sim t(df = n-1)
\end{align}
Lets compare the $T$-distribution to the normal distribution.  

In [None]:
xmin = -10
xmax = 10
x = np.linspace(xmin,xmax,100)
p = norm.pdf(x, 0, 1)
_ = plt.plot(x, p, 'r', linewidth = 2,label="z-curve")
_ = plt.legend()

from scipy.stats import t
dof = 29;
q = t.pdf(x,dof)
_ = plt.plot(x, q, 'b', linewidth = 2,label="T, dof=%d"%(dof))
_ = plt.legend()

Like the normal distribution, the T-distribution is symmetric about the mean.  However, it is shorter and wider than the Z-curve.  It has an extra parameter (Degree of Freedom). In our case, the degree of freedom is one less than the number of observations.  If the population standard deviation is not known (often the case), one should use the T-distribution rather than the normal distribution.

Exercise:  An engineer working for Ford is interested in the population of all vehicles that have an engine size of 3.0L or larger, and is particular interested in $\mu$, the highway mileage (mpg).  Assume the population is normally distributed.  The sample mean among a random sample of 14 vehicles is 18.3 mpg, and the sample standard deviation is 5.1 mpg (note: $\sigma$ is unknown).  What is the 95% CI for $\mu$?

In [None]:
alpha = 0.05
xbar = 18.3
n = 14
se = 5.1/np.sqrt(n) # if 5.1 was the population standard deviation, 
z_alphadiv2 = norm.ppf(1-alpha/2)
confidence = norm.cdf(z_alphadiv2)-norm.cdf(-z_alphadiv2)
ci = [xbar - se*z_alphadiv2, xbar + se*z_alphadiv2 ]
print "We are", 100*confidence, "% confident that the interval = ",  ci,  "contains the mean."

but, we actually have s= 5.1, the sample standard deviation, so we need to the t-test

In [None]:
alpha = 0.05
xbar = 18.3
n = 14
dof = n-1
se = 5.1/np.sqrt(n) #  5.1 was the sample standard deviation, 
t_alphadiv2 = t.ppf(1-alpha/2,dof)
confidence = t.cdf(t_alphadiv2,dof)-t.cdf(-t_alphadiv2,dof)
ci = [xbar - se*t_alphadiv2, xbar + se*t_alphadiv2 ]
print "We are", 100*confidence, "% confident that the interval = ",  ci,  "contains the mean."

## Hypothesis Testing
Hypothesis: Statement to be tested.  Often, this is referred to as the null hypothesis, $H_0$.  The alternative hypothesis $H_a$ is, as the name suggests, the alternative to null hypothesis (if $H_0$ is not true, what do I suspect might be true instead?) 

There are two competing camps for hypothesis testing:
* Bayesian inference: a probability is assigned/computed for a hypothesis
* Frequentist approach: depends on likelihood of observed/unobserved data.

### Comment:
We will cover (as per the textbook) the frequentist approach, using frequentist measures like $p$-values and confidence intervals.  This has been the dominant approach to conducting hypothesis testing over the last two decades.  I do want to take a few minutes to talk about the Bayesian approach, since more recently, the Bayesian inference (MA 5770) has been garnering enthusiasm in fields like machine learning.  Bayesian inference uses the idea of conditional probability (e.g. P(A|B): probability that event A happens given event B) to determine which hypothesis is most probable.  One has to specify a prior distribution  about the probability distribution that represents the statistic one cares about.  To use Bayesian inference, one needs a full understanding of the statistical model.  Lets return to the frequentist approach for hypothesis testing.

### Five-Step Procedure
1. Formulate the appropriate hypothesis
2. Decide on an appropriate test statistic
3. Specify the critical region for the test statistics
4. Conduct the experiment and find the specific value for the test statistic
5. Reach an appropriate conclusion and state them.

## Understanding Hypothesis
A common example is drawn from the judicial system: innocent until proven guilty.  Here,
* $H_0$: defendant is innocent
* $H_a$: defendant is guilty

    
<table border=1px>
    <tr>
        <th>Defendant State</th>
        <th>Convict (reject $H_0$) </th>
        <th>Acquit (do not reject $H_0$)</th>
    </tr>
    <tr>
        <td>Innocent ($H_0$ is true)</td>
        <td>Type I error</td>
        <td> OK</td>
    </tr>
    <tr>
        <td>Guilty ($H_0$ is false)</td>
        <td>OK</td>
        <td>Type II error</td>
    </tr>
    </table>

* Type I error: mistake of rejecting the null hypothesis when it was in fact true.  The probability of committing a type I error is often denoted by $\alpha$ (note: overloaded use of variable $\alpha$).
* Type II error: mistake of failing to reject the null hypothesis when it is false.  The probability of committing a type II error is often denoted by $\beta$.

There are several common types of hypothesis:
1. Equal versus not equal hypothesis (a.k.a. two-tailed test)
    * $H_0$: parameter = some value (e.g. $H_0: \mu = 17$)
    * $H_a$: parameter $\neq$ some value (e.g. $H_a: \mu \neq 17$)
2. Equal versus greater than hypothesis (a.k.a. right-tailed test)
    * $H_0$: parameter = some value (e.g. $H_0: \mu = 17$)
    * $H_a$: parameter $>$ some value (e.g. $H_a: \mu > 17$)    
3. Equal versus less than hypothesis (a.k.a. left-tailed test)
    * $H_0$: parameter = some value (e.g. $H_0: \mu = 17$)
    * $H_a$: parameter $<$ some value (e.g. $H_a: \mu < 17$)   
    
## Test Statistic
Let $\mu_0$ be the nominal value for $\mu$, i.e., for the three types of hypothesis above:
1. $H_0: \mu = \mu_0$, $H_a: \mu \neq \mu_0$
2. $H_0: \mu = \mu_0$, $H_a: \mu > \mu_0$
3. $H_0: \mu = \mu_0$, $H_a: \mu < \mu_0$
If $\sigma$ (population standard deviation is known, use
$$ Z = \frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}} \sim N(0,1). $$
Otherwise, if $\sigma$ is not known, use
$$ T = \frac{\bar{x}-\mu_0}{s/\sqrt{n}} \sim t(df=n-1). $$

## Critical Regions
two-tail test ($H_0: \mu = \mu_0$, $H_a: \mu \neq \mu_0$):
<img src="figures/two_tail.png">

left-tail test ($H_0: \mu = \mu_0$, $H_a: \mu < \mu_0$):
<img src="figures/left_tail.png">

right-tail test ($H_0: \mu = \mu_0$, $H_a: \mu > \mu_0$):
<img src="figures/right_tail.png">

## Conduct the experiment
Find the specific value of the test statistic.  Note:
* One should always complete the first three steps **before** any data is colected
* A well-planned experiment should have clear criteria for making decisions before the data collection, in order to ensure objectivity

## Conclusions
If the test statistic falls inside the critical region, we reject $H_0$
* We reject $H_0$ if we have **significant** evidence at level $\alpha$ that $H_0$ is false
* We do not reject $H_0$ if data is **NOT significant** at level $\alpha$.

Lets quantify this.  One often computes a $p$-value, the probability of observing the computed statistic if the null hypothesis is true.  The $p$-value measures, in some sense, how far into the tail we are, based on the computed statistic, i.e., how significant the observation is.  The closer the $p$-value is to zero, the more evidence we have against $H_0$.  We reject $H_0$ if $p$-value $< \alpha$.

Example: lets go back to the pH of our red wines.  Lets recompute the population mean and standard deviation:

In [None]:
mu = wine["pH"].mean()
sigma = wine["pH"].std()
print("mean = %g, sigma = %g"%(mu,sigma))

Suppose now, we want to take a sample of 30 wines.  Suppose that our hypothesis is that the mean is 3.3.  The alternative hypothesis, is that the mean is different from 3.31 (i.e., a two-tail test).  Lets sample the population and compute the p-value.

In [None]:
n = 30
observations = np.random.choice(wine.index.values,n)
sampled_wines = wine.loc[observations]
xbar = sampled_wines["pH"].mean()

z = (xbar - 3.1) / (sigma/np.sqrt(n))

alpha = 0.05
pvalue = 2*(1-norm.cdf(np.abs(z))) # note: the factor of 2 is here because of the two-tailed test

print "The p-value is %g"%(pvalue)


if (pvalue < alpha):
    print "We reject the null hypothesis"
else:
    print "we have no evidence to reject the null hypothesis"
    
    

There are convenient relationships between confidence intervals and hypothesis testing. 
* Sometimes, we compute a CI for $\mu$ after rejecting $H_0$ to report the location of plausible values for $\mu$.
* A two-sided hypothesis test with a significance level of $\alpha$ is
equivalent to constructing a $(1- \alpha)100\%$ confidence interval for $\mu$.
* We can check whether the CI contains $\mu_0$:
    * If the interval does contain $\mu_0$, then we fail to reject $H_0$.
    * If the interval does not contain $\mu_0$, then we reject $H_0$.

## Power of a test
The power of a hypothesis test is the probability that we reject the
null hypothesis when the alternative hypothesis is true.
\begin{align}
P(\text{reject } H_0 \,| \, H_a \text{ is true}).
\end{align}
For example, suppose we want to test $H_0: \mu = \mu_0$, and $H_a: \mu> \mu_0$.  The statistic is computed using:
$$ z = \frac{\bar{x}-\mu_0}{ \frac{\sigma}{\sqrt{n}}}.$$
The critical region is $z>z_{\alpha}$, the power is $P(z>z_\alpha \,| \,\mu = \mu_1)$