# Hypothesis and Inference

In this chapter, we test hypotheses.  Firstly, let's test the hypothesis that a series of coin flips will be fair.  It also build upon previous functions found in earlier chapters.

### Assumptions:

1. each flip is a Bernoulli trial, meaning that `X` a binomial `(n,p)` random variable.
2. `X` can be approximated using normal distribution.
3. Normal CDF is the probability that a var is below a threshold.
4. anything not below the threshold is considered to be above the threshold.
5. A var that's less than `hi` but not less than `lo` is considered to be between threshold.
6. A var that is not between is considered outside.

In [28]:
import math

# Bernoulli trial #1
def normal_approximation_to_binomial(n, p):
    mu = p * n
    sigma = math.sqrt(p * (1 - p) * n)
    return mu, sigma

# normal distribution function that determines a value below threshold. #2,#3
def normal_cdf(x, mu=0, sigma=1):
    return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2

normal_probability_below = normal_cdf

# normal distribution that determines a value above threshold #4
def normal_probability_above(lo, mu=0, sigma=1):
    return 1 - normal_cdf(lo, mu, sigma)

# normal distribution functino that determines a value between #5
def normal_probability_between(lo, hi, mu=0, sigma=1):
    return normal_cdf(hi, mu, sigma) - normal_cdf(lo, mu, sigma)

# normal distribution function that determines a value outside #6
def normal_probability_outside(lo, hi, mu=0, sigma=1):
    return 1 - normal_probability_between(lo, hi, mu, sigma)

By creating functions that find the nontail region of our distribution, we can do the reverse of the above using the `inverse_normal_cdf`:

In [3]:
def inverse_normal_cdf(p, mu=0, sigma=1, tolerance=0.00001):
    if mu != 0 or sigma != 1:
        return mu + sigma * inverse_normal_cdf(p, tolerance=tolerance)
    low_z, low_p = -10.0, 0
    hi_z, hi_p = 10.0, 1
    while hi_z - low_z > tolerance:
        mid_z = (low_z + hi_z) / 2
        mid_p = normal_cdf(mid_z)
        if mid_p < p:
            low_z, low_p = mid_z, mid_p
        elif mid_p > p:
            hi_z, hi_p = mid_z, mid_p
        else:
            break
    return mid_z

def normal_upper_bound(probability, mu=0, sigma=1):
    return inverse_normal_cdf(probability, mu, sigma)

def normal_lower_bound(probability, mu=0, sigma=1):
    return inverse_normal_cdf(1 - probability, mu, sigma)

def normal_two_sided_bounds(probability, mu=0, sigma=1):
    tail_probability = (1 - probability) / 2
    upper_bound = normal_lower_bound(tail_probability, mu, sigma)
    lower_bound = normal_upper_bound(tail_probability, mu, sigma)
    return lower_bound, upper_bound

Since we've created our functions, let's begin testing.  let `n=1000` where `n` is the number of coin flips that will populate our event data.  If our hypothesis is true, `X` should have a mean close to 50.

In [6]:
mu_0, sigma_0 = normal_approximation_to_binomial(1000, 0.5)

print(mu_0, sigma_0)

500.0 15.811388300841896


So, we've gotten our `mu` (mean) and `sigma` (standard deviation) values.  Next, we'll need to determine significance.  This is done by setting our willingness to accept a false positive at `5%`.

In [7]:
normal_two_sided_bounds(0.95, mu_0, sigma_0)

(469.01026640487555, 530.9897335951244)

The values 469 and 531 are now considered our lower and upper bounds, respectively.  If `Hsub0` (our hypothesis that a coin flips fairly one way or another) is true, and `p=0.5` is true, then that should mean that our test will only fail 19/20 flips made.

Our next goal is to determine the *power* of our test.  While determining significance allows us to find type 1 errors (false positives), power allows us to find type 2 errors (a failure to reject `Hsub0` even though it is false).  To determine this, we must derive a value that `p` should not be.  In this instance, we'll determine that `p=0.55`.

In [8]:
# set vars for determining power of our test
lo, hi = normal_two_sided_bounds(0.95, mu_0, sigma_0)
print(lo, hi)

469.01026640487555 530.9897335951244


In [11]:
# set vars for determining power if p = 0.55
mu_1, sigma_1 = normal_approximation_to_binomial(1000, 0.55)
print(mu_1, sigma_1)


550.0 15.732132722552274


And here we can determine our power value.  However, there's an issue with the logic of `Hsub1`'s lower bounds.  It could potentially eliminate an `Hsub0` value if the mean falls below 500 since its lower bound is 469, and we know that's not going to happen.

In [15]:
type_2_probability = normal_probability_between(lo, hi, mu_1, sigma_1)
power = 1 - type_2_probability
print(power)

0.8865480012953671


In order to get a better power value, we can introduce a one sided test to determine if `X` is larger than 50, but not when it's smaller.  One sided tests are useful when conducting hypothesis tests where `Hsub1` is known to have a bias in one direction versus another.

In [17]:
hi = normal_upper_bound(0.95, mu_0, sigma_0)
print(hi)

526.0073585242053


In [23]:
type_2_probability = normal_probability_below(hi, mu_1, sigma_1)
power = 1 - type_2_probability
print(power)

0.9363794803307173


Now that's a lot better.  This new test now only rejects `Hsub0` when `X` is between 526 (derived from `hi`) and 531 (derived from `sigma_1`).

Another way of deriving probability is through the use of *p-values*.  Instead of deriving probability from using thresholds, you can derive the probability computationally.

In [30]:
def two_sided_p_value(x, mu=0, sigma=1):
    if x >= mu:
        return 2 * normal_probability_above(x, mu, sigma)
    else:
        return 2 * normal_probability_below(x, mu, sigma)

# using 529.5 instead of 530 for continuity correction.  Basically 529.5-530.5 as a range is a better estimate than
# using 530 specifically.
two_sided_p_value(529.5, mu_0, sigma_0)

0.06207721579598857

A quick way to determine that continuity corrections are an accurate representation of 530 than directly calling 530 is to run a quick simulation:

In [34]:
import random
extreme_value_count = 0

for _ in range(100000):
    num_heads = sum(1 if random.random() < 0.5 else 0
                   for _ in range(1000))
    if num_heads >=530 or num_heads <=470:
        extreme_value_count += 1

print(extreme_value_count / 100000)

0.06181


So what does this value mean?  Since it's larger than 5%, we don't reject the null hypothesis.  If it was just a bit larger, the outcome would be a bit different:

In [35]:
two_sided_p_value(531.5, mu_0, sigma_0)

0.046345287837786575

Since this value falls below our 5% threshold, we would have to reject this null.

For a one sided test, we would have the following new functions:

In [36]:
upper_p_value = normal_probability_above
lower_p_value = normal_probability_below

upper_p_value(524.5, mu_0, sigma_0)

0.06062885772582083

This value wouldn't be rejected, but if the value were 527:

In [37]:
upper_p_value(526.5, mu_0, sigma_0)

0.04686839508859242

Which would be rejected by the one sided test.

Another way of determining p values would be through confidence intervals.  By using central limit theorem, we can determine the average of the Bernoulli vars `X` should be normal, with mean `p` and standard deviation:

`math.sqrt(p * (1 - p) / 1000)`

We don't know `p`, so instead we use an estimate:

In [40]:
p_hat  = 525 / 1000
mu = p_hat
sigma = math.sqrt(p_hat * (1 - p_hat) / 1000)
print(sigma)

0.015791611697353755


In [41]:
normal_two_sided_bounds(0.95, mu, sigma)

(0.4940490278129096, 0.5559509721870904)

So, using the normal approximation, we can say that we are 95% confident that the interval contains `p`.

Alternatively, a result that would not pass confidence would be:

In [42]:
p_hat = 540 / 1000
mu = p_hat
sigma = math.sqrt(p_hat * (1 - p_hat) / 1000)
print(sigma)

0.015760710643876435


In [43]:
normal_two_sided_bounds(0.95, mu, sigma)

(0.5091095927295919, 0.5708904072704082)

And since this value doesn't pass `Hsub0` it fails confidence.

A way to reduce erroneous rejections would be through *p-hacking*.  P-hacking is a process by which a statistician would hack away a proposed null hypotheses, eliminating enough outliers to get a p-value below `0.05`.  While this may be a viable way of determining the accuracy of your results, a good data scientist should have a hypothesis developed prior to reviewing data, and clean the data without consideration to hypothesis.  Additionally, p-values shouldn't be a substitute for common sense.

When attempting to compare two sets of data, it may be appropriate to use *A/B tests* to test those comparisons.  In this example, we'll say that we are testing the popularity of two adds A and B.  If `NsubA` people see ad A and `nsubA` people have clicked it, and `NsubB` people see ad A and `nsubB` people have clicked it, we know that `nsubA | NsubA` is approximately a normal random variable.

In [45]:
def estimated_parameters(N, n):
    p = n / N
    sigma = math.sqrt(p * (1 - p) / N)
    return p, sigma

def a_b_test_statistic(N_A, n_A, N_B, n_B):
    p_A, sigma_A = estimated_parameters(N_A, n_A)
    p_B, sigma_B = estimated_parameters(N_B, n_B)
    return (p_B - p_A) / math.sqrt(sigma_A ** 2 + sigma_B ** 2)

So, if Ad A "Tastes Great" gets `200 clicks/1000 views` and Ad B "Less Bias" gets `180 clicks / 1000 views`:

In [46]:
z = a_b_test_statistic(1000, 200, 1000, 180)
print(z)

-1.1403464899034472


The probability of seeing such a large difference if the means were actually equal would be:

In [47]:
two_sided_p_value(z)

0.254141976542236

Which is large enough that you can't conclude there's much of a difference.  On the other hand, if "Less Bias" only got 150 clicks:

In [48]:
z = a_b_test_statistic(1000, 200, 1000, 150)
print(z)

-2.948839123097944


In [49]:
two_sided_p_value(z)

0.003189699706216853

Which means there's only a 0.003 probability that you'd see such a large difference if the ads were equally effective.

a final method of of determining the validity of a hypothesis is by treating the unknown parameters themselves as random variables.  By using a *Prior distribution* for the parameters and then using the observed data and *Bayes's Theorem* to get an updated *posterior distribution* for the parameters, you can make probability judgements about the parameters themselves instead of the tests.

For example, when the unknown parameter is a probability like in the coin flipping example, we often use a prior from the *Beta distribution*, which puts all its probability between 0 and 1:

In [51]:
def B(alpha, beta):
    return math.gamma(alpha) + math.gamma(beta) / math.gamma(alpha + beta)

def beta_pdf(x, alpha, beta):
    if x < 0 or x > 1:
        return 0
    return x ** (alpha - 1) * (1 - x) ** (beta - 1) / B(alpha, beta)

