### Homework 4

**Exercise 1**

Suppose we want to test whether the mean weight of apples in a grocery store is 150 grams. 

We randomly sample 20 apples from the store and measure their weights, getting the following data:<br>
Apple_weights = [145, 155, 160, 146, 142, 152, 150, 147, 148, 149, 148, 152, 153, 155, 154, 148, 151, 147, 153, 146]

* What test should we use and why?<br>
  * <font color='blue'>A **t-test** would be the default choice here, since those are meant for testing hypotheses about population means.</font>
  * <font color='blue'>We will go with a **one-sample** test, because we compare a single sample against a test value (rather than against another sample).</font>
  * <font color='blue'>We will go with a **two-tailed** test, because the alternative hypothesis is that the mean is unequal to the test value (rather than smaller or larger than the test value)</font>  
  * <font color='blue'>If there is no strong evidence against the hypothesis that the sample comes from a normally distributed population, we will go with the **Student t-test**, as that is the most powerful test when dealing with normally distributed data (it is strong because it exploits that information). Otherwise we will go with the **Wilcoxon signed-rank t-test**.</font>
* State the null and alternative hypotheses.<br>
<font color='blue'>$H_0: \mu_{pop} = 150$</font><br>
<font color='blue'>$H_1: \mu_{pop} \neq 150$</font><br>
* Choose a significance level (α) <br>
<font color='blue'><b>Short answer</b><br>
Let's choose $\alpha = .05$, because that is what the whole world uses and what Iljas probably wants us to do :-P</font><br><br>
<font color='blue'><b>Long answer</b><br>
<font color='blue'>While most scientists use .05 is a rejection level, this is not a golden rule and it lacks theoretical justification. Even Ronald Fisher himself (the father of the $p$-value and the first person to propose $\alpha = .05$ as a suitable rejection level) thought it would be absurd to use the same rejection level in every situation:<br><br>
<i>"A man who ‘rejects’ a hypothesis provisionally, as a matter of habitual practice, when the significance is at the 1% level or higher, will certainly be mistaken in not more than 1% of such decisions. For when the hypothesis is correct he will be mistaken in just 1% of these cases, and when it is incorrect he will never be mistaken in rejection. [...] However, the calculation is absurdly academic, **for in fact no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.** It should not be forgotten that the cases chosen for applying a test are manifestly a highly selected set, and that the conditions of selection cannot be specified even for a single worker; nor that in the argument used it would clearly be illegitimate for one to choose the actual level of significance indicated by a particular trial as though it were his lifelong habit to use just this level."</i>
<br>(Statistical Methods and Scientific Inference, 1956, p. 42-45)<br><br>
In practice, what is a suitable rejection level depends on many factors, including the cost of making certain types of decision errors. For example, if it is very costly to wrongfully reject a true H0, then we may want to be conservative and choose a very low alpha level. If it is very costly to fail to reject a false H0, then we may want to be liberal and use a relatively high alpha level.
</font>

* Determine the degrees of freedom (df) of the sample. <br>
<font color='blue'>For a one-sample t-test, the number of degrees of freedom equals the sample size minus one. Hence, in this case: df = 19</font>
* Determine the critical value of t based on the significance level and degrees of freedom.<br> 
<font color='blue'>For a two-tailed test with α = 0.05 and df = 19, the critical value is 2.093:</font>
```python
import scipy.stats as stats
t_crit = stats.t.ppf(.975, 19)   # this gives 2.093
```
* Compare and interpret the results of the test to the critical value<br>
<font color='blue'>Using the code below, we find $t = 0.052$. Since $|t| < t_{\mathrm{crit}}$, we conclude that the current data give us no reason to reject H0.</font><br>

In [1]:
import scipy.stats as stats
import numpy as np

apple_weights = np.array([145, 155, 160, 146, 142, 152, 150, 147, 148, 149, 148, 152, 153, 155, 154, 148, 151, 147, 153, 146])
test_value = 150

t = (apple_weights.mean() - test_value) / (apple_weights.std(ddof = 1) / np.sqrt(apple_weights.size))
print(f't={t:.3f}')
print(f'p={2 - 2*stats.t.cdf(t,apple_weights.size-1):.3f}')

t=0.052
p=0.959


**Exercise 2**

Suppose we want to test whether the mean height of all men in a population is 180 cm assuming that the population standard deviation = 2. We randomly sample 50 men from the population and measure their heights, getting the following data:

Men_height = [177, 180, 182, 179, 178, 181, 176, 183, 179, 180, 178, 181, 177, 178, 180, 179, 182, 180, 183, 181, 179, 177, 180, 181, 178, 180, 182, 179, 177, 182, 178, 181, 183, 179, 180, 181, 183, 178, 177, 181, 179, 182, 180, 181, 178, 180, 179, 181, 183, 179]

* What test should we use and why?<br>
  * <font color='blue'>We have a relatively large sample (>30 measurements). Furthermore, assuming that the population is normally distributed and that we know the standard deviation, it would make sense to use a **Z-test** here.</font>
  * <font color='blue'>We will go with a **one-sample** test, because we compare a single sample against a test value (rather than against another sample).</font>
  * <font color='blue'>We will go with a **two-tailed** test, because the alternative hypothesis is that the mean is unequal to the test value (rather than smaller or larger than the test value)</font>

* State the null and alternative hypotheses<br>
<font color='blue'>$H_0: \mu_{pop} = 150$</font><br>
<font color='blue'>$H_1: \mu_{pop} \neq 150$</font><br>

* Choose a significance level (α).<br>
<font color='blue'>See my answer in **Exercise 1**</font>

* Determine the degrees of freedom (df) of the sample.<br>
<font color='blue'><b>Short answer</b><br>There are no degrees of freedom in a Z-test, because we assume that the population standard deviation is known.<br><br>
<b>Answer with explanations</b><br>In a t-test we estimate the population standard deviation from the sample.That estimate is subject to sampling variability, which we need to take into account when performing inference about the population mean. That is where the degrees of freedom come into play (the larger df, the lower the sampling variability, and the stronger our inference). In a Z-test, there is no sampling variability to be taken into account when performing inference about the population mean. Hence, there are no degrees of freedom (it is essentially infinite).</font>

* Determine the critical value.<br>
<font color='blue'>For a two-tailed z-test, the critical value is 1.96. This means that we reject H0 if we find a sample mean that deviates by more than 1.96 standard deviations from the hypothesized population mean. Note that <i>standard deviation</i> here refers to the <i>distribution of the sample mean</i>, not the sample itself</font>
```python
import scipy.stats as stats
z_crit = stats.norm.ppf(.975)   # this gives 1.96
```
* Compare and interpret the results of the test to the critical value.<br>
<font color='blue'><b>Short answer</b><br>The z-value equals $\frac{179.48-180}{2/\sqrt{50}} = -0.57$. The absolute value is smaller than the critical value so we choose to not reject H0.<br><br>
<font color='blue'><b>Answer with explanations</b><br>What we wish to answer here is the following question:</font><br><br>
<font color='blue'><i>Does the mean of this sample deviate by more than 1.96 standard deviations from the hypothesized population mean?</i><br><br>To answer this, we need to know the distribution of the sample mean. It can be shown that the sampling mean follows a normal distribution itself (CLT). Assuming that the standard deviation of the population is $\sigma=2$, the standard deviation of that distribution is $\hat{\sigma}_{pop} = \frac{\sigma}{\sqrt{n}}$. In the present case, we find $\hat{\sigma} = \frac{2}{\sqrt{50}} = 0.283$. The z-value thus equals $\frac{\bar{x} - \mu}{\sigma} = \frac{179.84 - 180.00}{0.283} = -0.57$. This is well within the critical values of -1.96 and 1.96, which means that the current data do not give us a reason to reject the hypothesis that they come from a population with a mean equal to 180 cm.</font>


**Exercise 3**

Suppose we want to test whether the mean weight of a population of cats is different from 4 kg. We randomly sample 50 cats from the population and measure their weights, getting the following data:

Cats_weights = [3.9, 4.2, 4.5, 4.1, 4.3, 3.8, 4.6, 4.2, 3.7, 4.3, 3.9, 4.0, 4.1, 4.5, 4.2, 3.8, 3.9, 4.3, 4.1, 4.0, 4.4, 4.2, 4.1, 4.6, 4.4, 4.2, 4.1, 4.3, 4.0, 4.4, 4.3, 3.8, 4.1, 4.5, 4.2, 4.3, 4.0, 4.1, 4.2, 3.9, 4.3, 3.7, 4.1, 4.5, 4.2, 4.0, 4.2, 4.4, 4.1, 4.5]

* Perform one sample two tailed Z-Test to determine whether the mean weight of the sampled cats is significantly different from 4 kg.<br>
<font color='blue'>This is not possible, because the standard deviation of the population is not given.</font>
* State the null and alternative hypotheses.<br>
<font color='blue'>H0: $\mu_{pop} = 4$</font><br>
<font color='blue'>H1: $\mu_{pop} \neq 4$</font>
* Choose a significance level, $\alpha$<br>
<font color='blue'>See answer to this question in Exercise 1</font><br>
* Assuming that the standard deviation is equal to the sample mean, calculate the z-score using the formula $Z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}$<br>
<font color='blue'>I assume that what was meant here is:<br><br><i>Assuming that the standard deviation **of the population** is equal to the sample **standard deviation**, calculate the z-score using the formula $Z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}$</i>.<br><br>
The answer then is $Z = \frac{4.17 - 4}{0.227 / \sqrt{50}} = 5.23$</font>
* Look up the critical z-value at the chosen significance level (α) using a z-table.<br>
<font color='blue'>If we choose $\alpha=.05$, then the critical value is 1.96</font><br>    
* Compare the calculated z-score to the critical z-values. If the calculated z-score falls outside the range between the critical z-values, we reject the null hypothesis in favor of the alternative hypothesis.<br>
<font color='blue'>The z value is larger than the critical value, which means that we have sufficient evidence to reject H0 at the chosen $\alpha$ level.</font><br>    



In [2]:
y=np.array([[3.9, 4.2, 4.5, 4.1, 4.3, 3.8, 4.6, 4.2, 3.7, 4.3, 3.9, 4.0, 4.1, 4.5, 4.2, 3.8, 3.9, 4.3, 4.1, 4.0, 4.4, 4.2, 4.1, 4.6, 4.4, 4.2, 4.1, 4.3, 4.0, 4.4, 4.3, 3.8, 4.1, 4.5, 4.2, 4.3, 4.0, 4.1, 4.2, 3.9, 4.3, 3.7, 4.1, 4.5, 4.2, 4.0, 4.2, 4.4, 4.1, 4.5]])
z = (y.mean() - 4) / (y.std(ddof=1) / np.sqrt(y.size))
p = 1 - stats.norm.cdf(z,0,1)
print(f'z={z:.3f}')
print(f'p={p:.3f}')


z=5.234
p=0.000


# Side note to Iljas in case he reads this

In your lectures you talked a few times about 'accepting the null hypothesis' when $p > .05$. 

However, while tempting, it is incorrect to interpret a high p value as evidence in favor of H0. 

The p value can only be used to quantify evidence *against* H0, never to argue *in favor* of it. 

The reason is that a high p value can occur for multiple reasons:
1. H0 is true
2. H0 is false, but you have too little data to detect this

We cannot distinguish these cases. It is intuitive to think that we could compute a proper sample size to exclude the second option (so that a high p means evidence in support of H0 when using that sample size), but for that we would need to know the 'effect size' in case H0 is false - and we don't know that.

To give a more concrete demonstration of why a high p value is not evidence in support of H0, consider the following situation:
* ```x = [-30, 30]```
* H0: $\mu = 100$

If $p > .05$ would mean "evidence in favor of H0", then we should surely not find a high p value here, right?

The p value turns out to be 0.19 for this example. Not because there is strong evidence that H0 is true, but purely because we have too little data to disprove it.

In [3]:
data = [-30, 30]
test_value = 100

_, pvalue = stats.ttest_1samp(data, test_value)

pvalue.round(2)

0.19

In [9]:
salary_women = [32000, 17000]  # 32k, 17k
salary_men = [42000, 91000] # 42k, 90k

_, pvalue = stats.ttest_ind(salary_women, salary_men)

pvalue.round(2)

0.24