# Statistics 

## This notebook is in partial fulfillment of a HDip in Data Analytics in Computing

### Author: Katie O'Brien 
***
***


In [1]:
from scipy.stats import fisher_exact
import numpy as np
import pandas as pd

# Week 1

*Exercise 1*

*Use scipy's version of Fisher's exact test to simulate the Lady Tasting Tea problem.*

## Lady Tasting Tea


Lady tasting tea is a experiment designed by the polymath Ronald Fisher that was reported in his 1935 book "The Design of Experiments"[1] 

![](Youngronaldfisher2.jpeg)

This experiment is the original exposition of his notion of a null hypothesis - ie: a statement that is generally assumed to remain possibly true, and researchers work to reject, nullify or disprove this statement.The concept of the null hypothesis is a fundamental part of the scientific process.[2]

In his book he describes a null hypothesis, as something which is "never proved or established, but is possibly disproved, in the course of experimentation".

A colleague of Fisher's, Muriel Bristol, claimed to be able to tell whether the tea or the milk was added first to a cup. Fisher was extremely sceptical of this procamation and himself and a 3rd colleague William Roach, put together an experiment to test her ability. 

It was this simple experiment that led Fisher to do important work in the design of statistically valid experiments based on the statistical significance of experimental results. He developed Fisher's exact test to assess the probabilities and statistical significance of experiments.

### The experiment

Fisher proposed to give her eight cups, four of each variety, in random order. One could then ask what the probability was for her getting the specific number of cups she identified correct, but just by chance.

If there was 4 cups with tea in first; and 4 cups with milk first what is the chance that she would guess the correct cups.

In null-hypothesis significance testing, the p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct.[6]




[1]: https://home.iitk.ac.in/~shalab/anova/DOE-RAF.pdf
[2]: https://www.statisticshowto.com/probability-and-statistics/null-hypothesis/
[3]: https://en.wikipedia.org/wiki/Ronald_Fisher
[5]: https://web.archive.org/web/20040710084649/http://www.dean.usma.edu/math/people/Sturdivant/images/MA376/dater/ladytea.pdf
[6]: https://en.wikipedia.org/wiki/P-value

We need to figure out what is the chance that the correct 4 cups will be picked from the 8. While there a number of ways to determine this, we will use "Scipy's Fisher_exact"  function.[1] This uses a 2 x 2 table in order to run the test. The lady tasting tea is a 1 x 2 test as we only have 2 choices with only 1 subject. This results in the below table appearing repetitious, however, this is necessary for the above function to run correctly


[1]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html

In [2]:
# Creating a dataframe in order to clearly display the data. 

ar=np.array([[0.0, 4.0],[4.0, 0.0]])    
df=pd.DataFrame(ar, columns=["Milk First", "Tea First"])
df.index=["Tea First", "Milk First"] 
df 

Unnamed: 0,Milk First,Tea First
Tea First,0.0,4.0
Milk First,4.0,0.0


It should be noted that while in the table above we have the subject picking the 4 "tea first" cups, it would also produce the exact same results if the subject picked the "milk first cups". 
We have the table layout of the results we need, now to find out the p-value:

In [3]:
oddsratio,pvalue =fisher_exact([[0, 4],[4, 0]])  
pvalue

0.028571428571428567

In Scipy.stats we have to use a table of shape (2,2) in order to obtain a p-value for the result.The input for the test above matches the table displayed- it can be seen that a table of shape (1,2) would be ideal to prevent duplication, however that is not possible in the circumstance.

The result of the fisher's exact is measuring the odds of picking the tea first cups, and the milk first cups, without bearing in mind that by process of emimination, when the subject is picking the "tea first" cups, they are inadvertently also **not** picking the "milk first" cups. In order to get the correct result from the "fisher_exact" test,  we need to divide the p value by 2.


In [4]:
result = pvalue/2
result

0.014285714285714284

### Result:

As we can see above, the odds of the subject simply guessing the correct cups is .14285.... When multiplied by 100 to obtain the percentage we can see that this is equal to ~ 1.4% or a 1/70 chance. 


Fisher was willing to reject the null hypothesis if, and only if, the lady the lady correctly picked the correct cups, effectively acknowledging the lady's ability at a 1.4% significance level (but without quantifying her ability). Fisher later discussed the benefits of more trials and repeated tests.

Author David Salsburg reports that a colleague of Fisher, H. Fairfield Smith, revealed that in the actual experiment the lady succeeded in identifying all eight cups correctly.




### *Exercise 2* 
 *Calculate the minimum number of cups of tea required to ensure the probability of randomly selecting the correct cups is less than or equal to 1%.*

We can modify the result above to determine how many cups that the subject would have to guess in order to ensure the probability is less than 1%

In [5]:
# Change the values within the square brackets to determine the findings
oddsratio,pvalue =fisher_exact([[5,0],[0, 5]])  
pvalue
result = print(((pvalue/2)*100),"%")

0.3968253968253969 %


We can see that having the subject pick 5 correct cups from 10 would give a .369% chance of randomly selecting the correct cups. 

References:

- Fisher 1971, Chapter II. The Principles of Experimentation, Illustrated by a Psycho-physical Experiment, Section 8. The Null Hypothesis.
- Salsburg, D. (2002) The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century, W.H. Freeman / Owl Book. ISBN 0-8050-7134-2


***
***
***

## Week 2 

*Take the code from the Examples section of the scipy stats documentation for independent samples t-tests, add it to your own notebook and add explain how it works using MarkDown cells and code comments. Improve it in any way you think it could be improved.*

Student's T-test is a test developed by a scientist, William Sealy Gosset, working in the guinness factory in Dublin in the early 1900's- late 1800's. He became interested in the problems around small sample sizes, as the factory was sometimes limited to small samples of raw products. One theory on the naming of the test was that the company policy at the time was to use an alias when publishing. Gosset used the alias Student, thus informing the name Student's t-test. The other theory was that Guinness factory wanted to run tests on quality assurance on the raw materials that were being used in production, and didn't want their competitors to know. 

The t-distribution is part of a family of continuous probability distributions that arise when estimating the mean of a normally distributed population where the sample size is small and the standard deviation is unknown. This distribution plays a role the Student's t-test for assessing the statistical significance of the difference between two sample means, the construction of confidence intervals for the difference between two population means, and in linear regression analysis.

### Types of T-tests

The most frequently used t-tests are one-sample and two-sample tests:

- A one-sample location test of whether the mean of a population has a value specified in a null hypothesis.

- A two-sample location test of the null hypothesis such that the means of two populations are equal. All such tests are usually called Student's t-tests, though strictly speaking that name should only be used if the variances of the two populations are also assumed to be equal; the form of the test used when this assumption is dropped is sometimes called Welch's t-test. These tests are often referred to as unpaired or independent samples t-tests, as they are typically applied when the statistical units underlying the two samples being compared are non-overlapping.


#### Paired samples
Paired samples t-tests typically consist of a sample of matched pairs of similar units, or one group of units that has been tested twice (a "repeated measures" t-test).

A typical example of the repeated measures t-test would be where subjects are tested prior to a treatment, say for high blood pressure, and the same subjects are tested again after treatment with a blood-pressure-lowering medication. By comparing the same patient's numbers before and after treatment, we are effectively using each patient as their own control. That way the correct rejection of the null hypothesis (here: of no difference made by the treatment) can become much more likely.

#### Independent (unpaired) samples
The independent samples t-test is used when two separate sets of independent and identically distributed samples are obtained, and one variable from each of the two populations is compared. For example, suppose we are evaluating the effect of a medical treatment, and we enroll 100 subjects into our study, then randomly assign 50 subjects to the treatment group and 50 subjects to the control group. In this case, we have two independent samples and would use the unpaired form of the t-test.

##### Scipy.stats.ttest_ind

This function calculates the t-test for the means of 2 independent samples of scores. By this we mean that the data in one sample is unrelated to the other. Compare this to a paired samples t-test which can compare before/after scores on interventionalist studies. 
It is important that researchers and analysts understand the data that they are analysing, in order to pick the correct analysis tools for the job. 

In the independent sample t-test the test assumes that the populations have identical variances by default. The test is for the null hypothesis that 2 independent samples have identical average (expected) values.[1]

[1]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

### Code

The following code is pulled directly from the examples on the scipy.stats website: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html























[1]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

In [6]:
# Importing required packages and selecting a random number generator from numpy
from scipy import stats
rng = np.random.default_rng()

In [7]:
# The RVS samples are created using a random generator using a normal distribution. 
# The mean is set by loc and the scale is setting the standard deviation.
# The size of the samples in these instances is 500 + 100 and the random state is set using a numpy RNG
rvs1 = stats.norm.rvs(loc=5, scale=10, size=500, random_state=rng)
rvs2 = stats.norm.rvs(loc=5, scale=10, size=500, random_state=rng)
rvs3 = stats.norm.rvs(loc=5, scale=20, size=500, random_state=rng)
rvs4 = stats.norm.rvs(loc=5, scale=20, size=100, random_state=rng)
rvs5 = stats.norm.rvs(loc=8, scale=20, size=100, random_state=rng)

***
Test with sample with identical means:

In [8]:
# The t-test is used by simply using 2 of the samples above and calling the function 
stats.ttest_ind(rvs1, rvs2)

# The result is returned as follows: The t-value, and the p-value:

Ttest_indResult(statistic=0.5112799769225118, pvalue=0.6092681155724654)

We can also let the function know if the 2 samples have equal population variances by using "equal_var". This takes in a boolean operator. If False, we are going to ask the function to perform a Welch's t-test instead of a Student's t-test. 

In [19]:
# equal_var takes in a boolean operator
# If True (default), perform a standard independent 2 sample test that assumes equal population variances
#If False, perform Welch’s t-test, which does not assume equal population variance
stats.ttest_ind(rvs1, rvs2, equal_var=False)

Ttest_indResult(statistic=0.5112799769225118, pvalue=0.609268181959482)

***

Scypy's ttest_ind underestimates the p-value for unequal variances. Note that rvs3 has a larger Standard deviation than rvs1. 

In [20]:
stats.ttest_ind(rvs1, rvs3)

Ttest_indResult(statistic=0.8637384634744444, pvalue=0.3879391124860613)

In [11]:
stats.ttest_ind(rvs1, rvs3, equal_var=False)

Ttest_indResult(statistic=0.8637384634744444, pvalue=0.3880243536985388)

***
When the size of the sample in one does not match the sample size in the second, the equal variance t-statistic (Student's t-test);  is no longer equal to the unequal variance t-test(Welch's t-test):
This can be seen in the next 2 examples:

In [12]:
# Equal variance t-statistic
stats.ttest_ind(rvs1, rvs4)

Ttest_indResult(statistic=-0.06783076785459832, pvalue=0.945943023946996)

In [21]:
# Unequal variance t-statistic
stats.ttest_ind(rvs1, rvs4, equal_var=False)
# Note the differences in the t-value, and p-value

Ttest_indResult(statistic=-0.04157852895796947, pvalue=0.9669124049967042)

***
Running a T-test with different means, variance, and sample size:

In [22]:
stats.ttest_ind(rvs1, rvs5)

Ttest_indResult(statistic=-2.062958713514101, pvalue=0.03954770461634422)

In [23]:
# Again, with a different mean, sample size, and variance- this time telling the test the variances are unequal
stats.ttest_ind(rvs1, rvs5, equal_var=False)

Ttest_indResult(statistic=-1.3380428621013978, pvalue=0.18367316338174833)

#### Permutations

If 0 or None (default), use the t-distribution to calculate p-values. Otherwise, permutations is the number of random permutations that will be used to estimate p-values using a permutation test. If permutations equals or exceeds the number of distinct partitions of the pooled data, an exact test is performed instead (i.e. each distinct partition is used exactly once).

The t-test quantifies the difference between the arithmetic means of the two samples. The p-value quantifies the probability of observing more extreme values assuming the null hypothesis, that the samples are drawn from populations with the same population means, is true. A p-value larger than a chosen threshold (e.g. 5% or 1%) indicates that our observation is not so unlikely to have occurred by chance. Therefore, we do not reject the null hypothesis of equal population means. If the p-value is smaller than our threshold, then we have evidence against the null hypothesis of equal population means.

The permutation test can be computationally expensive and not necessarily more accurate than the analytical test, but it does not make strong assumptions about the shape of the underlying distribution.

When performing a permutation test, more permutations typically yields more accurate results.
Use a np.random.Generator to ensure reproducibility:

In [16]:
stats.ttest_ind(rvs1, rvs5, permutations=10000,
                random_state=rng)

Ttest_indResult(statistic=-2.062958713514101, pvalue=0.0401)

Use of trimming is commonly referred to as the trimmed t-test. At times called Yuen’s t-test, this is an extension of Welch’s t-test, with the difference being the use of winsorized means in calculation of the variance and the trimmed sample size in calculation of the statistic.
Winsorization is the transformation of statistics by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers.Trimming is recommended if the underlying distribution is long-tailed or contaminated with outliers as in sample "a" below.

If "trim" is nonzero, performs a trimmed (Yuen’s) t-test. Defines the fraction of elements to be trimmed from each end of the input samples. If 0 (default), no elements will be trimmed from either side. The number of trimmed elements from each tail is the floor of the trim times the number of elements. Valid range is [0, .5).


Take these two samples, one of which has an extreme tail.

In [17]:
# sample a has an extreme tail in this instance
a = (56, 128.6, 12, 123.8, 64.34, 78, 763.3)
b = (1.1, 2.9, 4.2)

Use the trim keyword to perform a trimmed (Yuen) t-test. For example, using 20% trimming, trim=.2, the test will reduce the impact of one (np.floor(trim*len(a))) element from each tail of sample a. It will have no effect on sample b because np.floor(trim*len(b)) is 0.

In [None]:
stats.ttest_ind(a, b, trim=.2)

References: 

https://www.scribbr.com/statistics/t-test/#:~:text=A%20t%2Dtest%20is%20a,does%20a%20t%2Dtest%20measure%3F