# Data Analysis

## Why are statistical significance tests useful?

* They provide a formalized framework for comparing and evaluating data
* They enable us to evaluate whether perceived effects in our dataset reflect differences across the whole population

## Normal Distribution (Gaussian Distribution, Bell Curve)

### Two parameters associated:

* Mean $$\mu$$
* Standard deviation $$\sigma$$


These two parameters plug in to the following probability density function, which describes a Gaussian distribution:

![title](img/normal-d.jpg)

$$f(x) = \frac{1}{{\sqrt {2\pi \sigma^2} }}e^{ - \frac{{(x - \mu)^2}}{2\sigma^2}}$$

* The expected <b>value of a variable described</b> by a Gaussian distribution is the <b>mean</b> and the <b>variance</b> is the <b>standard deviation</b>.

* Normal distributions are also symetric about their mean

## Statistical Significance Tests

## t-Test
One of the most common parametric test that we can use to compare two sets of data.

* Aims at accepting or rejecting a <b>null hypothesis</b>: generally a statement that we are trying to disprove by running our test)

<b>TEST STATISTIC:</b> reduces the dataset to one number that helps to accept or reject the <b>null hypothesis</b>. When performing a t-Test, we compute a test statistic called <b>T</b>. 

$$ tTest \rightarrow t $$

Depending on the value of the test statistic T we can determine whether or not a null hypotesis is true.

### Two Sample t-Test
A few different versions depending on assumptions:
* Equal sample size?
* Same variance?

$$t = \frac{\mu_1 - \mu_2}{{\sqrt {\frac {\sigma_1^2}{N_1} + \frac {\sigma_2^2}{N_2} }}}$$

Where:
* Sample mean for i sample: $$\mu_i$$ 
* Sample variance for i'th sample: $$\sigma_i^2$$
* Sample size for i sample: $$N_i$$

To estimate the number of degrees of freedom:
$$\nu \approx \frac{(\frac{\sigma_1^2}{N_1}+\frac{\sigma_2^2}{N_2})^2}{\frac{\sigma_1^4}{N_1^2 \nu_1}+\frac{\sigma_2^4}{N_2^2 \nu_2}}$$

Where:

$$\nu_i = N_i - 1$$

is the degrees of freedom associated with the i'th variance estimate.

With these two values we can estimate the P value which is the probability of obtaining the test statistic at least as extreme as the one that was actually observed assumin that the null hypothesis was true (the P value IS NOT the probability of the null hypothesis is true given the data).

* P-value: probability of obtaining a test statistic <b>at least</b> as extreme as ours if null hypothesis was true
* Set Pcritical -> if P < Pcritical: REJECT NULL HYPOTHESIS else CANNOT REJECT NULL HYPOTHESIS

### t-Test in Python: SciPy

In [2]:
import scipy.stats

In [7]:
# two sets of data
list_1 = [1,2,3,4,5,6]
list_2 = [5,4,3,2,6,7,8,9,10]
# assumes a two-sided t-test
scipy.stats.ttest_ind(list_1, list_2, equal_var=False)
# returns a tuple: (t-value, p-value for a two-tailed test)

Ttest_indResult(statistic=-2.1004201260420148, pvalue=0.05583466515003168)

#### For one-sided: half of two sided p-value (one side of the distribution)

$$ > Mean \rightarrow \frac{P}{2} < P_{critical}, t > 0$$

$$ < Mean \rightarrow \frac{P}{2} < P_{critical}, t < 0$$