### Imports

In [3]:
import numpy as np
import scipy.stats as stats

### Parts to a probability distribution
- A probability distribution can be on either discrete values or continuous values.
________________________________________________________________________
||Discrete Variable|Continuous Variable|
|---|---|---|
|**Def**|Represent counts|Represent measurable amounts|
|**Ex**|Number of calls received in a call-center per hour|Height, employee salaries, temperature ect...|

### Types of Distributions
- Consider the situation at hand and determine the appropriate distribution type.
- Generate random variables `.rvs()` after creating a distribution object from the following distributions.
________________________________________________________________________
||Uniform|Normal|Binomial|Poisson|
|----|---|---|---|---|
|**Use Case**|Models the probability of every outcome being the same|Models a continuous random variable where the further away from the mean you are, the less likely the outcome.|Model the number of successes after a number of trials|Model a certain amount of events occuring over a time interval|
|**Variable Type**|Discrete|Continuous|Discrete|Discrete|
|**Example**|Rolling a dice, flipping a coin, card from a deck|A store's daily sales|Number of heads you would expect after flipping a coin multiple times.|The number of emails sent by a mail server in a day.|
|**Function**|`stats.randint()`|`stats.norm()`|`stats.binom()`|`stats.poisson()`|
|**Input**|Range of values|Mean and Standard deviation|Number of rials and a probability of success| Mean rate|
|**Output**|Distribution object|Distribution object|Distribution object|Distribution object||

### Distribution Methods

- First make a distribution object from scipy (uniform, normal, binomial, poisson...), then use the dristribution that best fits your scenario.
________________________________________________________________________
||probability mass function|probability density function|cumulative distribution function |percent point function|survival function|inverse survival function|
|----|---|---|---|---|----|----|
|**Primary Check**|If you have a given value and want the probability of that single value|If you have a given value and want the probability of that single value|Pobability our random value takes on a value `<=` a given point|Pobability our random value takes on a value `<=` a given point|Probability our random variable take on a value `>` a given point|Probability our random variable take on a value `>` a given point|
|**Secondary Check**|If your distribution is `made of` continuous or discrete random variables|If your distribution is `made of` continuous or discrete random variables|Do you `have` the value or the probability|Do you `have` the value or the probability|Do you `want` the value or the probability|Do you `want` the value or the probability|
|**Variable Type**|Descrete|Continuous|Value|Probability|Value|Probability|
|**Function**|`.pmf()`|`.pdf()`|`.cdf()`|`.ppf()`|`.sf()`|`.isf()`|
|**Input**|Integer value|Float value|Any numeric value|Percentage value|Any numeric value|Percentage value|
|**Output**|Probability|Probability|Probability|Value|Probability|value|

### Comparing Means
- **Parametric** vs **Non-Parametric**: parametric tests rely on a distribution. And in the case of ``t-tests`` and `ANOVA`, the distribution they rely on is a normal distribution.
- **Central Limit Theorem**:If you have a population (regardless of distribution) with mean `μ` and take sufficiently large random samples (usually N > 30 for each independent sample) from the population, with replacement, then the distribution of the sample means will be approximately normally distributed.

**When do we compare means**

We compare means when answering the following type of question:
- Are the salaries of the marketing department higher than the company average?
_________________________________________________________________________
|steps||One Sample t-test|Indipendent t-test (Two sample)| ANOVA|
|----|----|----|----|----|
|**1**|**Goal**|Compare observed mean to theoretical mean|Compare `two` observed means|Comparing `more than` two observed means|
||**Plot distribution from random numbers**|`stats.norm()` then `.rvs()` then `plot`|`stats.norm()` then `.rvs()` then `plot|`stats.norm()` then `.rvs()` then `plot|
|**2**|**Null hypothesis $H_0$**|$\mu_{obs} = \mu_{th}$|$\mu_{a} = \mu_{b}$|$\mu_{a} = \mu_{b} = \mu_{n}$|
||**Alternative hypothesis $H_a$**|$\mu_{obs}$ != $\mu_{th}$|$\mu_{a}$ != $\mu_{b}$|$\mu_{a}$ != $\mu_{b}$ != $\mu_{n}$|
|**3**|**Significance level (`Alpha`)**|0.05|0.05|0.05|
|**4**|**Verify Assumptions**|`Normal` Distribution, or `=> 30` observations.|`Independent`, `normaly` distributed and have `equal varience`.|`Independent`, `normaly` distributed and have `equal varience`.|
||**Verify Veriance equality**||`stats.levene()`|`stats.levene()`|
||||Variences are qual if $p_{value}$ > `alpha`|Variences are qual if $p_{value}$ > `alpha`|
|**5**|**If distribution is not normal**|`stats.wilcoxon()`|`stats.mannwhitneyu()`|`stats.kruskal()`|
||**Test statistics for normal distribution**|`stats.ttest_1samp()`|`stats.ttest_ind()`|`stats.f_oneway()`|
||**Input**|observed (`array-like`) and theoreticl (`mean`)|2 `array-like` samples|`Multiple` array-like samples|
||**Output**|`t-statistic` and `p_value`|`t-statistic` and `p_value`|`t-statistic` and `p_value`|
|**6**|**Conclude**|Reject $H_0$ if $P_{value} < Alpha$ and $t_{stats} > 0$|Reject $H_0$ if $P_{value} < Alpha$ and $t_{stats} > 0$|Reject $H_0$ if $P_{value} < Alpha$ and $t_{stats} > 0$|
|||else `fail` to reject $H_0$|else `fail` to reject $H_0$|else `fail` to reject $H_0$|

**Important Note**: $p_{value}$ / 2 for single trail questions for **step 6**

![image.png](attachment:image.png)

**More Notes**:
- If assumptions can't be met, the equivalent non-parametric test can be used.   
- Normal Distribution assumption can be be met by having a large enough sample (due to Central Limit Theorem), or the data can be scaled using a Gaussian Scalar.   
- The argument in the stats.ttest_ind() method of `equal_var` can be set to `False` to accomodate this assumption. 

add later:

sum of all probabilties in your sample space = 1