# **DS Cheat Sheet**
### Henry Chan

-----------

# Content
1. [Sampling Method](#1.-Sampling-Method)
1. Preprocessing
1. [Hypothesis Test](#3.-Hypothesis-Test)
1. Supervised ML
1. Unsupervised ML
1. Visulisation
1. [Statistics](#7.-Statistics)

-----------

# 1. Sampling Method

## For numpy array:
- `numpy.random.choice(x, size, replace=True)`

## For pandas DataFrame:
- `df.sample(frac, strtified, random_state)`

------------

# 3. Hypothesis Test

## Z-score
- used for comparing sample to population value of interest, especially when the SE(SD) is known
- $Z = \frac{\hat{X} - \mu}{standard \ error} = \frac{\hat{X} - \mu}{\sigma \ / \sqrt{n}}$
- find p-value by `scipy.stats.norm.cdf(z)`
- if $p < \alpha $ then reject null hypothesis
- p-value for RHS tail: `1 - scipy.stats.norm.cdf(z)`
- in scipy style: `scipy.stats.ttest_1samp(array, popmean, alternative)`

## Bootstrapping (Standard Error)

Bootstrap SE = Expected value of bootstrap replicates(standard deviation) ~ calculated standard error

In [127]:
import numpy as np
np.random.seed(123) #set the random seed
sample = np.random.normal(loc=0,scale=1, size=100) #draw 100 samples from normal distribution
boot_rep = [] #bootstrap replicate list
for i in range(1000): #B=1000, basically B>100 is chosen
    rep = np.random.choice(sample, size=len(sample))
    boot_rep.append(np.mean(rep)) #take the mean of resampled list
print(f"Bootstrap SE is {np.std(boot_rep, ddof=1)}. Calculated SE is {np.std(sample, ddof=1)/np.sqrt(100)}")

Bootstrap SE is 0.11374654937038005. Calculated SE is 0.11339243375361954


## t-test
- when the population SD(SE) is unknown
- *Individual t-test*: two groups from different population
    - using `t, p = scipy.stats.ttest_ind(a, b, equal_var=True, alternative)`
- *paired t-test* : two related groups e.g.pre- and post-
    - using `t, p = scipy.stats.ttest_rel(a, b, alternative)`
- for *alternative*: `alternative={'two-sided', 'less', 'greater'}`
- ind.t-test: $dof = n - 2$ vs. paired t-test: $dof = n - 1$
- when n tends to infinity, t-distribution = normal distribution
- can also obtain p-value from `scipy.stats.t.cdf(t, df)`

## ANOVA
- anlysis of variance, more than two features


## Pingouin
- first `import pinoguin`
- t-test: `pingouin.ttest(x, y, paired=False, alternative)`
- ANOVA: `pingouin.anova(data, dv, between)`

------------

# 7. Statistics

## Error
- *Type I Error*: null hypothesis is True but we reject
- *Type II Error*: null hypothesis is False but we do not reject