
# Notes on Statistics and Finance

If we have a random variable $X$, then a sample of size $n$ consist of $n$ observations on $X$ {cite}`fundamentals_quant_finance`. 
The characteristics of this sample can be represented using a sample statistics such as:

* **Mean $\bar x $**: a measure of location of the sample. Alternatives measures of sample location are median and mode.
* **Variance $s$**: a measure of dispersion of the sample 
* **Skewness**: a measure of asymmetry in the sample. Assets returns are skewed due to trending nature of markets. 
* **Kurtosis**:  measures relative weight in the tails and the centre. Asset returns present high kurtosis because of the presence of market crashes and the fact that trading is discountinous, so that you can have jumps in prices. 

But we are not interested in the values of the sample statistics, but in the values of the equivalent population statistics (e.g $\mu$, $\sigma^2$) that can give us the characteristics of the random variable $X$.  Those values are clearly not knowable exactly, but we can approximate them using *statistical inference* from the sample values.

## Statistical Inference


**Inference Variance** 

$\alpha$ **-quantile** For any given random variable $X$, it can be useful to know the number $x_\alpha$  for which the probabililty of X being less than number $x_\alpha$ is $\alpha$: e.g the 95% quantile of the standard norm distribution is the number 1.644. 


In [25]:
from scipy.stats import norm
alpha = 0.95
x_alpha = norm.ppf(alpha)
print(x_alpha)

1.6448536269514722



**Confidence Intervals** We extend this concept to a range, so a $\alpha$ confidence interval for a random variable $X$ is a range in which X falls with probability $\alpha$

In [28]:
print(norm.interval(0.95))

(-1.959963984540054, 1.959963984540054)


#### Central Limit Theorem
Suppose we drawn a sample of size $n$, from a population having random variable $X$, mean $\mu$ and variance $\sigma^2$. The CLT states that as $n\to\infty$ then

$$\frac{\bar x - \mu}{\frac{\sigma}{\sqrt n}} \approx N(0,1) $$


Surprisingly, this doesn't suppose that the random variable $X$ is normally distributed, but only that it has a defined mean $\mu$ and a variance $\sigma^2$. 

**Applications of CLT** 

We can use CLT to test an hypothesis on the mean of the population given the sample mean and some confidence interval. 

Since we don't kow the $\sigma$ of the population, there is an additional theorical result we can use for our purpose, if we suppose that the random variable X approximate a normal distribution $N(\mu, \sigma^2)$:

$$\frac{\bar x - \mu}{\frac{s}{\sqrt n}} \approx t_{n-1} $$

An example taken from {cite}`fundamentals_quant_finance`, using scipy


In [2]:
from scipy.stats import t
import numpy as np

student_dist = t(6)
interval = student_dist.interval(0.95)
a, b =[-(4*x/np.sqrt(7) - 5) for x in interval]
print(min(a,b), max(a,b))


1.3006170102856744 8.699382989714326


#### Hypothesis Testing

Rules
1. State null and alternative hypotheses
2. Chose a test statistics and state its distribution
3. Find the value of the test statistics given your data, under the null hypothesis
4. Choose a significance level
5. Find the critical values of the test statistics for the critical level
6. Compare your value with the critical value: if the value is above the critical value, reject the null hypothesis 



You a sample of size 25, you find the mean $\bar x = -2$ and the variance $s = 5$. You want to test two hypothesis: the null hypothesis $\mu = 0$ and the hypothesis that $\mu < 0$. 

What is the value of your test statistics under the null hypothesis?

$t_0 = \frac{-2 - 0}{5/5} = -2$

Let's take 5% significance level. What is the critical value for the test statistics for 5% significance level?
P(X < t*) = 0.05 




In [34]:
t(24).ppf(0.05)  

-1.7108820799094282

Since $t_0 < t^*$, then we reject the null hypothesis is favor of the alternative hypothesis. 

## Sampling Distribution and Hypothesis Testing
The sampling distribution are valid for hypothesis thesis and confidence intervals only under the assumption of a normal population. We can use the CLT only if we are testing for the mean of a population in the case we have a very large sample. 

Normal or Student-t distribution are used for inferences from sample means to population means, while Chi-squared are used for inference on variance. 

**Type I errors**: We reject the null hypothesis even if it is true. The probability of doing a Type I error is equal to the significance level $\alpha$. 
**Type II errors**: We fail to reject the null hypothesis even if its false. 

The statistical power of a study is the probability of not committing a Type II error. We can increase the power by increasing the confidence level (but we increase the Type I error by doing so) or by increasing the sample size (since variance of sample distribution depends from the sample size).


-1.5452444955452138

-1.5

In [1]:
import numpy as np

In [2]:
np.sqrt(0.0015)

0.03872983346207417

In [3]:
from scipy.stats import norm

In [4]:
norm.interval(0.95)

(-1.959963984540054, 1.959963984540054)