# Tutorial 3: Confidence Interval and Hypothesis Testing

In this tutorial, we are going to use the distributions learned in the last tutorial to test our hypothesis and also to model the financial data.

Confidence interval is needed to determine how accurate the sample mean estimation is.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import quandl
quandl.ApiConfig.api_key = 'PGNosZasWCLCBMfGND4h'

In [3]:
#get data from quandl
spy_table = quandl.get('BCIW/_SPXT')
spy_total = spy_table[['Open','Close']]

#calculate log returns
spy_log_return = np.log(spy_total.Close).diff().dropna()
print 'Population mean:', np.mean(spy_log_return)
print 'Population standard deviation:',np.std(spy_log_return)

Population mean: 0.000548884514811
Population standard deviation: 0.00906545189634


In [4]:
print '10 days sample returns:', np.mean(spy_log_return.tail(10))
print '10 days sample standard deviation:', np.std(spy_log_return.tail(10))
print '1000 days sample returns:', np.mean(spy_log_return.tail(1000))
print '1000 days sample standard deviation:', np.std(spy_log_return.tail(1000))

10 days sample returns: 0.000628690620669
10 days sample standard deviation: 0.00192302156592
1000 days sample returns: 0.000498744418002
1000 days sample standard deviation: 0.00774787143568


## Confidence Interval

In [5]:
#apply the formula above to calculate confidence interval
bottom_1 = np.mean(spy_log_return.tail(10))-1.96*np.std(spy_log_return.tail(10))/(np.sqrt(len((spy_log_return.tail(10)))))
upper_1 = np.mean(spy_log_return.tail(10))+1.96*np.std(spy_log_return.tail(10))/(np.sqrt(len((spy_log_return.tail(10)))))
bottom_2 = np.mean(spy_log_return.tail(1000))-1.96*np.std(spy_log_return.tail(1000))/(np.sqrt(len((spy_log_return.tail(1000)))))
upper_2 = np.mean(spy_log_return.tail(1000))+1.96*np.std(spy_log_return.tail(1000))/(np.sqrt(len((spy_log_return.tail(1000)))))

In [6]:
print '10 days 95% confidence inverval:', (bottom_1,upper_1)
print '1000 days 95% confidence inverval:', (bottom_2,upper_2)

10 days 95% confidence inverval: (-0.00056321049436542151, 0.0018205917357026068)
1000 days 95% confidence inverval: (1.8526371206507907e-05, 0.00097896246479839541)


## Confidence Interval of Normal Distribution

- critical values of 90%, 95%, 99% are 1.64, 1.96, 2.32
- '3 sigma rule' (67-95-99.7 rule): 

![3 Sigma Rule](https://upload.wikimedia.org/wikipedia/commons/a/a9/Empirical_Rule.PNG)

## Central Limit Theory

- Tells us that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population, and the means of the samples will be approximately normally distributed.

## Hypothesis Testing

- essentially testing your inference based on a sample

**Example**
- The daily return of S&P 500 gives the population
- Assume we don't know the population's mean
- I guess that the mean of the population is 0.  To check the validity, you can test the hypothesis with the sample.

How to calculate the confidence interval:

In [8]:
mean_1000 = np.mean(spy_log_return.tail(1000))
std_1000 = np.std(spy_log_return.tail(1000))
mean_10 = np.mean(spy_log_return.tail(10))
std_10 = np.std(spy_log_return.tail(10))
s = pd.Series([mean_10,std_10,mean_1000,std_1000],index = ['mean_10', 'std_10','mean_1000','std_1000'])
print s

mean_10      0.000629
std_10       0.001923
mean_1000    0.000499
std_1000     0.007748
dtype: float64


In [9]:
bottom = 0 - 1.64*std_1000/np.sqrt(1000)
upper = 0 + 1.64*std_1000/np.sqrt(1000)
print (bottom, upper)

(-0.00040181510038027937, 0.00040181510038027937)


On a 90% confidence level, the mean of the population is not 0, so the hypothesis that the daily return on S&P500 from August 2010 is 0.

But can we claim this with a 95% confidence level?

In [10]:
bottom = 0 - 1.96*std_1000/np.sqrt(1000)
upper = 0 + 1.96*std_1000/np.sqrt(1000)
print (bottom, upper)

(-0.00048021804679594372, 0.00048021804679594372)


No we cannot.

If the tested value is outside the confidence interval, you can reject the **null hypothesis**.


The hypothesis testing method above is not straightforward.  We reverse the process to calculate the critical value, or **Z-score**. 

In [14]:
print np.sqrt(1000)*(mean_1000 - 0)/std_1000

2.03561499991


The higher the Z score, the further the tested value is from the hypothesized value.  So lower is better for supporting a hypothesis.

How wide our confidence interval is:

In [17]:
import scipy.stats as st

# 1.9488 is equivalent to 95% confidence interval
print (1 - st.norm.cdf(1.9488))  

0.025659656888


Testing the hypothesis that population mean = 0 with a large sample of 1200 observations.

In [18]:
mean_1200 = np.mean(spy_log_return.tail(1200))
std_1200 = np.std(spy_log_return.tail(1200))
z_score = np.sqrt(1200)*(mean_1200 - 0)/std_1200
print 'z-score = ',z_score
p_value = (1 - st.norm.cdf(z_score))
print 'p_value = ',p_value

z-score =  2.49627411833
p_value =  0.00627527865775
