# Quantile

In [77]:
import scipy.stats as st
import numpy as np

## Standard Normal

In [78]:
# Standard Normal corresponds to mu = 0 and sigma = 1.
mu = 0
sigma = 1

Question 1 : Use st.norm.ppf to calculate the Quantile at $alpha = 0.95$


The `st.norm.ppf()` method takes a percentage(which is the probability or the area under the curve ) and returns the corresponding z-score.

In [79]:
# This is useful when calculating 90% confidence interval.
st.norm.ppf(0.95)

1.6448536269514722

Question 2 : Now, use ``st.norm.ppf`` to calculate the Quantile at $alpha = 0.975$

In [80]:
st.norm.ppf(0.975)

1.959963984540054

## T-Student 

the following code runs the function `st.t.ppf` of $alpha = 0.7$ with various degrees of freedom :


In [81]:
alpha=0.7
liste = [10,50,100,2000,100000]
for df in liste:
    val = st.t.ppf(alpha,df)
    print( 'Degree of Freedom = %7d,  Quantile = %10f' %(df,val))

Degree of Freedom =      10,  Quantile =   0.541528
Degree of Freedom =      50,  Quantile =   0.527760
Degree of Freedom =     100,  Quantile =   0.526076
Degree of Freedom =    2000,  Quantile =   0.524484
Degree of Freedom =  100000,  Quantile =   0.524402


Question 1 : modify the previous code to get the quantiles of $alpha=0.95$ and $alpha=0.975$

In [82]:
alpha1=0.95
alpha2=0.975
liste = [10,50,100,2000,100000]
print ("    - With alpha = 0.95")
for df in liste:
    val = st.t.ppf(alpha1,df)
    print( 'Degree of Freedom = %7d,  Quantile = %10f' %(df,val))
print ("\n    - With alpha = 0.975")
for df in liste:
    val2 = st.t.ppf(alpha2,df)
    print( 'Degree of Freedom = %7d,  Quantile = %10f' %(df,val2))

    - With alpha = 0.95
Degree of Freedom =      10,  Quantile =   1.812461
Degree of Freedom =      50,  Quantile =   1.675905
Degree of Freedom =     100,  Quantile =   1.660234
Degree of Freedom =    2000,  Quantile =   1.645616
Degree of Freedom =  100000,  Quantile =   1.644869

    - With alpha = 0.975
Degree of Freedom =      10,  Quantile =   2.228139
Degree of Freedom =      50,  Quantile =   2.008559
Degree of Freedom =     100,  Quantile =   1.983972
Degree of Freedom =    2000,  Quantile =   1.961151
Degree of Freedom =  100000,  Quantile =   1.959988


Question 2 : what do you conclude about the relationship between quantiles and number of freedom ? 
 
 


As the degree of freedom increases, the quantile gets smaller 

##  Chi-square:

Question 1- : Calculate P(X <= 8) with degree of freedom = 5.

In [83]:
st.chi2.cdf(8,5)

0.8437643724222779

Qiestion 2- : use st.chi2.ppf using Quantile at alpha = 0.843764373 with degree of freedom = 5


In [84]:
st.chi2.ppf(0.843764373,5)

8.000000010482703



# Interval Estimation of the Mean:

In [85]:
# x contains a sample.
# n = sample size.
x = np.array([25,24,24,27,29,31,28,24,25,26,25,18,30,28,23,26,27,23,16,20,22,22,25,24, 24,25,25,27,26,30,25,25,26,26,25,24])
n = len(x)

Question 1 : Calculate the Standard Error of the Mean (SEM).( note that to get an unbiased estimation of the standard deviation, you should set `ddof= 1` ). 
Hint:  \begin{equation}
\mathrm{SE}=\frac{\sigma}{\sqrt{n}}
\end{equation}


In [86]:
# SEM = Standard Error of the Mean.
SEM = np.std(x, ddof=1) / np.sqrt(n)
print (SEM)

0.50709255283711


<img width="500" height="500" src="https://analystnotes.com/graph/quan/SS02SDlosn1.gif"/>

<img width="500" height="500" src="https://i.ytimg.com/vi/sJyZ9vRhP7o/maxresdefault.jpg"/>



## 90% Confidence Interval:

Documentation : https://www0.gsb.columbia.edu/faculty/pglasserman/B6014/ConfidenceIntervals.pdf


Question 1 : Using the approximated quantiles of the Standard Normal distribution, use the following equation:

\begin{equation}
\left ( \bar{X}-1.645 *\frac{\sigma}{\sqrt{n}}, \bar{X}+1.645 *\frac{\sigma}{\sqrt{n}} \right )
\end{equation}

In [87]:
mean=x.mean()
min = mean - 1.645 * SEM 
max = mean + 1.645 * SEM
print (f'The confidence interval is : [', min, f' , ',max, f']')

The confidence interval is : [ 24.165832750582954  ,  25.834167249417046 ]


Question 2 : Using the exact quantiles of the Standard Normal (hint : replace $1.645 \frac{\sigma}{\sqrt{n}}$ and consider including `st.norm.ppf` )


In [88]:
v= st.norm.ppf(0.9)
min = mean - v * SEM
max = mean + v * SEM
print (f'The confidence interval is : [', min, f' , ',max, f']')

The confidence interval is : [ 24.350134745035593  ,  25.649865254964407 ]


Question 3 : Use the `interval()` function from the SciPy library to get the 90% confidance interval of Standard Normal : `st.norm.interval(percentage, loc, scale)`


Documentation:

https://stackoverflow.com/questions/28242593/correct-way-to-obtain-confidence-interval-with-scipy

In [89]:
st.norm.interval(0.9 , loc=mean , scale= SEM)

(24.1659069752658, 25.8340930247342)

Question 4 : same as Question 1 , get the 95% confidance interval (using the exact quantiles) of the Student-t distrubution. (consider using  st.t.ppf(0.95,df=n-1) ).


Documentation:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html



In [90]:
v= st.t.ppf(0.9,df=n-1)
min = mean - v * SEM
max = mean + v * SEM
print (f'The confidence interval is : [', min, f' , ',max, f']')

The confidence interval is : [ 24.33762972342813  ,  25.66237027657187 ]


Question 5 - Using the interval() function from the SciPy library (Student-t).
st.t.interval( )

In [91]:
st.t.interval(0.9 , df = n-1, loc= mean , scale= SEM)

(24.14323039111625, 25.85676960888375)

## 95% Confidence Interval

Question 1- Using the approximated quantiles of the Standard Normal.


In [92]:
mean=x.mean()
min = mean - 1.96 * SEM 
max = mean + 1.96 * SEM
print (f'The confidence interval is : [', min, f' , ',max, f']')

The confidence interval is : [ 24.006098596439266  ,  25.993901403560734 ]


Question 2- Using the exact quantiles of the Standard Normal.

In [93]:
v= st.norm.ppf(0.95)
min = mean - v * SEM
max = mean + v * SEM
print (f'The confidence interval is : [', min, f' , ',max, f']')

The confidence interval is : [ 24.1659069752658  ,  25.8340930247342 ]


Question 3- : Using the interval() function from the SciPy library (Standard Normal). 


In [94]:
st.norm.interval(0.95 , loc=mean , scale= SEM)

(24.006116859610792, 25.993883140389208)

Question 4- : Using the exact quantiles of the Student-t.


In [95]:
v= st.t.ppf(0.95,df=n-1)
min = mean - v * SEM
max = mean + v * SEM
print (f'The confidence interval is : [', min, f' , ',max, f']')

The confidence interval is : [ 24.14323039111625  ,  25.85676960888375 ]


Question 5 - :  Using the interval() function from the SciPy library (Student-t).

In [96]:
st.t.interval(0.95 , df = n-1, loc= mean , scale= SEM)

(23.97054738812868, 26.02945261187132)

## 99% Confidence Interval

Question 1 : Using the approximated quantiles of the Standard Normal.


In [97]:
mean=x.mean()
min = mean - 2.575 * SEM 
max = mean + 2.575 * SEM
print (f'The confidence interval is : [', min, f' , ',max, f']')

The confidence interval is : [ 23.69423667644444  ,  26.30576332355556 ]


Question 2- Using the exact quantiles of the Standard Normal.


In [98]:
v= st.norm.ppf(0.99)
min = mean - v * SEM
max = mean + v * SEM
print (f'The confidence interval is : [', min, f' , ',max, f']')

The confidence interval is : [ 23.820326317765446  ,  26.179673682234554 ]


Question 3 : Using the interval() function from the SciPy library (Standard Normal). 


In [99]:
st.norm.interval(0.99 , loc=mean , scale= SEM)

(23.69381614279075, 26.30618385720925)

Question 4- : Using the exact quantiles of the Student-t.


In [100]:
v= st.t.ppf(0.99,df=n-1)
min = mean - v * SEM
max = mean + v * SEM
print (f'The confidence interval is : [', min, f' , ',max, f']')

The confidence interval is : [ 23.7638490504603  ,  26.2361509495397 ]


Question 5- : Using the interval() function from the SciPy library (Student-t).


In [101]:
st.t.interval(0.99 , df = n-1, loc= mean , scale= SEM)

(23.618778470336505, 26.381221529663495)

# Interval Estimation of the Proportion: 

Question 1 : Suppose there is two candidates for the election: T and B. (T for trump , B for biden , just saying ...lol)
The candidate T wants to survey his approval rating.
Out of 100 suerveyed, 55 answered positively.
Can T be sure of this election? 
- Assume that T gets elected with 50% or more. 
- Use 95% confidence interval.

draw your conclusion using the confidence interval

In [102]:
# 95% Confidence Interval.
# We apply the expression for the standard error (SE) forumla used for proportions from the lecture.
p = 55/100
q=1-p
alpha = 0.5*(1-0.95)
#from the table we find Z-value = 1.96
min = p - 1.96 * np.sqrt((p*q)/100)
max = p + 1.96 * np.sqrt((p*q)/100)
print (f'The confidence interval is : [', min, f' , ',max, f']')

The confidence interval is : [ 0.4524912311635513  ,  0.6475087688364488 ]


Conclusion : 0.5 (50%) belongs to the confidence interval, so T can be sure of his election 


Question 2 : Out of 1000 suerveyed, 550 answered positively. Can T be sure of this election?
- Assume that T gets elected with 50% or more. 
- Use 95% confidence interval.

draw your conclusion using the confidence interval

In [103]:
# 95% Confidence Interval.
# We apply the expression for the standard error (SE) from the lecture.
# W ignore additional corrections. 
p = 550/1000
q=1-p
alpha = 0.5*(1-0.95)
#from the table we find Z-value = 1.96
min = p - 1.96 * np.sqrt((p*q)/100)
max = p + 1.96 * np.sqrt((p*q)/100)
print (f'The confidence interval is : [', min, f' , ',max, f']')

The confidence interval is : [ 0.4524912311635513  ,  0.6475087688364488 ]


Conclusion : 0.5 (50%) belongs to the confidence interval, so T can be sure of his election 

# Correlation



some resources to read to learn more about these correlations : 

https://datascience.stackexchange.com/questions/64260/pearson-vs-spearman-vs-kendall/64261


https://towardsdatascience.com/clearly-explained-pearson-v-s-spearman-correlation-coefficient-ada2f473b8

In [104]:
import pandas as pd
import numpy as np
import scipy.stats as st
import os
import seaborn as sns

In [105]:
sns.get_dataset_names()


['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'exercise',
 'flights',
 'fmri',
 'gammas',
 'geyser',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'tips',
 'titanic']

In [124]:
df = sns.load_dataset('iris')

In [138]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [139]:
type(df)

pandas.core.frame.DataFrame

In [140]:
x = df.petal_length
y = df.sepal_length

## P-Value

Documentation : 
https://towardsdatascience.com/p-value-basics-with-python-code-ae5316197c52

Question 1 : Using the SciPy function: 
- calculate correlation and p-value. hint : ``np.round(st.pearsonr(..),..)``  
- interpret the results 

In [158]:
np.round(st.pearsonr(x,y),3)    #doesn't work for y
st.pearsonr(x,y)

(0.8717537758865832, 1.0386674194497525e-47)

Question2 : Using the Pandas function.


In [147]:
x.corr(y)

0.8717537758865831

Question 3: Correlation array, use: `np.round(..)`   

In [148]:
# Correlation array.
np.round(x.corr(y),3)

0.872

## Spearman

Question 1 :  Using the Spearman SciPy function and Correlation and p-value.
hint : ``np.round(st.spearmanr(..),..) ``         

In [159]:
np.round(st.spearmanr(x,y),3)        #doesn't work for y
st.spearmanr(x,y)

SpearmanrResult(correlation=0.8818981264349859, pvalue=3.443087278047102e-50)

## Kendall

Question 1: Using the SciPy function and Correlation and p-value.

In [149]:
st.kendalltau(x,y)

KendalltauResult(correlation=0.7185159275387325, pvalue=1.1691259442824597e-36)

Question : Confidence Interval of the Pearson Correlation: 

In [157]:
# Apply the Fisher's z-transformation.
# See the lecture.
p,r = st.pearsonr(x,y)
r_z = np.arctanh(r)
se = 1/np.sqrt(len(x)-3)
z = st.norm.ppf(1-alpha/2)
min_z, max_z = r_z-z*se, r_z+z*se
min, max = np.tanh((min_z,max_z))
print (f'The confidence interval is : [', min, f' , ',max, f']')

The confidence interval is : [ -0.18279015884633043  ,  0.18279015884633043 ]


In [113]:
# 95% confidence interval. 
# Expressed as a dictionary object.


Question 1 :  99% confidence interval.Expressed as a dictionary object.

In [114]:
# 99% confidence interval.
# Expressed as a dictionary object.

