# Quantile

In [None]:
import scipy.stats as st
import numpy as np

## Standard Normal



---


>* *Video* [Z-Scores and Percentiles: Crash Course Statistics #18](https://www.youtube.com/watch?v=uAxyI_XfqXk&list=PL8dPuuaLjXtNM_Y-bUAhblSAdWRnmBUcr&index=19)
>
>Note that : percentiles are the same thing as quantiles just expressed in %


---



In [None]:
# Standard Normal corresponds to mu = 0 and sigma = 1.
mu = 0
sigma = 1

Question 1 : Use st.norm.ppf to calculate the Quantile at $alpha = 0.95$


The `st.norm.ppf()` method takes a percentage(which is the probability or the area under the curve ) and returns the corresponding z-score.

In [None]:
# This is useful when calculating 90% confidence interval.
st.norm.ppf(0.95)

Question 2 : Now, use ``st.norm.ppf`` to calculate the Quantile at $alpha = 0.975$

In [None]:
st.norm.ppf(0.975)

## T-Student 



---
>Ce que j'ai retenu
>* t distribution is useful to represent sampling distribution of sample means when sample size is not big enough to apply the central limit theorem and approximate it with a normal distribution.
>
>*Video* [Introduction to the t Distribution (non-technical)](https://www.youtube.com/watch?v=Uv6nGIgZMVw&t=181s)
>
>* the shape of the t distribution depends on df (which directly depends on the sample size n cuz df=n-1) and the higher the df is the closer the t distribution gets from the normal distribution 


---




the following code runs the function `st.t.ppf` of $alpha = 0.7$ with various degrees of freedom :


In [None]:
alpha=0.7
liste = [10,50,100,2000,100000]
for df in liste:
    val = st.t.ppf(alpha,df)
    print( 'Degree of Freedom = %7d,  Quantile = %10f' %(df,val))

In [None]:
st.norm.ppf(0.7)

Question 1 : modify the previous code to get the quantiles of $alpha=0.95$ and $alpha=0.975$

In [None]:
alpha=0.95
liste = [10,50,100,2000,100000]
for df in liste:
    val = st.t.ppf(alpha,df)
    print( 'Degree of Freedom = %7d,  Quantile = %10f' %(df,val))

Question 2 : what do you conclude about the relationship between quantiles and number of freedom ? 
 
 


>As the number of degrees of freedom of the t distribution increases, quantile at certain alpha converges to its value for the same alpha for a normal distribution

##  Chi-square:

Question 1- : Calculate P(X <= 8) with degree of freedom = 5.

In [None]:
df=5
st.chi2.cdf(8,df)

Qiestion 2- : use st.chi2.ppf using Quantile at alpha = 0.843764373 with degree of freedom = 5


In [None]:
st.chi2.ppf(0.843764373,df)



# Interval Estimation of the Mean:



---
>Dans l'ordre, j'ai vu :
>* *Video* [Introduction to confidence intervals](https://www.youtube.com/watch?v=27iSnzss2wM&list=PLvxOuBpazmsMdPBRxBTvwLv5Lhuk0tuXh&index=1
)
>* *Video* [Deriving a Confidence Interval for the Mean (The Rationale Behind the Confidence Interval Formula)](https://www.youtube.com/watch?v=-iYDu8flFXQ&list=PLvxOuBpazmsMdPBRxBTvwLv5Lhuk0tuXh&index=2)
>* *Video* [Confidence Intervals: Crash Course Statistics #20
](https://www.youtube.com/watch?v=yDEvXB6ApWc&list=PL8dPuuaLjXtNM_Y-bUAhblSAdWRnmBUcr&index=21)

---



In [None]:
# x contains a sample.
# n = sample size.
x = np.array([25,24,24,27,29,31,28,24,25,26,25,18,30,28,23,26,27,23,16,20,22,22,25,24, 24,25,25,27,26,30,25,25,26,26,25,24])
n = len(x)

Question 1 : Calculate the Standard Error of the Mean (SEM).( note that to get an unbiased estimation of the standard deviation, you should set `ddof= 1` ). 
Hint:  \begin{equation}
\mathrm{SE}=\frac{\sigma}{\sqrt{n}}
\end{equation}


In [None]:
# SEM = Standard Error of the Mean.
SE=np.std(x,ddof=1)/np.sqrt(n)
SE

<img width="500" height="500" src="https://analystnotes.com/graph/quan/SS02SDlosn1.gif"/>

<img width="500" height="500" src="https://i.ytimg.com/vi/sJyZ9vRhP7o/maxresdefault.jpg"/>



## 90% Confidence Interval:

Documentation : https://www0.gsb.columbia.edu/faculty/pglasserman/B6014/ConfidenceIntervals.pdf


Question 1 : Using the approximated quantiles of the Standard Normal distribution, use the following equation:

\begin{equation}
\left ( \bar{X}-1.645 *\frac{\sigma}{\sqrt{n}}, \bar{X}+1.645 *\frac{\sigma}{\sqrt{n}} \right )
\end{equation}

In [None]:
b_inf=x.mean()-1.654*SE
b_sup=x.mean()+1.654*SE
print(f'({b_inf},{b_sup})')

Question 2 : Using the exact quantiles of the Standard Normal (hint : replace $1.645 \frac{\sigma}{\sqrt{n}}$ and consider including `st.norm.ppf` )


In [None]:
q=0.9
a=(1-q)/2
borne_inf=x.mean()- (-st.norm.ppf(a)*SE) #because st.norm.ppf(0.05) gives -1.645
borne_sup=x.mean()+ (-st.norm.ppf(a)*SE)
print (f"({borne_inf},{borne_sup})")

In [None]:
(st.norm.ppf(a)*SE)

Question 3 : Use the `interval()` function from the SciPy library to get the 90% confidance interval of Standard Normal : `st.norm.interval(percentage, loc, scale)`


Documentation:

https://stackoverflow.com/questions/28242593/correct-way-to-obtain-confidence-interval-with-scipy

In [None]:
st.norm.interval(0.9,loc=x.mean(),scale=SE)

Question 4 : same as Question 1 , get the **95% confidance interval (using the exact quantiles)** of **the Student-t** distrubution. (consider using  st.t.ppf(0.95,df=n-1) ).


Documentation:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html



In [None]:
inf=x.mean()-(-st.t.ppf(0.025,df)*SE)
sup=x.mean()+(-st.t.ppf(0.025,df)*SE)
print(f'({inf},{sup})')

Question 5 - Using the interval() function from the SciPy library (Student-t).
st.t.interval( )

In [None]:
st.t.interval(0.95,loc=x.mean(),scale=SE,df=df)

## 95% Confidence Interval

Question 1- Using the approximated quantiles of the Standard Normal.


In [None]:
(x.mean()-1.96*SE,x.mean()+1.96*SE)

Question 2- Using the exact quantiles of the Standard Normal.

In [None]:
(x.mean()-(-st.norm.ppf(0.025)*SE),x.mean()+(-st.norm.ppf(0.025)*SE))

Question 3- : Using the interval() function from the SciPy library (Standard Normal). 


In [None]:
st.norm.interval(0.95,x.mean(),SE)

Question 4- : Using the exact quantiles of the Student-t.


In [None]:
(x.mean()-(-st.t.ppf(0.025,df)*SE),x.mean()+(-st.t.ppf(0.025,df)*SE))

Question 5 - :  Using the interval() function from the SciPy library (Student-t).

In [None]:
st.t.interval(0.95,loc=x.mean(),scale=SE,df=df)

## 99% Confidence Interval

Question 1 : Using the approximated quantiles of the Standard Normal.


In [None]:
(x.mean()-2.575*SE,x.mean()+2.275*SE)

Question 2- Using the exact quantiles of the Standard Normal.


In [None]:
a=(1-0.99)/2
(x.mean()-(-st.norm.ppf(a)*SE),x.mean()+(-st.norm.ppf(a)*SE))

Question 3 : Using the interval() function from the SciPy library (Standard Normal). 


In [None]:
st.norm.interval(0.99,x.mean(),SE)

Question 4- : Using the exact quantiles of the Student-t.


In [None]:
a=(1-0.99)/2
(x.mean()-(-st.t.ppf(a,df)*SE),x.mean()+(-st.t.ppf(a,df)*SE))

Question 5- : Using the interval() function from the SciPy library (Student-t).


In [None]:
st.t.interval(0.99,loc=x.mean(),scale=SE,df=df)

# Interval Estimation of the Proportion: 



---
>* *Video* [01 - Estimating Population Proportions, Part 1 - Learn Confidence Intervals in Statistics](https://www.youtube.com/watch?v=e6HsIWQJjdM&t=689s)
>* [A Confidence Interval for A Population Proportion](https://openstax.org/books/introductory-business-statistics/pages/8-3-a-confidence-interval-for-a-population-proportion) 
>* to understand the formula : [The Central Limit Theorem for Proportions](https://openstax.org/books/introductory-business-statistics/pages/7-3-the-central-limit-theorem-for-proportions)


---




Question 1 : Suppose there is two candidates for the election: T and B. (T for trump , B for biden , just saying ...lol)
The candidate T wants to survey his approval rating.
Out of 100 suerveyed, 55 answered positively.
Can T be sure of this election? 
- Assume that T gets elected with 50% or more. 
- Use 95% confidence interval.

draw your conclusion using the confidence interval

In [None]:
# 95% Confidence Interval.
# We apply the expression for the standard error (SE) forumla used for proportions from the lecture.
n=100
z_c=-st.norm.ppf((1-0.95)/2)
p_hat=0.55
SE=np.sqrt((p_hat*(1-p_hat))/n)
inf=p_hat-z_c*SE
sup=p_hat+z_c*SE
(inf,sup)

In [None]:
0.55*(1-0.55)

Question 2 : Out of 1000 suerveyed, 550 answered positively. Can T be sure of this election?
- Assume that T gets elected with 50% or more. 
- Use 95% confidence interval.

draw your conclusion using the confidence interval

In [None]:
# 95% Confidence Interval.
# We apply the expression for the standard error (SE) from the lecture.
# W ignore additional corrections. 
# 95% Confidence Interval.
# We apply the expression for the standard error (SE) forumla used for proportions from the lecture.
n=1000
x=550
z_c=-st.norm.ppf((1-0.95)/2)
p_hat=x/n
SE=np.sqrt((p_hat*(1-p_hat))/n)
inf=p_hat-z_c*SE
sup=p_hat+z_c*SE
(inf,sup)

the fact that n is bigger, even if the sample proportion is the same, the standard error gets smaller.
The 95% confidence interval we get is above 50%. T can be thus 95% sure of his election

# Correlation



some resources to read to learn more about these correlations : 

https://datascience.stackexchange.com/questions/64260/pearson-vs-spearman-vs-kendall/64261


https://towardsdatascience.com/clearly-explained-pearson-v-s-spearman-correlation-coefficient-ada2f473b8



---

>* [NumPy, SciPy, and Pandas: Correlation With Python](https://realpython.com/numpy-scipy-pandas-correlation-python/)


---



In [None]:
import pandas as pd
import numpy as np
import scipy.stats as st
import os
import seaborn as sns

In [None]:
sns.get_dataset_names()


In [None]:
df = sns.load_dataset('iris')

In [None]:
df.head()

In [None]:
type(df)

In [None]:
x = df.petal_length
y = df.sepal_length

## P-Value

Documentation : 
https://towardsdatascience.com/p-value-basics-with-python-code-ae5316197c52

Question 1 : Using the SciPy function: 
- calculate correlation and p-value. hint : ``np.round(st.pearsonr(..),..)``  
- interpret the results 

`st.pearsonr(x,y)` returns a tuple of 
* the correlation coefficient of pearson for the two arrays, 
* and the p_value associated with this value of pearson r we got in the case of an uncorrelated system 



In [None]:
st.pearsonr(x,y)


In [None]:
r,p_value=st.pearsonr(x,y)

In [None]:
np.round(st.pearsonr(x,y))



Question2 : Using the Pandas function.


In [None]:
x.corr(y)

In [None]:
y.corr(x)

Question 3: Correlation array, use: `np.round(..)`   

In [None]:
# Correlation array.
np.corrcoef(x,y)

## Spearman

Question 1 :  Using the Spearman SciPy function and Correlation and p-value.
hint : ``np.round(st.spearmanr(..),..) ``         

In [None]:
st.spearmanr(x,y)

## Kendall

Question 1: Using the SciPy function and Correlation and p-value.

In [None]:
st.kendalltau(x,y)

Question : Confidence Interval of the Pearson Correlation: 



---
>* [Confidence interval on pearson r correlation](https://onlinestatbook.com/2/estimation/correlation_ci.html)
>* [Sampling Distributions: Pearson's r](https://www.youtube.com/watch?v=I0SjHVOHztc)


---




In [None]:
# Apply the Fisher's z-transformation.
# See the lecture.
z=.5*(np.log(r+1)-np.log(1-r))
z

In [None]:
# 95% confidence interval. 
# Expressed as a dictionary object.

#CI in terms of z
mean=z
SE=1/np.sqrt(df.shape[0]-3)
Zc=-st.norm.ppf((1-0.95)/2)
low_z=mean-SE*Zc
up_z=mean+SE*Zc
#CI in terms of r
low_r=(np.exp(2*low_z)-1)/(np.exp(2*low_z)+1)
up_r=(np.exp(2*up_z)-1)/(np.exp(2*up_z)+1)
d={}
d['lower_limit']=low_r
d['upper_limit']=up_r
d

Question 1 :  99% confidence interval.Expressed as a dictionary object.

In [None]:
# 99% confidence interval.
# Expressed as a dictionary object.

#CI in terms of z
mean=z
SE=1/np.sqrt(df.shape[0]-3)
Zc=-st.norm.ppf((1-0.99)/2)
low_z=mean-SE*Zc
up_z=mean+SE*Zc

#CI in terms of r
low_r=(np.exp(2*low_z)-1)/(np.exp(2*low_z)+1)
up_r=(np.exp(2*up_z)-1)/(np.exp(2*up_z)+1)
d={}
d['lower_limit']=low_r
d['upper_limit']=up_r
d