# Statistical Inference 
1. The Central Limit Theorem
2. Confidence intervals for Means
3. Hypothesis testing 
     - the z test
     - single sample t test
     - independent samples t test



## Central Limit Theorem


The Central Limit Theorem states that given large enough sample size(>=30), the following properties hold true:

1. Sampling distribution's mean = Population mean (μ)
2. Sampling distribution's standard deviation (standard error) = σ/√n
3. for n ≥ 30, the sampling distribution tends to a normal distribution for all practical purposes.
4. In other words, for a large n, the sampling distribution of the mean approaches a normal distribution !


In [None]:
#!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
# !wget https://www.dropbox.com/s/d0azrfwynya0xjb/train.csv?dl=0 --no-check-certificate
!wget https://www.dropbox.com/s/d0azrfwynya0xjb/train.csv

In [None]:

import numpy as np 
import pandas as pd 
import random

import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as st
import warnings
warnings.filterwarnings('ignore')


In [None]:
df = pd.read_csv("train.csv")
print(df.shape)
df.head()

In [None]:
df['SalePrice'].head()

In [None]:
#sns.set_style('whitegrid') 
sns.distplot(df.SalePrice, kde=True, color='red', bins=100)

In [None]:
plt.hist(df.SalePrice, bins=100, color='pink')
plt.axvline(x=df.SalePrice.mean(), color='g')

**Observation:**
We can see from the above plot that the population is not normal,  Therefore, we need to draw sufficient samples of different sizes and compute their means (known as sample means). We will then plot those sample means to get a normal distribution.


In [None]:
x1bar = df.SalePrice.sample(n=30).mean()   #- S1 -- mean = x1 bar
x2bar = df.SalePrice.sample(n=30).mean()   #== S2 -- mean = x2bar
print(x1bar, x2bar)

In [None]:
list1=[]
num_samples=5000  # s1 to s5000
for i in range(0, num_samples):
    list1.append(df.SalePrice.sample(n=30, replace=True).mean())


In [None]:
len(list1)

In [None]:
ax = sns.distplot(list1, kde=True, color='red', bins=100)

### Sampling distribution approaching Normal distribution 
For sample size >=30, the resulting sampling distribution is almost a normal distribution

In [None]:
# Homework -- fun experiment
from scipy.stats import expon
data = expon.rvs(size=2000)
sns.distplot(data, kde=True, color='red', bins=100)
#validate central limit theorem using this data...using the code similar to above

In [None]:
type(data)

In [None]:
df = pd.DataFrame(data, columns=['data'])
print(df.shape)
df.head()

In [None]:
data

In [None]:
list1=[]
num_samples=5000  # s1 to s5000
for i in range(0, num_samples):
    list1.append(df.data.sample(n=30, replace=True).mean())

In [None]:
ax = sns.distplot(list1, kde=True, color='red', bins=100)

## Confidence Intervals
**Confidence Interval (CI)** is a type of statistical estimation that proposes a range of plausible values for an unknown parameter (for example, the mean). The interval has an associated confidence level that the true parameter is in the proposed range.
The 95% confidence interval defines a range of values that you can be 95% confident contains the population mean. With large samples, you know that mean with much more precision than you do with a small sample, so the confidence interval is quite narrow when computed from a large sample.


![](https://i.imgur.com/AjRb5aG.png)


### Calculating the Confidence Interval

1. Start with  xbar (mean), s (std dev), n (no. of obs.)
  - We should use the standard deviation of the entire population, but in many cases we won't know it.
  - We can use the standard deviation for the sample if we have enough observations (at least n=30, hopefully more)

2. Decide what Confidence Interval we want: 95% or 99% are common choices. Then find the "Z" value for that Confidence Interval here:

3. Use that Z value in this formula for the Confidence Interval
  - Confidence Interval = [PointEstimate - MoE, PointEstimate + MoE]

![](https://i.imgur.com/jdrj6wC.png)




### Cumulative Density Function, Percent Point Function

1. *Cumulative Density Function (CDF)*: **stats.norm.cdf** - Returns the **probability** for an observation **equal to or lesser than** a specific value from the distribution. It can also be thought of as - given a z-score, what is the cumulative probability distribution upto that z-score    
2. *Percent Point Function (PPF)*: **stats.norm.ppf** - Returns the **observation value** for the provided probability that is **less than or equal to** the provided probability from the distribution. It can also be thought of as - given a cumulative probability, what is the z-score
3. CDF is the reverse of PPF

### Confidence Interval using z-distribution

In [5]:
n=30
xbar=54
sig=6
conf_level=.88
# z_conf_int(xbar, sig, conf_level, n)

In [6]:
import numpy as np
import scipy.stats as st
cumProb = (1 + conf_level)/2
z = round(st.norm.ppf(cumProb),2)
print('z:', z)
moe = round(z * (sig/np.sqrt(n)),2)
CI = (xbar - moe, xbar + moe)
print('CI:', CI)

z: 1.55
CI: (52.3, 55.7)


In [None]:
cumProb = (1+0.95)/2

In [None]:
round(st.norm.ppf(0.025),2), round(st.norm.ppf(0.975),2)

In [7]:
# find a 95% confidence interval funtion using z distribution
def z_conf_int(xbar, sig , conf_level, n):
    import scipy.stats as st
    import numpy as np
    area=(1+conf_level)/2
    z= round(st.norm.ppf(area),2)
    se=sig/np.sqrt(n)
    moe=z*se  #margin of error
    lb=round(xbar-moe,1)
    ub=round(xbar+moe,1)
    print('z-score:',z)
    print( f'the confidence interval is ({lb},{ub}) ')

- A survey of 30 adults found that the mean age of a person’s primary vehicle is 5.6 years
- Assuming the standard deviation of the population is 0.8 year, find the 99% confidence interval of the population mean.

In [14]:
n=30
xbar=5.6
sig=0.8
conf_level=.99
z_conf_int(xbar, sig, conf_level, n)

z-score: 2.58
the confidence interval is (5.2,6.0) 


### Confidence Interval using t-distribution

In [17]:

import numpy as np 
import pandas as pd 
import random

import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as st
import warnings
warnings.filterwarnings('ignore')

In [15]:
!wget https://www.dropbox.com/s/kr6f2lednm1pvc4/heart_failure_clinical_records_dataset.csv
# !wget https://www.dropbox.com/s/kr6f2lednm1pvc4/heart_failure_clinical_records_dataset%20%281%29.csv --no-check-certificate

--2023-01-14 03:38:40--  https://www.dropbox.com/s/kr6f2lednm1pvc4/heart_failure_clinical_records_dataset.csv
Resolving www.dropbox.com (www.dropbox.com)... 162.125.4.18, 2620:100:601c:18::a27d:612
Connecting to www.dropbox.com (www.dropbox.com)|162.125.4.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/kr6f2lednm1pvc4/heart_failure_clinical_records_dataset.csv [following]
--2023-01-14 03:38:40--  https://www.dropbox.com/s/raw/kr6f2lednm1pvc4/heart_failure_clinical_records_dataset.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc1ef0edf9138b60e041b90b6e3d.dl.dropboxusercontent.com/cd/0/inline/B0h7Cs0QPOU5P2phhSJOV_XK-yEyRVYLeEdV0y_jccSZT2IbIHz5ODAa4zuiSU8PmJYVf2wIjJVnD3m6g9W-et9Ctt3YkzMyf09rTbrmW8OIJ64j5puGH3H3IwnqOIghjkxvcUSHF7pCm8I-h1Mb7qm6jDBr69wIO8RP35tgPephJQ/file# [following]
--2023-01-14 03:38:41--  https://uc1ef0edf9138b60e041b90b6e3d.dl.dropboxusercontent.com/cd

In [18]:
# reading data
data = pd.read_csv('heart_failure_clinical_records_dataset.csv')
print(data.shape)
data.head()

(299, 13)


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


**Heart Failure Clinical Records data**
1. Find the confidence interval of average of blood platelets in a human with 95% confidence level
2. Find the confidence interval of average of serum creatine in a human with 95% confidence level
3. Find the confidence interval of average of ejection fraction in a human with 95% confidence level


In [23]:
data.columns

Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',
       'DEATH_EVENT'],
      dtype='object')

In [19]:
from scipy.stats import t
import numpy as np
import random

n=10
conf_level=0.95
sample=data['platelets'].sample(n, random_state=1)
area=(1+conf_level)/2
df=n-1   # degree of freedom
t=t.ppf(area, df)
xbar = np.mean(sample)
s = np.std(sample)
se = s/np.sqrt(n)
moe=t*se
lb=round(xbar - moe)
ub=round(xbar + moe)
#print( f'the confidence interval is ({lb},{ub}) ')
print('lower bound:', lb)
print('upper bound:', ub)
print('Confidence interval:', (lb,ub))
print('the confidence interval of average of blood platelets in a human with 95% confidence:', (lb, ub))

lower bound: 218359
upper bound: 378041
Confidence interval: (218359, 378041)
the confidence interval of average of blood platelets in a human with 95% confidence: (218359, 378041)


In [25]:
# find a 95% confidence interval using t distribution
def t_conf_int(data, var , conf_level, n):
    from scipy.stats import t
    import numpy as np
    import random
    sample=data[var].sample(n, random_state=1)
    area=(1+conf_level)/2
    df=n-1   # degree of freedom
    t=t.ppf(area, df)
    xbar = np.mean(sample)
    s = np.std(sample)
    se=s/np.sqrt(n)
    moe=t*se
    lb=round((xbar - moe),2)
    ub=round((xbar + moe),2)
    #print( f'the confidence interval is ({lb},{ub}) ')
    return lb, ub
    

In [26]:
#find the 95% conf interval for  blood platelets
ci = t_conf_int(data,'platelets', .95, 20 )
print('The confidence interval:',ci)

The confidence interval: (243270.0, 351565.8)


In [22]:
print( 'The 95% confidence interval width is ',ci[1] - ci[0])

The 95% confidence interval width is  108296


In [27]:
ci = t_conf_int(data,'serum_creatinine', .95, 20 )
print('The confidence interval:',ci)


The confidence interval: (1.0, 1.32)


In [None]:
#find the 99% conf interval for  blood platelets
ci = t_conf_int(data,'platelets', .99, 20 )
print('The confidence interval:',ci)

In [None]:
#print( 'The 99% confidence interval width is ',348094-242442)
print( 'The 99% confidence interval width is ',ci[1] - ci[0])