## Sampling Distribution of a Statistic
    -> The term sampling distrubution of a statistic refers to the distribution of some sample statistic over many samples drawn from the same population

### Sample statistic
    -> A metric calculated for a sample of data drawn from a larger population.
    -> Ex: sum , mean, median ..
    -> The distribution of a sample statistic such as the mean is likely to be bell shaped than the distribution of the data itself.
    -> The larger the sample, the narrower the distribution of the sample statistic.

### Central Limit Theorem
    -> The central limit theorem states that if you take sufficiently large random samples from the population (with replalcement), then the distribution of the sample means will be normally distributed or bell shaped.
    -> This holds true even if the population is not normally distributed.
    -> With respect to data science, the central limit theorem is not much used as hypothesis testing and confidence intervals play a small role in Datascience.
    
#### Detailed Reference
     https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_probability/BS704_Probability12.html
     https://en.wikipedia.org/wiki/Central_limit_theorem
     https://www.scribbr.com/statistics/central-limit-theorem/
     https://www.youtube.com/watch?v=zeJD6dqJ5lo
     https://statisticsbyjim.com/basics/central-limit-theorem/

### Standard Error
    -> The standard error is a single metric that sums up the variability in the sampling distribution for a statistic.
    -> It is measured based on the standard deviation s of the sample values.

$$
Standard Error = SE = \frac{s}{\sqrt{n}}
$$

    -> As the sample size increases, the standard error decreases.
    -> The reslationship b/w standard error and sample size is referred to as - square root of n rule (to decrease SE by a factor of 2, the sample size must be incresed by a factor of 4)
    -> Following is the approach to calculate SE:
        1. Collect a number of brand-new samples from the population.
        2. for each new sample calculate the statistic (eg: mean)
        3. Calculate the standard deviation of the statistics computed in step 2; use this as the estimate of standard error

    -> In modern statistics, the bootstrap has become the standard way to estimate standard error.
    -> It can be used for any statistic and does not rely on the central limit theorem or other distributional assumptions.

### NOTE 
    Donot confuse these
    -> Standard Deviation - measures the variability of individual data points.
    -> Standard Error - measures the variability of a sample metric (eg - mean)
    
### Further Reading 
    https://onlinestatbook.com/stat_sim/sampling_dist/

### Bootstraping
    -> Bootstraping is a statistical procedure that resamples a single dataset to create many simulated samples.
    -> This process allows you to calculate standard errors, construct confidence intervals, and perform hypothesis testing for numerous
       types of sample statistics.
    -> Conceptually bootstrap can be imagined as replicating the original sample thousands or millions of times so that we have a hypothetical population which represents all the knowledge from the original sample.
    -> We can then draw samples from this hypothetical population and create a sampling distribution.
    -> The algorithm for a bootstrap resampling of the mean, for a sample size n, is as follows:
        1. Draw a sample value, record it, then replace it.
        2. Repeat n times
        3. Record the mean of the n resampled values.
        4. Repeat steps 1-3 R times
        5. Use the R reseults to:
            a. Calculate their standard deviation (this estimates sample mean, standard error)
            b. Produce a hitogram or boxplot.
            c. Find a confidence interval
        where R is the number of iterations of the bootstrap. The more iterations you do, the more accurate the estimate of the standard error, or the confidence interval.

### Detailed Reading
    https://statisticsbyjim.com/hypothesis-testing/bootstrapping/
    https://www.mastersindatascience.org/learning/machine-learning-algorithms/bootstrapping/

In [1]:
import pandas as pd
from sklearn.utils import resample

In [2]:
loan_df = pd.read_csv("../../datasets/loan_data_set.csv")
loan_df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [3]:
## In python scikit-learn provides resample function to achieve this
data = loan_df["ApplicantIncome"]
results = []
for nrepeat in range(1000):
    sample = resample(data, replace=True, random_state=42)
    results.append(sample.median())

results = pd.Series(results)
print("Bootstrap Sttistics:")
print(f"original: {data.median()}")
print(f"bias: {results.mean()-data.median()}")
print(f"standard error : {results.std()}")

Bootstrap Sttistics:
original: 3812.5
bias: 46.5
standard error : 0.0
