## Sampling Distribution of a Statistic
    -> The term sampling distrubution of a statistic refers to the distribution of some sample statistic over many samples drawn from the same population

### Sample statistic
    -> A metric calculated for a sample of data drawn from a larger population.
    -> Ex: sum , mean, median ..
    -> The distribution of a sample statistic such as the mean is likely to be bell shaped than the distribution of the data itself.
    -> The larger the sample, the narrower the distribution of the sample statistic.

### Central Limit Theorem
    -> The central limit theorem states that if you take sufficiently large random samples from the population (with replalcement), then the distribution of the sample means will be normally distributed or bell shaped.
    -> This holds true even if the population is not normally distributed.
    -> With respect to data science, the central limit theorem is not much used as hypothesis testing and confidence intervals play a small role in Datascience.
    
#### Detailed Reference
     https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_probability/BS704_Probability12.html
     https://en.wikipedia.org/wiki/Central_limit_theorem
     https://www.scribbr.com/statistics/central-limit-theorem/
     https://www.youtube.com/watch?v=zeJD6dqJ5lo
     https://statisticsbyjim.com/basics/central-limit-theorem/

### Standard Error
    -> The standard error is a single metric that sums up the variability in the sampling distribution for a statistic.
    -> It is measured based on the standard deviation s of the sample values.

$$
Standard Error = SE = \frac{s}{\sqrt{n}}
$$

    -> As the sample size increases, the standard error decreases.
    -> The reslationship b/w standard error and sample size is referred to as - square root of n rule (to decrease SE by a factor of 2, the sample size must be incresed by a factor of 4)
    -> Following is the approach to calculate SE:
        1. Collect a number of brand-new samples from the population.
        2. for each new sample calculate the statistic (eg: mean)
        3. Calculate the standard deviation of the statistics computed in step 2; use this as the estimate of standard error

    -> In modern statistics, the bootstrap has become the standard way to estimate standard error.
    -> It can be used for any statistic and does not rely on the central limit theorem or other distributional assumptions.

### NOTE 
    Donot confuse these
    -> Standard Deviation - measures the variability of individual data points.
    -> Standard Error - measures the variability of a sample metric (eg - mean)
    
### Further Reading 
    https://onlinestatbook.com/stat_sim/sampling_dist/

### Bootstraping
    -> Bootstraping is a statistical procedure that resamples a single dataset to create many simulated samples.
    -> This process allows you to calculate standard errors, construct confidence intervals, and perform hypothesis testing for numerous
       types of sample statistics.
    -> Conceptually bootstrap can be imagined as replicating the original sample thousands or millions of times so that we have a hypothetical population which represents all the knowledge from the original sample.
    -> We can then draw samples from this hypothetical population and create a sampling distribution.
    -> The algorithm for a bootstrap resampling of the mean, for a sample size n, is as follows:
        1. Draw a sample value, record it, then replace it.
        2. Repeat n times
        3. Record the mean of the n resampled values.
        4. Repeat steps 1-3 R times
        5. Use the R reseults to:
            a. Calculate their standard deviation (this estimates sample mean, standard error)
            b. Produce a hitogram or boxplot.
            c. Find a confidence interval
        where R is the number of iterations of the bootstrap. The more iterations you do, the more accurate the estimate of the standard error, or the confidence interval.

### Detailed Reading
    https://statisticsbyjim.com/hypothesis-testing/bootstrapping/
    https://www.mastersindatascience.org/learning/machine-learning-algorithms/bootstrapping/

In [1]:
import pandas as pd
from sklearn.utils import resample

In [2]:
loan_df = pd.read_csv("../../datasets/loan_data_set.csv")
loan_df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [3]:
## In python scikit-learn provides resample function to achieve this
data = loan_df["ApplicantIncome"]
results = []
for nrepeat in range(1000):
    sample = resample(data, replace=True,n_samples=30, random_state=1)
    results.append(sample.median())

results = pd.Series(results)
print("Bootstrap Sttistics:")
print(f"original: {data.median()}")
print(f"bias: {results.mean()-data.median()}")
print(f"standard error : {results.std()}")

Bootstrap Sttistics:
original: 3812.5
bias: 102.0
standard error : 0.0


### Bagging (Bootstrap Aggregating)
    -> The bootstrap can be used with multivariate data, where the rows are sampled as units.
    -> A model can  be run on the bootstrapped data, for ex to estimate the stability(variability) of model parameters, or to improve predictive power.
    -> With classification and regression trees (decision trees), running multiple trees on bootstrap samples and then averaging their predictions (or with classification, taking a majority vote) generally performs better than using a single tree.
    -> This process is called bagging (or bootstrap aggregating)

### Further Reading
    -> To calculate bootstrap confidence interval
    -> https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html

### Confidence Intervals
    -> Confidence Intervals is another way of finding errors in the data, just like frequency tables, box plots, histograms and standard errors
    -> A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data.
    OR
    -> A confidence interval is how much uncertainty there is with any particular statistic. 
    -> Confidence intervals are often used with a margin of error. It tells you how confident you can be that the results from a poll or survey reflect what you would expect to find if it were possible to survey the entire population.

### Further Reading
    -> https://www.youtube.com/watch?v=hlM7zdf7zwU - Confidence Intervals and MArgin of error
    -> https://www.youtube.com/playlist?list=PLFP3i_O9BWPzCMklCHqjJHQm_T_w7UnRX - Playlist for sampling distribuion, confidence Intervals and central limit theorem
    -> statisticshowto.com/probability-and-statistics/confidence-interval/
    -> http://www.stat.yale.edu/Courses/1997-98/101/confint.htm4
    -> https://www.simplilearn.com/tutorials/data-analytics-tutorial/confidence-intervals-in-statistics4
    
### Books
    -> For a bootstrap approach to confidence intervals - Introductory Statistics and Analytics: A resampling perspective by Peter Bruce.
    or Unlocking the power of data, 2nd ed, by Robin Lock.
    -> prediction intervals - Modern Engineering Statistics by Thomas Ryan (mainly for engineering measurements)

### Important Notes on confidence intervals
    -> The percentage associated with the confidence interval is termed - level of confidence
    -> The higher the level of confidence, the wider the interval.
    -> The smaller the sample, the wider the interval(i.e the greater the uncertainty).
    -> The above two statements can be summed as - The more confident you want to be and the less data you have, the wider you must make the confidence interval to be sufficiently assured of capturing the true value.
    -> For a data scientist, a confidence interval is a tool that can be used to get an idea of how variable a sample result might be.
    -> This is mainly to communicate the potential error in an estimate and perhaps to learn whether a larger sample in needed.

### Normal Distribution
    -> The normal distribution, also known as Gaussian distribution, is the most important probability distribution in statistics for independent, random variables.
    -> It is represented by a bell curve
    -> The normal distributioon is a continuous probability distribution which is symmetric around its mean, which means most of the observations cluster around the central peak and taper off equally in both directions.
    -> While a normal distribution is symmetrical, not all symmetrical distributions are normal.
    -> It is the most important probability distribution in statistics beacause it accurately describes the distribution of values for many natural phenomenon.
    -> Characteristics that are sum of many independent processes folllow normal distribution. for ex: heights, blood presure, measurement error and IQ scores follow Normal distribution.

#### Parameters of the Normal Distribution
    -> As with any probability distribution, the parameters for Normal Distribution define its shape and probabilities entirely.
    -> It has two parameters mean and standard deviation. The shape of the distribution changes based on parameters.
    
    Mean - The mean is the central tendency of the normal distribution. It defines the location of the peak for the bell curve.
            Most of the values cluster around the mean. Changing the mean can shift the entire curve left or right on X axis.

    Standard Deviation - The standard deviation defines the variability of values from the mean. It defines the width of the normal distribution. It determines how far away from the mean the values tend to fall. It represents the distance between the observations and the average.
    On a graph changing the standard deviation can increase or decrease the width of the curve along X axis.
    For narrow distributions, the chances are that values won't fall far from the mean.
    As the spread of the bell curve increases, the likelihood of the values far from mean will also increase.
     

### Common properties of Normal Distribution
    Despite the different shapes, all forms of the normal distribution have the following characteristic properties.

    -> They’re all unimodal, symmetric bell curves. The Gaussian distribution cannot model skewed distributions.
    -> The mean, median, and mode are all equal.
    -> Half of the population is less than the mean and half is greater than the mean.
    -> The Empirical Rule allows you to determine the proportion of values that fall within certain distances from the mean.

### Note
    -> Normal distribution is just one of many probability distributions in stats and it doesnot fit all populations.
    -> Refer this https://statisticsbyjim.com/hypothesis-testing/identify-distribution-data/ to determine if normal distribution provides best fit for your data.
    -> Also this https://statisticsbyjim.com/basics/assessing-normality-histograms-probability-plots/
    -> If you have continuous data that are skewed, you’ll need to use a different distribution, such as the Weibull, lognormal, exponential, or gamma distribution.
    -> Skewed Distributions - https://statisticsbyjim.com/basics/skewed-distribution/


### Empirical Rule for Normal Distribution
    -> When you have normally distributed data, standard deviation can be used to determine the proportion of values that fall within a specified number of standard deviations.
    -> The Empirical rule describes the percentage of data that fall within specific numbers of standard deviations from the mean for bell shaped curves.
    -> Mean +/- Std. Deviation              Percentage values
         1                            -             68 %
         2                            -             95 %
         3                            -             99.7 %

    -> If the data does not follow Normal/ Gaussian Distribution to find different proportions for Std. Deviations use Chebychev's Theorem.
    -> Link - https://statisticsbyjim.com/basics/chebyshevs-theorem-in-statistics/

### Further Reading
    -> https://statisticsbyjim.com/probability/empirical-rule/

### Standard Normal Distribution
    -> The normal distribution can have many shapes depending on parameter values. 
    -> The special case where the mean is zero and the std. deviation is 1 is called as Standard Normal Distribution. 
    -> This distribution is also known as the Z-distribution.
    -> A value on the standard normal distribution is known as a standard score or a Z-score.
    -> A standard score represents the number of standard deviations above or below the mean that a specific observation falls.

### Standardization: How to Calculate Z-scores
    -> Standard scores are a great way to understand where a specific observation falls relative to the entire normal distribution
    -> They also allow you to take observations drawn from normally distributed populations that have different means and standard deviations and place them on a standard scale.
    ->  This standard scale enables you to compare observations that would otherwise be difficult.
    -> This process is called standardization, and it allows you to compare observations and calculate probabilities across different populations.
    -> To standardize your data, you need to convert the raw measurements into Z-scores.
    -> After you standardize your data, you can place them within the standard normal distribution. 
    -> In this manner, standardization allows you to compare different types of observations based on where each observation falls within its own distribution. (ex: compare apple to oranges)

### Further Reading
    -> Z scores: https://statisticsbyjim.com/basics/z-score/