## The Bootstrap


In [8]:
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt
from sklearn.utils import resample


loans_income = pd.read_csv("loans_income.csv").squeeze('columns')



In [10]:
results = []
for nrepeat in range(1000):
    sample = resample(loans_income)
    results.append(sample.median())
results = pd.Series(results)

print('Bootstrap Statistics:')
print(f'original: {loans_income.median()}')
print(f'bias: {results.mean() - loans_income.median()}')
print(f'std. error: {results.std()}')

Bootstrap Statistics:
original: 62000.0
bias: -73.97350000000006
std. error: 219.5400541658228


## Explaination
Resample (sklearn) = Resample arrays or sparse matrices in a consistent way.

Bootstrapping is a statistical technique used to estimate properties (like the mean, median, or standard error) of a population by resampling a dataset with replacement.

A loop runs 1000 times, 
1) In each iteration: resampling (resample(loans_income) generates a new sample from the loans_income dataset. The sample is drawn with replacement, meaning the same data point can appear multiple times in the new sample)

2) Compute Median: The median of this bootstrap sample is calculated using sample.median().

3) Store Result: The computed median is appended to the results list.

At the end of the loop, the results list contains 1000 bootstrap sample medians.


##
The results list is converted into a Pandas Series, which provides convenient statistical methods like mean() and std().

Original Median:

    loans_income.median() calculates the median of the original dataset (not the bootstrap samples).
    This serves as the reference statistic.

Bias:

    Bias is the difference between the mean of the bootstrap medians (results.mean()) and the original median (loans_income.median()).
    It quantifies how much the bootstrap estimate deviates from the original estimate.

Standard Error:

    results.std() calculates the standard deviation of the bootstrap medians, which provides an estimate of the uncertainty (variability) of the median.


In [18]:
test = resample(loans_income)
test

16880    57000
3950     62000
33363    50000
33388    45000
25801    60300
         ...  
1228     94000
11301    60000
18272    55000
34309    35000
23485    62000
Name: x, Length: 50000, dtype: int64

In [20]:
results

0      62000.0
1      62000.0
2      62000.0
3      62000.0
4      62000.0
        ...   
995    62000.0
996    62000.0
997    62000.0
998    62000.0
999    62000.0
Length: 1000, dtype: float64