# Inferential Statistics (Part I)

![Image](./images/central_limit_theorem.jpeg)

---

## Data and Sampling Distributions

#### Population Distribution 

The _population_ is assumed to follow the underlying but _unknown_ distribution (unless we're facing a physical process that can be modeled).

#### Sample Distribution

The _sample_ is the real data from which we can obtain its _empirical_ distribution in order to generate hypothesis about the population distribution.





---

### Random Sampling and Sample Bias

In a random sampling process every data point have the same posibility of being selected in each draw. Can be performed with or without replacement.

- __Sample:__ A subset from a larger dataset.

- __Population:__ The larger dataset or idea of a dataset.

- __N:__ The size of the population (_n:_ the size of the sample).

--

- __Random sampling:__ Drawing elements into a sample at random. 

- __Stratified sampling:__ Dividing the population into strata and randomly sampling from each strata.

- __Stratum:__ A homogeneous subgroup of a population with common characteristics.

- __Simple random sampling:__ The sample that results from random sampling without stratifying the population.

--

- __Bias:__ Systematic error (not random).

- __Sample bias:__ A sample that misrepresents the population.



In [None]:
# imports

import pandas as pd
import numpy as np
from scipy import stats
from scipy.stats import sem

import seaborn as sns
import matplotlib.pylab as plt

In [None]:
# Population vs. sample

np.random.seed(seed=1)
x = np.linspace(-3, 3, 300)
xsample = stats.norm.rvs(size=1000)

fig, axes = plt.subplots(ncols=2, figsize=(15, 5))

ax = axes[0]
ax.fill(x, stats.norm.pdf(x))
ax.set_axis_off()
ax.set_xlim(-3, 3)

ax = axes[1]
ax.hist(xsample, bins=30)
ax.set_axis_off()
ax.set_xlim(-3, 3)
ax.set_position;

---

### Sampling Distribution of a Statistic


- __Data distribution:__ The frequency distribution of individual _values_ in a dataset.

- __Sample statistic:__ A metric calculated for a sample of data drawn from a larger population.

- __Sampling distribution:__ The frequency distribution of a sample statistic over many samples or resamples.

- __Central limit theorem:__ The tendency of the sampling distribution to take on a normal shape as sample size rises.

- __Standard error:__ The variability (standard error) of a sample _statistic_ over many samples.



In [None]:
loans_income = pd.read_csv('./datasets/loans_income.csv').squeeze('columns')
loans_income

In [None]:
sample_data = pd.DataFrame({'income': loans_income.sample(1000),
                            'type': 'Data'})
sample_data

In [None]:
sample_mean_05 = pd.DataFrame({'income': [loans_income.sample(5).mean() for _ in range(1000)],
                               'type': 'Mean of 5'})
sample_mean_05

In [None]:
sample_mean_20 = pd.DataFrame({'income': [loans_income.sample(20).mean() for _ in range(1000)],
                               'type': 'Mean of 20'})
sample_mean_20

In [None]:
results = pd.concat([sample_data,
                     sample_mean_05,
                     sample_mean_20])
results

In [None]:
# Sample distribution vs. Sample statistic (mean) distribution

g = sns.FacetGrid(results, col='type', col_wrap=1, 
                  height=4, aspect=2)
g.map(plt.hist, 'income', range=[0, 200000], bins=40)
g.set_axis_labels('Income', 'Count')
g.set_titles('{col_name}')

plt.tight_layout()

#### Important observations:

- The distribution of a sample statistic such as the mean is likely to be more regular and bell-shaped than the distribution of the data itself (Central Limit Theorem).

- The larger the sample the statistic is based on, the more this is true.

- The larger the sample, the narrower the distribution of the sample statistic.

Check out [this](https://onlinestatbook.com/stat_sim/sampling_dist/) simulator

In [None]:
# Standard Error

sample_size = 20

sample = loans_income.sample(sample_size)

sample_standard_error = sample.std() / np.sqrt(sample_size)   # Data approach

sampling_dist_se = sample_mean_20['income'].std()   # Sampling distribution of a statistic approach

print('Dataset mean:',loans_income.sample(1000).mean(),
      '\nDataset median:',loans_income.median(),
      '\nSample size:', sample_size,
      '\nSample standard error:', sample_standard_error,
      '\nStandard error of sampling distribution:', sampling_dist_se)

In [None]:
# Using scipy 

print('Sample standard error:', sem(sample))

Standard error estimates how accurate the mean of any given sample represents the true mean of the population.

---

### Normal Distribution (a.k.a. Z-Distribution)

In a normal distribution, 68% of the data lies within one standard deviation of the mean, 95% lies within two standard deviations, and 99.7% lies within three standard deviations.

- __Error:__ The difference between a data point and a predicted or average value.

- __Standardize:__ Substract the mean and divide it by the standard deviation.

- __Z-score:__ The result of standarizing an individual data point.

- __Standard normal:__ A normal distribution with mean = 0 and standard deviation = 1.

- __QQ-Plot:__ A plot to visualize how close a sample distribution is to a specified distribution (e.g.: the normal distribution)

In [None]:
# Normal distributed data
norm_sample = stats.norm.rvs(size=10000)

# Not-normal distributed data
sp500_px = pd.read_csv('./datasets/sp500_data.csv')
nflx = sp500_px['NFLX']
not_norm_sample = np.diff(np.log(nflx[nflx>0]))
not_norm_sample

In [None]:
norm_sample = pd.Series(norm_sample)
norm_sample_sd = norm_sample.std()
print(f'Standard deviation of the sample: {norm_sample_sd}')
norm_sample.plot.hist();

In [None]:
not_norm_sample = pd.Series(not_norm_sample)
not_norm_sample_sd = not_norm_sample.std()
print(f'Standard deviation of the sample: {not_norm_sample_sd}')
not_norm_sample.plot.hist();

Other distributions are:

- Long-Tailed Distributions

- Student's t-Distribution

- Binomial Distribution

- Chi-Square Distribution

- F-Distribution

- Poisson Distributions

- Exponential Distribution

- Weibull Distribution

---