# Statistical Inference with Python

## What is Statistical Inference
Statistical inference is the process of drawing conclusions about a population based on a sample of data. It's a powerful tool that allows us to move from observations to generalizations. In the world of machine learning, statistical inference provides the bedrock for:
* Estimating model parameters and quantifying their uncertainty
* Evaluating the performance of machine learning models and making comparisons
* Making predictions on new, unseen data with confidence intervals
* Understanding the underlying relationships between variables.

## Key Concepts and Notation
Before we proceed, let's establish some essential terminology and notation:
* **Population:** The entire collection of individuals or objects about which we want to draw conclusions. We often use Greek letters to denote population parameters such as:
    * µ (mu) for the population mean
    * σ (sigma) for the population standard deviation
    * σ² (sigma squared) for the population variance 
* **Sample:** A subset of the population from which we collect data. Sample statistics are usually denoted with Roman letters:
    * x̄ (x-bar) for the sample mean.
    * s for the sample standard deviation.
    * s² for the sample variance
    * n for the sample size
* **Random Variable:** A variable whose value is numerical outcome of a random phenomenon. We often use uppercase letters (e.g., X, Y) to denote random variables.
* **Probability Distribution:** A function that describes the likelihood of different outcomes for a random variable. Common examples include the normal distribution, binomial distribution, and Poisson distribution. 
* **Sampling Distribution:** The probability of a statistic (e.g., the distribution of sample means from many different samples). This concept is central to understanding how well a sample statistic estimates a population parameter.

## Estimation with Python
The goal of estimation is to use sample data to approximate unknown population parameters. 

### Point Estimation
A point estimate is a single value that serves as our "best guess" for a population parameter. For example, the sample mean (x̅) is a common point estimator for the population mean(µ).

**Desirable Properties of Estimators:**
* **Unbiasedness:** An estimator is unbiased if its expected value is equal to the true parameter value (i.e., on average, it hits the target). Mathematically, for an estimator θ̂ (theta-hat) of a parameter θ (theta), E(θ̂)=θ.
* **Consistency:** An estimator is consistent if it converges to the true parameter value as the sample size increases (i.e., more data leads to better accuracy).
* **Efficiency:** An estimator is efficient if it has the smallest variance among all unbiased estimators (i.e., it's the most precise).

In [10]:
# ----- Point Estimation Example with Python -----

import numpy as np

# Generate 10 random heights (in cm) between 150 and 200
data = np.random.randint(150, 200, size=10) 

# Calculate the sample mean (point estimate for the population mean)
sample_mean = np.mean(data)
print(f"Sample Mean: {sample_mean}")

Sample Mean: 178.4


### Interval Estimation
An interval estimate provides a range of plausible values for a population parameter. A **confidence interval** is an interval constructed with a certain level of confidence (e.g., 95%) that it captures the true population parameter. For example, a 95% confidence interval for the population mean (µ) when the population standard deviation (σ) is known is given by:

x̄ ± z<sub>α/2</sub> * (σ / √n)

where: 

* x̄ (x-bar) is the sample mean
* z<sub>α/2</sub> (z-alpha-by-2) is the critical value from the standard normal distribution corresponding to the desired confidence level (for a 95% confidence interval, α=0.05 and z<sub>α/2</sub> ≈1.96)
* n is the sample size

**Interpretation:** If we were to repeatedly take samples and construct confidence intervals in this way, 95% of those intervals would contain the true population mean.

In [7]:
# ------ Interval Estimation Example with Python -----
import numpy as np
from scipy.stats import norm

# Generate 10 random heights (in cm) between 150 and 200
data = np.random.randint(150, 200, size=10)

# Sample mean and standard deviation
sample_mean = np.mean(data)
sample_std = np.std(data)

# Confidence level (e.g., 95%)
confidence_level = 0.95
alpha = 1-confidence_level

# Critical value from the standard normal distribution
z_critical = norm.ppf(1 - alpha/2)

# Margin of Error
margin_of_error = z_critical * (sample_std / np.sqrt(len(data)))

# Confidence interval
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)
print("Confidence Interval:", confidence_interval)

Confidence Interval: (np.float64(162.6177545879594), np.float64(181.1822454120406))


## Hypothesis Testing Overview
Hypothesis testing provides a formal framework for using data to evaluate claims about a population. We'll explore this in detail in a later lesson, but here's a brief overview:

* **The Core Idea:** We formulate a hypothesis about a population parameter and then use sample data to determine if there's enough evidence to reject that hypothesis.
* **Key Components:**
    * Null Hypothesis (H<sub>0</sub>): The default assumption (e.g., "no effect" or "no difference").
    * Alternative Hypothesis (H<sub>1</sub> or H<sub>α</sub>): The claim we want to test (e.g., "there is an effect").
    * Test statistic: A value computed from the sample data to assess the evidence against H₀.
    * p-value: The probability of observing the test statistic or something more extreme under H₀.
    * Significance level (α (alpha)): The threshold for rejecting H₀ (e.g., 0.05).

We'll delve deeper into the process of hypothesis testing, different types of tests, and how to interpret results in the dedicated lesson.

## Summary
* Statistical inference allows us to draw conclusions about populations using sample data.
* Key concepts include populations, samples, random variables, and probability distributions.
* Estimation involves point estimates (single values) and interval estimates (confidence intervals).
* Hypothesis testing provides a framework for evaluating claims about population parameters.