## <span style="color:red"> **LECTURE 2** - Probability & Statistic I </span>

- PDF, CDF, quantile
- Real and empirical distributions
- Errors: heteroscedastic and homoscedastic
- Kolmogorov axioms and probability
- Bayes‚Äô theorem
- Transformations of random variables

### **Probability Density Function (PDF), Cumulative Distribution Function (CDF), and Quantile**

- **PDF (Probability Density Function)**: Describes the probability for a continuous variable to take a specific value. The area under the PDF over an interval gives the probability of the variable falling within that interval.
- **CDF (Cumulative Distribution Function)**: It is obtained by integrating the PDF from - infinity up to a certain values X. Gives the probability that a random variable is less than or equal to a certain value. 

$$
H(x) = \int_{-\infty}^{x} h(x')\, dx'
$$


- **Quantile**: it's the inverse of the CDF. The value below which a certain percentage of observations fall. For example, the 0.25 quantile (or 25th percentile) is the value below which 25% of the data lie.

---

### **Empirical and Theoretical Distributions**

- **Theoretical Distribution**: A probability distribution derived from a known mathematical model (e.g., Normal, Poisson).
- **Empirical Distribution**: Based on observed data. It approximates the distribution of a dataset and is typically represented by the empirical CDF or histogram.
- Empirical distributions are used when the true distribution is unknown or difficult to model.

---

### **Homoscedastic and Heteroscedastic Errors**

- **Homoscedasticity**: The variance of the errors is constant.
- **Heteroscedasticity**: The error variance changes with the data

---

### **Kolmogorov's Axioms and Probability**

Kolmogorov formalized the foundation of probability with three axioms:

1. **Non-negativity**: For any event A, the probability is non-negative:  
   \( P(A) >= 0 \)
2. **Normalization**: The probability of the entire sample space is 1:  
   \( P($\Omega$) = 1 \)
3. **Additivity**: For any two mutually exclusive events A and B:  
   \( P(A $\cup$ B) = P(A) + P(B) \)

These axioms form the basis of modern probability theory.

---

### **Bayes' Theorem**

Bayes' Theorem updates the probability of a hypothesis based on new evidence:

$$
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
$$

-  P(A|B) : Posterior probability (updated belief)  
-  P(B|A) : Likelihood of observing B given A  
-  P(A) : Prior probability of A  
-  P(B) : Marginal probability of B. We can write the marginal probability of x as:   $ p(x) = \int p(x,y) dy = \int p(x|y)p(y)dy$

Used in many fields like medicine, machine learning, and decision theory.

---

### **Transformations of Random Variables**

Transforming a random variable means applying a function to it, creating a new variable.

- **Example**: Let \( X \) be a random variable and \( Y = g(X) \) a transformation.
- To find the **distribution of \( Y \)**:
  - If \( X \) is continuous with PDF \( $f_X$ \) and \( g \) is invertible, then:

$$
f_Y(y) = f_X(g^{-1}(y)) \cdot \left| \frac{d}{dy} g^{-1}(y) \right|
$$

- This is used to derive distributions of functions of random variables (e.g., squares, sums, logarithms).


## <span style="color:red"> **LECTURE 3** - Probability & Statistic II </span>

- Monte Carlo integration (crude / hit-or-miss)  
- Mean, median, and expected value  
- Standard deviation, MAD1, variance, MAD2, quantile region, interquantile range, mode  
- Skewness  
- Kurtosis  
- Statistics of the PDF and sample; Bessel‚Äôs correction  
- Uncertainties of estimators  
- PDFs: uniform, Gaussian, log-normal, chi-squared, Poisson  
- Importance sampling  

### **Monte Carlo Integration (Crude and Hit-or-Miss)**

- **Monte Carlo integration** uses random sampling to approximate definite integrals.
- **Crude Monte Carlo**:  
  Estimate the integral $\int_a^b f(x) \, dx$ by sampling $x_i \sim \mathcal{U}(a, b)$ and computing:  
  $$
  I \approx (b - a) \cdot \frac{1}{N} \sum_{i=1}^N f(x_i)
  $$
- **Hit-or-Miss method**:  
  Sample uniformly in a rectangle that encloses the graph of $f(x)$.  
  The integral is approximated by the fraction of points that fall below the curve times the area of the rectangle.

---

### **Mean, Median, Expected Value, and Mode**

KEEP ATTENTION AT THE DIFFERENT USE OF $\bar{x}$ AND $\mu$

- **Mean**: Arithmetic average of a dataset. $\mu = \mathbb{E}[X]$, Where X will denote an entire dataset.
- **Median**: Middle value when data are ordered. Less sensitive to outliers.
- **Expected value** $\mathbb{E}[X]$ : Theoretical mean of a random variable. For continuous variables:  
  $$
  \mathbb{E}[X] = \int x f(x) \, dx
  $$
- **Mode**: Most frequent value in a dataset.

---

### **Standard Deviation, Variance, MAD, Quantiles, and IQR**

- **Variance** (2nd-order moment):  
  $$
  \sigma^2 = \text{Var}(X) = \mathbb{E}[(X - \mu)^2] = \int_{-\infty}^{+\infty}(x - \mu)^2 f(x) \, dx
  $$
- **Standard Deviation** $\sigma$: Measures spread around the mean. It is the square root of the variance:
  $$
  \sigma = \sqrt{\text{Var}(X)} = \sqrt{\sigma^2}
  $$
- **MAD_1 (Mean Absolute Deviation)**:  
  $$
  \text{MAD}_1 = \frac{1}{N} \sum_{i=1}^N |x_i - \bar{x}|
  $$
  Note: this is not differentiable at  x = 0 , so it's sometimes avoided in optimization.
- **MAD_2 (Median Absolute Deviation)**:  
  $$
  \text{MAD}_2 = \frac{1}{N} \sum_{i=1}^N |x_i - M|
  $$
  where M is the median.
- **Quantile region**: Range containing a central portion of the distribution (e.g., 95% interval).
- **Interquantile Range (IQR)**:  
  $$
  \text{IQR} = Q_{75} - Q_{25}
  $$
  Contains the central 50% of the data.

---

### **Skewness and Kurtosis**

- **Skewness**: Measures asymmetry of a distribution (3rd-order moment).  
  - Positive skew: tail to the right.  
  - Negative skew: tail to the left.  
  - Formula:
    $$
    \text{Skewness} = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{x_i - \bar{x}}{\sigma} \right)^3
    $$
- **Kurtosis**: Measures how likely extreme values (far from the mean) are (4th-order moment).  
  - High kurtosis: heavy tails.  
  - Low kurtosis: light tails.  
  - Normal distribution has kurtosis = 3.  
  - Formula:
    $$
    \text{Kurtosis} = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{x_i - \bar{x}}{\sigma} \right)^4
    $$

---

### **PDF vs Sample Statistics and Bessel‚Äôs Correction**

- **PDF statistics**: Theoretical values computed from a probability distribution.
- **Sample statistics**: Estimates of those values based on observed data.
- **Bessel‚Äôs correction**: When estimating variance from a sample, use:
  $$
  s^2 = \frac{1}{N - 1} \sum_{i=1}^N (x_i - \bar{x})^2
  $$
  This gives an **unbiased estimate** of the population variance.  
  You can skip Bessel‚Äôs correction only when N is large.

---

### **Uncertainties of Estimators**

When we compute estimators like the mean, variance, or IQR from a sample, there‚Äôs **uncertainty** because we have only a finite number of points. This is captured by the **standard error (SE)**.

- **Sample Mean**:
  $$
  \text{SE}(\bar{x}) = \frac{\sigma}{\sqrt{N}}
  $$

- **Sample Standard Deviation** and **Variance**:
  $$
  \text{SE}(\sigma) \approx \frac{\sigma}{\sqrt{2N}}, \quad \text{SE}(\sigma^2) = \frac{\sigma^2}{2N}
  $$

- **Interquartile Range (IQR)**:
  $$
  \text{SE}(\text{IQR}) \approx \frac{1.58 \times \text{IQR}}{\sqrt{N}}
  $$

General rule: **more data ‚Üí smaller standard error**, since uncertainty scales as $ \frac{1}{\sqrt{N}}$.



---

### **PDFs: Uniform, Gaussian, Log-Normal, Chi-Squared, Poisson**

- **Uniform**: All values in an interval have equal probability.  
  $$
  f(x) = \frac{1}{b - a}      \text{    for   } x \in [a, b]
  $$

  this distribution has $\sigma = \frac{b-a}{\sqrt{12}}$
- **Gaussian (Normal)**: Curve defined by mean $\mu$ and std $\sigma$.  
  $$
  f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \, e^{-\frac{(x - \mu)^2}{2\sigma^2}}
  $$
  - The convolution of two gaussian is a gaussian too.
  - It's the queen of distribution , because most natural process follow this shape and it's quite easy to use.
  - $1\sigma$ = 68% // $2\sigma$ = 95%

- **Log-Normal**: $X \sim \text{LogNormal}$ means $\ln X \sim \text{Normal}$.
- **Chi-squared** ($\chi^2$):  
  If we define standardized variables as  
  $$
  z_i = \frac{x_i - \mu}{\sigma},
  $$  
  then the sum of their squares  
  $$
  Q = \sum_{i=1}^N z_i^2
  $$  
  follows a **chi-squared distribution** with $K$ degrees of freedom.

  The number of degrees of freedom $K$ is equal to the number of **independent** data points used in the sum.

- **Poisson**: Discrete distribution for count data.  
  $$
  P(k; \mu) = \frac{\mu^k e^{-\mu}}{k!}
  $$
  - Where: $\mu$ is the mean, K is the number of events occouring
  - Known as "law of rare events"

---

### **Importance Sampling**

- Hit or miss and Crude MC, are inefficient if the integrand has some null zone, or even if is really extendended... that's beacuse this 2 methode use the uniform distribution.
- Instead of sampling from the uniform, sample from a **proposal distribution** $g(x)$ 
- Best when $g(x)$ is close to the shape of $f(x)$.
- Reduces variance and computational cost if the $g(x)$ it's well chosen


## <span style="color:red"> **LECTURE 4** - Probability & Statistic III </span>

- Central Limit Theorem  
- Law of Large Numbers  
- Multidimensional PDFs (mean, sigma x and y, covariance, correlation coefficient, principal axes, 2D confidence level)  
- Correlation vs causation (Pearson, Spearman, Kendall)  
- Rejection sampling  
- Inverse sampling  

### **Central Limit Theorem (CLT)**

The CLT states that the sum (or mean) of a large number of independent, identically distributed random variables tends to follow a **normal distribution**, regardless of the original distribution.
This theoreme is the faundation for the repeated measurments .

---

### **Law of Large Numbers (LLN)** (Bernoulli's theoreme)

- The LLN states that as the number of observations $N$ increases, the sample mean $\bar{x}$ converges to the true mean $\mu$, this is also valid fot the variance:
  $$
  \lim_{N \to \infty} \bar{x} = \mu \quad, \quad \lim_{N \to \infty} s = \sigma
  $$
- This is a statement about convergence **in probability**.

---

### **Multidimensional PDFs**

- In 2D, the joint distribution can be described by:
  - **Mean vector**:  
    $$
    \vec{\mu} = (\mu_x, \mu_y)
    $$
    where, ofcourse, $\mu_x = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty}x h(x,y) dx dy$
  - **Covariance matrix**:  
    $$
    \Sigma = \begin{pmatrix}
    \sigma_x^2 & \text{cov}(x, y) \\
    \text{cov}(y, x) & \sigma_y^2
    \end{pmatrix}
    $$
    The two off diagonal values are equal to 0 only if x & y are totaly uncorrelated.
    $$\sigma^2_x = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty}(x-\mu_x)^2 h(x,y) dx dy$$
    $$\sigma_{xy} = Cov(x,y) = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty}(x-\mu_x) (y-\mu_y) h(x,y) dx dy$$

  - **Correlation coefficient**:  
    $$
    \rho = \frac{\text{cov}(x, y)}{\sigma_x \sigma_y}
    $$
    Express the percentual of correlation between the 2 variable

  - **Principal axes**: determined by the eigenvectors of $\Sigma$; note that the correlation vanish in this system by definition.
  - **2D Confidence Ellipses**: regions where the joint probability is constant, keep attention, for each dimension the number of sigma has a different meaning: $1\sigma = 39$% in 2 dimension! I can impose 68% for the similitude with 1D, but it's not $1\sigma$

---

### **Correlation vs Causation**

Correlation does not imply causation!
Just because the sun burns our skin and also makes us thirsty, it doesn't mean that thirst causes sunburn!

- **Pearson's coefficient** ($r$) : Measures linear correlation between 2 different dataset; it's a value between -1 and 1, the 2 are uncorrelated only if r = 0.
It has 2 problems:
  - it's susceptible at the outliars
  - doesn't count the error

- **Spearman's coefficient** ($r_s$): Measures monotonic (rank-based) correlation.
- **Kendall's coefficient** ($\tau$): Measures ordinal association between two variables.

---

### **Rejection Sampling**

Rejection sampling is a method to generate random samples from a complex distribution $h(x)$, using a simpler proposal distribution $q(x)$.

The procedure works as follows:

1) Decide on a straightforward *proposal distribution* $q(x)$ to propose new samples. It should be wide enough to capture the tails of $h(x)$. Usually used a uniform distribution.

2) Generate a random sample from $q(x)$, $x_i$.

3) Now generate a random sample, $u$, from a uniform distribution in the range $[0,\mathrm{max}(h(x))]$, where the upper bound should be as large or larger than the maximum density of $h(x)$. (This could be worked out analytically or by histograming the data.)

4) If $u\leq h(x_i)$ accept the point, or else reject it and try again from step 2.

The set of accepted $x$ values will follow the target distribution $p(x)$.


---

### **Inverse Transform Sampling**
Rejection sampling works, but wouldn't it be awesome if we didn't have to discard *any* points during our sampling? This is the power and simplicity of **inverse transform sampling**.

- Used to sample from a distribution $h(x)$ with known CDF $H(x)$ and Quantile.
- Steps:
  1. Sample $u$ from  ${U}(0, 1)$.
  2. Using the quantile function $H^{-1}(x)$, find the value of $x$ below which a fraction $u$ of the distribution is contained.
  3. The $x$ value you get is a random sample from $h(x)$


Normalizarion here are rellly important.
you can retrive the quantile and the CDF by numerically solution if you are not able to do in by hand.

## <span style="color:red"> **LECTURE 5** - Frequentist Inference I </span>

- Population  
- Sample  
- Statistics  
- Estimators  
- Uncertainties and intervals  
- Frequentist vs Bayesian  
- Maximum Likelihood Estimator (MLE)  
- Properties of estimators  
- Likelihood  
- Chi-squared  
- Minimization  
- Mean and error of MLE with heteroscedastic and homoscedastic errors  
- Non-Gaussian likelihoods 

### **Population, Sample, Statistic, Estimators, Uncertainty and Intervals**

- A **population** is the full set of data or measurements we are interested in.
- A **sample** is a subset of the population, used to infer properties of the whole.
- A **statistic** is a function of the sample (e.g. the sample mean $\bar{x}$).
- An **estimator** is a rule or formula to estimate population parameters from the sample.
- All estimators have **uncertainties** due to random sampling.
- A **confidence interval**, gives a range likely to contain the true value.

---

### **Frequentist vs Bayesian**

- **Frequentist**: Probability is extract from the frequency of events. Parameters are fixed, data are random.
Into Frequentist inference we have confidence levels,.
- **Bayesian**: Probability expresses belief or uncertainty about what we know. Parameters have distributions while data are fixed. In Bayesian inference we have credible regions derive from posterior distribution of the parameters.


- Bayesian statistic it's hold by the **Bayes‚Äô theorem**:
  $$
  P(\theta | \text{data}) = \frac{P(\text{data} | \theta) \cdot P(\theta)}{P(\text{data})}
  $$

---

### **Maximum Likelihood Estimator (MLE)**

- The **MLE** is the value of the parameter $\theta$ that **maximizes the likelihood** of the observed data:
  $$
  \hat{\theta}_{\text{MLE}} = \arg \max_\theta \mathcal{L}(\theta)
  $$
- It's usefull in both frequentist e bayesian approach
- **Remember**: the **likelihood** is defined as the product of the probabilities (or probability densities) of the observed data, assuming a given model or parameter value.

  For independent data points $x_1, x_2, ..., x_N$:

  $$
  \mathcal{L}(\theta) = \prod_{i=1}^{N} p(x_i \mid \theta)
  $$

  Where:
  - $\mathcal{L}(\theta)$ is the likelihood function,
  - $p(x_i \mid \theta)$ is the probability (or density) of observing $x_i$ given parameter $\theta$,
  - The product assumes all $x_i$ are independent.

  Often, we work with the **log-likelihood**:
  $$
  \log \mathcal{L}(\theta) = \sum_{i=1}^{N} \log p(x_i \mid \theta)
  $$
  which is easier to compute and optimize.

---

### **Properties of Estimators**

- **Unbiasedness**: $\mathbb{E}[\hat{\theta}] = \theta$
- **Consistency**: $\hat{\theta} \to \theta_{true}$ as $N \to \infty$
- **Normality** : As $N \to \infty$ , the distribution of the estimator approaches a normal distribution.
- **Efficiency**: Minimum possible variance (called Cramer -Rao bound)

---

### **Likelihood, Chi-squared and Minimization**

- The **likelihood** $\mathcal{L}(\theta)$ is the probability of the data given parameters.
- If we infere that the process has a gaussian distribution the Likelihood will follow the $\exp(-\chi^2/2)$
- In Gaussian cases, maximizing the log-likelihood is equivalent to **minimizing the chi-squared**:
  $$
  \chi^2 = \sum_{i=1}^N \left( \frac{x_i - f(x_i; \theta)}{\sigma_i} \right)^2
  $$
- Minimizing $\chi^2$ gives the best-fit parameters.
- The MLE method tell us to think the likelihood as a function of the (unknown) model parameters, and by minimizing the $\chi^2$, we will find the values that maximize the values of the likelihood.

---

### **Mean and MLE Error: Homoscedastic vs Heteroscedastic**

- **Homoscedastic**: All data points have the same uncertainty $\sigma$. If we minimize the $\chi^2$ distribution, we will retrived the **sample mean**:
    $$
    \bar{x} = \frac{1}{N} \sum x_i \quad \text{and} \quad \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{N}}
    $$
- **Heteroscedastic**: Uncertainties vary for each data point $\sigma_i$. Then you will retrive the **weighted mean**:
    $$
    \bar{x} = \frac{\sum x_i / \sigma_i^2}{\sum 1 / \sigma_i^2}
    $$
    $$
    \sigma_{\bar{x}}^2 = \frac{1}{\sum 1/\sigma_i^2}
    $$
This two formula are extracted from the derivative of the log-Likelihood = 0 , that' because we are searching for a maximum.

Our Maximum Likelihood Estimator (MLE) is not perfect ‚Äî every estimate has an associated **uncertainty** due to the finite sample size.

Under general conditions, the MLE becomes **asymptotically normal**, meaning that for large $N$, the likelihood function can be approximated by a **Gaussian** centered at the true parameter value $\theta_0$.

To quantify the uncertainty, we expand the **log-likelihood** around its maximum using a second-order **Taylor expansion**:

$$
\log \mathcal{L}(\theta) \approx \log \mathcal{L}(\hat{\theta}) - \frac{1}{2} (\theta - \hat{\theta})^2 F(\hat{\theta})
$$

Here, $F(\hat{\theta})$ is the **Fisher Information Matrix**, defined as the negative second derivative (Hessian) of the log-likelihood:

$$
F(\theta) = - \frac{\partial^2}{\partial \theta_i \partial \theta_j} \log \mathcal{L}(\theta)
$$

The **covariance matrix** of the estimator $\hat{\theta}$ is then given by the **inverse** of the Fisher matrix:

$$
\text{Cov}(\hat{\theta}) = F^{-1}(\hat{\theta})
$$

For a **single parameter** $\theta$, this simplifies to:

$$
\sigma_{\hat{\theta}} = \sqrt{\frac{1}{F(\hat{\theta})}}
$$

This implies that asymptotically, the MLE is **normally distributed**:

$$
\hat{\theta} \sim \mathcal{N} \left( \theta_0, F^{-1}(\theta_0) \right)
$$
---

### **Non-Gaussian Likelihoods**

- When the data doesn‚Äôt follow a Gaussian distribution, use the appropriate **likelihood model**, such as : **Poisson**, **Binomial**, **Exponential**, **Log-normal**, etc.
- The MLE approach still applies: choose the model, write the likelihood, and maximize it numerically.
- In most of the cases, you will find the same result as in the gaussian one

## <span style="color:red"> **LECTURE 6** - Frequentist Inference II </span>

- Fit  
- Outliers (Huber loss function)  
- Goodness of fit  
- Reduced chi-squared  
- Model misspecification  
- Occam‚Äôs Razor  
- AIC (Akaike Information Criterion)  
- Bootstrap  
- Jackknife

### **Fit**

- Fitting means adjusting model parameters so that the model best matches the observed data.
- Typically done by minimizing a **loss function**, such as the **sum of squared residuals** or **negative log-likelihood**.
- The goal is to find the best estimate $\hat{\theta}$ that explains the data.

---

### **Outliers and Huber Loss Function**

- **Outliers** are data points that deviate significantly from the trend of the rest of the data.
- Summing the squares of the residuals ($\chi^2=\sum_{i=1}^N (y_i - M(x_i))^2/\sigma^2$) is sensitive to outliers
- How do we deal with outliers? By modifying the likelihood!
- The **Huber loss** combines the squared loss for small errors and absolute loss for large errors:

$$
L_{\text{Huber}}(t) =
\begin{cases}
\frac{1}{2} t^2 & \text{if } |t| \leq c \\
c |t| - \frac{1}{2} c^2 & \text{if } |t| > c
\end{cases}
$$

- Where $t = \left| \frac{y - M(\theta)}{\sigma} \right|$ represents the **standardized residual**, i.e. how far the observed value $y$ is from the model prediction $M(\theta)$, in units of the known uncertainty $\sigma$.
- $c$ is the **tuning constant** (or confidence threshold), which determines the cutoff point where the loss switches from quadratic to linear. A common value is $c \approx 1.345$, which gives good balance between efficiency and robustness under normal errors.
- This approach makes the fit more **robust** to outliers: small residuals behave like in least squares, but large residuals are penalized less harshly.
- Note that by doing this, we are effectively putting **prior information** into the analysis... infact, in a frequentist approach we prefear to re-do the measurments ore simply delete the few outliars.

---

### **Goodness of Fit : Reduced Chi-squared**

- Measures how well the model describes the data. Remember GIGO (Garbage In Garbage Out), if the model is wrong , finding the "best" parameter doesn't really mean something ...
- A good fit should show residuals randomly scattered around zero.

- The **reduced chi-squared** is defined as:

  $$
  \chi^2_{\text{red}} = \frac{1}{\nu} \sum_{i=1}^N \left( \frac{y_i - f(x_i)}{\sigma_i} \right)^2
  $$

  where **$\nu = N - k$** is the number of **degrees of freedom** (data points minus number of parameters).

- Interpretation:
  - $\chi^2_{\text{red}} \approx 1$: good fit
  - $\chi^2_{\text{red}} \gg 1$: underfitting or underestimated errors
  - $\chi^2_{\text{red}} \ll 1$: overfitting or overestimated errors
- If the model is **wrong** (misspecified), goodness-of-fit measures can be misleading.

---

### **Model Comparison, Occam‚Äôs Razor , AIC and BIC**
- You can't do $\chi^2_{\text{red}}$ with Huber function, because it's not gaussian! We have to find other possibilities...
- When comparing two models with the **same number of parameters**, we can simply compare their **maximum log-likelihood** values:
- Larger log-likelihood ‚áí better fit

The **Huber loss** clearly performs better (less negative log-likelihood), meaning it fits the data more effectively, especially in the presence of outliers.


When models have **different numbers of parameters**, simply comparing likelihoods is not fair: more complex models might fit better **just by chance**. We need to penalize complexity ‚Äî this is known as the **Occam penalty**.


A simple method to compare models with different complexity is the **AIC** (Akaike Information Criterion):
  
- Lower AIC is better for the explaination of the dataset
- It's composed by lot of term, the first one it's the $\chi^2$, the second and third penalize model complexity
- If models fit the data equally well, AIC prefers the one with fewer parameters.



The **BIC** (Bayesian Information Criterion) is another way to compare models, especially when they have different numbers of parameters.
It‚Äôs similar to AIC, but it **penalizes complex models more strongly**, especially when the dataset is large.


- Lower **BIC** means a better model.
- BIC prefers **simpler models**, especially when N is large.
- It‚Äôs often used in **Bayesian statistics**, but doesn‚Äôt need a full Bayesian analysis, often use in frequentist analysis too.


---

### **Bootstrap**

- A **resampling method** to estimate uncertainties and confidence intervals.
- Keep attention : it create information out of nothing!
- Steps:
  1. Resample data (with replacement) to create many datasets. 
    The probability of getting the original dataset it's extreamly low ($N! / N^N$)
  2. Fit the model to each resampled dataset.
  3. Analyze the distribution of the fitted parameters.
- Useful when analytical uncertainty is hard to compute or it's too big (such as when we have few point for a gaussian distribution).

---

###  **Jackknife Method** 

The **Jackknife** is a method to estimate the **uncertainty** (standard error) and **bias** of a statistic ‚Äî like the **mean** or **standard deviation** ‚Äî using your data.

Suppose you have a dataset of $N$ values.

1. Leave out **one** data point at a time ‚Üí you get $N$ new datasets containing (N-1) points.
2. Compute your statistic (e.g. mean, std) on each of these.
3. From the $N$ results, estimate:
   - A **better (bias-corrected)** value of the statistic
   - The **uncertainty** on that value


Jackknife works **well** when the statistic is:
- The **mean**
- The **standard deviation**

It works **poorly** for:
- The **median**
- **Quantiles** (e.g. the 25th percentile)

These are called **rank-based statistics**, and removing one point at a time doesn‚Äôt change them much ‚Äî so the jackknife underestimates the uncertainty.


####  **Jackknife vs Bootstrap**

|                | Jackknife                | Bootstrap                |
|----------------|--------------------------|--------------------------|
| Type           | Leaves out one point     | Resamples with replacement |
| Fast?          | ‚úÖ Yes                   | ‚ùå Slower               |
| Repeatable?    | ‚úÖ Always same result     | ‚ùå Changes each time    |
| Works for all stats? | ‚ùå Not for medians       | ‚úÖ Yes                 |
| Confidence intervals | ‚ùå Approximate         | ‚úÖ Full distribution     |


## <span style="color:red"> **LECTURE 7** - Frequentist Inference III </span>

- Hypothesis testing (p-value)  
- Null hypothesis  
- Type I and Type II errors  
- KS test (Kolmogorov‚ÄìSmirnov)  
- Histograms  
- Number of bins (Scott‚Äôs & Freedman‚ÄìDiaconis rules)  
- Rug plot  
- Kernel Density Estimation (Gaussian and Epanechnikov)  

### **Hypothesis Testing and p-value**

Hypothesis testing is a fundamental procedure in statistics used to decide whether there is enough evidence in a sample of data to infer that a certain condition holds for the entire population.

- **Null Hypothesis ($H_0$):** This is the starting assumption or the default claim about the population. It usually represents the idea that there is **no effect**, **no difference**, or **no relationship** between variables. For example, $H_0$ might state that the mean of a population is equal to a specific value.

- **Alternative Hypothesis ($H_1$):** This is the hypothesis you want to test or provide evidence for. It represents a change, effect, or difference from what the null hypothesis states. 

- **Test Statistic:** To test the hypotheses, a test statistic is computed from the sample data. This statistic measures how far the observed data are from what would be expected if $H_0$ were true. Different tests have different statistics (e.g., t-test, z-test, chi-square test).

- **p-value:** The p-value is the probability, assuming the null hypothesis $H_0$ is true, of obtaining a test statistic at least as extreme as the one observed. In other words, it quantifies how likely your data would be if there were actually no effect.

  - A **small p-value** indicates that the observed data is unlikely under $H_0$, so we have evidence to reject the null hypothesis.
  - A **large p-value** suggests the data is consistent with $H_0$, and we do not reject it.

    $$
        p_i = \int_{x_i}^{\infty} h_0(x)dx = 1 - \int_{-\infty}^{x_i}h_0(x)dx = 1- H_0(x_i)
    $$

- **Significance Level ($\alpha$):** This is a threshold probability set before the test (commonly 0.05 or 5%). If the p-value is less than $\alpha$, the result is called statistically significant, and we reject the null hypothesis in favor of the alternative.


#### Important notes:

- **Failing to reject $H_0$ is not the same as accepting $H_0$.** It means the data do not provide strong enough evidence against $H_0$, but $H_0$ might still be false.
- The p-value does **not** measure the probability that $H_0$ is true or false; it only measures data compatibility with $H_0$.

#### Example:

Suppose you want to test if a coin is fair.  
- $H_0$: The coin is fair (probability of heads = 0.5).  
- $H_1$: The coin is biased (probability of heads ‚â† 0.5).

You flip the coin 100 times, get 60 heads, and compute a test statistic. The p-value tells you how likely it is to get 60 or more heads assuming the coin is fair. If the p-value is below your threshold (e.g., 0.05), you reject $H_0$ and conclude the coin is likely biased.


--- 
### **Facts about p-value**
1) Not the chance the hypothesis is true:
A p-value does not tell you the probability that the null hypothesis is true. Instead, it tells you how likely your results are if the null hypothesis were true.

2) Not the chance it's "just random":
A p-value is not the probability that your results happened by chance alone. It's based on the assumption that the null hypothesis is correct and measures how well your data fit that assumption.

3) The 0.05 rule is just a guideline:
The 0.05 cutoff for ‚Äúsignificance‚Äù is a tradition, not a scientific rule. A result just below or above 0.05 should not be seen as automatically meaningful or meaningless.

4) Doesn‚Äôt tell how big or important an effect is:
A small p-value doesn‚Äôt mean the effect is big or important. Even tiny effects can be ‚Äúsignificant‚Äù if the sample is large enough ‚Äî that‚Äôs why we also need to consider effect size.
---

### **Type I and Type II Errors**

**TYPE I ERRORS (false positives, or false alarms)**

- The null hypothesis is true, but incorrectly rejected.
- False positive probability is dictated by the significance level $\alpha$. 

aka *That pixel was just background but I think it's a real source.*

**TYPE II ERRORS (false negatives, or false dismissals)**

- The null hypothesis is false, but not rejected.
- False negatives probability is dictated by a variable called $\beta$, related to $(1-\beta)$, called the ***detection probability***.

aka *That was a real galaxy but I missed it!*

For a sample of size $N$ (containing background noise and sources), the **expected number of spurious sources (Type I / false positives)** is 

$$ n_\mathrm{spurious} = N(1-a)\alpha = N(1-a)\int_{x_c}^\infty h_B(x)dx$$ 

and the **expected number of missed sources (Type II / false negatives)** is

$$ n_\mathrm{missed} = Na\beta = Na\int_0^{x_c}h_S(x)dx.$$

The **total number of classified sources** (that is number of instances where we reject the null hypothesis) is

$$ n_\mathrm{source} = Na - n_\mathrm{missed} + n_\mathrm{spurious} = N[(1-\beta)a + (1-a)\alpha].$$

The **sample completeness** (or **detection probability**) is defined as

$$ \eta = \frac{Na - n_\mathrm{missed}}{Na} = 1-\int_0^{x_c}h_S(x)dx = 1-\beta$$

Finally, the **sample contamination** is

$$ \epsilon = \frac{n_\mathrm{spurious}}{n_\mathrm{source}}$$

where $(1-\epsilon)$ is sometimes called the **classification efficiency**.

---

### **Kolmogorov‚ÄìSmirnov (KS) Test**

- we'd like to compare two different sample and understand if they were taken from the same distribution
- KS is a non-parametric test to compare a sample with a reference distribution, or two samples.
- Measures the maximum distance between the empirical CDF $\rightarrow D = max|F_1 - F_2|$
- Outputs a statistic $D$ and a p-value.
- amazingly D, does not dipend on the underlying distribution we care about
- Useful to test goodness-of-fit.

---

### **Histograms and Number of Bins**

Choosing the number of bins affects the histogram shape. The bin's width it's a hyper-parameter that has to be tune for correctly extracting the true statistics:

- **Scott‚Äôs Rule:**  
$$
\text{bin width} = \Delta_b =\frac{3.5 \times \sigma}{N^{1/3}}
$$
That's a grat rule only if we know sigma of the distribution... often it's not usable

- **Freedman-Diaconis Rule:**  
$$
\text{bin width} = \Delta_b =  \frac{2 \times IQR}{N^{1/3}} = \frac{2.7 \times \sigma_G}{N^{1/3}}
$$
where $IQR$ = interquartile range between 75% and 25% and $N$ = number of data points.

- By making histogram you are losing some information depending on the width of the bin you are choosing; It's possible to define the bin height uncertainty by a simple rule: 
$$
  \sigma_k = \frac{\sqrt{n_k}}{\Delta_b \cdot N}
$$
where: N is the total number of data, $n_k$ it's the numer of count in the k-bin and $\Delta$ is the bin width


---

### **Rug Plot**

- A simple plot showing individual data points as small vertical lines (ticks) along an axis.
- Useful to visualize the distribution of data points on top of other plots (like histograms or density plots).

---

### **Kernel Density Estimation (KDE)**

- The core idea it's not to usa a Dirac - delta in each point, but rather a distribution.
- All this distribution (kernel) are summed up to produce the PDF.
- Any distribution could be use:

#### Common kernels:

- **Gaussian kernel:**  
$$
K(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}}
$$

- **Epanechnikov kernel:**  
  $$
  K(x) = \frac{3}{4} (1 - x^2) \quad \text{for } |x| \leq 1, \quad 0 \text{ otherwise}
  $$
  parabolic with a fix support, more localized then the gaussian

- **Linear (triangular) kernel:**  
  $$
  K(x) = 1 - |x| \quad \text{for } |x| \leq 1, \quad 0 \text{ otherwise}
  $$  
  Linearly decreasing weights as distance from the reference point increases. **Less smooth** than Gaussian.


- **Uniform (or tophat) kernel:**  
  $$
  K(x) = \frac{1}{2} \quad \text{for } |x| \leq 1, \quad 0 \text{ otherwise}
  $$  
  Assigns equal weight to all points in a fixed window. Simple but **can produce less accurate estimates** due to lack of smoothness.


KDE bandwidth controls the smoothness (similar to bin width for histograms). That's an hyper parameter that has to be tune fine before the analysis thanks to Cross Validation. 


## <span style="color:red"> **LECTURE 8** - Bayesian Inference I </span>

- Bayes recap  
- Bayesian method  
- Prior  
- 3 Bayesian principles  
- Credibility regions  


### **Bayes Recap ‚Äì Principles and Rules**

Bayes' theorem allows us to **update our belief** about a hypothesis or a parameter after observing new data. The core formula is:

$P(\theta| D) = \dfrac{P(D | \theta) \cdot P(\theta)}{P(D)}$

Where:
- $\theta$:  parameters values
- $D$: observed data
- $P(\theta)$: **prior** ‚Äì initial belief before seeing the data
- $P(D | \theta)$: **likelihood** ‚Äì probability of observing $D$ assuming $\theta$ is true
- $P(\theta | D)$: **posterior** ‚Äì updated belief after seeing the data
- $P(D)$: normalization constant (also called **evidence**)


$$
  \text{Posterior probability} = \frac{\text{Likelihood} \times \text{Prior}}{\text{Evidence}}
$$
---

### **Bayesian Method** 

1. **Define the problem** ‚Äì Choose the model and the parameter $\theta$ to estimate.
2. **Assign the prior** $P(\theta)$ ‚Äì Express your knowledge or assumptions about $\theta$ before the data.
3. **Define the likelihood** $P(D | \theta)$ ‚Äì Describe how the data is generated from $\theta$.
4. **Compute the posterior** $P(\theta | D)$ ‚Äì Using Bayes‚Äô theorem.
5. **Estimate the parameter** ‚Äì Use the posterior to get a point estimate (e.g. MAP, mean, median).
6. **Quantify uncertainty** ‚Äì Through credibility intervals, variance, etc.

---

### **Prior Distributions** 

In Bayesian statistics, the prior distribution represents what we believe about a parameter before observing any data. Choosing a prior is a crucial step because it influences the final result (the posterior).

There are two main types of priors:

1. **Informative Prior**

    - A prior that incorporates specific, pre-existing knowledge or beliefs about the parameter.

    - When to use it: When you have reliable background information, from previous experiments, expert opinion, or strong theoretical expectations.

Example:
Suppose you‚Äôre estimating the probability that a coin is biased towards heads.
If previous tests suggest it lands heads ~70% of the time, you could use a Beta(7, 3) prior (centered around 0.7), which reflects your prior belief.

2. **Uninformative (or non-informative) Prior**

    - A prior that is intentionally vague or flat, expressing no strong belief about the parameter before seeing the data.

    - When to use it: When you want the data to speak for itself and avoid influencing the result with prior assumptions.

Example:
For the same coin-flip scenario, if you have no idea about the bias, you might use a uniform prior over [0, 1], this assumes all probabilities are equally likely.



When no strong prior information is available, there are three main principles to guide the choice:

- **Principle of indifference**: Assign equal probabilities when there is no reason to prefer one value over another. 
- **Invariance principle**: The prior should remain consistent under reparameterization.
- **Maximum entropy**: Among all distributions compatible with known constraints, choose the one with the highest entropy (least informative).

If you choose the wrong prior, it's your fault (GIGO), only an anormus amount of data could correct your initial belif, in this case you can say that are **"data dominated"**, otherwise you are **"prior dominated"**.


---

### **Bayesian Credible Regions**

In the **frequentist paradigm**, the meaning of the confidence interval $\mu_0 \pm \sigma_{\mu}$ is 
the interval that would contain the true $\mu$ (from which the data were drawn) in $68\%$ (or X\%) cases
of a large number of imaginary repeated experiments (each with a different N values of $\{x_i\}$). 

However, the meaning of the so-called **Bayesian credible region** is fundamentally different: it is the interval that **I believe** contains the true $\mu$  with a probability of $68%$ (or $X\%$) after I've collected my data (my dear, one and only dataset; no imaginary repetitions). This credible region is the 
relevant quantity in the context of scientific measurements. 

There are several important features of a Bayesian posterior distribution:
- They represent how well we believe a parameter is constrained within a certain range
- We often quote the posterior maximum (**Maximum A Posteriori (MAP)**).
- We also often quote the posterior expectation value (i.e. mean) $\bar{\theta} = \int \theta\, p(\theta|D)d\theta$, or most often median (recall: robust estimator).
- **The credible regions are not unique**. We can compute them in two different ways
    1. We can integrate downwards from the MAP to enclose $X\%$ ("highest probability density interval"), or
    2. We can integrate inwards from each tail by $X/2\%$ ("equal-tailed interval")

## <span style="color:red"> **LECTURE 9** - Bayesian Inference II </span>

- Odds ratios  
- Bayes factors  
- Frequentist vs Bayesian  


Model comparison and hypothesis testing in Bayesian inference are enormously different from classical/frequentist statistics. We don't have p-values here. **In Bayesian inference, we probabilistically rank models based on how well they explain the data under our prior knowledge.** 

### **Odds Ratio $(\mathcal{O})$**

- The **Odds Ratio** is the ratio of the posterior probabilities of two competing models $M_1$ and $M_2$, given the data D and the information at my disposition I:

  $$
    O_{21} = \frac{p(M_2|D,I)}{p(M_1|D,I)} = \frac{p(D\,|\,M_2,I)\,p(M_2\,|\,I)}{p(D\,|\,M_1,I)\,p(M_1\,|\,I)} \equiv B_{21} \, \frac{p(M_2\,|\,I)}{p(M_1\,|\,I)}
  $$

- It combines the **Bayes Factor** and the **prior odds** to update our belief in which model is more likely after seeing the data.

---

### **Bayes Factor $(\mathcal{B})$**

- The **Bayes Factor** is the ratio of marginal likelihoods (evidences) of the two models:

  $$
  B_{12} = \frac{P(D | M_1, I)}{P(D | M_2, I)} = \frac{\mathcal{Z}_1}{\mathcal{Z}_2}
  $$

- $\mathcal{Z}$ is call evidence
- It measures how well each model explains the observed data, **independent of prior model probabilities**.

- Interpretation (Jeffreys scale):

  | $B_{10}$ Value        | Strength of Evidence for $M_1$ |
  |------------------------|-------------------------------|
  | $<1$                   | Evidence against $M_1$        |
  | $1 - 3$                | Weak                          |
  | $3 - 10$               | Moderate                      |
  | $10 - 100$             | Strong                        |
  | $>100$                 | Decisive                      |

---

### **Frequentist vs Bayesian Approach**

|              | **Frequentist**                                                 | **Bayesian**                                                        |
|----------------------|------------------------------------------------------------------|----------------------------------------------------------------------|
| **Parameters**        | Treated as **fixed but unknown** values                         | Treated as **random variables** with probability distributions       |
| **Probability**       | Interpreted as **long-run frequency** of events                 | Interpreted as **degree of belief** or subjective certainty          |
| **Data**              | Considered as **repeatable random samples** from a population   | Considered as **fixed once observed**; inference is updated via Bayes‚Äô rule |
| **Use of Prior**      | **Not used**; all inference is based on data                    | **Essential**; priors express beliefs before seeing the data         |
| **Inference**         | Based on **sampling distribution** and repeated hypothetical experiments | Based on **posterior distribution**, combining prior and data        |
| **Model Comparison**  | Uses **p-values, confidence intervals, likelihood ratios**      | Uses **Bayes factors, posterior probabilities, credible intervals**  |
| **Interpretation**    | Results apply to **what would happen in repeated experiments**  | Results apply to **the current data and model assumptions**          |



## <span style="color:red"> **LECTURE 10** - Bayesian Inference III </span>

- Monte Carlo  
- Markov chains (detailed balance)  
- MCMC (Markov Chain Monte Carlo)  
- Metropolis‚ÄìHastings algorithm  
- Corner plot  
- Trace plot  
- Burn-in  
- Steps

###  **Monte Carlo Methods**

- **Monte Carlo methods** are computational algorithms that rely on repeated **random sampling** to estimate numerical results.
- Widely used in physics, statistics, and Bayesian inference.
- Particularly helpful when analytical solutions are difficult or impossible.
- However, in high-dimensional spaces, standard Monte Carlo can become highly **inefficient**, which motivates the use of improved techniques like MCMC.

---

###  **Markov Chains**

A **Markov chain** is a sequence of random variables (states) where the probability of moving to the next state depends **only on the current state**, not on the path taken to get there. This is called the **Markov property**, or **memorylessness**.

Mathematically, if you're in state $x$ now, the probability of moving to state $x'$ is:
$
P(x | x')
$

#### **Stationary Distribution**

A **stationary distribution** $\pi(x)$ is a probability distribution over the states that **remains unchanged** as the Markov chain evolves.

It satisfies the condition:
$$
\sum_{x} \pi(x) \cdot P(x \to x') = \pi(x') \quad \text{for every } x'
$$

This means that the probability of being in state $x'$ at the next step is equal to the sum of the probabilities of arriving at $x'$ from all other states, taking into account:
- How likely it was to be in each of those states ($\pi(x)$)
 
- And the probability of transitioning from $x$ to $x'$ ($P(x \to x')$).

---

### **Detailed Balance Condition**

A **sufficient condition** for a Markov chain to have a stationary distribution $\pi(x)$ is the **detailed balance condition**:

$$
\pi(x) \cdot P(x \to x') = \pi(x') \cdot P(x' \to x)
$$

This condition says that the **flow of probability** from state $x$ to $x'$ is the same as from $x'$ to $x$.  
It implies **reversibility** of the chain and ensures that $\pi(x)$ is a stationary distribution.


---

### **Markov Chain Monte Carlo (MCMC)**

- **MCMC** combines the idea of Monte Carlo sampling with Markov chains to draw samples from complex distributions.
- The key idea is to construct a Markov chain whose stationary distribution is the **target distribution** $\pi(x)$ (typically the posterior in Bayesian inference).
- Even if $\pi(x)$ is only known **up to a normalization constant**, MCMC methods can still be used to sample from it.


####  Why Use MCMC?

In Bayesian inference, we are often interested in the **posterior distribution**, but computing it explicitly is hard.  
MCMC allows us to **approximate** this distribution by generating samples from it, rather than calculating it directly.

This makes MCMC a powerful and flexible tool for inference in complex models.

---

### **Metropolis-Hastings Algorithm**

- A widely used MCMC algorithm.
- Generates a sequence of samples that approximates $\pi(x)$.

#### Steps:

1. Start from an initial point $x$.
2. Propose a new point $x'$ from a **proposal distribution** $T(x'|x)$.
3. Compute the **acceptance probability** = $\alpha$
    $$
    \alpha = \frac{\pi(x') \cdot T(x|x')}{\pi(x) \cdot T(x'|x)} 
    $$  

    If the proposal distribution is **symmetric**, i.e. $T(x'|x) = T(x|x')$, the acceptance ratio simplifies to:

    $$
    \alpha =  \frac{\pi(x')}{\pi(x)} 
    $$
4. Draw a uniform random number between 0 and 1 ... Accept $x'$ with probability $\alpha$, otherwise stay at $x$.
5. Repeat the process to create a Markov chain.


#### How Can We Compute $\alpha$ If We Don't Know $\pi(x)$?

In Bayesian inference, the target distribution $\pi(x)$ is typically the **posterior**:

$$
\pi(x) = \frac{p(x) \cdot p(\text{data} \mid x)}{p(\text{data})}
$$
Let‚Äôs plug in the expression for $\pi(x)$:

$$
\frac{\pi(x')}{\pi(x)} = \frac{p(x') \cdot p(\text{data} \mid x')}{p(x) \cdot p(\text{data} \mid x)}
$$

Notice:  
The term $p(\text{data})$ appears in both numerator and denominator ‚Äî **so it cancels out**!

This means you only need:

- The **prior** $p(x)$  
- The **likelihood** $p(\text{data} \mid x)$

Both of which are known or chosen by the modeler.


KEEP ATTENTION: MA gives you a sample dataset from the posterior PDF, but not the PDF itself! you have to run some density estimation methode (KDE) for achive the correct $\pi(x)$

---

### **Burn-in and Plot**

- The early steps of MCMC may not represent the target distribution well.
- The **burn-in period** refers to the initial segment of the chain that is discarded.
- A **trace plot** shows the sampled values over steps ‚Äî used to check convergence.
- A **corner plot** (also called pair plot) is used to visualize multidimensional posterior distributions, it Shows histograms of each parameter.

---

#### **Correlation Length**

In an MCMC simulation, **correlation length** (also called **autocorrelation time**) refers to how many steps it takes for samples in the chain to become approximately **independent** from each other.

- If samples are highly correlated, the chain is moving slowly through the space, and many steps are needed to obtain independent samples.
- The **effective number of samples** is smaller than the total number of steps taken.

This is why understanding and **reducing correlation** is important to improve sampling efficiency.

We often define the correlation length $\tau$ such that:

$$
\text{Effective samples} \approx \frac{N_\text{total}}{\tau}
$$

---

#### **Step Size Tuning**

The **step size** controls how far the chain jumps between states. Choosing it well is crucial:

- If the step is **too small**: the chain moves slowly, samples are highly correlated ‚Üí inefficient exploration.
- If the step is **too large**: the chain proposes states far from the current one, and many of them get **rejected** ‚Üí again, inefficient.

the **Goal** is balance **acceptance rate** and **decorrelation**.

Typical strategy:
- Tune the step size to achieve a **moderate acceptance rate** (e.g. ~20‚Äì40% for Metropolis-Hastings).
- Monitor the **autocorrelation** or use **diagnostic plots** (e.g. trace plot) to check if the chain is mixing well.

There is no universal best step size: it depends on the shape of the target distribution and the algorithm used.



## <span style="color:red"> **LECTURE 11** - Bayesian Inference IV </span>

- Thinning  
- Adaptive Metropolis  
- Single Component Adaptive Metropolis  
- Hamiltonian Monte Carlo  
- Emcee  
- Gibbs sampling  

### **Thinning**

**Definition**: Thinning is the practice of keeping only every *k*-th sample from an MCMC chain (e.g., every 10th sample).

- The goal of MCMC is to approximate the **posterior distribution** by generating samples from it.
- However, successive samples from the MCMC are usually **autocorrelated**.
- Thinning attempts to reduce this autocorrelation by discarding intermediate samples.
- **Note**: Modern practice often recommends storing all samples and addressing autocorrelation during post-processing, since thinning can discard useful information and reduce effective sample size unnecessarily.

---

### **Adaptive Metropolis**

**Definition**: An MCMC method that adapts the proposal distribution $T$ based on the history of the chain.

- In the **Metropolis-Hastings** algorithm, choosing a good proposal distribution is crucial.
- Adaptive Metropolis (AM) automatically tunes the **covariance matrix** of the proposal distribution as the chain progresses.
- This allows better exploration of the posterior, especially in high-dimensional or correlated parameter spaces.
- This method doesn't use only the last point, but it use the entire chain, our chain is no longer markovian.
- To fix this, we often let the algorithm "learn" during an initial phase (called the tuning stage), where it adapts the proposal. After that, we stop adapting and keep the proposal fixed ‚Äî from that moment on, the chain becomes Markovian again and gives valid Bayesian results.
---

### **Single Component Adaptive Metropolis (SCAM)**

**Definition**: A variant of Adaptive Metropolis where only one parameter (component) is updated at a time.

- Standard MCMC or AM methods like Metropolis-Hastings suffer of low rate in high-dimensional spaces.
- SCAM is especially useful when parameters have **different scales or conditional dependencies**.
- At each iteration, only one dimension of the parameter vector is updated, often using an adaptive univariate proposal.
- This can be more efficient than updating all parameters jointly, especially in the presence of strong correlations.
- The adaptation improves sampling efficiency over time.

---

### **Other method**

**Hamilton Monte Carlo**: An MCMC algorithm that uses concepts from physics (Hamiltonian dynamics) to make informed proposals in parameter space, improving efficiency in exploring complex posterior distributions.


**Differential Evolution** : A population-based optimization algorithm that can be adapted for MCMC by evolving a set of candidate solutions using differences between randomly selected members of the population to guide proposals.

---

### **`emcee`**

- It's a full python package.
- Emcee is designed to efficiently sample from **complex, anisotropic posterior distributions**.
- It uses multiple parallel "**walkers**" that share information and adapt proposals to the geometry of the target distribution.
- The process need a starting guess, we don't need to be too precise, the chain will eventualy converge to the true value
- Need also the **number of step** for the chain, the **burn-in region** 
- has some specific method to discard the **auto correlation lenght**

---

### **Gibbs Sampling**

**Gibbs sampling** is a type of Markov Chain Monte Carlo (MCMC) algorithm that avoids the need for rejection ‚Äî **every proposed sample is accepted**. It's especially efficient when the conditional distributions of the parameters are known and easy to sample from.


#### **How It Works**

1. **Initialize** the sampler at some starting point in parameter space.
2. **Iterate over each parameter** in turn.
3. For each parameter:
   - Keep all other parameters fixed.
   - Sample a new value from the **conditional posterior distribution** (THAT YOU HAVE TO KNOW) of that parameter.
4. Repeat this process for **many iterations (Gibbs steps)** to build your Markov chain.


####  **Key Features**

-  Every sample is **automatically accepted** (no rejection step).
-  Very **fast and efficient**, especially in high-dimensional spaces.
-  **Short burn-in period** ‚Äî reaches equilibrium quickly.
-  **Limitation**: Requires knowledge of the **conditional distributions** of all parameters.


Use it when:
- You know how to compute and sample from the **conditional posterior distributions**.
- You need a **simple, efficient MCMC** method.

---

### **Conjugate Prior**

**Definition**:  
In Bayesian statistics, a **conjugate prior** is a prior distribution that, when combined with a particular **likelihood function**, results in a **posterior distribution** that is in the same family as the prior.

- It simplifies calculations.
- The posterior has a known and tractable form.
- Useful for analytical solutions and understanding posterior updates.




## <span style="color:red"> **LECTURE 12** - Bayesian Inference V </span>

- Savage‚ÄìDickey ratio  
- Nested Sampling ‚Äì Dynesty

### **Savage ‚Äì Dickey Density Ratio**

It's a shortcut to compare two models when one is a **special case** of the other.

Let‚Äôs say:

- $M_1$: the **simple model**, where a parameter $A = 0$ (e.g., "no signal", A stay for Amplitude)
- $M_2$: the **full model**, where $A$ can be anything (e.g., "signal allowed")

Then, instead of computing evidence for both models, we use this trick:

$$
\mathcal{B} = \frac{p(A = 0)}{p(A = 0 \mid \text{data})}
$$

This is the **Bayes factor** between $M_2$ and $M_1$.

#### What Does It Mean?

- $p(A = 0)$ is how much we believed in $A = 0$ **before seeing the data**.
- $p(A = 0 \mid \text{data})$ is how much we believe in $A = 0$ **after seeing the data**.

If the data makes $A = 0$ **less likely**, then $M_1$ is disfavored.


#### How Do We Use It?

1. Run MCMC on the full model $M_2$.
2. From the samples, estimate $p(A = 0 \mid \text{data})$.
   - For example, use a histogram or KDE on the samples of $A$.
3. Calculate $\mathcal{B}$ using the ratio.


#### Why Is This Useful?

- You don‚Äôt need to compute full evidences $\mathcal{Z}_1$ and $\mathcal{Z}_2$.
- Fast and simple, especially when comparing null hypotheses.


| Concept                 | Description                                                                                                                                                                                                                                                                                                                                                                                       |
| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Bayes Factor**        | A **general method** to compare two models (\$M\_1\$ and \$M\_2\$) by taking the ratio of their **evidences**:  $\mathcal{B}_{21} = \frac{P(\text{data} \mid M_2)}{P(\text{data} \mid M_1)}$ It tells you how much more the data supports one model over the other. Requires full integration over parameter space.                                                                               |
| **Savage‚ÄìDickey Ratio** | A **special case** of the Bayes factor, used **only** when: <br>‚Äì \$M\_1\$ is a **special case** of \$M\_2\$ (e.g. \$A=0\$) <br>‚Äì You can compute the **prior** and **posterior density** at a single point (e.g. \$A=0\$). <br><br>Then:  $\mathcal{B}_{21} = \frac{p(A=0)}{p(A=0 \mid \text{data})}$ It's a shortcut that **gives the Bayes factor** without needing to compute full evidences. |


---

### **Nested Sampling**

Nested Sampling is a method to **compute the evidence** $\mathcal{Z}$ in Bayesian inference:

$$
\mathcal{Z} = \int \mathcal{L}(\theta) \, \pi(\theta) \, d\theta
$$

This is crucial for **comparing models**, since:

$$
\text{Bayes Factor} = \frac{\mathcal{Z}_2}{\mathcal{Z}_1}
$$

But computing $\mathcal{Z}$ is hard ‚Äî especially in high dimensions. That‚Äôs where Nested Sampling comes in.


#### How It Works

1. Start with $N$ random points from the **prior**.
2. Find the one with the **lowest likelihood**, call it $L_{\text{min}}$.
3. Remove it, and **replace** it with a new point sampled from the prior **subject to** $\mathcal{L} > L_{\text{min}}$.
4. Keep track of the shrinking prior volume and the associated likelihoods.
5. Approximate the 1D integral $\mathcal{Z}$ using the sequence of $(X_i, \mathcal{L}_i)$ points.

In Python, the `dynesty` library does this automatically:

---

### **Very important** 
- Samples that come out of a nested sampling runs are **weighted**. The results of nested sampling and the samples and their weights *together*. Do not use samples by themselves, it doesn't make sense.
- If an MCMC is taking to long, you can interrupt it and use samples you got so far (assuming you're past the burn in period). This is not possible with nested sampling; samples don't make any sense until the algorithm reached the top of the posterior! You have to let it go till the end.


## <span style="color:red"> **LECTURE 13** - Data Mining & Machine Learning </span>

- cos'√® il machine learning e scopo principale
- features / sample / classes / istanze
- sci-kit cos'√® e come vuole i dati
- seguenti metodi di sci-kit ( model.fit, model.predict , model.predict_proba, model.score ,model.trasform)
- sci - kit estimator object 
- supervisionato (classificazione e regressione) esempio netflix
- KNN velocemente
- non supervisionato, cosa cambia da prima (clustering, dimensionality reduction), spiegazione semplice e veloce di ciascuno
- PCA velocemente
- isomap 
- clustering k-means velocemente
- model validation ( confusion matrix,  training set / test set , )

### **What is Machine Learning and Its Main Purpose**

Machine Learning (ML) is a branch of artificial intelligence where computers learn patterns from data to make decisions or predictions without being explicitly programmed. The main goal is to build models that generalize well on new, unseen data.

---

### **Key Terms**

- **Features:** The input variables or attributes used to describe each data point (e.g., height, weight).  
- **Sample / Instance:** A single data point or observation with its features (e.g., one person's measurements).  
- **Classes:** Categories or labels that data points belong to in classification tasks (e.g., cat, dog).  
- **Target:** The output or label we want to predict.

---

### **`Scikit-learn`** 

Scikit-learn is a popular Python library for machine learning. It provides easy-to-use tools for data preprocessing, modeling, and evaluation.

- **Data format:**  
  - Input features: a 2D array/matrix of shape $(n\_samples, n\_features)$.  Always in this form, it's very picky.
  - if your x is just 1D, you have to reshape it in ND by using `np.newaxis()`
  - Target labels: A 1D array of length $n\_samples$.

---

### **Common Scikit-learn Methods**

An **estimator** in scikit-learn is any object implementing at least the methods `.fit()` and `.predict()` or `.transform()`. 
Usually call model.

- **`model.fit(X, y)`:** Trains the model on data $X$ with labels $y$.  
- **`model.predict(X)`:** Predicts labels for new data $X$.  
- **`model.predict_proba(X)`:** Gives the probability estimates for classification classes (if available).  
- **`model.score(X, y)`:** Returns the accuracy on data $X$ compared to true labels $y$.  
- **`model.transform(X)`:** Applies a transformation to data $X$ (used in dimensionality reduction, feature extraction).

---


### **Supervised Learning**

In supervised learning, the model learns from labeled data:

- **Classification:** Predicting discrete labels. We will use the propriety of a dataset to predict unlabeled data.
- **Regression:** Predicting continuous values.

---

### **Unsupervised Learning**

Unsupervised learning deals with unlabeled data:

- **Clustering:** Grouping similar data points .  
- **Dimensionality Reduction:** Reducing data features while preserving structure , usefull in data visualization.
- **Density Estimation** can determine the distribution of the data within the parameter space.

The main difference is that no label is provided; the goal is to find hidden patterns or structure.

---


### **Model Validation**
Determine how well your model will generalize from the training dataset to future unlabeled data.

- **Confusion Matrix:** A matrix showing true vs. predicted classes to evaluate classification accuracy and errors.  The element on the diagnal are the one correctly identified, the off-diagonal are confunded.
- **Training Set:** Data used to train the model.  
- **Test Set:** Separate data used to evaluate model performance on unseen (but labeled) data.



## <span style="color:red"> **LECTURE 14** - Clustering </span>

- whats are hyperparameter and make example
- cross validation hyper parameter tuning
- training  / validation / test 
- K - fold cross validation
- mix con mcmc se massimo si trova tra due punti
- clustering, non sappiamo come faccia , ma funziona
- mean shift clustering


### **Hyperparameters**

Hyperparameters are parameters that are **not learned from the data**, but set **before** the learning process begins. They control the learning process and model structure.
They can easily fool us into thinking something wrong about the data.

### Examples:
- Number of clusters in K-means ($K$)
- number of bins in a histogramm
- Depth of a decision tree
- Bandwidth in Kernel Density Estimation (KDE)

These are typically chosen via **validation** methods like **cross-validation** (see below).

---


### **Training / Validation / Test Sets**

We can think of divide the datasei into:

1. **Training set**: Used to **fit** the model.
2. **Validation set**: Used to **tune hyperparameters** and select the best model version.
3. **Test set**: Used only **at the end** to report the final unbiased performance of the selected model.

But we know that less data is bad for ML, and also make the result dipendent on what is inside each set (think at outliars, if they fall into test gave a different risult...)
We can solve this problem thanks to Cross Validation (CV)

---

### **K-Fold Cross Validation**

Cross-validation is a technique to **evaluate the generalization ability** of a model by partitioning the data into multiple subsets.

### Why it matters:
- Helps prevent **overfitting** and **underfitting**
- Makes full use of data (especially important with small datasets)
- Gives a better estimate of model performance

K-Fold is a smarter form of cross-validation:

1. Split the Training data into $K$ equal parts (folds).
2. For each fold:
   - Use $K-1$ folds for training
   - Use the remaining fold for validation
3. Repeat $K$ times so every point gets to be in the validation set once.
4. Extract the best parameter from the $K$ validation.
5. Use the Test data to evaluate the model.

We can take it to extreme by taking K = N = number of data, this is called "Leave one  out" Cross Validation, This drammatically increase the computational cost, but reduce gratly the variance.

---


## **Clustering**

Clustering is **unsupervised learning**: grouping similar data points without knowing the labels.  
It aims to discover structure in data by finding clusters (dense regions) of similar observations.

---

### **K-Means Clustering**

**K-Means** is a simple and popular **centroid-based** clustering algorithm.

#### How it works:
1. Choose the number of clusters $k$ (a hyperparameter).
2. Initialize $k$ **centroids** (usually randomly).
3. Assign each data point to the **nearest centroid**.
4. Update centroids as the **mean of points assigned** to each cluster.
5. Repeat steps 3‚Äì4 until convergence (no significant change in centroids or assignments).

**Objective:** Minimize the **within-cluster sum of squares** (WCSS):

$$
\text{WCSS} = \sum_{i=1}^{k} \sum_{x \in C_i} \|x - \mu_i\|^2
$$

where $C_i$ is the set of points in cluster $i$ and $\mu_i$ is the centroid of $C_i$. $K$ is the number of cluster. You are minimizing the distance between points and centroid for each cluster

#### Strengths:
- Efficient and scalable
- Easy to implement

#### Weaknesses:
- Requires choosing $k$
- Assumes spherical, equally sized clusters
- Sensitive to initialization
- Sensitive to outliers

---

### **Mean Shift Clustering**

**Mean Shift** is a **non-parametric**, **density-based**, **centroid-shifting** clustering algorithm.

#### How it works (Density Gradient Ascent):
1. Define a **window (kernel)** around each data point ‚Äî typically a Gaussian with bandwidth $h$.
2. Compute the **mean of data points** within the window.
3. Move (shift) the window center toward the **mean**.
4. Repeat steps 2‚Äì3 until convergence (i.e., the center stops moving).
5. Merge points converging to the same center into a **single cluster**.


It performs **gradient ascent** on a kernel density estimate (KDE). The mode (maximum) of the KDE becomes the cluster center.


#### Strengths:
- **Does not require predefining the number of clusters**
- Can identify clusters of **arbitrary shape**
- **Robust** to outliers
- Works well when clusters correspond to **modes in the density**

#### Weaknesses:
- Computationally expensive (especially on large datasets)
- **Bandwidth selection** is critical ‚Äî too large merges clusters, too small splits them
- Does not scale well in high dimensions




## <span style="color:red"> **LECTURE 15** - Dimensional Reduction I </span>

- Curse of Dimensionality
- PCA (apply a trasform to the data such that the new axes are aligned with the maximal variance of the data, ortogonalization in more dimension, fewer then the original dimension: some are discard, at the end of the game it's a diagonalization)
- data preparization for PCA: subtract the mean, divide by the variance , normalize eachsample
- in spectral imaging from galaxxy, every peak grow on a background , that's not noise, it's physisc, but i can remove it thanks to PCA
- scree plot, first 2 comoponent exlain 96% of the variance in our example
- interpreting PCA result
- PCA it's linear, it struggle a lot with non linear component
- Recostruction of dark area with PCA
- overview of non-negative  matrix factorization
- overvie (just know it exist) of ICA 

### **Curse of Dimensionality**

As the number of features (dimensions) increases:
- The volume of space increases exponentially.
- Data becomes **sparse**, even if you have a lot of it.
- Models struggle to generalize well.
- Distance metrics (like Euclidean distance) lose meaning ‚Äî all points start to look equally distant.

Example: If each feature has a 50% chance of matching, the probability that all $n$ match is $0.5^n$. Even with just 4 features, that‚Äôs only 6.25%!

---
### **PCA (Principal Component Analysis)**

PCA is a technique to **reduce dimensionality** by projecting data to a new space:
- New axes are the directions of **maximum variance**.
- These axes (principal components) are **orthogonal**.
- Redundant dimensions are **discarded**.
- The process is equivalent to **diagonalizing the covariance matrix**.

#### Steps:
1. **Center the data**: Subtract the mean.
2. **Scale the data**: Divide by the standard deviation.
3. **Normalize samples** (optional, done for spectral images).
4. Compute the **covariance matrix**.
5. Compute **eigenvectors/eigenvalues**.
6. Sort eigenvectors by decreasing eigenvalue ‚Üí these are the principal components. (eingvalues reflect the variance)

One you have the eigenvectors $e_j(k)$, you can recostruct a true data $x_i(k)$ in the eigenvecture basis as: 

$$
    x_i(k) = \mu(k) + \sum_j \theta_{ij}e_j(k)
$$

#### PCA Limitations

- **Linear**: PCA can‚Äôt handle **non-linear structures** in data.
- Struggles with **curved manifolds** (e.g., spirals).
- We would need all the component for the 100% exlaination
- how many component should i keep? Cross validation is the answer ...

let's look at the video for a better comprensation of how it work

---

### **Scree Plot & Explained Variance**

A **scree plot** shows eigenvalues (variance explained) by each principal component.

In our exaple the first 2 components explain 96% of the variance:
- So you can reduce your data to 2D while keeping most information.
- Useful for **visualization** and **noise reduction**.

---

### **Dark Area Reconstruction with PCA**

You can use PCA to reconstruct **missing or corrupted data** (e.g., missing regions in astronomical images):
- Fit PCA on complete data.
- Project corrupted sample into the PC space.
- Reconstruct using the leading PCs ‚Üí fill in missing values based on structure learned from the rest.

---

### **Overview: Non-Negative Matrix Factorization (NMF)**

**NMF** is a technique that factorizes a matrix $X$ into the product of two matrices:

$$
X \approx WH
$$

with the important constraint that **all elements of** $W$ and $H$ are **non-negative** (i.e., no negative numbers allowed).

This makes the results easier to interpret in many real-world cases, especially when the data naturally can't be negative (like pixel intensities or word counts).

**Applications:**
- Discovering topics in a collection of documents
- Breaking down images into basic components
- Analyzing spectral data in astronomy or chemistry

Compared to **PCA**, which can use negative values, **NMF tends to produce more interpretable results**, often representing **distinct parts** of the input (like separate topics or objects).

---

#### **Just Know It Exists: ICA (Independent Component Analysis)**

**ICA** is another method for decomposing data, but instead of focusing on variance (like PCA), it looks for **independent** components.

That means it tries to separate a complex signal into **underlying sources** that are as statistically **independent** from each other as possible.

**Typical use cases:**
- **Blind source separation** (e.g., separating different voices recorded by multiple microphones)
- Analyzing EEG brain signals
- Uncovering independent trends in financial time series


---


## <span style="color:red"> **LECTURE 16** - Dimensional Reduction II </span>

- random forest
- manifold learning techniques
- Locally Linear Embedding (what is it, scheme of what is doing)
- IsoMap (what is it, scheme of what is doing)
- t - SNE (overview)
- Density Estimation (recup)
- non parametric DE ( KDE , Nearest-Neighbor Density Estimation)
- Parametric Density Estimation (Gaussian Mixture Models) 


PCA, ICA and NFM are useless in handwritten dataset... they fail, let's have a look at other possible dimensionality reduction, more helpfull in this case.

### **Random Forest**

**Random Forest** is a machine learning method that helps make **predictions** ‚Äî like: Is this email spam? What‚Äôs the price of this house?

- It builds **many decision trees**.
- Each tree gives its own answer.
- Then it **combines the answers**:
  - For classification (like spam/not spam): it chooses the **most common** answer (majority vote).
  - For regression (like price): it takes the **average**.

How the trees are built:

- Each tree is trained on a **random sample** of the data called Bootstrap.
- Each tree looks at only a **random set of features** when making decisions.

**Random Forest = Many random trees working together to make smart predictions!**

**Key Idea**: Combine many weak learners (trees) into a strong learner.

---

## **Manifold Learning Techniques**

Manifold learning methods are **non-linear dimensionality reduction techniques** that assume data lies on a low-dimensional manifold embedded in a high-dimensional space. These techniques aim to **uncover the underlying structure** of the data.


### **Locally Linear Embedding (LLE)**

LLE is a **non-linear dimensionality reduction** algorithm that preserves local neighborhoods geometry around each point.

**What is it?**

- Assumes each data point and its neighbors lie on a locally linear patch of the manifold.
- Computes weights that best reconstruct each point from its neighbors.
- Finds low-dimensional embeddings that best preserve these local relationships.

**Scheme:**

1. For each point, identify $k$ nearest neighbors.
2. Compute weights $w_{ij}$ to reconstruct point $x_i$ **from** its neighbors:  
   $x_i \approx \sum_j w_{ij} x_j$ where j are the neighbour points
3. Find low-dimensional representations $y_i$ that minimize the distance between $x_i$ and new space point $y_i$ 


### **IsoMap**

IsoMap is a **global non-linear dimensionality reduction** method that preserves geodesic distances between points.
This method assumes data lies on a smooth manifold.

**Scheme:**

1. Construct a neighborhood graph (e.g., $k$-nearest neighbors).
2. Compute shortest paths (geodesic distances) between all pairs using Dijkstra .
3. Apply classical MDS (Multi-Dimensional Scaling) to the geodesic distance matrix.

IsoMap preserves the **intrinsic geometry** of the data better than PCA in non-linear settings.


### **t-SNE** 

**t-SNE** is a tool that helps you **visualize data** with many features (high-dimensional) in just **2D or 3D**.

What it does:

- It takes complex data (with lots of numbers/features) and shows it as a simple **2D or 3D plot**.
- Points that are **similar** in the original data end up **close together** in the plot.
- It‚Äôs really good at showing **clusters** or groups in the data.

How it works:

- Figures out how similar the data points are.
- Then tries to **keep those similarities** when showing the data in 2D or 3D.
- Uses special math (like **probabilities** and **Student-t distribution**) to do it well.


In short : **t-SNE = A smart way to draw complex data in 2D or 3D so you can spot patterns.**


---

### **Summary Table of Dimensionality Reduction Methods**

| Method     | Type        | Preserves     | Suitable for Visualization | Parametric | Main Use Case                          |
|------------|-------------|----------------|-----------------------------|------------|----------------------------------------|
| PCA        | Linear      | Global dist    | Yes                         | Yes        | General reduction, noise filtering     |
| NMF        | Linear      | Parts-based    | Limited                     | Yes        | Topic modeling, text data              |
| ICA        | Linear      | Indep. sources | No                          | Yes        | Signal separation (e.g., EEG, audio)   |
| LLE        | Non-linear  | Local linearity| Yes                         | No         | Manifold learning, visualizing clusters|
| IsoMap     | Non-linear  | Geodetics      | Yes                         | No         | Unfolding non-linear structures        |
| t-SNE      | Non-linear  | Probability    | Yes                         | No         | Visualization of high-dim. data        |



---

### **Density Estimation (Recap)**

Density estimation aims to model the **probability distribution** of a dataset based on observed data.

Two broad classes:

1. **Parametric**: Assumes a specific distribution (e.g., Gaussian).
2. **Non-Parametric**: Makes fewer assumptions; adapts to data complexity.

---

### **Non-Parametric Density Estimation**

**Kernel Density Estimation (KDE)**

- Places a kernel (e.g., Gaussian) at each data point.
- Estimates the density at a point $x$ 
- The bandwidth controls the smoothness of the resulting density.
We can think of it by replacing each point with a probability cloud

**Nearest-Neighbor Density Estimation**

- Estimates density based on the volume $V_k$ containing the $k$ nearest neighbors of $x$.
- Formula:  
  $\hat{f}(x) = \frac{k}{n V_k}$

- Adapts well to **local variations** in data density.

---

### **Parametric Density Estimation: Gaussian Mixture Models (GMM)**

GMM assumes that the data is generated from a **mixture of several Gaussian distributions**.

**Definition:**

- Probability density function:  
  $p(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x \mid \mu_k, \Sigma_k)$  
  where $\pi_k$ are the mixing coefficients, and $\mathcal{N}$ is the multivariate normal.

**Estimation:**

- Parameters $(\pi_k, \mu_k, \Sigma_k)$ are learned using the **Expectation-Maximization (EM)** algorithm.

GMMs are widely used in **clustering**, **anomaly detection**, and **density modeling**.



## <span style="color:red"> **LECTURE 17** - Regression I </span>

- regression 
- bayesian regression
- linear regression ( homoschedastic , SciKit.LinearRegression())
- polynomial regressione
- basis regression
- kernel regression / nadara - watson regression
- over / under fitting - CrossValidation for the best model
-  andamento RMS or BIC Vs degree polinomial fitting
- learning curves

### **Regression (What is it?)**

Regression is the **supervised** process that try to find the relation between x and y.

That is, for a given $x$, instead of trying to estimate the **full probability distribution function (PDF)** of $y$, we often settle for a **point estimate** ‚Äî the most likely expected value.

Crudely: regression = **curve fitting**: finding the best function that explains the observed data.

In contrast with **unsupervised learning** (like clustering), regression **requires labeled data** ‚Äî pairs of $(x_i, y_i)$.

---

### **Bayesian Regression**

In **regular regression** (like least squares), we try to find **one best-fit line** through the data.

But in **Bayesian regression**, we don‚Äôt just pick one line ‚Äî we look at **many possible lines**, and figure out how likely each one is.

How it works (in simple terms):

- We **start with a belief** (called a **prior**) about what the model could look like.
- Then we **update that belief** using the data we observe (this gives us the **posterior**).
- The result is a **range of possible models**, not just one.

What makes it special:

- It gives **probabilistic predictions** ‚Äî we get a prediction *and* how uncertain it is.
- It‚Äôs **regularized by priors**, meaning it avoids overfitting by starting with assumptions.
- It‚Äôs great when we want to **include uncertainty** in our results.


When to use it:

- When you care about **uncertainty** in predictions
- When data is **limited** or **noisy**
- When you want to **combine prior knowledge** with data


---

### **Linear Regression (Homoscedastic)**

This models the response $y$ as a **linear function of inputs**:

$$
y = \theta_1 x + \theta_0 + \epsilon
$$

Where $\epsilon$ is a noise term assumed to have **constant variance** (homoscedasticity):

$$
\epsilon \sim \mathcal{N}(0, \sigma^2)
$$

Each data point restricts the set of plausible lines in parameter space $(\theta_0, \theta_1)$. As more points are added, the intersection of these constraints narrows.

```python
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
```


### **Linear Regression (Heteroscedastic)**

This still models the response $y$ as a **linear function of inputs**:

$$
y = \theta_1 x + \theta_0 + \epsilon
$$

But now, the **variance of the noise** is **not constant** across all data points ‚Äî this is called **heteroscedasticity**:

$$
\epsilon \sim \mathcal{N}(0, \sigma^2(x))
$$

This means that:

- Some data points are more "reliable" (lower variance).
- Others are noisier and should influence the model **less**.
- The model should **give different weights** to different data points when fitting.

If the errors are different for each point, it is better to think of the problem in matrix notation:

$$
Y = M \theta
$$

where $Y$ is an $N$-dimensional vector of values $y_i$,

$$
Y = 
\begin{bmatrix}
y_0 \\
\vdots \\
y_{N-1}
\end{bmatrix}.
$$

For the straight line model, $\theta$ is simply a two-dimensional vector of regression coefficients,

$$
\theta =
\begin{bmatrix}
\theta_0 \\
\theta_1
\end{bmatrix},
$$

and $M$ is called the design matrix

$$
M =
\begin{bmatrix}
1 & x_0 \\
\vdots & \vdots \\
1 & x_{N-1}
\end{bmatrix},
$$

where the constant in the first column of $M$ captures the zeropoint (i.e. the constant $y$-intercept) in the regression.


### **Multivariative**

It's simply as befor, but instead of have only 2 dimension x and y, you can add more variable, such as y = ax + bz + ck + ...
Of course a,b,c ... are achived from the **designed matrix**.

---

### **Polynomial Regression**

Extends linear regression by adding polynomial terms:

$$
y = w_0 + w_1 x + w_2 x^2 + \dots + w_d x^d + \epsilon
$$

This is still **linear in parameters**, just not linear in $x$.

- More expressive models
- Risk of **overfitting** for large degree $d$

```python
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

model = make_pipeline(PolynomialFeatures(degree=5), LinearRegression())
model.fit(X, y)
```

In this case the design matrix became:

$$
M = \begin{pmatrix}
1 & x_0 & x_0^2 & x_0^3 \\
1 & x_1 & x_1^2 & x_1^3 \\
\vdots & \vdots & \vdots & \vdots \\
1 & x_N & x_N^2 & x_N^3
\end{pmatrix}.
$$


---

### **Basis Function Regression**

We use arbitrary **basis functions** $\phi_j(x)$:

Examples:
- Polynomial: $\phi_j(x) = x^j$
- Gaussian: $\phi_j(x) = \exp\left(-\frac{(x - \mu_j)^2}{2\sigma^2}\right)$
- Fourier: $\phi_j(x) = \cos(jx), \sin(jx)$

By choosing a suitable basis, you can fit almost any shape.

---

### **Kernel Regression / Nadaraya-Watson Estimator**

In the case of Gaussian Basis Regression, Gaussians are evenly spaced over the range of interest. If we instead placed Gaussians at the location of every data point, we get Gaussian Kernel Regression instead. Or just Kernel Regression more generally since we don't have to have a Gaussian kernel function. It is also called Nadaraya-Watson regression.

This smooths the data without fitting a fixed global model.

Of course you will find the perfect banwidth by using Cross Validation

#### **Kernel Regression vs Kernel Density Estimation (KDE)**

| Feature                     | Kernel Regression (Nadaraya-Watson)             | Kernel Density Estimation (KDE)               |
|----------------------------|--------------------------------------------------|------------------------------------------------|
| üéØ **Goal**                | Predict $y$ for a given $x$                     | Estimate the **density** of the data          |
| üìà **Input**               | Pairs $(x_i, y_i)$                              | Single variable $x_i$                         |
| üì§ **Output**              | Smoothed estimate of $y$ as function of $x$     | Probability density function over $x$         |
| üì¶ **Formula**             | Weighted average of $y_i$'s                     | Weighted sum of kernels centered at $x_i$     |
| üìç **Kernel center**       | Centered at **each $x_i$**                     | Also centered at **each $x_i$**               |
| üîß **Kernel function**     | Usually Gaussian or other symmetric functions   | Same (Gaussian, Epanechnikov, etc.)           |
| üìå **Bandwidth**           | Controls smoothing (chosen via cross-validation)| Controls smoothing (can use rules or CV)      |
| üîÅ **Used for**            | Non-parametric regression                       | Density estimation / plotting distributions   |


---

### **Overfitting / Underfitting ‚Äî Cross-Validation**

- **Underfitting**: Model too simple ‚Üí can't capture patterns
- **Overfitting**: Model too complex ‚Üí memorizes noise

Use **cross-validation** to estimate model performance:

```python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)
print("Mean score:", np.mean(scores))
```

Cross-validation-Score helps select:
- Best model complexity (e.g., polynomial degree)
- Regularization parameters

---

### **RMS or BIC vs Polynomial Degree**

More regression coefficients improve the ability of the model to fit all the points (reduced bias), but at the expense of model complexity and variance. Of course we can fit a Nth-degree polynomial to N data points, but that would be foolish. We'll determine the best trade-off between bias and variance through cross-validation.

When we increase the complexity of a model, the data points fit the model more and more closely. However, this process does not necessarily result in a better fit to the data. Rather, if the degree is too high, then we are overfitting the data. The model has high variance, meaning that a small change in a training point can change the model dramatically.

We can evaluate this using a training set, a cross-validation set and a test set.

Plotting **RMS or BIC vs polynomial degree** for both the CV set and training set can help choose the optimal degree ‚Äî where adding complexity no longer improves performance.

#### Root Mean Squared Error (RMS):

$$
\text{RMS} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}
$$

#### Bayesian Information Criterion (BIC):

$$
\text{BIC} = k \ln(n) - 2 \ln(\hat{L})
$$

- $k$: number of parameters  
- $n$: number of observations  
- $\hat{L}$: maximum likelihood  

#### Interpretation of RMS and BIC

For low order, both the training and CV error are high. This is sign of a high-bias model that is underfitting the data.  
For high order, the training error becomes small (by definition), but the CV error is large. This is the sign of a high-variance model that is overfitting the data.  
The BICs give similar results.  
We'd like to minimize the RMS or BIC, and the minimum should be the same.


---

### **Learning Curves**

Learning curves show **train vs validation error** as the dataset size increases.

Key patterns:
- **High bias**: both train and val errors are high ‚Üí increase model complexity
- **High variance**: large gap between train and val ‚Üí add more data or regularize

```python
from sklearn.model_selection import learning_curve

train_sizes, train_scores, val_scores = learning_curve(model, X, y, cv=5)
```

Plot to diagnose model behavior and data sufficiency. We can see two regimes:

- The training and CV errors have converged. This indicates that the model is dominated by bias. Increasing the number of training points is futile. If the error is too high, you instead need a more complex model, not more training data.
- The training error is smaller than the CV error. This indicates that the model is dominated by variance. Increasing the number of training points may help to improve the model.


## <span style="color:red"> **LECTURE 18** - Regression II </span>

- Regularization
- ridge regression
- LASSO regularization
- difference and similitude ridge / LASSO 
- Locally linear Regression (LOWESS / LOESS) (overwiev)
- Non - linear regression (overwiev)
- Gaussian process regression



### **Regularization**

When we make models more complex‚Äîlike using very high-degree polynomials‚Äîthey can start to fit the training data *too* well. This is called **overfitting**. It means the model learns not only the true pattern but also the noise in the data. As a result, the model does great on the data it has seen but performs badly on new, unseen data.

**Regularization** helps prevent overfitting by adding a penalty that discourages the model from becoming too complex. This penalty keeps the model simpler and helps it generalize better to new data by balancing two things:
- **Bias** (how much the model assumptions simplify the real data)
- **Variance** (how much the model changes when trained on different data samples)


Regularization is something extra we add during fitting to avoid overfitting.
- Fitting = learning the best parameters from data.
- Regularization = gently forcing the model to stay simple during fitting
---

### **Ridge Regression (L2 Regularization)**

Ridge regression tackles overfitting by adding a penalty on the *squared size* of the coefficients (parameters). The loss function it tries to minimize becomes:

$$
\text{Loss}_{\text{ridge}} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^p \theta_j^2
$$

- The first part measures how well the model fits the data.
- The second part (with $\lambda$) penalizes large coefficients to prevent overly complex models. Because if the parameters are too high, you can think that the function need to change a lot among point, that's overfitting.
- $\lambda$ in known as regularization parameter

Key points:
- **Coefficients get smaller** but don‚Äôt become exactly zero.
- Good when many features contribute but might be correlated.
- Keeps all features in the model but controls their impact.

---

### **LASSO Regression (L1 Regularization)**

LASSO adds a penalty based on the *absolute value* of the coefficients:

$$
\text{Loss}_{\text{lasso}} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^p |\theta_j|
$$

What makes LASSO special:
- It can shrink some coefficients **exactly to zero**, effectively removing those features from the model.
- This means LASSO does **feature selection** automatically.
- Useful when you expect only a few important features out of many.

---

### **Ridge vs. LASSO: Similarities and Differences**

| Feature                 | Ridge                            | LASSO                               |
|-------------------------|---------------------------------|-----------------------------------|
| Penalty type            | Squares of coefficients ($L_2$) | Absolute values of coefficients ($L_1$) |
| Feature selection       | No                              | Yes (some coefficients become zero) |
| Effect on coefficients  | Shrinks smoothly towards zero   | Produces sparse solutions (some zero exactly) |
| Best use case           | When many features matter, even if correlated | When only a few features are really important |

The difference between Ridge and LASSO is just the shape of the constraint region. For LASSO, the shape is such that some of the parameters may end up being 0, which is super beneficial.

Setting $\lambda = 0 $ is mathemathicall identical to no regularizaion. But that's not necessarily true in the scikit-learn implementation: i.e. Ridge and Lasso with lambda =0 might not give the same result of LinearRegression. The regularization algorithms have additional sophistications to improve convergence. 

**How do we choose $\lambda$?**
We use cross-validation, just as we discussed before. In fact...Scikit-Learn has versions of Ridge and LASSO regression that do this automatically for you-- see RidgeCV and LassoCV.

---

### **Locally Linear Regression (LOWESS / LOESS)**

LOWESS and LOESS are simple ways to make a smooth curve through data without assuming a fixed formula.

How it works:
- For each point, it looks at nearby points only.
- Gives more importance to points that are closer.
- Fits a simple line just to those nearby points.
- Does this for every point, making a smooth curve that follows local changes.

Why use it?
- Very flexible and easy to understand.
- Good when the relationship between variables changes in different areas.
- Doesn‚Äôt force one shape to fit all data.

---

### **Non-Linear Regression (Overview)**

Non-linear regression fits curves or complex shapes to data, not just straight lines.

Examples:
- S-shaped growth curves
- Exponential growth or decay
- Neural networks (many layers of curves)

How it works:
- Uses trial-and-error methods (like gradient descent) to find the best curve.
- It can be tricky to find the best fit and might need good starting guesses.
- Useful when data clearly isn‚Äôt a straight line.


---

### **Gaussian Process Regression (GPR)**

Gaussian Process Regression is a powerful **non-parametric** regression method. Unlike traditional regression techniques that assume a specific functional form (like a straight line or a polynomial), GPR assumes that the data come from a **distribution over functions**.

**What is a Gaussian Process?**

A **Gaussian Process (GP)** is a collection of random variables, any finite number of which have a **joint Gaussian distribution**.

Think of a GP not as a single curve, but as a *distribution* over all possible smooth curves that could fit your data. When you observe some data points, you can narrow down this distribution and make predictions with uncertainty included.

**GPR in Simple Words**

- You give GPR some data: $x_i$, $y_i$ (inputs and outputs).
- GPR looks at these data and says: "what are all the *smooth* functions that could have produced this?"
- Then, for a new input $x^*$, it doesn't just give a single $y^*$ value but a **probability distribution** for what $y^*$ could be.

**Mathematical Form**

Let‚Äôs say we want to predict values of $y$ from inputs $x$. In GPR, we assume:

$$
y(x) \sim \mathcal{GP}(m(x), k(x, x'))
$$

Where:

- $m(x)$ is the **mean function** (usually taken as zero: $m(x) = 0$).
- $k(x, x')$ is the **kernel** or **covariance function**, defining how correlated the outputs are depending on their inputs.

A popular choice for the kernel is the **Radial Basis Function (RBF)**:

$$
k(x, x') = \sigma_f^2 \exp\left( -\frac{(x - x')^2}{2 \ell^2} \right)
$$

where:
- $\sigma_f^2$ controls the variance (how "high" the function goes),
- $\ell$ is the length scale (how "wiggly" the function is).

**What GPR Gives You**

When you input some new $x^*$ values, GPR gives you:
- The **mean** prediction $\mu(x^*)$
- The **variance** $\sigma^2(x^*)$

This means you get **error bars** on every prediction!




## <span style="color:red"> **LECTURE 19** - Classification I </span>

- Generative VS Discriminative classification (differences and main concept)
- performance of classifiers
- Generative classification (discriminant function, bayes classifier, decision boundary)
- Naive bayes
- gaussian naive bayes
- linear e quadratic discriminat analysis
-  GMM and bayes classification
- K - nearest neighbor classifier

### **Generative vs. Discriminative Classification**

KDE is a unsupervised form of classification, we want to look now at supervised one.
There are two different type:

- **Generative classification** :
    If we find ourselves asking which category is most likely to generate the observed result, then we are using using **density estimation** for classification and this is referred to as **generative classification**. Here we have a full model of the density for each class or we have a model which describes how data could be generated from each class. 

- **Discriminative classification** :
    if we don't care about the full distribution, then we are doing something more like clustering, where we don't need to map the distribution, we just need to define boundaries.  Classification that finds the **decision boundary** that separates classes is called **discriminative classification**. 

For example, in the figure below, to classify a new object, it would suffice to know:
1. model 1 is a better fit than model 2 (***generative classification***), or 
2. that the decision boundary is at $x=1.4$ (***discriminative classification***).

![Ivezic, Figure 9.1](http://www.astroml.org/_images/fig_bayes_DB_1.png)

---

### **Performance of Classifiers**

- **Confusion matrix** (binary case):
    - **True Positive** = **correctly identified**  = TP
    - **True Negative** = **correctly rejected**  = TN
    - **False Positive** = **incorrectly identified** = FP 
    - **False Negative** = **incorrectly rejected** = FN

- **Metrics**:
  - **Accuracy**: $\frac{TP + TN}{TP + TN + FP + FN}$

    The proportion of all correct predictions (both positive and negative) out of the total number of predictions.

  - **Completeness**: $\frac{TP}{TP + FN}$

    Out of all the actual positives, how many did the model correctly identify?

  - **Contamination** :  $\frac{FP}{TP + FP}$

    fraction of false inside the true

  - **Precision**: $\frac{TP}{TP + FP}$

    Out of all the instances the model predicted as positive, how many were actually positive?
     - Precision = 1 - Contamination

   - **F1-score**: harmonic mean of precision and completness

- **Receiver Operating Characteristic (ROC)**
 A ROC curve simply plots the true-positive vs. the false-positive rate.
 
   If there are many more background events than source events, small false positive results can dominate a signal For these cases we can plot completeness versus efficiency

   Note that you **choose** the completeness and efficiency that you want by choosing a **threshold (decision boundary)**.


---

### **Generative Classification**

Generative classification is a probabilistic approach that models **how the data is generated**. The key idea is to estimate the distribution of features for each class and then apply **Bayes‚Äô theorem** to make predictions. This contrasts with discriminative methods, which model the decision boundary directly.

####  Discriminant Function

We can relate classification to regression: in regression we estimate a function $f(y \mid x)$ to predict continuous values.  
In classification, we do the same‚Äîbut $y$ is discrete, e.g., $y \in \{0, 1\}$.

So we define a **discriminant function** $g(x)$ that estimates the probability of class membership:
$$
g(x) = p(y = 1 \mid x)
$$

This function returns a probability, and we classify based on whether that probability exceeds a threshold (typically $0.5$).



####  Bayes Classifier

The **Bayes classifier** uses the discriminant function to make optimal decisions under uncertainty. For binary classification, it works as follows:

$$
\hat{y} =
\begin{cases}
1 & \text{if } g(x) > \frac{1}{2} \\
0 & \text{otherwise}
\end{cases}
$$

This rule assigns a new input $x$ to the class with the **highest posterior probability**.

It can also be extended to multi-class classification by choosing the class with the highest $p(y = k \mid x)$ for all $k$.



####  Decision Boundary

The **decision boundary** is the set of all $x$ where the classifier is **uncertain**‚Äîi.e., where the posterior probabilities of two (or more) classes are equal.

For the binary case:
$$
p(y=1 \mid x) = p(y=0 \mid x)
$$

This defines the surface (or line, or point) in the input space where the predicted class changes. In generative models, this boundary results from the underlying class distributions.


---

### **Naive Bayes**

Naive Bayes is one of the simplest and most effective classification algorithms.  
It is called ‚Äúnaive‚Äù because it assumes **conditional independence** between features given the class.


**Make a pause:** a feature here is a single information for each element of the dataset, for example if you are searching for spam - email, a possible feature is "how many times compare the world "free" in the email ?".
Each feature is indipendent from all the other in this approach.


Formally, the model works in this way:
$$
P(\mathbf{x} \mid C_k) = \prod_{i=1}^{n} P(x_i \mid C_k)
$$
where $C_k$ is the $k$-th class, and $x_i$ are the individual features of input $\mathbf{x}$.

This means that once we know the class, the model calculate the probability for each feature to be in the class and then multiply all of them.

- Since Naive Bayes is a **supervised learning** method, we assume that we already know the true class labels in the training set.  
- We use these to estimate the distributions $P(x_i \mid C_k)$ and the priors $P(C_k)$,  
- and then apply **Bayes‚Äô theorem** to predict the class of new, unseen data.

####  Pros:
- Simple to implement
- Very fast, both in training and prediction
- Works surprisingly well even when the independence assumption is violated
- Performs well on high-dimensional data

---

### **Gaussian Naive Bayes**

Gaussian Naive Bayes is a variant of Naive Bayes used when the features are **continuous-valued** (i.e., real numbers rather than categories or binary values).

Instead of estimating probabilities using histograms or counts, we assume that each feature $x_i$ follows a **Gaussian distribution** within each class $C_k$:
$$
P(x_i \mid C_k) = \frac{1}{\sqrt{2\pi \sigma_k^2}} \exp\left( -\frac{(x_i - \mu_k)^2}{2\sigma_k^2} \right)
$$

Here:
- $\mu_k$ is the **mean** of feature $x_i$ for class $C_k$
- $\sigma_k^2$ is the **variance** of that feature within the class


This model keeps the **Naive Bayes assumption** that all features are conditionally independent given the class:
$$
P(\mathbf{x} \mid C_k) = \prod_{i=1}^{n} P(x_i \mid C_k)
$$

Then, we use **Bayes' theorem** to compute the posterior probability of each class and choose the most likely one.

---

### **Linear and Quadratic Discriminant Analysis (LDA / QDA)**

Both **Linear Discriminant Analysis (LDA)** and **Quadratic Discriminant Analysis (QDA)** are **generative classification models** that assume each class follows a **Gaussian distribution**.  
They differ in how they treat the **covariance matrix**, which controls the shape and orientation of the Gaussian.


####  Linear Discriminant Analysis (LDA)

- Assumes the data for each class is normally distributed:
  $$
  \mathbf{x} \mid C_k \sim \mathcal{N}(\mu_k, \Sigma)
  $$
- All classes **share the same covariance matrix** $\Sigma$
- Class-specific means $\mu_k$ are different
- Because of the shared $\Sigma$, the **decision boundary is linear**

This means the model separates the classes with straight lines (or hyperplanes in higher dimensions).
67

####  Quadratic Discriminant Analysis (QDA)

- Still assumes Gaussian distributions:
  $$
  \mathbf{x} \mid C_k \sim \mathcal{N}(\mu_k, \Sigma_k)
  $$
- But now **each class has its own covariance matrix** $\Sigma_k$
- This allows more flexibility in the shape of each class distribution
- As a result, the **decision boundaries are quadratic curves**

QDA is more powerful than LDA but also requires estimating more parameters, so it needs more data to avoid overfitting.


####  Summary

| Property          | LDA                                | QDA                                 |
|------------------|-------------------------------------|--------------------------------------|
| Covariance       | Shared $\Sigma$                     | Class-specific $\Sigma_k$            |
| Decision Surface | Linear                              | Quadratic                            |
| Flexibility      | Less (simpler model)                | More (can fit complex boundaries)    |
| Data Requirement | Lower (fewer parameters)            | Higher (more parameters to estimate) |


---

### **Gaussian Mixture Models (GMM) and Bayes Classification**

So far, our generative classifiers have relied on fairly **strong assumptions**, such as:
- **Conditional independence** of features (Naive Bayes)
- **Single Gaussian distribution** per class (LDA/QDA)

These models work well, but they may fail when the true data distribution is more complex.


####  Bayes Classification with GMM

A more **flexible and expressive** approach is to use **Gaussian Mixture Models** to represent the class-conditional distributions.

Instead of assuming that each class is modeled by a single Gaussian, we assume it is a **mixture of multiple Gaussians**.

This allows us to **model complex, multimodal distributions** for each class.

This methode is called GMM Bayes Classifier

####  Why use GMMs?

- More expressive than a single Gaussian
- Can model **non-linear, complex class boundaries**
- Especially useful when data exhibits **clusters within each class**

#### Considerations

- Requires choosing the number of components $K$
- Can be computationally expensive
- Risk of **overfitting** with too many components
- **NOTE:** We can take this to the extreme by having one mixture component at each training point. We also don't have to restrict ourselves to a Gaussian kernel, we can use any kernel that we like. The resulting ***non-parametric*** Bayes classifier is referred to as **Kernel Discriminant Analysis (KDA)**.



---

### **K-Nearest Neighbor Classifier** 

- **Non-parametric** method
- Classify new point $\mathbf{x}$ by majority vote among its $k$ nearest neighbors
- A large choice of K decrease the variance in the classification but increase the bias
- you can pick the best K by cross - validation, with the intent to reduce the error

Pros:
- No training needed
- Simple

Cons:
- Expensive at test time
- Choice of $k$ and distance metric affects performance


## <span style="color:red"> **LECTURE 20** - Classification II </span>

- Discriminative Classification
- Logistic regression
- Support Vector Machines
- kernel method
- Decision tree
- splitting criteria
- ensemble learning
- bagging
- Random forest
- Boosting
- what should I use?

### **Discriminative Classification & Advanced Models**

Discriminative classifiers do not model how data ‚Äúcame to be‚Äù (that's generative classification), but instead **directly learn** the mapping from inputs $x$ to labels $y$.  

---

### **Logistic Regression**

predict $P(y=1 \mid x)$ by fitting a linear model in the log-odds space.

The only difference between LDA and Logstic Regression is how the regression coefficients are estimated. In LDA they are chosen to minimize density estimation error, whereas in Logistic Regression they are chosen to minimize classifcation error.

---

### **Support Vector Machines (SVM)**

define a hyperplane (a plane in $N-1$ dimensions) that maximizes the distance of the closest point from each class. This distance is the "margin". The points that touch the margin (or that are on the wrong side) are called **support vectors**. 

**What happen if the data have some overlap between the classes?**

When this happens, a **hard-margin SVM** (which requires perfect separation) won‚Äôt work.  
That‚Äôs why we use the **soft-margin SVM**, which allows some mistakes but still tries to find the best separating hyperplane.

- The model introduces a **slack variables** that allow some points to be:
  - Inside the margin
  - Or even misclassified
- The optimization now balances two goals:
  1. **Maximize the margin**
  2. **Minimize the total slack** (how many violations we allow)

This is controlled by a **regularization parameter** $C$:
- Large $C$ ‚Üí penalize misclassifications more (tighter margin, less tolerant)
- Small $C$ ‚Üí allow more violations (wider margin, more tolerant)

The soft-margin SVM still tries to separate the classes as cleanly as possible,  
but it **accepts some overlap** to achieve a better **generalization** on unseen data.

**Some important notion:**
1) SVM is not scale invariant, is worth rescaling the mean to 0 and variance of 1
2) Once the support vectors are determined, changes to the positions or numbers of points beyond the margin will not change the decision boundary
3) Strong resilience to outliers
4) This is why there is a high completeness compared to the other methods: it does not matter if the background sources is much bigger. It simply determines the best boundary between the small source clump and the large background clump.
5) This completeness, however, comes at the cost of a relatively large contamination level.


---

### **Kernel Method**

If the contamination is driven by non-linear effects, it may be worth implementing a **non-linear decision boundary**. We can do this by ***kernelization**.

A first possibility is simply to aad a new dimension at the dataset (from 2D to 3D for example), i create the information associated at this dimension, so it could be something that clearly separete the point in different cluster, then i can fin the best plane to separete them.

---

### **Decision Trees**

A **decision tree** is similar to the process of classification that you might do by hand: 

- define some criteria to separate the sample into 2 groups (not necessarily equal),
- then take those sub-groups and do it again.  
- keep going until you reach a stopping point such as not having a minimum number of objects to split again.  

In short, we have done a hierarchical application of decision boundaries.

The tree structure is as follows:
- top node contains the entire data set
- at each branch the data are subdivided into two child nodes 
- split is based on a predefined decision boundary (usually axis aligned)
- splitting repeats, recursively, until we reach a predefined stopping criteria 

Build a tree by **recursively splitting** on feature thresholds until leaves are pure or other stopping criteria are met.

#### Splitting Criteria

At each node, choose feature $j$ and threshold $t$ to minimize weighted impurity.
The typical process for finding the optimal decision boundary is to perform trial splits along each feature one at a time, within which the value of the feature to split at is also trialed. The feature that allows for the maximum information gain is the one that is split at this level.

- **Gini Impurity:**  
  $$
  G = 1 - \sum_{k=1}^K p_k^2
  $$
  It essentially estimates the probability of incorrect classification by choosing both a point and (separately) a class randomly from the data.
- **Entropy (Information Gain):**  
  $$
  H = -\sum_{k=1}^K p_k \log p_k
  $$
  where $p_k$ is the fraction of samples of class $k$ in the node.  

Obviously in constructing a decision treee, if your choice of stopping criteria is too loose, further splitting just ends up adding noise.  So using cross-validation in order to optimize the depth of the tree (and to avoid overfitting) is the best choice.

---

### **Ensemble Learning**

Combine multiple ‚Äúweak learners‚Äù (the method seen before) to form a stronger model.

#### **Bagging (Bootstrap Aggregating)**

- **bootstrap** the samples (with replacement) of size $N$.
- Train an independent tree on each sample.
- **Aggregate** by majority vote (classification) or average (regression).

Reduces **variance** without increasing bias.

For a sample of $N$ points in a training set, bagging generates $B$ equally sized bootstrap samples from which to estimate the function $f_i(x)$. The final estimator for $\hat{y}$, defined by bagging, is then

$$\hat{y} = f(x) = \frac{1}{B} \sum_i^B f_i(x).$$

**NOTE** : you can put the parameter `n_jobs=-1`; That says to use all the cores of your machine to do the job.  That's one of the benefits of bagging.  It can be made parallel trivially -- one bagging process has nothing to the with the others.  You just average them all together when you are done. 

#### **Random Forest**

Random forests extend bagging by generating decision trees from the bootstrap samples. In addition to drawing random samples from our training set with replacement, we may also draw random subsets of features for training the individual trees.
 
- In Random Forests, the splitting features on which to generate the tree are selected at random from the full set of features in the data.
- The number of features selected per split level is typically the square root of the total number of features, $\sqrt{D}$. 
- The fundamental difference is that in Random forests, only a subset of features are selected at random out of the total and the best split feature from the subset is used to split each node in a tree, unlike in bagging where all features are considered for splitting a node. 
- The final classification from the random forest is based on the averaging of the classifications of each of the individual decision trees.
- As alway use CV to determine the optimal depth of the tree

---
### **Boosting**

**Boosting** is a way to make a strong model by combining **many weak models** that aren't very good on their own.

#### Basic Idea

1. Train a simple model.
2. See which points it gets wrong.
3. Focus more on those hard points next time.
4. Repeat this several times.
5. At the end, combine all the models to get a better result.

Each new model **tries to fix the mistakes** made by the ones before.

#### What Makes Boosting Special?

- It gives **more importance** to points that are hard to classify.
- The final prediction is a **weighted vote** of all the models.
- The better a model is, the more say it gets.

#### AdaBoost (Adaptive Boosting)

- A popular type of boosting.
- After each round, it **boosts** (increases) the weight of the wrong points.
- This helps the next model **focus** on the tough stuff.
- change the learning rate, where half the learning rate means that weights are boosted half as much for each iteration.

#### Downside

- Boosting is **slow** because models are built **one after the other**.
- Can be sensitive to noisy data.
- Not easy to run in parallel like Random Forests.

---

## **Model Selection: ‚ÄúWhat Should I Use?‚Äù**

| Method                     | Type            | Strengths                                                                 | Weaknesses                                                         |
|----------------------------|------------------|---------------------------------------------------------------------------|--------------------------------------------------------------------|
| **Logistic Regression**    | Discriminative   | Fast, interpretable, works well on linearly separable data                | Struggles with nonlinearity                                        |
| **SVM (linear / kernel)**  | Discriminative   | Strong margin, effective in high‚Äêdim spaces, kernels for nonlinearity     | Computationally heavy for large $N$                                |
| **Decision Tree**          | Discriminative   | Intuitive rules, handles mixed data types                                 | High variance, prone to overfitting                                |
| **Random Forest**          | Discriminative   | Excellent off‚Äêthe‚Äêshelf performance, handles large feature sets            | Less interpretable, many parameters                                |
| **Bagging**                | Discriminative   | Reduces variance of unstable learners                                     | Requires multiple models, less interpretable                       |
| **Boosting** | Discriminative | High accuracy, focuses on hard examples                                   | Slower to train, sensitive to noise and overfitting                |
| **Naive Bayes**            | Generative       | Very fast, works surprisingly well with many features                     | Assumes feature independence, which is often unrealistic           |
| **Gaussian Naive Bayes**   | Generative       | Handles continuous data with a normal distribution assumption             | Poor with non-Gaussian data or correlated features                 |
| **LDA (Linear Disc. Analysis)** | Generative | Simple, fast, works well with Gaussian data and equal covariances         | Assumes same covariance matrix across classes                      |
| **QDA (Quadratic Disc. Analysis)** | Generative | More flexible than LDA, models class-specific covariance          | Needs more data, prone to overfitting with small samples           |
| **GMM + Bayes Classifier** | Generative       | Can model complex, multimodal distributions per class                     | Training (EM algorithm) is slower, sensitive to initialization     |
| **K-Nearest Neighbors (KNN)** | Generative| No training needed, simple and intuitive                                  | Slow at test time, sensitive to irrelevant features & scaling      |


Naive Bayes and its variants are by far the easiest to compute. Linear support vector machines are more expensive, though several fast algorithms exist. Random forests can be easily parallelized. Boosting helps with challenging classification (but at that point you might want to go all the way to deep learning)

## <span style="color:red"> **LECTURE 21** - Deep Learning I </span>

- Loss function
- Gradient Descent
- AdaBoost
- Neural Networks
- In more detail
- Backpropagation
- About derivative
- number of layers
- number of neurons
- activation function


### **Loss function**

A **loss function** measures how ‚Äúbad‚Äù a model‚Äôs prediction is compared to the true value. The smaller the loss, the better your model is doing.

It helps guide the **training** of a machine learning model, by telling it *how wrong* its predictions are, so it can improve.


#### **Common Loss Functions**

#### L2 Loss (Mean Squared Error - MSE)

This is typical in **regression problems**:

$$
L_2 = (y - f(x))^2
$$

- Squared difference between prediction and actual value.
- Penalizes **larger errors more heavily**.
- Smooth and easy to differentiate ‚Üí good for gradient descent.
- Assumes Gaussian noise in the data.

### L1 Loss (Mean Absolute Error - MAE)

Also used in regression:

$$
L_1 = |y - f(x)|
$$

- Less sensitive to outliers than L2.
- Leads to **sparser** models (used in LASSO).
- Less smooth ‚Üí gradient descent can be trickier.


#### Classification loss function

In **classification**, especially **binary classification**, your true labels are {-1, +1} and your model's prediction is a **score** or **probability**.

So we look at:

$$
y \cdot f(x)
$$

- If this is **positive**, your model got the right class ( +1 * +1  or -1 * -1).
- If this is **negative**, it got it wrong ( +1 * -1  or -1 * +1).
- Larger positive ‚Üí more confident correct prediction.
- More negative ‚Üí confident wrong prediction.



This function (PARABOLA CENTRATA IN 1,0 POSITIVA O MODULO UGUALE) does something reasonable for $y*f(x)\le1$.  However, look what happens at larger values where we are even more confident that $y*f(x)$ is positive and that our class should be $+1$.  The loss goes **up**.  That's bad.

We need a loss function that makes sense for classification.

- The first we'll try is the so-called **Zero-One Loss**. It is 1 for $yf(x)<0$ and 0 for $yf(x)>0$; thus the name. You increment the loss function by 1 every time you make a wrong prediction. It is just a count of the total number of mistakes. However, the Zero-One loss is hard to minimize, so instead we can try something that allows the loss to be continuous function in $y*f(x)$.  

- The **Hinge Loss**, which looks like $${\rm max}(0,1-y*f(x)),$$. Here there is no contribution to the loss for values $\ge 1$, but there is a linearly increasing loss for smaller values. So, it penalizes both wrong predictions and also correct predictions that have low confidence.

- A **Logistic Loss** (also called the *log loss* and *cross entropy loss*) function has similar properties as shown in **blue**, but is smoother and has slightly less and less penalty for more and more confident $+1$ predictions.

---

### **Gradient descendent**

**Gradient descent** is a method to find the best parameters $\theta$ that minimize a loss function.

In this course, we've been trying to find $\theta$ that gives the best fit to our training data ‚Äî without overfitting.

There are different ways to do that:
- Sometimes we can calculate the best $\theta$ directly.
- Sometimes we try many random values (like in MCMC).
- But gradient descent is like standing on a hill and walking downhill until you reach the bottom.

We update the parameters using:

$$
\theta_{\text{new}} = \theta_{\text{old}} - \eta \nabla J(\theta)
$$

- $\eta$  is the **learning rate** (step size).
- $\nabla J(\theta)$  is the slope (gradient) of the cost function.


### Learning rate:

- Too small ‚Üí takes forever to reach the minimum.
- Too big ‚Üí can skip over the minimum or even diverge.

### Other notes:

- We start from random values of $\theta$.
- Gradient descent works well when the cost function has only one minimum (like with  L2 loss).
- It's also useful when the dataset is too big to load all at once.

![https://miro.medium.com/max/1400/0*GaO7X6j3coh3oNwf.png](https://miro.medium.com/max/1400/0*GaO7X6j3coh3oNwf.png)

---

### **AdaBoost**

Let's see lecture 20

---



## <span style="color:red"> **LECTURE 22** - Deep Learning II </span>

- keras
- 