### **<span style="color:red"> LECTURE 2  </span>**

### **Probability Density Function (PDF), Cumulative Distribution Function (CDF), and Quantile**

- **PDF (Probability Density Function)**: Describes the probability for a continuous variable to take a specific value. The area under the PDF over an interval gives the probability of the variable falling within that interval.
- **CDF (Cumulative Distribution Function)**: It is obtained by integrating the PDF from - infinity up to a certain values X. Gives the probability that a random variable is less than or equal to a certain value. 
- **Quantile**: it's the inverse of the CDF. The value below which a certain percentage of observations fall. For example, the 0.25 quantile (or 25th percentile) is the value below which 25% of the data lie.

---

### **Empirical and Theoretical Distributions**

- **Theoretical Distribution**: A probability distribution derived from a known mathematical model (e.g., Normal, Poisson).
- **Empirical Distribution**: Based on observed data. It approximates the distribution of a dataset and is typically represented by the empirical CDF or histogram.
- Empirical distributions are used when the true distribution is unknown or difficult to model.

---

### **Homoscedastic and Heteroscedastic Errors**

- **Homoscedasticity**: The variance of the errors is constant.
- **Heteroscedasticity**: The error variance changes with the data

---

### **Kolmogorov's Axioms and Probability**

Kolmogorov formalized the foundation of probability with three axioms:

1. **Non-negativity**: For any event A, the probability is non-negative:  
   \( P(A) >= 0 \)
2. **Normalization**: The probability of the entire sample space is 1:  
   \( P($\Omega$) = 1 \)
3. **Additivity**: For any two mutually exclusive events A and B:  
   \( P(A $\cup$ B) = P(A) + P(B) \)

These axioms form the basis of modern probability theory.

---

### **Bayes' Theorem**

Bayes' Theorem updates the probability of a hypothesis based on new evidence:

$
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
$

-  P(A|B) : Posterior probability (updated belief)  
-  P(B|A) : Likelihood of observing B given A  
-  P(A) : Prior probability of A  
-  P(B) : Marginal probability of B  

Used in many fields like medicine, machine learning, and decision theory.

---

### **Transformations of Random Variables**

Transforming a random variable means applying a function to it, creating a new variable.

- **Example**: Let \( X \) be a random variable and \( Y = g(X) \) a transformation.
- To find the **distribution of \( Y \)**:
  - If \( X \) is continuous with PDF \( $f_X$ \) and \( g \) is invertible, then:

$
f_Y(y) = f_X(g^{-1}(y)) \cdot \left| \frac{d}{dy} g^{-1}(y) \right|
$

- This is used to derive distributions of functions of random variables (e.g., squares, sums, logarithms).


### **<span style="color:red"> LECTURE 3  </span>**

### **Monte Carlo Integration (Crude and Hit-or-Miss)**

- **Monte Carlo integration** uses random sampling to approximate definite integrals.
- **Crude Monte Carlo**:  
  Estimate the integral $\int_a^b f(x) \, dx$ by sampling $x_i \sim \mathcal{U}(a, b)$ and computing:  
  $$
  I \approx (b - a) \cdot \frac{1}{N} \sum_{i=1}^N f(x_i)
  $$
- **Hit-or-Miss method**:  
  Sample uniformly in a rectangle that encloses the graph of $f(x)$.  
  The integral is approximated by the fraction of points that fall below the curve times the area of the rectangle.

---

### **Mean, Median, and Expected Value**

- **Mean**: Arithmetic average of a dataset.
- **Median**: Middle value when data are ordered. Less sensitive to outliers.
- **Expected value** ($\mathbb{E}[X]$): Theoretical mean of a random variable. For continuous variables:  
  $$
  \mathbb{E}[X] = \int x f(x) \, dx
  $$

---

### **Standard Deviation, MAD (1), Variance, MAD (2), Quantile Region, Interquantile Range, Mode**

- **Standard deviation** ($\sigma$): Measures spread around the mean.
- **MAD_1 (Mean Absolute Deviation)**:  
  $$
  \text{MAD}_1 = \frac{1}{N} \sum_{i=1}^N |x_i - \bar{x}|
  $$
- **Variance**:  
  $$
  \text{Var}(X) = \mathbb{E}[(X - \mu)^2]       with \mu = \mathbb{E}[X])
  $$
- **MAD_2**: Median Absolute Deviation = median $(|x_i - median({x_i})|)$
- **Quantile region**: Range containing a central portion of the distribution (e.g., 95% interval).
- **Interquantile range (IQR)**:  
  $$
  \text{IQR} = Q_{75} - Q_{25}
  $$

It contain the 50% of the dataset
- **Mode**: Most frequent value in a dataset.

---

### **Skewness and Kurtosis**

- **Skewness**: Measures asymmetry of a distribution.
  - Positive skew: tail to the right.
  - Negative skew: tail to the left.
- **Kurtosis**: Measures how likely extreme values (far from the average) are in a distribution.
  - High kurtosis: heavy tails.
  - Low kurtosis: light tails.
  - Normal distribution has kurtosis $= 3$.

---

### **PDF vs Sample Statistics, Bessel's Correction**

- **PDF statistics**: Theoretical values (mean, variance, etc.) computed from a probability distribution.
- **Sample statistics**: Estimates of these quantities based on data.
- **Bessel’s correction**: When estimating variance from a sample, divide by $N - 1$ instead of $N$ to correct bias:  
  $$
  s^2 = \frac{1}{N - 1} \sum_{i=1}^N (x_i - \bar{x})^2
  $$

---

### **Uncertainties of Estimators**

- Every estimator has **uncertainty** due to finite sample size.
- For the **sample mean**:
  $$
  \text{Standard error} = \frac{\sigma}{\sqrt{N}}
  $$
- For the **sample variance** and **standard deviation** ($s$), the standard error can be approximated as:
  $$
  \text{SE}(s) \approx \frac{\sigma}{\sqrt{2N}}
  $$
  where $\sigma$ is the true standard deviation and $N$ is the sample size.
- For the **Interquantile Range (IQR)**, the uncertainty depends on the density around the quartiles; a rough estimate of its standard error is:
  $$
  \text{SE}(\text{IQR}) \approx \frac{1.58 \times \text{IQR}}{\sqrt{N}}
  $$
- Confidence intervals express the likely range of the true parameter.

---

### **PDFs: Uniform, Gaussian, Log-Normal, Chi-Squared, Poisson**

- **Uniform**: All values in an interval have equal probability.  
  $$
  f(x) = \frac{1}{b - a}      \text{    for   } x \in [a, b]
  $$

  this distribution has $\sigma = \frac{b-a}{\sqrt(12)}$
- **Gaussian (Normal)**: Curve defined by mean $\mu$ and std $\sigma$.  
  $$
  f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \, e^{-\frac{(x - \mu)^2}{2\sigma^2}}
  $$
  - The convolution of two gaussian is a gaussian too.
  - It's the quuen of distribution , because everything follow this shape and it's quite easy to use.
  - $1\sigma$ = 68% // $2\sigma$ = 95%

- **Log-Normal**: $X \sim \text{LogNormal}$ means $\ln X \sim \text{Normal}$.
- **Chi-squared** ($\chi^2$):  
  If we define standardized variables as  
  $$
  z_i = \frac{x_i - \mu}{\sigma},
  $$  
  then the sum of their squares  
  $$
  Q = \sum_{i=1}^K z_i^2
  $$  
  follows a **chi-squared distribution** with $K$ degrees of freedom.

  The number of degrees of freedom $K$ is equal to the number of **independent** data points used in the sum.

- **Poisson**: Discrete distribution for count data.  
  $$
  P(k; \mu) = \frac{\mu^k e^{-\mu}}{k!}
  $$
  - Where: $\mu$ is the mean, K is the number of events occouring
  - Known as "law of rare events"

---

### **Importance Sampling**

- Hit or miss and Crude MC, are inefficient if the integrand has some null zone, or even if is really extendended... that's beacuse this 2 methode use the uniform distribution.
- Instead of sampling from the uniform, sample from a **proposal distribution** $g(x)$ 
- Best when $g(x)$ is close to the shape of $f(x)$.
- Reduces variance and computational cost if the $g(x)$ it's well chosen


### **<span style="color:red"> LECTURE 4  </span>**

### **Central Limit Theorem (CLT)**

- The CLT states that the sum (or mean) of a large number of independent, identically distributed random variables tends to follow a **normal distribution**, regardless of the original distribution.

---

### **Law of Large Numbers (LLN)**

- The LLN states that as the number of observations $N$ increases, the sample mean $\bar{x}$ converges to the true mean $\mu$:
  $$
  \lim_{N \to \infty} \bar{x} = \mu
  $$
- This is a statement about convergence **in probability**.

---

### **Multidimensional PDFs**

- In 2D, the joint distribution can be described by:
  - **Mean vector**:  
    $$
    \vec{\mu} = (\mu_x, \mu_y)
    $$

  - **Covariance matrix**:  
    $$
    \Sigma = \begin{pmatrix}
    \sigma_x^2 & \text{cov}(x, y) \\
    \text{cov}(y, x) & \sigma_y^2
    \end{pmatrix}
    $$
    The two off diagonal values are equal to 0 only if x & y are totaly uncorrelated

  - **Correlation coefficient**:  
    $$
    \rho = \frac{\text{cov}(x, y)}{\sigma_x \sigma_y}
    $$
    Express the percentual of correlation between the 2 variable

  - **Principal axes**: determined by the eigenvectors of $\Sigma$; note that the correlation vanish in this system by definition.
  - **2D Confidence Ellipses**: regions where the joint probability is constant, keep attention, for each dimension the number of sigma has a different meaning: $1\sigma = 39$% in 2 dimension! I can impose 68% for the similitude with 1D, but it's not $1\sigma$.

---

### **Correlation vs Causation**

Correlation does not imply causation!
Just because the sun burns our skin and also makes us thirsty, it doesn't mean that thirst causes sunburn!

- **Pearson's correlation** (r) : Measures linear correlation between 2 different dataset; it's a value between -1 and 1, the 2 are uncorrelated only if r = 0.
It has 2 problems:
  - it's susceptible at the outliars
  - doesn't count the error

- **Spearman's rho**: Measures monotonic (rank-based) correlation.
- **Kendall's tau**: Measures ordinal association between two variables.

---

### **Rejection Sampling**

Rejection sampling is a method to generate random samples from a complex distribution $p(x)$, using a simpler proposal distribution $q(x)$.

The procedure works as follows:

1. **Choose a proposal distribution** $q(x)$ from which it's easy to sample (often a uniform distribution).  
   Make sure it's "wide enough" to cover the shape of $p(x)$, including its tails.

2. **Find a constant** $M$ such that for all $x$:
   $$
   p(x) \leq M q(x)
   $$
   This ensures the proposal dominates the target distribution.

3. **Generate a candidate sample** $x $ from $q(x)$.

4. **Draw a random number** $u $ from $ \mathcal{U}(0, 1)$.

5. **Accept or reject**:
   - Accept $x$ if  
     $$
     u < \frac{p(x)}{M q(x)}
     $$
   - Otherwise, reject $x$ and go back to step 3.

The set of accepted $x$ values will follow the target distribution $p(x)$.


---

### **Inverse Transform Sampling**

- Used to sample from a distribution with known CDF $F(x)$ and Quantile.
- Steps:
  1. Sample $u$ from  ${U}(0, 1)$.
  2. Compute $x = F^{-1}(u)$.
Normalizarion here are rellly important.
you can retrive the quantile and the CDF by numerically solution if you are not able to do in by hand

### **<span style="color:red"> LECTURE 5  </span>**

### **Population, Sample, Statistic, Estimators, Uncertainty and Intervals**

- A **population** is the full set of data or measurements we are interested in.
- A **sample** is a subset of the population, used to infer properties of the whole.
- A **statistic** is a function of the sample (e.g. the sample mean $\bar{x}$).
- An **estimator** is a rule or formula to estimate population parameters from the sample.
- All estimators have **uncertainties** due to random sampling.
- A **confidence interval**, gives a range likely to contain the true value.

---

### **Frequentist vs Bayesian**

- **Frequentist**: Probability is extract from the frequency of events. Parameters are fixed, data are random.
Into Frequentist inference we have confidence levels,.
- **Bayesian**: Probability expresses belief or uncertainty about what we know. Parameters have distributions while data are fixed. In Bayesian inference we have credible regions derive from posterior distribution of the parameters.


- Bayesian statistic it's hold by the **Bayes’ theorem**:
  $$
  P(\theta | \text{data}) = \frac{P(\text{data} | \theta) \cdot P(\theta)}{P(\text{data})}
  $$

---

### **Maximum Likelihood Estimator (MLE)**

- The **MLE** is the value of the parameter $\theta$ that **maximizes the likelihood** of the observed data:
  $$
  \hat{\theta}_{\text{MLE}} = \arg \max_\theta \mathcal{L}(\theta)
  $$
- It's usefull in both frequentist e bayesian approach
- **Remember**: the **likelihood** is defined as the product of the probabilities (or probability densities) of the observed data, assuming a given model or parameter value.

  For independent data points $x_1, x_2, ..., x_N$:

  $$
  \mathcal{L}(\theta) = \prod_{i=1}^{N} p(x_i \mid \theta)
  $$

  Where:
  - $\mathcal{L}(\theta)$ is the likelihood function,
  - $p(x_i \mid \theta)$ is the probability (or density) of observing $x_i$ given parameter $\theta$,
  - The product assumes all $x_i$ are independent.

  Often, we work with the **log-likelihood**:
  $$
  \log \mathcal{L}(\theta) = \sum_{i=1}^{N} \log p(x_i \mid \theta)
  $$
  which is easier to compute and optimize.

---

### **Properties of Estimators**

- **Unbiasedness**: $\mathbb{E}[\hat{\theta}] = \theta$
- **Consistency**: $\hat{\theta} \to \theta$ as $N \to \infty$
- **Efficiency**: Minimum possible variance (called Cramer-Rao bound)

---

### **Likelihood, Chi-squared and Minimization**

- The **likelihood** $\mathcal{L}(\theta)$ is the probability of the data given parameters.
- If we infere that the process has a gaussian distribution the Likelihood will follow the $\exp(-\chi^2/2)$
- In Gaussian cases, maximizing the log-likelihood is equivalent to **minimizing the chi-squared**:
  $$
  \chi^2 = \sum_{i=1}^N \left( \frac{x_i - f(x_i; \theta)}{\sigma_i} \right)^2
  $$
- Minimizing $\chi^2$ gives the best-fit parameters.
- The MLE method tell us to think the likelihood as a function of the (unknown) model parameters, and by minimizing the $\chi^2$, we will find the values that maximize the values of the likelihood.

---

### **Mean and MLE Error: Homoscedastic vs Heteroscedastic**

- **Homoscedastic**: All data points have the same uncertainty $\sigma$. We Will use the mean:
    $$
    \bar{x} = \frac{1}{N} \sum x_i \quad \text{and} \quad \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{N}}
    $$
- **Heteroscedastic**: Uncertainties vary for each data point $\sigma_i$. Then use a **weighted mean**:
    $$
    \bar{x} = \frac{\sum x_i / \sigma_i^2}{\sum 1 / \sigma_i^2}
    $$
    $$
    \sigma_{\bar{x}}^2 = \frac{1}{\sum 1/\sigma_i^2}
    $$
This two formula are extracted from the derivative of the log-Likelihood = 0 , that' because we are searching for a maximum.

Our Maximum Likelihood Estimator (MLE) is not perfect — every estimate has an associated **uncertainty** due to the finite sample size.

Under general conditions, the MLE becomes **asymptotically normal**, meaning that for large $N$, the likelihood function can be approximated by a **Gaussian** centered at the true parameter value $\theta_0$.

To quantify the uncertainty, we expand the **log-likelihood** around its maximum using a second-order **Taylor expansion**:

$$
\log \mathcal{L}(\theta) \approx \log \mathcal{L}(\hat{\theta}) - \frac{1}{2} (\theta - \hat{\theta})^2 F(\hat{\theta})
$$

Here, $F(\hat{\theta})$ is the **Fisher Information Matrix**, defined as the negative second derivative (Hessian) of the log-likelihood:

$$
F(\theta) = - \frac{\partial^2}{\partial \theta_i \partial \theta_j} \log \mathcal{L}(\theta)
$$

The **covariance matrix** of the estimator $\hat{\theta}$ is then given by the **inverse** of the Fisher matrix:

$$
\text{Cov}(\hat{\theta}) = F^{-1}(\hat{\theta})
$$

For a **single parameter** $\theta$, this simplifies to:

$$
\sigma_{\hat{\theta}} = \sqrt{\frac{1}{F(\hat{\theta})}}
$$

This implies that asymptotically, the MLE is **normally distributed**:

$$
\hat{\theta} \sim \mathcal{N} \left( \theta_0, F^{-1}(\theta_0) \right)
$$
---

### **Non-Gaussian Likelihoods**

- When the data doesn’t follow a Gaussian distribution, use the appropriate **likelihood model**, such as : **Poisson**, **Binomial**, **Exponential**, **Log-normal**, etc.
- The MLE approach still applies: choose the model, write the likelihood, and maximize it numerically.
- In most of the cases, you will find the same result as in the gaussian one

### **<span style="color:red"> LECTURE 6 </span>**

### **Fit**

- Fitting means adjusting model parameters so that the model best matches the observed data.
- Typically done by minimizing a loss function, such as the **sum of squared residuals** or **negative log-likelihood**.
- The goal is to find the best estimate $\hat{\theta}$ that explains the data.

---

### **Outliers and Huber Loss Function**

- **Outliers** are data points that deviate significantly from the trend of the rest of the data.
- Standard least squares are very sensitive to outliers.
- How do we deal with outliers? By modifying the likelihood!
- The **Huber loss** combines the squared loss for small errors and absolute loss for large errors:

$$
L_{\text{Huber}}(t) =
\begin{cases}
\frac{1}{2} t^2 & \text{if } |t| \leq c \\
c |t| - \frac{1}{2} c^2 & \text{if } |t| > c
\end{cases}
$$

- Where $t = \left( \frac{y - M(\theta)}{\sigma} \right)$ represents the **standardized residual**, i.e. how far the observed value $y$ is from the model prediction $M(\theta)$, in units of the known uncertainty $\sigma$.
- $c$ is the **tuning constant** (or confidence threshold), which determines the cutoff point where the loss switches from quadratic to linear. A common value is $c \approx 1.345$, which gives good balance between efficiency and robustness under normal errors.
- This approach makes the fit more **robust** to outliers: small residuals behave like in least squares, but large residuals are penalized less harshly.
- Note that by doing this, we are effectively putting **prior information** into the analysis

---

### **Goodness of Fit : Reduced Chi-squared**

- Measures how well the model describes the data. Remember GIGO (Garbage In Garbage Out), if the model is wrong , finding the "best" parameter doesn't really mean something ...
- A good fit should show residuals randomly scattered around zero.

- The **reduced chi-squared** is defined as:

$$
\chi^2_{\text{red}} = \frac{1}{\nu} \sum_{i=1}^N \left( \frac{y_i - f(x_i)}{\sigma_i} \right)^2
$$

where $\nu = N - k$ is the number of degrees of freedom (data points minus number of parameters).

- Interpretation:
  - $\chi^2_{\text{red}} \approx 1$: good fit
  - $\chi^2_{\text{red}} \gg 1$: underfitting or underestimated errors
  - $\chi^2_{\text{red}} \ll 1$: overfitting or overestimated errors
- If the model is **wrong** (misspecified), goodness-of-fit measures can be misleading.

---

### **Model Comparison, Occam’s Razor , AIC and BIC**
- You can't do $\chi^2_{\text{red}}$ with Huber function, because it's not gaussian!
- When comparing two models with the **same number of parameters**, we can simply compare their **maximum log-likelihood** values:
- Larger log-likelihood ⇒ better fit

The **Huber loss** clearly performs better (less negative log-likelihood), meaning it fits the data more effectively, especially in the presence of outliers.


When models have **different numbers of parameters**, simply comparing likelihoods is not fair: more complex models might fit better **just by chance**. We need to penalize complexity — this is known as the **Occam penalty**.


A simple method to compare models with different complexity is the **AIC** (Akaike Information Criterion):
  
- Lower AIC is better for the explaination of the dataset
- It's composed by lot of term, the first one it's the $\chi^2$, the second and third penalize model complexity
- If models fit the data equally well, AIC prefers the one with fewer parameters.



The **BIC** (Bayesian Information Criterion) is another way to compare models, especially when they have different numbers of parameters.
It’s similar to AIC, but it **penalizes complex models more strongly**, especially when the dataset is large.


- Lower **BIC** means a better model.
- BIC prefers **simpler models**, especially when $n$ is large.
- It’s often used in **Bayesian statistics**, but doesn’t need a full Bayesian analysis, often use in frequentist analysis too.


---

### **Bootstrap**

- A **resampling method** to estimate uncertainties and confidence intervals.
- Keep attention : it create information out of nothing!
- Steps:
  1. Resample data (with replacement) to create many datasets. The probability of getting the original dataset it's extreamly low ($N! / N^N$)
  2. Fit the model to each resampled dataset.
  3. Analyze the distribution of the fitted parameters.
- Useful when analytical uncertainty is hard to compute or it's too big (such as when we have few point for a gaussian distribution).

---

###  **Jackknife Method** 

The **Jackknife** is a method to estimate the **uncertainty** (standard error) and **bias** of a statistic — like the **mean** or **standard deviation** — using your data.

Suppose you have a dataset of $N$ values.

1. Leave out **one** data point at a time → you get $N$ new datasets.
2. Compute your statistic (e.g. mean, std) on each of these.
3. From the $N$ results, estimate:
   - A **better (bias-corrected)** value of the statistic
   - The **uncertainty** on that value


Jackknife works **well** when the statistic is:
- The **mean**
- The **standard deviation**

It works **poorly** for:
- The **median**
- **Quantiles** (e.g. the 25th percentile)

These are called **rank-based statistics**, and removing one point at a time doesn’t change them much — so the jackknife underestimates the uncertainty.

---

##  Jackknife vs Bootstrap

|                | Jackknife                | Bootstrap                |
|----------------|--------------------------|--------------------------|
| Type           | Leaves out one point     | Resamples with replacement |
| Fast?          | ✅ Yes                   | ❌ Slower               |
| Repeatable?    | ✅ Always same result     | ❌ Changes each time    |
| Works for all stats? | ❌ Not for medians       | ✅ Yes                 |
| Confidence intervals | ❌ Approximate         | ✅ Full distribution     |


### **<span style="color:red"> LECTURE 7  </span>**

### **<span style="color:red"> LECTURE 8  </span>**