# MODELS

## MODELS FOR PREDICTION

### Simple Linear Regression Model

#### Model

Simple linear regression model is:

$$
y_i = \beta_0 + \beta_1 x_i + \varepsilon_i
$$

**Where:**
- $y_i$: dependent variable (outcome) for observation $i$  
- $x_i$: independent variable (predictor) for observation $i$  
- $\beta_0$: intercept (value of $y$ when $x=0$)  
- $\beta_1$: slope (change in $y$ for a one-unit increase in $x$)  
- $\varepsilon_i$: error term for observation $i$, assumed to have mean 0

**While:**

- **Known (from the data):** $x_i, y_i$ (the observations, given in the dataset).  
- **Unknown but to be estimated:** $\beta_0, \beta_1$ (parameters of the regression model).  
- **Not directly known but assumed:** $\varepsilon_i$ (error terms, assumed to have mean 0 and constant variance).

In regression, the **Residual Sum of Squares (RSS)** measures the total squared difference between the observed values $y_i$ and the predicted values $\hat{y}_i$:

$$
RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$

- $y_i$ is the observed value for observation $i$.  
- $\hat{y}_i$ is the predicted value from the regression model: $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$.  
- The difference $e_i = y_i - \hat{y}_i$ is called the residual.  

To calculate $\beta_0, \beta_1$ to construct the model via **Ordinary Least Squares (OLS)** by minimizing the RSS:

$$
RSS(\beta_0, \beta_1) = \sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_i))^2
$$

- The residual for observation $i$ is $e_i = y_i - \hat{y}_i$.  
- Minimizing $RSS$ gives the “best-fitting line” through the data points.

#### Weighting

Weighting is used when some observations are more important or more reliable than others.

Weighted simple linear regression model is:

$$
y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \quad i = 1,2,\dots,n
$$

- Standard regression minimizes the Residual Sum of Squares (RSS):

$$
RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$

- Weighted regression introduces weights $w_i > 0$ for each observation:

$$
WRSS = \sum_{i=1}^n w_i \left(y_i - \hat{y}_i\right)^2
$$

where:
- $w_i$: weight for observation $i$  
- $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$  

Minimize $WRSS$ with respect to $\beta_0$ and $\beta_1$:

$$
(\hat{\beta}_0, \hat{\beta}_1) = \arg\min_{\beta_0, \beta_1} \sum_{i=1}^n w_i (y_i - \beta_0 - \beta_1 x_i)^2
$$

The weighted formulas for the coefficients:

$$
\hat{\beta}_1 = \frac{\sum_{i=1}^n w_i (x_i - \bar{x}_w)(y_i - \bar{y}_w)}{\sum_{i=1}^n w_i (x_i - \bar{x}_w)^2}
$$

$$
\hat{\beta}_0 = \bar{y}_w - \hat{\beta}_1 \bar{x}_w
$$

where weighted means are:

$$
\bar{x}_w = \frac{\sum_{i=1}^n w_i x_i}{\sum_{i=1}^n w_i}, \quad
\bar{y}_w = \frac{\sum_{i=1}^n w_i y_i}{\sum_{i=1}^n w_i}
$$

Observations with larger weights $w_i$ have more influence on the fitted line to reduce the impact of less reliable points on the coefficient estimates.

#### Performance

**Cross-Validation**

To estimate the generalization performance of the regression model on unseen data.  

**k-Fold Cross-Validation**

1. Split the data into $k$ roughly equal folds (subsets): $D_1, D_2, \dots, D_k$.  
2. For each fold $j = 1, \dots, k$:
   - Train the model on the remaining $k-1$ folds: $D_{-j} = D \setminus D_j$  
   - Fit the model to obtain $\hat{\beta}_0^{(-j)}, \hat{\beta}_1^{(-j)}$  
   - Predict on the left-out fold $D_j$: $\hat{y}_i^{(-j)} = \hat{\beta}_0^{(-j)} + \hat{\beta}_1^{(-j)} x_i$ for $i \in D_j$  
3. Compute the prediction error (e.g., Mean Squared Error) for each fold:

$$
MSE_j = \frac{1}{|D_j|} \sum_{i \in D_j} \left(y_i - \hat{y}_i^{(-j)}\right)^2
$$

4. Average over all folds to get cross-validated MSE:

$$
CV_{MSE} = \frac{1}{k} \sum_{j=1}^{k} MSE_j
$$

- *Common choices are $k=5$ or $k=10$*  
- *Leave-One-Out CV (LOOCV) is also a special case that means $k=n$, each observation is used as a single test case.*

$CV_{MSE}$ gives an estimate of the expected prediction error on new/unseen data.  

Lower $CV_{MSE}$ means better generalization performance.  

This method helps also detect overfitting that means a model with very low training MSE but high $CV_{MSE}$ is overfitting the training data.

**via Residuals**

$$
e_i = y_i - \hat{y}_i
$$

**Where:**
- $y_i$: observed value for observation $i$  
- $\hat{y}_i$: predicted value from the regression model

**Interpretation:**  
- The residual represents the difference between the actual and predicted value for each observation and small residuals indicate that the model predictions are close to the actual values.

**via Mean Squared Error (MSE)**

$$
MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{RSS}{n}
$$

**Where:**
- $n$: number of observations  
- $RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2$

**Interpretation:**  
- MSE measures the average squared difference between observed and predicted values and lower MSE indicates better model fit, but the unit is the square of the dependent variable.

**via Root Mean Squared Error (RMSE)**

$$
RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}
$$

**Interpretation:**  
- RMSE converts MSE back to the original units of the dependent variable and it represents the typical size of the prediction error.

**via Mean Absolute Error (MAE)**

$$
MAE = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|
$$

**Interpretation:**  
- MAE is the average of absolute residuals while more robust to outliers than MSE/RMSE and gives a straightforward measure of average prediction error.

**via R-squared ($R^2$)**

$$
R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2} = 1 - \frac{RSS}{TSS}
$$

**Where:**
- $TSS = \sum_{i=1}^n (y_i - \bar{y})^2$: total variance of the observed data  
- $\bar{y}$: mean of observed values

**Interpretation:**  
- $R^2$ measures the proportion of variance in the dependent variable explained by the model and values closer to 1 indicate better fit; 0 means the model explains nothing.

#### Prediction

Linear regression model for prediction is:

$$
\hat{y}_{new} = \hat{\beta}_0 + \hat{\beta}_1 x_{new}
$$

**Where:**
- $\hat{\beta}_0$: estimated intercept  
- $\hat{\beta}_1$: estimated slope coefficient  
- $x_{new}$: new value of the predictor variable  
- $\hat{y}_{new}$: predicted value of the response variable

For predicting a new individual observation at $x_0$ via **prediction interval (PI)** that estimates the range in which a new individual response lies for a given $x_0$:

$$
\hat{y}_0 \; \pm \; t_{\alpha/2, \, n-2} \cdot SE_{pred}(\hat{y}_0)
$$

**Where:**
- $\hat{y}_0$: predicted mean response at $x_0$  
- $SE_{pred}(\hat{y}_0) = \sqrt{SE(\hat{y}_0)^2 + \sigma^2}$  
- $\sigma^2$: variance of the error term (residual variance)

Also, to construct a confidence interval for the mean predicted value at a given $x_{0}$ via **confidence interval (CI)** that estimates the range in which the mean response lies for a given $x_0$:

$$
\hat{y}_0 \; \pm \; t_{\alpha/2, \, n-2} \cdot SE(\hat{y}_0)
$$

**Where:**
- $\hat{y}_0 = \hat{\beta}_0 + \hat{\beta}_1 x_0$: predicted mean response at $x_0$  
- $t_{\alpha/2, \, n-2}$: critical value from the $t$-distribution with $n-2$ degrees of freedom  
- $SE(\hat{y}_0)$: standard error of the predicted mean  

The PI is always wider than the CI, because it includes both the uncertainty of the mean estimate and the random error of a new observation.

**Extrapolation** occurs when using a regression model to predict values outside the range of the observed data and it can lead to unreliable predictions, as the model was not trained for values outside the observed range. Even if the model is linear, the true relationship outside the data range may differ.

#### Diagnostics

**Outliers**

Observations with unusually large residuals compared to what the model predicts.  
- **Check:**  
  - Standardized residuals ($|e_i| > 2$ or $3$).  
  - Studentized residuals.

**Influential Points**

Points that disproportionately affect the estimated regression coefficients.  
- **Check:**  
  - Cook’s Distance ($D_i > 1$ is often problematic).  
  - Leverage values ($h_{ii}$, diagonal elements of the hat matrix).

**Heteroskedasticity**

The variance of the residuals is not constant ($Var(\varepsilon_i) \neq \sigma^2$).  
- **Check:**  
  - Residuals vs. Fitted plot (fan or cone shape suggests heteroskedasticity).  
  - Breusch–Pagan test.

**Non-normality of Errors**

Residuals are not normally distributed ($\varepsilon_i \sim N(0, \sigma^2)$ assumption violated).  
- **Check:**  
  - Q-Q plot.  
  - Shapiro-Wilk test.

**Correlated Errors**

Residuals are dependent on each other (common in time-series data).  
- **Check:**  
  - Durbin–Watson statistic ($DW \approx 2$ is good; $<1$ or $>3$ indicates problems).  
  - Residuals vs. Time plot.

**Non-Linearity**

If the true relationship is not linear, the model is misspecified.  
- **Check:**  
  - Partial residual plots (component + residual plots).  
  - Ramsey RESET test.

Importantly for simple linear regression, no fitting or regularizing are needed.

### Multiple Linear Regression Model

#### Model

Multiple linear regression model is:

$$
y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \varepsilon_i
$$

**Where:**

- $y_i$ : dependent variable (outcome) for observation $i$
- $x_{ij}$ : value of the $j$-th independent variable (predictor) for observation $i$  
- $\beta_0$ : intercept (expected value of $y$ when all predictors are 0)  
- $\beta_j$ : regression coefficient for predictor $x_{ij}$ (change in $y$ for a one-unit increase in $x_{ij}$, holding other predictors constant)  
- $\varepsilon_i$ : error term for observation $i$, assumed to have mean 0  

**While:**

- **Known (from the data):** $x_{ij}, y_i$ (the observations, given in the dataset).  
- **Unknown but to be estimated:** $\beta_0, \beta_1, \dots, \beta_p$ (parameters of the regression model).  
- **Not directly known but assumed:** $\varepsilon_i$ (error terms, assumed to have mean 0, constant variance, and no correlation across observations).  

In regression, the **Residual Sum of Squares (RSS)** measures the total squared difference between the observed values $y_i$ and the predicted values $\hat{y}_i$:

$$
RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$

- $y_i$ is the observed value for observation $i$.  
- $\hat{y}_i$ is the predicted value from the regression model:  

$$
\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \hat{\beta}_2 x_{i2} + \cdots + \hat{\beta}_p x_{ip}
$$  

- The difference $e_i = y_i - \hat{y}_i$ is called the residual.  

To calculate $\beta_0, \beta_1, \dots, \beta_p$ to construct the model via **Ordinary Least Squares (OLS)** by minimizing the RSS:

$$
RSS(\beta_0, \beta_1, \dots, \beta_p) = \sum_{i=1}^n \Big( y_i - (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}) \Big)^2
$$

- The residual for observation $i$ is $e_i = y_i - \hat{y}_i$.  
- Minimizing $RSS$ gives the “best-fitting hyperplane” through the data points.  

#### Weighting

Weighting is used when some observations are more important or more reliable than others.

Weighted multiple linear regression model is:

$$
y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \varepsilon_i, \quad i = 1,2,\dots,n
$$

- Standard regression minimizes the Residual Sum of Squares (RSS):

$$
RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$

- Weighted regression introduces weights $w_i > 0$ for each observation:

$$
WRSS = \sum_{i=1}^n w_i \left(y_i - \hat{y}_i\right)^2
$$

where:
- $w_i$: weight for observation $i$  
- $\hat{y}_i = \hat{\beta}_0 + \sum_{j=1}^p \hat{\beta}_j x_{ij}$  

Minimize $WRSS$ with respect to all coefficients $\beta_0, \beta_1, \dots, \beta_p$:

$$
(\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_p) = \arg\min_{\beta_0, \beta_1, \dots, \beta_p} \sum_{i=1}^n w_i \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\right)^2
$$

- In matrix form:

$$
\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{W} \mathbf{y}
$$

where:
- $\mathbf{X}$ is the $n \times (p+1)$ design matrix (including a column of 1's for the intercept)  
- $\mathbf{W} = \text{diag}(w_1, w_2, \dots, w_n)$ is the diagonal **weight matrix**  
- $\mathbf{y}$ is the vector of observed responses  
- $\hat{\boldsymbol{\beta}} = (\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_p)^\top$  

Observations with larger weights $w_i$ have more influence on the fitted hyperplane to reduce the impact of less reliable points on the coefficient estimates.  

#### Interpretation

**via Correlated Predictors**

The coefficient $\beta_j$ represents the expected change in $y$ for a one-unit increase in $x_j$, holding all other predictors constant:

$$
\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \hat{\beta}_2 x_{i2} + \cdots + \hat{\beta}_p x_{ip}
$$

- If predictors are correlated, the interpretation of $\beta_j$ depends on the values of other predictors.  
- High correlation can make $\hat{\beta}_j$ unstable (large standard errors).

**via Multicollinearity**

Occurs when predictors are highly linearly related and it inflates standard errors.

- Mathematically, this makes $(\mathbf{X}^\top \mathbf{X})$ nearly singular, increasing variance of coefficient estimates:

$$
Var(\hat{\beta}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}
$$

- Detect using Variance Inflation Factor (VIF):

$$
VIF_j = \frac{1}{1 - R_j^2}
$$

where $R_j^2$ is the $R^2$ of regressing $x_j$ on all other predictors.  

**via Confounding Variables**

It can bias estimates if omitted and a confounder $x_k$ affects both $y$ and another predictor $x_j$.  

- Ignoring $x_k$ can bias $\hat{\beta}_j$:  

$$
Bias(\hat{\beta}_j) = Cov(x_j, x_k) \cdot \beta_k / Var(x_j)
$$

- Including $x_k$ in the model controls for its confounding effect.

**via Interaction Terms**

It show how the effect of one predictor depends on another and if the effect of $x_1$ on $y$ depends on $x_2$, include an interaction:

$$
y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 (x_{i1} \cdot x_{i2}) + \varepsilon_i
$$

  - $\beta_3$: additional change in $y$ for a one-unit increase in $x_1$ when $x_2$ increases by one unit.

**via Main Effects**

It represent the isolated effect of each predictor and the main effect of a predictor $x_j$ is its effect ignoring interactions:

$$
\text{Main effect of } x_j = \frac{\partial \hat{y}_i}{\partial x_j} \text{ when all interaction terms are 0 or baseline.}
$$

- If interaction terms exist, main effects are conditional and should be interpreted at a reference value of the interacting variable(s).  

#### Performance

**Cross-Validation**

To estimate the generalization performance of a multiple linear regression model on unseen data.

**k-Fold Cross-Validation**

1. Split the data into $k$ roughly equal folds (subsets): $D_1, D_2, \dots, D_k$.  
2. For each fold $j = 1, \dots, k$:
   - Train the model on the remaining $k-1$ folds: $D_{-j} = D \setminus D_j$  
   - Fit the model to obtain $\hat{\boldsymbol{\beta}}^{(-j)} = (\hat{\beta}_0^{(-j)}, \hat{\beta}_1^{(-j)}, \dots, \hat{\beta}_p^{(-j)})$  
   - Predict on the left-out fold $D_j$:

$$
\hat{y}_i^{(-j)} = \hat{\beta}_0^{(-j)} + \hat{\beta}_1^{(-j)} x_{i1} + \hat{\beta}_2^{(-j)} x_{i2} + \cdots + \hat{\beta}_p^{(-j)} x_{ip}, \quad i \in D_j
$$

3. Compute the prediction error (e.g., Mean Squared Error) for each fold:

$$
MSE_j = \frac{1}{|D_j|} \sum_{i \in D_j} \left(y_i - \hat{y}_i^{(-j)}\right)^2
$$

4. Average over all folds to get cross-validated MSE:

$$
CV_{MSE} = \frac{1}{k} \sum_{j=1}^{k} MSE_j
$$

- *Common choices are $k=5$ or $k=10$*  
- *Leave-One-Out CV (LOOCV) is also a special case: $k=n$, each observation is used as a single test case.*  

$CV_{MSE}$ gives an estimate of the expected prediction error on new/unseen data.  

Lower $CV_{MSE}$ means better generalization performance.  

Helps detect overfitting means that a model with very low training MSE but high $CV_{MSE}$ is likely overfitting.

**via Residuals**

$$
e_i = y_i - \hat{y}_i
$$

**Where:**
- $y_i$: observed value for observation $i$  
- $\hat{y}_i$: predicted value from the multiple regression model  

$$
\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \hat{\beta}_2 x_{i2} + \cdots + \hat{\beta}_p x_{ip}
$$  

**Interpretation:**  
- The residual represents the difference between the actual and predicted value for each observation.  
- Small residuals indicate that the model predictions are close to the actual values.  

**via Mean Squared Error (MSE)**

$$
MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{RSS}{n}
$$

**Where:**
- $n$: number of observations  
- $RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2$

**Interpretation:**  
- MSE measures the average squared difference between observed and predicted values.  
- Lower MSE indicates better model fit, but the unit is the square of the dependent variable.  

**via Root Mean Squared Error (RMSE)**

$$
RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}
$$

**Interpretation:**  
- RMSE converts MSE back to the original units of the dependent variable and it represents the typical size of the prediction error.  

**via Mean Absolute Error (MAE)**

$$
MAE = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|
$$

**Interpretation:**  
- MAE is the average of absolute residuals and more robust to outliers than MSE/RMSE and gives a straightforward measure of average prediction error.  

**via R-squared ($R^2$)**

$$
R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2} = 1 - \frac{RSS}{TSS}
$$

**Where:**
- $TSS = \sum_{i=1}^n (y_i - \bar{y})^2$: total variance of the observed data  
- $\bar{y}$: mean of observed values  

**Interpretation:**  
- $R^2$ measures the proportion of variance in the dependent variable explained by the model.  
- Values closer to 1 indicate better fit; 0 means the model explains nothing.  

**via Adjusted R-squared ($\bar{R}^2$)**

$$
\bar{R}^2 = 1 - \left(1 - R^2\right)\frac{n-1}{n-p-1}
$$

**Where:**
- $n$: number of observations  
- $p$: number of predictors (independent variables, excluding intercept)  

**Interpretation:**  
- Adjusted $R^2$ penalizes adding irrelevant predictors to the model and it only increases if the new predictor improves the model more than would be expected by chance and preferred over $R^2$ in multiple regression when comparing models with different numbers of predictors.  
- Values closer to 1 indicate better fit; 0 means the model explains nothing.

#### Selection

**via Stepwise Regression**

Stepwise regression is a model selection technique used when there are multiple predictors ($x_1, x_2, \dots, x_p$) and we want to find the best subset of predictors for the model.

Stepwise regression iteratively adds or removes predictors based on a selection criterion (e.g., Adjusted $R^2$ & AIC).

**Forward Selection:**
1. Start with no predictors: $y_i = \beta_0 + \varepsilon_i$  
2. For each candidate predictor $x_j$, fit the model:

$$
y_i = \beta_0 + \beta_j x_{ij} + \varepsilon_i
$$

3. Choose the predictor with the best improvement according to the selection criterion.  
4. Repeat by adding one predictor at a time until no further improvement is possible.

**Backward Elimination:**
1. Start with all predictors:  

$$
y_i = \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij} + \varepsilon_i
$$

2. Remove the least significant predictor according to the criterion.  
3. Repeat until all remaining predictors meet the significance threshold.

**Stepwise (Both Directions):**
- Combines forward selection and backward elimination: predictors can be added or removed at each step.

**Selection Criteria**

- **Adjusted $R^2$**:

$$
\text{Adjusted } R^2 = 1 - \frac{(1-R^2)(n-1)}{n-p-1}
$$

- **Akaike Information Criterion (AIC)**:

$$
AIC = n \ln\left(\frac{RSS}{n}\right) + 2(p+1)
$$

Here, $RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2$ and $p$ is the number of predictors in the model.  

#### Prediction

Multiple linear regression model for prediction is:

$$
\hat{y}_{new} = \hat{\beta}_0 + \hat{\beta}_1 x_{new,1} + \hat{\beta}_2 x_{new,2} + \cdots + \hat{\beta}_p x_{new,p}
$$

**Where:**
- $\hat{\beta}_0$: estimated intercept  
- $\hat{\beta}_j$: estimated slope coefficient for predictor $j$  
- $x_{new,j}$: new value of the $j$-th predictor variable  
- $\hat{y}_{new}$: predicted value of the response variable  

For predicting a new individual observation at predictor vector $\mathbf{x}_0 = (1, x_{0,1}, x_{0,2}, \dots, x_{0,p})^\top$ via **prediction interval (PI)** that estimates the range in which a new individual response lies for given predictors:

$$
\hat{y}_0 \; \pm \; t_{\alpha/2, \, n-p-1} \cdot SE_{pred}(\hat{y}_0)
$$

**Where:**
- $\hat{y}_0 = \mathbf{x}_0^\top \hat{\beta}$: predicted mean response at $\mathbf{x}_0$  
- $SE_{pred}(\hat{y}_0) = \sqrt{SE(\hat{y}_0)^2 + \sigma^2}$  
- $\sigma^2$: variance of the error term (residual variance)  

To construct a confidence interval for the mean predicted value at given predictors $\mathbf{x}_0$ via **confidence interval (CI)** that estimates the range in which the mean response lies:

$$
\hat{y}_0 \; \pm \; t_{\alpha/2, \, n-p-1} \cdot SE(\hat{y}_0)
$$

**Where:**
- $\hat{y}_0 = \hat{\beta}_0 + \hat{\beta}_1 x_{0,1} + \cdots + \hat{\beta}_p x_{0,p}$: predicted mean response at $\mathbf{x}_0$  
- $t_{\alpha/2, \, n-p-1}$: critical value from the $t$-distribution with $n-p-1$ degrees of freedom  
- $SE(\hat{y}_0)$: standard error of the predicted mean  

The prediction interval (PI) is always wider than the confidence interval (CI), because PI accounts for both the uncertainty in the mean estimate and the random error of a new observation.  

**Extrapolation** occurs when using a regression model to predict values outside the range of the observed data. This can lead to unreliable predictions, because the model was not trained on such values. Even if the model is linear, the true relationship outside the observed range may differ.

#### Diagnostics

**Outliers**

Observations with unusually large residuals compared to what the model predicts.  
- **Check:**  
  - Standardized residuals ($|e_i| > 2$ or $3$).  
  - Studentized residuals.  
  - In multiple regression, consider **Mahalanobis distance** to detect outliers in multivariate predictor space.

**Influential Points**

Points that disproportionately affect the estimated regression coefficients.  
- **Check:**  
  - Cook’s Distance ($D_i > 1$ is often problematic).  
  - Leverage values ($h_{ii}$, diagonal elements of the hat matrix $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top$).  
  - In multiple regression, influential points can be unusual **combinations of predictor values**, not just extreme in a single predictor.

**Heteroskedasticity**

The variance of the residuals is not constant ($Var(\varepsilon_i) \neq \sigma^2$).  
- **Check:**  
  - Residuals vs. Fitted plot (fan or cone shape suggests heteroskedasticity).  
  - Breusch–Pagan test.  
  - White test (robust to model specification in multiple regression).

**Non-normality of Errors**

Residuals are not normally distributed ($\varepsilon_i \sim N(0, \sigma^2)$ assumption violated).  
- **Check:**  
  - Q-Q plot.  
  - Shapiro-Wilk test.  
  - Multiple regression often tolerates mild non-normality if $n$ is large (Central Limit Theorem).

**Correlated Errors**

Residuals are dependent on each other (common in time-series or panel data).  
- **Check:**  
  - Durbin–Watson statistic ($DW \approx 2$ is good; $<1$ or $>3$ indicates problems).  
  - Residuals vs. Time plot.  

**Non-Linearity**

If the true relationship is not linear in the predictors, the model is misspecified.  
- **Check:**  
  - Partial residual plots (component + residual plots) for each predictor.  
  - Ramsey RESET test.  
  - Consider adding polynomial or interaction terms in multiple regression.

#### Fitting

**Overall F-test**

Tests whether the regression model provides a better fit than a model with no predictors (intercept only).

$$
H_0: \beta_1 = \beta_2 = \dots = \beta_p = 0 \quad \text{vs.} \quad H_a: \text{at least one } \beta_j \neq 0
$$

F-statistic:

$$
F = \frac{(TSS - RSS)/p}{RSS/(n-p-1)} = \frac{ESS / p}{RSS / (n-p-1)}
$$

Where:  
- $TSS = \sum_{i=1}^{n} (y_i - \bar{y})^2$ is the total sum of squares  
- $RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ is the residual sum of squares  
- $ESS = TSS - RSS$ is the explained sum of squares  
- $p$ is the number of predictors  
- $n$ is the number of observations  

**Interpretation:**  

- Large $F$ (small $p$-value) indicates that the model explains a significant amount of variance in $y$ compared to the null model.

**Coefficient t-tests**

Test whether each individual predictor has a statistically significant effect on the dependent variable.

$$
t_j = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)}
$$

Where $SE(\hat{\beta}_j)$ is the standard error of the coefficient $\hat{\beta}_j$.  

- Null hypothesis: $H_0: \beta_j = 0$  
- Alternative hypothesis: $H_a: \beta_j \neq 0$  

**Interpretation:**  

- Large $|t_j|$ (small $p$-value) suggests that predictor $x_j$ significantly contributes to the model.

#### Regularizing

Regularizing is used to prevent overfitting and improve prediction performance when predictors are highly correlated or numerous.

**L2 Regularization (Ridge)**

Adds a penalty proportional to the square of the coefficients.

$$
\hat{\boldsymbol{\beta}}^{ridge} = \arg\min_{\boldsymbol{\beta}} \Bigg\{ \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p \beta_j^2 \Bigg\}
$$

Where:  
- $\lambda \ge 0$: regularization strength (hyperparameter) & chosen via cross-validation to optimize predictive performance.

**Interpretation:**  

- Shrinks coefficients towards zero.

**L1 Regularization (Lasso)**

Adds a penalty proportional to the absolute value of the coefficients

$$
\hat{\boldsymbol{\beta}}^{lasso} = \arg\min_{\boldsymbol{\beta}} \Bigg\{ \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p |\beta_j| \Bigg\}
$$

**Interpretation:**  

- It can shrink some coefficients exactly to zero means automatic feature selection.

Both L1 and L2 can be combined in **Elastic Net** regularization.

## MODELS FOR CLASSIFICATION

### Binary Logistic Regression Model

#### Model

Binary logistic regression model is:

$$
p_i = P(y_i = 1 \mid x_i) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_i)}}
$$

**Where:**
- $y_i \in \{0,1\}$: binary dependent variable (outcome) for observation $i$  
- $x_i$: independent variable (predictor) for observation $i$  
- $\beta_0$: intercept (baseline log-odds of $y=1$ when $x=0$)  
- $\beta_1$: slope (change in the log-odds of $y=1$ for a one-unit increase in $x$)  
- $p_i$: predicted probability that $y_i=1$ given $x_i$  

Instead of modeling $y_i$ directly, logistic regression models the log-odds (logit):

$$
\text{logit}(p_i) = \ln\left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_i
$$

**While:**

- **Known (from the data):** $x_i, y_i$ (observations, with $y_i \in \{0,1\}$).  
- **Unknown but to be estimated:** $\beta_0, \beta_1$ (parameters of the logistic model).  
- **Not directly known but assumed:** The probabilities $p_i$, derived from the logistic function.  

In logistic regression, parameters are estimated by **Maximum Likelihood Estimation (MLE)**.

The MLE is:

$$
L(\beta_0, \beta_1) = \prod_{i=1}^n p_i^{y_i} (1-p_i)^{1-y_i}
$$

And the log-MLE is:

$$
\ell(\beta_0, \beta_1) = \sum_{i=1}^n \Big[ y_i \ln(p_i) + (1-y_i) \ln(1-p_i) \Big]
$$

- Estimators $\hat{\beta}_0, \hat{\beta}_1$ are obtained by maximizing $\ell(\beta_0, \beta_1)$.  
- This gives the “best-fitting logistic curve” that separates the two outcome classes.

#### Performance

**Cross-Validation**

To estimate the generalization performance of the logistic regression model on unseen data.  

**k-Fold Cross-Validation**

1. Split the data into $k$ roughly equal folds (subsets): $D_1, D_2, \dots, D_k$.  
2. For each fold $j = 1, \dots, k$:
   - Train the model on the remaining $k-1$ folds: $D_{-j} = D \setminus D_j$  
   - Fit the model to obtain $\hat{\beta}_0^{(-j)}, \hat{\beta}_1^{(-j)}$  
   - Predict on the left-out fold $D_j$: $\hat{p}_i^{(-j)} = \frac{1}{1 + e^{-(\hat{\beta}_0^{(-j)} + \hat{\beta}_1^{(-j)} x_i)}}$ for $i \in D_j$  
3. Compute the prediction error using a classification loss (e.g., Log Loss, Accuracy, AUC) for each fold.  

- *Common choices are $k=5$ or $k=10$*  
- *Leave-One-Out CV (LOOCV) is also a special case that means $k=n$, each observation is used as a single test case.*  

Cross-validation gives an estimate of the expected performance on new/unseen data.  

Lower log loss or higher accuracy indicates better generalization performance.  

This method also helps detect overfitting that a model with very high training accuracy but much lower cross-validation accuracy is overfitting the training data.

**via Accuracy**

Proportion of all predictions that are correct.

$$
\text{Accuracy} = \frac{1}{n} \sum_{i=1}^n \mathbf{1}(\hat{y}_i = y_i)
$$  

Where $\hat{y}_i = 1$ if $\hat{p}_i \geq 0.5$, else $0$.   

**Interpretation:**

- Higher percentage means the model correctly classifies more observations.

**via Precision**

Proportion of predicted positives that are actually positive.  

$$
\text{Precision} = \frac{TP}{TP + FP}
$$  

**Where:**  
- $TP$: True Positives (correctly predicted positives)  
- $FP$: False Positives (incorrectly predicted positives)  

**Interpretation:**

- Higher percentage means the model makes fewer false positive errors.

**via Recall (Sensitivity)**

Proportion of actual positives that are correctly identified.

$$
\text{Recall} = \frac{TP}{TP + FN}
$$  

**Where:**  
- $FN$: False Negatives (missed positives)  

**Interpretation:**

- Higher percentage means the model detects more of the positive cases.

**via Specificity**

Proportion of actual negatives that are correctly identified.   

$$
\text{Specificity} = \frac{TN}{TN + FP}
$$  

**Where:**  
- $TN$: True Negatives (correctly predicted negatives)  
- $FP$: False Positives (incorrectly predicted positives)  

**Interpretation:**

- Higher percentage means the model correctly rejects more negative cases, avoiding false alarms.    

**via F1-score**

Harmonic mean of Precision and Recall.   

$$
F1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$  

**Interpretation:**

- Balances false positives and false negatives to get higher percentage indicates better balanced performance.  

**via ROC To AUC**

- ROC (Receiver Operating Characteristic) curve plots True Positive Rate (TPR) as TPR = Recall vs. False Positive Rate (FPR) as FPR = 1-Specificity to get AUC (Area Under Curve) summarizes the model’s ability to discriminate between classes.  

**Interpretation:**  

- AUC = 0.5 means random guessing and AUC closer to 1 means better discrimination.  

**via Log Loss (Cross-Entropy Loss)**  

Penalizes confident but wrong predictions.

$$
\text{LogLoss} = -\frac{1}{n} \sum_{i=1}^n \Big[ y_i \ln(\hat{p}_i) + (1-y_i) \ln(1-\hat{p}_i) \Big]
$$  

**Interpretation:**   

- Higher log loss means heavily penalizes confident incorrect predictions.  

#### Prediction

Binary logistic regression model for prediction is:

$$
\hat{p}_{new} = \frac{1}{1 + e^{-(\hat{\beta}_0 + \hat{\beta}_1 x_{new})}}
$$

**Where:**
- $\hat{\beta}_0$: estimated intercept  
- $\hat{\beta}_1$: estimated slope coefficient  
- $x_{new}$: new value of the predictor variable  
- $\hat{p}_{new}$: predicted probability that $y=1$ for the new observation

To estimate the uncertainty of the predicted probability $\hat{p}_0$ at $x_0$, a **confidence interval** can be constructed using the standard error of the logit:

1. Compute the logit of the predicted probability:

$$
\hat{\text{logit}}_0 = \hat{\beta}_0 + \hat{\beta}_1 x_0
$$

2. Construct a CI for the logit:

$$
\hat{\text{logit}}_0 \pm z_{\alpha/2} \cdot SE(\hat{\text{logit}}_0)
$$

while

$$
SE(\hat{\text{logit}}_0) = \sqrt{x_0^\top \, \text{Cov}(\hat{\beta}) \, x_0}
$$

**Where:**
- $SE(\hat{\text{logit}}_0)$: standard error of the predicted logit  
- $x_0 = \begin{bmatrix} 1 \\ x_0 \end{bmatrix}$: predictor vector including intercept  
- $\text{Cov}(\hat{\beta})$: estimated covariance matrix of the coefficients  
- $z_{\alpha/2}$: critical value from the standard normal distribution (e.g., 1.96 for 95\% CI)  

3. Transform back to probability scale using the logistic function:

$$
\hat{p}_0 = \frac{1}{1 + e^{-\hat{\text{logit}}_0}}
$$

CI gives a range in which the true probability for a given $x_0$ is likely to lie.  

Unlike linear regression, there is no standard **prediction interval** for a single observation, because the response is binary.  

**Extrapolation** occurs when $x_{new}$ is outside the range of observed $x_i$; predicted probabilities may be unreliable.

#### Lifting

Measures the effectiveness of a predictive model compared to random guessing especially in rare classess like fraud detection.    

$$
\text{Lift} = \frac{\text{Proportion of positives in model-selected group}}{\text{Proportion of positives in entire population}}
$$

**Interpretation:**

- Lift > 1 means model is better than random at identifying positives.  
- Lift = 1 means model is no better than random.  

When the positive class is rare, standard models may underperform and the techniques like oversampling, undersampling & weighting can help increase the model's ability to identify the rare class, improving the lift metrics.

1. **Oversampling (Synthetic or Duplicate Sampling)**

Increase the number of minority class samples:

$$
n_{minority}^{new} > n_{minority}^{original}
$$

By synthetic generation (SMOTE) for each minority sample $x_i$, generate a new sample along the line connecting $x_i$ to one of its $k$ nearest neighbors $x_{nn}$:

$$
x_{new} = x_i + \lambda \cdot (x_{nn} - x_i), \quad \lambda \sim U(0,1)
$$

Model sees more minority examples that learns rare class patterns means the lift increases.

2. **Undersampling (Majority Reduction)**

Reduce the number of majority class samples:

$$
n_{majority}^{new} < n_{majority}^{original}
$$

Randomly select a subset of majority samples to balance classes:

$$
n_{majority}^{new} \approx n_{minority}^{original} \quad \text{or some ratio } r
$$

Prevents model from being biased toward majority class, improving rare class detection.

3. **Up-Down Weighting (Class Weighting)**

Assign higher weight $w_i$ to minority class samples during training:

$$
w_i =
\begin{cases}
\frac{n}{2 \cdot n_{minority}} & \text{if } y_i = 1 \\
\frac{n}{2 \cdot n_{majority}} & \text{if } y_i = 0
\end{cases}
$$

Loss function becomes weighted:

$$
\text{Weighted Loss} = \sum_{i=1}^{n} w_i \cdot \ell(y_i, \hat{p}_i)
$$

Model penalizes misclassification of minority class more means improves detection.

#### Fitting

**1. Hosmer-Lemeshow Test**

The Hosmer-Lemeshow test evaluates how well the predicted probabilities from a binary logistic regression model fit the observed outcomes.

1. Sort the predicted probabilities and divide them into $g$ groups (commonly $g=10$ deciles).  
2. For each group $j$, calculate the observed ($O_j$) and expected ($E_j$) number of events (positives).  
3. Compute the test statistic:

$$
C = \sum_{j=1}^{g} \frac{(O_j - E_j)^2}{E_j (1 - E_j / n_j)}
$$

- $n_j$: number of observations in group $j$  
- Under the null hypothesis of good fit, $C$ approximately follows a $\chi^2$ distribution with $g-2$ degrees of freedom.  

**Interpretation:**  
- $p > 0.05$: model fits well  
- $p < 0.05$: poor fit  


**2. Deviance / Likelihood Ratio Test**

Compares the fitted model to a baseline (intercept-only) model to assess improvement in fit.

$$
G = -2 \big( \ell_0 - \ell_1 \big)
$$

- $\ell_0$: log-likelihood of the intercept-only model  
- $\ell_1$: log-likelihood of the fitted model  
- $G \sim \chi^2_{df}$ where $df = \text{number of predictors}$  

**Interpretation:**  
- Large $G$ and small $p$-value indicate that the fitted model significantly improves fit over the null model.

**3. Residual Analysis**

Residuals are used to detect lack-of-fit or outliers.

**Deviance residuals:**

$$
r_i^{(D)} = \text{sign}(y_i - \hat{p}_i) \sqrt{-2 \Big[ y_i \ln(\hat{p}_i) + (1-y_i)\ln(1-\hat{p}_i) \Big]}
$$

**Pearson residuals:**

$$
r_i^{(P)} = \frac{y_i - \hat{p}_i}{\sqrt{\hat{p}_i (1 - \hat{p}_i)}}
$$

- Large absolute residuals indicate potential outliers or observations poorly fitted by the model.  

#### Regularizing

Regularizing helps prevent overfitting by penalizing large coefficient values in binary logistic regression.

**L2 Regularization (Ridge)**

Adds a penalty proportional to the square of the coefficients.

$$
\text{Loss}_{\text{Ridge}} = - \ell(\beta) + \lambda \sum_{j=1}^{p} \beta_j^2
$$

**Where:**
- $\ell(\beta)$: log-likelihood of the binary logistic regression model  
- $\beta_j$: coefficient of predictor $j$  
- $p$: number of predictors  
- $\lambda \ge 0$: regularization strength (hyperparameter) & chosen via cross-validation to optimize predictive performance.

**Interpretation:**  
- Shrinks coefficients towards zero.

**L1 Regularization (Lasso)**

Adds a penalty proportional to the absolute value of the coefficients.

$$
\text{Loss}_{\text{Lasso}} = - \ell(\beta) + \lambda \sum_{j=1}^{p} |\beta_j|
$$

**Interpretation:**  
- It can shrink some coefficients exactly to zero means automatic feature selection.

Both L1 and L2 can be combined in **Elastic Net** regularization.  

### Multinominal Logistic Regression Model

#### Model

Multinomial logistic regression model is:

$$
p_{ik} = P(y_i = k \mid x_i) = \frac{\exp(\beta_{0k} + \beta_{1k} x_i)}{\sum_{j=1}^K \exp(\beta_{0j} + \beta_{1j} x_i)},
\quad k = 1,2,\dots,K
$$

- One category (say $K$) is chosen as the baseline/reference category and for this baseline, parameters are set to $\beta_{0K} = 0, \; \beta_{1K} = 0$.  

**Where:**
- $y_i \in \{1,2,\dots,K\}$: multinomial dependent variable (outcome) for observation $i$  
- $x_i$: independent variable (predictor) for observation $i$  
- $\beta_{0k}$: intercept for category $k$ relative to the baseline  
- $\beta_{1k}$: slope for category $k$ relative to the baseline  
- $p_{ik}$: predicted probability that $y_i = k$ given $x_i$  

Also for multiple predictors, the multinomial logistic regression model generalizes to:

$$
P(y_i = k \mid \mathbf{x}_i) = \frac{\exp(\theta_k + \mathbf{x}_i^\top \boldsymbol{\beta}_k)}{\sum_{j=1}^{K} \exp(\theta_j + \mathbf{x}_i^\top \boldsymbol{\beta}_j)}, \quad k = 1, \dots, K
$$

where:

- $\mathbf{x}_i = (x_{i1}, x_{i2}, \dots, x_{ip})^\top$ is a vector of $p$ predictors for observation $i$  
- $\boldsymbol{\beta}_k = (\beta_{k1}, \beta_{k2}, \dots, \beta_{kp})^\top$ is the coefficient vector for class $k$  
- $\theta_k$ is the intercept for class $k$  
- $K$ is the total number of outcome categories  

This formulation explicitly accounts for multiple predictors and allows each category to have its own set of coefficients.

Instead of modeling $y_i$ directly, logistic regression models the log-odds (logit) and the model estimates $K-1$ log-odds equations relative to the baseline:

$$
\ln\left(\frac{P(y_i = k \mid x_i)}{P(y_i = K \mid x_i)}\right) = \beta_{0k} + \beta_{1k} x_i,
\quad k = 1,2,\dots,K-1
$$

**While:**

- **Known (from the data):** $x_i, y_i$ (observations, with $y_i \in \{1,2,\dots,K\}$).  
- **Unknown but to be estimated:** $\beta_{0k}, \beta_{1k}$ for $k=1,\dots,K-1$ (parameters of the multinomial logistic model).  
- **Not directly known but assumed:** The probabilities $p_{ik}$, derived from the multinomial logistic function.  

In multinomial logistic regression, parameters are estimated by **Maximum Likelihood Estimation (MLE)**.

The MLE is:

$$
L(\beta) = \prod_{i=1}^n \prod_{k=1}^K p_{ik}^{I(y_i = k)}
$$

where $I(y_i=k)$ is an indicator function that equals $1$ if observation $i$ belongs to category $k$, and $0$ otherwise.

And the log-MLE is:

$$
\ell(\beta) = \sum_{i=1}^n \sum_{k=1}^K I(y_i = k) \ln(p_{ik})
$$

- Estimators $\hat{\beta}_{0k}, \hat{\beta}_{1k}$ (for $k = 1,\dots,K-1$) are obtained by maximizing $\ell(\beta)$.  
- This gives the “best-fitting set of multinomial logistic functions” that separates the $K$ outcome classes.

#### Performance

**Cross-Validation**

To estimate the generalization performance of the multinomial logistic regression model on unseen data.  

**k-Fold Cross-Validation**

1. Split the data into $k$ roughly equal folds (subsets): $D_1, D_2, \dots, D_k$.  
2. For each fold $j = 1, \dots, k$:
   - Train the model on the remaining $k-1$ folds: $D_{-j} = D \setminus D_j$  
   - Fit the model to obtain $\hat{\beta}_{0k}^{(-j)}, \hat{\beta}_{1k}^{(-j)}$ for $k=1,\dots,K-1$  
   - Predict on the left-out fold $D_j$:  
$$
\hat{p}_{ik}^{(-j)} =
\frac{\exp(\hat{\beta}_{0k}^{(-j)} + \hat{\beta}_{1k}^{(-j)} x_i)}
{\sum_{m=1}^K \exp(\hat{\beta}_{0m}^{(-j)} + \hat{\beta}_{1m}^{(-j)} x_i)},
\quad i \in D_j
$$

3. Compute the prediction error using a multiclass classification loss (e.g., Multinomial Log Loss, Accuracy, Macro-F1) for each fold.  

- *Common choices are $k=5$ or $k=10$*  
- *Leave-One-Out CV (LOOCV) is also a special case that means $k=n$, each observation is used as a single test case.*  

Cross-validation gives an estimate of the expected performance on new/unseen data.  

Lower multinomial log loss or higher accuracy/F1 indicates better generalization performance.  

This method also helps detect overfitting — a model with very high training accuracy but much lower cross-validation accuracy is overfitting the training data.

**via Accuracy**

**Per-Class:**  
Accuracy calculated for each class $k$ individually.

$$
\text{Accuracy}_k = \frac{TP_k + TN_k}{TP_k + TN_k + FP_k + FN_k}, \quad k = 1,\dots,K
$$  

**Where:**  
- $TP_k$: true positives for class $k$  
- $TN_k$: true negatives for class $k$  
- $FP_k$: false positives for class $k$  
- $FN_k$: false negatives for class $k$  

**Micro:**  
Proportion of all correctly classified observations.

$$
\text{Micro-Accuracy} = \frac{1}{n} \sum_{i=1}^n \mathbf{1}(\hat{y}_i = y_i)
$$  

**Where:**  
- $n$: total number of observations  
- $\hat{y}_i = \arg\max_{k \in \{1,\dots,K\}} \hat{p}_{ik}$ (predicted class)  
- $y_i$: true class  

**Macro:**  
Average of per-class accuracies.

$$
\text{Macro-Accuracy} = \frac{1}{K} \sum_{k=1}^K \text{Accuracy}_k
$$  

**Where:**  
- $K$: number of classes  

**Interpretation:**  
- Higher percentage means the model correctly classifies more observations.  

**via Precision**

**Per-Class:**

For each class $k$ individually.

$$
\text{Precision}_k = \frac{TP_k}{TP_k + FP_k}, \quad k=1,\dots,K
$$  

**Macro:**

Average over all classes.  

$$
\text{Macro-Precision} = \frac{1}{K} \sum_{k=1}^K \text{Precision}_k
$$  

**Micro:**

Global TP/FP across all classes.  

$$
\text{Micro-Precision} = \frac{\sum_{k=1}^K TP_k}{\sum_{k=1}^K (TP_k + FP_k)}
$$  

**Interpretation:**  
- Higher percentage means the model detects more of the positive cases.  

**via Recall (Sensitivity)**

**Per-Class:**  
For each class $k$ individually.

$$
\text{Recall}_k = \frac{TP_k}{TP_k + FN_k}, \quad k=1,\dots,K
$$  

**Macro:**  
Average over all classes.

$$
\text{Macro-Recall} = \frac{1}{K} \sum_{k=1}^K \text{Recall}_k
$$  

**Micro:**  
Global TP/FN across all classes.

$$
\text{Micro-Recall} = \frac{\sum_{k=1}^K TP_k}{\sum_{k=1}^K (TP_k + FN_k)}
$$  

**Interpretation:**  
- Higher percentage means the model detects more of the positive cases.

**via Specificity**

**Per-Class:**  
For each class $k$ individually.

$$
\text{Specificity}_k = \frac{TN_k}{TN_k + FP_k}, \quad k=1,\dots,K
$$  

**Where:**  
- $TN_k$: True Negatives for class $k$  
- $FP_k$: False Positives for class $k$  

**Macro:**  
Average over all classes.

$$
\text{Macro-Specificity} = \frac{1}{K} \sum_{k=1}^K \text{Specificity}_k
$$  

**Where:**  
- $K$: number of classes  

**Micro:**  
Global TN/FP across all classes.

$$
\text{Micro-Specificity} = \frac{\sum_{k=1}^K TN_k}{\sum_{k=1}^K (TN_k + FP_k)}
$$  

**Where:**  
- $TN_k$: True Negatives for class $k$  
- $FP_k$: False Positives for class $k$  

**Interpretation:**  
- Higher percentage means the model correctly rejects more negative cases, avoiding false alarms.

**via F1-score**

**Per-Class:**  
For each class $k$ individually.

$$
F1_k = \frac{2 \cdot \text{Precision}_k \cdot \text{Recall}_k}{\text{Precision}_k + \text{Recall}_k}, \quad k=1,\dots,K
$$  

**Where:**  
- $\text{Precision}_k = \frac{TP_k}{TP_k + FP_k}$  
- $\text{Recall}_k = \frac{TP_k}{TP_k + FN_k}$  
- $TP_k$: True Positives for class $k$  
- $FP_k$: False Positives for class $k$  
- $FN_k$: False Negatives for class $k$  

**Macro:**  
Average over all classes.

$$
\text{Macro-F1} = \frac{1}{K} \sum_{k=1}^K F1_k
$$  

**Where:**  
- $K$: number of classes  

**Micro:**  
Global TP/FP/FN across all classes.

$$
\text{Micro-F1} = \frac{2 \cdot \sum_{k=1}^K TP_k}{2 \cdot \sum_{k=1}^K TP_k + \sum_{k=1}^K (FP_k + FN_k)}
$$  

**Where:**  
- $TP_k$: True Positives for class $k$  
- $FP_k$: False Positives for class $k$  
- $FN_k$: False Negatives for class $k$  

**Interpretation:**  
- Balances false positives and false negatives to get higher percentage indicates better balanced performance.

**via ROC To AUC**

For each class $k$:  

- True Positive Rate (TPR) = $\frac{TP_k}{TP_k + FN_k}$  
- False Positive Rate (FPR) = $\frac{FP_k}{FP_k + TN_k}$  

- Compute AUC$_k$ for each class  
- **Macro-AUC:** average over classes  
- **Micro-AUC:** global TPR/FPR across all classes  

**Interpretation:**  
- AUC closer to 1 = better discrimination between class $k$ vs rest.

**via Log Loss (Cross-Entropy Loss)**

Penalizes confident but wrong predictions.

Multinomial cross-entropy:

$$
\text{LogLoss} = -\frac{1}{n} \sum_{i=1}^n \sum_{k=1}^K \mathbf{1}(y_i = k) \ln(\hat{p}_{ik})
$$  

**Interpretation:**  
- Higher log loss means worse predictions; heavily penalizes confident incorrect predictions.

**via Cohen's Kappa**

Cohen's Kappa measures the agreement between predicted and true classes, adjusted for chance agreement.

$$
\kappa = \frac{p_o - p_e}{1 - p_e}
$$

**Where:**  
- $p_o = \frac{1}{n} \sum_{i=1}^{n} \mathbf{1}(\hat{y}_i = y_i)$ : observed agreement (accuracy)  
- $p_e = \sum_{k=1}^{K} p_{k}^{\text{pred}} \cdot p_{k}^{\text{true}}$ : expected agreement by chance  
  - $p_{k}^{\text{pred}}$: proportion of predictions in class $k$  
  - $p_{k}^{\text{true}}$: proportion of true instances in class $k$  

**Interpretation:**  
- $\kappa = 1$: perfect agreement  
- $\kappa = 0$: agreement equivalent to chance  
- $\kappa < 0$: worse than chance  

**via Top-K Accuracy**

Top-K Accuracy evaluates whether the true class is among the top $K$ predicted probabilities.

$$
\text{Top-K Accuracy} = \frac{1}{n} \sum_{i=1}^{n} \mathbf{1}\Big(y_i \in \text{TopK}(\hat{\mathbf{p}}_i)\Big)
$$

**Where:**  
- $y_i$: true class for observation $i$  
- $\hat{\mathbf{p}}_i = (\hat{p}_{i1}, \dots, \hat{p}_{iK})$: predicted probability vector for observation $i$  
- $\text{TopK}(\hat{\mathbf{p}}_i)$: set of classes corresponding to the $K$ highest predicted probabilities for observation $i$  

**Interpretation:**  
- Top-1 Accuracy = standard accuracy  
- Top-K Accuracy > Top-1 captures model performance in multi-class ranking

#### Prediction

Multinomial logistic regression model for prediction is:

$$
\hat{p}_{new,k} = \frac{\exp(\hat{\beta}_{0k} + \hat{\beta}_{1k} x_{new})}{\sum_{j=1}^K \exp(\hat{\beta}_{0j} + \hat{\beta}_{1j} x_{new})},
\quad k = 1,2,\dots,K
$$

**Where:**  
- $\hat{\beta}_{0k}$: estimated intercept for class $k$  
- $\hat{\beta}_{1k}$: estimated slope coefficient for class $k$  
- $x_{new}$: new value of the predictor variable  
- $\hat{p}_{new,k}$: predicted probability that $y=k$ for the new observation  

To estimate the uncertainty of the predicted probability $\hat{p}_{new,k}$ at $x_{new}$, a **confidence interval** can be constructed using the standard error of the logit (log-odds) for each class relative to the baseline:

1. Compute the logit of the predicted probability relative to the baseline class $K$:

$$
\hat{\text{logit}}_{new,k} = \hat{\beta}_{0k} + \hat{\beta}_{1k} x_{new}, \quad k = 1, \dots, K-1
$$

2. Construct a CI for the logit:

$$
\hat{\text{logit}}_{new,k} \pm z_{\alpha/2} \cdot SE(\hat{\text{logit}}_{new,k})
$$

While

$$
SE(\hat{\text{logit}}_{new,k}) = \sqrt{x_{new}^\top \, \text{Cov}(\hat{\beta}_k) \, x_{new}}, \quad k = 1, \dots, K-1
$$

**Where:**
- $SE(\hat{\text{logit}}_{new,k})$: standard error of the predicted logit for class $k$  
- $z_{\alpha/2}$: critical value from the standard normal distribution (e.g., 1.96 for 95% CI)  
- $x_{new}$: predictor vector for the new observation (including intercept)  
- $\text{Cov}(\hat{\beta}_k)$: estimated covariance matrix of the coefficients for category $k$  
- $K$: total number of categories; category $K$ is chosen as the baseline  

3. Transform back to probability scale using the softmax function:

$$
\hat{p}_{\text{new},k} =
\frac{
\exp\big(\hat{\beta}_{0k} + \hat{\beta}_{1k} x_{\text{new}}\big)
}{
\sum_{j=1}^{K} \exp\big(\hat{\beta}_{0j} + \hat{\beta}_{1j} x_{\text{new}}\big)
}, \quad k = 1, \dots, K
$$


CI gives a range in which the true probability for each class is likely to lie.  

Unlike linear regression, there is no standard **prediction interval** for a single observation, because the response is categorical.  

**Extrapolation** occurs when $x_{new}$ is outside the range of observed $x_i$; predicted probabilities may be unreliable.

#### Lifting

Measures the effectiveness of a predictive model compared to random guessing especially in rare classess.  

**Per-Class Lift:**

For class $k$:

$$
\text{Lift}_k = \frac{\text{Proportion of class } k \text{ in model-selected group}}{\text{Proportion of class } k \text{ in entire population}}
$$

**Where:**  
- Model-selected group: top decile or bin of observations with highest predicted probability $\hat{p}_{ik}$ for class $k$  
- Entire population: all $n$ observations  
- $k = 1, 2, $\dots$, K$  

**Interpretation:**  

- $\text{Lift}_k > 1$ means model is better than random at identifying class $k$.  
- $\text{Lift}_k = 1$ means model performs no better than random.  

**Macro-Lift:** Average lift across all classes.

$$
\text{Macro-Lift} = \frac{1}{K} \sum_{k=1}^K \text{Lift}_k
$$

**Micro-Lift:** Weighted lift across all classes based on class frequencies.

$$
\text{Micro-Lift} = \frac{\sum_{k=1}^K n_k \cdot \text{Lift}_k}{\sum_{k=1}^K n_k}
$$

When the positive class is rare, standard models may underperform and the techniques like oversampling, undersampling & weighting can help increase the model's ability to identify the rare class, improving the lift metrics.

1. **Oversampling (Synthetic or Duplicate Sampling)**

Increase the number of minority class samples for class $k$.

$$
n_{k}^{new} > n_{k}^{original}
$$

By synthetic generation (SMOTE) for each minority sample $x_i$, generate a new sample along the line connecting $x_i$ to one of its $k$ nearest neighbors $x_{nn}$:

$$
x_{new} = x_i + \lambda \cdot (x_{nn} - x_i), \quad \lambda \sim U(0,1)
$$

2. **Undersampling (Majority Reduction)**

Reduce the number of majority class samples.

$$
n_{majority}^{new} < n_{majority}^{original}
$$

Randomly select a subset of majority samples to balance classes:

$$
n_{majority}^{new} \approx n_{minority}^{original} \quad \text{or some ratio } r
$$

3. **Up-Down Weighting (Class Weighting)**

Assign higher weight $w_i$ to minority class samples during training.

$$
w_i =
\begin{cases}
\frac{n}{K \cdot n_k} & \text{if } y_i = k \\
\frac{n}{K \cdot n_j} & \text{if } y_i = j \neq k
\end{cases}
$$

Loss function becomes weighted:

$$
\text{Weighted Loss} = \sum_{i=1}^{n} w_i \cdot \ell(y_i, \hat{p}_{ik})
$$

Model penalizes misclassification of minority class more means improves detection.

#### Fitting

**Deviance & Pearson Chi-Square Tests**

**Deviance:** Measures the discrepancy between the observed data and the model-predicted probabilities.

$$
D = 2 \sum_{i=1}^n \sum_{k=1}^K I(y_i=k) \ln \frac{I(y_i=k)}{\hat{p}_{ik}}
$$

**Pearson Chi-Square:** Compares observed and expected counts in each category.

$$
X^2 = \sum_{i=1}^n \sum_{k=1}^K \frac{(I(y_i=k) - \hat{p}_{ik})^2}{\hat{p}_{ik}}
$$

**Where:**  
- $n$ = number of observations  
- $K$ = number of outcome categories  
- $y_i \in \{1, \dots, K\}$ = observed class for observation $i$  
- $\hat{p}_{ik}$ = predicted probability that $y_i = k$  
- $I(y_i=k)$ = indicator function, $1$ if $y_i=k$, $0$ otherwise  

The test statistic can be compared to a chi-square distribution with degrees of freedom:

$$
df = n - (K-1) \cdot p
$$

where $p$ is the number of predictors. Large values of $D$ or $X^2$ indicate poor model fit.

**Likelihood Ratio Test (LRT)**

The LRT compares a nested model (simpler) with a full model (more parameters) to evaluate whether additional predictors significantly improve the fit:

$$
G^2 = -2 \left( \ell_0 - \ell_1 \right)
$$

**Where:**  
- $\ell_0$ = log-likelihood of the nested (smaller) model  
- $\ell_1$ = log-likelihood of the full model  

Degrees of freedom:  

$$
df = \text{number of additional parameters in the full model}
$$

A large $G^2$ value indicates that the full model significantly improves the fit compared to the nested model.

**Pseudo R² Measures**

Since traditional R² is not defined for logistic models, several pseudo R² metrics are used to assess the goodness-of-fit:

**McFadden R²**:
$$
R^2_\text{McFadden} = 1 - \frac{\ell_1}{\ell_0}
$$

**Cox & Snell R²**:
$$
R^2_\text{Cox-Snell} = 1 - \left(\frac{L_0}{L_1}\right)^{2/n}
$$

**Nagelkerke R²** (adjusted Cox & Snell):
$$
R^2_\text{Nagelkerke} = \frac{R^2_\text{Cox-Snell}}{1 - L_0^{2/n}}
$$

**Where:**  
- $L_0 = \exp(\ell_0)$ = likelihood of the intercept-only model  
- $L_1 = \exp(\ell_1)$ = likelihood of the fitted model  
- $n$ = number of observations  

Higher pseudo R² values indicate better model fit, although they do not have a direct interpretation as variance explained like in linear regression.

#### Regularizing

Regularizing is used to prevent overfitting by penalizing large coefficients.

**Ridge (L2) Regularization**

Ridge penalizes the sum of squared coefficients.

$$
\mathcal{L}(\beta) = - \sum_{i=1}^{n} \sum_{k=1}^{K} I(y_i=k) \ln(\hat{p}_{ik}) + \lambda \sum_{k=1}^{K-1} \sum_{j=0}^{p} \beta_{jk}^2
$$

**Where:**  
- $n$ = number of observations  
- $K$ = number of classes  
- $p$ = number of predictors  
- $\beta_{jk}$ = coefficient for predictor $j$ and class $k$  
- $\lambda$ = regularization strength (hyperparameter) & chosen via cross-validation to optimize predictive performance.
- $\hat{p}_{ik}$ = predicted probability for observation $i$ in class $k$  
- $I(y_i=k)$ = indicator function, $1$ if $y_i=k$, $0$ otherwise  

**Interpretation:**  
- Shrinks coefficients towards zero.

**Lasso (L1) Regularization**

Lasso penalizes the sum of absolute coefficients.

$$
\mathcal{L}(\beta) = - \sum_{i=1}^{n} \sum_{k=1}^{K} I(y_i=k) \ln(\hat{p}_{ik}) + \lambda \sum_{k=1}^{K-1} \sum_{j=0}^{p} |\beta_{jk}|
$$

**Interpretation:**  
- It can shrink some coefficients exactly to zero means automatic feature selection.

Both L1 and L2 can be combined in **Elastic Net** regularization.

### Ordinal Logistic Regression Model

#### Model

Ordinal logistic regression model is:

$$
\text{logit}\big(P(y_i \le k \mid x_i)\big) = \ln\left(\frac{P(y_i \le k \mid x_i)}{P(y_i > k \mid x_i)}\right) = \theta_k - \beta x_i, \quad k = 1,2,\dots,K-1
$$

- $\theta_k$: intercept (cutpoint) for threshold $k$  
- $\beta$: slope coefficient for predictor $x_i$, common across all thresholds (proportional odds assumption)  
- $y_i$: ordinal outcome for observation $i$  
- $x_i$: independent variable  

Also for multiple predictors, the ordinal logistic regression model generalizes to:

$$
\text{logit}\big(P(y_i \le k \mid \mathbf{x}_i)\big) = \theta_k - \mathbf{x}_i^\top \boldsymbol{\beta}, \quad k = 1, \dots, K-1
$$

where:

- $\mathbf{x}_i = (x_{i1}, x_{i2}, \dots, x_{ip})^\top$ is a vector of $p$ predictors for observation $i$  
- $\boldsymbol{\beta} = (\beta_1, \beta_2, \dots, \beta_p)^\top$ is the coefficient vector (common across thresholds, proportional odds assumption)  
- $\theta_k$ is the intercept (cutpoint) for threshold $k$  

This explicitly accounts for multiple predictors while maintaining the proportional odds assumption.

**Category probabilities** are obtained as:

$$
\begin{aligned}
P(y_i = 1 \mid x_i) &= \text{logit}^{-1}(\theta_1 - \beta x_i) \\
P(y_i = k \mid x_i) &= \text{logit}^{-1}(\theta_k - \beta x_i) - \text{logit}^{-1}(\theta_{k-1} - \beta x_i), \quad k = 2,\dots,K-1 \\
P(y_i = K \mid x_i) &= 1 - \text{logit}^{-1}(\theta_{K-1} - \beta x_i)
\end{aligned}
$$

**Maximum Likelihood Estimation (MLE)** is used to estimate the parameters:

$$
L(\theta, \beta) = \prod_{i=1}^n \prod_{k=1}^K P(y_i = k \mid x_i)
$$

Log-likelihood function:

$$
\ell(\theta, \beta) = \sum_{i=1}^n \sum_{k=1}^K I(y_i = k) \ln P(y_i = k \mid x_i)
$$

Estimators $\hat{\theta}_k$ and $\hat{\beta}$ are obtained by maximizing $\ell(\theta, \beta)$.

#### Performance

**Cross-Validation**

To estimate the generalization performance of the ordinal logistic regression model on unseen data.  

**k-Fold Cross-Validation**

1. Split the data into $k$ roughly equal folds (subsets): $D_1, D_2, \dots, D_k$.  
2. For each fold $j = 1, \dots, k$:
   - Train the model on the remaining $k-1$ folds: $D_{-j} = D \setminus D_j$  
   - Fit the model to obtain $\hat{\theta}_k^{(-j)}, \hat{\beta}^{(-j)}$ for $k=1,\dots,K-1$  
   - Predict cumulative probabilities on the left-out fold $D_j$:  
$$
\hat{P}(y_i \le k \mid x_i)^{(-j)} = \text{logit}^{-1}(\hat{\theta}_k^{(-j)} - \hat{\beta}^{(-j)} x_i), \quad i \in D_j
$$
   - Obtain category probabilities:  
$$
\hat{P}(y_i = 1 \mid x_i)^{(-j)} = \hat{P}(y_i \le 1 \mid x_i)^{(-j)}, \quad
\hat{P}(y_i = k \mid x_i)^{(-j)} = \hat{P}(y_i \le k \mid x_i)^{(-j)} - \hat{P}(y_i \le k-1 \mid x_i)^{(-j)}, \quad k=2,\dots,K-1
$$
$$
\hat{P}(y_i = K \mid x_i)^{(-j)} = 1 - \hat{P}(y_i \le K-1 \mid x_i)^{(-j)}
$$

3. Compute prediction error using ordinal-appropriate metrics (e.g., accuracy, macro-F1, or ordinal log loss) for each fold.  

- *Common choices are $k=5$ or $k=10$*  
- *Leave-One-Out CV (LOOCV) is a special case where $k=n$.*  

Cross-validation provides an estimate of expected performance on new/unseen data.  

Lower log loss or higher accuracy indicates better generalization performance.  

**Log-Loss**

Log-loss measures the uncertainty of predictions by penalizing incorrect classifications.  

For $n$ observations and $K$ categories:

$$
\text{LogLoss} = - \frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} I(y_i = k) \cdot \log(\hat{p}_{ik})
$$

- $I(y_i=k)$: indicator function (1 if true, 0 otherwise)  
- $\hat{p}_{ik}$: predicted probability that observation $i$ belongs to class $k$   

**Mean Absolute Error (MAE)**

MAE measures the average magnitude of errors between predicted and true categories.

$$
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} | y_i - \hat{y}_i |
$$

- $y_i$: true ordinal class (numeric)  
- $\hat{y}_i$: predicted ordinal class (numeric)  

**Quadratic Weighted Kappa (QWK)**

QWK penalizes larger disagreements more heavily using quadratic weights.

$$
w_{ij} = \frac{(i-j)^2}{(K-1)^2}
$$

Then:

$$
\kappa_{QW} = 1 - \frac{\sum_{i=1}^{K}\sum_{j=1}^{K} w_{ij} O_{ij}}{\sum_{i=1}^{K}\sum_{j=1}^{K} w_{ij} E_{ij}}
$$

**Rank Correlation Measures**

Ordinal predictions can also be evaluated using rank correlation metrics.

- **Kendall’s Tau**:

$$
\tau = \frac{(C - D)}{\tfrac{1}{2} n (n-1)}
$$

where $C$ = number of concordant pairs, $D$ = number of discordant pairs.  

- **Somers’ D** (asymmetric measure, prediction vs. true labels):

$$
D_{yx} = \frac{C - D}{C + D + T_y}
$$

where $T_y$ = number of ties in the true variable.  

- **Concordance Index (C-index):**

$$
C = \frac{\text{Number of concordant pairs} + 0.5 \times \text{Number of ties}}{\text{Total comparable pairs}}
$$

#### Prediction

Ordinal logistic regression model for prediction is based on cumulative probabilities.

$$
\hat{P}(y \le k \mid x_{\text{new}}) = \frac{1}{1 + \exp\!\big(-(\hat{\theta}_k - \hat{\beta}^\top x_{\text{new}})\big)},
\quad k = 1,2,\dots,K-1
$$

**Where:**  
- $\hat{\theta}_k$: estimated threshold (cutpoint) for category $k$  
- $\hat{\beta}$: estimated slope coefficient vector (common across categories)  
- $x_{\text{new}}$: predictor vector for the new observation  
- $\hat{P}(y \le k \mid x_{\text{new}})$: predicted cumulative probability up to category $k$  

The category-specific probabilities are obtained by differencing cumulative probabilities:

$$
\hat{p}_{\text{new},1} = \hat{P}(y \le 1 \mid x_{\text{new}})
$$

$$
\hat{p}_{\text{new},k} = \hat{P}(y \le k \mid x_{\text{new}}) - \hat{P}(y \le k-1 \mid x_{\text{new}}),
\quad k = 2, \dots, K-1
$$

$$
\hat{p}_{\text{new},K} = 1 - \hat{P}(y \le K-1 \mid x_{\text{new}})
$$

To estimate the uncertainty of the predicted cumulative logit at $x_{\text{new}}$, a **confidence interval (CI)** can be constructed:

1. Compute the cumulative logit for threshold $k$:

$$
\hat{\text{logit}}_{\text{new},k} = \hat{\theta}_k - \hat{\beta}^\top x_{\text{new}}
$$

2. Construct a CI for the logit:

$$
\hat{\text{logit}}_{\text{new},k} \pm z_{\alpha/2} \cdot SE(\hat{\text{logit}}_{\text{new},k})
$$

with

$$
SE(\hat{\text{logit}}_{\text{new},k}) = \sqrt{
\mathbf{x}_{\text{new}}^\top \, \text{Cov}(\hat{\boldsymbol{\beta}}) \, \mathbf{x}_{\text{new}}
+ \text{Var}(\hat{\theta}_k)
+ 2 \, \mathbf{x}_{\text{new}}^\top \, \text{Cov}(\hat{\boldsymbol{\beta}}, \hat{\theta}_k)
}
$$

**Where:**  
- $SE(\hat{\text{logit}}_{\text{new},k})$: standard error of the cumulative logit at threshold $k$  
- $z_{\alpha/2}$: critical value from the standard normal distribution (e.g., 1.96 for 95% CI)  
- $\text{Cov}(\hat{\beta})$: estimated covariance matrix of slope coefficients  
- $\text{Cov}(\hat{\boldsymbol{\beta}}, \hat{\theta}_k)$: estimated covariance vector between slopes and threshold $k$  
- $\text{Var}(\hat{\theta}_k)$: estimated variance of threshold $k$
- $\mathbf{x}_{\text{new}}$: predictor vector for the new observation

3. Transform the CI back to the probability scale using the logistic function.

Unlike linear regression, there is no standard **prediction interval** for an individual observation.  

**Extrapolation** occurs when $x_{\text{new}}$ is outside the observed range; predicted probabilities may become unstable.

#### Fitting

**Deviance & Pearson Chi-Square Tests**

**Deviance:** Measures the discrepancy between the observed data and the model-predicted probabilities.

$$
D = 2 \sum_{i=1}^n \sum_{k=1}^K I(y_i=k) \ln \frac{I(y_i=k)}{\hat{p}_{ik}}
$$

**Pearson Chi-Square:** Compares observed and expected counts in each category.

$$
X^2 = \sum_{i=1}^n \sum_{k=1}^K \frac{(I(y_i=k) - \hat{p}_{ik})^2}{\hat{p}_{ik}}
$$

**Where:**  
- $n$ = number of observations  
- $K$ = number of outcome categories  
- $y_i \in \{1, \dots, K\}$ = observed class for observation $i$  
- $\hat{p}_{ik}$ = predicted probability that $y_i = k$, obtained from cumulative logits:  

$$
\hat{p}_{ik} = \hat{P}(y \leq k \mid x_i) - \hat{P}(y \leq k-1 \mid x_i)
$$

- $I(y_i=k)$ = indicator function, $1$ if $y_i=k$, $0$ otherwise  

The test statistic can be compared to a chi-square distribution with degrees of freedom:

$$
df = n - \text{number of estimated parameters}
$$

Large values of $D$ or $X^2$ indicate poor model fit.

**Likelihood Ratio Test (LRT)**

The LRT compares a nested model (simpler) with a full model (more predictors) to evaluate whether additional predictors significantly improve the fit:

$$
G^2 = -2 \left( \ell_0 - \ell_1 \right)
$$

**Where:**  
- $\ell_0$ = log-likelihood of the nested (smaller) model  
- $\ell_1$ = log-likelihood of the full model  

Degrees of freedom:  

$$
df = \text{number of additional parameters in the full model}
$$

A large $G^2$ value indicates that the full model significantly improves the fit compared to the nested model.

**Pseudo R² Measures**

Since traditional R² is not defined for logistic models, several pseudo R² metrics are used to assess the goodness-of-fit:

**McFadden R²**:
$$
R^2_\text{McFadden} = 1 - \frac{\ell_1}{\ell_0}
$$

**Cox & Snell R²**:
$$
R^2_\text{Cox-Snell} = 1 - \left(\frac{L_0}{L_1}\right)^{2/n}
$$

**Nagelkerke R²** (adjusted Cox & Snell):
$$
R^2_\text{Nagelkerke} = \frac{R^2_\text{Cox-Snell}}{1 - L_0^{2/n}}
$$

**Where:**  
- $L_0 = \exp(\ell_0)$ = likelihood of the intercept-only model  
- $L_1 = \exp(\ell_1)$ = likelihood of the fitted model  
- $n$ = number of observations  

Higher pseudo R² values indicate better model fit, although they do not have a direct interpretation as variance explained like in linear regression.

#### Regularizing

Regularizing is used to prevent overfitting by penalizing large coefficients.  

In ordinal logistic regression, the loss is based on cumulative probabilities.

**Ridge (L2) Regularization**

Ridge penalizes the sum of squared coefficients.

$$
\mathcal{L}(\beta) = - \sum_{i=1}^{n}
\Bigg[
\sum_{k=1}^{K-1} I(y_i \leq k) \ln(\hat{P}(y_i \leq k \mid x_i))
+ I(y_i > k) \ln(1 - \hat{P}(y_i \leq k \mid x_i))
\Bigg]
+ \lambda \sum_{j=1}^{p} \beta_{j}^2
$$

**Where:**  
- $n$ = number of observations  
- $K$ = number of ordered categories  
- $p$ = number of predictors  
- $\beta_j$ = coefficient for predictor $j$ (common across thresholds)  
- $\lambda$ = regularization strength (hyperparameter), chosen via cross-validation  
- $\hat{P}(y_i \leq k \mid x_i)$ = predicted cumulative probability  
- $I(\cdot)$ = indicator function  

**Interpretation:**  
- Shrinks coefficients towards zero while keeping thresholds separate.

**Lasso (L1) Regularization**

Lasso penalizes the sum of absolute coefficients.

$$
\mathcal{L}(\beta) = - \sum_{i=1}^{n}
\Bigg[
\sum_{k=1}^{K-1} I(y_i \leq k) \ln(\hat{P}(y_i \leq k \mid x_i))
+ I(y_i > k) \ln(1 - \hat{P}(y_i \leq k \mid x_i))
\Bigg]
+ \lambda \sum_{j=1}^{p} |\beta_{j}|
$$

**Interpretation:**  
- Can shrink some coefficients exactly to zero, performing automatic feature selection.

Both L1 and L2 can be combined in **Elastic Net** regularization.

# MODELS IN PYTHON

## MODELS FOR PREDICTION IN PYTHON

In [None]:
# SOON

## MODELS FOR CLASSIFICATION IN PYTHON

In [None]:
# SOON