# **Chapter 3. Modeling**

## **3.1. Statistical Models**

### **3.1.1. Definition**

A **statistical model** is a mathematical representation of observed data. It is used to make inferences about the underlying processes that generated the data, predict future outcomes, or understand relationships between variables. Statistical models provide a framework for analyzing data and are fundamental in quantitative research across various fields, including economics, biology, engineering, and social sciences.

**Definition**: A statistical model is an abstraction that describes the relationship between input variables (predictors) and output variables (responses). It formalizes assumptions about the data, which can be leveraged to generate insights.

**Components of a Statistical Model**:
   - **Variables**: These can be dependent (response) variables that we aim to predict or explain, and independent (predictor) variables that are used to predict the behavior of the dependent variable.
   - **Parameters**: These are the numerical constants in the model that define its specific form. In many cases, they are estimated from the data.
   - **Random Error (Noise)**: This encompasses the variability in the response variable that cannot be explained by the predictor variables. It acknowledges that data are not perfect and contains inherent uncertainty.

**Formulation**: Statistical models can typically be expressed mathematically as:
     $$
     Y = f(X) + \epsilon
     $$
Where:
- $Y$ is the dependent variable.
- $f(X)$ is a function that describes the relationship between $Y$ and the independent variable(s) $X$.
- $\epsilon$ represents the random error term.

**Assumptions**: Each statistical model comes with its own set of assumptions that must hold for the model to be valid. Common assumptions include:
- Linearity: In linear regression, the relationship between independent and dependent variables is assumed to be linear.
- Independence: Observations should be independent of each other.
- Homoscedasticity: Variability of the error terms should remain constant across all levels of the independent variables.
- Normality: For parametric tests, it is often assumed that the residuals (errors) are normally distributed.

**Applications**: Statistical models are widely used in various applications, including:
- Predictive analytics (forecasting trends)
- Experimental design (testing treatments)
- Quality control (monitoring process variations)
- Risk assessment (evaluating uncertainties)

In summary, statistical models serve as powerful tools for understanding relationships within data, guiding decision-making, and making predictions. They form the backbone of statistical analysis and facilitate the interpretation of complex datasets.

### **3.1.2. Modeling Process**

The process of creating a statistical model involves several steps:
1. **Model Specification**: Defining the model structure and identifying relevant variables.
2. **Parameter Estimation**: Using statistical methods to estimate model parameters from the data.
3. **Model Evaluation**: Assessing the goodness-of-fit and validity of the model, checking its predictive power, and making necessary adjustments.
4. **Model Validation**: Testing the model on new data to verify its accuracy and reliability.

### **3.1.3. Types of Model**

Statistical models can be categorized in various ways based on their characteristics, underlying assumptions, and intended applications. The primary categories include:

1. **Descriptive vs. Inferential Models**:
   - **Descriptive Models**: These models summarize and describe the characteristics of the data without making predictions or inferences about a larger population.
   - **Inferential Models**: These models are used to make predictions or inferences about a population based on a sample of data.


2. **Parametric vs. Non-Parametric Models**:
   - **Parametric Models**: These models assume a specific form for the function relating the predictor and response variables, characterized by a finite number of parameters (e.g., linear regression).
   - **Non-Parametric Models**: These models do not assume a specific functional form and can adapt to the underlying data structure more flexibly (e.g., decision trees, kernel smoothing).


3. **Linear vs. Non-linear Models**:

   - **Linear Models**: These models assume a linear relationship between the predictor and response variables, represented in a straight-line form.
   - **Non-linear Models**: These models allow for complex relationships that cannot be represented as a straight line (e.g., quadratic, exponential).


4. **Static vs. Dynamic Models**:

   - **Static Models**: These models analyze data at a fixed point in time.
   - **Dynamic Models**: These models consider changes over time and can include time series data.

5. **Regression vs. Classification**:
    - **Regression**: Regression analysis involves predicting a continuous output variable based on one or more input variables (predictors). The goal is to model the relationship between the predictor(s) and the response variable.
    - **Classification**: Classification is the process of predicting categorical outcomes (labels). Given a set of input features, the goal is to assign a class label based on the learned patterns from the training data.

### **3.1.4. Bias versus Variance**

**Bias** refers to the error introduced by approximating a real-world problem with a simplified model. It is the model's tendency to consistently make assumptions that deviate from the true relationship between input features and the target variable.

**Variance**, on the other hand, refers to the model's sensitivity to fluctuations in the training data. It measures how much the predictions of the model vary for different training datasets.

In summary:
- Bias represents the model's systematic error or the assumptions it makes.
- Variance represents the model's inconsistency or sensitivity to training data fluctuations.
- **High bias** → **Underfit**
- **High variance** → **Overfit**

The bias-variance tradeoff:

![Bias-variance tradeoff](images/bias_variance_tradeoff.png)

## **3.2. Regression**

Regression analysis is a powerful statistical tool used for modeling the relationship between a dependent variable and one or more independent variables. This section will cover the fundamentals of regression, how to evaluate regression models, and a detailed exploration of linear regression.

### **3.2.1. Fundamentals of Regression**

n regression analysis, the main objective is to estimate the relationships among variables. The model can provide insights about how the expected value of the dependent variable changes as the independent variable(s) change.

Key concepts in regression include:
- **Dependent Variable**: The outcome or response variable that we aim to predict.
- **Independent Variables**: The predictors or features that are used to estimate the dependent variable.
- **Regression Coefficients**: These parameters represent the magnitude of change in the dependent variable for a one-unit change in the independent variable.
- **Error Terms**: Variability in the dependent variable that cannot be explained by the independent variables.
  
Regression analysis helps in making predictions, understanding relationships, and assessing trends in data, making it a critical component in data analysis and machine learning.

### **3.2.2. Regression Model Evaluation**

Evaluating the performance of regression models involves assessing how well the model's predictions match the actual outcomes. Several metrics are commonly used to quantify model performance:

1. **Mean Absolute Error (MAE)**:  Measures the average magnitude of errors in a set of predictions, without considering their direction. It is the average over the test sample of the absolute differences between prediction and actual observation.
$$
MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$

2. **Mean Squared Error (MSE)**: Measures the average of the squares of the errors, which means larger errors have a disproportionately higher impact on the metric. 
$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

3. **Root Mean Squared Error (RMSE)**: This is the square root of the mean squared error and provides an estimate of the average distance between the predicted values and actual values.
$$
RMSE = \sqrt{MSE}
$$

4. **R-squared (R²)**: Represents the proportion of the variance for the dependent variable that's explained by the independent variables in the model. It provides a measure of how well observed outcomes are replicated by the model.
$$
R^2 = 1 - \frac{SS_{res}}{SS_{tot}}
$$

Where
- $SS_{res}$ is the sum of squares of residuals
- $SS_{tot}$ is the total sum of squares

Choosing the right evaluation metric depends on the specific context and objectives for modeling.

### **3.2.3. Linear Regression**

**Definition:**
Linear regression is a simple yet powerful statistical technique that models the relationship between a scalar dependent variable and one or more independent variables using a linear equation. In simple linear regression, the model takes the form:

$$
Y = \beta_0 + \beta_1X + \epsilon
$$

Where:
- $Y$ is the dependent (response) variable.
- $\beta_0$ is the y-intercept (constant term).
- $\beta_1$ is the coefficient for the independent variable $X$.
- $\epsilon$ is the error term.

**Example of Linear Regression using Scikit-Learn:**
In this example, we will perform linear regression using the popular `scikit-learn` library in Python. We will create a synthetic dataset, fit a linear regression model, and evaluate its performance.

**Import Required Libraries:**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

**Load the dataset:**

In [None]:
# Create synthetic dataset
np.random.seed(0)
x = 2 * np.random.rand(100, 1)
y = 4 + 3 * x + 0.2*np.random.randn(100, 1)
plt.scatter(x, y)

**Split the dataset:**

In [None]:
# Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

**Linear regression model:**

In [None]:
# Create and fit the model
model = LinearRegression()
model.fit(x_train, y_train)

**Evaluate model:**

In [None]:
# Make predictions
y_pred = model.predict(x_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print evaluation metrics
print(f'Mean Absolute Error: {mae}')
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

# Plot the results
fig = plt.figure()
plt.scatter(x_test, y_test, color='blue', label='Actual')
plt.plot(x_test, y_pred, color='red', label='Predicted')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression: Actual vs Predicted')
plt.legend()
plt.show()

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 1</b></p>

1. Load file `Solubility.csv`, use the following columns as inputs for predicting the output (solubility) column:
   - AMW: average molecular weight
   - num_rings: number of rings
   - fraction_CSP3: fraction of $sp^3$-hybridized carbons
   - num_hba: number of hydrogen bond accepter
   - num_hbd: number of hydrogen bond donor
   - logP: partition coefficient
   - TPSA: topological polar surface area
3. Split the dataset into training and testing sets (70/30 split).
4. Create a linear regression model using scikit-learn, fit the model to the training data, and make predictions on the testing data.
5. Evaluate your model using the aforementioned metrics (MAE, MSE, and R²).
6. Visualize the actual vs predicted values on a scatter plot.

## **3.3. Classification**

Classification is a fundamental concept in statistical modeling and machine learning where the goal is to predict categorical outcomes based on input features. This section will delve into the fundamentals of classification, how to evaluate classification models, and a specific exploration of logistic regression.

### **3.3.1. Fundamentals of Classification**

Classification involves assigning a category label to a given input based on the learned phenomena from the training data. The process typically involves defining a model that maps the input features to one or more class labels.

Key concepts in classification include:
- **Classes**: The distinct categories or labels that the model aims to predict. For example, in a binary classification problem, the classes can be "0" or "1."
- **Predictors**: The input features utilized to classify observations. These inputs can be numerical or categorical.
- **Decision Boundary**: The threshold that helps distinguish between different classes. The model assesses input values to determine on which side of the boundary they fall.
  
Common classification tasks include spam detection (spam vs. not spam), medical diagnosis (disease vs. no disease), and sentiment analysis (positive vs. negative), making classification a widely used technique in various fields.

### **3.3.2. Classification Model Evaluation**

Assessing the performance of a classification model is crucial to understanding its effectiveness. Several metrics are used to evaluate classification models:

1. **Confusion Matrix**: A confusion matrix is a table that describes the performance of a classification model by comparing the predicted labels with the true labels. It provides insights into the types of errors made by the classifier. The confusion matrix is typically structured as follows:
$$
   \begin{array}{|c|c|c|}
   \hline
   & \text{Predicted Positive} & \text{Predicted Negative} \\
   \hline
   \text{Actual Positive} & TP & FN \\
   \hline
   \text{Actual Negative} & FP & TN \\
   \hline
   \end{array}
$$
   
2. **Accuracy**: The ratio of the number of correct predictions to the total number of predictions made. It is one of the simplest evaluation metrics.
$$
Accuracy = \frac{TP + TN}{TP + TN + FP + FN}
$$

Where
- $TP$ = True Positives
- $TN$ = True Negatives
- $FP$ = False Positives
- $FN$ = False Negatives.

3. **Precision**: Measures the accuracy of positive predictions. It represents the proportion of true positive results in all positive predictions.
$$
Precision = \frac{TP}{TP + FP}
$$

4. **Recall (Sensitivity)**: Measures the ability of the model to identify all relevant instances. It assesses how many actual positives were captured by the model.
$$
Recall = \frac{TP}{TP + FN}
$$

5. **F1 Score**: The harmonic mean of precision and recall. It provides a single metric that balances both considerations, making it helpful when dealing with imbalanced classes.
$$
F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}
$$

6. **Receiver Operating Characteristic (ROC) Curve** and **Area Under the Curve (AUC)**: The ROC curve plots the true positive rate against the false positive rate at various threshold settings. AUC quantifies the entire two-dimensional area underneath the ROC curve and gives an aggregate measure of performance across all classification thresholds.

Choosing appropriate metrics depends on the specific use case and the consequences of false positives and false negatives.

### **3.3.3. Logistic Regression**

**Definition:**
Logistic regression is a statistical method used for binary classification problems, which predicts the probability that an instance belongs to a particular class. It models the relationship between the dependent binary variable and one or more independent variables using the logistic function. The logistic function outputs values between 0 and 1, which can be interpreted as probabilities.

The logistic regression model is given by:
$$
P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X)}}
$$

Where:
- $P(Y=1|X)$ is the probability that the dependent variable $Y$ equals 1 (i.e., belongs to class 1) given the input features $X$.
- $\beta_0$ is the intercept and $\beta_1$ are the coefficients representing the direction and strength of the relationship with the independent variable(s).

**Example of Logistic Regression using Scikit-Learn:**
In this example, we will perform logistic regression using the `scikit-learn` library in Python. We will use the famous Iris dataset, which consists of multiple species of flowers.

**Import Required Libraries:**

In [None]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc

**Load the dataset:**

In [None]:
breast_cancer = load_breast_cancer()
x = breast_cancer.data  # Features
y = (breast_cancer.target == 0).astype(int)  # Binary target variable

**Split the dataset:**

In [None]:
# Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

**Logistic regression model:**

In [None]:
# Create and fit the model
model = LogisticRegression(max_iter=5000)
model.fit(x_train, y_train)

**Evaluate model:**

In [None]:
# Make predictions
y_pred = model.predict(x_test)
y_pred_proba = model.predict_proba(x_test)[:, 1]  # Probability estimates for the positive class

# Evaluate the model
conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

# Print evaluation metrics
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')
print(f'Confusion Matrix:\n{conf_matrix}')

# Plot ROC Curve
plt.figure()
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC Curve (AUC = {:.2f})'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 1</b></p>

1. Load the file `BBBP.csv`, this is a dataset for predicting blood-brain barrier penetration. The output column is `p_np` and all other columns are used as inputs.
2. Perform a logistic regression analysis to predict a binary outcome.
3. Split the dataset into training and testing sets (e.g., 80/20 split).
4. Fit a logistic regression model using scikit-learn and make predictions on the testing data.
5. Evaluate your model using accuracy, precision, recall, and F1 score.
6. Visualize the confusion matrix using a heatmap for better interpretation of results.