# Machine Learning Morphisms: Mathematical Formulation

## 1. Morphism Space Definition
Let $\mathcal{M} = \{S, X, Y\}$ be the morphism space, where:

- $S$: Sample Space
- $X$: Feature Space
- $Y$: Target Space

### 1.1 Morphism Transformation Operator
Define the morphism transformation operator $\mathcal{M}: S \to Y$ as:

$$\mathcal{M}(s) = \bigcup_{i=1}^{k} F_i(s)$$

Where $F_i$ represents different transformation functions.

---

## 2. Data Preprocessing Morphism $\mathcal{M}_1$

### 2.1 Preprocessing Transformation
$$\mathcal{M}_1: S \to X$$

$$X = \{\xi \mid \xi = f_{prep}(s), s \in S\}$$

Preprocessing functions:

- Null value handling: $f_{null}(s)$
- Outlier detection: $f_{outlier}(s)$
  - IQR-based outlier removal
  - Mahalanobis distance-based outlier detection
  - Minimum Covariance Determinant (MCD) outlier detection
  - Time series decomposition-based outlier removal
- Standardization: $f_{std}(s)$

The preprocessing pipeline:

$$f_{prep}(s) = f_{std}(f_{outlier}(f_{null}(s)))$$

---

### 2.2 Outlier Detection Methods

#### 2.2.1 IQR-based Outlier Removal
Outliers are identified using the Interquartile Range (IQR):

1. Compute the first ($Q_1$) and third quartiles ($Q_3$) of the data.
2. Calculate the IQR:
   $$
   IQR = Q_3 - Q_1
   $$
3. Define lower and upper bounds:
   $$
   \text{Lower Bound} = Q_1 - 1.5 \cdot IQR
   $$
   $$
   \text{Upper Bound} = Q_3 + 1.5 \cdot IQR
   $$
4. Identify outliers:
   $$
   x \text{ is an outlier if } x < \text{Lower Bound or } x > \text{Upper Bound.}
   $$



#### 2.2.2 Mahalanobis Distance-based Outlier Detection
Mahalanobis distance is used to measure the distance of a point from the center of a multivariate distribution:

1. Compute the mean vector $\mu$ and covariance matrix $\Sigma$ of the dataset.
2. Calculate the Mahalanobis distance for each data point $x$:
   $$
   D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}
   $$
3. Define a threshold (e.g., based on the Chi-squared distribution) to classify outliers:
   $$
   D_M(x) > \text{Threshold} \implies x \text{ is an outlier.}
   $$



#### 2.2.3 Minimum Covariance Determinant (MCD) Outlier Detection
The MCD method is a robust estimator of covariance and location:

1. Compute the MCD location $\mu_{MCD}$ and covariance $\Sigma_{MCD}$ from the data.
2. Calculate the Mahalanobis distance using MCD estimators:
   $$
   D_{MCD}(x) = \sqrt{(x - \mu_{MCD})^T \Sigma_{MCD}^{-1} (x - \mu_{MCD})}
   $$
3. Identify outliers:
   $$
   D_{MCD}(x) > \text{Threshold} \implies x \text{ is an outlier.}
   $$



#### 2.2.4 Time Series Decomposition-based Outlier Removal
Outliers in time series data are identified by decomposing the data into trend, seasonality, and residuals:

1. Decompose the time series:
   $$
   x_t = T_t + S_t + R_t
   $$
   Where:
   - $T_t$: Trend component
   - $S_t$: Seasonal component
   - $R_t$: Residual (random) component
2. Identify outliers in the residual component:
   $$
   R_t \text{ is an outlier if } |R_t| > k \cdot \sigma(R)
   $$
   Where $\sigma(R)$ is the standard deviation of the residuals and $k$ is a chosen threshold (e.g., $k=3$ for 3 standard deviations).



### 2.3 Standardization Function
$$f_{std}(x) = \frac{x - \mu(X)}{\sigma(X)}$$

Where:
- $\mu(X)$ is the mean of the feature $X$.
- $\sigma(X)$ is the standard deviation of the feature $X$.




---

## 3. Feature Engineering Morphism $\mathcal{M}_2$

### 3.1 Feature Transformation
$$\mathcal{M}_2: X \to X'$$

$$X' = \{\xi' \mid \xi' = f_{eng}(\xi), \xi \in X\}$$

Feature engineering operations:

- **Log transformation**: $f_{log}(x)$
- **Correlation analysis**: $f_{corr}(X)$
- **Feature selection**: $f_{select}(X)$
- **One-hot encoding for categorical variables**: $f_{onehot}(X)$

The feature engineering pipeline:

$$f_{eng}(x) = f_{select}(f_{corr}(f_{log}(f_{onehot}(x))))$$



### 3.2 Correlation Matrix Computation
The correlation matrix $\rho$ measures the pairwise relationships between features:

$$\rho_{i,j} = \frac{Cov(X_i, X_j)}{\sigma(X_i)\sigma(X_j)}$$

Where:
- $Cov(X_i, X_j)$ is the covariance between features $X_i$ and $X_j$.
- $\sigma(X_i)$ and $\sigma(X_j)$ are the standard deviations of $X_i$ and $X_j$, respectively.



### 3.3 Feature Selection Based on High Correlation
Highly correlated independent features introduce redundancy. Features with a correlation coefficient $\rho > \tau$ (threshold) are removed:

1. Compute the correlation matrix $\rho$ for all independent features.
2. Define a threshold $\tau$ (e.g., $\tau = 0.85$).
3. For each pair $(X_i, X_j)$ where $\rho_{i,j} > \tau$:
   - Remove one of the features $X_i$ or $X_j$, typically based on domain knowledge or variance contribution.

Mathematically:
$$
X' = \{X_i \in X \mid \forall j, \rho_{i,j} \leq \tau \text{ or } i = j\}
$$



### 3.4 One-Hot Encoding for Categorical Variables
Categorical variables are transformed into numerical representations using one-hot encoding:

1. For a categorical variable $C$ with $k$ unique categories $\{c_1, c_2, \dots, c_k\}$:
   - Create $k$ binary features $\{C_1, C_2, \dots, C_k\}$, where:
     $$
     C_j = 
     \begin{cases} 
     1 & \text{if } C = c_j \\
     0 & \text{otherwise}
     \end{cases}
     $$
2. Replace $C$ in the dataset with the binary features $\{C_1, C_2, \dots, C_k\}$.

Let $X_{cat}$ represent the set of categorical features in $X$. The transformed dataset after one-hot encoding is:
$$
f_{onehot}(X) = (X \setminus X_{cat}) \cup \bigcup_{C \in X_{cat}} \{C_1, C_2, \dots, C_k\}
$$



### Final Feature Engineering Pipeline
The complete feature engineering process:
1. Apply one-hot encoding for categorical variables:
   $$
   X \to f_{onehot}(X)
   $$
2. Apply log transformation to numerical features to reduce skewness:
   $$
   X \to f_{log}(X)
   $$
3. Perform correlation analysis and remove features with $\rho > \tau$:
   $$
   X \to f_{corr}(X)
   $$
4. Retain the most relevant features for modeling:
   $$
   X \to f_{select}(X)
   $$

The resulting feature set is:
$$
X' = f_{eng}(X)
$$

---

## 4. Time Series Stationarity Morphism $\mathcal{M}_3$

### 4.1 Stationarity Transformation
$$\mathcal{M}_3: X' \to X''$$

$$X'' = \{\xi'' \mid \xi'' = f_{stat}(\xi'), \xi' \in X'\}$$

Stationarity tests, transformations, and diagnostics:

- **Augmented Dickey-Fuller (ADF) test**: $f_{adf}(x)$
- **Differencing**: $f_{diff}(x)$
- **Seasonal decomposition**: $f_{seasonal}(x)$
- **Autocorrelation Function (ACF) plot**: $f_{acf}(x)$
- **Partial Autocorrelation Function (PACF) plot**: $f_{pacf}(x)$

$$
f_{stat}(x') = 
\begin{cases} 
x' & \text{if } f_{adf}(x') \text{ is stationary} \\
f_{diff}(x') & \text{otherwise}
\end{cases}
$$

### 4.2 ACF and PACF Plot Computation
To assess autocorrelation and partial autocorrelation, the following plots are generated:

- **ACF Plot**:
  $$ACF(lag) = \frac{\sum_{t=1}^{T-lag} (x_t - \mu)(x_{t+lag} - \mu)}{\sum_{t=1}^{T} (x_t - \mu)^2}$$

- **PACF Plot**:
  The PACF at lag $k$ is calculated by fitting autoregressive models of order $k$ and examining the correlation between $x_t$ and $x_{t-k}$ after removing the effects of the intermediate lags $1, 2, \dots, k-1$.

These plots are used to:
1. Identify the order of differencing needed (if any).
2. Determine the parameters $p$ (AR order) and $q$ (MA order) for SARIMA modeling.

---



## 5. Model Training Morphism $\mathcal{M}_4$

### 5.1 Model Representation
$$
\mathcal{M}_4: X'' \to \mathcal{M}, \quad M \in \mathcal{M}
$$



Where $M$ is the set of models:

1. **Baseline Model**: $M_{SARIMA}$
2. **1D-CNN**: $M_{1D-CNN}$
3. **LSTM**: $M_{LSTM}$
4. **LSTM with Attention**: $M_{LSTM-Attention}$

---

### 5.1.1 Baseline Model: SARIMAX ($M_{SARIMAX}$)

The **Seasonal Autoregressive Integrated Moving Average with Exogenous Regressors (SARIMAX)** model includes both time series components and exogenous features $Z_t$ (e.g., wind speed, humidity, cloud cover). The model is represented as:

$$
\Phi_p(B)(1 - B)^d (1 - B^s)^D y_t = \Theta_q(B) \Theta_Q(B^s) \varepsilon_t + \beta Z_t
$$

- **Components**:
  - $B$: Backshift operator, where $B y_t = y_{t-1}$.
  - $\Phi_p(B)$: Non-seasonal AR polynomial of order $p$:
    $$
    \Phi_p(B) = 1 - \phi_1 B - \phi_2 B^2 - \cdots - \phi_p B^p
    $$
  - $\Theta_q(B)$: Non-seasonal MA polynomial of order $q$:
    $$
    \Theta_q(B) = 1 + \theta_1 B + \theta_2 B^2 + \cdots + \theta_q B^q
    $$
  - $d$: Non-seasonal differencing order.
  - $D$: Seasonal differencing order.
  - $s$: Seasonality period (e.g., 12 for monthly data).
  - $Z_t$: Exogenous predictors at time $t$ (e.g., weather variables such as temperature, wind speed, etc.).
  - $\beta$: Coefficient vector for the exogenous predictors.
  - $\varepsilon_t$: White noise at time $t$.

#### **Objective for SARIMAX**:
SARIMAX minimizes the residual sum of squares with respect to both the ARIMA parameters $(p, d, q, P, D, Q, s)$ and the exogenous coefficients $\beta$:

$$
\mathcal{L}_{SARIMAX} = \sum_{t=1}^n (y_t - \hat{y}_t)^2
$$

#### **Diagnostics and Parameter Selection**:
1. Use **ACF** and **PACF plots** for $p$ and $q$ selection.
2. Use feature importance analysis for selecting relevant exogenous predictors $Z_t$.

---

### 5.1.2 1D-CNN ($M_{1D-CNN}$)

The **1D-CNN** model for weather prediction captures local temporal patterns in the data. Its forward pass is defined as:

$$
h^{(l)} = f(W^{(l)} * X^{(l-1)} + b^{(l)})
$$

- **Components**:
  - $X^{(l-1)}$: Input tensor from the previous layer.
  - $W^{(l)}$: Convolutional kernel (filter).
  - $*$: Convolution operator.
  - $b^{(l)}$: Bias term for layer $l$.
  - $f$: Activation function (e.g., ReLU or tanh).
  - $h^{(l)}$: Output tensor of layer $l$.

The CNN is trained to minimize the Mean Squared Error (MSE):

$$
\mathcal{L}_{1D-CNN} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$


---

### 5.1.3 LSTM ($M_{LSTM}$)

The **Long Short-Term Memory (LSTM)** model is used to capture long-term dependencies in the weather data. Its equations are:

1. Forget gate:
   $$
   f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)
   $$
2. Input gate:
   $$
   i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)
   $$
   $$
   \tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)
   $$
3. Cell state update:
   $$
   C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t
   $$
4. Output gate:
   $$
   o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)
   $$
   $$
   h_t = o_t \odot \tanh(C_t)
   $$

The LSTM model is trained by minimizing the Mean Absolute Error (MAE):

$$
\mathcal{L}_{LSTM} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|
$$

---

### 5.1.4 LSTM with Attention ($M_{LSTM-Attention}$)

The **LSTM with Attention** model improves predictions by focusing on the most relevant time steps in the sequence. The attention mechanism is defined as:

1. Compute attention scores:
   $$
   e_t = \text{score}(h_t, s)
   $$
   where $\text{score}$ is a function (e.g., dot product or alignment).

2. Normalize scores using softmax:
   $$
   \alpha_t = \frac{\exp(e_t)}{\sum_{t'} \exp(e_{t'})}
   $$

3. Compute context vector:
   $$
   c = \sum_t \alpha_t h_t
   $$

4. Combine context vector and LSTM output:
   $$
   \hat{y}_t = f(c, h_t)
   $$

The attention-enhanced predictions are evaluated using Mean Squared Error (MSE):

$$
\mathcal{L}_{LSTM-Attention} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$

---

### 5.2 Loss Function
The general loss function for all models is defined as:

$$
\mathcal{L}(M) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$

Each model adapts this loss function to its architecture and feature representation.

---

## 6. Hyperparameter Tuning Morphism $\mathcal{M}_6$

### 6.1 Tuning with Optuna
Optuna is a hyperparameter optimization framework that intelligently searches the hyperparameter space $\Theta$ to minimize the objective function $\mathcal{L}(M)$:

$$
\theta^* = \underset{\theta \in \Theta}{\arg\min} \mathcal{L}(M(\theta))
$$

Optuna works internally using the following steps:

---

#### 6.1.1 Trial and Objective Function Morphism
Each **trial** represents a single evaluation of the objective function $\mathcal{L}(M)$ for a given set of hyperparameters $\theta$. Let the trial space $\mathcal{T}$ be:

$$
\mathcal{T} = \{\tau_1, \tau_2, \dots, \tau_k\}
$$

For each trial $\tau_i$, Optuna performs:
1. **Sampling**: Generate a candidate hyperparameter set $\theta_i$ from the search space $\Theta$:
   $$
   \theta_i \sim P(\Theta)
   $$
   Where $P(\Theta)$ is the prior distribution over $\Theta$.
2. **Evaluation**: Evaluate the model $M$ using the sampled hyperparameters:
   $$
   \mathcal{L}(M(\theta_i)) = \frac{1}{n}\sum_{j=1}^n (y_j - \hat{y}_j)^2
   $$
3. **Storage**: Store the results of the trial $(\theta_i, \mathcal{L}(M(\theta_i)))$ in the study history.

---

#### 6.1.2 Search Strategy Morphism
Optuna dynamically selects hyperparameters based on the results of previous trials using one of the following strategies:

1. **Random Search**:
   $$
   \theta \sim \text{Uniform}(\Theta)
   $$
   This explores the search space randomly, with no reliance on previous results.

2. **Bayesian Optimization**:
   Optuna uses a **Tree-structured Parzen Estimator (TPE)** to approximate the objective function. The TPE separates the search space into two likelihood models:
   - $P(\mathcal{L} \mid \theta)$: Likelihood of good hyperparameters based on previous trials.
   - $P(\theta)$: Prior distribution of hyperparameters.

   The next hyperparameter $\theta_{next}$ is selected to maximize the expected improvement (EI):
   $$
   \theta_{next} = \underset{\theta}{\arg\max} \ \text{EI}(\theta)
   $$
   Where:
   $$
   \text{EI}(\theta) = \int_{-\infty}^{\mathcal{L}^*} (\mathcal{L}^* - \mathcal{L}) \cdot P(\mathcal{L} \mid \theta) \, d\mathcal{L}
   $$
   and $\mathcal{L}^*$ is the best observed objective value.

3. **Grid Search**: Optuna can simulate grid search by exhaustively sampling all combinations within a discretized $\Theta$.

---

#### 6.1.3 Pruning Morphism
Optuna implements pruning to terminate underperforming trials early. Let $\mathcal{T}_{active} \subseteq \mathcal{T}$ represent active trials. For each trial $\tau_i$:
1. Evaluate the intermediate objective value at step $k$:
   $$
   \mathcal{L}_k(\tau_i) = \frac{1}{k} \sum_{j=1}^k (y_j - \hat{y}_j)^2
   $$
2. Compare with the best-performing trial $\tau^*$:
   $$
   \tau_i \text{ is pruned if } \mathcal{L}_k(\tau_i) > \mathcal{L}_k(\tau^*).
   $$
3. Remove pruned trials from $\mathcal{T}_{active}$:
   $$
   \mathcal{T}_{active} = \mathcal{T}_{active} \setminus \{\tau_i \mid \text{pruned}\}.
   $$

---

#### 6.1.4 Final Optimization Morphism
After $k$ trials, Optuna selects the best hyperparameters $\theta^*$ based on the study history:

$$
\theta^* = \underset{\theta \in \mathcal{T}}{\arg\min} \mathcal{L}(M(\theta))
$$

---

### 6.2 Summary of Optuna's Internal Process
1. **Trial Generation**:
   - Random or guided sampling (e.g., TPE) is used to select $\theta$.
2. **Evaluation**:
   - The objective function $\mathcal{L}(M(\theta))$ is minimized.
3. **Pruning**:
   - Poor trials are terminated early based on intermediate performance.
4. **Selection**:
   - The best hyperparameters $\theta^*$ are chosen after $k$ trials.

By leveraging this dynamic and adaptive process, Optuna efficiently identifies optimal hyperparameters for the model $M$.


### 6.2 Time Series Cross-Validation
To evaluate model performance during tuning, **time series cross-validation** is applied using Scikit-learn's `TimeSeriesSplit`. This ensures that temporal dependencies are respected.

1. **Splitting Mechanism**:
   Let $T$ be the total time steps, $t$ be the initial training window size, $v$ be the validation window size, and $k$ be the number of splits.

   For each split $i$:
   - Training set:
     $$
     T_{\text{train}}^{(i)} = \{1, 2, \dots, t + (i-1)v\}
     $$
   - Validation set:
     $$
     T_{\text{val}}^{(i)} = \{t + (i-1)v + 1, \dots, t + iv\}
     $$

2. **Example**:
   If $T = 100$, $t = 60$, $v = 10$, and $k = 4$, the splits are:
   - Split 1:
     $T_{\text{train}}^{(1)} = \{1, 2, \dots, 60\}$, $T_{\text{val}}^{(1)} = \{61, \dots, 70\}$
   - Split 2:
     $T_{\text{train}}^{(2)} = \{1, 2, \dots, 70\}$, $T_{\text{val}}^{(2)} = \{71, \dots, 80\}$
   - Split 3:
     $T_{\text{train}}^{(3)} = \{1, 2, \dots, 80\}$, $T_{\text{val}}^{(3)} = \{81, \dots, 90\}$
   - Split 4:
     $T_{\text{train}}^{(4)} = \{1, 2, \dots, 90\}$, $T_{\text{val}}^{(4)} = \{91, \dots, 100\}$

3. **Loss Calculation**:
   The cross-validation loss is computed as the average over all splits:
   $$
   \mathcal{L}_{CV} = \frac{1}{k} \sum_{i=1}^{k} \mathcal{L}^{(i)}
   $$
   where $\mathcal{L}^{(i)}$ is the loss for the $i$-th split.

---

This morphism describes how hyperparameter tuning and evaluation are performed efficiently using Optuna and time series cross-validation.






## 7. Model Evaluation Morphism $\mathcal{M}_7$

### 7.1 Performance Metrics

**Mean Squared Error (MSE):**

$$MSE = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

**Root Mean Squared Error (RMSE):**

$$RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$

**Mean Absolute Error (MAE):**

$$MAE = \frac{1}{n}\sum_{i=1}^{n} |y_i - \hat{y}_i|$$

**Mean Absolute Percentage Error (MAPE):**

$$MAPE = \frac{1}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right| \times 100$$

Where:
- $y_i$: Actual value at observation $i$.
- $\hat{y}_i$: Predicted value at observation $i$.
- $n$: Number of observations.

---

### 7.2 Model Selection Criteria

**Akaike Information Criterion (AIC):**

The AIC is used to evaluate model fit while penalizing model complexity:

$$AIC = 2k - 2 \ln(L)$$

Where:
- $k$: Number of model parameters.
- $L$: Maximum likelihood of the model.

**Bayesian Information Criterion (BIC):**

The BIC is another metric for model selection, with a stronger penalty for model complexity compared to AIC:

$$BIC = k \ln(n) - 2 \ln(L)$$

Where:
- $k$: Number of model parameters.
- $n$: Number of data points.
- $L$: Maximum likelihood of the model.

Both AIC and BIC are minimized to select the best model. Lower values indicate a better balance of model fit and complexity.

---

### 7.3 Bias-Variance Decomposition

The error of a predictive model can be decomposed into three components:

$$\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$$

Where:
- **Bias** measures the error due to overly simplistic assumptions in the model.
- **Variance** measures the sensitivity of the model to fluctuations in the training data.
- **Irreducible Error** represents the noise inherent in the data that no model can explain.

---

### Final Performance Evaluation
A comprehensive evaluation of the models includes:
1. **Error Metrics**:
   - $MSE$, $RMSE$, $MAE$, $MAPE$
2. **Model Selection Criteria**:
   - $AIC$, $BIC$
3. **Error Decomposition**:
   - Bias-Variance tradeoff analysis.

## 8. Comprehensive Morphism Composition

### 8.1 Total Morphism Operator
$$\mathcal{M}_{total}: S \to Y$$

$$\mathcal{M}_{total} = \mathcal{M}_7 \circ \mathcal{M}_6 \circ \mathcal{M}_5 \circ \mathcal{M}_4 \circ \mathcal{M}_3 \circ \mathcal{M}_2 \circ \mathcal{M}_1$$

### 8.2 Probabilistic Model Representation
$$P(Y|X) = \int P(Y|M,X)P(M|X)dM$$

---


