<div class="alert block alert-info alert">

# <center> Scientific Programming in Python

## <center>Karl N. Kirschner<br>Bonn-Rhein-Sieg University of Applied Sciences<br>Sankt Augustin, Germany

# <center> Machine Learning (ML) Overview

<center><b>Machine Learning</b> is the study and usage of computer algorithms that <b>learn</b> (i.e., improve automatically) through <b>experience</b> (i.e., data).</center>

<br>

- **"Can machines think [in the way that we do]?"** [1]
    - The ML term was <font color='dodgerblue'>**first used in 1959**</font> by Arthur Samuel (an IBM researcher)

## Core Mathematical Foundations

- The "work-horse" of ML is **Statistics** (analyzing and modeling data).

- Other core disciplines include:
    - **Calculus:** Used for **optimization** (finding the minimum of a cost/loss function) via **derivatives** (e.g., gradient descent).
    - **Linear Algebra:** Used for representing data and operations efficiently using **vectors**, **matrices** and **tensors**.


Researchers developed different components for many years. Only recently were they collected into code libraries, making the ideas more accessible.


## Machine Learning Categories

1. <font color='dodgerblue'>**Shallow learning**</font> (e.g., **s**ci**k**it-**learn**; a.k.a., **sklearn**)
    - <font color='dodgerblue'>**Predefined features**</font>
        - Relies on **Feature Engineering**: Features must be **manually selected, extracted, and pre-processed** by the user (domain expert).
    - Models are simpler (e.g., Linear Regression, Decision Trees, Support Vector Machines).

<br>

2. <font color='dodgerblue'>**Deep learning**</font> (e.g., TensorFlow, PyTorch)
    - <font color='dodgerblue'>**Feature learning**</font> (note: can also have feature engineering involved as a pre-processing step)
        - Strength is **Feature Learning**: The model **automatically learns the optimal features** (i.e., representations) directly from the raw input data (e.g., pixels, raw text).
    - Utilizes **Deep Neural Networks (DNNs)**: complex architectures where many simple algorithmic **"layers"** are stacked and trained end-to-end.
        - Mostly <font color='dodgerblue'>**combines shallow learning**</font> instances together into the <font color='dodgerblue'>**"layers"**</font>


## The Three Main ML Task Types

Almost all ML problems fall into one of these <font color='dodgerblue'>three categories</font>:
1.  <font color='dodgerblue'>**Supervised Learning:**</font> Training data includes the correct **labels** (answers). Used for **Prediction**.
    - **Examples**
        - **Classification** (e.g., Is this email spam?)
        - **Regression** (e.g., What is the price of this house?).

<br>

2.  <font color='dodgerblue'>**Unsupervised Learning:**</font> Training data has **no labels**. Used for **Discovery**.
    - **Examples**
        - **Clustering** (grouping similar data points),
        - **Dimensionality Reduction**.

<br>

3.  **Reinforcement Learning (RL):**</font> An **Agent** learns to make sequential **decisions** by interacting with an **Environment** to maximize a <font color='dodgerblue'>**Reward**</font>.
    - **Example**
        - Training autonomous vehicles, playing games (AlphaGo).
<br>

**Sources**:
1. Turing, Alan M. "Computing machinery and intelligence." Parsing the Turing test. Springer, Dordrecht, 2009. 23-65.

**Additional Resources**:
1. https://en.wikipedia.org/wiki/Machine_learning

<hr style="border:2px solid gray"></hr>

# Supervised vs. Unsupervised Learning

Many traditional (Shallow) ML models fall into these two fundamental categories (excluding Reinforcement Learning).

---

## I. Supervised Learning (Learning with Target/Label Values)

In supervised learning, the model is trained on data that includes target output, or **"label"** ($\mathbf{y}$).    
- **Goal**: to **learn a mapping function** $f: \mathbf{X} \rightarrow \mathbf{y}$.
    - **Training** features ($\mathbf{X}$): **independent** variables (i.e., input).
    - **Target** feature ($\mathbf{y}$): **dependent** variable(s) (i.e., output).

### Key <font color='dodgerblue'>Supervised</font> Learning Tasks

| Regression (Predicts a **Continuous Value**) | Classification (Predicts a **Discrete Label**) |
| :---: | :---: |
| <font color='dodgerblue'>**Linear**</font> | Logistic Regression |
| <font color='dodgerblue'>**Polynomial**</font> | <font color='dodgerblue'>**Support Vector Machine (SVM)**</font> |
| StepWise | Naive Bayes</font> |
| <font color='dodgerblue'>**Ridge**</font> | Nearest Neighbor |
| <font color='dodgerblue'>**Lasso**</font> | Decision Tree |
| <font color='dodgerblue'>**ElasticNet**</font> | <font color='dodgerblue'>**Random Forest**</font> |

## II. Unsupervised Learning (Learning without Target/Label Values)

In unsupervised learning, the model is trained on data with **no labels**.
- **Goal**: to **discover hidden structure**, **patterns**, or **relationships** within the input data ($\mathbf{X}$).

### Key <font color='dodgerblue'>Unsupervised</font> Learning Tasks

| Clustering (Group Data by **Similarity**) | Dimensionality Reduction (Simplify **Data Structure**) |
| :---: | :---: |
| <font color='dodgerblue'>**K-means**</font> | <font color='dodgerblue'>**Principal Component Analysis (PCA)**</font> |
| Mean-Shift | Linear Discriminant Analysis (LDA) |
| DBScan | Generalized Discriminant Analysis |
| Agglomerative **Hierarchical** | Autoencoder |
| Spectral | Non-Negative Matrix Factor Factorization |
| Gaussian Mixture Model | UMAP (Uniform Manifold Approximation) |

<hr style="border:2px solid gray"></hr>

# Deep Learning

Deep Learning is defined by its use of **Deep Neural Networks (DNNs)**, which are models comprised of **hidden layers**.

- DNNs are powerful because they can perform **<font color='dodgerblue'>Feature Learning</font>** versus human-controlled **<font color='dodgerblue'>Feature Engineering</font>** (Shallow Learning).
    - **Feature Learning**: an algorithm automatically finds the **best data representations**

<br>

Deep Learning models can be applied to all **three major task types**:

1. **Supervised Deep Learning:** Used for complex **Regression** (e.g., predicting sequence data) or **Classification** (e.g., image recognition) using models like
    - **Convolutional Neural Networks (CNNs)**, or
    - **Recurrent Neural Networks (RNNs)**.
    - **Advantage**: trained on <font color='dodgerblue'>labeled data</font>, with specific and <font color='dodgerblue'>clearly defined goals</font>.
        - Learn a <font color='dodgerblue'>mapping</font> from <font color='dodgerblue'>input features</font> to known <font color='dodgerblue'>output labels</font>.

<br>

2. **Unsupervised Deep Learning:** Used for tasks like advanced **Clustering** or **Dimensionality Reduction** using models like
    - **Autoencoders**, and
    - **Generative Adversarial Networks (GANs)**.
    - **Advantage**: Discover hidden <font color='dodgerblue'>patterns, structures, relationships</font>, or <font color='dodgerblue'>insights</font> within the data itself.

<br>

3. **Reinforcement Learning (RL):** The learning agent is often a **Deep Neural Network** that learns the optimal policy through trial-and-error to maximize a **Reward** signal.

<br>

## The Neural Network Structure

A **Neural Network** is a highly flexible function that maps inputs to outputs through a series of interconnected computational units called **neurons** (or **nodes**).


1. **Input Layer:** Receives the raw **features** ($\mathbf{X}$) of the dataset. Each node represents one feature (observable).
2. **Encoder**: input $\rightarrow$ hidden layers (focuses on <font color='dodgerblue'>data reduction</font> and <font color='dodgerblue'>representation</font>)
3. **Hidden Layer(s):** Where the **actual learning occurs**. Each node in a hidden layer:
    - Calculates a **weighted sum** of its inputs (from the previous layer).
    - Applies a non-linear **<font color='dodgerblue'>Activation Function</font>** (e.g., ReLU, Sigmoid) to the sum.
    - These layers create a **<font color='dodgerblue'>compressed knowledge representation</font>** of the original input. The more layers, the "deeper" the network.
4. **Decoder**: hidden layers $\rightarrow$ <font color='dodgerblue'>output</font>
5. **Output Layer:** Returns the final **predicted result** ($\mathbf{\hat{Y}}$, where the "hat" indicates a prediction).
    - The number of nodes will correspond to the task: one node for Regression; one node per class for Classification.

<center><img alt="neural network" width="800" src="00_images/31_machine_learning/deep_neural_network.png" align="center" hspace="10px" vspace="0px"></center>

<center>Image Source: https://www.studytonight.com/post/understanding-deep-learning</center>

<hr style="border:2px solid gray"></hr>

# <font color='dodgerblue'>Key Libraries</font> for Machine Learning

## I. Foundational Libraries (Data Structure and Numerics)

1.  **NumPy** (Numerical Python)
2.  **Pandas**

<br>

## II. Machine Learning Toolkits

### A. <font color='dodgerblue'>Shallow Learning</font>

1.  **scikit-learn (sklearn):**
    - The **standard library** for traditional, high-performance ML algorithms (Regression, Classification, Clustering, etc.).
    - Known for its **unified API** (models use the same `.fit()`, `.predict()`, and `.transform()` methods), making it ideal for beginners.

<center><img alt="Accuracy vs Precision" width="800" src="00_images/31_machine_learning/scikit_learn_ml_map.png" align="center"></center>

<center>Image Source (interactive): https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html</center>

### B. <font color='dodgerblue'>Deep Learning</font>

1.  **TensorFlow (TF):**
    - An open-source library developed by Google.
    - Handles complex model building, GPU acceleration, and deployment in production environments.
2.  **Keras (integrated into $\text{tf.keras}$):**
    - A high-level API designed to make building and training neural networks **simple and fast**.
    - It acts as a user-friendly interface to TensorFlow's core functionality.
3.  <font color='dodgerblue'>**PyTorch:**</font>
    - An open-source library developed by Facebook's AI Research lab.
    - Known for its **Pythonic** feel and **dynamic computation graphs**, making it highly popular for research and **easy prototyping**.

<hr style="border:2px solid gray"></hr>

# General Workflow For Model Creation
- The scikit-learn Workflow


## 1. Understand Your Goal (Define the ML Task)

This first step will dictate the type of algorithm you must use.

- **Prediction Goal:** What type of data are you predicting?
    - **Numerical/Continuous Data** $\rightarrow$ **Regression** Task (e.g., predicting temperature).
    - **Categorical/Discrete Data** $\rightarrow$ **Classification** Task (e.g., predicting "cat" or "dog").

<br>

## 2. Data Preparation and Splitting

This step transforms the raw data into the necessary NumPy arrays ($\mathbf{X}$ and $\mathbf{y}$) and prepares them for training.

- **Data Cleaning:** Collect, clean (e.g., handle missing values), and handle **outliers**.
- **Define Features vs. Target:**
    - **Training Features ($\mathbf{X}$):** **Independent Variables** (i.e., input data, predictors).
    - **Target Feature ($\mathbf{y}$):** **Dependent Variable(s)** (i.e., output data, what we want to predict).
- **Feature Scaling/Normalization:** Adjust magnitudes of features so they are on a similar scale (e.g., using `StandardScaler` or `MinMaxScaler`).
- **Encoding:** Convert any categorical data (e.g., "Male", "Female")
    - Into numerical representations (e.g., 0, 1) using techniques like **One-Hot Encoding**.
- **Data Splitting:** Divide the prepared data into distinct sets to prevent **overfitting**.
    - Use sklearn's `train_test_split` function.
    - Results in four sets:
        - $\mathbf{X}_{\text{train}}$, $\mathbf{X}_{\text{test}}$, and
        - $\mathbf{y}_{\text{train}}$, $\mathbf{y}_{\text{test}}$.

<br>

## 3. Model Exploration and Training (The $\text{sklearn}$ API)

The key to $\text{sklearn}$ is the **unified API** - every model object follows the same fundamental pattern.

- **Model Selection:** Choose several candidate models (e.g., `LinearRegression`, `RandomForestClassifier`).
- **Training:** Fit the model to the training data. This is where the model learns the relationship between $\mathbf{X}_{\text{train}}$ and $\mathbf{y}_{\text{train}}$.
    - **Example code:** model.fit($\mathbf{X}_{\text{train}}$, $\mathbf{y}_{\text{train}}$)
- **Hyperparameter Tuning:** Systematically optimize model settings (e.g., tree depth, learning rate) that are not learned from the data.

<br>

## 4. Model Evaluation and Determination

We test the model's performance on the **unseen** $\mathbf{X}_{\text{test}}$ data.

- **Prediction:** Generate predictions ($\mathbf{\hat{y}}$) using the input test data/features ($\mathbf{X}_{\text{test}}$).
    - **Example code:** predictions = model.predict($\mathbf{X}_{\text{test}}$)
- **Evaluation:** Compare the predictions ($\mathbf{\hat{y}}$) to the **actual test data** ($\mathbf{y}_{\text{test}}$) using appropriate metrics.
    - Regression Metrics: Mean Squared Error (MSE), $R^2$.
    - Classification Metrics: Accuracy, Precision, Recall, F1 Score.

<br>

## 5. Apply the Model using New Data

The finalized, best-performing model is deployed to make predictions on entirely new data.

- **Final Outcome:** Apply the **trained model** to **new data**.

<hr style="border:2px solid gray"></hr>

# Trained Model Evaluation:

## <font color='dodgerblue'>Classification Metrics</font>

Evaluating classification models: How well do they correctly assign instances to the correct category?

### Core Metrics Based on Counts

In all formulas:

- **True Positives (TP):** Model <font color='dodgerblue'>**correctly**</font> predicted the **positive** class.
- **True Negatives (TN):** Model <font color='dodgerblue'>**correctly**</font> predicted the **negative** class.
- **False Positives (FP):** Model <font color='red'>**incorrectly**</font> predicted **positive** (i.e., a false alarm).
- **False Negatives (FN):** Model <font color='red'>**incorrectly**</font> predicted **negative** (i.e., a miss).


### 1. Accuracy

- **Definition:** The proportion of correctly classified instances out of the total instances.
- **Formula:**
$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$
- **Caution:** Can be misleading with **imbalanced datasets**
    - e.g., 95% of data is in one class
- **sklearn Function:** `accuracy_score`

---

### 2. Confusion Matrix (a summary table)

The foundation for many other metrics. It summarizes performance by showing the counts of correct and incorrect predictions for each class.

| | **Actual (Target) Positive** | **Actual (Target) Negative** |
| :--- | :---: | :---: |
| **Predicted Positive** | True Positive (**TP**) | False Positive (**FP**) (Type I Error) |
| **Predicted Negative** | False Negative (**FN**) (Type II Error) | True Negative (**TN**) |

---

**Example for a "good" model** (F1 score $\approx 0.86$):

| | **Actual (target) Positive** | **Actual (target) Negative** |
| :--- | :---: | :---: |
| **Predicted Positive** | 150 | 30 |
| **Predicted Negative** | 20 | 200 |

---

### 3. Recall (a.k.a. Sensitivity or True Positive Rate)

- **Definition:** Of all **actual positive** instances, what proportion did the model correctly identify?
- **Formula:**
$$\text{Recall} = \frac{TP}{TP + FN}$$
- **Use Case:** Minimizing **False Negatives** is critical (e.g., medical screening where missing a sick patient is dangerous, or fraud detection).
- **sklearn Function:** `recall_score`

---

### 4. F1-Score

- **Definition:** The **harmonic mean** of precision and recall. It provides a single score that balances both metrics.
- **Formula:**
$$\text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$
- **Use Case:** A robust metric, especially when dealing with **class imbalance**.
- **sklearn Function:** `f1_score`


<!-- ## 2. Advanced Metrics

### **ROC Curve and AUC** (Area Under the Curve)

- **ROC Curve:** Plots the True Positive Rate (**Recall**) against the False Positive Rate (FPR) at various classification thresholds.
- **Formula:** $$\text{FPR} = \frac{FP}{FP + TN}$$
- **AUC:** The Area Under the ROC Curve.
- **Use Case:** Evaluates the model's ability to distinguish between classes across **all possible decision thresholds**. A higher AUC (closer to 1.0) indicates better overall discrimination.

### **Log Loss** (Cross-Entropy Loss)

- **Definition:** Measures the performance of a classification model where the prediction is a **probability** (rather than a hard class label). It heavily penalizes confident predictions that are wrong.

- **Use Case:** Essential for evaluating models (like Logistic Regression or Neural Networks) that output probability scores. -->

<br>

## <font color='dodgerblue'>Regression Metrics</font>

These metrics are typically used for **Regression Problems** (i.e., predicting numerical, continuous data - e.g., house price, temperature, or age).

In all formulas:
- $\mathbf{n}$ is the number of data points.
- $\mathbf{y}_i$ is the actual (target) value.
- $\mathbf{\hat{y}}_i$ is the predicted value.
- $\mathbf{\bar{y}}$ is the mean of the actual values.

---

## 1. Mean Absolute Error (MAE)

- **Definition:** The average magnitude of the errors (the average absolute difference between the actual and predicted values).
- **Formula:**
$$MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$
- **Strengths:**
    - Easy to understand and interpret.
    - **Robust to outliers** (errors are scaled linearly).
    - Units are the same as the target variable's units (i.e., intuitive).
- **Weaknesses:**
    - Not differentiable (less suitable as a primary loss function for gradient-based optimization).
- **sklearn Function:** `mean_absolute_error`

---
## 2. Mean Squared Error (MSE)

- **Definition:** The average of the squared errors.
- **Formula:**
$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
- **Strengths:**
    - **Differentiable** (excellent for use as a loss function in algorithms like Linear Regression and Neural Networks).
    - **Penalizes larger errors** more heavily due to the squaring operation.
- **Weaknesses:**
    - Highly **sensitive to outliers**.
    - Units are the **square** of the target variable's units (less intuitive).
- **sklearn Function:** `mean_squared_error`

---
## 3. Root Mean Squared Error (RMSE)

* **Definition:** The square root of the MSE. This operation brings the error magnitude back to the original units of the target variable.
* **Formula:**
$$RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$
* **Strengths:**
    * Commonly used and widely understood.
    * Units are those of the target variable's units (intuitive).
    * Penalizes large errors more than MAE.
* **Weaknesses:**
    * Still sensitive to outliers (though less so than raw MSE).
* **sklearn Function:** This is typically calculated by taking the square root of `mean_squared_error`.

---


## 4. R-squared ($\mathbf{R^2}$) (a.k.a. Coefficient of Determination)

- **Definition:** A measure of how well the model explains the variability of the target variable. It compares the model's error to the error of a simple mean-based model.
- **Formula:**
$$R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$$
- **Interpretation:**
    - Ranges from $1.0$ (a perfect fit) to $0$ (the model is no better than simply predicting the mean) and can be negative (worse than the mean).
- **Strengths:**
    - Provides a **relative measure** of fit, easy to understand as a percentage of explained variance.
- **Weaknesses:**
    - Can be misleading: it never decreases when adding features, even irrelevant ones (leading to potential overfitting).
    - Doesn't tell you the errors' magnitude in the original units.
- **sklearn Function:** `r2_score`


<!-- Adjusted R-squared

Formula: Radj2​=1−n−p−1(1−R2)(n−1)​ where n is the number of data points and p is the number of predictors (features).
Interpretation: Adjusted R-squared accounts for the number of predictors in the model. It will only increase if the new features significantly improve the model, penalizing the addition of irrelevant features.
Strengths:
    More reliable than R2 for comparing models with different numbers of predictors.
Weaknesses:
    Still a relative measure.

Mean Absolute Percentage Error (MAPE)

Formula: MAPE=n1​∑i=1n​​yi​yi​−y^​i​​​×100%
Interpretation: MAPE expresses the error as a percentage of the actual value. This makes it useful for comparing models across different scales.
Strengths:
    Scale-independent, good for comparing performance across different datasets or models where the target variable has different magnitudes.
    Easy to understand as a percentage.
Weaknesses:
    Undefined if yi​ is zero.
    Can heavily penalize errors when yi​ is very small.
    Asymmetric (penalizes over-predictions differently from under-predictions).

Root Mean Squared Logarithmic Error (RMSLE)

Formula: RMSLE=n1​∑i=1n​(log(yi​+1)−log(y^​i​+1))2​
Interpretation: RMSLE measures the ratio between actual and predicted values rather than the difference. It penalizes under-predictions more heavily than over-predictions and is robust to outliers, especially when the target variable has a wide range of values.
Strengths:
    Useful when you care about percentage errors, not just absolute errors (e.g., predicting prices where a $10 error on a $100 item is much worse than on a $1,000,000 item).
    Less sensitive to large errors than RMSE.
Weaknesses:
    Cannot be used if yi​ or y^​i​ are negative.
    The interpretation isn't as straightforward as MAE or RMSE. -->

<hr style="border:2px solid gray"></hr>

# Take Home Messages

## Core Concepts & Structure

### 1. The Data Structure
* All ML models use **Input Features ($\mathbf{X}$)** and, optionally, a **Target Variable ($\mathbf{y}$)**, typically stored as NumPy arrays.
* **Shallow Learning** (e.g., $\text{sklearn}$) relies on **Feature Engineering**.
* **Deep Learning** (e.g., TensorFlow, PyTorch) relies on **Feature Learning** using **Deep Neural Networks**.

### 2. The Three ML Tasks
| Task Type | Target ($\mathbf{y}$) | Goal | Primary Tools |
| :--- | :--- | :--- | :--- |
| <font color='dodgerblue'>**Supervised**</font> | Known (Labeled) | Prediction or Classification | Regression, Classification |
| <font color='dodgerblue'>**Unsupervised**</font> | Unknown (Unlabeled) | Discovery of structure | Clustering, Dimensionality Reduction |
| **Reinforcement** | No target; uses **Reward** | Sequential decision-making | Deep Q-Learning (often uses DNNs) |

---

## The General $\text{sklearn}$ Workflow

A five-step process for building and evaluating traditional ML models:

1.  **Goal Definition:** Determine if the task is **Regression** (continuous $\mathbf{y}$) or **Classification** (discrete $\mathbf{y}$).
2.  **Data Preparation:** Clean, encode, scale features, and use `train_test_split` to create **Training** ($\mathbf{X}_{\text{train}}, \mathbf{y}_{\text{train}}$) and **Test** ($\mathbf{X}_{\text{test}}, \mathbf{y}_{\text{test}}$) sets.
3.  **Model Creation and Training**: create a model object (e.g., `LinearRegression`) and train it (`fit`).
4.  **Model Evaluation:** Evaluate the model's performance on unseen data $\mathbf{X}_{\text{test}}$ and $\mathbf{y}_{\text{test}}$ (e.g., Accuracy, F1-Score, or RMSE).
5.  **Model Application:** Apply the model for what it was designed to do (new input data).

<br><br>

<div class="alert alert-block alert-warning">
<hr style="border:1.5px dashed gray"></hr>


#### Specific Example(s): Autoencoders - generative models (i.e., <font color='dodgerblue'>creates new things</font>)

**Autoencoder** neural networks are an unsupervised (i.e., using unlabeled input data) learning model. They **encode an input** (i.e., something that is human-relatable) and **transform it into a different representation** within the latent space, and then **decode** back to something **human-relatable**. This allows for new things to be generated.


- https://www.jeremyjordan.me/autoencoders/
- <font color='dodgerblue'>Sparse</font> Autoencoder
    - **hidden** layers have the **same number of nodes** as the **input** and **output** layers
    - loss function includes a penalty for "activating" a node within the hidden layer

<br>

- <font color='dodgerblue'>Denoising</font> Autoencoder
    - slightly **corrupt** the **input data** (i.e., add noise) to help make the encoding/decoding more generalizable
    - **target data** remains **uncorrupted**
    - make the decoding (reconstruction function) insensitive to small changes in the input

<br>

- <font color='dodgerblue'>Contractive</font> Autoencoder
    - make the **encoding** (feature extraction function) **less sensitive** to **small changes** within the **input data**
    - learn similar encoding (hidden layer) for different inputs that vary slightly

<br>

- <font color='dodgerblue'>Variational</font> Autoencoder (VAE)
    - https://arxiv.org/abs/1606.05908
    - training using **backpropagation** (aka **backward propagation of error**)
        - backpropagation - https://www.ibm.com/think/topics/backpropagation
        - starting from an **output**, compute the **importance** (measured as a gradient) that each neural network **parameter** has on the final model's **error** (predicted values) (i.e., loss function)
    - encoding is **regularized** (adding a penalty term to the model's loss function during the learning process) to ensure that the latent space has good properties (and thus, allowing us to have generative models to be created)
        - regularization - https://en.wikipedia.org/wiki/Regularization_(mathematics)


<hr style="border:1.5px dashed gray"></hr>
<!-- - Generative Adversarial Networks (GANs)
    - two networks oppose each other (a generator and a discriminator), for which both iteratively improve -->