<div class="alert block alert-info alert">

# <center> Scientific Programming in Python

## <center>Karl N. Kirschner<br>Bonn-Rhein-Sieg University of Applied Sciences<br>Sankt Augustin, Germany

# <center> Machine Learning (ML) Overview

**"Can machines think [in the way that we do]?"** [1]


- The ML term was <font color='dodgerblue'>**first used in 1959**</font> by Arthur Samuel (an IBM researcher)


- Mathematical Foundation
    - <font color='dodgerblue'>**Statistics**</font> (the "work-horse" of ML)
    - Calculus (derivatives; optimizations)
    - Algerbra (vectors, matrix, tensors)


- Different components were developed by researchers for many years. Only recently they were collected into libraries that make the ideas more accessible.

## Machine Learning Catagories

1. <font color='dodgerblue'>**Shallow learning** (e.g. **s**ci**k**it-**learn** - a.k.a. **sklearn**)
    - <font color='dodgerblue'>**predefined features**</font>

1. Deep learning (e.g. TensorFlow, PyTorch)
    - <font color='dodgerblue'>**feature learning**</font>
    - mostly <font color='dodgerblue'>**combines shallow learning**</font> instances together into <font color='dodgerblue'>**"layers"**</font>


**Sources**:
1. Turing, Alan M. "Computing machinery and intelligence." Parsing the Turing test. Springer, Dordrecht, 2009. 23-65.

**Additional Resources**:
1. https://en.wikipedia.org/wiki/Machine_learning

In [None]:
## For extra information given within the lectures

from IPython.display import HTML


def set_code_background(color: str):
    ''' Set the background color for code cells.

        Source: psychemedia via https://stackoverflow.com/questions/49429585/
                how-to-change-the-background-color-of-a-single-cell-in-a-jupyter-notebook-jupy

        To match Jupyter's dev class colors:
            "alert alert-block alert-warning" = #fcf8e3

        Args:
            color: HTML color, rgba, hex
    '''

    script = ("var cell = this.closest('.code_cell');"
              "var editor = cell.querySelector('.input_area');"
              f"editor.style.background='{color}';"
              "this.parentNode.removeChild(this)")
    display(HTML(f'<img src onerror="{script}">'))


set_code_background(color='#fcf8e3')

# Two General Types of ML

1. **Shallow Learning**
2. **Deep Learning**

<br>
<br>

<hr style="border:2px solid gray"></hr>

# Shallow Learning

## Catagories

| Regression | Classification | Clustering | Dimension Reduction|
| :-: | :-: | :-: | :-: |
| <font color='dodgerblue'>Linear</font> | Logistic Regression | <font color='dodgerblue'>K-means</font> | <font color='dodgerblue'>Principle Component Analysis</font> |
| <font color='dodgerblue'>Polynomial</font> | <font color='dodgerblue'>Support Vector Machine</font> | Mean-Shift | Linear Discriminant Analysis |
| StepWise | Naive Bayes | DBScan | Gernalized Discriminant Analysis |
| Ridge | Nearest Neighbor | Agglomerative Hierachcial | Autoencoder |
| Lasso | Decision Tree | Spectral Clustering | Non-Negative Matrix Factorization |
| ElasticNet | <font color='dodgerblue'>Random Forest</font> | Gaussian Mixture | UMAP |

## Supervised vs. Unsupervised Learning

1. **Supervised** - the **target information is known** in the data set, and we **train to reproduce** that information
    - <font color='dodgerblue'>regression</font>
    - <font color='dodgerblue'>classification</font>

1. **Unsupervised** - the **target information is unknown**, with the goal to 
    - cluster the data's similarity (<font color='dodgerblue'>clustering</font>)
    - determine the distribution of data (<font color='dodgerblue'>density estimation</font>)
    - <font color='dodgerblue'>dimensionality reduction</font> for exploring and visualization

<p><img alt="Accuracy vs Precision" width="800" src="00_images/31_machine_learning/scikit_learn_ml_map.png" align="center" hspace="10px" vspace="0px"></p>

Image Source (interactive): https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

<hr style="border:2px solid gray"></hr>

# Deep Learning
- **Supervised Learning**: trained on <font color='dodgerblue'>labeled data</font>, with specified goals (clearly defined <font color='dodgerblue'>prediction/classification</font> goals)
    - learn a <font color='dodgerblue'>mapping</font> from <font color='dodgerblue'>input features</font> to known <font color='dodgerblue'>output labels</font>

<br>

- **Unsupervised Learning**: trained on <font color='dodgerblue'>unlabeled data</font>, with specified goals (discovering and insights)
    - discover hidden <font color='dodgerblue'>patterns, structures, relationships</font>, or <font color='dodgerblue'>insights</font> within the data itself.

<br>

- **Reinforcement Learning**: learning through interaction that includes maximizing rewards (no labeled data, no specified goal) 

<br>

#### Neural network
- **Input Layer**: <font color='dodgerblue'>features (observables)</font> should have some degree of correlation (i.e., structure; nonlinear relationships)
- Encoder: input $\rightarrow$ hidden layers (<font color='dodgerblue'>data reduction</font>)
- **Hidden Layer**: a <font color='dodgerblue'>compressed knowledge representation</font> of the original input
- Decoder: hidden layers $\rightarrow$ <font color='dodgerblue'>output</font>
- **Output Layer**

<p><img alt="neural network" width="800" src="00_images/31_machine_learning/deep_neural_network.png" align="center" hspace="10px" vspace="0px"></p>

Image Source: https://www.studytonight.com/post/understanding-deep-learning

<br>

Type of Deep Learning Neural Networks (NN):
1. Convolutional NN
2. Recurrent NN (RNNs)
    - Long Short-Term Memory (LSTM) Networks (a sepecial type of RNN)
3. Generative Adversarial Networks (GANs)
4. Reinforcement Learning (RL) with Deep Learning (Deep RL)


## Python Libraries
1. <font color='dodgerblue'>**TensorFlow**</font>
    - open-source library
3. <font color='dodgerblue'>**Keras**</font> (integrated into TensorFlow: `tf.keras`)
    - designed for fast experimentation
4. <font color='dodgerblue'>**PyTorch**</font>
   - open-source library
   - easy to use (Pythonic)
   - easy prototyping

<div class="alert alert-block alert-warning">
<hr style="border:1.5px dashed gray"></hr>


#### Specific Example(s): Autoencoders - generative models (i.e., <font color='dodgerblue'>creates new things</font>)

**Autoencoder** neural networks are an unsupervised (i.e., using unlabeled input data) learning model. They **encode an input** (i.e., something that is human-relatable) and **transform it into a different representation** within the latent space, and then **decode** back to something **human-relatable**. This allows for new things to be generated.


- https://www.jeremyjordan.me/autoencoders/
- <font color='dodgerblue'>Sparse</font> Autoencoder
    - **hidden** layers have the **same number of nodes** as the **input** and **output** layers
    - loss function includes a penalty for "activating" a node within the hidden layer

<br>

- <font color='dodgerblue'>Denoising</font> Autoencoder
    - slightly **corrupt** the **input data** (i.e., add noise) to help make the encoding/decoding more generalizable
    - **target data** remains **uncorrupted**
    - make the decoding (reconstruction function) insensitive to small changes in the input

<br>

- <font color='dodgerblue'>Contractive</font> Autoencoder
    - make the **encoding** (feature extraction function) **less sensitive** to **small changes** within the **input data**
    - learn similar encoding (hidden layer) for different inputs that vary slightly

<br>

- <font color='dodgerblue'>Variational</font> Autoencoder (VAE)
    - https://arxiv.org/abs/1606.05908
    - training using **backpropagation** (aka **backward propagation of error**)
        - backpropagation - https://www.ibm.com/think/topics/backpropagation
        - starting from an **output**, compute the **importance** (measured as a gradient) that each neural network **parameter** has on the final model's **error** (predicted values) (i.e., loss function)
    - encoding is **regularized** (adding a penalty term to the model's loss function during the learning process) to ensure that the latent space has good properties (and thus, allowing us to have generative models to be created)
        - regularization - https://en.wikipedia.org/wiki/Regularization_(mathematics)


<hr style="border:1.5px dashed gray"></hr>
<!-- - Generative Adversarial Networks (GANs)
    - two networks oppose each other (a generator and a discriminator), for which both iteratively improve -->

# General Workflow For Model Creation (and Prediction)

1. **Understand your goal** - do you want to predict
    - Categorical data (i.e., noncontinuous data)
    - Numerical data (i.e., continuous data)

<br>

2. **Data**
    - Collect, clean (e.g. drop rows with missing data) and adjust magnitudes (e.g., **normalize**)
    - Determine training features versus target features (i.e. what you want to predict)
        - Training features (**independent variables** -- x-axis data)
        - Target features (**dependent variable(s)** -- y-axis data)
    - Encode any categorical data present (i.e., provide numerical values)
    - Data splitting (**training and test sets**)

<br>

3. **Model Exploration**
    - Choose several models to try 
    - Default parameters ("hyperparameters")
    - Identify good candidates (see #4 below)
    - Optimize hyperparameters

<br>

4. **Model Evaluation and Determination**
    - Choose and compute different metrics

<br>

5. **Apply Model using New Data**
    - I.e., make predictions

## Trained Model Evaluation

### Classification-Data Metrics:

Evaluating classification models: How well they correctly assign instances?

1. <font color='dodgerblue'>**Accuracy**</font>: The proportion of correctly classified instances out of the total instances.

    **Caution**: Can be misleading with imbalanced datasets (e.g., 95% of data is in one class).

<br>

2. <font color='dodgerblue'>**Confusion Matrix**</font> (foundation for many other metrics): A summarizing performance matrix that shows how many:
    - True Positives (**TP**): A model's outcome that **correctly** predicted the **positive** (e.g., "yes"; "with disease") class
    - True Negatives (**TN**): A model's outcome that **correctly** predicted the **negative** (e.g., "no"; "without disease") class
    - False Positives (**FP**) (a.k.a. Type I error): A model's outcome that **incorrectly** predicted the **positive** class
    - False Negatives (**FN**) (a.k.a. Type II error): A model's outcome that **incorrectly** predicted the **negative** class

<br>

|               | **Actual (target) Positive** | **Actual (target) Negative** |
| :------------ | :------------------ | :------------------ |
| **Predicted Positive** | # True Positive (TP)  | # False Positive (FP) |
| **Predicted Negative** | # False Negative (FN) | # True Negative (TN)  |

An example for a "good" model (i.e., F1 score = 0.86):

|               | **Actual (target) Positive** | **Actual (target) Negative** |
| :------------ | :------------------ | :------------------ |
| **Predicted Positive** | 150  | 30 |
| **Predicted Negative** | 20 |200  |

<br>

3. <font color='dodgerblue'>**Precision**</font>: Of all instances predicted as positive, what proportion were actually positive?
    - Formula: $\Large\frac{TP}{TP+FP}$
    - Use: Minimizing false positives is critical (e.g., medical diagnosis).

<br>

4. <font color='dodgerblue'>**Recall**</font> (a.k.a. Sensitivity or True Positive Rate): Of all actual positive instances, what proportion did the model correctly identify?
    - Formula: $\Large\frac{TP}{TP+FN}$
    - Use: Minimizing false negatives is critical (e.g., fraud detection where missing actual fraud is very costly, or disease screening where missing a sick patient is dangerous).

<br>

5. <font color='dodgerblue'>**F1-Score**</font>: The harmonic mean of precision and recall. (It balances the precision and recall metrics.)
    - Formula: $\Large\frac{2∗(Precision∗Recall)}{Precision + Recall}$
    - Use: A single metric that balances precision and recall, especially where there is a class imbalance.

<!-- 
<br>

6. <font color='dodgerblue'>**ROC Curve** and AUC</font> (Area Under the Receiver Operating Characteristic Curve):
    - ROC Curve: Plots the True Positive Rate (i.e., Recall) against the False Positive Rate at various classification thresholds.
    - AUC: Overall classifier performance,irrespective of the classification threshold. (A higher AUC indicates better discrimination between classes.)
    - Use: evaluate the model's ability to distinguish between classes across all possible thresholds (good for imbalanced datasets).

<br>

7. <font color='dodgerblue'>**Log Loss**</font> (Cross-Entropy Loss): Measures the performance of a classification model where the prediction is a probability. It penalizes confident wrong predictions heavily.
    - Use: You need to evaluate probabilistic outputs from models like logistic regression or neural networks. -->

### Numerical-Data Metrics
Typically are regression problems (e.g., house price, temperature, age).

1. <font color='dodgerblue'>**Mean Absolute Error (MAE)**</font>:
    - Formula: $\Large MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| $, where $\Large\hat{y}_i$ is the ideal/target value
    - Interpretation: It tells you the average magnitude of the errors.
    - Strengths:
        - Easy to understand and interpret.
        - Robust to outliers.
        - Units are those of the target variable's units (i.e., intuitive)
    - Weaknesses:
        - Not differentiable (less suitable as a loss function in optimization algorithms).

<br>

2. <font color='dodgerblue'>**Mean Squared Error (MSE)**</font>:
    - Formula: $\Large MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $
    - Interpretation: Penalizes larger errors more heavily than smaller ones.
    - Strengths:
        - Differentiable (e.g., usable in Linear Regression, Neural Networks)
        - Penalizes large errors.
    - Weaknesses:
        - Units are the square of the target variable's units.
        - Highly sensitive to outliers.

<br>

3. <font color='dodgerblue'>**Root Mean Squared Error (RMSE)**</font>:
    - Formula: $\Large RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 } $
    - Interpretation: It brings the error back to the original units of the target variable.
    - Strengths:
        - Commonly used and widely understood.
        - Penalizes large errors more than MAE.
        - Units are those of the target variable's units.
    - Weaknesses:
        - Sensitive to outliers (less so than raw MSE).

<br>

4. <font color='dodgerblue'>**R-squared (R2)**</font> (a.k.a. Coefficient of Determination):
    - Formula: $\Large R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} $
    - Interpretation: It measures how well the model explains the variability of the target variable.
    - Strengths:
        - Provides a relative measure of fit, ranging from $-\infty$ (very poor) to $0$ (poor) to $1$ (a perfect fit)
        - Easy to understand as a percentage
    - Weaknesses:
        - Can increase simply by adding more features (this can lead to overfitting).
        - Doesn't tell you the errors' magnitude in the original units.

<!-- Adjusted R-squared

Formula: Radj2​=1−n−p−1(1−R2)(n−1)​ where n is the number of data points and p is the number of predictors (features).
Interpretation: Adjusted R-squared accounts for the number of predictors in the model. It will only increase if the new features significantly improve the model, penalizing the addition of irrelevant features.
Strengths:
    More reliable than R2 for comparing models with different numbers of predictors.
Weaknesses:
    Still a relative measure.

Mean Absolute Percentage Error (MAPE)

Formula: MAPE=n1​∑i=1n​​yi​yi​−y^​i​​​×100%
Interpretation: MAPE expresses the error as a percentage of the actual value. This makes it useful for comparing models across different scales.
Strengths:
    Scale-independent, good for comparing performance across different datasets or models where the target variable has different magnitudes.
    Easy to understand as a percentage.
Weaknesses:
    Undefined if yi​ is zero.
    Can heavily penalize errors when yi​ is very small.
    Asymmetric (penalizes over-predictions differently from under-predictions).

Root Mean Squared Logarithmic Error (RMSLE)

Formula: RMSLE=n1​∑i=1n​(log(yi​+1)−log(y^​i​+1))2​
Interpretation: RMSLE measures the ratio between actual and predicted values rather than the difference. It penalizes under-predictions more heavily than over-predictions and is robust to outliers, especially when the target variable has a wide range of values.
Strengths:
    Useful when you care about percentage errors, not just absolute errors (e.g., predicting prices where a $10 error on a $100 item is much worse than on a $1,000,000 item).
    Less sensitive to large errors than RMSE.
Weaknesses:
    Cannot be used if yi​ or y^​i​ are negative.
    The interpretation isn't as straightforward as MAE or RMSE. -->

 <font color='dodgerblue'>**Take Home Message**</font>: Each metric does something slightly **different**, and you have to **use it** and **discuss** it in the proper **context**.