# Some important (high-level) considerations

In applying any machine learning algorithms to a dataset, several considerations are crucial.  This supplementary document covers some of the basic steps in making choices that matter to your problem.



## Problem types *or* Goals

```{figure} ../img/ml-goals1.png
---
width: 70%
name: ml-goals1
---
Goals in building a model (image source: Stefano Tempesta).
```

```{figure} ../img/ml-goals2.png
---
width: 70%
name: ml-goals2
---
Goals in building a model, continued (image source: Stefano Tempesta).
```

## Learning algorithms

### Supervised learning
To model relationships and dependencies between input and output.

**Regression**

For example, can we predict the mass of a penguin given its other characteristics?

In [None]:
import seaborn as sns
import pandas as pd
sns.set_palette('bright')

penguins = sns.load_dataset('penguins')
penguins = penguins[~penguins.isna().any(axis='columns')]
penguins = penguins.sort_values('flipper_length_mm')

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error

X = penguins[['flipper_length_mm']]
y = penguins['body_mass_g']

# Linear Regression
lr = LinearRegression()
lr.fit(X, y)
y_pred_lr = lr.predict(X)
print(f"Linear Regression RMSE: {root_mean_squared_error(y, y_pred_lr)}")

# Nearest Neighbors

# Random Forest

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 2, figsize=(10, 5))

sns.scatterplot(y='body_mass_g', x='flipper_length_mm', color='k', data=penguins, ax=ax[0])

y_preds = [y_pred_lr, y_pred_knn, y_pred_rf]
labels = ['linear reg.', 'kNN', 'RF']
linestyles = ['-', '--', ':']
markerstyles = ['.', 'D', 'x']

for j, y_pred in enumerate(y_preds):
    ax[0].plot(X, y_pred, label=labels[j], linestyle=linestyles[j])
    ax[1].scatter(X, y - y_pred, label=labels[j], marker=markerstyles[j], alpha=0.5)
ax[0].legend()
ax[1].legend()
plt.tight_layout()

**Classification**

For example, can we predict where a penguin lives given its other characteristics?

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder

# Define inputs and outputs
penguins = penguins.sample(frac=1)
X = penguins.drop("island", axis=1)
y = penguins["island"]

# Encode categorical variables
enc = LabelEncoder()
y = enc.fit_transform(y)
X = pd.get_dummies(X)

In [None]:
models = [LogisticRegression, SVC, RandomForestClassifier]

### Unsupervised learning
To identify structure or relationships.

**Clustering**

For example, can we group the penguins to identify the species using their characteristics?

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix

In [None]:
# Define input and output
X = penguins.drop(["species", "island", 'sex'], axis=1)
y = penguins["species"]

# Encode categorical variables
enc = LabelEncoder()
y = enc.fit_transform(y)

In [None]:
# clustering fitting and prediction


penguins['species_pred'] = y_pred

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
sns.scatterplot(y='body_mass_g', x='flipper_length_mm', hue='species', data=penguins, ax=ax[0])
sns.scatterplot(y='body_mass_g', x='flipper_length_mm', hue='species_pred', data=penguins, ax=ax[1])

### Semi-supervised learning
Some outputs are "labeled", most are not, typically in classification problems.

```{figure} ../img/ex-semi-supervised.png
---
width: 80%
name: ex-semi-supervised
---
Example of a semi-supervised learning model {cite:p}`berthelot2019mixmatch`.
```

### Reinforcement learning
The algorithm learns by acting and observing reward.  The goal is to identify an "optimal" policy.

```{figure} ../img/reinforcement-learning.png
---
width: 70%
name: reinforcement
---
Generic modeling of a reinforcement learning model.
```

## Training, testing, and validation

A brief word through https://mlu-explain.github.io/train-test-validation/.

## Regularization and hyperparameter tuning

Example: with a linear regression base.

Lasso ($\ell_1$):

$$\min_{w} { \frac{1}{2n} ||X w - y||_2 ^ 2 + \alpha ||w||_1}$$

Ridge ($\ell_2$):

$$\min_{w} { \frac{1}{2n} ||X w - y||_2 ^ 2 + \alpha ||w||_2 ^ 2}$$
Elastic Net:

$$\min_{w} { \frac{1}{2n} ||X w - y||_2 ^ 2 + \alpha \rho ||w||_1 +
\frac{\alpha(1-\rho)}{2} ||w||_2 ^ 2}$$

```{figure} ../img/lasso-ridge.png
---
width: 70%
name: lasso-ridge
---
Lasso vs. ridge regularization {cite:p}`efron2021computer`.
```

In [None]:
from sklearn.linear_model import lasso_path, enet_path

X = penguins.drop(['body_mass_g'], axis=1)
X = pd.get_dummies(X)

y = penguins['body_mass_g']

In [None]:
# print("Computing regularization path using the lasso...")
eps = 5e-4
alphas_lasso, coefs_lasso, _ = lasso_path(X, y, eps=eps)
alphas_enet, coefs_enet, _ = enet_path(X, y, eps=eps, l1_ratio=0.8)

In [None]:
import numpy as np 
from itertools import cycle

plt.figure(1)
colors = cycle(["b", "r", "g", "c", "k"])
neg_log_alphas_lasso = -np.log10(alphas_lasso)
neg_log_alphas_enet = -np.log10(alphas_enet)
for coef_l, coef_e, c in zip(coefs_lasso, coefs_enet, colors):
    l1 = plt.plot(neg_log_alphas_lasso, coef_l, c=c)
    l2 = plt.plot(neg_log_alphas_enet, coef_e, linestyle="--", c=c)

plt.xlabel("-Log(alpha)")
plt.ylabel("coefficients")
plt.title("Lasso and Elastic-Net Paths")
plt.legend((l1[-1], l2[-1]), ("Lasso", "Elastic-Net"), loc="lower left")
plt.axis("tight")

In [None]:
X.columns

In [None]:
coefs_lasso[:, 99]

## One step toward automating machine learning model selection

In [None]:
from autogluon.tabular import TabularDataset, TabularPredictor

penguins = sns.load_dataset('penguins')
penguins = penguins[~penguins.isna().any(axis='columns')]

### Regression

### Classification