# Machine Learning Compendium

The aim of this chapter is to understand and visualize the functionality of different machine learning algorithms by applying them to example data. Knowing the advantages and disadvantages of the respective method is essential when choosing a model. In addition, there will be always a trial-and-error part when finding the best model. Especially the validation step is necessary to decide whether the method of choice is the correct one and performs well enough.

There are overviews of comparisons of different [classifier](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py) (supervised learning) and [clustering](http://scikit-learn.org/stable/modules/clustering.html) (unsupervised learning) methods in `sklearn`. Here, we will focus on some of them.

In [None]:
import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

In [None]:
plt.rcParams["figure.figsize"] = (16.0, 8.0)
plt.style.use("ggplot")

In [None]:
# Helper functions
from ml_functions import decision_boundaries, svm_decision_function

#### Colorcode
There are several [different colorcode](https://matplotlib.org/users/colormaps.html) for the respective purpose. We are going to use `viridis` as it has a strong contrast or `Set1`.

In [None]:
colorcode = "viridis"
# colorcode = 'Set1'

## Generating Example Data

Several [samples generators](http://scikit-learn.org/stable/modules/classes.html#samples-generator) come with `sklearn`, and we can use them to illustrate different ML methods. In the following we use these types of example data points:

- **Circles**: Two circles with noise of the magnitude of their distance to each other. Separating them is a tough task.

- **Blobs**: Three Gaussian-distributed blobs with good separation. Most models should be able to tell them apart.

- **Moons**: Two half moons with low noise level (`noise=0.09`). Both distributions are intertwined but they do not overlap. They will show how important tuning and validation are.

- **Silly moons**: As above but with a higher noise level (`noise=0.3`). Those distributions show the differences between overfitted and well-performing models.



In [None]:
from sklearn.datasets import make_circles, make_moons, make_blobs

In [None]:
n_samples = 200
rdm_seed = 42

X_circle, y_circle = make_circles(
    n_samples=n_samples, noise=0.1, factor=0.8, shuffle=False, random_state=rdm_seed
)
X_blobs, y_blobs = make_blobs(
    n_samples=n_samples, centers=3, shuffle=False, random_state=rdm_seed
)
X_moons, y_moons = make_moons(
    n_samples=n_samples, noise=0.09, shuffle=False, random_state=rdm_seed
)
X_moons2, y_moons2 = make_moons(
    n_samples=n_samples, noise=0.3, shuffle=False, random_state=rdm_seed
)

data = {
    "Circles": (X_circle, y_circle),
    "Blobs": (X_blobs, y_blobs),
    "Moons": (X_moons, y_moons),
    "Silly moons": (X_moons2, y_moons2),
}

In [None]:
# Funktion to plot generated data.
def plot_test_data(X, y, show_clusters=False, title="", ax=None):
    if not ax:
        fig, ax = plt.subplots(figsize=(10, 10))

    if show_clusters:
        c = y
    else:
        c = "dodgerblue"

    plt.title(title)
    plt.scatter(X[:, 0], X[:, 1], c=c, cmap=colorcode)
    plt.axis("off")

In [None]:
# Plot the data.
show_clusters = False

plt.figure(figsize=(10, 10))
gs = gridspec.GridSpec(2, 2, height_ratios=[1, 1], width_ratios=[1, 1])
gs.update(wspace=0.2, hspace=0.2)

ax = plt.subplot(gs[0])
plot_test_data(X_circle, y_circle, show_clusters=show_clusters, title="Circles", ax=ax)
ax = plt.subplot(gs[1])
plot_test_data(X_blobs, y_blobs, show_clusters=show_clusters, title="Blobs", ax=ax)
ax = plt.subplot(gs[2])
plot_test_data(X_moons, y_moons, show_clusters=show_clusters, title="Moons", ax=ax)
ax = plt.subplot(gs[3])
plot_test_data(
    X_moons2, y_moons2, show_clusters=show_clusters, title="Silly moons", ax=ax
)

### Task

- We can plot the test data with the given cluster assignment (**label**). The **`show_clusters`** variable has to be set to `True`.

- Go back to the samples generator and change parameteres like **`n_samples`** or **`noise`** to see what happens.

In [None]:
# It's your turn!

When we do not know to which cluster each data point belongs a typical task of machine learning is to find respective (possible) clusters. This kind of task is called unsupervised learning.

## Unsupervised Learning - Clustering

**Clustering methods** help to detect clusters in unlabeled data sets:

**K-Means** needs the number ***K*** of clusters as input. The algorithm sets *K* independent cluster centers as starting points. Each data point is assigned to the nearest center. After one iteration it calculates new cluster centers by averaging the respective assigned members. K-means performs very well on data showing a good separation and "spherical" clusters.

The **Gaussian Mixture-**algorithm also requires the number *K* of clusters as input. The method is based on fitting multidimensional Gaussian distribution to the data set. Thereby it can handle distributions with "non-spherical" clusters.

With the clustering algorithm **[DBSCAN](https://en.wikipedia.org/wiki/DBSCAN)** (**D**ensity-**B**ased **S**patial **C**lustering of **A**pplications with **N**oise) we do not set the number of clusters, but two algorithm parameters that determine which clusters are found.

### K-Means

In [None]:
from sklearn.cluster import KMeans

In [None]:
# Model definition
n_clusters = 3
max_iterations = 10

model = KMeans(n_clusters, max_iter=max_iterations, init="random", n_init=1)

# Layout
plt.figure(figsize=(12, 12))
gs = gridspec.GridSpec(2, 2, height_ratios=[1, 1], width_ratios=[1, 1])
gs.update(wspace=0.2, hspace=0.2)
cmap = cm.get_cmap("Set1")

for i, d in enumerate(data.keys()):
    # Set input and target (not used)
    X, y = data[d]
    # Model fitting
    model.fit(X)
    predictions = model.predict(X)
    # Visualization
    ax = plt.subplot(gs[i])
    ax.set_title(f"{d} with K-Means")
    ax.scatter(X[:, 0], X[:, 1], c=cmap(predictions), cmap=colorcode)
    ax.scatter(
        model.cluster_centers_[:, 0],
        model.cluster_centers_[:, 1],
        c="black",
        s=200,
        label="Cluster centers",
    )
    ax.legend()

### Gaussian Mixture

The results of a [**Gaussian Mixture**](http://scikit-learn.org/stable/modules/mixture.html) algorithm look like the K-means clusters but elliptical distribution can be handled. Typically correlated data show this type of distribution.

In [None]:
from sklearn.mixture import GaussianMixture

In [None]:
# Model definition
n_components = 2
model = GaussianMixture(n_components)

# Layout
plt.figure(figsize=(12, 12))
gs = gridspec.GridSpec(2, 2, height_ratios=[1, 1], width_ratios=[1, 1])
gs.update(wspace=0.2, hspace=0.2)
cmap = cm.get_cmap("Set1")

for i, d in enumerate(data.keys()):
    # Set input and target (not used)
    X, y = data[d]
    # Model fitting
    model.fit(X)
    predictions = model.predict(X)
    # Visualization
    ax = plt.subplot(gs[i])
    ax.set_title(f"{d} with Gaussian Mixture")
    ax.scatter(X[:, 0], X[:, 1], c=cmap(predictions), cmap=colorcode)

### DBSCAN

[**DBSCAN**](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) defines the number of cluster automatically as they depend on the set parameters. To get a reasonable result we have to tune them. 
The most important parameters are **eps** (maximum distance, radius) and **min_samples** (minimal number of points within radius) which define the neighbourhood.

To set up the DBSCAN we need to know the general type of distribution. In addition the size of the data set and thereby the density can have a high influence on the parameter *eps*. A more detailed explanation is given [here](https://en.wikipedia.org/wiki/DBSCAN).

In [None]:
from sklearn.cluster import DBSCAN

In [None]:
# Model definition
model = DBSCAN(eps=0.163, min_samples=5)

# Layout
plt.figure(figsize=(12, 12))
gs = gridspec.GridSpec(2, 2, height_ratios=[1, 1], width_ratios=[1, 1])
gs.update(wspace=0.2, hspace=0.2)
cmap = cm.get_cmap("Set1")

for i, d in enumerate(data.keys()):
    # Set input and target (not used)
    X, y = data[d]
    # Model fitting
    predictions = model.fit_predict(X)
    # Visualization
    ax = plt.subplot(gs[i])
    ax.set_title(f"{d} with DBSCAN")
    cluster_points = predictions != -1
    noise_points = predictions == -1
    ax.scatter(
        X[cluster_points][:, 0],
        X[cluster_points][:, 1],
        c=cmap(predictions[cluster_points]),
        cmap=colorcode,
    )
    ax.scatter(X[noise_points][:, 0], X[noise_points][:, 1], c="black", marker="x")

### Task

Try to find a set of parameteres to cluster the different data sets correcly. You may want to go back to the data set generators to change them as well!



In [None]:
# Solutions for blobs and moons:
# model = DBSCAN(eps=2, min_samples=5) #Blobs
# model = DBSCAN(eps=0.2, min_samples=5) #Moons

## Supervised Learning

After a first try of clustering (unsupervised learning) we will go on with the same data sets but apply **Supervised Learning**. Now we will put in into the training to which class each data points belongs. There different classifier which can be applied on labeled data.

### Support Vector Machines

A [**Support Vector Machine**](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) (SVM) is an ML-algorithm which uses geometrical cuts within one or more (hyper)planes to classify the data.

Complex distribution cannot be separated by straight lines. Therefore, the data set has to be transformed in higher dimensional space. In this new coordinate system there are *straight lines* which can separate the data easily. The vectors of the data points closest to the seperating plane are called ***support vectors***. The distance between support vectors and separating plane is maximized with the aim to have as much *free space* as possible between different classes.

In [None]:
from sklearn.svm import SVC as SVM

In [None]:
# Model definition - linear kernel
model = SVM(kernel="linear")

# Model fit
model.fit(X_moons, y_moons)

# Plot
plt.figure(figsize=(5, 5))
plt.title("Moons with SVM (linear)")
plt.scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap=colorcode)
svm_decision_function(clf=model, ax=plt.gca(), colors="black")

In [None]:
# Model definition - Radial basis function as kernel
model = SVM(kernel="rbf", gamma=2)

# Model fit
model.fit(X_moons, y_moons)

# Plot
plt.figure(figsize=(5, 5))
plt.title("Moons with SVM (rbf)")
plt.scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap=colorcode)
svm_decision_function(clf=model, ax=plt.gca(), colors="black")

### Task

Try out different kernel functions or data sets. In addition try out the parameter **`gamma`**.

In [None]:
kernel = "rbf"
gamma = 0.5
model = SVM(kernel=kernel, gamma=gamma)

dataset = X_moons  # X_circle X_moons X_moons2 #X_blobs
target = y_moons  # y_circle y_moons y_moons2 #y_blobs

model.fit(dataset, target)

# Plot
plt.figure(figsize=(5, 5))
plt.scatter(dataset[:, 0], dataset[:, 1], c=target, cmap=colorcode)

if dataset[0][0] != X_blobs[0][0]:
    svm_decision_function(clf=model, ax=plt.gca(), colors="black")
else:
    decision_boundaries(clf=model, ax=plt.gca())

### K-Nearest-Neighbors

The [**KNeighborsClassifier**](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) (kNN) method is a mathematical easy calculation for a classification task. For each data point a majority decision is made: Which is the most frequent class my *k-nearest-neighbors* have?

Training is done by simply read in all training data points. As simple the calculation is for big data sets and a large *k* it is kind of brute force.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# Model definition
k = 1
model = KNeighborsClassifier(k)

# Model fit
model.fit(X_moons, y_moons)

# Plot
plt.figure(figsize=(5, 5))
plt.title(f"Moons with kNN (k={k})")
plt.scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap=colorcode)
decision_boundaries(clf=model, ax=plt.gca())

### Task
Try out different values for **`k`** (1, 5, 10, 20, 50, 100). Do you expect an improvement?

In [None]:
# Model definition
k = 1  # (1, 5, 10, 20, 50, 100)
model = KNeighborsClassifier(k)

dataset = X_moons  # X_circle X_moons X_moons2 #X_blobs
target = y_moons  # y_circle y_moons y_moons2 #y_blobs

model.fit(dataset, target)

# Plot
plt.figure(figsize=(5, 5))
plt.scatter(dataset[:, 0], dataset[:, 1], c=target, cmap=colorcode)

decision_boundaries(clf=model, ax=plt.gca())

### Decision Tree

A [**Decision Tree**](http://scikit-learn.org/stable/modules/tree.html) cuts the space with in rectangular sections. The **`max_depth`** parameter is the most important one. It defines how many decisions are allowed to be performed.  
Blobs are separated by to lines. A depth of two should be sufficient. By having more and more *decisions* curved separations are represented.

Already in this example we see a tendency to introduce overfitting with higher depth. The circles and silly moons show no smooth separation. Thereby new data following the same distributions are likely to be classified not correctly at the interface regions.

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
model = DecisionTreeClassifier(max_depth=10)

plt.figure(figsize=(12, 12))
gs = gridspec.GridSpec(2, 2, height_ratios=[1, 1], width_ratios=[1, 1])
gs.update(wspace=0.2, hspace=0.2)

for i, d in enumerate(data.keys()):
    X, y = data[d]
    model.fit(X, y)

    ax = plt.subplot(gs[i])
    ax.set_title(f"{d} with Decision Tree")
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap=colorcode)
    decision_boundaries(clf=model, ax=plt.gca())

### Random Forrest

A [**Random Forrest**](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) consists of single *Decision Trees*. The final decision is made by a majority voting of all trees and we could get something like a probability of the made decision. Thereby overfitting can be reduced.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
model = RandomForestClassifier(n_estimators=100, max_depth=10)

plt.figure(figsize=(12, 12))
gs = gridspec.GridSpec(2, 2, height_ratios=[1, 1], width_ratios=[1, 1])
gs.update(wspace=0.2, hspace=0.2)

for i, d in enumerate(data.keys()):
    X, y = data[d]
    model.fit(X, y)

    ax = plt.subplot(gs[i])
    ax.set_title(f"{d} mit Random Forest")
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap=colorcode)
    decision_boundaries(clf=model, ax=plt.gca())

### Task
Try out different **`max_depth`** for the single Decision Tree and the Randomforst and compare the results. In the case of a Randomforest you can also set the number of single trees by **`n_estimators`**.

### Multi Layer Perceptron (MLP)

Especially for larger training data sets and complex structures a [**Multi-Layer Perceptron**](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) can by very useful. But simple data set will show some limits as well.

MLPs have a high amount of (hyper)parameters which have to be set. Having a single hidden layer and a small amount of neurons (e.g. 10) only high symmetrical distributions like the blobs and circles are well classified. We can extend a MLP in width (add more neurons per layer) and depth (adding additional layers). 

In [None]:
from sklearn.neural_network import MLPClassifier

In [None]:
model = MLPClassifier(
    hidden_layer_sizes=10, max_iter=10000, activation="relu", alpha=0.001  # 10, 50, 100
)

plt.figure(figsize=(12, 12))
gs = gridspec.GridSpec(2, 2, height_ratios=[1, 1], width_ratios=[1, 1])
gs.update(wspace=0.2, hspace=0.2)

for i, d in enumerate(data.keys()):
    X, y = data[d]
    model.fit(X, y)

    ax = plt.subplot(gs[i])
    ax.set_title(f"{d} with MLP")
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap=colorcode)
    decision_boundaries(clf=model, ax=plt.gca())

#### Task

Try to make up a ***Deep Neural Network*** by adding additional layer. The parameter **`hidden_layer_sizes`** has to be changed for that purpose. Does it perform better on some data sets? Do you see first indices of overtraining?

In addition you can change the activation (**`activation`**) function.

## kNN (overview)

As mentioned before kNN performed gives quite reasonable results. With K=20 all sample data sets show a good classification. The edges are almost smooth for all distribution. Only the case of circles it shows a higher uncertainty. kNN struggles with overlaying distributions.

In [None]:
model = KNeighborsClassifier(n_neighbors=20)

plt.figure(figsize=(12, 12))
gs = gridspec.GridSpec(2, 2, height_ratios=[1, 1], width_ratios=[1, 1])
gs.update(wspace=0.2, hspace=0.2)

for i, d in enumerate(data.keys()):
    X, y = data[d]
    model.fit(X, y)

    ax = plt.subplot(gs[i])
    ax.set_title(f"{d} with kNN (k=20)")
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap=colorcode)
    decision_boundaries(clf=model, ax=plt.gca())

## SVM (overview)

Using the correct kernel function is essential for SVM. Thereby the underlaying distributions should be know when applying a SVM to the data. For all samples the SVM show quite reasonable results and has a smooth separation.

In [None]:
model = SVM(kernel="rbf")

plt.figure(figsize=(12, 12))
gs = gridspec.GridSpec(2, 2, height_ratios=[1, 1], width_ratios=[1, 1])
gs.update(wspace=0.2, hspace=0.2)

for i, d in enumerate(data.keys()):
    X, y = data[d]
    model.fit(X, y)

    ax = plt.subplot(gs[i])
    ax.set_title(f"{d} with SVM")
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap=colorcode)
    decision_boundaries(clf=model, ax=plt.gca())

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_

_The used functions_ `decision_boundaries` _and_ `svm_decision_function` _are licensed under a [MIT License](https://github.com/Python4AstronomersAndParticlePhysicists/PythonWorkshop-ICE/blob/master/LICENSE). Copyright © 2017 [Python4AstronomersAndParticlePhysicists](https://github.com/Python4AstronomersAndParticlePhysicists/PythonWorkshop-ICE)_