Copyright © 2020 IUBH Internationale Hochschule

#### Machine Learning

"
In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.

Learning problems fall into a few categories:

**supervised learning**,in which the data comes with additional attributes that we want to predict.This problem can be either:

- **classification:**

samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data. An example of a classification problem would be handwritten digit recognition, in which the aim is to assign each input vector to one of a finite number of discrete categories. Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class.

- **regression:**

if the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.

**unsupervised learning**, in which the training data consists of a set of input vectors x without any corresponding target values.

The goal in such problems may be to discover groups of similar examples within the data, where it is called **clustering**, or to determine the distribution of data within the input space, known as **density estimation**, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization." [1]

[1], source: [scikit-learn_tutorial](https://scikit-learn.org/stable/tutorial/basic/tutorial.html#machine-learning-the-problem-setting), visited on 28.07.2020

**Clustering**

Cluster analysis, or clustering, is an unsupervised machine learning task.

It involves automatically discovering natural grouping in data. Unlike supervised learning (like predictive modeling), clustering algorithms only interpret the input data and find natural groups or clusters in feature space.

"Clustering techniques apply when there is no class to be predicted but rather when the instances are to be divided into natural groups."

- [Page 141, Data Mining: Practical Machine Learning Tools and Techniques, 2016.](https://amzn.to/2R0G3uG)

A cluster is often an area of density in the feature space where examples from the domain (observations or rows of data) are closer to the cluster than other clusters. The cluster may have a center (the centroid) that is a sample or a point feature space and may have a boundary or extent.

"These clusters presumably reflect some mechanism at work in the domain from which instances are drawn, a mechanism that causes some instances to bear a stronger resemblance to each other than they do to the remaining instances."

- [Pages 141-142, Data Mining: Practical Machine Learning Tools and Techniques, 2016.](https://amzn.to/2R0G3uG)

Clustering can be helpful as a data analysis activity in order to learn more about the problem domain, so-called pattern discovery or knowledge discovery.

For example:

- The [phylogenetic tree](https://en.wikipedia.org/wiki/Phylogenetic_tree) could be considered the result of a manual clustering analysis.
- Separating normal data from outliers or anomalies may be considered a clustering problem.
- Separating clusters based on their natural behavior is a clustering problem, referred to as market segmentation.


Clustering can also be useful as a type of feature engineering, where existing and new examples can be mapped and labeled as belonging to one of the identified clusters in the data.

Evaluation of identified clusters is subjective and may require a domain expert, although many clustering-specific quantitative measures do exist. Typically, clustering algorithms are compared academically on synthetic datasets with pre-defined clusters, which an algorithm is expected to discover.

"Clustering is an unsupervised learning technique, so it is hard to evaluate the quality of the output of any given method."

- Page 534, [Machine Learning: A Probabilistic Perspective, 2012.](https://amzn.to/2TwpXuC)

**Clustering Algorithms**<br>
There are many types of clustering algorithms.<br>
Many algorithms use similarity or distance measures between examples in the feature space in an effort to discover dense regions of observations. As such, it is often good practice to scale data prior to using clustering algorithms.

"Central to all of the goals of cluster analysis is the notion of the degree of similarity (or dissimilarity) between the individual objects being clustered. A clustering method attempts to group the objects based on the definition of similarity supplied to it."

- Page 502, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2016.

Some clustering algorithms require you to specify or guess at the number of clusters to discover in the data, whereas others require the specification of some minimum distance between observations in which examples may be considered “close” or “connected.”

As such, cluster analysis is an iterative process where subjective evaluation of the identified clusters is fed back into changes to algorithm configuration until a desired or appropriate result is achieved.

The scikit-learn library provides a suite of different clustering algorithms to choose from.

A list of 5 of the more popular algorithms is as follows:

- [Agglomerative Clustering](#aggleromative)
- [DBSCAN](#dbscan)
- [K-Means](#kmeans)
- [Mixture of Gaussians](#mfg)
- [KNeighborsClassifier](#knc)

Each algorithm offers a different approach to the challenge of discovering natural groups in data.

There is no best clustering algorithm, and no easy way to find the best algorithm for your data without using controlled experiments.

The examples will provide the basis for you to copy-paste the examples and test the methods on your own data.
additional information about clustering with scikit-learn:

- [Clustering, scikit-learn API.](https://scikit-learn.org/stable/modules/clustering.html)

<a id="agglomerative">**Agglomerative Clustering**</a>

Agglomerative clustering involves merging examples until the desired number of clusters is achieved.<br>
It is a part of a broader class of hierarchical clustering methods and you can learn more here:

- [Hierarchical clustering, Wikipedia.](https://en.wikipedia.org/wiki/Hierarchical_clustering)

It is implemented via the [AgglomerativeClustering class](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html) and the main configuration to tune is the “n_clusters” set, an estimate of the number of clusters in the data, e.g. 2.

source: https://machinelearningmastery.com/clustering-algorithms-with-python/

**Example to agglomerative clustering**:
Show the agglomerative clustering algotithm with sample datasets [make_circles](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html#sklearn.datasets.make_circles) and [make_blobs](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html#sklearn.datasets.make_blobs).
most common parameter for agglomerative clustering:
- 'nCLusters' : n_clusters int or None, default=2 

The number of clusters to find. It must be None if distance_threshold is not None.


In [None]:
# resources
from numpy import unique
from numpy import arange
from numpy import where
from sklearn.datasets import make_circles, make_blobs
from sklearn.cluster import AgglomerativeClustering
from matplotlib import pyplot
#
# ----------------------------------------------------------------------------
# generate datasets
#
Xb, _ = make_blobs(n_samples=1000, n_features=2, centers=None, random_state=4)
Xc, _ = make_circles(n_samples=1000, factor=0.8, noise=0.05)
#
# ----------------------------------------------------------------------------
# define algorithm model:
#
nClusters = 2
#
model = AgglomerativeClustering(n_clusters=nClusters)
#
# fit model and predict clusters
# - make_blob dataset:
blob_yhat = model.fit_predict(Xb)
# - make_circle dataset:
circle_yhat = model.fit_predict(Xc)
#
#
blobClusters = unique(blob_yhat)
circleClusters = unique(circle_yhat)
#
# scatter plots for each dataset and their clusters
#
figAgglo, axes = pyplot.subplots(nrows=2, ncols=2, figsize=(10, 8))

axes[0,0].clear()
axes[0,1].clear()
axes[1,0].clear()
axes[1,1].clear()
    
axes[0,0].set_title("original data (blob)")
axes[1,0].set_title("original data (circle)")
axes[0,1].set_title("predicted clusters (blob)")
axes[1,1].set_title("predicted clusters (circle)")
        
axes[0,0].scatter(Xb[:, 0], Xb[:, 1])
axes[1,0].scatter(Xc[:, 0], Xc[:, 1])

for cluster in blobClusters:
    # get row indexes for samples with this cluster
    row_ix = where(blob_yhat == cluster)
    # create scatter of these samples
    axes[0,1].scatter(Xb[row_ix[0], 0], Xb[row_ix[0], 1])
#  
for cluster in circleClusters:
    # get row indexes for samples with this cluster
    row_ix = where(circle_yhat == cluster)
    # create scatter of these samples
    axes[1,1].scatter(Xc[row_ix[0], 0], Xc[row_ix[0], 1])
#

<a id="dbscan">**DBSCAN**</a>

DBSCAN Clustering (where DBSCAN is short for Density-Based Spatial Clustering of Applications with Noise) involves finding high-density areas in the domain and expanding those areas of the feature space around them as clusters.

"… we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it"

- [A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, 1996.]( https://www.osti.gov/biblio/421283)

The technique is described in the paper:

- [A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, 1996.]( https://www.osti.gov/biblio/421283)

It is implemented via the [DBSCAN class](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) and the main configuration to tune is the “eps” and “min_samples” hyperparameters.

source: https://machinelearningmastery.com/clustering-algorithms-with-python/

<a id="">Example to DBSCAN clustering</a>

Show the [DBSCAN clustering algorithm](https://scikit-learn.org/stable/modules/clustering.html#dbscan) with sample datasets [make_circles](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html#sklearn.datasets.make_circles) and [make_blobs](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html#sklearn.datasets.make_blobs).

Parameters:
DBSCAN:
- eps: <br>"The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function."<br> **eps = 0.3**
- min_samples: <br>"The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself."<br> **minSamples = 8**

In [None]:
# resources
#
from numpy import unique
from numpy import arange
from numpy import where
from sklearn.datasets import make_circles, make_blobs
from sklearn.cluster import DBSCAN
from matplotlib import pyplot
#
#-----------------------------------------------------------------------------
# generate datasets
Xb, _ = make_blobs(n_samples=1000, n_features=2, centers=None, random_state=4)
Xc, _ = make_circles(n_samples=1000, factor=0.8, noise=0.1)
#
#-----------------------------------------------------------------------------
#
eps = 0.3
minSamples = 8
#
model = DBSCAN(eps=eps, min_samples=minSamples)
#
# fit model and predict clusters
#
blob_yhat = model.fit_predict(Xb)
#
circle_yhat = model.fit_predict(Xc)
#
#
blobClusters = unique(blob_yhat)
circleClusters = unique(circle_yhat)
#
# scatter plots for each dataset and their clusters
#
figDBSCAN, axes = pyplot.subplots(nrows=2, ncols=2, figsize=(10, 8))

axes[0,0].clear()
axes[0,1].clear()
axes[1,0].clear()
axes[1,1].clear()
    
axes[0,0].set_title("original data (blob)")
axes[1,0].set_title("original data (circle)")
axes[0,1].set_title("predicted clusters (blob)")
axes[1,1].set_title("predicted clusters (circle)")
        
axes[0,0].scatter(Xb[:, 0], Xb[:, 1])
axes[1,0].scatter(Xc[:, 0], Xc[:, 1])

for cluster in blobClusters:
    # get row indexes for samples with this cluster
    row_ix = where(blob_yhat == cluster)
    # create scatter of these samples
    axes[0,1].scatter(Xb[row_ix, 0], Xb[row_ix, 1])
#  
for cluster in circleClusters:
    # get row indexes for samples with this cluster
    row_ix = where(circle_yhat == cluster)
    # create scatter of these samples
    axes[1,1].scatter(Xc[row_ix, 0], Xc[row_ix, 1])
#  


<a id="kmeans">**K-Means**</a><br>
[K-Means Clustering](https://en.wikipedia.org/wiki/K-means_clustering) may be the most widely known clustering algorithm and involves assigning examples to clusters in an effort to minimize the variance within each cluster.

"The main purpose of this paper is to describe a process for partition ing an N-dimensional population into k sets on the basis of a sample. The process, which is called ‘k-means,’ appears to give partitions which are reasonably efficient in the sense of within-class variance."

- [Some methods for classification and analysis of multivariate observations, 1967.](https://projecteuclid.org/euclid.bsmsp/1200512992)

The technique is described here:

* [k-means clustering, Wikipedia.](https://en.wikipedia.org/wiki/K-means_clustering)

It is implemented via the [KMeans class](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) and the main configuration to tune is the “n_clusters” hyperparameter set to the estimated number of clusters in the data.

source: https://machinelearningmastery.com/clustering-algorithms-with-python/

**example to K-Means clustering**


Show the [K-Means clustering algorithm](https://scikit-learn.org/stable/modules/clustering.html#k-means) with sample datasets [make_circles](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html#sklearn.datasets.make_circles) and [make_blobs](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html#sklearn.datasets.make_blobs).

Parameters:<br>
K-Means:
- nCLusters: <br>"The number of clusters to form as well as the number of centroids to generate."<br> **nClusters = 4**

In [None]:
# resources:
#
from numpy import unique
from numpy import arange
from numpy import where
from sklearn.datasets import make_circles, make_blobs
from sklearn.cluster import KMeans
from matplotlib import pyplot
#
#-----------------------------------------------------------------------------
# generate datasets
#
Xb, _ = make_blobs(n_samples=1000, n_features=2, centers=None, random_state=4)
Xc, _ = make_circles(n_samples=1000, factor=0.8, noise=0.1)
#
#-----------------------------------------------------------------------------
nClusters = 4
#
model = KMeans(n_clusters = nClusters)
#
# fit model and predict clusters
# - make_blob dataset:
blob_yhat = model.fit_predict(Xb)
# - make_circle dataset:
circle_yhat = model.fit_predict(Xc)
#
#
blobClusters = unique(blob_yhat)
circleClusters = unique(circle_yhat)
#
# scatter plots for each dataset and their clusters
#
figKMeans, axes = pyplot.subplots(nrows=2, ncols=2, figsize=(10, 8))

axes[0,0].clear()
axes[0,1].clear()
axes[1,0].clear()
axes[1,1].clear()
    
axes[0,0].set_title("original data (blob)")
axes[1,0].set_title("original data (circle)")
axes[0,1].set_title("predicted clusters (blob)")
axes[1,1].set_title("predicted clusters (circle)")
        
axes[0,0].scatter(Xb[:, 0], Xb[:, 1])
axes[1,0].scatter(Xc[:, 0], Xc[:, 1])

for cluster in blobClusters:
    # get row indexes for samples with this cluster
    row_ix = where(blob_yhat == cluster)
    # create scatter of these samples
    axes[0,1].scatter(Xb[row_ix, 0], Xb[row_ix, 1])
#  
for cluster in circleClusters:
    # get row indexes for samples with this cluster
    row_ix = where(circle_yhat == cluster)
    # create scatter of these samples
    axes[1,1].scatter(Xc[row_ix, 0], Xc[row_ix, 1])
#  

**Gaussian Mixture Model**
A Gaussian mixture model summarizes a multivariate probability density function with a mixture of Gaussian probability distributions as its name suggests.

For more on the model, see:

* [Mixture model, Wikipedia.](https://en.wikipedia.org/wiki/Mixture_model)

It is implemented via the [GaussianMixture class](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html) and the main configuration to tune is the “n_clusters” hyperparameter used to specify the estimated number of clusters in the data.

source: https://machinelearningmastery.com/clustering-algorithms-with-python/

**example to Gaussian mixture  clustering** 

Show the [Gaussian mixture clustering algorithm](https://scikit-learn.org/stable/modules/mixture.html#gmm) with sample datasets [make_circles](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html#sklearn.datasets.make_circles) and [make_blobs](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html#sklearn.datasets.make_blobs).

Parameters:<br>
Gaussian mixture model:
- n_components: <br>"The number of mixture components."<br> **nComponents = 6**

In [None]:
# resources
#
from numpy import unique
from numpy import arange
from numpy import where
from sklearn.datasets import make_circles, make_blobs
from sklearn.mixture import GaussianMixture
from matplotlib import pyplot
#
#-----------------------------------------------------------------------------
#
# generate datasets
Xb, _ = make_blobs(n_samples=1000, n_features=2, centers=None, random_state=4)
Xc, _ = make_circles(n_samples=1000, factor=0.8, noise=0.1)
#
# define model
#
nComponents = 6
#
model = GaussianMixture(n_components = nComponents)
#
# fit model and predict clusters
# - make_blob dataset:
blob_yhat = model.fit_predict(Xb)
# - make_circle dataset:
circle_yhat = model.fit_predict(Xc)
#
#
blobClusters = unique(blob_yhat)
circleClusters = unique(circle_yhat)
#
# scatter plots for each dataset and their clusters
#
figKMeans, axes = pyplot.subplots(nrows=2, ncols=2, figsize=(10, 8))

axes[0,0].clear()
axes[0,1].clear()
axes[1,0].clear()
axes[1,1].clear()
    
axes[0,0].set_title("original data (blob)")
axes[1,0].set_title("original data (circle)")
axes[0,1].set_title("predicted clusters (blob)")
axes[1,1].set_title("predicted clusters (circle)")
        
axes[0,0].scatter(Xb[:, 0], Xb[:, 1])
axes[1,0].scatter(Xc[:, 0], Xc[:, 1])

for cluster in blobClusters:
    # get row indexes for samples with this cluster
    row_ix = where(blob_yhat == cluster)
    # create scatter of these samples
    axes[0,1].scatter(Xb[row_ix, 0], Xb[row_ix, 1])
#  
for cluster in circleClusters:
    # get row indexes for samples with this cluster
    row_ix = where(circle_yhat == cluster)
    # create scatter of these samples
    axes[1,1].scatter(Xc[row_ix, 0], Xc[row_ix, 1])
#  

**KNeighboursClassifier**
"
Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

scikit-learn implements two different nearest neighbors classifiers: KNeighborsClassifier implements learning based on the  nearest neighbors of each query point, where  is an integer value specified by the user."

source: https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification

**example to KNeighborsClassifier clustering** 

Show the [KNeighborsClassifier clustering algorithm](https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification) with sample datasets [make_circles](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html#sklearn.datasets.make_circles) and [make_blobs](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html#sklearn.datasets.make_blobs).

Parameters:<br>
KNeighborsClassifier:
- n_neighbors: <br>"Number of neighbors to use by default for kneighbors queries."<br> **nNeighbors = 5**

In [None]:
# resources:
#
from numpy import unique
from numpy import arange
from numpy import where
from sklearn.datasets import make_circles, make_blobs
from sklearn.neighbors import KNeighborsClassifier
from matplotlib import pyplot
#
#------------------------------------------------------------------
#
Xb, Yb = make_blobs(n_samples=1000, n_features=2, centers=None, random_state=4)
Xc, Yc = make_circles(n_samples=1000, factor=0.8, noise=0.1)
#
#
nNeighbors = 5
#
model= KNeighborsClassifier(n_neighbors = nNeighbors)
#
# fit model and predict clusters
model.fit(Xb,Yb)
#
blob_yhat = model.predict(Xb)
# - make_circle dataset:
model.fit(Xc,Yc)
circle_yhat = model.predict(Xc)
#
#
blobClusters = unique(blob_yhat)
circleClusters = unique(circle_yhat)
#
# scatter plots for each dataset and their clusters
#
figKMeans, axes = pyplot.subplots(nrows=2, ncols=2, figsize=(10, 8))

axes[0,0].clear()
axes[0,1].clear()
axes[1,0].clear()
axes[1,1].clear()
    
axes[0,0].set_title("original data (blob)")
axes[1,0].set_title("original data (circle)")
axes[0,1].set_title("predicted clusters (blob)")
axes[1,1].set_title("predicted clusters (circle)")
        
axes[0,0].scatter(Xb[:, 0], Xb[:, 1])
axes[1,0].scatter(Xc[:, 0], Xc[:, 1])

for cluster in blobClusters:
    # get row indexes for samples with this cluster
    row_ix = where(blob_yhat == cluster)
    # create scatter of these samples
    axes[0,1].scatter(Xb[row_ix, 0], Xb[row_ix, 1])
#  
for cluster in circleClusters:
    # get row indexes for samples with this cluster
    row_ix = where(circle_yhat == cluster)
    # create scatter of these samples
    axes[1,1].scatter(Xc[row_ix, 0], Xc[row_ix, 1])
#  

Copyright © 2020 IUBH Internationale Hochschule