# Unsupervised learning

<hr style="border:2px solid gray">

## Index: <a id='index'></a>

1. [Unsupervised learning](#UL)
1. [K-Means](#KM)
1. [Exercise 2](#Exercise_2)
1. [Density-Based Spatial Clustering of Applications with Noise](#DBSCAN)


## So what is unsuperived learning and what is it used for?  [^](#index)
<a id='UL'></a>


As you might have guessed, unsupervised learning is where you don't have the answer to what you are looking i.e. you don't have the target, so you cannot train your favourite classifier. Somehow you need an algorithm that will train a model to pick things that are "the same". This could be trying to distinguish coins on weight and diameter without know which coin is which - this is often used as an example. 

The most common form is clustering (although fault detection and density estimation are other common forms) and the most common form of clustering is K-Means, so this is the example that we will look at. Again, this is an example taken from {homl}.

## K-Means [^](#index)
<a id='KM'></a>

K-means can take an unlabeled data set and group it into clusters. So let's generate some random data made up of clusters.

In [None]:
import numpy as np
import scipy as sp
import sklearn as sk
import matplotlib.pyplot as plt

In [None]:
from sklearn.datasets import make_blobs
blob_centers = np.array(
    [[ 0.2,  2.3],
     [-1.5 ,  2.3],
     [-2.8,  1.8],
     [-2.8,  2.8],
     [-2.8,  1.3]])
blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])


In [None]:
X, y = make_blobs(n_samples=2000, centers=blob_centers,
                  cluster_std=blob_std, random_state=7)

In [None]:
def plot_clusters(X, y=None):
    plt.scatter(X[:, 0], X[:, 1], c=y, s=1)
    plt.xlabel("$x_1$", fontsize=14)
    plt.ylabel("$x_2$", fontsize=14, rotation=0)

In [None]:
plt.figure(figsize=(8, 4))
plot_clusters(X)
print(X)
plt.show()

We have five new blobs of data, but how can we make a model identify each blob? If we knew the centroids of each blob or the identity of each entry, this task would be easy. However, in unsupervised learning, we have neither of these.

K-Means starts by randomly selecting centroids and classifying each instance based on its nearest centroid. It then updates the centroids using the associated data. This process continues iteratively, with instances being reclassified and centroids recalculated until the centroids no longer move. It's a simple yet effective approach. Let's give it a try here:

In [None]:
from sklearn.cluster import KMeans
k = 5
kmeans = KMeans(n_clusters=k, random_state=42)
y_pred = kmeans.fit_predict(X)

Unfortunately, you have to tell it how many clusters to look for, but hey, now each of the data points has been assigned to one of the 5 clusters:

In [None]:
print(y_pred[0:20])

In [None]:
kmeans.cluster_centers_ # to find the centres of the clusters

Let's look at how well it did:

In [None]:
plt.figure(figsize=(8, 4))
plot_clusters(X)
#print(X)
plt.plot(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],"ro")
#print(kmeans.inertia_)
plt.show() # pretty good

You can use it to predict the label for new data

In [None]:
X_new = np.array([[0, 2], [3, 2], [-3, 3], [-3, 2.5]]) #try adding some more to see what happens
kmeans.predict(X_new)

You can plot the decision boundaries as a Voronoi plot (code taken straight from {homl}):

In [None]:
def plot_data(X):
    plt.plot(X[:, 0], X[:, 1], 'k.', markersize=2)  # Plotting data points from input X

def plot_centroids(centroids, weights=None, circle_color='w', cross_color='k'):
    if weights is not None:
        centroids = centroids[weights > weights.max() / 10]  # Filter centroids based on their weights
    plt.scatter(centroids[:, 0], centroids[:, 1],  # Plot centroids
                marker='o', s=30, linewidths=8,
                color=circle_color, zorder=10, alpha=0.9)
    plt.scatter(centroids[:, 0], centroids[:, 1],  # Plot centroids
                marker='x', s=10, linewidths=10,
                color=cross_color, zorder=11, alpha=1)

def plot_decision_boundaries(clusterer, X, resolution=1000, show_centroids=True,
                             show_xlabels=True, show_ylabels=True):
    mins = X.min(axis=0) - 0.1 
    maxs = X.max(axis=0) + 0.1
    xx, yy = np.meshgrid(np.linspace(mins[0], maxs[0], resolution),  # Generate grid of points in the defined limits
                         np.linspace(mins[1], maxs[1], resolution))
    Z = clusterer.predict(np.c_[xx.ravel(), yy.ravel()])  # Perform clustering on the grid points
    Z = Z.reshape(xx.shape)  # Reshape results to have same shape as the grid

    plt.contourf(Z, extent=(mins[0], maxs[0], mins[1], maxs[1]),  # Plot the filled contours (decision boundaries)
                cmap="Pastel2")
    plt.contour(Z, extent=(mins[0], maxs[0], mins[1], maxs[1]),  # Plot the contour lines
                linewidths=1, colors='k')
    plot_data(X)  # Plot the original data
    if show_centroids:
        plot_centroids(clusterer.cluster_centers_)  # Plot the centroids if specified

    if show_xlabels:
        plt.xlabel("$x_1$", fontsize=14)  # Show x-axis label if specified
    else:
        plt.tick_params(labelbottom=False)  # Hide x-axis labels
    if show_ylabels:
        plt.ylabel("$x_2$", fontsize=14, rotation=0)  # Show y-axis label if specified
    else:
        plt.tick_params(labelleft=False)  # Hide y-axis labels


In [None]:
plt.figure(figsize=(8, 4))
plot_decision_boundaries(kmeans, X)
plt.show()

If you happen to have a vague idea of where the centroids are, you can tell it where to start:
```python
init_guess=np.array([-3,1.0],[-3,2],[-3,3],[-1,2],[0,2])
kmeans=KMeans(n-cluster=5,init=init_guess,n_inits=5)
```

### Inertia

When the K-Means algorithm runs, it actually runs several times. The number of times it runs is given by n_inits and the default value is 10. It then uses a performance algorithm to detemine which is the best and keeps that one. The performance metric is called *inertia* and is the mean squared distance between each centroid and the instances associated with it. This is available to you.

In [None]:
kmeans.inertia_

<div style="background-color:#C2F5DD">

## Exercise [^](#index)
<a id='Exercise_2'></a>

Plot what happens to the inertia score as you change the number of centroids in your algorithm. Can you use this to determine how many centroids you should have?

<div style="background-color:#C2F5DD">

## Exercise

Try to use K-Means on the iris data to separate them out without using labels. Even though we have only been using 2D to show plots the algorithm will happily run in multiple dimensions.


In [None]:
import numpy as np
import scipy as sp 
from sklearn.datasets import load_iris
iris=load_iris()

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris['data'],iris['target'], test_size=0.2) 
k = 4
kmeans = KMeans(n_clusters=k, random_state=42,n_init=10)
y_pred = kmeans.fit_predict(X_train)

In [None]:
yp=kmeans.predict(X_test)

In [None]:
print(yp)

In [None]:
print(y_test)

In [None]:
#How accurate was this? Can you write code to classify the accuracy?


### Density-based spatial clustering of applications with noise (DBSCAN) [^](#index)
<a id='DBSCAN'></a>

DBSCAN is a popular clustering algorithm used in machine learning and data mining. Unlike K-Means, which requires specifying the number of clusters in advance, DBSCAN automatically determines the number of clusters based on the density of the data.

DBSCAN operates by grouping together data points that are close to each other in a dense region, while separating regions of lower density. It identifies core points, which have a sufficient number of neighboring points within a specified distance (epsilon), and expands clusters by including reachable points within this distance. Any points that are not part of a cluster are considered outliers or noise.

Practical usage: In high-energy physics experiments, particle tracks are reconstructed from the signals recorded by particle detectors. DBSCAN can be applied to identify and group together the recorded signals that belong to the same particle track. By clustering these signals based on their spatial proximity, DBSCAN helps in reconstructing the paths of particles accurately.


<div style="background-color:#C2F5DD">

## Optional Exercise


We have a dataset of particles with two features: momentum and charge. The goal is to group these particles based on these two features.

Let's first simulate this data:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN

np.random.seed(50)

# Make up data for 4 different particles
data, _ = make_blobs(n_samples=500, centers=4, cluster_std=1)

# Let's say the first feature is momentum and the second is charge
momentum = data[:, 0]
charge = data[:, 1]

plt.scatter(momentum, charge)
plt.xlabel('Momentum')
plt.ylabel('Charge')
plt.title('Simulated Particle Data')
plt.grid(True)
plt.show()


Here, we are simulating data from four types of particles, each with different distributions of momentum and charge.

Next, let's use DBSCAN to identify the particle types:

DBSCAN takes two parameters, the **eps** which specifies the maximum distance between two samples for them to be considered as in the same neighborhood, and **min_sample**s which is the number of samples in a neighborhood for a point to be considered as a core point. You can alter these parameters based on the density of your data points.


In [None]:
# Perform DBSCAN on data
dbscan = DBSCAN(eps=0.7, min_samples=10)
clusters = dbscan.fit_predict(data)

# Plot the clustered data
plt.scatter(momentum, charge, c=clusters, cmap='viridis')
plt.xlabel('Momentum')
plt.ylabel('Charge')
plt.title('DBSCAN Clustering of Particle Data')
plt.grid(True)
plt.show()


Try to alter the values in DBSCAN to see how the plot changes