# Unsupervised learning with marbles

In this example we explore **unsupervised machine learning**  by clustering the marbles data set and evaluating the result. In unsupervised learning, the model does not have access to labelled data during the training phase. Instead it tries to assign labels by discovering structure in the data automatically, i.e. by clustering similar data points together.

This approach is used to detect unknown patterns in data and to perform unbiased analyses. There are a variety of clustering methods implemented in `sklearn`. A broad overview is given in the [documentation](http://scikit-learn.org/stable/modules/clustering.html).

In [None]:
import os
import zipfile
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm

In [None]:
cmap = cm.get_cmap('Set1')

## Data Import, Preparation, Feature Engineering

In [None]:
def parse_lines(lines):
    """ Parse strings of marble data"""
    lines = lines[2:-2]
    rows = [d.split(', ') for d in lines.split('), (')]
    data = [[int(v.replace(')][(', '')) for v in r] for r in rows]
    return pd.DataFrame(data)[[0, 1, 2]]

files = [
    'blue-white-glass.data',
    'cyan-glass.data',
    'glass-blue.data',
    'glass-green.data',
    'glass-red.data',
    'glass-yellow.data',
    'planet-black-blue.data',
    'planet-green.data',
    'planet-ocean.data',
]

dfs = []
for i, fname in enumerate(files):
    print(f'Load data {i}: {fname}')

    with zipfile.ZipFile(f'../.assets/data/marbles/{fname}.zip', 'r') as zipf:
        with zipf.open(f'{fname}', 'r') as infile:
            content = infile.readlines()[0].decode()
            dfs.append(parse_lines(content).assign(color=f'{fname}'.replace('.data', '')))

df = pd.concat(dfs)
df.columns=['R', 'G', 'B', 'color']

def generate_xy_values(df):
    df['X'] = 0.5 * np.sqrt(3) * df['G'] - 0.5 * np.sqrt(3) * df['B']
    df['Y'] = df['R'] - (1 / 3 * df['G']) - (1 / 3 * df['B'])
    
def generate_intensity_values(df):
    df['I'] = np.square(df['X']) + np.square(df['Y'])

def generate_angles(df):
    df['Phi'] = np.arctan2(df['Y'], df['X'])

# Feature Engineering I     
generate_xy_values(df)
generate_intensity_values(df)
generate_angles(df)

In [None]:
# Add target ID
ids = {'blue-white-glass': 0,
       'cyan-glass': 1,
       'glass-blue': 2,
       'glass-green': 3,
       'glass-red': 4,
       'glass-yellow': 5,
       'planet-black-blue': 6,
       'planet-green': 7,
       'planet-ocean': 8,}

df['cat'] = df['color'].map(ids)

In [None]:
df.sample(5)

# K-Means

K-Means is an unsupervised machine learning method, which only needs the number of clusters ($k$) as input. It creates $k$ random clusters in the beginning and assigns each data point to one of it. After one iteration it calculates for each cluster a new center by averaging all assigned data points. In this case the assignment is based on the distance to the nearest cluster.

Let's see how the [**K-Means**](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) algorithm performs on two marbles types with a very good separation.



In [None]:
# Two marbles as input
plt.figure(figsize=(10,6))

cat1=1
cat2=4

var1='X'
var2='Y'

plt.scatter(df[df['cat']==cat1][var1], df[df['cat']==cat1][var2], s=10, alpha=0.01)
plt.scatter(df[df['cat']==cat2][var1], df[df['cat']==cat2][var2], s=10, alpha=0.01)
plt.xlabel(var1)
plt.ylabel(var2);

Let's use two clusters as input.

In [None]:
# For a better visualization we use just part of the dataset.
dataset = pd.concat([
    df[df['cat']==cat1][[var1,var2]].head(1000),
    df[df['cat']==cat2][[var1,var2]].head(1000)
])

### Model definition

In [None]:
# Import of model
from sklearn.cluster import KMeans

In [None]:
model = KMeans(
    n_clusters=2,  # number of clusters
    max_iter=1,  # number of itereations
    
    n_init=1, 
    init='random', 
    precompute_distances=False, 
    random_state=20 
)

### Model Fitting

In [None]:
X = dataset
kmeans = model.fit(X)
predictions = kmeans.predict(X)

### Results _basics_

In [None]:
plt.figure(figsize=(10, 6))

# Plot of samples
plt.scatter(df[df['cat']==cat1][var1].head(1000), df[df['cat']==cat1][var2].head(1000), s=10, alpha=0.2)
plt.scatter(df[df['cat']==cat2][var1].head(1000), df[df['cat']==cat2][var2].head(1000), s=10, alpha=0.2)

# Plot of clusters
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], color='black', marker='o', s=250)

plt.xlabel(var1)
plt.ylabel(var2);

### Task
What happens when we increase the **`max_iter`** parameter? Does the model performs better? Are all samples assigned correctly?

In [None]:
### It's your turn!






### Results _validation_
For a first validation, we use a test data set of the next thousand samples of each marbles type. We show the wrongly assigned samples in the plot with a red marker.

In [None]:
# Create test dataset
marbles_test = pd.concat([
    df[df['cat']==cat1].loc[1001:2000][[var1,var2]],
    df[df['cat']==cat2].loc[1001:2000][[var1,var2]]
]).values
 
# Apply trained model test dataset
predictions = kmeans.predict(marbles_test)

# Truth of test dataset
marbles_true = np.array([1 for i in range(1000)] + [0 for i in range(1000)])

# Check which sample is wrongly assigned
marbles_false = marbles_test[(predictions - marbles_true) != 0]

# Visualization
plt.figure(figsize=(10, 6))
plt.scatter(marbles_test[:,0][:1000], marbles_test[:,1][:1000], s=20, alpha=0.2)
plt.scatter(marbles_test[:,0][1000:2000], marbles_test[:,1][1000:2000], s=20, alpha=0.2)

plt.scatter(marbles_false[:,0], marbles_false[:,1], color='red', marker='v', s=50)

plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], color='black', marker='o', s=250)
plt.xlabel(var1)
plt.ylabel(var2);

### Task
Give it another try and check what happen in each iteration. The red data points show the wrongly assigned samples. Maybe you can find other parameters which improve the performance?

In [None]:
# It's your turn!







## More marble types

We are going to give it a try to see how K-means performs when we have more possible clusters. Now we use all marbles types.

In [None]:
cat = [0,1,2,3,4,5,6,7,8]

X = df[df['cat'].isin(cat)][['X','Y','cat']]

# Reduced data set size
X = X.sample(10000)

# Define target for visualiztion
target = X['cat']
X=X.drop(['cat'],axis=1).values

In [None]:
# Raw data
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], color=cmap(target), s=5, label ='Truth')
plt.legend();

### Model Definition and Training

In [None]:
model = KMeans(
    n_clusters=9,  # number of clusters
    max_iter=5,  # number of itereations
    
    n_init=1, 
    init='random', 
    precompute_distances=False, 
    random_state=20 
)
kmeans = model.fit(X)
predictions = kmeans.predict(X)

### Results

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], s=20, color = cmap(predictions), alpha=0.5, label='Predictions K-Means')
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], color='black', marker='o', s=150, label ='Cluster centers')
plt.legend();

In [None]:
plt.figure(figsize=(10, 6))

plt.scatter(X[:, 0], X[:, 1], color=cmap(target), s=5, label='Truth')
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], color='black', marker='o', s=150, label ='Cluster centers')
plt.legend();

### Task

Reduce the number of marbles types and rerun K-means. Do not forget to adapt the `n_cluster` parameter.

In [None]:
# It's your turn!







### Remarks
K-means is an easy to use unsupervised classifier but we see that it is highly dependent on the distribution itself. Even clearly separated samples are hard to cluster especially when having a complicated geometry. In addition, overlaying data is not possible to handle at all.

# Gaussian Mixture

The results of a [**Gaussian Mixture**](http://scikit-learn.org/stable/modules/mixture.html) algorithm look like the K-means clusters but elliptical distribution can be handled. Instead of using only distances like in K-means it assumes that all data points belong to a mixture of Gaussian distributed clusters.

In [None]:
from sklearn.mixture import GaussianMixture

In [None]:
# Create training data set
size = 1000
cat1=1
cat2=4
var1='X'
var2='Y'

dataset = pd.concat([
    df[df['cat']==cat1][[var1,var2]].head(size),
    df[df['cat']==cat2][[var1,var2]].head(size)
])

### Model Definition and Training

In [None]:
X = dataset
model = GaussianMixture(n_components=2)
X = X.sample(1000).values
model.fit(X)
predictions = model.predict(X)

### Results

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], s=20, c = cmap(predictions), cmap='Set1', label='Predictions Gaussian Mixture')
plt.legend();

## More marble types


In [None]:
size = 9000
cat=[0,1,2,3,4,5,6,7,8]

X = df[df['cat'].isin(cat)][['X','Y','cat']]

# Reduced data set size
X = X.sample(size)

# Define target for visualiztion
target = X['cat']
X=X.drop(['cat'],axis=1).values

In [None]:
# Raw data
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=cmap(target), s=5, alpha = 0.5, label ='Truth')
plt.legend();

### Model Definition and Training

In [None]:
model = GaussianMixture(n_components=9, init_params='kmeans')
model.fit(X)
predictions = model.predict(X)

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], s=20, color = cmap(predictions), alpha=0.5, label='Predictions Gaussian Mixture')
plt.legend();

### Remarks
The results of the clustering of all marbles are highly dependent on the `init_params` parameter which can be set to `random` or `kmeans`. Some of the detected clusters look quite promising, but as one would expect, overlapping samples can not be clearly separated.

## Task
- Try to find as much as possible types of marbles which can be clustered with Gaussian Mixture. You may have to tune the parameters of the model!

- In addition, try to use more features to train the model. For visualization you should stick to X and Y <br>`X = df[df['cat'].isin(cat)][['X','Y','R','G','B','I','Phi','cat']]`)

In [None]:
# It's your turn!







# DBSCAN

[DBSCAN](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.dbscan.html) (**D**ensity-**B**ased **S**patial **C**lustering of **A**pplications with **N**oise) is another clustering algorithm. One important feature is that we do not have to set the number of clusters. Rather, we need to tune the key parameter `eps` (minimal distance between two data points), which will determine how many clusters are foudn. This can vary highly with the sample size. A good illustration of the strategy can be found [here](https://en.wikipedia.org/wiki/DBSCAN).

In [None]:
from sklearn.cluster import DBSCAN

In [None]:
# Create training data set
size = 1000
cat1=1
cat2=4
var1='X'
var2='Y'

dataset = pd.concat([
    df[df['cat']==cat1][[var1,var2]].head(size),
    df[df['cat']==cat2][[var1,var2]].head(size)
])

### Model Definition and Training

In [None]:
X = dataset
model = DBSCAN(eps=5, min_samples=10)
X = X.sample(1000).values
predictions = model.fit_predict(X)

In [None]:
pred_sort=np.sort(predictions)
if pred_sort[-1]+1 == 0:
    print('There are no clusters')
else:
    print(f'Number of clusters: {pred_sort[-1]+1}')
    print(f'There are {sum(pred_sort==-1)} samples without an assigned cluster (Noise).')

### Results

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], s=20, c = cmap(predictions), cmap='Set1', label='Predictions DBSCAN')
plt.legend();

## More marble types


In [None]:
size = 9000
cat=[0,1,2,3,4,5,6,7,8]

X = df[df['cat'].isin(cat)][['X','Y','cat']]

# Reduced data set size
X = X.sample(size)

# Define target for visualiztion
target = X['cat']
X=X.drop(['cat'],axis=1).values

In [None]:
# Raw data
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=cmap(target), s=5, alpha = 0.5, label ='Truth')
plt.legend();

### Model Definition and Training

In [None]:
model = DBSCAN(eps=5, min_samples=10)
predictions = model.fit_predict(X)

In [None]:
pred_sort=np.sort(predictions)
if pred_sort[-1]+1 == 0:
    print('There are no clusters')
else:
    print(f'Number of clusters: {pred_sort[-1]+1}')
    print(f'There are {sum(pred_sort==-1)} samples without an assigned cluster (Noise).')

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], s=20, color = cmap(predictions), alpha=0.5, label='Predictions DBSCAN')
plt.legend();

As for K-Means, the method struggles with overlapping distributions. However, it is capable of clustering more difficult geometries of distributions.

## Task
- Try to find as much as possible types of marbles which can be clustered with DBSCAN. You may have to tune the parameters of the model!

- In addition, try to use more features to train the model. For visualization you should stick to X and Y <br>`X = df[df['cat'].isin(cat)][['X','Y','R','G','B','I','Phi','cat']]`)

In [None]:
# It's your turn!









---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_