This is a simple notebook to work with some galaxy images and apply the KMeans clustering algorithm.

It accompanies Chapter 7 of the book but only shows up in sec. 7.6.

Copyright: Viviana Acquaviva (2023); see also data credits below.

Modifications by Julieta Gruszko (2025)

License: [BSD-3-clause](https://opensource.org/license/bsd-3-clause/)

In [None]:
!pip install scikit-image #Using a new library to handle image data

In [None]:
import numpy as np
import pandas as pd
import os
import random
import matplotlib.pyplot as plt

import skimage
from skimage.transform import resize, rescale
from skimage import io
from skimage.feature import blob_dog, blob_log, blob_doh #Aren't these the coolest names
from skimage.color import rgb2gray

This data set is composed by 200 images randomy selected from the Kaggle Galaxy Zoo challenge:

https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge

The code below visualizes the first 25 objects in your data set. You can run it to get a view of the first 25 galaxies. Note: you might get an error message, in this case see here 

https://stackoverflow.com/questions/43288550/iopub-data-rate-exceeded-in-jupyter-notebook-when-viewing-image

In [None]:
fig, axes = plt.subplots(ncols= 5, nrows = 5,figsize=(50,50))

ax = axes.ravel()

for i in range(ax.shape[0]):

    img = skimage.io.imread('../Data/Images/Image_'+str(i)+'.png')
    ax[i].imshow(img, cmap='gray')
    ax[i].set_xticks([])
    ax[i].set_yticks([])

    

The code below implements a very rudimental technique to identify and mask multiple sources, by finding bright blobs away from the center of the image. Results are shown for the first 5 images. As I said, it is very rudimental! However I don't think results really depend on it.

In [None]:
#This shows how multiple sources can be identified and masked.

n_ob = 5

fig, ax = plt.subplots(2, n_ob, figsize=(50, 20))

for i in range(n_ob):

    img = skimage.io.imread('../Data/Images/Image_'+str(i)+'.png')

    image_gray = rgb2gray(img)

    blobs_log = blob_log(image_gray, max_sigma=30, num_sigma=10, threshold=.1)

    # Compute radii in the 3rd column.
    
    blobs_log[:, 2] = blobs_log[:, 2] * np.sqrt(2)
    
    blobs_log = blobs_log[blobs_log[:,2].argsort()[::-1]]
    
    ax[0,i].imshow(img, interpolation='nearest')

    X, Y = np.ogrid[:img.shape[0], :img.shape[1]]
    
    center = np.array([img.shape[0]/2, img.shape[1]/2]) #center
    
    for blob in blobs_log:    
        y, x, r = blob    
        c = plt.Circle((x, y), r, color = 'yellow', linewidth=2, fill=False)
        ax[0,i].add_patch(c)
        
        if (np.linalg.norm(np.array([x,y])-center)) > 10: #If not in center
        
            mask = (X - blob[0])**2 + (Y - blob[1])**2 < r**2
            img[mask] = 0
    
    ax[1,i].imshow(img, interpolation='nearest')
        
    print('I found', int(len(blobs_log)), 'sources.')
    
    if blobs_log[1,2] > 0.5*blobs_log[0,2]: #second source bigger than half first
        print('Multiple large sources detected in image', str(i))


On my computer, it took about a minute to remove the background sources from the first 200 images. The resulting "No Sources" images are available in the Data folder, so if this is taking a long time, you can abort execution and use the existing files.

In [None]:
for i in range(200):

    img = skimage.io.imread('../Data/Images/Image_'+str(i)+'.png')

    image_gray = rgb2gray(img)

    blobs_log = blob_log(image_gray, max_sigma=30, num_sigma=10, threshold=.1)

    # Compute radii in the 3rd column.
    blobs_log[:, 2] = blobs_log[:, 2] * np.sqrt(2)
    
    blobs_log = blobs_log[blobs_log[:,2].argsort()[::-1]]
    
    X, Y = np.ogrid[:img.shape[0], :img.shape[1]]
    
    center = np.array([img.shape[0]/2, img.shape[1]/2]) #center
    
    for blob in blobs_log:    
        y, x, r = blob    
#        c = plt.Circle((x, y), r, color = 'yellow', linewidth=2, fill=False)
#        ax.add_patch(c)
        
        if (np.linalg.norm(np.array([x,y])-center)) > 10: #If not in center
        
            mask = (X - blob[0])**2 + (Y - blob[1])**2 < r**2
            img[mask] = 0
    
    skimage.io.imsave('../Data/Images/NoSources_Image_'+str(i)+'.png',img)
    
    if np.mod(i, 10) == 0:
        print('Processing image', i)

#### These are the first 25 objects after the spurious source removal (did I mention it's very rudimental?).

In [None]:
fig, axes = plt.subplots(ncols= 5, nrows = 5,figsize=(50,50))

ax = axes.ravel()

for i in range(ax.shape[0]):

    img = skimage.io.imread('../Data/Images/NoSources_Image_'+str(i)+'.png')
    ax[i].imshow(img, cmap='gray')
    ax[i].set_xticks([])
    ax[i].set_yticks([])


#### Let's read in the images, resize them to something a bit more manageable, and compile them in a numpy array.


In [None]:
images = []

for i in range(200):
    img =skimage.io.imread('../Data/Images/NoSources_Image_'+str(i)+'.png')
    img_resized = resize(img,(100,100))
    length = np.prod(img_resized.shape)
    img_resized = np.reshape(img_resized,length)
    images.append(img_resized)
    
images = np.vstack(images)

In [None]:
images.shape

### A reasonable clustering hypothesis would be that we can separate galaxies in two clusters, one for the ellipticals (round-ish and red), one for the spiral ones (blue-ish and with evident substructure). 

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, n_init=10, random_state = 10)
kmeans.fit(images)
y_kmeans = kmeans.predict(images)

### Question:
- How many features is k-means using to cluster the instances? What are the features?

In this case, the predictions (the cluster to which each image belongs to) can only assume the value 0 and 1. Here we show a quick way to how many galaxies are predicted to belong to each cluster.

In [None]:
print(len(np.where([y_kmeans == 0])[1]))

In [None]:
print(len(np.where([y_kmeans == 1])[1]))

#### We can use the code below to take a look at 25 galaxies that were placed in the first cluster and see if they look somehow alike.

In [None]:
fig, axes = plt.subplots(ncols= 5, nrows = 5,figsize=(50,50))

ax = axes.ravel()

for i in range(min(len(np.where([y_kmeans == 0])[1]),25)):
    #Note: the line below selects galaxies that are assigned to cluster 0
    img = skimage.io.imread('../Data/Images/NoSources_Image_'+str(np.where([y_kmeans == 0])[1][i])+'.png')
    ax[i].imshow(img, cmap='gray')
    ax[i].set_xticks([])
    ax[i].set_yticks([])

#### We can do the same thing for the second cluster.

In [None]:
fig, axes = plt.subplots(ncols= 5, nrows = 5,figsize=(50,50))

ax = axes.ravel()

for i in range(min(len(np.where([y_kmeans == 1])[1]),25)):
    #Note: the line below selects galaxies that are assigned to cluster 1
    img = skimage.io.imread('../Data/Images/NoSources_Image_'+str(np.where([y_kmeans == 1])[1][i])+'.png')
    ax[i].imshow(img, cmap='gray')
    ax[i].set_xticks([])
    ax[i].set_yticks([])

### Question:
- From what you see in these examples, what aspect of the images is k-means using to cluster the galaxies into two groups? 

This makes perfect sense, of course: the cost function is based on the euclidean distance, calculated pixel-by-pixel, between different images. Images with a similar amount of background pixels will be considered similar, as the difference between a dark pixel and a bright one is larger than the difference due to color or intensity.

### Let's now build a clustering scheme with three clusters:

In [None]:
kmeans = KMeans(n_clusters=3, n_init = 10, random_state = 10)
kmeans.fit(images)
y_kmeans = kmeans.predict(images)

In [None]:
#Let's how big the clusters are.

for i in range(3):
    print(len(np.where([y_kmeans == i])[1]))

We can investigate the first few objects in the small one.

In [None]:
fig, axes = plt.subplots(ncols= 5, nrows = 1,figsize=(50,10))

ax = axes.ravel()

for i in range(min(len(np.where([y_kmeans == 2])[1]),5)): #change index here as necessary, using the index corresponding to the smallest cluster
    #Note: the line below selects galaxies that are assigned to cluster 2
    img = skimage.io.imread('../Data/Images/NoSources_Image_'+str(np.where([y_kmeans == 2])[1][i])+'.png') #and here
    ax[i].imshow(img, cmap='gray')
    ax[i].set_xticks([])
    ax[i].set_yticks([])

#### I found that this was a useful exercise: k-Means is able to pick out weird objects (basically, objects with a saturated background). So while it is not helpful to separate out galaxies based on morphology and color, it is quite apt at detecting spurious objects that should probably be eliminated from the data set before further processing

A look at the other two clusters reveals no surprises (objects are clustered according to size).

In [None]:
fig, axes = plt.subplots(ncols= 5, nrows = 5, figsize=(50,50))

ax = axes.ravel()

for i in range(min(len(np.where([y_kmeans == 1])[1]),25)):
    #Note: the line below selects galaxies that are assigned to cluster 1
    img = skimage.io.imread('../Data/Images/NoSources_Image_'+str(np.where([y_kmeans == 1])[1][i])+'.png')
    ax[i].imshow(img, cmap='gray')
    ax[i].set_xticks([])
    ax[i].set_yticks([])

In [None]:
fig, axes = plt.subplots(ncols= 5, nrows = 5, figsize=(50,50))

ax = axes.ravel()

for i in range(min(len(np.where([y_kmeans == 0])[1]),25)):
    #Note: the line below selects galaxies that are assigned to cluster 2
    img = skimage.io.imread('../Data/Images/NoSources_Image_'+str(np.where([y_kmeans == 0])[1][i])+'.png')
    ax[i].imshow(img, cmap='gray')
    ax[i].set_xticks([])
    ax[i].set_yticks([])

### Questions: 
- Currently, KMeans is classifying galaxies according to size. How could we fix this? Give at least one option.
- Currently, would you expect this approach to classify oval/oblong galaxy images at different angles (e.g. vertically or horizontally oriented) in the same cluster, or different clusters? If they would be grouped in different clusters, give one option for how to fix this. 

### Conclusions

IMHO, clustering algorithms are powerful when they are semi-supervised.

Pre-processing seems to be quite important; defining a proper distance metric can also help.

### Acknowledgement statement:

Upload both notebooks to Gradescope for this week!