This notebook features the training of machine learning algorithms to learn about properties of barred and normal galaxies, data taken from the Carnegie Irvine Galaxy Survey.

In [2]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve
from tensorflow.keras.preprocessing.image import ImageDataGenerator, load_img, img_to_array
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

First, I import and rescale the images to 256 x 256 to speed up the training process.

In [74]:
image_width, image_height = 256, 256

def load_images(directory, label):
    images = []
    labels = []
    for file in os.listdir(directory):
        if file.endswith('.jpg'):
            img = load_img(os.path.join(directory, file), target_size=(image_width, image_height))
            img_array = img_to_array(img) / 255.0  # Normalize pixel values
            images.append(img_array)
            labels.append(label)
    return images, labels

barred_spirals, y_barred = load_images('Barred_Spirals', 0)
normal_spirals, y_normal = load_images('Normal_Spirals', 1)

X = np.array(barred_spirals + normal_spirals)
y = np.array(y_barred + y_normal)

Calculate the amount of barred spirals and normal spirals in the dataset.

In [75]:
len(barred_spirals),len(normal_spirals)

(104, 201)

In [76]:
X.shape # 104 + 201 = 305 total images of dimensions 256 x 256

(305, 256, 256, 3)

Because the data comes in the form of .jpg images, I need to preprocess the data so that it is only 2 dimensional and that the pixel values are converted between a range of 0 and 1.

Now, I split the data into training and testing sets.

In [77]:
# Preprocess data
gray_images = []
# Make all the images in the dataset gray.
for i in range(0,305):
    gray_image = np.mean(X[i],axis=2)
    gray_images.append(gray_image)


In [81]:
X = np.array(gray_images)
X.shape

(305, 256, 256)

In [84]:
# Flatten the array along the 1st and 2nd axes to just have the pixel intensity values 
X_train = X.reshape(305,-1)
X_train.shape

(305, 65536)

In [85]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [86]:
X_train.shape

(244, 65536)

I will start with a *stoachastic gradient descent* classifier within Scikit-Learn's SGDCClassifier class.

In [87]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state = 42)
sgd_clf.fit(X_train,y_train)

Now I use this classifier to detect the first image in the stack. 

In [97]:
y[5] # Barred Spiral 

0

In [100]:
sgd_clf.predict([X[0]])

array([0])

The array guesses that this image represents a barred spiral, which does seem to be correct! 

I now will perform cross-validation of the dataset. I use the cross_val_score() function to evaluate the SGDCClassifier model using $k$-fold cross-validation with $k = 3$ folds. This will split the training set into $k$-folds, then train the model $k$ times, holding out a different fold each time for validation.

In [103]:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf,X_train,y_train,cv=3,scoring="accuracy")

array([0.53658537, 0.62962963, 0.62962963])

This array of accuracy ratios suggests that the classifier gets approximately 54% of its predictions correct in the 1st fold and about 63% correct in the 2nd and 3rd folds. Next, I will use a dummy classifier that just classifies every single image in the most frequent class, which in our case is the Normal Spiral Galaxies.

In [106]:
from sklearn.dummy import DummyClassifier
dummy_clf = DummyClassifier()
dummy_clf.fit(X_train,y_train)
print(any(dummy_clf.predict(X_train)))

True


Now, we look at the cross_val score and compare it with the performance in the cross-validation of the other dataset.

In [107]:
cross_val_score(dummy_clf,X_train,y_train,cv=3,scoring="accuracy")

array([0.65853659, 0.66666667, 0.65432099])

Wow, it does a better job than before! But there seems to be some preference to the number of 2/3? One hunch is that 2/3s of the pictures in the dataset are of Normal Spiral Galaxies and only 1/3 are barred spiral galaxies, so that if you guess that EVERY image is a Normal Spiral Galaxy, you will be right 2/3s of the time.

In [109]:
len(normal_spirals)/(len(barred_spirals) + len(normal_spirals))

0.659016393442623

This is a good example of why accuracy should not be the preferred performance metric for classifiers, especially when it comes to **skewed datasets**.

I will now compute the **confusion matrix** which counts the number of times instances of class A are classified as class B for all A/B pairs. In my example, this just counts the amount of times normal spirals are confused for barred spirals and vice versa.

In [110]:
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf,X_train,y_train,cv=3)

This function also performs $k$-fold cross-validation, but instead of returning the evaluation scores, it returns hte predictions made on each test fold. This means that I can get out of sample predictions for each instance in the training set. This means that the model is making predictions on data it never saw during the training phase. 

In [111]:
# get the confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_train,y_train_pred)
cm

array([[ 24,  59],
       [ 39, 122]])

Each row in the confusion matrix represents an real class while each column represents a predicted class. Top left corner represents true negatives. Top right corner represents false positive (type I error). Bottom left represents false negatives (type II errors) and bottom right represents true positives.

In [112]:
y_train_perfect_predictions = y_train # pretend we reached perfection
confusion_matrix(y_train,y_train_perfect_predictions)

array([[ 83,   0],
       [  0, 161]])

### Next we look at precision and recall metrics to evaluate model performance.

In [113]:
from sklearn.metrics import precision_score, recall_score
precision_score(y_train,y_train_pred) #

0.6740331491712708

In [115]:
122/(122+59)

0.6740331491712708

In [116]:
from sklearn.metrics import precision_score, recall_score
recall_score(y_train,y_train_pred) #

0.7577639751552795

In [117]:
from sklearn.metrics import f1_score
f1_score(y_train,y_train_pred)

0.7134502923976608