# Polar Ring Galaxy Example: NGC 660

Here is an example of a polar ring galaxy, NGC 660, the type this project focuses on identifying.

![NGC 660 Polar Ring Galaxy](https://upload.wikimedia.org/wikipedia/commons/thumb/9/9d/NGC_660_Polar_Galaxy_Gemini_Observatory.jpg/800px-NGC_660_Polar_Galaxy_Gemini_Observatory.jpg)

*Image Credit: Gemini Observatory/AURA*

# Galaxy Classification
This notebook uses the fastai package (built on PyTorch) as well as the sklearn package to create a supervised machine learning model to classify images of galaxies. The data for this project came from the Hyper-Suprime Cam Subary Sky Survey (https://hsc.mtk.nao.ac.jp/ssp/). Images were manually classified into categories.

This notebook also makes sue of Nvidia CUDA, a program which, in this case, allows the model to directly utilize the CUDA cores of a PC's Graphics Processing Unit (GPU). This increases the speed of the model as the GPU can more effectively handle batches of galaxy images than the CPU can.

If importing an already created model, skip down to the "Import Model" section. Make sure you run the imports below though!

In [None]:
import pandas as pd
from pathlib import Path
from shutil import copy
from fastai.vision.all import *
from sklearn.model_selection import train_test_split
import torch

# ONLY USE BELOW CODE IF USING GPU FOR ADDITONAL HORSEPOWER
print(torch.cuda.is_available())  # Should print True if GPU is enabled
print(torch.cuda.get_device_name(0))  # Shows the name of your GPU

## Quick pre-processing

Not a lot of pre-processing is necessary. The only aspect that is important is to ensure that no truncated images (images that won't open properly) make it into either the training session or the final data classification, or else the model will crash and waste a lot of time. Alternatively, a try() catch() could be used in the final model, but I prefer to just remove the trouble images outright to prevent any trouble. The following code will identify and delete images that do not open.

This code is not necessary if you know your images are correct, but is probably acceptable to use if you are uncertain or if it is your first time with those images.

In [None]:
from PIL import Image

# Path to your dataset
dataset_path = Path('ims')

# List to store corrupted file paths
corrupted_files = []

# Check all image files in the folder
for img_path in dataset_path.iterdir():
    try:
        # Try to open the image
        img = Image.open(img_path)
        img.verify()  # Verify integrity
    except Exception as e:
        print(f"Corrupted file: {img_path}, Error: {e}")
        corrupted_files.append(img_path)  # Save the corrupted file path to the list

In [None]:
# Delete all corrupted files
for file_path in corrupted_files:
    try:
        file_path.unlink()  # Deletes the file
        print(f"Deleted: {file_path}")
    except Exception as e:
        print(f"Failed to delete {file_path}. Error: {e}")

## Create training sets

This section splits the data in the training folder (variable source_folder) using train_test_split from sklearn.

In [None]:
#Split test and train data randomly
source_folder = Path('training data')
classes = [folder.name for folder in source_folder.iterdir() if folder.is_dir() and folder.name not in ['train', 'valid', 'models']]


for cls in classes:
    img_path = source_folder / cls
    images = list(img_path.iterdir())
    
    # Split into train and valid
    train_imgs, valid_imgs = train_test_split(images, test_size=0.2, random_state=42)

    # Create target folders
    (source_folder / 'train' / cls).mkdir(parents=True, exist_ok=True)
    (source_folder / 'valid' / cls).mkdir(parents=True, exist_ok=True)

    # Move images
    for img in train_imgs:
        shutil.copy(str(img), str(source_folder / 'train' / cls / img.name))
    for img in valid_imgs:
        shutil.copy(str(img), str(source_folder / 'valid' / cls / img.name))



## Create learning model

The following cell defines our dls and directs it to the training data folder. This is where different parameters can be set up. I have found that a bs of 64 works well on an RTX 5070 Ti but will likely need to be toned down if using a GPU with lower VRAM. VRAM is the largest limitation in training times. The architecture can be changed in the vision_learner function at the bottom. ConvNeXt_tiny has shown an improvement in accuracy over ResNet. It seems to be on-par with the larger ConvNeXt models, and is significantly faster as well.

The cell after this one finds the optimal learning rate. Probably not necessary, but I believe that adaptability is always good, so I like to use it.

In [None]:
#Load train data
source_folder = Path('training data')
dls = ImageDataLoaders.from_folder(
    source_folder, # Trains on pre-defined source folder above (training data)
    train='train', 
    valid='valid',
    num_workers=12, #Sets CPU to use specified workers (I am using a Ryzen 9 5900X with 12 cores, trying to maximize power) 
    bs=64, #Sets batch size to bs to leverage GPU in processing by increasing batch size, or decreasing if VRAM runs out
    pin_memory=True #Improves data transfer between CPU and GPU
) 
#, item_tfms=Resize(224)) #Resize images (speeds up model but decreases accuracy (probably) because image looks worse)

#Create model
learn = vision_learner(dls, convnext_tiny, metrics=accuracy)

In [None]:
#Find optimal learning value instead of guessing
optimal_model = learn.lr_find()
print(optimal_model)
optimal_rate = optimal_model.valley

### Model Training

Simple training using fine_tune. Has an early stop if valid_loss either drops or stops improving 3 epochs in a row. If using a powerful GPU, no reason to not just let the model go for as long as it wants, as it can catch itself.

Cell afterwards creates a confusion matrix to show how training went. Optional use.

Finally, if the model was good and you want to save time on re-training, there's a line to export the model.

In [None]:
#Fine tune
learn.fine_tune(
    5, #epochs (# of trials)
    cbs=[
        EarlyStoppingCallback(monitor='valid_loss', patience=3), #Stops training epochs if valid_loss begins dropping
        SaveModelCallback(monitor='valid_loss', fname='best_model') #Saves the best model if training stops
    ],
    base_lr = optimal_rate)

In [None]:
#Evaluate model with confusion matrix (not necessary but useful visual)
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

In [None]:
#Save classifier
learn.export('full_model2.8.pkl')

# Import Model and Full Dataset Test

If model is already created, skip to this point to import and run on dataset.

Some trial and error was necessary to get the model to utilize the GPU properly here, hence the cpu=False line that is commented out. I left it in just in case of emergency, but the next two lines to move the model to 'cuda' should work. If the following two print lines do not print True and cuda, then try cpu=False in the load_learner line after model name. This problem only seems to pop up if importing a model. and works perfectly fine if going straight through with creation of a model above.

In [None]:
#Load in learner
learn = load_learner('full_model2.8.pkl') #cpu=False
dls = learn.dls
print("Model loaded.")
learn.model.to('cuda')
learn.dls.device = torch.device("cuda")

In [None]:
print(next(learn.model.parameters()).is_cuda)
print(learn.dls.device)

### Run Model on Full Dataset

Create path to dataset we want to classify, then define dls again.

In [None]:
#Path to large dataset
# full_dataset = Path('ims')
full_dataset = Path('small_test_ims')
image_files = list(full_dataset.iterdir())

test_dls = learn.dls.test_dl(image_files, pin_memory=True) #This is a good point to edit the aspects of the model before testing

#### Creating Probabilities and Predicting Classes

At least for the time being, the model seems to like to create a cap on its predictive probability (around 0.23). I suspect this is due to suspect training data and the model lacks confidence in its answers. When examining the logits, they are found to be approximately the same number except for the predicted answer which is a bit higher. I believe this to be fixable with fixing up the training sets, but until then, temperature scaling is used in the temperature_softmax function to simply scale the probabilites in the creation of the catalog.

In [None]:
import torch.nn.functional as F

def temperature_softmax(logits, T=0.5):
    return F.softmax(logits / T, dim=1)

# Get raw logits from your model
preds, _ = learn.get_preds(dl=test_dls)

# Try with a temperature
probs_temp = temperature_softmax(preds, T=0.28)

# Save probabilities along with class predictions
predicted_classes = [learn.dls.vocab[i] for i in probs_temp.argmax(dim=1)]
probs_list = probs_temp.tolist()  # Convert tensor to a Python list


### Catalog Creation

Saves a catalog of all predicted images with their file name, predicted class, and probability of each class.

In [None]:
# Create a DataFrame of probabilities with each column named after the corresponding class
prob_df = pd.DataFrame(probs_temp.numpy(), columns=learn.dls.vocab)

# Retrieve filenames directly from the test DataLoader's dataset (assuming items are file paths)
filenames = [Path(item).name for item in test_dls.dataset.items]

# Create the catalog DataFrame
catalog = pd.DataFrame({
    'filename': filenames,
    'predicted_class': predicted_classes
})

# Concatenate the probabilities: one column per galaxy type
catalog = pd.concat([catalog, prob_df], axis=1)

# Round probabilities
catalog.update(catalog.select_dtypes(float).round(8))

# Save the catalog to a CSV file
catalog.to_csv('catalog_small.csv', index=False)
print("Catalog saved as catalog_small.csv")
