# Presentation of classification algorithm Matthias Walter 1780559, october 2025


## **Data cleaning**
Start with data cleaning. Goal: use loop through all folders to get RGB pixel data of 64*64 images and append their object ID. 
Create 1D vectors of pixels and their respective objectID, (starts with 123 ) and match with Zoospec datasets provided on CV. From Zoospec table we use the object ID for matching the pixel data with their labels "spiral" or "elliptical". 



In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import os
from PIL import Image

build = False
## data cleaning start --------------------------------------------------------------------------------------------------------------------------------
# read data for information on classification information as a function ob object id 
#drop unecessary columns for better overview 
if build:
    classification_df = pd.read_csv("ZooSpecPhotoDR19_torradeflot.csv", usecols=['objid', 'spiral', 'elliptical'])



    directory = r"C:\Users\matth\OneDrive - Universität Graz\Uni\Master\Semester4\Statistics and Data analysis\project_galaxy_data\HIPS2FITS_petro"


    all_image_data = []


    #apparently os.walk goes through all subfolders and files in a dir
    # .ipynb checkpoints folder and 0.jpg was just deleted for simplicity
    for root,_, files in os.walk(directory):
        print(f"Processing folder: {root}") # show current folder being processed because of large folder number
        # 'root' is the path to the current subfolder for, like 99
        # 'files' is a list of all filenames in that subfolder, with their object id asd name 

        for file_name in files:
            # Process only images and avoid iterating the folder in the single subfolders 
            if file_name.lower().endswith(('.jpg', '.jpeg')): # 
                # split datatype specific ending from filenamem in this case .jpg
                # because filename is objectid and  this is needed for later matching
                objid = os.path.splitext(file_name)[0]
                # abs path
                image_path = os.path.join(root, file_name)
                
                # create vector of that image
                with Image.open(image_path) as img:
                    # attention: images are in RGB, so there is 3 dimensions of 64x64 pixels (previous mistake)

                    # Ensure image is in RGB mode (3 channels)
                    img_rgb = img.convert('RGB')
                    
                    # Convert the 64x64x3 image to a NumPy array
                    pixel_array = np.array(img_rgb)
                    
                    # Separate the channels and flatten each one
                    r_channel = pixel_array[:, :, 0].flatten() # All red pixels
                    g_channel = pixel_array[:, :, 1].flatten() # All green pixels
                    b_channel = pixel_array[:, :, 2].flatten() # All blue pixels
                    
                    # 2D-> 1D: Concatenate all channels into one long row
                    flattened_pixels = np.concatenate([r_channel, g_channel, b_channel])
                    
                    # append objectid name                  
                    row = [objid] + flattened_pixels.tolist()

                    # Add this row to our main data list
                    all_image_data.append(row)

    print("Image processing complete.")

    # convert to pandas df 

    # create vectors of pixels ordered, first all red pixels then all green then all blue pixels
    num_pixels = 64 * 64  # 4096 pixels per RGB dimension

    # Create names for each channel, use i+1 because of python indexing starting at 0 


    # this is not necessary it just produceds a nicer data frame 
    r_cols = [f'R_pix_{i+1}' for i in range(num_pixels)]
    g_cols = [f'G_pix_{i+1}' for i in range(num_pixels)]
    b_cols = [f'B_pix_{i+1}' for i in range(num_pixels)]



    # Combine all column names
    column_names = ['objid'] + r_cols + g_cols + b_cols


    # Create the DataFrame
    image_df = pd.DataFrame(all_image_data, columns=column_names)

    # make sure datatypes match (previous mistake because str and int cannot match of course)
    image_df['objid'] = image_df['objid'].astype(str)
    classification_df['objid'] = classification_df['objid'].astype(str)

    # now use objid to merge both dataframes
    final_df = pd.merge(image_df, classification_df, on='objid')


    # data is not always spiral or elliptical , some have 0 in both columns, those will be dropped
    print(f"Count of (0, 0) cases: {len(final_df[(final_df['spiral'] == 0) & (final_df['elliptical'] == 0)])}")
    clean_index = (final_df['spiral'] == 1) | (final_df['elliptical'] == 1)

    # final df for classification task 
    final_df = final_df[clean_index]


    # Display the first 5 rows for checking
    print("Merged DataFrame preview:")
    print(final_df.head())

    #save to csv for resuse (iteration through all of the images takes long if algorithm has to be adapted)
    final_df.to_csv('new_galaxy_data.csv', index=False)

    # if build is false just read the csv that has previously been built by the exact algorithm above
else:
    final_df = pd.read_csv('new_galaxy_data.csv')

## data cleaning end --------------------------------------------------------------------------------------------------------------------------------

### Data cleaning- what to look out for
* RGB data, 3 dimensions of 64^2 pixels 
* ipynb folder inside image folder 00 
* matching of different data types (str <-> int) when comparing objid from different dataframes
* Some objects were neither labeled "spiral" nor "elliptical", those entries were omitted, as they are not useful for the binary output of the algorithm

## **Training the ML classification algorithm**

### Data preparation
The cleaned data is now prepared for a machine learning model to be easily "digestable". 
The goal is to use the pixel images as input for the machine learning algorithm to perform binary classification on "spiral" "elliptical". 
We prepared the data by setting the labels either "spiral" or "elliptical", which means one label vector with values 0,1 can be used for as a target variable, as a 1 in spiral implies 0 in elliptical and vice versa. 

For PCA the components of most variance are kept. Why? Areas of the domain that are constant always have less information than areas that are dependend on the domain, as we want to learn how a function behaves in differnet regions of the domain. In context this means that pixels (=features) that are always dark don't have a lot of variance across different images and therefore give less information than pixels that are always changing, by looking at pixels we look at discrete domains of a function, **the difference between two pictures could therefore be distiguished identically if the always dark pixel would have been neglected, but less data would have to be stored -> faster model.**

Although what matters for distinction of images is the variance of a discrete region of the function (an image), it is not the absolute variance that matters, as e.g. some region of the center region of the images are always brighter than the edges, but the relative variance. Therefore all of the features (= pixels) will be rescaled by StandardRescaler, which means that all values of each pixel will be rescaled, such that they have $\mu$= 0 and $\sigma = 1$. 


Then the principal components of the vector are taken, but not so fast as the data set is quite big and my PC quite slow, that is why incremental PCA is used. 
This looks at the data consecutively and retrieves a given number of components. The number of components is first estimated by taking a smaller batch of data, this is a bit of a trade of but computationally necessary.





In [2]:
## now to the fun part -------------------------------------------------------------------------------------------------------------------------------
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import IncrementalPCA


y = final_df['spiral']  # define spiral label as target variable, 0 implies 1 in elliptical see data cleaning
X = final_df.drop(columns=['objid', 'spiral', 'elliptical'])  # we only selcect pixel values for machine learning model to learn from 


X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.3,    # 30% of data for testing 
    random_state=69,  # for reproducibility 
    stratify=y        # this makes sure that train and test data have the same proportion of classes -> better training quality
)

# The 
# A bright pixel (value 250) would be treated as "more important"
# than a dim one (value 20).
# StandardScaler rescales all  features to have a mean of 0
# and a standard deviation of 1.

scaler = StandardScaler()

#  fit the scaler ONLY on the training data
X_train_scaled = scaler.fit_transform(X_train)

# transform the test data using the same scaler
X_test_scaled = scaler.transform(X_test)


#because there were some problems with my ram i now do incremental pca, this requires a batch size as input 

n_sample_size = 5000 
pca_finder = PCA(n_components=0.95, random_state= 69)
pca_finder.fit(X_train_scaled[:n_sample_size]) # Fit on just 5000 rows

# Get the number of components it found
k = pca_finder.n_components_


pca = IncrementalPCA(n_components=k, batch_size = 1000)


# Fit PCA ONLY on the training data
X_train_pca = pca.fit_transform(X_train_scaled)

# Transform the test data using the same PCA fit
X_test_pca = pca.transform(X_test_scaled)

print(f"Original features: {X_train_scaled.shape[1]}")
print(f"PCA-reduced features: {X_train_pca.shape[1]}")
print("-" * 30)


Original features: 12288
PCA-reduced features: 577
------------------------------


## **Choice, training and evaluation of a ML model**
### Training
The test data is now used for training a ML algorithm (I use algorithms from sklearn library), $X_{train}$ and $y_{train}$ are used for training and the trained model is then used to create a prediction based on the until now omitted test data $X_{test}$. The results of the predictions $y_{pred}$ are then compared to the actual labels of the features $y_{test}$. 

### Evaluation
The meaning of the output of the classification report is the following
* **precision**: how many of the classified as "spiral" galaxies, were actually "spiral"?, how easily do I get false positives
* **recall**: how many of the "spiral" galaxies were identified as such? how easily do I get false negatives
* **f1 score**: prec*recall/(prec+recall)*2 
* **accuracy**: how many times was the prediction right? dangerous because not 50/50 occurance of both data types 
* **macro average**: accuracy weighted by f1 score for each binary output, considers above problem


### Used model
#### Random tree
A random tree is constructed like a flowchart that asks questions to the input data and eventually gives an output, a clasification. A question in this case could be: is the red value of pixel 138 > 0.7 -> spiral, then other questions would be asked consequtively to get to a final result. The actual learning takes place in the way the content, order and  parameters of those questions are chosen. This is done as they are chosen arbitrarily with the goal of entropy minimization of the data. This means that one question should be chosen such, that it separates "spiral" from "elliptical" as effectively as possible. Then additional questions are asked until domains of only "spiral" and "elliptical" data points are grouped by the decision tree. In the used algorithm the Shannon-entropy is used $$H(S) = - \sum_{i=1}^{C} p_i \log_2(p_i)$$ with C being the number of classes, in our case 2 and $p_i$ being the fraction of elements in a node S that belong to a certain class. 

#### Random forest
A random forest consist of many random trees, however not every tree in the forest is trained on the whole data. Every tree is trained on bootstrapped subbatches of the original data and gives a different output for example tree 1 votes "spiral" and tree 3 votes "elliptical" all different because trained on different domains of the dataset. Then an average can be taken for the final result, which prevents overfitting and overinterpretation of noise in a single decision trees. 

In [3]:
# actual training now 
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=69)

# Fit the model on the training data
rf_classifier.fit(X_train_pca, y_train)
# Make predictions on the test data
y_pred = rf_classifier.predict(X_test_pca)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred)) 



Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.74      0.84      3825
           1       0.94      1.00      0.97     16052

    accuracy                           0.95     19877
   macro avg       0.96      0.87      0.90     19877
weighted avg       0.95      0.95      0.94     19877

Confusion Matrix:
[[ 2816  1009]
 [   49 16003]]
