# DS3 Datathon 2024 - Celestial Bodies

This notebook is meant to generate predictions for celestial bodies dataset for DS3 Datathon 2024.

Created by:

* [Borys Łangowicz (neloduka_sobe)](https://www.linkedin.com/in/borys-langowicz/)

* [Martin Pellikka](https://www.linkedin.com/in/martinpellikka/)



[Link to the kaggle competition](https://www.kaggle.com/competitions/ds3-datathon-celestial-labelling)

## Imports

In [1]:
# Numbers
import pandas as pd
import numpy as np

# Graphs
import seaborn as sns
import matplotlib.pyplot as plt

# ML
import sklearn
import sklearn.model_selection
from sklearn.metrics import accuracy_score

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

from xgboost import XGBClassifier
from xgboost import plot_importance

from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
import random

# Pipeline
from sklearn.model_selection import train_test_split # For train/test splits
from sklearn.feature_selection import VarianceThreshold # Feature selector
from sklearn.pipeline import Pipeline # For setting up pipeline

# Various pre-processing steps
from sklearn.preprocessing import Normalizer, StandardScaler, MinMaxScaler, PowerTransformer, MaxAbsScaler, LabelEncoder
from sklearn.model_selection import GridSearchCV # For optimization

from sklearn.metrics import balanced_accuracy_score



ModuleNotFoundError: No module named 'sklearn'

### Setting final/develpment mode

As it is a datathon submission, we use entirety of the dataset, to generate more accurate results for the final submission.

In [None]:
generating_final_result = True

### Fixing seeds

[Source](https://sklearn-genetic-opt.readthedocs.io/en/stable/tutorials/reproducibility.html)

In [None]:
random_seed = 5643
np.random.seed(random_seed)
random.seed(random_seed)

## Loading Data

### Loading training data

In [None]:
data_train = pd.read_csv("/content/space/celestial_train.csv")

### Separating X values for data_train

In [None]:
data_trainX = data_train.loc[:,data_train.columns != 'class']

### Separating Y values for data_test

In [None]:
data_trainY = data_train["class"]

## Description of the Data

Provided dataset consists of the following columns:

`id` = Object Identifier, the unique value that identifies the object in the image catalog used by the CAS.

`alpha` = Right Ascension angle (at J2000 epoch).

`delta` = Declination angle (at J2000 epoch).

`u` = Ultraviolet filter in the photometric system.

`g` = Green filter in the photometric system.

`r` = Red filter in the photometric system.

`i` = Near Infrared filter in the photometric system.

`z` = Infrared filter in the photometric system.

`run_ID` = Run Number used to identify the specific scan.

`rerun_ID` = Rerun Number to specify how the image was processed.

`cam_col` = Camera column to identify the scanline within the run.

`field_ID` = Field number to identify each field.

`spec_obj_ID` = Unique ID used for optical spectroscopic objects (this means that 2 different observations with the same spec_obj_ID must share the output class).

`redshift` = redshift value based on the increase in wavelength.

`plate` = plate ID, identifies each plate in SDSS.

`MJD` = Modified Julian Date, used to indicate when a given piece of SDSS data was taken.

`fiber_ID` = fiber ID that identifies the fiber that pointed the light at the focal plane in each observation.

`class` = object class [GALAXY: galaxy, STAR: star or QSO: quasar object].


Acknowledgements:
[Sloan Digital Sky Survey](https://www.sdss4.org/science/image-gallery/)

## Data Cleaning
[Source](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

### One hot encoding

In [None]:
le = LabelEncoder()
le.fit(data_trainY)
data_trainY = le.transform(data_trainY)
data_trainY

### Dropping insignificant columns (Determined by feature importance)

We drop columns that were identified, as insignificant.

In [None]:
data_trainX = data_trainX.drop(columns=['id', 'fiber_ID', 'cam_col', 'rerun_ID', 'alpha', 'delta', 'run_ID', 'field_ID'])

# Pipeline

We create main pipeline for the model.

### Train/Test split

In [None]:
if not generating_final_result:
  X_train, X_test, y_train, y_test = train_test_split(
      data_trainX,
      data_trainY,
      test_size=1/3,
      random_state=0)

  print(X_train.shape)
  print(X_test.shape)
else:
  X_train = data_trainX
  y_train = data_trainY

We use RandomForest to classify the celestial objects.

In [None]:
pipe = \
Pipeline(steps=[('scaler', StandardScaler()),
('selector',  SelectFromModel(LinearSVC(C=0.1, penalty="l1", dual=False))),
('classifier', RandomForestClassifier())])

pipe.fit(X_train,y_train)
if not generating_final_result:
  y_pred = pipe.predict(X_test)

print('Training set score: ' + str(pipe.score(X_train,y_train)))
if not generating_final_result:
  print('Test set score: ' + str(pipe.score(X_test,y_test)))

if not generating_final_result:
  print()
  print("Accuracy on test data:", accuracy_score(y_test, y_pred))
  print("Ballanced accuracy on test data:", balanced_accuracy_score(y_test, y_pred))

# Optimization

In [None]:
parameters = {'scaler': [StandardScaler(), MinMaxScaler(),
              Normalizer(), MaxAbsScaler()],
              'classifier__max_depth': [2,4,6],
              'classifier__min_samples_leaf': [x for x in range(1,10)],
              }

Calculating accuracy of the model

In [None]:
if not generating_final_result:
  grid = GridSearchCV(pipe, parameters, cv=2).fit(X_train, y_train)

  y_pred = grid.predict(X_test)

  print('Training set score: ' + str(grid.score(X_train, y_train)))
  print('Test set score: ' + str(grid.score(X_test, y_test)))
  print()
  print("Accuracy on test data:", accuracy_score(y_test, y_pred))
  print("Ballanced accuracy on test data:", balanced_accuracy_score(y_test, y_pred))

Determining featrue importance

In [None]:
# Access the best set of parameters
best_params = grid.best_params_
print(best_params)
# Stores the optimum model in best_pipe
best_pipe = grid.best_estimator_
print(best_pipe)

We decided to drop the following columns:

`id`, `fiber_ID`, `cam_col`, `rerun_ID`, `alpha`, `delta`, `run_ID`, `field_ID`

## Creating data for submission

Loading, standarizing and preparing results for the final submission for the datathon

### Loading test data

In [None]:
data_test = pd.read_csv("/content/space/celestial_test.csv")

## Predicting

In [None]:
y_pred = pipe.predict(data_test)

In [None]:
y_pred = le.inverse_transform(y_pred)

### Saving predictions to CSV

We save final predictions to a file to be submitted.

In [None]:
return_dataset = pd.DataFrame({'id': ids, 'output': y_pred}, columns=['id', 'output'])
return_dataset

In [None]:
return_dataset.to_csv("celestial_solutions.csv", index = False)