### "ML-powered Exoplanet Habitability Pipeline Development"
##### Kishor Baniya
##### Mathematics, Computer Science, and Physics
##### Caldwell University, NJ 07006

As my first-ever data science project, I will be exploring and learning on data manipulation for effective training of ML models as a pipeline to discover habitability in exoplanets; the comparision factors are based on the pattern recognition process as the model is trained based on the comparision between the respective exoplanet to Earth. 
This was a learning as building themed project, and I really enjoyed the process. Hope you will do too, this file contains all the codes and data manipulation involved as the backend of the beautiful poster displayed at the Sagan Workshop at Caltech :)

In [1]:
import pandas as pd
# This script reads a CSV file containing exoplanet data and displays the first few rows.

f = pd.read_csv("hwc_table_all.csv")

# renaming the columns for clarity
f.columns = ['name', 'type', 'detection_method', 'mass', 'radius', 'flux',
                  'temperature', 'period', 'distance', 'age', 'esi']

# further refining
f.dropna(subset=['mass', 'radius', 'temperature', 'flux'], inplace=True)

#adding a new column showcasing that all of the records are habitable exoplanets (True or 1)
f['potentially habitable'] = 1


In [2]:
# Using the csv file, our model will be trained only to know what a habitable exoplanet is. 
# However, it is also important that it understand what a non-habitbale exoplanet is. 
# Therefore, we will now create synthetic data of non-habitable exoplanets, mix it with the original database, to ease the training process.

import numpy as np
import pandas as pd

n = 250  # Number of synthetic samples per extreme, it will generate a total of 750 samples in total.
         # Although the samples are synthetic, they can be represented as real exoplanets data trend because they are 
         # basically planets that are too small/large, too hot/cold, too close/far, or too massive/light to be habitable.
         # there are also 250 such samples where the planets are around in the mid-range of the habitable zone, but still not habitable.

np.random.seed(42)

# Lower-bound non-habitable samples
low_extremes = pd.DataFrame({
    'mass': np.random.uniform(0.001, 0.05, size=n),             
    'radius': np.random.uniform(0.1, 0.4, size=n),              
    'temperature': np.random.uniform(10, 150, size=n),          
    'flux': np.random.uniform(0.01, 0.5, size=n),               
    'potentially habitable': 0
})

# Mid-range non-habitable samples
# These samples represent planets that are neither too small nor too large, but still not habitable.
mid_extremes = pd.DataFrame({
    'mass': np.random.uniform(0.3, 5, size=n),
    'radius': np.random.uniform(0.5, 2.5, size=n),
    'temperature': np.random.uniform(230, 400, size=n),
    'flux': np.random.uniform(0.2, 2.0, size=n),
    'potentially habitable': 0
})

# Upper-bound non-habitable samples
high_extremes = pd.DataFrame({
    'mass': np.random.uniform(20, 300, size=n),                 
    'radius': np.random.uniform(5, 20, size=n),                 
    'temperature': np.random.uniform(500, 2000, size=n),        
    'flux': np.random.uniform(5, 100, size=n),                  
    'potentially habitable': 0
})

# Combining both into a single synthetic non-habitable dataset
synthetic_non_habitable = pd.concat([low_extremes, mid_extremes, high_extremes], ignore_index=True)

# Merging the original habitable dataset with the synthetic non-habitable dataset
final_dataset = pd.concat([f, synthetic_non_habitable], ignore_index=True)

final_dataset.head(n=830)


Unnamed: 0,name,type,detection_method,mass,radius,flux,temperature,period,distance,age,esi,potentially habitable
0,TOI-904 c,M Warm Superterran,Transit,5.340000,2.167000,0.524434,244.820750,83.99970,150.322365,1.5,0.658603,1
1,TOI-700 e,M Warm Terran,Transit,0.818000,0.953000,1.278049,305.688111,27.80978,101.520947,1.5,0.912032,1
2,TOI-700 d,M Warm Terran,Transit,1.250000,1.073000,0.859827,276.939259,37.42396,101.520947,1.5,0.941176,1
3,GJ 357 d,M Warm Superterran,Radial Velocity,6.100000,2.340000,0.382595,226.261383,55.66100,30.795030,,0.575518,1
4,GJ 3293 d,M Warm Superterran,Radial Velocity,7.600000,2.670000,0.588593,251.304031,48.13450,65.851875,,0.629777,1
...,...,...,...,...,...,...,...,...,...,...,...,...
815,,,,166.004793,12.384882,87.203097,1671.982784,,,,,0
816,,,,37.819455,13.659185,19.940955,1367.447599,,,,,0
817,,,,252.784584,17.983657,34.429847,720.444064,,,,,0
818,,,,187.713982,19.711090,32.554326,1716.685655,,,,,0


In [3]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

features = ['mass', 'radius', 'temperature', 'flux']
X = final_dataset[features]
y = final_dataset['potentially habitable']

# Splitting the dataset into training and testing so that we can evaluate the model later
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scaling the features so that they have a mean of 0 and a standard deviation of 1
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
model.fit(X_train_scaled, y_train)

# Evaluation of the model based on the test and training sets.
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))

# Saving the model and the scalar
import joblib
joblib.dump(model, 'habitability_model.pkl')
joblib.dump(scaler, 'scaler.pkl')
print("Model trained and saved successfully.")

              precision    recall  f1-score   support

           0       0.98      0.98      0.98       150
           1       0.79      0.79      0.79        14

    accuracy                           0.96       164
   macro avg       0.88      0.88      0.88       164
weighted avg       0.96      0.96      0.96       164

Model trained and saved successfully.
