<a href="https://colab.research.google.com/github/najat-ttt/Exo-Hunter/blob/master/Exo_Hunter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Task**
Keras Model and LightGBM Model creation and Training with exoplanets dataset collected from **NASA** for **Exo-Hunter** project that will be submitted for **NSAC-2025** event.


#**01. Load the Dataset**

i. Connect Drive Storage with Colab

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


ii. Load the data from "/content/drive/MyDrive/Exo-Hunter/exoplanets_vs_false_FIXED.csv" into a pandas DataFrame.

In [2]:
from os import linesep
import pandas as pd

file_path = '/content/drive/MyDrive/Exo-Hunter/exoplanets_vs_false_FIXED.csv'
try:
    df = pd.read_csv(file_path)
    display(df.head())
    print("\n")
    display(df.info())
except FileNotFoundError:
    print(f"Error: File not found at {file_path}. Please check the file path and try again.")

Unnamed: 0,mission,object_name,disposition,period,planet_radius,star_temp,star_radius,star_mass,discovery_facility
0,ARCHIVE,PLANET_1734,CONFIRMED,3.943328,7.968167,5314.834165,0.969259,0.246349,UNKNOWN
1,Kepler,K00711.03,CONFIRMED,124.524522,2.69,5497.0,1.046,0.988,
2,Kepler,K01501.01,CONFIRMED,2.617028,1.57,4831.0,0.724,0.706,
3,ARCHIVE,PLANET_168,CONFIRMED,300.262512,13.2379,9170.291947,3.498023,2.619901,UNKNOWN
4,ARCHIVE,PLANET_1768,CONFIRMED,325.259434,13.519703,5557.471709,3.712298,1.821366,UNKNOWN




<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16025 entries, 0 to 16024
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   mission             16025 non-null  object 
 1   object_name         16025 non-null  object 
 2   disposition         16025 non-null  object 
 3   period              15992 non-null  float64
 4   planet_radius       15647 non-null  float64
 5   star_temp           15735 non-null  float64
 6   star_radius         15646 non-null  float64
 7   star_mass           13242 non-null  float64
 8   discovery_facility  6013 non-null   object 
dtypes: float64(5), object(4)
memory usage: 1.1+ MB


None

# **02. Preprocessing the Dataset**
The dataset has been loaded and inspected. The next step is to preprocess this data by selecting relevant features, handling missing values, and scaling the numerical features, similar to the preprocessing done for the previous dataset. This includes identifying numerical and categorical features, imputing missing values, and applying scaling and one-hot encoding where appropriate, then splitting the data into training and validation sets.

In [3]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Define features and target columns
features = ['period', 'planet_radius', 'star_temp', 'star_radius', 'star_mass', 'discovery_facility']
target = 'disposition'

# Check if the target column exists in the DataFrame
if not all(col in df.columns for col in features+ [target]):
  missing = [col for col in features+ [target] if col not in df.columns]
  raise ValueError(f"Fixed CSV must contain the following columns: {missing}")

# Filter for relevant dispositions and map to numerical labels
df_filtered = df[df[target].isin(['CONFIRMED', 'FALSE POSITIVE'])]
df_filtered['label'] = df_filtered[target].apply(lambda x: 1 if x == 'CONFIRMED' else 0)

# Separate features and target
X = df_filtered[features]
y = df_filtered['label']

# Identify numerical and categorical features
numerical_features = ['period', 'planet_radius', 'star_temp', 'star_radius', 'star_mass']
categorical_features = ['discovery_facility']

# Create preprocessing pipelines for numerical and categorical features
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')), # Use median imputation for numerical features
    ('scaler', StandardScaler()) # Scale numerical features
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), # Use constant imputation for categorical features
    ('onehot', OneHotEncoder(handle_unknown='ignore')) # One-hot encode categorical features
])

# Create a column transformer to apply different transformations to different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough' # Keep the target column unchanged
)

# Apply preprocessing to the dataset
X_preprocessed = preprocessor.fit_transform(X)

print(f"X_preprocessed shape: {X_preprocessed.shape}, y_preprocessed shape: {y.shape}")
print(f"Processed data shape: {X_preprocessed.shape}")

# Split the dataset into training and testing sets
X_train, X_val, y_train, y_val = train_test_split(X_preprocessed, y, test_size=0.2, stratify=y, random_state=42)

print(f"Training data shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"Valiation data shape: {X_val.shape}, y_test shape: {y_val.shape}")

X_preprocessed shape: (16025, 7), y_preprocessed shape: (16025,)
Processed data shape: (16025, 7)
Training data shape: (12820, 7), y_train shape: (12820,)
Valiation data shape: (3205, 7), y_test shape: (3205,)


# **03. Model Creation and Training**
The fixed dataset has been preprocessed and split into training and validation sets. The next step is to define and train the model using the preprocessed training and validation data from this dataset. We will use two model for better comparison:

*   Basic Tabular **Keras** Model
*   **LightGBM**

---

# Keras Model


In [4]:
from tensorflow.keras import layers, models

def make_tabular_model(input_dim):
    model = models.Sequential([
        layers.Input(shape=(input_dim,)),
        layers.Dense(64, activation='relu'),
        layers.Dense(32, activation='relu'),
        layers.Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Create the Keras model with new input dimension
input_dim = X_train.shape[1]
basic_keras_model = make_tabular_model(input_dim)

# Train the Keras model
print("Training Basic Keras Model with the dataset...")
history_keras_model = basic_keras_model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=70, batch_size=32, verbose = 1)
print("Training Done!")

Training Basic Keras Model with the dataset...
Epoch 1/70
[1m401/401[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.7398 - loss: 0.4807 - val_accuracy: 0.7934 - val_loss: 0.3755
Epoch 2/70
[1m401/401[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8101 - loss: 0.3578 - val_accuracy: 0.8190 - val_loss: 0.3617
Epoch 3/70
[1m401/401[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8197 - loss: 0.3474 - val_accuracy: 0.8162 - val_loss: 0.3549
Epoch 4/70
[1m401/401[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8292 - loss: 0.3340 - val_accuracy: 0.8321 - val_loss: 0.3418
Epoch 5/70
[1m401/401[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8288 - loss: 0.3354 - val_accuracy: 0.8287 - val_loss: 0.3365
Epoch 6/70
[1m401/401[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8391 - loss: 0.3177 - val_accuracy: 0.8365 

The basic Keras model has been trained on the dataset. The next step is to evaluate the trained Keras model on the validation set to get the accuracy and loss.

In [5]:
# Evaluate the basic Keras model on the fixed validation set
print("Evaluating Basic Keras Model with dataset...")
loss_basic_keras, accuracy_basic_keras = basic_keras_model.evaluate(X_val, y_val, verbose=0)

print(f"Basic Keras Model Evaluation - Loss: {loss_basic_keras:.4f}, Accuracy: {accuracy_basic_keras:.4f}")

Evaluating Basic Keras Model with dataset...
Basic Keras Model Evaluation - Loss: 0.2727, Accuracy: 0.8718


# LightGBM Model

In [6]:
!pip install LightGBM



In [7]:
import lightgbm as lgb

# Since the regularized approach worked, let's fine-tune it further
model_lgb = lgb.LGBMClassifier(
    random_state=42,
    n_estimators=1500,       # More trees since we're regularizing well
    learning_rate=0.025,     # Slightly lower
    num_leaves=20,           # Slight increase in complexity
    max_depth=5,             # One level deeper
    min_child_samples=40,    # Slightly less conservative
    subsample=0.85,          # Slight increase
    colsample_bytree=0.85,   # Slight increase
    reg_alpha=0.25,          # Slightly less regularization
    reg_lambda=0.25,         # Slightly less regularization
    force_col_wise=True
)

print("\nTraining Fine-tuned LightGBM Model...")
model_lgb.fit(
    X_train, y_train,
    eval_set=[(X_train, y_train), (X_val, y_val)],
    eval_metric=['logloss', 'accuracy'],
    callbacks=[
        lgb.early_stopping(150, verbose=False),  # More patience
        lgb.log_evaluation(100)
    ]
)
print("Training Done!")


Training Fine-tuned LightGBM Model...
[LightGBM] [Info] Number of positive: 8020, number of negative: 4800
[LightGBM] [Info] Total Bins 1279
[LightGBM] [Info] Number of data points in the train set: 12820, number of used features: 7
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.625585 -> initscore=0.513323
[LightGBM] [Info] Start training from score 0.513323
[100]	training's binary_logloss: 0.265525	valid_1's binary_logloss: 0.278004
[200]	training's binary_logloss: 0.226034	valid_1's binary_logloss: 0.245105
[300]	training's binary_logloss: 0.20281	valid_1's binary_logloss: 0.227909
[400]	training's binary_logloss: 0.187226	valid_1's binary_logloss: 0.219
[500]	training's binary_logloss: 0.175366	valid_1's binary_logloss: 0.213684
[600]	training's binary_logloss: 0.167564	valid_1's binary_logloss: 0.21143
[700]	training's binary_logloss: 0.160086	valid_1's binary_logloss: 0.210182
[800]	training's binary_logloss: 0.153803	valid_1's binary_logloss: 0.209015
[900]	training's binary

The LightGBM model has been trained on the dataset. The next step is to evaluate the trained LightGBM model on the validation set to get the accuracy and loss (logloss).

In [8]:
from sklearn.metrics import accuracy_score, log_loss

# Evaluate the LightGBM model on the fixed validation set
print("Evaluating LightGBM Model with dataset...")
y_pred_lgb = model_lgb.predict(X_val)
y_pred_proba_lgb = model_lgb.predict_proba(X_val)[:, 1]
accuracy_lgb = accuracy_score(y_val, y_pred_lgb)
logloss_lgb = log_loss(y_val, y_pred_proba_lgb)

# Calculate logloss if there are both positive and negative classes in the validation set
# Check if y_val_fixed contains both 0 and 1
if len(y_val.unique()) > 1:
    logloss_lgb = log_loss(y_val, y_pred_proba_lgb)
    print(f"LightGBM Model Evaluation - Loss: {logloss_lgb:.4f}, Accuracy: {accuracy_lgb:.4f}")
else:
    print(f"LightGBM Model Evaluation - Loss: {logloss_lgb:.4f}, Accuracy: {accuracy_lgb:.4f}")

Evaluating LightGBM Model with dataset...




LightGBM Model Evaluation - Loss: 0.2088, Accuracy: 0.9164


# **04. Comparison between the Models**
Comparison of the validation loss and accuracy scores of the two models and print a conclusion based on the results.

In [10]:
# Compare the validation loss and accuracy scores

print("Comparison between the Models")
print(f"Basic Keras Model - Loss: {history_keras_model.history['val_loss'][-1]:.4f}, Accuracy: {history_keras_model.history['val_accuracy'][-1]:.4f}")
print(f"LightGBM Model - Loss: {logloss_lgb:.4f}, Accuracy: {accuracy_lgb:.4f}")

Comparison between the Models
Basic Keras Model - Loss: 0.2727, Accuracy: 0.8718
LightGBM Model - Loss: 0.2088, Accuracy: 0.9164
