<h3>Exoplanet Machine Learning Project


The goal of this project is to design and train a machine learning model that can predict whether a distant star has one or more exoplanets in its orbit.  

---

**Dataset Preprocesing**  
The dataset is "Exoplanet Hunting in Deep Space" available through Kaggle (https://www.kaggle.com/datasets/keplersmachines/kepler-labelled-time-series-data).  The dataset includes data from multiple Kepler space telescope campaigns.  Each row represents a star monitored by Kepler and data points in columns are observed changes in light intensity of the star.  Such changes could be caused by an exoplanet orbiting the star and affecting the observable light as it passes.  Stars are labeled as either 1 (without exoplanets) or 2 (with exoplanet(s)).

---

**Getting Familiar with the Data**

In [1]:
import pandas as pd

# The dataset is already split into a portion for training and a portion for testing.
train_df = pd.read_csv("exoTrain.csv")
test_df = pd.read_csv("exoTest.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'exoTest.csv'

In [None]:
# Get information about the nature of the dataset
train_df.info()
test_df.info()

In [None]:
train_df.head()

In [None]:
# Shape = (num_rows, num_columns)
print(f"Train Shape: {train_df.shape}")
print(f"Test Shape: {test_df.shape}")

In [None]:
# Check for missing data
train_df.isnull().sum()


In [None]:
# Confirm that labels are either 1 (non-exoplanet star) or 2 (exoplanet star)
print("Unique Labels:", train_df.iloc[:, 0].unique())

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Function to plot a few random light curves to understand the flux measurements.  Plots an equal number of exoplanet and non-exoplanet star light curves using a consistent flux scale.
def plot_random_light_curves(df, num_samples_per_class=3):
    
    plt.figure(figsize=(10, num_samples_per_class * 3))

    # Separate classes
    exoplanet_df = df[df.iloc[:, 0] == 2]
    non_exoplanet_df = df[df.iloc[:, 0] == 1]

    # Generate samples for each class
    exo_samples = exoplanet_df.sample(num_samples_per_class, random_state=42)
    non_exo_samples = non_exoplanet_df.sample(num_samples_per_class, random_state=42)

    # Combine samples for plotting
    samples = pd.concat([non_exo_samples, exo_samples])

    # Determine the flux range for all plots
    flux_min = df.iloc[:, 1:].min().min()
    flux_max = df.iloc[:, 1:].max().max()

    for i, (_, row) in enumerate(samples.iterrows()):
        flux_values = row[1:].values

        plt.subplot(num_samples_per_class * 2, 1, i + 1)
        plt.plot(flux_values, color="blue")
        plt.title(f"Light Curve (Label: {row.iloc[0]})")
        plt.xlabel("Time Step")
        plt.ylabel("Flux")

    plt.tight_layout()
    plt.show()

plot_random_light_curves(train_df, num_samples_per_class=3)

In [None]:
# Split dataset into X (features) and y (target)
X_train = train_df.drop(columns=["LABEL"])
y_train = train_df["LABEL"]

X_test = test_df.drop(columns=["LABEL"])
y_test = test_df["LABEL"]

# Check class distribution in the training set
unique, counts = np.unique(y_train, return_counts=True)
print("Training Set Class Distribution:", dict(zip(unique, counts)))

# Check class distribution in the test set
unique_test, counts_test = np.unique(y_test, return_counts=True)
print("Test Set Class Distribution:", dict(zip(unique_test, counts_test)))

&nbsp;  
**Analysis**  
&nbsp;  
The dataset is very skewed toward the label 1 class, with only a small fraction classed as label 2.  This will make it difficult for the model to predict label 2 with any confidence. Using a tool like Synthetic Minority Oversampling Technique (SMOTE) may help to avoid issues with this imbalance.  
&nbsp;

---

&nbsp;  
**Finalize Data Prep**  
- Normalize flux values. Many deep learning models perform better with normalized features, especially when the data points vary widely.
- Apply SMOTE to balance class distribution
- Reshape data. The flux data shows changes over time, so a time-series model would be well-suited for this task.  Time-series models require data to be shaped in 3D.
&nbsp;


In [None]:
from sklearn.preprocessing import StandardScaler

# Normalize flux values for better model performance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

In [None]:
from imblearn.over_sampling import SMOTE

# Apply SMOTE to balance classes in the training dataset
sm = SMOTE(sampling_strategy="auto", random_state=42)
X_train_resampled, y_train_resampled = sm.fit_resample(X_train_scaled, y_train)

# Check new class distribution
unique_resampled, counts_resampled = np.unique(y_train_resampled, return_counts=True)
print("Balanced Training Data:", dict(zip(unique_resampled, counts_resampled)))

In [None]:
# Reshape from 2D to 3D (num_rows, num_cols, num_features). num_features = 1 because each time step has only 1 value (flux).
X_train_reshaped = np.reshape(X_train_resampled, (X_train_resampled.shape[0], X_train_resampled.shape[1], 1))
X_test_reshaped = np.reshape(X_test_scaled, (X_test_scaled.shape[0], X_test_scaled.shape[1], 1))

# Confirm new shape
print("X_train shape:", X_train_reshaped.shape)
print("X_test shape:", X_test_reshaped.shape)
print("y_train shape:", y_train_resampled.shape)

In [None]:
# Rename for easier use later
X_train = X_train_reshaped
X_test = X_test_reshaped
y_train = y_train_resampled

# Save data as numpy arrays
np.save("X_train.npy", X_train)
np.save("X_test.npy", X_test)
np.save("y_train.npy", y_train)
np.save("y_test.npy", y_test)