Skip to content

Machine learning models to classify exoplanets from the raw dataset. Using Python: Sklearn, Joblib, NumPy, Pandas, Matplotlib.

Notifications You must be signed in to change notification settings

hanesy/machine-learning-challenge

Repository files navigation

Machine Learning - Exoplanet Exploration

This project explores different machine learning models capable of classifying candidate exoplanets from the raw dataset.

Over a period of nine years in deep space, the NASA Kepler space telescope has been out on a planet-hunting mission to discover hidden planets outside of our solar system.

Research Questions

Which machine learning model best fits the data?

Getting Started

I explored four different machine learning models, and each of them are contained in individual notebooks.

The comparative summary of the hyper-tuned models is in its own NOTEBOOK.

Prerequisites

The libraries I used are as follows:

  • sklearn
  • joblib
  • numpy
  • pandas
  • matplotlib

All notebooks contain a pip install cells for joblib and sklearn that can be uncommented to make sure the version on your machine is up to date.

Data Cleaning and Modeling

Data Cleaning

In each of the notebooks, I clean the source data to drop null values and remove the error columns.

Developing the model (Pre-process, train, test)

The "y" variable for machine learning is "koi_disposition," which classifies each candidate as "confirmed", "candidate", or "false positive."

The "x" variables are the remaining columns in the dataset. The definitions of the columns are provided at the end of each model notebook, or it can be obtained at Kaggle or the data dictionary.

From the cleaned dataframe, I create a stratified train test split from the data with random_state=42.

I scale the data using a quantile transformer and normalizer.

I train and test the model, then hypertune them.

Comparison

As a final step, I compile the scores and classification reports for the hypertuned models and observe the best-fit model in this notebook.

Which model best fits the data?

hypertuned rfc (random forest classifer) with standard scaler has the best fit to the data from observing the model score (0.89). Classification report of this model also has the best precision of outcomes (.80 for "CONFIRMED").

Hypertuning seems to have little impact on model scores.

See the details in the sections below.

Details for Each Model

The following sections show the overall scores for hypertuned and non-hypertuned models and precision for outcomes for ONLY hypertuned models.

For details on the hypertuned models' classification reports (including recall, f-1 score, etc.), see Model_Comparison notebook. For classification reports of the non-hypertuned models, please see each individual notebook.


Test Scores

Logistic - Type Accuracy Scores
Non-hypertuned 0.87
Hypertuned 0.89

Hypertuned Model Outcome Precision

Outcome Precision Scores
CANDIDATE 0.82
CONFIRMED 0.76
FALSE POSITIVE 0.99

Test Scores

RFC - Type Accuracy Scores
Non-hypertuned 0.90
Hypertuned 0.90

Hypertuned Model Outcome Precision

Outcome Precision Scores
CANDIDATE 0.86
CONFIRMED 0.80
FALSE POSITIVE 0.97

Test Scores

SVC - Type Accuracy Scores
Non-hypertuned 0.89
Hypertuned 0.88

Hypertuned Model Outcome Precision

Outcome Precision Scores
CANDIDATE 0.80
CONFIRMED 0.76
FALSE POSITIVE 0.99

Test Scores

KNN - Type Accuracy Scores
Non-hypertuned 0.89
Hypertuned 0.89

Hypertuned Model Outcome Precision

Outcome Precision Scores
CANDIDATE 0.86
CONFIRMED 0.75
FALSE POSITIVE 0.99

Author

About

Machine learning models to classify exoplanets from the raw dataset. Using Python: Sklearn, Joblib, NumPy, Pandas, Matplotlib.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published