# Identifying Habitable Exoplanets using Machine Learning
## Data Science Capstone Project of BBM467, 2022 Fall

#### Hikmet Güner, Deniz Erkin Kasaplı
#### 21946179, 21946328

Clustering exoplanets using machine learning in order to gather insight about potentially habitable planets outside of our solar system.
Considering the fact that even the closest planets outside of the solar system are unreachable, there is no way to classify a planet as habitable or not with accuracy. However, there are some key elements which are considered necessary for habitability. The aim of this project is to gather insight of how grouping planets, without making any habitability indicing calculations, could affect filtering based on habitability. The dataset used for this project is directly fetched from NASA's exoplanet archive, which will be clustered for further analysis. The results could be utilized to filter out false positives, saving time and resources which would be spent on those planets otherwise.


## Table of Content

[Problem](#problem)   
[Data Understanding](#data_understanding)   
[Data Preparation](#data_preparation)   
[Modeling](#modeling)   
[Evaluation](#evaluation)   
[References](#references)   


## Problem <a class="anchor" id="problem"></a>

Describe the problem here. What are the questions you are trying to solve?

Considering the fact that the tools used to identify exoplanets are biased, it seems plausible to begin the analysis without any prior knowledge of the planet, where the algorithm only groups planets that correleate. Then, from these correleations, more insight could be gained on placing key plantes which are known to be habitable or close to be habitable, such as Earth and Mars. However, these results should not be taken at face value, they exist simply to filter unwanted planets or false positives as much as possible.

Imports required for the notebook

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats

from sklearn.cluster import KMeans
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.preprocessing import normalize
from scipy.spatial.distance import cdist

Reading the data

In [3]:
df = pd.read_csv("exoplanets.csv",header=0, index_col=0)

## Data Understanding<a class="anchor" id="data_understanding"></a>

The dataset used is directly from NASA's exoplanet archive, which the user "SATHYANARAYAN RAO" has compiled and created a comma-separated value (.csv) version of it.

The dataset could be reached from the link <a href=https://www.kaggle.com/datasets/sathyanarayanrao89/nasa-exoplanetary-system>here</a> ( <a href="exoplanets.csv">.csv format</a>)

The data contains a lot of fields used for labelling, which is not within the interest of the project. A lot of information about the host star of the system is present, which could be used for further analysis, however, due to scope of the project being focused solely on the attributes of the planets, they possess no use for the project. There also exists some columns with no suitalbe definitions found.

You can reach the column analysis spreadsheet from <a href="columns.xlsx">here</a>

In [4]:
df.head(10)

Unnamed: 0,pl_name,hostname,default_flag,sy_snum,sy_pnum,discoverymethod,disc_year,disc_facility,soltype,pl_controv_flag,...,sy_vmagerr2,sy_kmag,sy_kmagerr1,sy_kmagerr2,sy_gaiamag,sy_gaiamagerr1,sy_gaiamagerr2,rowupdate,pl_pubdate,releasedate
0,11 Com b,11 Com,1,2,1,Radial Velocity,2007,Xinglong Station,Published Confirmed,0,...,-0.023,2.282,0.346,-0.346,4.44038,0.003848,-0.003848,2014-05-14,2008-01,2014-05-14
1,11 Com b,11 Com,0,2,1,Radial Velocity,2007,Xinglong Station,Published Confirmed,0,...,-0.023,2.282,0.346,-0.346,4.44038,0.003848,-0.003848,2014-07-23,2011-08,2014-07-23
2,11 UMi b,11 UMi,0,1,1,Radial Velocity,2009,Thueringer Landessternwarte Tautenburg,Published Confirmed,0,...,-0.005,1.939,0.27,-0.27,4.56216,0.003903,-0.003903,2018-04-25,2011-08,2014-07-23
3,11 UMi b,11 UMi,1,1,1,Radial Velocity,2009,Thueringer Landessternwarte Tautenburg,Published Confirmed,0,...,-0.005,1.939,0.27,-0.27,4.56216,0.003903,-0.003903,2018-09-04,2017-03,2018-09-06
4,11 UMi b,11 UMi,0,1,1,Radial Velocity,2009,Thueringer Landessternwarte Tautenburg,Published Confirmed,0,...,-0.005,1.939,0.27,-0.27,4.56216,0.003903,-0.003903,2018-04-25,2009-10,2014-05-14
5,14 And b,14 And,0,1,1,Radial Velocity,2008,Okayama Astrophysical Observatory,Published Confirmed,0,...,-0.023,2.331,0.24,-0.24,4.91781,0.002826,-0.002826,2014-07-23,2011-08,2014-07-23
6,14 And b,14 And,1,1,1,Radial Velocity,2008,Okayama Astrophysical Observatory,Published Confirmed,0,...,-0.023,2.331,0.24,-0.24,4.91781,0.002826,-0.002826,2014-05-14,2008-12,2014-05-14
7,14 Her b,14 Her,0,1,2,Radial Velocity,2002,W. M. Keck Observatory,Published Confirmed,0,...,-0.023,4.714,0.016,-0.016,6.383,0.000351,-0.000351,2021-09-20,2021-05,2021-09-20
8,14 Her b,14 Her,0,1,2,Radial Velocity,2002,W. M. Keck Observatory,Published Confirmed,0,...,-0.023,4.714,0.016,-0.016,6.383,0.000351,-0.000351,2018-04-25,2003-01,2014-08-21
9,14 Her b,14 Her,0,1,2,Radial Velocity,2002,W. M. Keck Observatory,Published Confirmed,0,...,-0.023,4.714,0.016,-0.016,6.383,0.000351,-0.000351,2018-04-25,2008-04,2014-08-21


## Data Preparation<a class="anchor" id="data_preparation"></a>

Explain what kind of data transformations, feature selection and/or engineering you will perform.

First, the columns that are not useful for the model will be dropped, refer to [Data Understanding](#data_understanding) for the details about the columns.
Then an analysis of null values within the remaining data will be conducted. The KNN imputation method seems to consistently fill the missing data based on minimal changes of means and standard deviations before and after the imputation.

## Modeling<a class="anchor" id="modeling"></a>

Which model will be used? Why? What parameters?

## Evaluation<a class="anchor" id="evaluation"></a>

Evaluate your model. Provide results, tables, charts, etc.

## References<a class="anchor" id="references"></a>

List all the sources you used during your work.
This includes jupyter notebooks you found on Internet.
Remeber, your work may not be an original one. However, this document must be genuine. Copy and paste kind of deliveries will be punished badly.\
<a href="https://urc.ucdavis.edu/how-write-abstract"> Undergraduate Research Center, How to Write an Abstract?</a> <br>
<a href="https://app.datacamp.com/learn/courses/unsupervised-learning-in-python"> Datacamp, Unsupervised Machine Learning in Python </a> <br>
<a href="https://educationalresearchtechniques.com/2018/10/17/kmeans-clustering-in-python/"> How to utilize the Elbow Method for Finding Optimal Cluster Size </a> <br>
<a href="https://chartio.com/learn/charts/box-plot-complete-guide/"> A Complete Guide to Box Charts <a> <br>
<a href="http://astroweb.case.edu/ssm/ASTR620/mags.html"> Astronomical Magnitude Systems <a> <br>
<a href="https://en.wikipedia.org/wiki/Metallicity"> Metallicity, Wikipedia <a> <br>


**Disclaimer!** <font color='grey'>This notebook was prepared by Hikmet Güner and Deniz Erkin Kasaplı as a term project for the *BBM467 - Data Intensive Applications* class. The notebook is available for educational purposes only. There is no guarantee on the correctness of the content provided as it is a student work.

If you think there is any copyright violation, please let us [know](https://forms.gle/BNNRB2kR8ZHVEREq8). 
</font>