# Kepler Exoplanet Search Results (NASA)

<img src="kepler.jpg">

### Context
The Kepler Space Observatory is a NASA-build satellite that was launched in 2009. The telescope is dedicated to searching for exoplanets in star systems besides our own, with the ultimate goal of possibly finding other habitable planets besides our own. The original mission ended in 2013 due to mechanical failures, but the telescope has nevertheless been functional since 2014 on a "K2" extended mission.

Kepler had verified 1284 new exoplanets as of May 2016. As of October 2017 there are over 3000 confirmed exoplanets total (using all detection methods, including ground-based ones). The telescope is still active and continues to collect new data on its extended mission.

### Content
This dataset is a cumulative record of all observed Kepler "objects of interest" — basically, all of the approximately 10,000 exoplanet candidates Kepler has taken observations on.

This dataset has an extensive data dictionary, which can be accessed here. Highlightable columns of note are:

**kepoi_name:** A KOI is a target identified by the Kepler Project that displays at least one transit-like sequence within Kepler time-series photometry that appears to be of astrophysical origin and initially consistent with a planetary transit hypothesis


**kepler_name:** [These names] are intended to clearly indicate a class of objects that have been confirmed or validated as planets—a step up from the planet candidate designation.


**koi_disposition:** The disposition in the literature towards this exoplanet candidate. One of CANDIDATE, FALSE POSITIVE, NOT DISPOSITIONED or CONFIRMED.


**koi_pdisposition:** The disposition Kepler data analysis has towards this exoplanet candidate. One of FALSE POSITIVE, NOT DISPOSITIONED, and CANDIDATE.


**koi_score:** A value between 0 and 1 that indicates the confidence in the KOI disposition. For CANDIDATEs, a higher value indicates more confidence in its disposition, while for FALSE POSITIVEs, a higher value indicates less confidence in that disposition.

## This dataset was published as-is by NASA. 



### Inspiration
**How often are exoplanets confirmed in the existing literature disconfirmed by measurements from Kepler? How about the other way round?**


**What general characteristics about exoplanets (that we can find) can you derive from this dataset?**


**What exoplanets get assigned names in the literature? What is the distribution of confidence scores?**

In [28]:
import pandas as pd
pd.set_option("display.max_columns",100)



         # Vizulation
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt


         # Clsassification Models
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, ExtraTreeRegressor

          # Modelling
from sklearn.model_selection import train_test_split

          # Testing
from sklearn.metrics import accuracy_score, confusion_matrix,classification_report


          # Deep Learning
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense


import warnings 
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("cumulative.csv")

## EDA - Exploratory Data Analysis

- It is the process of understanding, exploring, and visualizing data. In this process, we will clean the data, characterize and perform statistical analysis, visualize the data and interpret the results.

In [3]:
df.shape

(9564, 50)

In [4]:
#df = df.sample(4000)

In [5]:
df.head()

Unnamed: 0,rowid,kepid,kepoi_name,kepler_name,koi_disposition,koi_pdisposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,koi_time0bk_err2,koi_impact,koi_impact_err1,koi_impact_err2,koi_duration,koi_duration_err1,koi_duration_err2,koi_depth,koi_depth_err1,koi_depth_err2,koi_prad,koi_prad_err1,koi_prad_err2,koi_teq,koi_teq_err1,koi_teq_err2,koi_insol,koi_insol_err1,koi_insol_err2,koi_model_snr,koi_tce_plnt_num,koi_tce_delivname,koi_steff,koi_steff_err1,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
0,1,10797460,K00752.01,Kepler-227 b,CONFIRMED,CANDIDATE,1.0,0,0,0,0,9.488036,2.775e-05,-2.775e-05,170.53875,0.00216,-0.00216,0.146,0.318,-0.146,2.9575,0.0819,-0.0819,615.8,19.5,-19.5,2.26,0.26,-0.15,793.0,,,93.59,29.45,-16.65,35.8,1.0,q1_q17_dr25_tce,5455.0,81.0,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
1,2,10797460,K00752.02,Kepler-227 c,CONFIRMED,CANDIDATE,0.969,0,0,0,0,54.418383,0.0002479,-0.0002479,162.51384,0.00352,-0.00352,0.586,0.059,-0.443,4.507,0.116,-0.116,874.8,35.5,-35.5,2.83,0.32,-0.19,443.0,,,9.11,2.87,-1.62,25.8,2.0,q1_q17_dr25_tce,5455.0,81.0,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
2,3,10811496,K00753.01,,FALSE POSITIVE,FALSE POSITIVE,0.0,0,1,0,0,19.89914,1.494e-05,-1.494e-05,175.850252,0.000581,-0.000581,0.969,5.126,-0.077,1.7822,0.0341,-0.0341,10829.0,171.0,-171.0,14.6,3.92,-1.31,638.0,,,39.3,31.04,-10.49,76.3,1.0,q1_q17_dr25_tce,5853.0,158.0,-176.0,4.544,0.044,-0.176,0.868,0.233,-0.078,297.00482,48.134129,15.436
3,4,10848459,K00754.01,,FALSE POSITIVE,FALSE POSITIVE,0.0,0,1,0,0,1.736952,2.63e-07,-2.63e-07,170.307565,0.000115,-0.000115,1.276,0.115,-0.092,2.40641,0.00537,-0.00537,8079.2,12.8,-12.8,33.46,8.5,-2.83,1395.0,,,891.96,668.95,-230.35,505.6,1.0,q1_q17_dr25_tce,5805.0,157.0,-174.0,4.564,0.053,-0.168,0.791,0.201,-0.067,285.53461,48.28521,15.597
4,5,10854555,K00755.01,Kepler-664 b,CONFIRMED,CANDIDATE,1.0,0,0,0,0,2.525592,3.761e-06,-3.761e-06,171.59555,0.00113,-0.00113,0.701,0.235,-0.478,1.6545,0.042,-0.042,603.3,16.9,-16.9,2.75,0.88,-0.35,1406.0,,,926.16,874.33,-314.24,40.9,1.0,q1_q17_dr25_tce,6031.0,169.0,-211.0,4.438,0.07,-0.21,1.046,0.334,-0.133,288.75488,48.2262,15.509


In [6]:
df.columns

Index(['rowid', 'kepid', 'kepoi_name', 'kepler_name', 'koi_disposition',
       'koi_pdisposition', 'koi_score', 'koi_fpflag_nt', 'koi_fpflag_ss',
       'koi_fpflag_co', 'koi_fpflag_ec', 'koi_period', 'koi_period_err1',
       'koi_period_err2', 'koi_time0bk', 'koi_time0bk_err1',
       'koi_time0bk_err2', 'koi_impact', 'koi_impact_err1', 'koi_impact_err2',
       'koi_duration', 'koi_duration_err1', 'koi_duration_err2', 'koi_depth',
       'koi_depth_err1', 'koi_depth_err2', 'koi_prad', 'koi_prad_err1',
       'koi_prad_err2', 'koi_teq', 'koi_teq_err1', 'koi_teq_err2', 'koi_insol',
       'koi_insol_err1', 'koi_insol_err2', 'koi_model_snr', 'koi_tce_plnt_num',
       'koi_tce_delivname', 'koi_steff', 'koi_steff_err1', 'koi_steff_err2',
       'koi_slogg', 'koi_slogg_err1', 'koi_slogg_err2', 'koi_srad',
       'koi_srad_err1', 'koi_srad_err2', 'ra', 'dec', 'koi_kepmag'],
      dtype='object')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9564 entries, 0 to 9563
Data columns (total 50 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   rowid              9564 non-null   int64  
 1   kepid              9564 non-null   int64  
 2   kepoi_name         9564 non-null   object 
 3   kepler_name        2294 non-null   object 
 4   koi_disposition    9564 non-null   object 
 5   koi_pdisposition   9564 non-null   object 
 6   koi_score          8054 non-null   float64
 7   koi_fpflag_nt      9564 non-null   int64  
 8   koi_fpflag_ss      9564 non-null   int64  
 9   koi_fpflag_co      9564 non-null   int64  
 10  koi_fpflag_ec      9564 non-null   int64  
 11  koi_period         9564 non-null   float64
 12  koi_period_err1    9110 non-null   float64
 13  koi_period_err2    9110 non-null   float64
 14  koi_time0bk        9564 non-null   float64
 15  koi_time0bk_err1   9110 non-null   float64
 16  koi_time0bk_err2   9110 

In [8]:
# It seems empty columns

df.isnull().sum()

rowid                   0
kepid                   0
kepoi_name              0
kepler_name          7270
koi_disposition         0
koi_pdisposition        0
koi_score            1510
koi_fpflag_nt           0
koi_fpflag_ss           0
koi_fpflag_co           0
koi_fpflag_ec           0
koi_period              0
koi_period_err1       454
koi_period_err2       454
koi_time0bk             0
koi_time0bk_err1      454
koi_time0bk_err2      454
koi_impact            363
koi_impact_err1       454
koi_impact_err2       454
koi_duration            0
koi_duration_err1     454
koi_duration_err2     454
koi_depth             363
koi_depth_err1        454
koi_depth_err2        454
koi_prad              363
koi_prad_err1         363
koi_prad_err2         363
koi_teq               363
koi_teq_err1         9564
koi_teq_err2         9564
koi_insol             321
koi_insol_err1        321
koi_insol_err2        321
koi_model_snr         363
koi_tce_plnt_num      346
koi_tce_delivname     346
koi_steff   

## Feature Engineering

- It is the process of creating new features using existing features in the data set. In this process, existing features 	are manipulated or combined to make the data more meaningful, improve model performance and achieve better results.

In [9]:
df["koi_disposition"].value_counts()

FALSE POSITIVE    5023
CONFIRMED         2293
CANDIDATE         2248
Name: koi_disposition, dtype: int64

In [10]:
df["koi_pdisposition"].value_counts()

FALSE POSITIVE    5068
CANDIDATE         4496
Name: koi_pdisposition, dtype: int64

 - The content of the columns has features that favour each other. It is sorted according to these features and converted into numbers.   

In [11]:
d = {"FALSE POSITIVE":0,"CANDIDATE":1}
df["koi_pdisposition"] = df["koi_pdisposition"].map(d)

In [12]:
df["koi_pdisposition"]

0       1
1       1
2       0
3       0
4       1
       ..
9559    0
9560    0
9561    1
9562    0
9563    0
Name: koi_pdisposition, Length: 9564, dtype: int64

In [13]:
df = df.drop(["koi_teq_err1","koi_teq_err2","kepler_name","koi_disposition"],axis=1)

- Removed useless columns

In [14]:
df["koi_score"] = df["koi_score"].fillna(df["koi_score"].mean())

In [15]:
float_ch = ['koi_score','koi_period', 'koi_period_err1', 'koi_period_err2','koi_time0bk', 'koi_time0bk_err1', 'koi_time0bk_err2',
'koi_impact','koi_impact_err1', 'koi_impact_err2', 'koi_duration','koi_duration_err1', 'koi_duration_err2', 'koi_depth',
'koi_depth_err1','koi_depth_err2', 'koi_prad', 'koi_prad_err1', 'koi_prad_err2','koi_teq', 'koi_insol', 'koi_insol_err1',
'koi_insol_err2','koi_model_snr', 'koi_tce_plnt_num','koi_steff','koi_steff_err1', 'koi_steff_err2', 'koi_slogg',
'koi_slogg_err1','koi_slogg_err2', 'koi_srad', 'koi_srad_err1', 'koi_srad_err2', 'ra','koi_kepmag']

int_ch = ['koi_fpflag_nt', 'koi_fpflag_ss', 'koi_fpflag_co','koi_fpflag_ec']

In [16]:
for i in float_ch:
    df[i] = df[i].fillna(df[i].mean())
for i in int_ch:
    df[i] = df[i].fillna(df[i].mean())

- Removed useless columns

In [17]:
df = df.dropna()

In [18]:
df.isnull().sum()

rowid                0
kepid                0
kepoi_name           0
koi_pdisposition     0
koi_score            0
koi_fpflag_nt        0
koi_fpflag_ss        0
koi_fpflag_co        0
koi_fpflag_ec        0
koi_period           0
koi_period_err1      0
koi_period_err2      0
koi_time0bk          0
koi_time0bk_err1     0
koi_time0bk_err2     0
koi_impact           0
koi_impact_err1      0
koi_impact_err2      0
koi_duration         0
koi_duration_err1    0
koi_duration_err2    0
koi_depth            0
koi_depth_err1       0
koi_depth_err2       0
koi_prad             0
koi_prad_err1        0
koi_prad_err2        0
koi_teq              0
koi_insol            0
koi_insol_err1       0
koi_insol_err2       0
koi_model_snr        0
koi_tce_plnt_num     0
koi_tce_delivname    0
koi_steff            0
koi_steff_err1       0
koi_steff_err2       0
koi_slogg            0
koi_slogg_err1       0
koi_slogg_err2       0
koi_srad             0
koi_srad_err1        0
koi_srad_err2        0
ra         

## Modelling With Sklearn

 - Got koi_pdisposition column as referance column

 - In fact, one of the most important parts of the regression section is the **get_dummies()** method. Because it converts columns with column type object to matrix without specifying any superiority.

In [19]:
x = df.drop("koi_pdisposition",axis=1)
y = df[["koi_pdisposition"]]
x = pd.get_dummies(x,drop_first=True) 

In [23]:
def algo_test(x,y):
    gauss = GaussianNB()
    kneClas = KNeighborsClassifier()
    svc = SVC()
    bernoulli = BernoulliNB()
    randForestClas= RandomForestClassifier()
    gradBoodClas = GradientBoostingClassifier()
    logReg = LogisticRegression()
    decTreeClas = DecisionTreeClassifier()
    
    x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)
    
    algos = [gauss,kneClas,svc,bernoulli,randForestClas,gradBoodClas,logReg,decTreeClas]
    algo_names = ["GaussianNB","KNeighborsClassifier","SVC","BernoulliNB","RandomForestClassifier","GradientBoostingClassifier","LogisticRegression","DecisionTreeClassifier"]
    ac_sc = []
    con_mat = []
    clas_rep = []
    
    result = pd.DataFrame(columns = ["Accuracy_Score","Confusion_Matrix","Classification_Report"],index = algo_names)
    
    for algo in algos:
        algo.fit(x_train,y_train)
        ac_sc.append(accuracy_score(algo.predict(x_test),y_test))
        con_mat.append(confusion_matrix(algo.predict(x_test),y_test))
        clas_rep.append(classification_report(algo.predict(x_test),y_test))
        
    result.Accuracy_Score =ac_sc
    result.Confusion_Matrix = con_mat
    result.Classification_Report = clas_rep
    
    return result.sort_values("Accuracy_Score", ascending=False)

In [24]:
# Testing Regression Models

algo_test(x,y)

Unnamed: 0,Accuracy_Score,Confusion_Matrix,Classification_Report
DecisionTreeClassifier,0.996746,"[[961, 5], [1, 877]]",precision recall f1-score ...
BernoulliNB,0.994577,"[[961, 9], [1, 873]]",precision recall f1-score ...
GradientBoostingClassifier,0.991866,"[[952, 5], [10, 877]]",precision recall f1-score ...
RandomForestClassifier,0.97397,"[[956, 42], [6, 840]]",precision recall f1-score ...
LogisticRegression,0.731562,"[[707, 240], [255, 642]]",precision recall f1-score ...
KNeighborsClassifier,0.687093,"[[606, 221], [356, 661]]",precision recall f1-score ...
GaussianNB,0.590564,"[[214, 7], [748, 875]]",precision recall f1-score ...
SVC,0.565618,"[[592, 431], [370, 451]]",precision recall f1-score ...


## With Deep Learning - Tensorflow/Keras

In [25]:
model = Sequential()
model.add(Dense(1024, activation = "relu"))
model.add(Dense(512, activation = "relu"))
model.add(Dense(256, activation = "relu"))
model.add(Dense(128, activation = "relu"))
model.add(Dense(64, activation = "relu"))
model.add(Dense(32, activation = "relu"))
model.add(Dense(16, activation = "relu"))
model.add(Dense(8, activation = "relu"))
model.add(Dense(1, activation = "sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="adam" ,metrics="accuracy")
model.fit(x,y, epochs=100, batch_size=128,verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x198979498a0>

In [26]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_9 (Dense)             (None, 1024)              9485312   
                                                                 
 dense_10 (Dense)            (None, 512)               524800    
                                                                 
 dense_11 (Dense)            (None, 256)               131328    
                                                                 
 dense_12 (Dense)            (None, 128)               32896     
                                                                 
 dense_13 (Dense)            (None, 64)                8256      
                                                                 
 dense_14 (Dense)            (None, 32)                2080      
                                                                 
 dense_15 (Dense)            (None, 16)               

In [27]:
model.evaluate(x,y)



[0.6923328042030334, 0.5202863812446594]