In [1]:
# Update sklearn to prevent version mismatches
!pip install sklearn --upgrade

Requirement already up-to-date: sklearn in c:\users\heain\appdata\local\continuum\anaconda3\envs\pythondata\lib\site-packages (0.0)


In [2]:
# install joblib. This will be used to save your model. 
# Restart your kernel after installing 
!pip install joblib



In [1]:
import warnings
warnings.simplefilter('ignore')

# %matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import os

from sklearn.ensemble import RandomForestClassifier

# Read the CSV and Perform Basic Data Cleaning

In [2]:
df = pd.read_csv("data/exoplanet_data.csv")
# Drop the null columns where all values are null
df = df.dropna(axis='columns', how='all')
# Drop the null rows
df = df.dropna()
df.head()

Unnamed: 0,koi_disposition,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
0,CONFIRMED,0,0,0,0,54.418383,0.0002479,-0.0002479,162.51384,0.00352,...,-81,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
1,FALSE POSITIVE,0,1,0,0,19.89914,1.49e-05,-1.49e-05,175.850252,0.000581,...,-176,4.544,0.044,-0.176,0.868,0.233,-0.078,297.00482,48.134129,15.436
2,FALSE POSITIVE,0,1,0,0,1.736952,2.63e-07,-2.63e-07,170.307565,0.000115,...,-174,4.564,0.053,-0.168,0.791,0.201,-0.067,285.53461,48.28521,15.597
3,CONFIRMED,0,0,0,0,2.525592,3.76e-06,-3.76e-06,171.59555,0.00113,...,-211,4.438,0.07,-0.21,1.046,0.334,-0.133,288.75488,48.2262,15.509
4,CONFIRMED,0,0,0,0,4.134435,1.05e-05,-1.05e-05,172.97937,0.0019,...,-232,4.486,0.054,-0.229,0.972,0.315,-0.105,296.28613,48.22467,15.714


In [3]:
# drop the error columns
df = df.loc[:, ~df.columns.str.contains('_err')]
df.columns

Index(['koi_disposition', 'koi_fpflag_nt', 'koi_fpflag_ss', 'koi_fpflag_co',
       'koi_fpflag_ec', 'koi_period', 'koi_time0bk', 'koi_impact',
       'koi_duration', 'koi_depth', 'koi_prad', 'koi_teq', 'koi_insol',
       'koi_model_snr', 'koi_tce_plnt_num', 'koi_steff', 'koi_slogg',
       'koi_srad', 'ra', 'dec', 'koi_kepmag'],
      dtype='object')

### Exoplanet Archive Information
* 'koi_disposition': The pipeline flag that designates the most probable physical explanation of the KOI

### Project Disposition Columns
* 'koi_fpflag_nt': A KOI whose light curve is not consistent with that of a transiting planet.
* 'koi_fpflag_ss': A KOI that is observed to have a significant secondary event, transit shape, or out-of-eclipse variability, which indicates that the transit-like event is most likely caused by an eclipsing binary.
* 'koi_fpflag_co': The source of the signal is from a nearby star, as inferred by measuring the centroid location of the image both in and out of transit, or by the strength of the transit signal in the target's outer (halo) pixels as compared to the transit signal from the pixels in the optimal (or core) aperture.
* 'koi_fpflag_ec': The KOI shares the same period and epoch as another object and is judged to be the result of flux contamination in the aperture or electronic crosstalk.

### Transit Properties
Transit parameters delivered by the Kepler Project are typically best-fit parameters produced by a Mandel-Agol (2002) fit to a multi-quarter Kepler light curve, assuming a linear orbital ephemeris. Some of the parameters listed below are fit directly, other are derived from the best-fit parameters. Limb-darkening coefficients are fixed and pre-calculated from host star properties. Orbital Period, Transit Epoch, Planet-Star Radius Ratio, Planet-Star Distance over Star Radius and Impact Parameter are the free parameters in the fit. Matrix covariances are adopted as errors to the fit parameters; they therefore ignore the effects of correlation between the fit parameters and are likely to be underestimates.

See the links in the Purpose of KOI document for each activity table for more details on the fits for each delivery.

Scaled planetary parameters combine the dimensionless fit parameters with physical stellar parameters to produce planet characteristics in physical units.

Best-fit planetary transit parameters are typically normalized to the size of the host star. Physical planet parameters may be derived by scaling to the star's size and temperature. Transit parameters also depend weakly upon the limb-darkening coefficients which are derived from the stellar parameters (e.g., Claret and Bloemen 2011).

* 'koi_period': The interval between consecutive planetary transits.
* 'koi_time0bk': The time corresponding to the center of the first detected transit in Barycentric Julian Day (BJD) minus a constant offset of 2,454,833.0 days. The offset corresponds to 12:00 on Jan 1, 2009 UTC.
* 'koi_impact': The sky-projected distance between the center of the stellar disc and the center of the planet disc at conjunction, normalized by the stellar radius
* 'koi_duration': The duration of the observed transits. Duration is measured from first contact between the planet and star until last contact. Contact times are typically computed from a best-fit model produced by a Mandel-Agol (2002) model fit to a multi-quarter Kepler light curve, assuming a linear orbital ephemeris.
* 'koi_depth': The fraction of stellar flux lost at the minimum of the planetary transit. Transit depths are typically computed from a best-fit model produced by a Mandel-Agol (2002) model fit to a multi-quarter Kepler light curve, assuming a linear orbital ephemeris.
* 'koi_prad': The radius of the planet. Planetary radius is the product of the planet star radius ratio and the stellar radius.
* 'koi_teq': Approximation for the temperature of the planet. 
* 'koi_insol': Insolation flux is another way to give the equilibrium temperature.

### Threshold-Crossing Event (TCE) Information
The Transiting Planet Search (TPS) module of the Kepler data analysis pipeline performs a detection test for planet transits in the multi-quarter, gap-filled flux time series. The TPS module detrends each quarterly PDC light curve to remove edge effects around data gaps and then combines the data segments together, filling gaps with interpolated data so as to condition the flux time series for a matched filter. The module applies an adaptive, wavelet-based matched filter (Jenkins 2002, Jenkins et al. 2010 and Tenenbaum et al. (2012)) to perform a joint characterization of observation noise and detection of transit-like features in the light curve.

The TPS module estimates the Power Spectral Density of the flux time series as a function in time. This provides coefficients for a whitening filter to accommodate non-stationary, non-white noise and yields Single Event Statistic (SES) time series components. These can be interpreted as measurements of the statistical significance of the presence of a transit of trial duration at each point in the time series.

Single Event Statistics are folded at each trial orbital period and the maximum Multiple Event Statistic (MES) is obtained over all trial periods and phases. The MES estimates the signal to noise ratio of the putative transit-like sequence against the measurement noise. The MES threshold for defining the sample of Threshold Crossing Events (TCEs) is provided within the Release Notes. For reference, a lower MES threshold of 7.1σ limits the number of false positives in the TCE sample due to statistical random noise to less than 1 over the primary mission (Jenkins, Caldwell and Borucki 2002).

* 'koi_model_snr': Transit depth normalized by the mean uncertainty in the flux during the transits.
* 'koi_tce_plnt_num': TCE Planet Number federated to the KOI.

### Stellar Parameters
Stellar effective temperature, surface gravity, metallicity, radius, mass, and age should comprise a consistent set. Associated error estimates are 1-σ uncertainties.

* 'koi_steff': The photospheric temperature of the star.
* 'koi_slogg': The base-10 logarithm of the acceleration due to gravity at the surface of the star.
* 'koi_srad': The photospheric radius of the star

### KIC Parameters
* 'ra': KIC Right Ascension
* 'dec': KIC Declination
* 'koi_kepmag': Kepler-band (mag)

# Select your features (columns)

In [4]:
# y_values = df[["koi_disposition"]]

# label_encoder = LabelEncoder()
# label_encoder.fit(y_values)
# encoded_y = label_encoder.transform(y_values)
# one_hot_y = to_categorical(encoded_y)
# one_hot_y

In [5]:
y = df[['koi_disposition']]

In [6]:
# Set features. This will also be used as your x values.
selected_features = df[['koi_fpflag_nt', 'koi_prad', 'koi_teq', 'koi_srad']]
feature_names = selected_features.columns
selected_features.head()

Unnamed: 0,koi_fpflag_nt,koi_prad,koi_teq,koi_srad
0,0,2.83,443,0.927
1,0,14.6,638,0.868
2,0,33.46,1395,0.791
3,0,2.75,1406,1.046
4,0,2.77,1160,0.972


# Create a Train Test Split

Use `koi_disposition` for the y values

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(selected_features, y, random_state=42)

In [8]:
X_train.head()

Unnamed: 0,koi_fpflag_nt,koi_prad,koi_teq,koi_srad
6122,0,1.24,1017,1.125
6370,0,0.86,1867,0.797
2879,1,3.21,989,0.963
107,0,2.25,696,0.779
29,0,12.21,1103,1.082


# Pre-processing

Scale the data using the MinMaxScaler and perform some feature selection

In [9]:
# Scale your data
from sklearn.preprocessing import StandardScaler
X_scaler = StandardScaler().fit(X_train)
# y_scaler = StandardScaler().fit(y_train)

X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
# y_train_scaled = y_scaler.transform(y_train)
# y_test_scaled = y_scaler.transform(y_test)

# Train the Model



In [10]:
rf = RandomForestClassifier()
rf = rf.fit(X_train_scaled, y_train)
rf.score(X_test_scaled, y_test)

0.7070938215102975

In [11]:
print(f"Training Data Score: {rf.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {rf.score(X_test_scaled, y_test)}")

Training Data Score: 1.0
Testing Data Score: 0.7070938215102975


In [12]:
importances = rf.feature_importances_
importances

array([0.12186254, 0.3682925 , 0.28580208, 0.22404288])

In [13]:
sorted(zip(rf.feature_importances_, feature_names), reverse=True)

[(0.3682925029925251, 'koi_prad'),
 (0.2858020794988387, 'koi_teq'),
 (0.2240428824840806, 'koi_srad'),
 (0.12186253502455567, 'koi_fpflag_nt')]

In [14]:
# Calculate classification report
from sklearn.metrics import classification_report
predictions = rf.predict(X_test_scaled)
print(classification_report(y_test, predictions))

                precision    recall  f1-score   support

     CANDIDATE       0.53      0.52      0.53       411
     CONFIRMED       0.62      0.62      0.62       484
FALSE POSITIVE       0.84      0.85      0.84       853

      accuracy                           0.71      1748
     macro avg       0.66      0.66      0.66      1748
  weighted avg       0.71      0.71      0.71      1748



# Hyperparameter Tuning

Use `GridSearchCV` to tune the model's parameters

In [23]:
# Create the GridSearchCV model
from sklearn.model_selection import GridSearchCV

param_grid = { 
    'n_estimators': [200, 500],
    "min_samples_split": [2, 5, 10, 15, 100],
    'max_depth' : [5, 8, 15, 25, 30],
    "min_samples_leaf": [1, 2, 5, 10] 
}

rfc=RandomForestClassifier()
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5, verbose = 2)

In [24]:
# Train the model with GridSearch
best_rfc_model = CV_rfc.fit(X_train_scaled, y_train)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits
[CV] max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=200 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=200, total=   0.5s
[CV] max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=200 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.4s remaining:    0.0s


[CV]  max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=200, total=   0.5s
[CV] max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=200 
[CV]  max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=200, total=   0.5s
[CV] max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=200 
[CV]  max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=200, total=   0.5s
[CV] max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=200 
[CV]  max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=200, total=   0.5s
[CV] max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=500 
[CV]  max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=500, total=   1.3s
[CV] max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=500 
[CV]  max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=500, total=   1.3s
[CV] max_depth=5, min_samples_leaf=1, min_samples_split=2, n_es

[CV]  max_depth=5, min_samples_leaf=2, min_samples_split=2, n_estimators=200, total=   0.5s
[CV] max_depth=5, min_samples_leaf=2, min_samples_split=2, n_estimators=200 
[CV]  max_depth=5, min_samples_leaf=2, min_samples_split=2, n_estimators=200, total=   0.5s
[CV] max_depth=5, min_samples_leaf=2, min_samples_split=2, n_estimators=200 
[CV]  max_depth=5, min_samples_leaf=2, min_samples_split=2, n_estimators=200, total=   0.5s
[CV] max_depth=5, min_samples_leaf=2, min_samples_split=2, n_estimators=200 
[CV]  max_depth=5, min_samples_leaf=2, min_samples_split=2, n_estimators=200, total=   0.5s
[CV] max_depth=5, min_samples_leaf=2, min_samples_split=2, n_estimators=200 
[CV]  max_depth=5, min_samples_leaf=2, min_samples_split=2, n_estimators=200, total=   0.5s
[CV] max_depth=5, min_samples_leaf=2, min_samples_split=2, n_estimators=500 
[CV]  max_depth=5, min_samples_leaf=2, min_samples_split=2, n_estimators=500, total=   1.3s
[CV] max_depth=5, min_samples_leaf=2, min_samples_split=2, n_es

[CV]  max_depth=5, min_samples_leaf=2, min_samples_split=100, n_estimators=500, total=   1.3s
[CV] max_depth=5, min_samples_leaf=5, min_samples_split=2, n_estimators=200 
[CV]  max_depth=5, min_samples_leaf=5, min_samples_split=2, n_estimators=200, total=   0.5s
[CV] max_depth=5, min_samples_leaf=5, min_samples_split=2, n_estimators=200 
[CV]  max_depth=5, min_samples_leaf=5, min_samples_split=2, n_estimators=200, total=   0.5s
[CV] max_depth=5, min_samples_leaf=5, min_samples_split=2, n_estimators=200 
[CV]  max_depth=5, min_samples_leaf=5, min_samples_split=2, n_estimators=200, total=   0.5s
[CV] max_depth=5, min_samples_leaf=5, min_samples_split=2, n_estimators=200 
[CV]  max_depth=5, min_samples_leaf=5, min_samples_split=2, n_estimators=200, total=   0.5s
[CV] max_depth=5, min_samples_leaf=5, min_samples_split=2, n_estimators=200 
[CV]  max_depth=5, min_samples_leaf=5, min_samples_split=2, n_estimators=200, total=   0.5s
[CV] max_depth=5, min_samples_leaf=5, min_samples_split=2, n_

[CV]  max_depth=5, min_samples_leaf=5, min_samples_split=100, n_estimators=500, total=   1.3s
[CV] max_depth=5, min_samples_leaf=5, min_samples_split=100, n_estimators=500 
[CV]  max_depth=5, min_samples_leaf=5, min_samples_split=100, n_estimators=500, total=   1.3s
[CV] max_depth=5, min_samples_leaf=10, min_samples_split=2, n_estimators=200 
[CV]  max_depth=5, min_samples_leaf=10, min_samples_split=2, n_estimators=200, total=   0.5s
[CV] max_depth=5, min_samples_leaf=10, min_samples_split=2, n_estimators=200 
[CV]  max_depth=5, min_samples_leaf=10, min_samples_split=2, n_estimators=200, total=   0.6s
[CV] max_depth=5, min_samples_leaf=10, min_samples_split=2, n_estimators=200 
[CV]  max_depth=5, min_samples_leaf=10, min_samples_split=2, n_estimators=200, total=   0.6s
[CV] max_depth=5, min_samples_leaf=10, min_samples_split=2, n_estimators=200 
[CV]  max_depth=5, min_samples_leaf=10, min_samples_split=2, n_estimators=200, total=   0.6s
[CV] max_depth=5, min_samples_leaf=10, min_sample

[CV]  max_depth=5, min_samples_leaf=10, min_samples_split=100, n_estimators=500, total=   1.4s
[CV] max_depth=5, min_samples_leaf=10, min_samples_split=100, n_estimators=500 
[CV]  max_depth=5, min_samples_leaf=10, min_samples_split=100, n_estimators=500, total=   1.6s
[CV] max_depth=5, min_samples_leaf=10, min_samples_split=100, n_estimators=500 
[CV]  max_depth=5, min_samples_leaf=10, min_samples_split=100, n_estimators=500, total=   1.3s
[CV] max_depth=5, min_samples_leaf=10, min_samples_split=100, n_estimators=500 
[CV]  max_depth=5, min_samples_leaf=10, min_samples_split=100, n_estimators=500, total=   1.3s
[CV] max_depth=8, min_samples_leaf=1, min_samples_split=2, n_estimators=200 
[CV]  max_depth=8, min_samples_leaf=1, min_samples_split=2, n_estimators=200, total=   0.7s
[CV] max_depth=8, min_samples_leaf=1, min_samples_split=2, n_estimators=200 
[CV]  max_depth=8, min_samples_leaf=1, min_samples_split=2, n_estimators=200, total=   0.7s
[CV] max_depth=8, min_samples_leaf=1, min_

[CV]  max_depth=8, min_samples_leaf=1, min_samples_split=100, n_estimators=500, total=   1.6s
[CV] max_depth=8, min_samples_leaf=1, min_samples_split=100, n_estimators=500 
[CV]  max_depth=8, min_samples_leaf=1, min_samples_split=100, n_estimators=500, total=   1.5s
[CV] max_depth=8, min_samples_leaf=1, min_samples_split=100, n_estimators=500 
[CV]  max_depth=8, min_samples_leaf=1, min_samples_split=100, n_estimators=500, total=   1.5s
[CV] max_depth=8, min_samples_leaf=1, min_samples_split=100, n_estimators=500 
[CV]  max_depth=8, min_samples_leaf=1, min_samples_split=100, n_estimators=500, total=   1.7s
[CV] max_depth=8, min_samples_leaf=1, min_samples_split=100, n_estimators=500 
[CV]  max_depth=8, min_samples_leaf=1, min_samples_split=100, n_estimators=500, total=   1.7s
[CV] max_depth=8, min_samples_leaf=2, min_samples_split=2, n_estimators=200 
[CV]  max_depth=8, min_samples_leaf=2, min_samples_split=2, n_estimators=200, total=   0.6s
[CV] max_depth=8, min_samples_leaf=2, min_sam

[CV]  max_depth=8, min_samples_leaf=2, min_samples_split=100, n_estimators=200, total=   0.6s
[CV] max_depth=8, min_samples_leaf=2, min_samples_split=100, n_estimators=500 
[CV]  max_depth=8, min_samples_leaf=2, min_samples_split=100, n_estimators=500, total=   1.5s
[CV] max_depth=8, min_samples_leaf=2, min_samples_split=100, n_estimators=500 
[CV]  max_depth=8, min_samples_leaf=2, min_samples_split=100, n_estimators=500, total=   1.5s
[CV] max_depth=8, min_samples_leaf=2, min_samples_split=100, n_estimators=500 
[CV]  max_depth=8, min_samples_leaf=2, min_samples_split=100, n_estimators=500, total=   1.5s
[CV] max_depth=8, min_samples_leaf=2, min_samples_split=100, n_estimators=500 
[CV]  max_depth=8, min_samples_leaf=2, min_samples_split=100, n_estimators=500, total=   1.5s
[CV] max_depth=8, min_samples_leaf=2, min_samples_split=100, n_estimators=500 
[CV]  max_depth=8, min_samples_leaf=2, min_samples_split=100, n_estimators=500, total=   1.6s
[CV] max_depth=8, min_samples_leaf=5, min

[CV]  max_depth=8, min_samples_leaf=5, min_samples_split=100, n_estimators=200, total=   0.6s
[CV] max_depth=8, min_samples_leaf=5, min_samples_split=100, n_estimators=200 
[CV]  max_depth=8, min_samples_leaf=5, min_samples_split=100, n_estimators=200, total=   0.6s
[CV] max_depth=8, min_samples_leaf=5, min_samples_split=100, n_estimators=500 
[CV]  max_depth=8, min_samples_leaf=5, min_samples_split=100, n_estimators=500, total=   1.6s
[CV] max_depth=8, min_samples_leaf=5, min_samples_split=100, n_estimators=500 
[CV]  max_depth=8, min_samples_leaf=5, min_samples_split=100, n_estimators=500, total=   1.7s
[CV] max_depth=8, min_samples_leaf=5, min_samples_split=100, n_estimators=500 
[CV]  max_depth=8, min_samples_leaf=5, min_samples_split=100, n_estimators=500, total=   1.6s
[CV] max_depth=8, min_samples_leaf=5, min_samples_split=100, n_estimators=500 
[CV]  max_depth=8, min_samples_leaf=5, min_samples_split=100, n_estimators=500, total=   1.6s
[CV] max_depth=8, min_samples_leaf=5, min

[CV]  max_depth=8, min_samples_leaf=10, min_samples_split=100, n_estimators=200, total=   0.6s
[CV] max_depth=8, min_samples_leaf=10, min_samples_split=100, n_estimators=200 
[CV]  max_depth=8, min_samples_leaf=10, min_samples_split=100, n_estimators=200, total=   0.6s
[CV] max_depth=8, min_samples_leaf=10, min_samples_split=100, n_estimators=200 
[CV]  max_depth=8, min_samples_leaf=10, min_samples_split=100, n_estimators=200, total=   0.6s
[CV] max_depth=8, min_samples_leaf=10, min_samples_split=100, n_estimators=200 
[CV]  max_depth=8, min_samples_leaf=10, min_samples_split=100, n_estimators=200, total=   0.6s
[CV] max_depth=8, min_samples_leaf=10, min_samples_split=100, n_estimators=500 
[CV]  max_depth=8, min_samples_leaf=10, min_samples_split=100, n_estimators=500, total=   1.8s
[CV] max_depth=8, min_samples_leaf=10, min_samples_split=100, n_estimators=500 
[CV]  max_depth=8, min_samples_leaf=10, min_samples_split=100, n_estimators=500, total=   1.6s
[CV] max_depth=8, min_samples_

[CV]  max_depth=15, min_samples_leaf=1, min_samples_split=15, n_estimators=500, total=   2.1s
[CV] max_depth=15, min_samples_leaf=1, min_samples_split=100, n_estimators=200 
[CV]  max_depth=15, min_samples_leaf=1, min_samples_split=100, n_estimators=200, total=   0.9s
[CV] max_depth=15, min_samples_leaf=1, min_samples_split=100, n_estimators=200 
[CV]  max_depth=15, min_samples_leaf=1, min_samples_split=100, n_estimators=200, total=   0.8s
[CV] max_depth=15, min_samples_leaf=1, min_samples_split=100, n_estimators=200 
[CV]  max_depth=15, min_samples_leaf=1, min_samples_split=100, n_estimators=200, total=   0.9s
[CV] max_depth=15, min_samples_leaf=1, min_samples_split=100, n_estimators=200 
[CV]  max_depth=15, min_samples_leaf=1, min_samples_split=100, n_estimators=200, total=   0.7s
[CV] max_depth=15, min_samples_leaf=1, min_samples_split=100, n_estimators=200 
[CV]  max_depth=15, min_samples_leaf=1, min_samples_split=100, n_estimators=200, total=   0.7s
[CV] max_depth=15, min_samples_

[CV]  max_depth=15, min_samples_leaf=2, min_samples_split=15, n_estimators=500, total=   2.1s
[CV] max_depth=15, min_samples_leaf=2, min_samples_split=15, n_estimators=500 
[CV]  max_depth=15, min_samples_leaf=2, min_samples_split=15, n_estimators=500, total=   2.0s
[CV] max_depth=15, min_samples_leaf=2, min_samples_split=15, n_estimators=500 
[CV]  max_depth=15, min_samples_leaf=2, min_samples_split=15, n_estimators=500, total=   2.0s
[CV] max_depth=15, min_samples_leaf=2, min_samples_split=100, n_estimators=200 
[CV]  max_depth=15, min_samples_leaf=2, min_samples_split=100, n_estimators=200, total=   0.7s
[CV] max_depth=15, min_samples_leaf=2, min_samples_split=100, n_estimators=200 
[CV]  max_depth=15, min_samples_leaf=2, min_samples_split=100, n_estimators=200, total=   0.7s
[CV] max_depth=15, min_samples_leaf=2, min_samples_split=100, n_estimators=200 
[CV]  max_depth=15, min_samples_leaf=2, min_samples_split=100, n_estimators=200, total=   0.7s
[CV] max_depth=15, min_samples_leaf

[CV]  max_depth=15, min_samples_leaf=5, min_samples_split=15, n_estimators=500, total=   1.9s
[CV] max_depth=15, min_samples_leaf=5, min_samples_split=15, n_estimators=500 
[CV]  max_depth=15, min_samples_leaf=5, min_samples_split=15, n_estimators=500, total=   1.9s
[CV] max_depth=15, min_samples_leaf=5, min_samples_split=15, n_estimators=500 
[CV]  max_depth=15, min_samples_leaf=5, min_samples_split=15, n_estimators=500, total=   1.9s
[CV] max_depth=15, min_samples_leaf=5, min_samples_split=15, n_estimators=500 
[CV]  max_depth=15, min_samples_leaf=5, min_samples_split=15, n_estimators=500, total=   1.9s
[CV] max_depth=15, min_samples_leaf=5, min_samples_split=15, n_estimators=500 
[CV]  max_depth=15, min_samples_leaf=5, min_samples_split=15, n_estimators=500, total=   2.0s
[CV] max_depth=15, min_samples_leaf=5, min_samples_split=100, n_estimators=200 
[CV]  max_depth=15, min_samples_leaf=5, min_samples_split=100, n_estimators=200, total=   0.7s
[CV] max_depth=15, min_samples_leaf=5, 

[CV]  max_depth=15, min_samples_leaf=10, min_samples_split=15, n_estimators=200, total=   0.7s
[CV] max_depth=15, min_samples_leaf=10, min_samples_split=15, n_estimators=200 
[CV]  max_depth=15, min_samples_leaf=10, min_samples_split=15, n_estimators=200, total=   0.7s
[CV] max_depth=15, min_samples_leaf=10, min_samples_split=15, n_estimators=500 
[CV]  max_depth=15, min_samples_leaf=10, min_samples_split=15, n_estimators=500, total=   1.9s
[CV] max_depth=15, min_samples_leaf=10, min_samples_split=15, n_estimators=500 
[CV]  max_depth=15, min_samples_leaf=10, min_samples_split=15, n_estimators=500, total=   1.8s
[CV] max_depth=15, min_samples_leaf=10, min_samples_split=15, n_estimators=500 
[CV]  max_depth=15, min_samples_leaf=10, min_samples_split=15, n_estimators=500, total=   1.9s
[CV] max_depth=15, min_samples_leaf=10, min_samples_split=15, n_estimators=500 
[CV]  max_depth=15, min_samples_leaf=10, min_samples_split=15, n_estimators=500, total=   1.8s
[CV] max_depth=15, min_samples

[CV]  max_depth=25, min_samples_leaf=1, min_samples_split=15, n_estimators=200, total=   0.9s
[CV] max_depth=25, min_samples_leaf=1, min_samples_split=15, n_estimators=200 
[CV]  max_depth=25, min_samples_leaf=1, min_samples_split=15, n_estimators=200, total=   0.9s
[CV] max_depth=25, min_samples_leaf=1, min_samples_split=15, n_estimators=200 
[CV]  max_depth=25, min_samples_leaf=1, min_samples_split=15, n_estimators=200, total=   0.9s
[CV] max_depth=25, min_samples_leaf=1, min_samples_split=15, n_estimators=200 
[CV]  max_depth=25, min_samples_leaf=1, min_samples_split=15, n_estimators=200, total=   0.9s
[CV] max_depth=25, min_samples_leaf=1, min_samples_split=15, n_estimators=500 
[CV]  max_depth=25, min_samples_leaf=1, min_samples_split=15, n_estimators=500, total=   2.2s
[CV] max_depth=25, min_samples_leaf=1, min_samples_split=15, n_estimators=500 
[CV]  max_depth=25, min_samples_leaf=1, min_samples_split=15, n_estimators=500, total=   2.2s
[CV] max_depth=25, min_samples_leaf=1, mi

[CV]  max_depth=25, min_samples_leaf=2, min_samples_split=10, n_estimators=500, total=   2.3s
[CV] max_depth=25, min_samples_leaf=2, min_samples_split=15, n_estimators=200 
[CV]  max_depth=25, min_samples_leaf=2, min_samples_split=15, n_estimators=200, total=   0.9s
[CV] max_depth=25, min_samples_leaf=2, min_samples_split=15, n_estimators=200 
[CV]  max_depth=25, min_samples_leaf=2, min_samples_split=15, n_estimators=200, total=   0.9s
[CV] max_depth=25, min_samples_leaf=2, min_samples_split=15, n_estimators=200 
[CV]  max_depth=25, min_samples_leaf=2, min_samples_split=15, n_estimators=200, total=   0.9s
[CV] max_depth=25, min_samples_leaf=2, min_samples_split=15, n_estimators=200 
[CV]  max_depth=25, min_samples_leaf=2, min_samples_split=15, n_estimators=200, total=   0.9s
[CV] max_depth=25, min_samples_leaf=2, min_samples_split=15, n_estimators=200 
[CV]  max_depth=25, min_samples_leaf=2, min_samples_split=15, n_estimators=200, total=   0.9s
[CV] max_depth=25, min_samples_leaf=2, mi

[CV]  max_depth=25, min_samples_leaf=5, min_samples_split=10, n_estimators=500, total=   2.2s
[CV] max_depth=25, min_samples_leaf=5, min_samples_split=10, n_estimators=500 
[CV]  max_depth=25, min_samples_leaf=5, min_samples_split=10, n_estimators=500, total=   2.0s
[CV] max_depth=25, min_samples_leaf=5, min_samples_split=10, n_estimators=500 
[CV]  max_depth=25, min_samples_leaf=5, min_samples_split=10, n_estimators=500, total=   2.1s
[CV] max_depth=25, min_samples_leaf=5, min_samples_split=15, n_estimators=200 
[CV]  max_depth=25, min_samples_leaf=5, min_samples_split=15, n_estimators=200, total=   0.8s
[CV] max_depth=25, min_samples_leaf=5, min_samples_split=15, n_estimators=200 
[CV]  max_depth=25, min_samples_leaf=5, min_samples_split=15, n_estimators=200, total=   0.8s
[CV] max_depth=25, min_samples_leaf=5, min_samples_split=15, n_estimators=200 
[CV]  max_depth=25, min_samples_leaf=5, min_samples_split=15, n_estimators=200, total=   0.8s
[CV] max_depth=25, min_samples_leaf=5, mi

[CV]  max_depth=25, min_samples_leaf=10, min_samples_split=10, n_estimators=500, total=   1.9s
[CV] max_depth=25, min_samples_leaf=10, min_samples_split=10, n_estimators=500 
[CV]  max_depth=25, min_samples_leaf=10, min_samples_split=10, n_estimators=500, total=   1.9s
[CV] max_depth=25, min_samples_leaf=10, min_samples_split=10, n_estimators=500 
[CV]  max_depth=25, min_samples_leaf=10, min_samples_split=10, n_estimators=500, total=   2.0s
[CV] max_depth=25, min_samples_leaf=10, min_samples_split=10, n_estimators=500 
[CV]  max_depth=25, min_samples_leaf=10, min_samples_split=10, n_estimators=500, total=   2.0s
[CV] max_depth=25, min_samples_leaf=10, min_samples_split=10, n_estimators=500 
[CV]  max_depth=25, min_samples_leaf=10, min_samples_split=10, n_estimators=500, total=   1.9s
[CV] max_depth=25, min_samples_leaf=10, min_samples_split=15, n_estimators=200 
[CV]  max_depth=25, min_samples_leaf=10, min_samples_split=15, n_estimators=200, total=   0.8s
[CV] max_depth=25, min_samples

[CV]  max_depth=30, min_samples_leaf=1, min_samples_split=10, n_estimators=200, total=   0.9s
[CV] max_depth=30, min_samples_leaf=1, min_samples_split=10, n_estimators=200 
[CV]  max_depth=30, min_samples_leaf=1, min_samples_split=10, n_estimators=200, total=   0.9s
[CV] max_depth=30, min_samples_leaf=1, min_samples_split=10, n_estimators=500 
[CV]  max_depth=30, min_samples_leaf=1, min_samples_split=10, n_estimators=500, total=   2.2s
[CV] max_depth=30, min_samples_leaf=1, min_samples_split=10, n_estimators=500 
[CV]  max_depth=30, min_samples_leaf=1, min_samples_split=10, n_estimators=500, total=   2.2s
[CV] max_depth=30, min_samples_leaf=1, min_samples_split=10, n_estimators=500 
[CV]  max_depth=30, min_samples_leaf=1, min_samples_split=10, n_estimators=500, total=   2.4s
[CV] max_depth=30, min_samples_leaf=1, min_samples_split=10, n_estimators=500 
[CV]  max_depth=30, min_samples_leaf=1, min_samples_split=10, n_estimators=500, total=   2.3s
[CV] max_depth=30, min_samples_leaf=1, mi

[CV]  max_depth=30, min_samples_leaf=2, min_samples_split=10, n_estimators=200, total=   1.0s
[CV] max_depth=30, min_samples_leaf=2, min_samples_split=10, n_estimators=200 
[CV]  max_depth=30, min_samples_leaf=2, min_samples_split=10, n_estimators=200, total=   1.0s
[CV] max_depth=30, min_samples_leaf=2, min_samples_split=10, n_estimators=200 
[CV]  max_depth=30, min_samples_leaf=2, min_samples_split=10, n_estimators=200, total=   0.9s
[CV] max_depth=30, min_samples_leaf=2, min_samples_split=10, n_estimators=200 
[CV]  max_depth=30, min_samples_leaf=2, min_samples_split=10, n_estimators=200, total=   0.9s
[CV] max_depth=30, min_samples_leaf=2, min_samples_split=10, n_estimators=500 
[CV]  max_depth=30, min_samples_leaf=2, min_samples_split=10, n_estimators=500, total=   2.2s
[CV] max_depth=30, min_samples_leaf=2, min_samples_split=10, n_estimators=500 
[CV]  max_depth=30, min_samples_leaf=2, min_samples_split=10, n_estimators=500, total=   2.2s
[CV] max_depth=30, min_samples_leaf=2, mi

[CV]  max_depth=30, min_samples_leaf=5, min_samples_split=5, n_estimators=500, total=   2.0s
[CV] max_depth=30, min_samples_leaf=5, min_samples_split=10, n_estimators=200 
[CV]  max_depth=30, min_samples_leaf=5, min_samples_split=10, n_estimators=200, total=   0.8s
[CV] max_depth=30, min_samples_leaf=5, min_samples_split=10, n_estimators=200 
[CV]  max_depth=30, min_samples_leaf=5, min_samples_split=10, n_estimators=200, total=   0.8s
[CV] max_depth=30, min_samples_leaf=5, min_samples_split=10, n_estimators=200 
[CV]  max_depth=30, min_samples_leaf=5, min_samples_split=10, n_estimators=200, total=   0.8s
[CV] max_depth=30, min_samples_leaf=5, min_samples_split=10, n_estimators=200 
[CV]  max_depth=30, min_samples_leaf=5, min_samples_split=10, n_estimators=200, total=   0.8s
[CV] max_depth=30, min_samples_leaf=5, min_samples_split=10, n_estimators=200 
[CV]  max_depth=30, min_samples_leaf=5, min_samples_split=10, n_estimators=200, total=   0.8s
[CV] max_depth=30, min_samples_leaf=5, min

[CV]  max_depth=30, min_samples_leaf=10, min_samples_split=5, n_estimators=500, total=   1.9s
[CV] max_depth=30, min_samples_leaf=10, min_samples_split=5, n_estimators=500 
[CV]  max_depth=30, min_samples_leaf=10, min_samples_split=5, n_estimators=500, total=   1.8s
[CV] max_depth=30, min_samples_leaf=10, min_samples_split=5, n_estimators=500 
[CV]  max_depth=30, min_samples_leaf=10, min_samples_split=5, n_estimators=500, total=   1.8s
[CV] max_depth=30, min_samples_leaf=10, min_samples_split=10, n_estimators=200 
[CV]  max_depth=30, min_samples_leaf=10, min_samples_split=10, n_estimators=200, total=   0.7s
[CV] max_depth=30, min_samples_leaf=10, min_samples_split=10, n_estimators=200 
[CV]  max_depth=30, min_samples_leaf=10, min_samples_split=10, n_estimators=200, total=   0.7s
[CV] max_depth=30, min_samples_leaf=10, min_samples_split=10, n_estimators=200 
[CV]  max_depth=30, min_samples_leaf=10, min_samples_split=10, n_estimators=200, total=   0.7s
[CV] max_depth=30, min_samples_leaf

[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed: 21.5min finished


In [None]:
print(CV_rfc.best_params_)
print(CV_rfc.best_score_)
print(CV_rfc.best_estimator_)

# Test Model

In [26]:
predictions = CV_rfc.predict(X_test_scaled)
predicted = predictions[:20]
actual = y_test["koi_disposition"][:20].tolist()
pd.DataFrame({"Predictions": predicted, "Actual": actual}).reset_index(drop=True)

Unnamed: 0,Predictions,Actual
0,FALSE POSITIVE,FALSE POSITIVE
1,CONFIRMED,CANDIDATE
2,FALSE POSITIVE,FALSE POSITIVE
3,FALSE POSITIVE,FALSE POSITIVE
4,FALSE POSITIVE,FALSE POSITIVE
5,FALSE POSITIVE,CONFIRMED
6,FALSE POSITIVE,CANDIDATE
7,CONFIRMED,CANDIDATE
8,CONFIRMED,CANDIDATE
9,FALSE POSITIVE,FALSE POSITIVE


In [27]:
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))

                precision    recall  f1-score   support

     CANDIDATE       0.70      0.04      0.07       411
     CONFIRMED       0.41      0.68      0.51       484
FALSE POSITIVE       0.69      0.75      0.72       853

      accuracy                           0.56      1748
     macro avg       0.60      0.49      0.43      1748
  weighted avg       0.61      0.56      0.51      1748



# Save the Model

In [29]:
# save your model by updating "your_name" with your name
# and "your_model" with your model variable
# be sure to turn this in to BCS
# if joblib fails to import, try running the command to install in terminal/git-bash
import joblib
filename = 'SVC_model.h5'
joblib.dump(best_SVC_model, filename)

['SVC_model.h5']

In [30]:
loaded_model = joblib.load("SVC_model.h5")
loaded_model.score(X_train_scaled, y_train)

0.5668510394812131

In [None]:
# helpful link: https://towardsdatascience.com/svm-hyper-parameter-tuning-using-gridsearchcv-49c0bc55ce29