# Introduction

In [None]:
"""
What? Classification with XGBoost on the exoplanets dataset

Corey Wade. “Hands-On Gradient Boosting with XGBoost and scikit-learn
https://github.com/PacktPublishing/Hands-On-Gradient-Boosting-with-XGBoost-and-Scikit-learn/blob/master/Chapter04/exoplanets.csv
https://github.com/PacktPublishing/Hands-On-Gradient-Boosting-with-XGBoost-and-Scikit-learn
"""

# Import modules

In [23]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# Load dataset

In [None]:
"""
In this section, we examine exoplanets over time. The dataset has 5,087 rows and 3,189 columns that record light 
flux at different times of a star's life cycle. Multiplying columns and rows together results in 1.5 million 
data points. Using a baseline of 100 trees, we need 150 million data points to build a model. This offers a very
good example to show you how valuable XGBoost is.

The dataset contains information about the light of stars. Each row is an individual star and the columns 
reveal different light patterns over time. In addition to light patterns, an exoplanet column is labeled 2 
if the star hosts an exoplanet; otherwise, it is labeled 1.
"""

In [25]:
df = pd.read_csv('../DATASETS/exoplanets.csv')
df.head(3)

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,93.85,83.81,20.1,-26.98,-39.56,-124.71,-135.18,-96.27,-79.89,...,-78.07,-102.15,-102.15,25.13,48.57,92.54,39.32,61.42,5.08,-39.54
1,2,-38.88,-33.83,-58.54,-40.09,-79.31,-72.81,-86.55,-85.33,-83.97,...,-3.28,-32.21,-32.21,-24.89,-4.86,0.76,-11.7,6.46,16.0,19.93
2,2,532.64,535.92,513.73,496.92,456.45,466.0,464.5,486.39,436.56,...,-71.69,13.31,13.31,-29.89,-20.88,5.06,-11.8,-28.91,-70.02,-96.67


# Data exploration, cleaning and engineering

In [None]:
"""
As you can see from the output, 3197 columns are floats and 1 column is an int, so all columns are numerical.
"""

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5087 entries, 0 to 5086
Columns: 3198 entries, LABEL to FLUX.3197
dtypes: float64(3197), int64(1)
memory usage: 124.1 MB


In [27]:
df.isnull().sum().sum()

0

In [None]:
"""
Finding exoplanets is rare. The predictive column, on whether a star hosts an exoplanet or not, has very few 
positive cases, resulting in an imbalanced dataset. Imbalanced datasets require extra precautions.
"""

In [28]:
# Split data into X and y
X = df.iloc[:,1:]
y = df.iloc[:,0]

In [29]:
print("How many planets are NOT exoplanet?", (df.iloc[:,0] == 1).sum())
print("How many planets are exoplanet? ", (df.iloc[:,0] == 2).sum())

How many planets are NOT exoplanet? 5050
How many planets are exoplanet?  37


In [30]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)

# Gradient boosting classifier from scikit

In [None]:
"""
We have set max_depth=2 and n_estimators=100 to limit the size of the model
"""

In [31]:
start = time.time()

gbr = GradientBoostingClassifier(n_estimators=100, max_depth=2, random_state=2)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
score = accuracy_score(y_pred, y_test)
print('Score: ' + str(score))

end = time.time()
elapsed = end - start

print('Run Time: ' + str(elapsed) + ' seconds')

Score: 0.9874213836477987
Run Time: 228.50020694732666 seconds


In [None]:
"""
While a score of 98.7% percent is usually outstanding for accuracy, this is not the case with imbalanced datasets.
"""

# XGBclassifier from XGBoost

In [33]:
start = time.time()

# Instantiate the XGBRegressor, xg_reg
xg_reg = XGBClassifier(n_estimators=100, max_depth=2, random_state=2)

# Fit xg_reg to training set
xg_reg.fit(X_train, y_train)

# Predict labels of test set, y_pred
y_pred = xg_reg.predict(X_test)

score = accuracy_score(y_pred, y_test)

print('Score: ' + str(score))

end = time.time()
elapsed = end - start

print('Run Time: ' + str(elapsed) + ' seconds')

Score: 0.9913522012578616
Run Time: 10.212843179702759 seconds


In [None]:
"""
You make your own judgment in what is the fastest! Not bad ofr a dataset of 150 Millin data points
"""