# **Space X  Falcon 9 First Stage Landing Prediction**


by Andrew Hagan
kappapb@gmail.com

Space X advertises Falcon 9 rocket launches on its website with a cost of 62 million dollars; other providers cost upward of 165 million dollars each, much of the savings is because Space X can reuse the first stage. Therefore if I can determine if the first stage will land, I can determine the cost of a launch. This information can be used if an alternate company wants to bid against space X for a rocket launch.   Most unsuccessful landings are planned.  In this project I will apply the machine learning classifier models to the Falcon 9 data and check their accuraccy.

## Objectives


Perform exploratory  Data Analysis and determine Training Labels

*   create a column for the class
*   Standardize the data
*   Split into training data and test data

\-Find best Hyperparameter for SVM, Classification Trees and Logistic Regression

*   Find the that method performs best using test data

This data was scraped from the SpaceX website using Beautiful Soup.


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import f1_score
from sklearn.metrics import jaccard_score

Load the data


In [2]:
data = pd.read_csv("https://raw.githubusercontent.com/kappapb/Case_Study-SpaceX/main/falcon9launchdata1.csv")
data.head()

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude,Class
0,1,2010-06-04,Falcon 9,6104.959412,LEO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0003,-80.577366,28.561857,0
1,2,2012-05-22,Falcon 9,525.0,LEO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0005,-80.577366,28.561857,0
2,3,2013-03-01,Falcon 9,677.0,ISS,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0007,-80.577366,28.561857,0
3,4,2013-09-29,Falcon 9,500.0,PO,VAFB SLC 4E,False Ocean,1,False,False,False,,1.0,0,B1003,-120.610829,34.632093,0
4,5,2013-12-03,Falcon 9,3170.0,GTO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B1004,-80.577366,28.561857,0


In [None]:
data['LaunchSite'].value_counts()

In [None]:
X = pd.read_csv('https://raw.githubusercontent.com/kappapb/Case_Study-SpaceX/main/falcon9launchdata2.csv')
X.head(100)

Here we create a numpy array our of the 'Class' column in our data

In [None]:
Y = data['Class'].to_numpy()
Y

Here I standardize the data so it will work in the classifier system.

In [None]:
X = preprocessing.StandardScaler().fit_transform(X)

Now I will split that data for integrity.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split( X, Y, test_size=0.2, random_state=2)
print ('Train set:', X_train.shape,  Y_train.shape)
print ('Test set:', X_test.shape,  Y_test.shape)

Here's my Logistic Regression model with GridSearchCV function 


In [None]:
parameters ={'C':[0.01,0.1,1],
             'penalty':['l2'],
             'solver':['lbfgs']}

In [None]:
parameters ={"C":[0.01,0.1,1],'penalty':['l2'], 'solver':['lbfgs']}# l1 lasso l2 ridge
lr = LogisticRegression().fit(X_train,Y_train)
logreg_cv = GridSearchCV(lr, parameters, cv=10)
logreg_cv.fit(X_train, Y_train)

In [None]:
print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy :",logreg_cv.best_score_)

In [None]:
logreg_cv.score(X_test, Y_test)

Now I will try the Support Vector Machines model with parameters input by GridSearchCV function

In [None]:
parameters = {'kernel':('linear', 'rbf','poly','rbf', 'sigmoid'),
              'C': np.logspace(-3, 3, 5),
              'gamma':np.logspace(-3, 3, 5)}
svm = SVC()

In [None]:
svm_cv = GridSearchCV(svm, parameters, cv=10)
svm_cv.fit(X_train, Y_train)

In [None]:
print("tuned hpyerparameters :(best parameters) ",svm_cv.best_params_)
print("accuracy :",svm_cv.best_score_)

In [None]:
svm_cv.score(X_test, Y_test)

This is the Decision Tree Classifier with GridSearchCV


In [None]:
parameters = {'criterion': ['gini', 'entropy'],
     'splitter': ['best', 'random'],
     'max_depth': [2*n for n in range(1,10)],
     'max_features': ['auto', 'sqrt'],
     'min_samples_leaf': [1, 2, 4],
     'min_samples_split': [2, 5, 10]}

tree = DecisionTreeClassifier()

In [None]:
tree_cv = GridSearchCV(tree, parameters, cv=10)
tree_cv.fit(X_train, Y_train)

In [None]:
print("tuned hpyerparameters :(best parameters) ",tree_cv.best_params_)
print("accuracy :",tree_cv.best_score_)

In [None]:
tree_cv.score(X_test, Y_test)

Lastly, I will input the parameters for the K Nearest Neighbors model to be input with GridSearchCV

In [None]:
parameters = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
              'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
              'p': [1,2]}

KNN = KNeighborsClassifier()

In [None]:
knn_cv = GridSearchCV(KNN, parameters, cv=10)
knn_cv.fit(X_train, Y_train)

In [None]:
print("tuned hpyerparameters :(best parameters) ",knn_cv.best_params_)
print("accuracy :",knn_cv.best_score_)

In [None]:
knn_cv.score(X_test, Y_test)

Now I will start intensive score evaluation to find the best model.

In [None]:
yhat_train_svm = svm_cv.predict(X_train)
yhat_test_svm = svm_cv.predict(X_test)

SVM_f1_train = f1_score(Y_train, yhat_train_svm, average='weighted') 
SVM_jaccard_train = jaccard_score(Y_train, yhat_train_svm,pos_label=1)
print("SVM F1 Train score: ", SVM_f1_train)
print("SVM Jaccard Train score: ", SVM_jaccard_train)
SVM_f1_test = f1_score(Y_test, yhat_test_svm, average='weighted') 
SVM_jaccard_test = jaccard_score(Y_test, yhat_test_svm,pos_label=1)
print("SVM F1 Test score: ", SVM_f1_test)
print("SVM Jaccard Test score: ", SVM_jaccard_test)

Here I calculate 3 different stats scores: F1, Jaccard, and R2 for both test and train data and average them to come up with the total score for the model.

In [None]:
yhat_train_svm = svm_cv.predict(X_train)
yhat_test_svm = svm_cv.predict(X_test)
SVM_f1_train = f1_score(Y_train, yhat_train_svm, average='weighted') 
SVM_jaccard_train = jaccard_score(Y_train, yhat_train_svm,pos_label=1)
SVM_f1_test = f1_score(Y_test, yhat_test_svm, average='weighted') 
SVM_jaccard_test = jaccard_score(Y_test, yhat_test_svm,pos_label=1)

yhat_train_tree = tree_cv.predict(X_train)
yhat_test_tree = tree_cv.predict(X_test)
tree_f1_train = f1_score(Y_train, yhat_train_tree, average='weighted') 
tree_jaccard_train = jaccard_score(Y_train, yhat_train_tree,pos_label=1)
tree_f1_test = f1_score(Y_test, yhat_test_tree, average='weighted') 
tree_jaccard_test = jaccard_score(Y_test, yhat_test_tree,pos_label=1)

yhat_train_knn = knn_cv.predict(X_train)
yhat_test_knn = knn_cv.predict(X_test)
knn_f1_train = f1_score(Y_train, yhat_train_knn, average='weighted') 
knn_jaccard_train = jaccard_score(Y_train, yhat_train_knn,pos_label=1)
knn_f1_test = f1_score(Y_test, yhat_test_knn, average='weighted') 
knn_jaccard_test = jaccard_score(Y_test, yhat_test_knn,pos_label=1)

yhat_train_logreg = logreg_cv.predict(X_train)
yhat_test_logreg = logreg_cv.predict(X_test)
logreg_f1_train = f1_score(Y_train, yhat_train_logreg, average='weighted') 
logreg_jaccard_train = jaccard_score(Y_train, yhat_train_logreg,pos_label=1)
logreg_f1_test = f1_score(Y_test, yhat_test_svm, average='weighted') 
logreg_jaccard_test = jaccard_score(Y_test, yhat_test_svm,pos_label=1)

svm_total_score = ((metrics.accuracy_score(Y_train, yhat_train_svm))+ (metrics.accuracy_score(Y_test, yhat_test_svm))+SVM_f1_train+SVM_jaccard_train+SVM_f1_test+SVM_jaccard_test)/6
tree_total_score = ((metrics.accuracy_score(Y_train, yhat_train_tree))+ (metrics.accuracy_score(Y_test, yhat_test_tree))+tree_f1_train+tree_jaccard_train+tree_f1_test+tree_jaccard_test)/6
KNN_total_score = ((metrics.accuracy_score(Y_train, yhat_train_knn))+ (metrics.accuracy_score(Y_test, yhat_test_knn))+knn_f1_train+knn_jaccard_train+knn_f1_test+knn_jaccard_test)/6
logreg_total_score = ((metrics.accuracy_score(Y_train, yhat_train_logreg))+ (metrics.accuracy_score(Y_test, yhat_test_logreg))+logreg_f1_train+logreg_jaccard_train+logreg_f1_test+logreg_jaccard_test)/6

print("SVM Total score: ", svm_total_score)
print("Logistic Regression Total score: ", logreg_total_score)
print("KNN Total score: ", KNN_total_score)
print("Decision Tree Total score: ", tree_total_score)
d = {'SVM': [svm_total_score], 'LogReg': [logreg_total_score], 'KNN': [KNN_total_score],'Decision Tree': [tree_total_score]}
df = pd.DataFrame(data=d)
sns.barplot(data=(df)).set(title='Total Accuracy Score by Model')



In conclusion, it was a close race, but Support Vector Machines was the winner.  Decision Tree Classifier was last place.

In [None]:
d = {'SVM': [svm_total_score], 'LogReg': [logreg_total_score], 'KNN': [KNN_total_score],'Decision Tree': [tree_total_score]}
df = pd.DataFrame(data=d)
sns.barplot(data=(df)).set(title='Total Accuracy Score by Model')