# Space X Falcon 9 First Stage Landing Prediction
### **Machine Learning Prediction**
![Falcon 9](../images/falcon9.webp)
### Objectives:
* Perform exploratory Data Analysis and determine Training Labels
* Find best Hyperparameter for SVM, Classification Trees and Logistic Regression
     * Find the method performs best using test data

![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/landing\_1.gif)


Most unsuccessful landings are planed. Space X; performs a controlled landing in the oceans.

![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/crash.gif)


***


In [81]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.calibration import LabelEncoder
from sklearn.discriminant_analysis import StandardScaler


In [82]:
# Ploting the confusion matrix

def plot_confusion_matrix(y,y_predict):
    "this function plots the confusion matrix"
    from sklearn.metrics import confusion_matrix

    cm = confusion_matrix(y, y_predict)
    ax= plt.subplot()
    sns.heatmap(cm, annot=True, ax = ax); #annot=True to annotate cells
    ax.set_xlabel('Predicted labels')
    ax.set_ylabel('True labels')
    ax.set_title('Confusion Matrix'); 
    ax.xaxis.set_ticklabels(['did not land', 'land']); ax.yaxis.set_ticklabels(['did not land', 'landed'])

In [59]:
data = pd.read_csv('../csv/dataset_part_2.csv')
data.head()

FlightNumber      float64
Date               object
BoosterVersion     object
PayloadMass       float64
Orbit              object
LaunchSite         object
Outcome            object
Flights           float64
GridFins          float64
Reused            float64
Legs              float64
LandingPad         object
Block             float64
ReusedCount       float64
Serial             object
Longitude         float64
Latitude          float64
Class             float64
dtype: object
(91, 18)


In [60]:
print(data.describe())

       FlightNumber   PayloadMass    Flights   GridFins     Reused       Legs  \
count     90.000000     91.000000  90.000000  90.000000  90.000000  90.000000   
mean      45.500000   6123.547647   1.788889   0.777778   0.411111   0.788889   
std       26.124701   4705.752327   1.213172   0.418069   0.494792   0.410383   
min        1.000000    350.000000   1.000000   0.000000   0.000000   0.000000   
25%       23.250000   2531.500000   1.000000   1.000000   0.000000   1.000000   
50%       45.500000   4707.000000   1.000000   1.000000   0.000000   1.000000   
75%       67.750000   8300.500000   2.000000   1.000000   1.000000   1.000000   
max       90.000000  15600.000000   6.000000   1.000000   1.000000   1.000000   

           Block  ReusedCount   Longitude   Latitude      Class  
count  90.000000    90.000000   90.000000  90.000000  91.000000  
mean    3.500000     3.188889  -86.366477  29.449963   0.670330  
std     1.595288     4.194417   14.149518   2.141306   0.472698  
min   

In [61]:
data.head()

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude,Class
0,1.0,2010-06-04,Falcon 9,6123.547647,LEO,CCSFS SLC 40,None None,1.0,0.0,0.0,0.0,,1.0,0.0,B0003,-80.577366,28.561857,0.0
1,2.0,2012-05-22,Falcon 9,525.0,LEO,CCSFS SLC 40,None None,1.0,0.0,0.0,0.0,,1.0,0.0,B0005,-80.577366,28.561857,0.0
2,3.0,2013-03-01,Falcon 9,677.0,ISS,CCSFS SLC 40,None None,1.0,0.0,0.0,0.0,,1.0,0.0,B0007,-80.577366,28.561857,0.0
3,4.0,2013-09-29,Falcon 9,500.0,PO,VAFB SLC 4E,False Ocean,1.0,0.0,0.0,0.0,,1.0,0.0,B1003,-120.610829,34.632093,0.0
4,5.0,2013-12-03,Falcon 9,3170.0,GTO,CCSFS SLC 40,None None,1.0,0.0,0.0,0.0,,1.0,0.0,B1004,-80.577366,28.561857,0.0


In [62]:
X = pd.read_csv('../csv/dataset_part_3.csv')
print(X.dtypes)

FlightNumber      float64
Date               object
BoosterVersion     object
PayloadMass         int64
Outcome            object
                   ...   
Serial_B1056        int64
Serial_B1058        int64
Serial_B1059        int64
Serial_B1060        int64
Serial_B1062        int64
Length: 86, dtype: object


In [63]:
print(X.columns)

Index(['FlightNumber', 'Date', 'BoosterVersion', 'PayloadMass', 'Outcome',
       'Flights', 'GridFins', 'Reused', 'Legs', 'Block', 'ReusedCount',
       'Longitude', 'Latitude', 'Class', 'Orbit_ES-L1', 'Orbit_GEO',
       'Orbit_GTO', 'Orbit_HEO', 'Orbit_ISS', 'Orbit_LEO', 'Orbit_MEO',
       'Orbit_PO', 'Orbit_SO', 'Orbit_SSO', 'Orbit_VLEO',
       'LaunchSite_CCSFS SLC 40', 'LaunchSite_KSC LC 39A',
       'LaunchSite_VAFB SLC 4E', 'LandingPad_5e9e3032383ecb267a34e7c7',
       'LandingPad_5e9e3032383ecb554034e7c9',
       'LandingPad_5e9e3032383ecb6bb234e7ca',
       'LandingPad_5e9e3032383ecb761634e7cb',
       'LandingPad_5e9e3033383ecbb9e534e7cc', 'Serial_B0003', 'Serial_B0005',
       'Serial_B0007', 'Serial_B1003', 'Serial_B1004', 'Serial_B1005',
       'Serial_B1006', 'Serial_B1007', 'Serial_B1008', 'Serial_B1010',
       'Serial_B1011', 'Serial_B1012', 'Serial_B1013', 'Serial_B1015',
       'Serial_B1016', 'Serial_B1017', 'Serial_B1018', 'Serial_B1019',
       'Serial_B1020', 

In [64]:
X.head()

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Outcome,Flights,GridFins,Reused,Legs,Block,...,Serial_B1048,Serial_B1049,Serial_B1050,Serial_B1051,Serial_B1054,Serial_B1056,Serial_B1058,Serial_B1059,Serial_B1060,Serial_B1062
0,1.0,6/4/2010,Falcon 9,6124,None None,1.0,0.0,0.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,0
1,2.0,5/22/2012,Falcon 9,525,None None,1.0,0.0,0.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,0
2,3.0,3/1/2013,Falcon 9,677,None None,1.0,0.0,0.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,0
3,4.0,9/29/2013,Falcon 9,500,False Ocean,1.0,0.0,0.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,0
4,5.0,12/3/2013,Falcon 9,3170,None None,1.0,0.0,0.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,0


In [65]:
# numpy array from the column class in data
Y = data['Class'].to_numpy()
type(Y)

numpy.ndarray

In [66]:
non_numeric_columns = X.select_dtypes(exclude=['float', 'int']).columns
numeric_columns = X.select_dtypes(include=['float', 'int']).columns

In [67]:
X_numeric = X[numeric_columns]
scaler = preprocessing.StandardScaler()
X_numeric_standardized = scaler.fit_transform(X_numeric)

In [68]:
X_standardized = pd.concat([pd.DataFrame(X_numeric_standardized, columns=numeric_columns), X[non_numeric_columns]], axis=1)

In [69]:
X[0:5]

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Outcome,Flights,GridFins,Reused,Legs,Block,...,Serial_B1048,Serial_B1049,Serial_B1050,Serial_B1051,Serial_B1054,Serial_B1056,Serial_B1058,Serial_B1059,Serial_B1060,Serial_B1062
0,1.0,6/4/2010,Falcon 9,6124,None None,1.0,0.0,0.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,0
1,2.0,5/22/2012,Falcon 9,525,None None,1.0,0.0,0.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,0
2,3.0,3/1/2013,Falcon 9,677,None None,1.0,0.0,0.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,0
3,4.0,9/29/2013,Falcon 9,500,False Ocean,1.0,0.0,0.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,0
4,5.0,12/3/2013,Falcon 9,3170,None None,1.0,0.0,0.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,0


In [70]:
X_standardized_array = X_standardized.values
print(X_standardized_array)

[[-1.7129115371395962 8.92304039043479e-05 -0.6539128396553676 ...
  '6/4/2010' 'Falcon 9' 'None None']
 [-1.6744191430465716 -1.1963237659873809 -0.6539128396553676 ...
  '5/22/2012' 'Falcon 9' 'None None']
 [-1.635926748953547 -1.1638438989662208 -0.6539128396553676 ...
  '3/1/2013' 'Falcon 9' 'None None']
 ...
 [1.6744191430465716 2.0249525191704376 1.0038943594709162 ...
  '10/24/2020' 'Falcon 9' 'True ASDS']
 [1.7129115371395962 -0.5219391586269779 -0.6539128396553676 ...
  '11/5/2020' 'Falcon 9' 'True ASDS']
 [nan 8.92304039043479e-05 nan ... nan nan nan]]


### Training and Testing


* Data split into training and testing data using the  function  <code>train_test_split</code>.   
* The training data is divided into validation data, a second set used for training  data; 
* Then the models are trained and hyperparameters are selected using the function <code>GridSearchCV</code>.


In [71]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)

In [72]:
Y_test.shape

(19,)


* Logistic regression object 
* GridSearchCV object  <code>logreg_cv</code> with cv = 10.  
* Fit the object to find the best parameters from the dictionary <code>parameters</code>.


In [73]:
label_encoder = LabelEncoder()
X_train_encoded = X_train.apply(label_encoder.fit_transform)

In [74]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

In [75]:
parameters = {
    'classifier__C': [0.01, 0.1, 1],
    'classifier__penalty': ['l2'],
    'classifier__solver': ['lbfgs']
}


In [76]:
logreg_cv = GridSearchCV(estimator=pipeline, cv=10, param_grid=parameters)

In [77]:
logreg_cv.fit(X_train_encoded, Y_train)

* Output of the <code>GridSearchCV</code> object for logistic regression.

In [78]:
# Displaying the best parameters  and the accuracy on the validation data.
print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy :",logreg_cv.best_score_)

tuned hpyerparameters :(best parameters)  {'classifier__C': 1, 'classifier__penalty': 'l2', 'classifier__solver': 'lbfgs'}
accuracy : 0.9857142857142858


### Accuracy on the test data using the method <code>score</code>:

In [79]:
yhat=logreg_cv.predict(X_test)

ValueError: could not convert string to float: '10/8/2018'

In [80]:
# confusion matrix
plot_confusion_matrix(Y_test,yhat)

NameError: name 'yhat' is not defined

In [None]:
yhat=logreg_cv.predict(X_test)
plot_confusion_matrix(Y_test,yhat)

👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽👆🏽
* Logistic regression can distinguish between the different classes. 
* The major problem is false positives.


### Creating Support vector machine object
* <code>GridSearchCV</code> object  <code>svm_cv</code> with cv - 10.  
* Fit the object to find the best parameters from the dictionary <code>parameters</code>.


In [None]:
parameters = {'kernel':('linear', 'rbf','poly','rbf', 'sigmoid'),
              'C': np.logspace(-3, 3, 5),
              'gamma':np.logspace(-3, 3, 5)}
svm = SVC()

In [None]:
svm_cv = GridSearchCV(estimator=svm, cv=10, param_grid=parameters)
svm_cv.fit(X_train, Y_train)

In [None]:
print("tuned hpyerparameters :(best parameters) ",svm_cv.best_params_)
print("accuracy :",svm_cv.best_score_)

### Calculating the accuracy on the test data using the method <code>score</code>:


In [None]:
print("accuracy :", svm_cv.score(X_test, Y_test))

In [None]:
yhat=svm_cv.predict(X_test)
plot_confusion_matrix(Y_test,yhat)

### Creating a decision tree classifier object
* creating a  <code>GridSearchCV</code> object  <code>tree_cv</code> with cv = 10.  
* Fiting the object to find the best parameters from the dictionary <code>parameters</code>.


In [None]:
parameters = {'criterion': ['gini', 'entropy'],
     'splitter': ['best', 'random'],
     'max_depth': [2*n for n in range(1,10)],
     'max_features': ['auto', 'sqrt'],
     'min_samples_leaf': [1, 2, 4],
     'min_samples_split': [2, 5, 10]}

tree = DecisionTreeClassifier()

In [None]:
tree_cv = GridSearchCV(estimator=tree, cv=10, param_grid=parameters)
tree_cv.fit(X_train, Y_train)

In [None]:
print("tuned hpyerparameters :(best parameters) ",tree_cv.best_params_)
print("accuracy :",tree_cv.best_score_)

### Calculating the accuracy of tree_cv on the test data using the method <code>score</code>:


In [None]:
print("accuracy :", tree_cv.score(X_test, Y_test))

We can plot the confusion matrix


In [None]:
yhat = tree_cv.predict(X_test)
plot_confusion_matrix(Y_test,yhat)

Creating a k nearest neighbors object
* creating a  <code>GridSearchCV</code> object  <code>knn_cv</code> with cv = 10.  
* Fiting the object to find the best parameters from the dictionary <code>parameters</code>.


In [None]:
parameters = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
              'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
              'p': [1,2]}

KNN = KNeighborsClassifier()

In [None]:
knn_cv = GridSearchCV(estimator=KNN, cv=10, param_grid=parameters)
knn_cv.fit(X_train, Y_train)

In [None]:
print("tuned hpyerparameters :(best parameters) ",knn_cv.best_params_)
print("accuracy :",knn_cv.best_score_)

Calculating the accuracy of tree_cv on the test data using the method <code>score</code>:


In [None]:
print("accuracy :", knn_cv.score(X_test, Y_test))

In [None]:
yhat = knn_cv.predict(X_test)
plot_confusion_matrix(Y_test,yhat)

### Finding the method performs best:


In [None]:
print("Model\t\tAccuracy\tTestAccuracy")#,logreg_cv.best_score_)
print("LogReg\t\t{}\t\t{}".format((logreg_cv.best_score_).round(5), logreg_cv.score(X_test, Y_test).round(5)))
print("SVM\t\t{}\t\t{}".format((svm_cv.best_score_).round(5), svm_cv.score(X_test, Y_test).round(5)))
print("Tree\t\t{}\t\t{}".format((tree_cv.best_score_).round(5), tree_cv.score(X_test, Y_test).round(5)))
print("KNN\t\t{}\t\t{}".format((knn_cv.best_score_).round(5), knn_cv.score(X_test, Y_test).round(5)))

comparison = {}

comparison['LogReg'] = {'Accuracy': logreg_cv.best_score_.round(5), 'TestAccuracy': logreg_cv.score(X_test, Y_test).round(5)}
comparison['SVM'] = {'Accuracy': svm_cv.best_score_.round(5), 'TestAccuracy': svm_cv.score(X_test, Y_test).round(5)}
comparison['Tree'] = {'Accuracy': tree_cv.best_score_.round(5), 'TestAccuracy': tree_cv.score(X_test, Y_test).round(5)}
comparison['KNN'] = {'Accuracy': knn_cv.best_score_.round(5), 'TestAccuracy': knn_cv.score(X_test, Y_test).round(5)}


In [None]:
x = []
y1 = []
y2 = []
for meth in comparison.keys():
    x.append(meth)    
    y1.append(comparison[meth]['Accuracy'])
    y2.append(comparison[meth]['TestAccuracy'])
    

x_axis = np.arange(len(x))

plt.bar(x_axis - 0.2, y1, 0.4, label = 'Accuracy')
plt.bar(x_axis + 0.2, y2, 0.4, label = 'Test Accuracy')

plt.ylim([0,1])
plt.xticks(x_axis, x)

plt.xlabel("Methods")
plt.ylabel("Accuracy")
plt.title("Accuracy of Each Method")
plt.legend(loc='lower left')
plt.show()
    

## Author


[Helena Pedro](https://www.linkedin.com/in/helena-mbeua-pedro/) is a Data Scientist at Millennium Atlantic Bank in Angola. She is a Creative big thinker passionated about using data and optimization tools to direct decision making and solve complex and large-scale challenges.
- **Email:** mbeua94@gmail.com

Copyright © 2024