# **Space X  Falcon 9 First Stage Landing Prediction**


## Machine Learning Prediction

Space X advertises Falcon 9 rocket launches on its website with a cost of 62 million dollars; other providers cost upward of 165 million dollars each, much of the savings is because Space X can reuse the first stage. Therefore if we can determine if the first stage will land, we can determine the cost of a launch. This information can be used if an alternate company wants to bid against space X for a rocket launch.   In this lab, you will create a machine learning pipeline  to predict if the first stage will land given the data from the preceding labs.


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/landing_1.gif)


Several examples of an unsuccessful landing are shown here:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/crash.gif)


Most unsuccessful landings are planed. Space X; performs a controlled landing in the oceans.


## Objectives


Perform exploratory  Data Analysis and determine Training Labels

*   create a column for the class
*   Standardize the data
*   Split into training data and test data

\-Find best Hyperparameter for SVM, Classification Trees and Logistic Regression

*   Find the method performs best using test data


## Import Libraries and Define Auxiliary Functions


In [2]:
#import piplite
#await piplite.install(['numpy'])
#await piplite.install(['pandas'])
#await piplite.install(['seaborn'])

In [3]:
# Pandas is a software library written for the Python programming language for data manipulation and analysis.
import pandas as pd
# NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
import numpy as np
# Matplotlib is a plotting library for python and pyplot gives us a MatLab like plotting framework. We will use this in our plotter function to plot data.
import matplotlib.pyplot as plt
#Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics
import seaborn as sns
# Preprocessing allows us to standarsize our data
from sklearn import preprocessing
# Allows us to split our data into training and testing data
from sklearn.model_selection import train_test_split
# Allows us to test parameters of classification algorithms and find the best one
from sklearn.model_selection import GridSearchCV
# Logistic Regression classification algorithm
from sklearn.linear_model import LogisticRegression
# Support Vector Machine classification algorithm
from sklearn.svm import SVC
# Decision Tree classification algorithm
from sklearn.tree import DecisionTreeClassifier
# K Nearest Neighbors classification algorithm
from sklearn.neighbors import KNeighborsClassifier

In [4]:
def plot_confusion_matrix(y,y_predict):
    "this function plots the confusion matrix"
    from sklearn.metrics import confusion_matrix

    cm = confusion_matrix(y, y_predict)
    ax= plt.subplot()
    sns.heatmap(cm, annot=True, ax = ax); #annot=True to annotate cells
    ax.set_xlabel('Predicted labels')
    ax.set_ylabel('True labels')
    ax.set_title('Confusion Matrix'); 
    ax.xaxis.set_ticklabels(['did not land', 'land']); ax.yaxis.set_ticklabels(['did not land', 'landed']) 
    plt.show() 

## Load the dataframe


In [5]:
data = pd.read_csv("dataset_part_2.csv")

In [6]:
data.head()

In [7]:
X = pd.read_csv("dataset_part_3.csv")

In [8]:
X.head(100)

In [9]:
Y = data['Class'].to_numpy()
Y

Standardize the data in <code>X</code> then reassign it to the variable  <code>X</code> using the transform provided below.


In [10]:
#transform = preprocessing.StandardScaler()
X = preprocessing.StandardScaler().fit(X).transform(X)
X.dtype

In [11]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)

In [12]:
Y_test.shape

Create a logistic regression object  then create a  GridSearchCV object  <code>logreg_cv</code> with cv = 10.  Fit the object to find the best parameters from the dictionary <code>parameters</code>.


In [13]:
parameters ={'C':[0.01,0.1,1],
             'penalty':['l2'],
             'solver':['lbfgs']}

In [14]:
parameters ={"C":[0.01,0.1,1],'penalty':['l2'], 'solver':['lbfgs']}# l1 lasso l2 ridge
lr=LogisticRegression()
lr

In [15]:
logreg_cv = GridSearchCV(estimator=lr, param_grid=parameters, scoring='accuracy', cv=10)
logreg_cv.fit(X_train, Y_train)

In [16]:
print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy :",logreg_cv.best_score_)

Calculate the accuracy on the test data using the method <code>score</code>:


In [17]:
accuracy = logreg_cv.score(X_test, Y_test)
accuracy

In [18]:
yhat=logreg_cv.predict(X_test)
plot_confusion_matrix(Y_test,yhat)

Examining the confusion matrix, we see that logistic regression can distinguish between the different classes.  We see that the problem is false positives.

Overview:

True Postive - 12 (True label is landed, Predicted label is also landed)

False Postive - 3 (True label is not landed, Predicted label is landed)


In [19]:
parameters = {'kernel':('linear', 'rbf','poly','rbf', 'sigmoid'),
              'C': np.logspace(-3, 3, 5),
              'gamma':np.logspace(-3, 3, 5)}
svm = SVC()

In [20]:
svm_cv = GridSearchCV(estimator=svm, param_grid=parameters, scoring='accuracy', cv=10, n_jobs=-1, verbose=2)

In [21]:
svm_cv.fit(X_train, Y_train)

In [22]:
print("tuned hpyerparameters :(best parameters) ",svm_cv.best_params_)
print("accuracy :",svm_cv.best_score_)

Calculate the accuracy on the test data using the method <code>score</code>:


In [23]:
svm_cv.score(X_test, Y_test)

We can plot the confusion matrix


In [24]:
yhat=svm_cv.predict(X_test)
plot_confusion_matrix(Y_test,yhat)

In [25]:
parameters = {'criterion': ['gini', 'entropy'],
     'splitter': ['best', 'random'],
     'max_depth': [2*n for n in range(1,10)],
     'max_features': ['auto', 'sqrt'],
     'min_samples_leaf': [1, 2, 4],
     'min_samples_split': [2, 5, 10]}

tree = DecisionTreeClassifier()

In [26]:
tree_cv = GridSearchCV(estimator=tree, param_grid=parameters, scoring='accuracy', cv=10, n_jobs=-1)

In [27]:
tree_cv.fit(X_train, Y_train)

In [28]:
print("tuned hpyerparameters :(best parameters) ",tree_cv.best_params_)
print("accuracy :",tree_cv.best_score_)

Calculate the accuracy of tree_cv on the test data using the method <code>score</code>:


In [29]:
tree_cv.score(X_test, Y_test)

We can plot the confusion matrix


In [30]:
yhat = tree_cv.predict(X_test)
plot_confusion_matrix(Y_test,yhat)

In [32]:
parameters = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
              'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
              'p': [1,2]}

KNN = KNeighborsClassifier()

In [33]:
knn_cv = GridSearchCV(estimator=KNN, param_grid=parameters, scoring='accuracy', cv=10, n_jobs=-1)

In [34]:
knn_cv.fit(X_train, Y_train)

In [35]:
print("tuned hpyerparameters :(best parameters) ",knn_cv.best_params_)
print("accuracy :",knn_cv.best_score_)

In [36]:
knn_cv.score(X_test, Y_test)

In [37]:
knn_cv.score(X_test, Y_test)

We can plot the confusion matrix


In [38]:
yhat = knn_cv.predict(X_test)
plot_confusion_matrix(Y_test,yhat)

In [39]:
x_data = ['LogisticRegression', 'SVC', 'DecisionTreeClassifier', 'KNeighborsClassifier']
#y_data = [logreg_cv.score(X_test, Y_test), svm_cv.score(X_test, Y_test), tree_cv.score(X_test, Y_test), knn_cv.score(X_test, Y_test)]
y_data = [logreg_cv.best_score_, svm_cv.best_score_, tree_cv.best_score_, knn_cv.best_score_]
plt.figure(figsize=(10, 5))
bars = plt.bar(x=x_data, height=y_data, color='grey')
for bar in bars:
    height = bar.get_height()  # Get the height of the bar
    plt.text(
        bar.get_x() + bar.get_width() / 2,  # X position
        height + 0.01,                      # Y position (slightly above the bar)
        f"{height:.2f}",                    # Text value (formatted to 2 decimals)
        ha='center', va='bottom',           # Text alignment
        fontsize=10, color='black'          # Font and color
    )

plt.xlabel("Models")
plt.ylabel("Accuracy")
plt.title("Accuracy Vs Models")
plt.tight_layout()
plt.show()

In [40]:
x_data = ['LogisticRegression', 'SVC', 'DecisionTreeClassifier', 'KNeighborsClassifier']
y_data = [logreg_cv.score(X_test, Y_test), svm_cv.score(X_test, Y_test), tree_cv.score(X_test, Y_test), knn_cv.score(X_test, Y_test)]
#y_data = [logreg_cv.best_score_, svm_cv.best_score_, tree_cv.best_score_, knn_cv.best_score_]
plt.figure(figsize=(10, 5))
bars = plt.bar(x=x_data, height=y_data, color='grey')
for bar in bars:
    height = bar.get_height()  # Get the height of the bar
    plt.text(
        bar.get_x() + bar.get_width() / 2,  # X position
        height + 0.01,                      # Y position (slightly above the bar)
        f"{height:.2f}",                    # Text value (formatted to 2 decimals)
        ha='center', va='bottom',           # Text alignment
        fontsize=10, color='black'          # Font and color
    )

plt.xlabel("Models")
plt.ylabel("Accuracy")
plt.title("Accuracy using .score")
plt.tight_layout()
plt.show()