# 🚀 SpaceX Falcon 9 first stage Landing Prediction

## 1. Introduction

The goal of this project is to **predict whether the Falcon 9 first stage will land successfully**.

Falcon 9 is a **reusable rocket** developed by the company SpaceX. It consists of two main parts (stages):

- **First stage** is responsible for the initial launch and pushing the rocket upwards into the sky
- Second stage activates after the first stage separates and is in charge of placing the payload into orbit

The key part is that the **first stage must land back on Earth** so it can be reused. 
SpaceX advertises a single Falcon 9 rocket launch at a cost of \$62M, while other providers can charge upwards of \$165M. A big part of the cost savings comes from being able to reuse the first stage.

If we can figure out what factors influence a successful landing, we can build a model to predict it — which helps with planning, saving money, and improving future missions.

As part of the IBM course, I completed labs on collecting data from the [SpaceX API]("https://api.spacexdata.com/v4/launches/past") and a [Wikipedia article]("https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches") on Falcon 9 launches. This notebook skips that part and uses the prepared `space-data.csv` file as the main data source.

## 2. Importing Libraries

We begin by importing the necessary libraries that will help us analyze the data and build the prediction model:
- `pandas` for data manipulation and cleaning
- `numpy` for numerical operations
- `matplotlib` and `seaborn` for exploratory data analysis (EDA) and visualization
- `scikit-learn` which provides implemented machine learning algorithms and tools

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn modules:
# Preprocessing allows us to standarsize our data
from sklearn import preprocessing
# Allows us to split our data into training and testing data
from sklearn.model_selection import train_test_split
# Allows us to test parameters of classification algorithms and find the best one
from sklearn.model_selection import GridSearchCV
# Logistic Regression classification algorithm
from sklearn.linear_model import LogisticRegression
# Support Vector Machine classification algorithm
from sklearn.svm import SVC
# Decision Tree classification algorithm
from sklearn.tree import DecisionTreeClassifier
# K Nearest Neighbors classification algorithm
from sklearn.neighbors import KNeighborsClassifier

## 3. Loading and Exploring the Dataset

We will import the dataset and convert it into a Dataframe object.

In [9]:
df = pd.read_csv("space-data.csv")

### 3.1 Preview of the dataset
Let's take a quick look at the first few rows to understand the structure of the data.


In [10]:
df.head()

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude
0,1,2010-06-04,Falcon 9,6104.959412,LEO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0003,-80.577366,28.561857
1,2,2012-05-22,Falcon 9,525.0,LEO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0005,-80.577366,28.561857
2,3,2013-03-01,Falcon 9,677.0,ISS,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0007,-80.577366,28.561857
3,4,2013-09-29,Falcon 9,500.0,PO,VAFB SLC 4E,False Ocean,1,False,False,False,,1.0,0,B1003,-120.610829,34.632093
4,5,2013-12-03,Falcon 9,3170.0,GTO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B1004,-80.577366,28.561857


In [19]:
print(f"Dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")

Dataset contains 90 rows and 17 columns.


#### Column names
Here are all the columns present in the dataset.

In [11]:
df.columns

Index(['FlightNumber', 'Date', 'BoosterVersion', 'PayloadMass', 'Orbit',
       'LaunchSite', 'Outcome', 'Flights', 'GridFins', 'Reused', 'Legs',
       'LandingPad', 'Block', 'ReusedCount', 'Serial', 'Longitude',
       'Latitude'],
      dtype='object')

#### Feature Description (Data Dictionary)

| Column Name      | Description |
|------------------|-------------|
| `FlightNumber`   | Number of the flight |
| `Date`           | Date of the launch |
| `BoosterVersion` | Falcon 9 booster model/version |
| `PayloadMass`    | Mass of the payload in kilograms |
| `Orbit`          | Destination orbit (e.g., LEO, GTO, etc.) |
| `LaunchSite`     | Location of the launch site |
| `Outcome`        | Landing outcome (success/failure) |
| `Flights`        | Number of previous flights using this booster |
| `GridFins`       | Whether grid fins were used (`True`/`False`) |
| `Reused`         | Whether the booster was reused (`True`/`False`) |
| `Legs`           | Whether the booster had landing legs |
| `LandingPad`     | Identifier of the landing pad used |
| `Block`          | Falcon 9 block number (sub-version) |
| `ReusedCount`    | How many times the booster has been reused |
| `Serial`         | Unique serial number of the booster |
| `Longitude`      | Longitude of the launch site |
| `Latitude`       | Latitude of the launch site |

#### Data types
We now check the data types of each column to see how the data is represented.

In [12]:
df.dtypes

FlightNumber        int64
Date               object
BoosterVersion     object
PayloadMass       float64
Orbit              object
LaunchSite         object
Outcome            object
Flights             int64
GridFins             bool
Reused               bool
Legs                 bool
LandingPad         object
Block             float64
ReusedCount         int64
Serial             object
Longitude         float64
Latitude          float64
dtype: object

#### Missing values
Let's check if any columns contain missing values.

In [14]:
df.isnull().sum()

FlightNumber       0
Date               0
BoosterVersion     0
PayloadMass        0
Orbit              0
LaunchSite         0
Outcome            0
Flights            0
GridFins           0
Reused             0
Legs               0
LandingPad        26
Block              0
ReusedCount        0
Serial             0
Longitude          0
Latitude           0
dtype: int64

There are no missing values in the dataset that require attention. The `LandingPad` column will retain `None` values to represent when no landing pad was used.

### 3.2 Basic data exploration
We start by answering a few simple questions about the dataset to get a better understanding of launch frequencies and outcomes.

#### Launch activity per site
The data includes several SpaceX launch facilities:
- Cape Canaveral Space Launch Complex 40 (CCAFS SLC-40)
- Vandenberg Air Force Base Space Launch Complex 4E (VAFB SLC-4E)
- Kennedy Space Center Launch Complex 39A (KSC LC-39A)

To get an overview of the dataset, we first check how many missions were conducted at each launch site.

In [15]:
df['LaunchSite'].value_counts()

LaunchSite
CCAFS SLC 40    55
KSC LC 39A      22
VAFB SLC 4E     13
Name: count, dtype: int64

#### Number and occurance of each orbit
Each launch aims to a dedicated orbit. Let's determine how often each of them occurs in the dataset.

In [16]:
df['Orbit'].value_counts()

Orbit
GTO      27
ISS      21
VLEO     14
PO        9
LEO       7
SSO       5
MEO       3
ES-L1     1
HEO       1
SO        1
GEO       1
Name: count, dtype: int64

There are many types of orbits, but here I’ll briefly explain the most common ones found in the dataset:
- GTO (Geosynchronous Transfer Orbit): A high elliptical orbit used to move satellites into geostationary orbit.
- ISS (International Space Station): A habitable modular space station in low Earth orbit operated by multiple space agencies.
- VLEO (Very Low Earth Orbit): Orbits below 450 km, closer to Earth, good for detailed observation.
- PO (Polar Orbit): Orbit passing near Earth’s poles, allowing satellites to cover the entire planet.
- LEO (Low Earth Orbit): Orbit up to 2,000 km above Earth. Most satellites and the ISS orbit here.
- SSO (Sun-Synchronous Orbit): A near-polar orbit that passes over the same Earth spots at the same local solar time daily, ideal for consistent lighting conditions in imaging.

#### Landing outcomes

Each landing attempt is labeled by location and success:
- RTLS – landed on ground pad near the launch site
- ASDS – landed on drone ship in the ocean
- Ocean – soft landing in the ocean without a pad

Each can be marked as **True** (successful), **False** (failed), or **None** (no landing attempt).

Here is how often each of them appears in the dataset.

In [21]:
landing_outcomes = df['Outcome'].value_counts()
landing_outcomes

Outcome
True ASDS      41
None None      19
True RTLS      14
False ASDS      6
True Ocean      5
False Ocean     2
None ASDS       2
False RTLS      1
Name: count, dtype: int64

The `Outcome` column has full landing info, but for easier and cleaner analysis we need a new column with just binary values — success or failure.

We create a set of outcomes where the first stage did not land successfully.

In [23]:
for i,outcome in enumerate(landing_outcomes.keys()):
    print(i,outcome)

0 True ASDS
1 None None
2 True RTLS
3 False ASDS
4 True Ocean
5 False Ocean
6 None ASDS
7 False RTLS


In [22]:
bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]])
bad_outcomes

{'False ASDS', 'False Ocean', 'False RTLS', 'None ASDS', 'None None'}

Based on the `Outcome` column, we make a new list: if the outcome was bad, we put 0, if it was good, we put 1.

In [24]:
landing_class = [0 if x in bad_outcomes else 1 for x in df['Outcome']]
df['Class'] = landing_class
df.head()

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude,Class
0,1,2010-06-04,Falcon 9,6104.959412,LEO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0003,-80.577366,28.561857,0
1,2,2012-05-22,Falcon 9,525.0,LEO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0005,-80.577366,28.561857,0
2,3,2013-03-01,Falcon 9,677.0,ISS,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0007,-80.577366,28.561857,0
3,4,2013-09-29,Falcon 9,500.0,PO,VAFB SLC 4E,False Ocean,1,False,False,False,,1.0,0,B1003,-120.610829,34.632093,0
4,5,2013-12-03,Falcon 9,3170.0,GTO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B1004,-80.577366,28.561857,0


## 4. Exploratory Data Analysis