#SpaceX Falcon 9 Data Wrangling and Preparation

## 1. Introduction

This notebook focuses on data wrangling and feature engineering for the SpaceX Falcon 9 first stage landing dataset. This notebook builds on the raw dataset collected in `SpaceX Falcon 9 Data Collection and Enrichment`, focusing on cleaning, feature engineering, and preparation for modeling.


We will:
- Load and explore the dataset.
- Identify missing values and data types.
- Engineer features for modeling, including landing success, weather categories, and payload mass categories.

**Objective:** Prepare a clean dataset for supervised learning, where the target variable is the success of the first stage landing (`Class`: 1 = success, 0 = failure).

**Note:** Explanations of feature choices (e.g., weather thresholds, orbit types, payload categories) are included in the relevant sections.


## 2. Library Imports

Import the necessary Python libraries for data manipulation and numerical operations.


In [1]:
import pandas as pd
import numpy as np

## 3. Load Dataset

We load `dataset_part_1.csv`, which contains the enriched data collected previously, including launch, payload, and weather information. We take a quick look at its structure and inspect the data types.



In [2]:
df = pd.read_csv("dataset_part_1.csv")
df.head()

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude,TemperatureAvg,WindSpeed
0,1,2010-06-04,Falcon 9,6123.547647,LEO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0003,-80.577366,28.561857,25.7,9.7
1,2,2012-05-22,Falcon 9,525.0,LEO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0005,-80.577366,28.561857,24.7,13.5
2,3,2013-03-01,Falcon 9,677.0,ISS,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0007,-80.577366,28.561857,14.8,15.3
3,4,2013-09-29,Falcon 9,500.0,PO,VAFB SLC 4E,False Ocean,1,False,False,False,,1.0,0,B1003,-120.610829,34.632093,15.1,10.8
4,5,2013-12-03,Falcon 9,3170.0,GTO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B1004,-80.577366,28.561857,18.1,5.7


In [3]:
df.dtypes

Unnamed: 0,0
FlightNumber,int64
Date,object
BoosterVersion,object
PayloadMass,float64
Orbit,object
LaunchSite,object
Outcome,object
Flights,int64
GridFins,bool
Reused,bool


## 4. Data Cleaning & Missing Values

Identify missing values in the dataset and calculate their percentage. This helps decide if imputation or removal is needed.


In [4]:
missing = df.isnull().mean() * 100
missing[missing > 0].sort_values(ascending=False)

Unnamed: 0,0
LandingPad,28.888889


## 5. Exploratory Feature Analysis

Examine the key features in the dataset:

### Launch Site
The `LaunchSite` feature indicates the SpaceX launch facility for each mission. Different sites may have varying infrastructure and environmental conditions, which could affect landing success.

### Orbit
The `Orbit` feature describes the target orbital type for each launch. Orbit type affects mission complexity and landing difficulty. Examples:
- **LEO (Low Earth Orbit):** Relatively low altitude, simpler re-entry profile.
- **VLEO (Very Low Earth Orbit):** Very low altitude, more atmospheric drag.
- **GTO (Geostationary Transfer Orbit):** Higher altitude, requires additional maneuvers to reach final orbit.
- **SSO (Sun-Synchronous Orbit):** Polar orbit for consistent sunlight, often used for Earth observation satellites.
- **Other orbits** like GEO, MEO, ISS, HEO, ES-L1 are included for completeness.

Launch site and orbit type may affect the difficulty of landing. For example, launches to higher or more distant orbits require more complex trajectories and fuel usage, which could influence first-stage recovery success.

In [5]:
df['LaunchSite'].value_counts()
df['Orbit'].value_counts()

Unnamed: 0_level_0,count
Orbit,Unnamed: 1_level_1
GTO,27
ISS,21
VLEO,14
PO,9
LEO,7
SSO,5
MEO,3
HEO,1
ES-L1,1
SO,1


## 6. Feature Engineering

### 6.1 Landing Success Label (`Class`)

The raw `Outcome` column contains detailed landing results (e.g., True/False ASDS, Ocean, RTLS). For modeling, we create a binary classification variable:

- **1 = successful landing** (`True ASDS`, `True RTLS`, `True Ocean`)
- **0 = unsuccessful landing** (`None None`, `False ASDS`, `False RTLS`, `False Ocean`, `None ASDS`)

Simplifying the outcome to a binary variable allows us to train supervised classification models to predict landing success.

In [6]:
bad_outcomes = ['None None', 'False ASDS', 'False Ocean', 'None ASDS', 'False RTLS']
df['Class'] = (~df['Outcome'].isin(bad_outcomes)).astype(int)

### 6.2 Weather Features

Weather conditions such as temperature and wind speed can affect rocket launch and landing performance. We categorize these continuous variables to simplify analysis and help predictive modeling:

- **Temperature (`TemperatureAvg`):**
    - Cold: < 10°C
    - Moderate: 10–25°C
    - Hot: > 25°C
- **Wind Speed (`WindSpeed`):**
    - Low: < 10 km/h
    - Moderate: 10–20 km/h
    - High: > 20 km/h

Thresholds are based on typical operational limits for rocket launches and landings.

In [7]:
df['TempCategory'] = df['TemperatureAvg'].apply(lambda t: 'Cold' if t < 10 else 'Moderate' if t < 25 else 'Hot')
df['WindCategory'] = df['WindSpeed'].apply(lambda w: 'Low' if w < 10 else 'Moderate' if w < 20 else 'High')

### 6.3 Payload Mass Categories

Payload mass can influence the rocket's fuel consumption and landing difficulty. We divide `PayloadMass` into three categories using quantiles:

- Light: lower third
- Medium: middle third
- Heavy: upper third

This helps assess whether payload weight correlates with landing success.

In [8]:
df['PayloadCategory'] = pd.qcut(df['PayloadMass'], q=3, labels=['Light','Medium','Heavy'], duplicates='drop')

### 6.4 Summary Statistics

The overall launch success rate is approximately 67%.  

We also summarize the distributions of our engineered features (`TempCategory`, `WindCategory`, `PayloadCategory`) to check for balance and coverage.

Note that there are relatively few "Cold" temperature launches and "High" wind speed cases, which may need to be considered when modeling.

In [9]:
success_rate = df['Class'].mean()
print(f"Launch success rate: {success_rate:.2%}")

print(df['TempCategory'].value_counts())
print(df['WindCategory'].value_counts())
print(df['PayloadCategory'].value_counts())

Launch success rate: 66.67%
TempCategory
Moderate    54
Hot         34
Cold         2
Name: count, dtype: int64
WindCategory
Moderate    53
Low         32
High         5
Name: count, dtype: int64
PayloadCategory
Medium    32
Light     30
Heavy     28
Name: count, dtype: int64


## 7. Export Prepared Dataset

The cleaned and feature-engineered dataset is exported as `dataset_part_2.csv` for use in subsequent notebooks.

In [11]:
df.to_csv("dataset_part_2.csv", index=False)

## ✅ 8. Conclusion

This notebook focused on **data wrangling and feature engineering** to prepare the Falcon 9 dataset for predictive modeling:  

- **Data cleaning**: Checked and quantified missing values.  
- **Outcome simplification**: Engineered the binary `Class` target variable (1 = successful landing, 0 = failure).  
- **Feature engineering**:  
  - Categorized weather variables (`TempCategory`, `WindCategory`) based on operational thresholds.  
  - Derived `PayloadCategory` using quantiles.  
- **Exploratory analysis**: Assessed balance of engineered features and summarized success rates.  
- **Export**: Produced `dataset_part_2.csv` as a clean, feature-rich dataset ready for SQL analysis and machine learning models.  

✅ **Skills demonstrated**: data wrangling with pandas, handling missing values, feature engineering, categorical binning, and dataset preparation for supervised learning.  

This notebook builds directly on the **Data Collection and Enrichment** notebook and provides the foundation for the **SQL EDA** and **predictive modeling** notebooks in this portfolio.
