# SpaceX Falcon 9 First Stage Landing Prediction

## Module 1: Data Collection and Preparation

### Project Overview
In this project, we predict if the Falcon 9 first stage will successfully land. SpaceX advertises Falcon 9 rocket launches on its website at a cost of 62 million dollars, which is considerably less than other providers (costing upward of 165 million dollars) primarily because SpaceX can reuse the first stage.

### Module Objective
This notebook focuses on collecting and preparing SpaceX launch data for analysis. We will:
1. Attempt to collect data from the SpaceX API
2. Process the backup dataset provided by IBM
3. Clean and transform the data for future modeling stages

### Data Challenges
The dataset is stored in an unusual format with stringified Python lists that need careful parsing and cleaning.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import requests
import json
import re
import ast
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import files
import warnings
warnings.filterwarnings('ignore')

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)
pd.set_option('display.width', 1000)
pd.set_option('display.precision', 3)

print("🚀 Libraries successfully imported and environment configured.")

🚀 Libraries successfully imported and environment configured.


## 1. Data Collection

### 1.1 Accessing SpaceX API
SpaceX provides a public API with information about all launches. We'll attempt to fetch this data first, though the API may have access limitations.

### 1.2 Backup Dataset
IBM has provided a compressed backup dataset (`dataset_part_1.csv`) as a workaround for live API issues. This dataset contains the necessary information, but is stored in a format that requires significant cleaning.

In [3]:
# Function to load the dataset with basic validation
def load_backup_dataset(file_path='dataset_part_1.csv'):
    """
    Load the IBM-provided backup dataset.

    Args:
        file_path (str): Path to the CSV file

    Returns:
        pd.DataFrame: Raw dataset if loading is successful
    """
    try:
        df = pd.read_csv(file_path)
        print(f"✅ Dataset loaded successfully with shape: {df.shape}")
        return df
    except FileNotFoundError:
        print(f"❌ File {file_path} not found. Please upload the file.")
        return None
    except Exception as e:
        print(f"❌ Error loading file: {e}")
        return None

# First try to load the file directly
try:
    df_cleaned_raw = pd.read_csv('dataset_part_1.csv')
    print(f"✅ Static dataset loaded successfully with shape: {df_cleaned_raw.shape}")
except FileNotFoundError:
    # If file not found, prompt for upload
    print("❌ Static dataset not found. Please upload the dataset file:")
    from google.colab import files
    uploaded = files.upload()

    # Try to load the uploaded file
    if uploaded:
        file_name = next(iter(uploaded))
        df_cleaned_raw = pd.read_csv(file_name)
        print(f"✅ Uploaded dataset loaded successfully with shape: {df_cleaned_raw.shape}")

# Display basic information about the dataset
if df_cleaned_raw is not None:
    print("\n📊 Dataset Overview:")
    print(f"Rows: {df_cleaned_raw.shape[0]}, Columns: {df_cleaned_raw.shape[1]}")
    print("\n📋 Column names:")
    for col in df_cleaned_raw.columns:
        print(f"- {col}")

    # Display the first row to understand the structure
    print("\n🔍 Example of first row data structure:")
    for col in df_cleaned_raw.columns[:5]:  # Show just first 5 columns for brevity
        print(f"{col}: {type(df_cleaned_raw[col][0])} - {str(df_cleaned_raw[col][0])[:100]}...")
else:
    print("❌ Failed to load the dataset. Please check file availability.")

✅ Static dataset loaded successfully with shape: (1, 17)

📊 Dataset Overview:
Rows: 1, Columns: 17

📋 Column names:
- FlightNumber
- Date
- BoosterVersion
- PayloadMass
- Orbit
- LaunchSite
- Outcome
- Flights
- GridFins
- Reused
- Legs
- LandingPad
- Block
- ReusedCount
- Serial
- Longitude
- Latitude

🔍 Example of first row data structure:
FlightNumber: <class 'numpy.int64'> - 1...
Date: <class 'str'> - [datetime.date(2006, 3, 24), datetime.date(2007, 3, 21), datetime.date(2008, 9, 28), datetime.date(2...
BoosterVersion: <class 'str'> - ['Falcon 1', 'Falcon 1', 'Falcon 1', 'Falcon 1', 'Falcon 9', 'Falcon 9', 'Falcon 9', 'Falcon 9', 'Fa...
PayloadMass: <class 'str'> - [[   20.        ]
 [ 5919.16534091]
 [  165.        ]
 [  200.        ]
 [ 5919.16534091]
 [  525.  ...
Orbit: <class 'str'> - ['LEO', 'LEO', 'LEO', 'LEO', 'LEO', 'LEO', 'ISS', 'PO', 'GTO', 'GTO', 'ISS', 'LEO', 'GTO', 'GTO', 'I...


## 2. Data Cleaning Approach

### Cleaning Challenges
The dataset presents several challenges:
- Each column contains a string representation of a Python list
- Different columns have different data types within these lists (strings, booleans, numbers, None values)
- Some columns need to be split into multiple columns for analysis

### Cleaning Strategy
We'll implement a systematic approach:
1. Create helper functions for common cleaning tasks
2. Process each column individually based on its content
3. Validate the cleaned data before combining into the final dataset

In [4]:
# Helper function to safely parse Python list strings
def parse_python_list_str(raw_str):
    """
    Safely parse a string representation of a Python list.

    Args:
        raw_str (str): String representation of a Python list

    Returns:
        list: The parsed list or None if parsing fails
    """
    if not isinstance(raw_str, str):
        return None

    try:
        return ast.literal_eval(raw_str.strip())
    except Exception as e:
        print(f"❌ Error parsing string: {e}")
        return None

# Helper function to clean numeric columns
def clean_numeric_column(raw_column):
    """
    Convert a list of mixed values to numeric, replacing non-numeric with NaN.

    Args:
        raw_column (list): List of values to clean

    Returns:
        list: List with non-numeric values replaced by np.nan
    """
    return [float(x) if isinstance(x, (float, int)) else np.nan for x in raw_column]

# Function to preview column content
def preview_column(df, column_name, max_chars=300):
    """
    Preview a column's content and data type.

    Args:
        df (pd.DataFrame): DataFrame containing the column
        column_name (str): Name of the column to preview
        max_chars (int): Maximum characters to display

    Returns:
        str: The raw value of the first row for that column
    """
    raw_value = df[column_name][0]
    print(f"🔍 Column: {column_name}")
    print(f"🔍 Data type: {type(raw_value)}")
    print(f"🔍 First {max_chars} characters:\n{str(raw_value)[:max_chars]}")

    return raw_value

print("✅ Utility functions defined for data cleaning")

✅ Utility functions defined for data cleaning


## 3. Column-by-Column Data Cleaning

We'll now process each column individually using our utility functions. For each column, we'll:
1. Inspect the raw data structure
2. Parse the stringified list
3. Apply appropriate transformations based on the data type
4. Validate the cleaned data

This approach allows us to handle the specific requirements of each column and ensures high-quality data for our analysis.

In [5]:
# 4.1 PayloadMass Column Cleaning
print("\n🧹 Cleaning PayloadMass Column...")
safe_column = 'PayloadMass'
raw_value = preview_column(df_cleaned_raw, safe_column)

# Extract all numbers using regex for PayloadMass
print("\nExtracting numeric values using regex pattern matching...")
numbers = re.findall(r"[-+]?\d*\.\d+|\d+", raw_value)
flat_payloads = [float(num) for num in numbers]

# Preview the extracted payload mass values
print("\n✅ Flattened PayloadMass values (first 5):", flat_payloads[:5])
print("🧪 Type of first value:", type(flat_payloads[0]))
print(f"📊 Total extracted values: {len(flat_payloads)}")

# Basic statistics for payload mass
payloads_array = np.array(flat_payloads)
print(f"\n📊 Payload Mass Statistics:")
print(f"Min: {payloads_array.min():.2f} kg")
print(f"Max: {payloads_array.max():.2f} kg")
print(f"Mean: {payloads_array.mean():.2f} kg")
print(f"Median: {np.median(payloads_array):.2f} kg")


🧹 Cleaning PayloadMass Column...
🔍 Column: PayloadMass
🔍 Data type: <class 'str'>
🔍 First 300 characters:
[[   20.        ]
 [ 5919.16534091]
 [  165.        ]
 [  200.        ]
 [ 5919.16534091]
 [  525.        ]
 [  677.        ]
 [  500.        ]
 [ 3170.        ]
 [ 3325.        ]
 [ 2296.        ]
 [ 1316.        ]
 [ 4535.        ]
 [ 4428.        ]
 [ 2216.        ]
 [ 2395.        ]
 [  570.    

Extracting numeric values using regex pattern matching...

✅ Flattened PayloadMass values (first 5): [20.0, 5919.16534091, 165.0, 200.0, 5919.16534091]
🧪 Type of first value: <class 'float'>
📊 Total extracted values: 94

📊 Payload Mass Statistics:
Min: 20.00 kg
Max: 15600.00 kg
Mean: 5919.17 kg
Median: 4648.00 kg


In [6]:
# 4.2 Orbit Column Cleaning
print("\n🧹 Cleaning Orbit Column...")
safe_column = 'Orbit'
raw_value = preview_column(df_cleaned_raw, safe_column)

# Parse the orbit list using the ast module
try:
    parsed_orbit = parse_python_list_str(raw_value)
    print("\n✅ Parsed Orbit values (first 5):", parsed_orbit[:5])
    print("🧪 Type of first value:", type(parsed_orbit[0]))

    # Count the frequency of each orbit type
    orbit_counts = {}
    for orbit in parsed_orbit:
        orbit_counts[orbit] = orbit_counts.get(orbit, 0) + 1

    print("\n📊 Orbit Type Distribution:")
    for orbit, count in sorted(orbit_counts.items(), key=lambda x: x[1], reverse=True):
        print(f"- {orbit}: {count} launches")

except Exception as e:
    print("\n❌ Error during orbit parsing:", e)


🧹 Cleaning Orbit Column...
🔍 Column: Orbit
🔍 Data type: <class 'str'>
🔍 First 300 characters:
['LEO', 'LEO', 'LEO', 'LEO', 'LEO', 'LEO', 'ISS', 'PO', 'GTO', 'GTO', 'ISS', 'LEO', 'GTO', 'GTO', 'ISS', 'ISS', 'ES-L1', 'ISS', 'GTO', 'ISS', 'LEO', 'PO', 'GTO', 'ISS', 'GTO', 'GTO', 'ISS', 'GTO', 'GTO', 'PO', 'ISS', 'GTO', 'GTO', 'LEO', 'GTO', 'ISS', 'GTO', 'PO', 'GTO', 'ISS', 'SSO', 'LEO', 'PO', '

✅ Parsed Orbit values (first 5): ['LEO', 'LEO', 'LEO', 'LEO', 'LEO']
🧪 Type of first value: <class 'str'>

📊 Orbit Type Distribution:
- GTO: 27 launches
- ISS: 21 launches
- VLEO: 14 launches
- LEO: 11 launches
- PO: 9 launches
- SSO: 5 launches
- MEO: 3 launches
- ES-L1: 1 launches
- HEO: 1 launches
- SO: 1 launches
- GEO: 1 launches


### Understanding Orbit Types

The orbit types in our dataset reflect the destination of the payload:

1. **LEO (Low Earth Orbit)**:
   - Altitude: 180-2,000 km
   - Used for: Earth observation, military, some communication satellites
   
2. **GTO (Geostationary Transfer Orbit)**:
   - Elliptical orbit used to reach GEO
   - Used for: Communication satellites being positioned to geostationary orbit
   
3. **ISS (International Space Station)**:
   - Specific LEO where the ISS is located
   - Altitude: ~400 km
   
4. **PO (Polar Orbit)**:
   - Passes over Earth's poles
   - Used for: Earth observation, weather satellites
   
5. **ES-L1 (Earth-Sun Lagrangian Point 1)**:
   - Special point between Earth and Sun
   - Used for: Solar observation satellites

The orbit type affects the mission's energy requirements and may influence landing success rates.

In [7]:
# 4.3 LaunchSite Column Cleaning
print("\n🧹 Cleaning LaunchSite Column...")
safe_column = 'LaunchSite'
raw_value = preview_column(df_cleaned_raw, safe_column)

# Parse the launch site list
try:
    parsed_site = parse_python_list_str(raw_value)
    print("\n✅ Parsed LaunchSite values (first 5):", parsed_site[:5])
    print("✅ Type of first value:", type(parsed_site[0]))

    # Count the frequency of each launch site
    site_counts = {}
    for site in parsed_site:
        site_counts[site] = site_counts.get(site, 0) + 1

    print("\n📊 Launch Site Distribution:")
    for site, count in sorted(site_counts.items(), key=lambda x: x[1], reverse=True):
        print(f"- {site}: {count} launches")

except Exception as e:
    print("\n❌ Error during LaunchSite parsing:", e)


🧹 Cleaning LaunchSite Column...
🔍 Column: LaunchSite
🔍 Data type: <class 'str'>
🔍 First 300 characters:
['Kwajalein Atoll', 'Kwajalein Atoll', 'Kwajalein Atoll', 'Kwajalein Atoll', 'CCSFS SLC 40', 'CCSFS SLC 40', 'CCSFS SLC 40', 'VAFB SLC 4E', 'CCSFS SLC 40', 'CCSFS SLC 40', 'CCSFS SLC 40', 'CCSFS SLC 40', 'CCSFS SLC 40', 'CCSFS SLC 40', 'CCSFS SLC 40', 'CCSFS SLC 40', 'CCSFS SLC 40', 'CCSFS SLC 40', 

✅ Parsed LaunchSite values (first 5): ['Kwajalein Atoll', 'Kwajalein Atoll', 'Kwajalein Atoll', 'Kwajalein Atoll', 'CCSFS SLC 40']
✅ Type of first value: <class 'str'>

📊 Launch Site Distribution:
- CCSFS SLC 40: 55 launches
- KSC LC 39A: 22 launches
- VAFB SLC 4E: 13 launches
- Kwajalein Atoll: 4 launches


### Understanding Launch Sites

SpaceX uses several launch sites for Falcon 9 missions:

1. **CCSFS SLC 40** (Cape Canaveral Space Force Station Space Launch Complex 40):
   - Location: Florida, East Coast
   - Used for: Most LEO, GTO, and ISS missions
   
2. **KSC LC 39A** (Kennedy Space Center Launch Complex 39A):
   - Historic NASA launch pad used for Apollo and Shuttle missions
   - Now leased by SpaceX for Falcon 9 and Falcon Heavy
   
3. **VAFB SLC 4E** (Vandenberg Air Force Base Space Launch Complex 4E):
   - Location: California, West Coast
   - Used for: Polar orbits and sun-synchronous orbits
   
4. **Kwajalein Atoll**:
   - Remote location in the Marshall Islands, Pacific Ocean
   - Used for: Early Falcon 1 launches

The launch site has significant implications for available landing options and may affect landing success probability.

In [8]:
# 4.4 LandingPad Column Cleaning
print("\n🧹 Cleaning LandingPad Column...")
safe_column = 'LandingPad'
raw_value = preview_column(df_cleaned_raw, safe_column)

# Parse landing pad data
try:
    parsed_pad = parse_python_list_str(raw_value)
    print("\n✅ Parsed LandingPad values (first 5):", parsed_pad[:5])
    print("📘 Type of first value:", type(parsed_pad[0]))

    # Count None entries
    none_count = sum(1 for x in parsed_pad if x is None)
    print(f"\n🧮 Number of None entries: {none_count} ({none_count/len(parsed_pad)*100:.1f}%)")

    # Count unique landing pads (excluding None)
    pad_values = [x for x in parsed_pad if x is not None]
    unique_pads = set(pad_values)
    print(f"🧮 Number of unique landing pads: {len(unique_pads)}")

    # Preview hash IDs (these are SpaceX's internal identifiers)
    hash_ids = [x for x in parsed_pad if isinstance(x, str)]
    print("\n🔑 Sample of landing pad IDs:", hash_ids[:5])
    print(f"🧮 Total landing pad IDs: {len(hash_ids)}")

except Exception as e:
    print("\n❌ Error during parsing:", e)


🧹 Cleaning LandingPad Column...
🔍 Column: LandingPad
🔍 Data type: <class 'str'>
🔍 First 300 characters:
[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, '5e9e3032383ecb761634e7cb', None, '5e9e3032383ecb761634e7cb', None, '5e9e3032383ecb6bb234e7ca', '5e9e3032383ecb267a34e7c7', '5e9e3033383ecbb9e534e7cc', '5e9e3032383ecb6bb234e7ca', '5e9e3032383ecb6bb234e7ca', '

✅ Parsed LandingPad values (first 5): [None, None, None, None, None]
📘 Type of first value: <class 'NoneType'>

🧮 Number of None entries: 30 (31.9%)
🧮 Number of unique landing pads: 5

🔑 Sample of landing pad IDs: ['5e9e3032383ecb761634e7cb', '5e9e3032383ecb761634e7cb', '5e9e3032383ecb6bb234e7ca', '5e9e3032383ecb267a34e7c7', '5e9e3033383ecbb9e534e7cc']
🧮 Total landing pad IDs: 64


### Understanding Landing Pads

The landing pad values in our dataset are represented by SpaceX's internal identifiers. These cryptic strings (like `5e9e3032383ecb761634e7cb`) correspond to specific landing locations:

1. **ASDS (Autonomous Spaceport Drone Ship)**:
   - Floating landing platforms in the ocean
   - Examples: "Of Course I Still Love You" and "Just Read the Instructions"
   - Used for: High-velocity missions that can't return to launch site
   
2. **RTLS (Return to Launch Site)**:
   - Landing pads located at or near the launch site
   - Examples: Landing Zone 1 and Landing Zone 2 at Cape Canaveral
   - Used for: Lower-energy missions with sufficient fuel for boost-back
   
3. **None values**:
   - Early missions without landing attempts
   - Missions where landing was not planned
   - Failed missions where landing wasn't attempted

In later analysis, we'll need to map these IDs to actual landing pad types for more meaningful insights.

In [9]:
# 4.5 Outcome Column Cleaning
print("\n🧹 Cleaning Outcome Column...")
safe_column = 'Outcome'
raw_value = preview_column(df_cleaned_raw, safe_column)

# Parse outcome data
try:
    parsed_outcome = parse_python_list_str(raw_value)
    print("\n✅ Parsed Outcome values (first 5):", parsed_outcome[:5])
    print("🧬 Type of first value:", type(parsed_outcome[0]))

    # Split outcomes into success and type
    # Format is typically like "True ASDS" or "False Ocean"
    outcome_split = [entry.split() if isinstance(entry, str) else [None, None] for entry in parsed_outcome]
    landing_success = [item[0] if item else None for item in outcome_split]
    landing_type = [item[1] if len(item) > 1 else None for item in outcome_split]

    print("\n🚀 First 5 Landing Success values:", landing_success[:5])
    print("🛰️ First 5 Landing Type values:", landing_type[:5])

    # Clean landing success values to boolean
    landing_success_clean = [x.strip() if isinstance(x, str) else x for x in landing_success]
    landing_success_clean = [None if x in ['None', 'None None'] else x for x in landing_success_clean]
    landing_success_bool = [True if x == 'True' else False if x == 'False' else None for x in landing_success_clean]

    print("\n🧹 First 5 cleaned landing success (boolean):", landing_success_bool[:5])

    # Count success, failure, and None values
    success_count = landing_success_bool.count(True)
    failure_count = landing_success_bool.count(False)
    none_count = landing_success_bool.count(None)

    print(f"\n📊 Landing Outcome Distribution:")
    print(f"- Successful landings: {success_count} ({success_count/len(landing_success_bool)*100:.1f}%)")
    print(f"- Failed landings: {failure_count} ({failure_count/len(landing_success_bool)*100:.1f}%)")
    print(f"- No landing attempt: {none_count} ({none_count/len(landing_success_bool)*100:.1f}%)")

    # Count landing types
    landing_type_counts = {}
    for lt in landing_type:
        if lt is not None:
            landing_type_counts[lt] = landing_type_counts.get(lt, 0) + 1

    print(f"\n📊 Landing Type Distribution:")
    for lt, count in sorted(landing_type_counts.items(), key=lambda x: x[1], reverse=True):
        print(f"- {lt}: {count} attempts")

except Exception as e:
    print("\n❌ Error during parsing:", e)


🧹 Cleaning Outcome Column...
🔍 Column: Outcome
🔍 Data type: <class 'str'>
🔍 First 300 characters:
['None None', 'None None', 'None None', 'None None', 'None None', 'None None', 'None None', 'False Ocean', 'None None', 'None None', 'True Ocean', 'True Ocean', 'None None', 'None None', 'False Ocean', 'False ASDS', 'True Ocean', 'False ASDS', 'None None', 'None ASDS', 'True RTLS', 'False ASDS', 'Fa

✅ Parsed Outcome values (first 5): ['None None', 'None None', 'None None', 'None None', 'None None']
🧬 Type of first value: <class 'str'>

🚀 First 5 Landing Success values: ['None', 'None', 'None', 'None', 'None']
🛰️ First 5 Landing Type values: ['None', 'None', 'None', 'None', 'None']

🧹 First 5 cleaned landing success (boolean): [None, None, None, None, None]

📊 Landing Outcome Distribution:
- Successful landings: 60 (63.8%)
- Failed landings: 9 (9.6%)
- No landing attempt: 25 (26.6%)

📊 Landing Type Distribution:
- ASDS: 49 attempts
- None: 23 attempts
- RTLS: 15 attempts
- Ocean: 7 attemp

### Understanding Landing Outcomes

The Outcome column combines two critical pieces of information:

1. **Landing Success** (True/False/None):
   - **True**: Successful landing and recovery of the first stage
   - **False**: Attempted landing that failed
   - **None**: No landing attempt was made
   
2. **Landing Type** (ASDS/RTLS/Ocean):
   - **ASDS** (Autonomous Spaceport Drone Ship): Landing on a floating platform at sea
   - **RTLS** (Return to Launch Site): Landing at a ground pad near the launch site
   - **Ocean**: Controlled descent into ocean (early technology demonstrations)
   - **None**: No landing attempt

For our prediction model, we'll separate these into two distinct features:
- `LandingSuccess`: Boolean indicating whether the landing succeeded
- `LandingType`: Categorical variable indicating the landing method

This separation allows us to analyze how landing type affects success probability.

In [10]:
# 4.6 Boolean Columns Cleaning (GridFins, Legs, Reused)
boolean_columns = ['GridFins', 'Legs', 'Reused']
parsed_booleans = {}

print("\n🧹 Cleaning Boolean Feature Columns...")
for col in boolean_columns:
    print(f"\n----- {col} Column -----")
    raw_value = preview_column(df_cleaned_raw, col)

    try:
        parsed_value = parse_python_list_str(raw_value)
        parsed_booleans[col] = parsed_value
        print(f"\n✅ Parsed {col} values (first 5): {parsed_value[:5]}")
        print(f"📘 Type of first value: {type(parsed_value[0])}")

        # Count True and False values
        true_count = parsed_value.count(True)
        false_count = parsed_value.count(False)

        print(f"\n📊 {col} Distribution:")
        print(f"- True: {true_count} ({true_count/len(parsed_value)*100:.1f}%)")
        print(f"- False: {false_count} ({false_count/len(parsed_value)*100:.1f}%)")

    except Exception as e:
        print(f"\n❌ Error parsing {col}: {e}")


🧹 Cleaning Boolean Feature Columns...

----- GridFins Column -----
🔍 Column: GridFins
🔍 Data type: <class 'str'>
🔍 First 300 characters:
[False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, False, True, True, True, False, True, True, True, True, True, True, True, True

✅ Parsed GridFins values (first 5): [False, False, False, False, False]
📘 Type of first value: <class 'bool'>

📊 GridFins Distribution:
- True: 70 (74.5%)
- False: 24 (25.5%)

----- Legs Column -----
🔍 Column: Legs
🔍 Data type: <class 'str'>
🔍 First 300 characters:
[False, False, False, False, False, False, False, False, False, False, True, True, False, False, False, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, False, True, True, True, False, True, True, True, True, True, True, True, 

### Understanding Landing Technologies

The boolean columns represent key technologies used in SpaceX's landing system:

1. **GridFins**:
   - Lattice-like control surfaces deployed during descent
   - Function: Provide aerodynamic stability and steering during reentry
   - Impact: Critical for precision landing, especially for ASDS landings
   
2. **Legs**:
   - Deployable landing legs that extend before touchdown
   - Function: Support the booster upon landing and absorb impact
   - Impact: Required for any landing attempt (RTLS or ASDS)
   
3. **Reused**:
   - Indicates whether the booster has flown before
   - Function: Marker for SpaceX's reusability program
   - Impact: May affect reliability and performance characteristics

These technologies evolved over time, with early Falcon 9 missions not equipped with grid fins or landing legs. Understanding when these technologies were implemented helps track SpaceX's iterative development approach.

In [11]:
# 4.7 Block Column Cleaning
print("\n🧹 Cleaning Block Column...")
safe_column = 'Block'
raw_value = preview_column(df_cleaned_raw, safe_column)

try:
    parsed_block = parse_python_list_str(raw_value)
    print("\n✅ Parsed Block values (first 5):", parsed_block[:5])
    print("📘 Type of first value:", type(parsed_block[0]))

    # Replace None with NaN
    cleaned_block = [int(x) if isinstance(x, int) else np.nan for x in parsed_block]
    print("\n✅ Cleaned Block values (first 10):", cleaned_block[:10])

    # Count frequency of each block version
    block_counts = {}
    for block in parsed_block:
        if block is not None:
            block_counts[block] = block_counts.get(block, 0) + 1

    print("\n📊 Block Version Distribution:")
    for block, count in sorted(block_counts.items()):
        print(f"- Block {block}: {count} launches")

    # Count None values
    none_count = parsed_block.count(None)
    print(f"- None/Unknown: {none_count} launches")

except Exception as e:
    print("\n❌ Error parsing Block:", e)


🧹 Cleaning Block Column...
🔍 Column: Block
🔍 Data type: <class 'str'>
🔍 First 300 characters:
[None, None, None, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 4, 3, 4, 4, 3, 4, 3, 3, 4, 3, 4, 4, 4, 4, 5, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]

✅ Parsed Block values (first 5): [None, None, None, None, 1]
📘 Type of first value: <class 'NoneType'>

✅ Cleaned Block values (first 10): [nan, nan, nan, nan, 1, 1, 1, 1, 1, 1]

📊 Block Version Distribution:
- Block 1: 19 launches
- Block 2: 6 launches
- Block 3: 15 launches
- Block 4: 11 launches
- Block 5: 39 launches
- None/Unknown: 4 launches


### Understanding Falcon 9 Block Versions

The Block number represents major iterations in the Falcon 9 design:

1. **Block 1** (2010-2013):
   - Original Falcon 9 design
   - No landing capability
   
2. **Block 2** (2013-2014):
   - Improved engines and structure
   - Early landing experiments
   
3. **Block 3** (2015-2016):
   - First with consistent landing attempts
   - Improved thrust and fuel capacity
   
4. **Block 4** (2017):
   - Transitional design
   - Improved reusability features
   
5. **Block 5** (2018-Present):
   - Designed for rapid reuse (10+ flights)
   - Maximum thrust and reliability
   - Optimized for human spaceflight

The Block version is a critical factor for landing success prediction, as later versions incorporated significant improvements to the landing system.

In [12]:
# 4.8 Serial Column Cleaning
print("\n🧹 Cleaning Serial Column...")
safe_column = 'Serial'
raw_value = preview_column(df_cleaned_raw, safe_column)

try:
    parsed_serial = parse_python_list_str(raw_value)
    print("\n✅ Parsed Serial values (first 5):", parsed_serial[:5])
    print("🔢 Type of first value:", type(parsed_serial[0]))

    # Count frequency of early vs. later serial numbers
    early_serials = [s for s in parsed_serial if s.startswith('Merlin')]
    b_serials = [s for s in parsed_serial if s.startswith('B')]

    print(f"\n📊 Serial Number Distribution:")
    print(f"- Early Merlin series: {len(early_serials)} boosters")
    print(f"- B-series boosters: {len(b_serials)} boosters")

    # Show some examples of each type
    print(f"\n🔍 Example early serials: {early_serials}")
    print(f"🔍 Example B-series serials (first 5): {b_serials[:5]}")

except Exception as e:
    print("\n❌ Error parsing Serial:", e)


🧹 Cleaning Serial Column...
🔍 Column: Serial
🔍 Data type: <class 'str'>
🔍 First 300 characters:
['Merlin1A', 'Merlin2A', 'Merlin2C', 'Merlin3C', 'B0003', 'B0005', 'B0007', 'B1003', 'B1004', 'B1005', 'B1006', 'B1007', 'B1008', 'B1011', 'B1010', 'B1012', 'B1013', 'B1015', 'B1016', 'B1018', 'B1019', 'B1017', 'B1020', 'B1021', 'B1022', 'B1023', 'B1025', 'B1026', 'B1028', 'B1029', 'B1031', 'B1030',

✅ Parsed Serial values (first 5): ['Merlin1A', 'Merlin2A', 'Merlin2C', 'Merlin3C', 'B0003']
🔢 Type of first value: <class 'str'>

📊 Serial Number Distribution:
- Early Merlin series: 4 boosters
- B-series boosters: 90 boosters

🔍 Example early serials: ['Merlin1A', 'Merlin2A', 'Merlin2C', 'Merlin3C']
🔍 Example B-series serials (first 5): ['B0003', 'B0005', 'B0007', 'B1003', 'B1004']


### Understanding Booster Serial Numbers

The Serial column contains the unique identifier for each Falcon 9 booster:

1. **Merlin Series** (e.g., "Merlin1A", "Merlin2C"):
   - Used for the earliest Falcon 9 launches
   - Named after the Merlin engines that power the rocket
   - These boosters were not designed for recovery
   
2. **B-Series** (e.g., "B1015", "B1029"):
   - Modern Falcon 9 booster naming convention
   - Format: "B" + sequential number
   - The number after "B" increases with newer boosters
   
3. **Numbering Significance**:
   - Lower numbers (B1001-B1020): Early reusability test boosters
   - Mid-range (B1021-B1040): Transitional boosters with improved landing systems
   - Higher numbers (B1041+): Modern boosters with mature landing technology

Tracking boosters by serial number allows us to analyze the performance of specific boosters across multiple launches.

In [13]:
# 4.9 Coordinates (Longitude, Latitude) Cleaning
coordinate_columns = ['Longitude', 'Latitude']
parsed_coordinates = {}

print("\n🧹 Cleaning Coordinate Columns...")
for col in coordinate_columns:
    print(f"\n----- {col} Column -----")
    raw_value = preview_column(df_cleaned_raw, col)

    try:
        parsed_value = parse_python_list_str(raw_value)
        print(f"\n✅ Parsed {col} values (first 5): {parsed_value[:5]}")
        print(f"📘 Type of first value: {type(parsed_value[0]) if parsed_value and len(parsed_value) > 0 else None}")

        # Clean coordinates (replace non-numeric with NaN)
        cleaned_coords = clean_numeric_column(parsed_value)
        parsed_coordinates[col] = cleaned_coords

        print(f"\n✅ Cleaned {col} values (first 10): {cleaned_coords[:10]}")

        # Basic statistics
        valid_coords = [x for x in cleaned_coords if not np.isnan(x)]
        print(f"\n📊 {col} Statistics:")
        print(f"- Valid values: {len(valid_coords)} ({len(valid_coords)/len(cleaned_coords)*100:.1f}%)")
        print(f"- Min: {min(valid_coords):.6f}")
        print(f"- Max: {max(valid_coords):.6f}")
        print(f"- Mean: {sum(valid_coords)/len(valid_coords):.6f}")

    except Exception as e:
        print(f"\n❌ Error parsing {col}: {e}")


🧹 Cleaning Coordinate Columns...

----- Longitude Column -----
🔍 Column: Longitude
🔍 Data type: <class 'str'>
🔍 First 300 characters:
[167.7431292, 167.7431292, 167.7431292, 167.7431292, -80.577366, -80.577366, -80.577366, -120.610829, -80.577366, -80.577366, -80.577366, -80.577366, -80.577366, -80.577366, -80.577366, -80.577366, -80.577366, -80.577366, -80.577366, -80.577366, -80.577366, -120.610829, -80.577366, -80.577366, -80.5

✅ Parsed Longitude values (first 5): [167.7431292, 167.7431292, 167.7431292, 167.7431292, -80.577366]
📘 Type of first value: <class 'float'>

✅ Cleaned Longitude values (first 10): [167.7431292, 167.7431292, 167.7431292, 167.7431292, -80.577366, -80.577366, -80.577366, -120.610829, -80.577366, -80.577366]

📊 Longitude Statistics:
- Valid values: 94 (100.0%)
- Min: -120.610829
- Max: 167.743129
- Mean: -75.553302

----- Latitude Column -----
🔍 Column: Latitude
🔍 Data type: <class 'str'>
🔍 First 300 characters:
[9.0477206, 9.0477206, 9.0477206, 9.0477206, 28.

### Understanding Launch and Landing Coordinates

The coordinate columns contain the geographical positions of launches and landings:

1. **Longitude and Latitude**:
   - Precise geographic coordinates in decimal degrees
   - Format: Longitude (-180 to 180), Latitude (-90 to 90)
   
2. **Coordinate Clusters**:
   - Launch sites (Cape Canaveral, Vandenberg, etc.)
   - Landing pads (LZ-1, LZ-2)
   - Drone ship positions in the Atlantic and Pacific
   
3. **Significance for Prediction**:
   - Distance from launch to landing affects fuel requirements
   - Ocean landing positions relate to mission energy profiles
   - Different landing sites may have different success rates

These coordinates will be particularly useful for the visualization module, where we'll plot launches and landings on interactive maps.

In [15]:
# 5.1 Create Final Cleaned DataFrame
print("\n🔧 Creating final cleaned DataFrame...")

df_cleaned_final = pd.DataFrame({
    # Core landing predictors
    'GridFins': parsed_booleans['GridFins'],
    'Legs': parsed_booleans['Legs'],
    'Reused': parsed_booleans['Reused'],
    'Block': cleaned_block,

    # Location data
    'Longitude': parsed_coordinates['Longitude'],
    'Latitude': parsed_coordinates['Latitude'],

    # Mission characteristics
    'PayloadMass': flat_payloads,
    'Orbit': parsed_orbit,
    'LaunchSite': parsed_site,

    # Landing information
    'LandingPad': parsed_pad,
    'LandingSuccess': landing_success_bool,
    'LandingType': landing_type,

    # Booster information
    'Serial': parsed_serial
})

# Display basic information about the cleaned DataFrame
print(f"\n✅ Final DataFrame created with shape: {df_cleaned_final.shape}")
print("\n📊 Column data types:")
print(df_cleaned_final.dtypes)

# Check for missing values
missing_values = df_cleaned_final.isnull().sum()
print("\n📊 Missing Values by Column:")
print(missing_values)

# Display sample of cleaned data
print("\n📋 Sample of cleaned data:")
print(df_cleaned_final.head())


🔧 Creating final cleaned DataFrame...

✅ Final DataFrame created with shape: (94, 13)

📊 Column data types:
GridFins             bool
Legs                 bool
Reused               bool
Block             float64
Longitude         float64
Latitude          float64
PayloadMass       float64
Orbit              object
LaunchSite         object
LandingPad         object
LandingSuccess     object
LandingType        object
Serial             object
dtype: object

📊 Missing Values by Column:
GridFins           0
Legs               0
Reused             0
Block              4
Longitude          0
Latitude           0
PayloadMass        0
Orbit              0
LaunchSite         0
LandingPad        30
LandingSuccess    25
LandingType        0
Serial             0
dtype: int64

📋 Sample of cleaned data:
   GridFins   Legs  Reused  Block  Longitude  Latitude  PayloadMass Orbit       LaunchSite LandingPad LandingSuccess LandingType    Serial
0     False  False   False    NaN    167.743     9.048    

## 4. Creating the Final Cleaned Dataset

After cleaning each column individually, we now:
1. Combine all cleaned columns into a single DataFrame
2. Perform final validation checks
3. Save the cleaned dataset for future modules
4. Create basic visualizations of the cleaned data

The final dataset will contain properly typed and formatted data ready for exploratory analysis and modeling.

In [16]:
# 5.2 Save Cleaned Data to CSV
# Set the output filename
output_file = 'spacex_cleaned_data.csv'

# Save to CSV
df_cleaned_final.to_csv(output_file, index=False)
print(f"\n✅ Cleaned CSV saved as '{output_file}'")

# For Google Colab: Enable download of the cleaned file
try:
    from google.colab import files
    print("\n📥 Downloading file...")
    files.download(output_file)
except ImportError:
    print("\n💾 File saved locally. Download not needed.")


✅ Cleaned CSV saved as 'spacex_cleaned_data.csv'

📥 Downloading file...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>