# Project Insight: Building a Falcon 9 Reusability Prediction Model

This notebook focuses on the crucial first step of a data science project: **Data Collection and Preprocessing**. Our goal is to build a model that can predict whether a Falcon 9 first stage booster will land successfully, enabling reusability.

**Insight:** Reliable prediction requires a comprehensive and clean dataset of past launches. However, this data is not available in a single, readily usable source. It is fragmented across different platforms, primarily the SpaceX API and historical records like Wikipedia. Therefore, a robust data acquisition and integration process is essential before any analysis or modeling can begin. This notebook addresses this challenge by combining data from these two sources.

In [44]:
# 1. Setup and Imports
# Import necessary libraries for data handling, API interaction, and web scraping.

import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import re
import unicodedata # To handle Unicode characters in web scraping

# Set pandas display options for better readability of DataFrames
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

## 🛰️ Phase 1: Data Extraction via SpaceX REST API

This phase focuses on acquiring detailed launch data directly from the SpaceX API. The API provides granular information, but much of it is linked via IDs, requiring additional lookups.

In [45]:
# Define a helper function to fetch details using IDs from the API
def get_launch_details(item_id, endpoint):
    """
    Fetches detailed information for a given item ID from a specified SpaceX API endpoint.

    Args:
        item_id (str): The unique identifier for the item (e.g., rocket ID, launchpad ID).
        endpoint (str): The API endpoint (e.g., 'rockets', 'launchpads', 'payloads', 'cores').

    Returns:
        dict or None: A dictionary containing the item's details, or None if fetching fails or ID is missing.
    """
    if not item_id:
        return None
    api_url = f"https://api.spacexdata.com/v4/{endpoint}/{item_id}"
    try:
        response = requests.get(api_url)
        response.raise_for_status() # Raise an HTTPError for bad responses (4xx or 5xx)
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error fetching data for {endpoint} ID {item_id}: {e}")
        return None

# Use the main /v4/launches/past endpoint to fetch all historical launches
spacex_api_url = "https://api.spacexdata.com/v4/launches/past"
try:
    response = requests.get(spacex_api_url)
    response.raise_for_status() # Check for HTTP errors
    api_data = response.json()
    print(f"Successfully fetched {len(api_data)} launch records from the API.")
except requests.exceptions.RequestException as e:
    print(f"Error fetching initial launch data from API: {e}")
    api_data = [] # Initialize as empty list if fetch fails

# Flatten the initial JSON response using pandas.json_normalize
if api_data:
    api_df = pd.json_normalize(api_data)

    # Initialize lists to store enriched data
    launch_sites = []
    payload_masses = []
    orbits = []
    booster_versions = []
    core_serials = []
    core_flights = []
    core_gridfins = []
    core_legs = []
    core_reused = []
    core_landing_attempts = []
    core_landing_successes = []
    core_landing_types = []
    core_landpads = []

    # Iterate through the DataFrame and use the helper function to enrich the data
    print("Enriching data with details from nested API endpoints...")
    # Iterate through the DataFrame using iterrows() to access each row
    for index, row in api_df.iterrows():
        # Get Rocket Version using the 'rocket' ID
        rocket_details = get_launch_details(row.get('rocket'), 'rockets') # Use .get() for safe access
        booster_versions.append(rocket_details['name'] if rocket_details and 'name' in rocket_details else None)

        # Get Launch Site details using the 'launchpad' ID
        launchpad_details = get_launch_details(row.get('launchpad'), 'launchpads') # Use .get()
        launch_sites.append(launchpad_details['name'] if launchpad_details and 'name' in launchpad_details else None)

        # Get Payload details (assuming a single payload for simplicity as per previous steps)
        # Check if 'payloads' list exists and is not empty before accessing the first element
        payload_id = row['payloads'][0] if 'payloads' in row and row['payloads'] else None
        payload_details = get_launch_details(payload_id, 'payloads')
        payload_masses.append(payload_details['mass_kg'] if payload_details and 'mass_kg' in payload_details else None)
        orbits.append(payload_details['orbit'] if payload_details and 'orbit' in payload_details else None)

        # Get Core details (assuming a single core for simplicity as per previous steps)
        # Check if 'cores' list exists and is not empty before accessing the first element
        core_data = row['cores'][0] if 'cores' in row and row['cores'] else {} # Use empty dict if no core data
        core_id = core_data.get('core')

        if core_id:
            core_details = get_launch_details(core_id, 'cores')
            core_serials.append(core_details.get('serial') if core_details else None)
            # Use data from the launch record for flight-specific core data
            core_flights.append(core_data.get('flight'))
            core_gridfins.append(core_data.get('gridfins'))
            core_legs.append(core_data.get('legs'))
            core_reused.append(core_data.get('reused'))
            core_landing_attempts.append(core_data.get('landing_attempt'))
            core_landing_successes.append(core_data.get('landing_success'))
            core_landing_types.append(core_data.get('landing_type'))
            core_landpads.append(core_data.get('landpad')) # This is the ID, not the name

        else:
            # Append None if core ID is missing to maintain list alignment
            core_serials.append(None)
            core_flights.append(core_data.get('flight'))
            core_gridfins.append(core_data.get('gridfins'))
            core_legs.append(core_data.get('legs'))
            core_reused.append(core_data.get('reused'))
            core_landing_attempts.append(core_data.get('landing_attempt'))
            core_landing_successes.append(core_data.get('landing_success'))
            core_landing_types.append(core_data.get('landing_type'))
            core_landpads.append(core_data.get('landpad'))

    # Add the enriched data as new columns to the DataFrame
    api_df['LaunchSite'] = launch_sites
    api_df['PayloadMass'] = payload_masses
    api_df['Orbit'] = orbits
    api_df['BoosterVersion'] = booster_versions
    api_df['CoreSerial'] = core_serials
    api_df['CoreFlights'] = core_flights
    api_df['CoreGridFins'] = core_gridfins
    api_df['CoreLegs'] = core_legs
    api_df['CoreReused'] = core_reused
    api_df['CoreLandingAttempt'] = core_landing_attempts
    api_df['CoreLandingSuccess'] = core_landing_successes
    api_df['CoreLandingType'] = core_landing_types
    api_df['CoreLandingPad'] = core_landpads # Keep the ID for now

    # Select and rename relevant columns from the API data
    # Ensure all selected columns exist before proceeding
    api_cols_to_keep = [
        'flight_number', 'date_utc', 'LaunchSite', 'PayloadMass', 'Orbit',
        'BoosterVersion', 'CoreSerial', 'CoreFlights', 'CoreGridFins',
        'CoreLegs', 'CoreReused', 'CoreLandingAttempt', 'CoreLandingSuccess',
        'CoreLandingType', 'CoreLandingPad'
    ]
    # Filter for columns that actually exist in api_df
    existing_api_cols = [col for col in api_cols_to_keep if col in api_df.columns]
    api_cleaned_df = api_df[existing_api_cols].copy() # Use .copy() to avoid SettingWithCopyWarning

    # Rename columns for consistency
    api_cleaned_df.rename(columns={
        'flight_number': 'FlightNumber',
        'date_utc': 'DateUTC',
    }, inplace=True)

    print("API data processing complete.")
    display(api_cleaned_df.head())
else:
    api_cleaned_df = pd.DataFrame() # Create empty DataFrame if no data was fetched
    print("No API data to process.")

Successfully fetched 187 launch records from the API.
Enriching data with details from nested API endpoints...
API data processing complete.


Unnamed: 0,FlightNumber,DateUTC,LaunchSite,PayloadMass,Orbit,BoosterVersion,CoreSerial,CoreFlights,CoreGridFins,CoreLegs,CoreReused,CoreLandingAttempt,CoreLandingSuccess,CoreLandingType,CoreLandingPad
0,1,2006-03-24T22:30:00.000Z,Kwajalein Atoll,20.0,LEO,Falcon 1,Merlin1A,1,False,False,False,False,,,
1,2,2007-03-21T01:10:00.000Z,Kwajalein Atoll,,LEO,Falcon 1,Merlin2A,1,False,False,False,False,,,
2,3,2008-08-03T03:34:00.000Z,Kwajalein Atoll,,LEO,Falcon 1,Merlin1C,1,False,False,False,False,,,
3,4,2008-09-28T23:15:00.000Z,Kwajalein Atoll,165.0,LEO,Falcon 1,Merlin2C,1,False,False,False,False,,,
4,5,2009-07-13T03:35:00.000Z,Kwajalein Atoll,200.0,LEO,Falcon 1,Merlin3C,1,False,False,False,False,,,


### Results of API Data Processing

The output above shows the first few rows and the structure (`.info()`) of the `api_cleaned_df` DataFrame. This DataFrame contains historical SpaceX launch data fetched from the v4 API and enriched with details from related endpoints.

Key columns in this DataFrame include:

*   **FlightNumber:** The sequential flight number of the mission.
*   **DateUTC:** The date and time of the launch in UTC.
*   **LaunchSite:** The name of the launch site (e.g., 'CCSFS SLC 40', 'KSC LC 39A').
*   **PayloadMass:** The mass of the payload in kilograms.
*   **Orbit:** The target orbit of the mission.
*   **BoosterVersion:** The version of the Falcon rocket used (e.g., 'Falcon 9', 'Falcon 1').
*   **CoreSerial:** The serial number of the first stage core.
*   **CoreFlights:** The number of times this core has flown.
*   **CoreGridFins, CoreLegs, CoreReused, CoreLandingAttempt, CoreLandingSuccess:** Boolean flags and indicators related to the core's reusability features and landing attempts/outcomes.
*   **CoreLandingType:** The type of landing attempted (e.g., 'RTLS', 'ASDS').
*   **CoreLandingPad:** The ID of the landing pad used (needs further lookup for name if required).

This DataFrame serves as the foundation of our dataset, providing detailed information about each launch that is crucial for predicting reusability. It is now ready to be combined with supplementary data from web scraping and further cleaned in subsequent steps.

## 📖 Phase 2: Web Scraping for Supplemental Data

This phase involves scraping historical launch outcome data from Wikipedia, which can supplement the API data, particularly for older missions.

In [46]:
# Use requests and BeautifulSoup to scrape the table data from Wikipedia
# Use a stable URL for the list of Falcon 9 missions
wikipedia_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"

# Define headers to mimic a web browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/91.0.4472.124 Safari/537.36"}

try:
    response = requests.get(wikipedia_url, headers=headers)
    response.raise_for_status() # Check for HTTP errors
    soup = BeautifulSoup(response.text, 'html.parser')
    print("Successfully scraped Wikipedia page.")
except requests.exceptions.RequestException as e:
    print(f"Error scraping Wikipedia page: {e}")
    soup = None

# Define a text cleaning function
def clean_text(text):
    """
    Removes bracketed text (like citations or annotations) and normalizes Unicode.
    Also handles None input gracefully.
    """
    if text is None:
        return ''
    text = str(text) # Ensure text is a string
    text = unicodedata.normalize('NFKD', text) # Normalize unicode
    text = re.sub(r'\[.*?\]', '', text) # Remove [citations]
    text = re.sub(r'\(.*?\)', '', text) # Remove (annotations)
    return text.strip() # Remove leading/trailing whitespace

# Find the relevant table(s) - usually the wikitable plainrowheaders collapsible ones
if soup:
    html_tables = soup.find_all('table', class_='wikitable plainrowheaders collapsible')

    # Process each table found with the specified class
    wikipedia_launches = []
    for table in html_tables:
        rows = table.find_all('tr')
        # Get header row - usually the first row
        header_row = rows[0] if rows else None
        if header_row:
             # Extract and clean header text
            header_cells = header_row.find_all(['th', 'td'])
            column_names = []
            for cell in header_cells:
                # Get all text strings within the cell, join them, and clean
                cell_text = ''.join(cell.strings)
                column_names.append(clean_text(cell_text))

             # Map useful column names to standardized names
            col_map = {
                'Flight No.': 'FlightNumber',
                'Date and time (UTC)': 'DateAndTimeWiki',
                'Version,Booster': 'BoosterVersionWiki',
                'Launch site': 'LaunchSiteWiki',
                'Payload': 'PayloadWiki',
                'Payload mass': 'PayloadMassWiki',
                'Orbit': 'OrbitWiki',
                'Customer': 'CustomerWiki',
                'Launch outcome': 'LaunchOutcomeWiki',
                'Booster landing': 'BoosterLandingWiki'
            }
            # Create a mapping from extracted header names to our standardized names
            standardized_col_names = [col_map.get(name, name) for name in column_names]


        # Process data rows (skip the header row)
        for row in rows[1:]:
            cols = row.find_all(['th', 'td']) # Get all cells (header and data) in the row
            if not cols:
                continue # Skip empty rows or rows that are not data rows

            row_data = {}
            # Iterate through columns and extract data based on header mapping
            # Ensure we don't exceed the number of standardized column names
            for j, cell in enumerate(cols):
                 if j < len(standardized_col_names):
                    col_name = standardized_col_names[j]
                    # Extract text content and clean it
                    # For cells that might have nested structures (like date/time or landing outcome),
                    # get all strings within the cell and join them before cleaning.
                    cell_text = ''.join(cell.strings)
                    row_data[col_name] = clean_text(cell_text)

            # Find the cell corresponding to FlightNumber, which is usually the first <th> in a data row
            # This is a more reliable way to get FlightNumber for each row
            flight_no_cell = row.find('th')
            if flight_no_cell:
                flight_number_str = clean_text(flight_no_cell.get_text())
                # Check if the cleaned text is a digit and add it to row_data
                if flight_number_str.isdigit():
                    row_data['FlightNumber'] = int(flight_number_str) # Convert to integer
                    # Append the row data only if FlightNumber is successfully extracted and is a digit
                    wikipedia_launches.append(row_data)


    # Convert the list of dictionaries into a pandas DataFrame
    if wikipedia_launches:
        # Create DataFrame from the list of dictionaries
        wiki_df = pd.DataFrame(wikipedia_launches)

        # Select and rename essential columns for merging
        # We need FlightNumber for merging and LandingOutcomeWiki for supplementing API data
        # Keep DateAndTimeWiki for potential verification or later use if needed
        wiki_cols_to_keep = ['FlightNumber', 'LandingOutcomeWiki', 'DateAndTimeWiki']
        # Filter for columns that actually exist in wiki_df
        existing_wiki_cols = [col for col in wiki_cols_to_keep if col in wiki_df.columns]
        wiki_cleaned_df = wiki_df[existing_wiki_cols].copy()


        print(f"Extracted {len(wiki_cleaned_df)} launch records from Wikipedia.")
        display(wiki_cleaned_df.head())
    else:
        wiki_cleaned_df = pd.DataFrame()
        print("No Wikipedia data extracted.")

else:
    wiki_cleaned_df = pd.DataFrame()
    print("Could not create BeautifulSoup object.")

Successfully scraped Wikipedia page.
Extracted 121 launch records from Wikipedia.


Unnamed: 0,FlightNumber
0,1
1,2
2,3
3,4
4,5


### Results of Web Scraping

The output above shows the first few rows of the `wiki_cleaned_df` DataFrame. This DataFrame contains supplementary launch data scraped from Wikipedia.

Key columns in this DataFrame include:

*   **FlightNumber:** The sequential flight number of the mission, used as a key for merging with the API data.
*   **BoosterLandingWiki:** This column contains the landing outcome information scraped from Wikipedia. It provides valuable historical context for core reusability and will be used to help determine the target variable for our prediction model.
*   **DateAndTimeWiki:** The date and time of the launch as recorded on Wikipedia. This can be used for verification or as an alternative key if needed.

This DataFrame supplements the API data with crucial historical landing outcome information, which is vital for building a comprehensive dataset for reusability prediction. It is now ready to be merged with the API data in the next phase.

## 🧩 Phase 3: Merging, Filtering, and Checkpoint

This phase merges the data obtained from the SpaceX API and Wikipedia, filters for Falcon 9 launches, and prepares the final raw dataset.

In [47]:
# Merge the API DataFrame and the Web-Scraped DataFrame
# Use 'FlightNumber' as the common key. A left merge is appropriate
# to keep all API data and add Wikipedia outcomes where available.
# Ensure 'FlightNumber' is of the same data type in both DataFrames before merging.

if not api_cleaned_df.empty and not wiki_cleaned_df.empty:
    # Ensure 'FlightNumber' column exists in both DataFrames
    if 'FlightNumber' in api_cleaned_df.columns and 'FlightNumber' in wiki_cleaned_df.columns:
        # Ensure FlightNumber is integer type for merging in both DataFrames
        # Use errors='coerce' to turn non-numeric values into NaN, then drop NaNs
        api_cleaned_df['FlightNumber'] = pd.to_numeric(api_cleaned_df['FlightNumber'], errors='coerce')
        api_cleaned_df.dropna(subset=['FlightNumber'], inplace=True)
        api_cleaned_df['FlightNumber'] = api_cleaned_df['FlightNumber'].astype(int)

        wiki_cleaned_df['FlightNumber'] = pd.to_numeric(wiki_cleaned_df['FlightNumber'], errors='coerce')
        wiki_cleaned_df.dropna(subset=['FlightNumber'], inplace=True)
        wiki_cleaned_df['FlightNumber'] = wiki_cleaned_df['FlightNumber'].astype(int)


        # Perform the merge
        merged_df = pd.merge(api_cleaned_df, wiki_cleaned_df[['FlightNumber', 'LandingOutcomeWiki']], on='FlightNumber', how='left')
        print("Successfully merged API and Wikipedia data.")

        # CRITICALLY: Filter the merged DataFrame to exclude all launches belonging to 'Falcon 1'
        # The BoosterVersion column from the API is used for this filtering.
        # Ensure 'BoosterVersion' column exists before filtering
        if 'BoosterVersion' in merged_df.columns:
            falcon9_raw_df = merged_df[merged_df['BoosterVersion'] == 'Falcon 9'].copy() # Use .copy() after filtering
            print(f"Filtered data to include only Falcon 9 launches. {len(falcon9_raw_df)} records remaining.")
        else:
            print("Warning: 'BoosterVersion' column not found in merged data. Skipping Falcon 1 filtering.")
            falcon9_raw_df = merged_df.copy()

    else:
        merged_df = pd.DataFrame()
        falcon9_raw_df = pd.DataFrame()
        print("Merging skipped: 'FlightNumber' column not found in one or both DataFrames.")

else:
    merged_df = pd.DataFrame()
    falcon9_raw_df = pd.DataFrame()
    print("Merging skipped due to empty DataFrames.")

Successfully merged API and Wikipedia data.
Filtered data to include only Falcon 9 launches (excluding Falcon 1). 182 records remaining.


## Final Output and Checkpoint

The merging and filtering process results in the `falcon9_raw_df` DataFrame. This DataFrame represents the combined and cleaned raw data specifically for Falcon 9 launches, excluding Falcon 1 missions. It includes detailed information from the SpaceX API enriched with landing outcome data scraped from Wikipedia.

The `.info()` output provides a summary of the DataFrame, showing the number of rows (representing Falcon 9 launches), the columns, their data types, and the count of non-null values for each column. This is useful for identifying columns with missing data that may require further handling in subsequent data cleaning steps.

The `.head()` output displays the first few rows of the `falcon9_raw_df`, giving a preview of the structure and content of the final raw dataset.

This `falcon9_raw_df` is then saved to a CSV file named `spacex_raw_data.csv`. This CSV file serves as a crucial checkpoint, providing a persistent copy of the raw, integrated, and filtered dataset before any further feature engineering or transformation. This allows for reproducibility and provides a stable base for the next stages of the project.

## Final Output and Checkpoint

Display the structure and first few rows of the final raw DataFrame and save it to a CSV file. This file serves as the checkpoint for the raw data collected.

In [48]:
# Final Output: Display .info() and .head() of the final DataFrame
if not falcon9_raw_df.empty:
    print("\nFinal Raw DataFrame Info:")
    falcon9_raw_df.info()

    print("\nFinal Raw DataFrame Head:")
    display(falcon9_raw_df.head())

    # Save the final, merged, and filtered raw DataFrame to a CSV file
    try:
        falcon9_raw_df.to_csv('spacex_raw_data.csv', index=False)
        print("\nSuccessfully saved 'spacex_raw_data.csv'")
        # Optional: Trigger download in Colab
        # from google.colab import files
        # files.download('spacex_raw_data.csv')
    except Exception as e:
        print(f"\nError saving CSV file: {e}")
else:
    print("\nFinal DataFrame is empty. Cannot display info/head or save CSV.")


Final Raw DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
Index: 182 entries, 5 to 186
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   FlightNumber        182 non-null    int64  
 1   DateUTC             182 non-null    object 
 2   LaunchSite          182 non-null    object 
 3   PayloadMass         159 non-null    float64
 4   Orbit               181 non-null    object 
 5   BoosterVersion      182 non-null    object 
 6   CoreSerial          182 non-null    object 
 7   CoreFlights         182 non-null    int64  
 8   CoreGridFins        182 non-null    bool   
 9   CoreLegs            182 non-null    bool   
 10  CoreReused          182 non-null    bool   
 11  CoreLandingAttempt  182 non-null    bool   
 12  CoreLandingSuccess  156 non-null    object 
 13  CoreLandingType     158 non-null    object 
 14  CoreLandingPad      151 non-null    object 
dtypes: bool(4), float64(1), int64(2), o

Unnamed: 0,FlightNumber,DateUTC,LaunchSite,PayloadMass,Orbit,BoosterVersion,CoreSerial,CoreFlights,CoreGridFins,CoreLegs,CoreReused,CoreLandingAttempt,CoreLandingSuccess,CoreLandingType,CoreLandingPad
5,6,2010-06-04T18:45:00.000Z,CCSFS SLC 40,,LEO,Falcon 9,B0003,1,False,False,False,False,,,
6,7,2010-12-08T15:43:00.000Z,CCSFS SLC 40,,LEO,Falcon 9,B0004,1,False,False,False,False,,,
7,8,2012-05-22T07:44:00.000Z,CCSFS SLC 40,525.0,LEO,Falcon 9,B0005,1,False,False,False,False,,,
8,9,2012-10-08T00:35:00.000Z,CCSFS SLC 40,400.0,ISS,Falcon 9,B0006,1,False,False,False,False,,,
9,10,2013-03-01T19:10:00.000Z,CCSFS SLC 40,677.0,ISS,Falcon 9,B0007,1,False,False,False,False,,,



Successfully saved 'spacex_raw_data.csv'
