<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


# **Space X  Falcon 9 First Stage Landing Prediction**


## Web scraping Falcon 9 and Falcon Heavy Launches Records from Wikipedia


Date completed: 16 September 2024


Collecting Falcon 9 historical launch records from a Wikipedia page titled `List of Falcon 9 and Falcon Heavy launches`

https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches


The launch records are stored in the HTML table shown below:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_1_L2/images/falcon9-launches-wiki.png)


  ## Objectives
Web scrap Falcon 9 launch records with `BeautifulSoup`:
- Scrape the data from a snapshot of the  `List of Falcon 9 and Falcon Heavy launches` Wikipage updated on
`9th June 2021`
- Parse the table and convert it into a Pandas data frame


----

## Import Libraries

In [2]:
# Requests allows HTTP requests to get data from an API
import requests

# BeautifulSoup will be used for web scraping
from bs4 import BeautifulSoup

# Regular expressions for pattern matching
import re

# Unicode normalisation
import unicodedata

# Pandas for data manipulation
import pandas as pd

print("Libraries successfully imported")

Libraries successfully imported


## Define Auxiliary Functions


Helper functions to process web scraped HTML table


In [4]:
def date_time(table_cells):
    """
    This function returns the data and time from the HTML  table cell
    Input: the  element of a table data cell extracts extra row
    """
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

def booster_version(table_cells):
    """
    This function returns the booster version from the HTML  table cell
    Input: the  element of a table data cell extracts extra row
    """
    out=''.join([booster_version for i,booster_version in enumerate( table_cells.strings) if i%2==0][0:-1])
    return out

def landing_status(table_cells):
    """
    This function returns the landing status from the HTML table cell
    Input: the  element of a table data cell extracts extra row
    """
    out=[i for i in table_cells.strings][0]
    return out


def get_mass(table_cells):
    mass=unicodedata.normalize("NFKD", table_cells.text).strip()
    if mass:
        mass.find("kg")
        new_mass=mass[0:mass.find("kg")+2]
    else:
        new_mass=0
    return new_mass


def extract_column_from_header(row):
    """
    This function returns the landing status from the HTML table cell
    Input: the  element of a table data cell extracts extra row
    """
    if (row.br):
        row.br.extract()
    if row.a:
        row.a.extract()
    if row.sup:
        row.sup.extract()

    colunm_name = ' '.join(row.contents)

    # Filter the digit and empty names
    if not(colunm_name.strip().isdigit()):
        colunm_name = colunm_name.strip()
        return colunm_name

### TASK 1: Request the Falcon9 Launch Wiki page from its URL


First, let's perform an HTTP GET method to request the Falcon9 Launch HTML page, as an HTTP response.


In [7]:
# We use the requests.get() method with the provided static_url and assign the response to a object

# URL of the Wikipedia page
static_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"

# Send an HTTP GET request and get the HTML content
response = requests.get(static_url).text


Create a `BeautifulSoup` object from the HTML `response`


In [8]:
# Create a BeautifulSoup object from the response text
soup = BeautifulSoup(response, "html.parser")  # create a soup object

Print the page title to verify if the `BeautifulSoup` object was created properly


In [9]:
# Extract the title tag
tag_object = soup.title

# Print the title tag
print("tag object:", tag_object)

tag object: <title>List of Falcon 9 and Falcon Heavy launches - Wikipedia</title>


### TASK 2: Extract all column/variable names from the HTML table header


Next, we want to collect all relevant column names from the HTML table header


Let's try to find all tables on the wiki page first. If you need to refresh your memory about `BeautifulSoup`, please check the external reference link towards the end of this lab


In [11]:
# Use the find_all function in the BeautifulSoup object, with element type `table`
# Assign the result to a list called `html_tables`
html_tables = soup.find_all('table')


Starting from the third table is our target table contains the actual launch records.


In [12]:
# Let's print the third table and check its content
first_launch_table = html_tables[2]
print(first_launch_table)

<table class="wikitable plainrowheaders collapsible" style="width: 100%;">
<tbody><tr>
<th scope="col">Flight No.
</th>
<th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
</th>
<th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>Booster</a> <sup class="reference" id="cite_ref-booster_11-0"><a href="#cite_note-booster-11"><span class="cite-bracket">[</span>b<span class="cite-bracket">]</span></a></sup>
</th>
<th scope="col">Launch site
</th>
<th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_12-0"><a href="#cite_note-Dragon-12"><span class="cite-bracket">[</span>c<span class="cite-bracket">]</span></a></sup>
</th>
<th scope="col">Payload mass
</th>
<th scope="col">Orbit
</th>
<th scope="col">Customer
</th>
<th scope="col">Launch<br/>outcome
</th>
<th scope="col"><a href="/wiki/Falcon_9_first-stage_landing_tests" title="Falcon 

Next, we just need to iterate through the `<th>` elements and apply the provided `extract_column_from_header()` to extract column name one by one

In [13]:
# Empty list to hold the column names
column_names = []

# Iterate through each 'th' element in the table header
for th in first_launch_table.find_all('th'):
    # Extract column name by passing the 'th' element to the extract_column_from_header function
    col_name = extract_column_from_header(th)

    # Append the non-empty column name to the list
    if col_name is not None and len(col_name) > 0:
        column_names.append(col_name)


Check the extracted column names


In [14]:
# Display the resulting DataFrame
print(column_names)

['Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome']


## TASK 3: Create a data frame by parsing the launch HTML tables


We will create an empty dictionary with keys from the extracted column names in the previous task. Later, this dictionary will be converted into a Pandas dataframe


In [17]:
launch_dict = dict.fromkeys(column_names)

# Remove the irrelevant column
del launch_dict['Date and time ( )']

# Initialise the launch_dict with empty lists
launch_dict['Flight No.'] = []
launch_dict['Launch site'] = []
launch_dict['Payload'] = []
launch_dict['Payload mass'] = []
launch_dict['Orbit'] = []
launch_dict['Customer'] = []
launch_dict['Launch outcome'] = []

# Added new columns
launch_dict['Version Booster'] = []
launch_dict['Booster landing'] = []
launch_dict['Date'] = []
launch_dict['Time'] = []

Next, we just need to fill up the `launch_dict` with launch records extracted from table rows.


Usually, HTML tables in Wiki pages are likely to contain unexpected annotations and other types of noise, such as reference links `B0004.1[8]`, missing values `N/A [e]`, inconsistent formatting, etc.


In [18]:
# Fill up the launch_dict

# Extract tables
extracted_row = 0

# Extract each table
for table_number, table in enumerate(soup.find_all('table', "wikitable plainrowheaders collapsible")):
    # Get table rows
    for rows in table.find_all("tr"):
        # Check if the first table heading is a number corresponding to launch number
        if rows.th and rows.th.string:
            flight_number = rows.th.string.strip()
            flag = flight_number.isdigit()  # Ensure it's a number
        else:
            flag = False

        # Get table elements
        row = rows.find_all('td')

        # If valid, store in the dictionary
        if flag:
            extracted_row += 1

            # Flight Number
            launch_dict['Flight No.'].append(flight_number)
            print(f"Flight No.: {flight_number}")

            # Date and Time
            datatimelist = date_time(row[0])
            launch_dict['Date'].append(datatimelist[0].strip(','))  # Date
            launch_dict['Time'].append(datatimelist[1])             # Time
            print(f"Date: {datatimelist[0].strip(',')}, Time: {datatimelist[1]}")

            # Booster Version
            bv = booster_version(row[1]) or (row[1].a.string if row[1].a else None)
            launch_dict['Version Booster'].append(bv)
            print(f"Booster Version: {bv}")

            # Launch Site
            launch_site = row[2].a.string if row[2].a else row[2].get_text(strip=True)
            launch_dict['Launch site'].append(launch_site)
            print(f"Launch Site: {launch_site}")

            # Payload
            payload = row[3].a.string if row[3].a else row[3].get_text(strip=True)
            launch_dict['Payload'].append(payload)
            print(f"Payload: {payload}")

            # Payload Mass
            payload_mass = get_mass(row[4])
            launch_dict['Payload mass'].append(payload_mass)
            print(f"Payload Mass: {payload_mass}")

            # Orbit
            orbit = row[5].a.string if row[5].a else row[5].get_text(strip=True)
            launch_dict['Orbit'].append(orbit)
            print(f"Orbit: {orbit}")

            # Customer
            try:
                customer = row[6].a.string if row[6].a else row[6].get_text(strip=True)
            except IndexError:
                customer = None  # Handle cases where customer data is missing
            launch_dict['Customer'].append(customer)
            print(f"Customer: {customer}")

            # Launch Outcome
            launch_outcome = list(row[7].strings)[0] if row[7] else None
            launch_dict['Launch outcome'].append(launch_outcome)
            print(f"Launch Outcome: {launch_outcome}")

            # Booster Landing
            booster_landing = landing_status(row[8]) if len(row) > 8 else None
            launch_dict['Booster landing'].append(booster_landing)
            print(f"Booster Landing: {booster_landing}")


Flight No.: 1
Date: 4 June 2010, Time: 18:45
Booster Version: F9 v1.07B0003.18
Launch Site: CCAFS
Payload: Dragon Spacecraft Qualification Unit
Payload Mass: 0
Orbit: LEO
Customer: SpaceX
Launch Outcome: Success

Booster Landing: Failure
Flight No.: 2
Date: 8 December 2010, Time: 15:43
Booster Version: F9 v1.07B0004.18
Launch Site: CCAFS
Payload: Dragon
Payload Mass: 0
Orbit: LEO
Customer: NASA
Launch Outcome: Success
Booster Landing: Failure
Flight No.: 3
Date: 22 May 2012, Time: 07:44
Booster Version: F9 v1.07B0005.18
Launch Site: CCAFS
Payload: Dragon
Payload Mass: 525 kg
Orbit: LEO
Customer: NASA
Launch Outcome: Success
Booster Landing: No attempt

Flight No.: 4
Date: 8 October 2012, Time: 00:35
Booster Version: F9 v1.07B0006.18
Launch Site: CCAFS
Payload: SpaceX CRS-1
Payload Mass: 4,700 kg
Orbit: LEO
Customer: NASA
Launch Outcome: Success

Booster Landing: No attempt
Flight No.: 5
Date: 1 March 2013, Time: 15:10
Booster Version: F9 v1.07B0007.18
Launch Site: CCAFS
Payload: SpaceX

In [21]:
# Count the number of data points in each column as a check for any errors
for count,i in enumerate(launch_dict):
    column_name = list(launch_dict.keys())[count]
    length_column = str(len(launch_dict[column_name]))
    print(f'{column_name}: {length_column}')

Flight No.: 121
Launch site: 121
Payload: 121
Payload mass: 121
Orbit: 121
Customer: 121
Launch outcome: 121
Version Booster: 121
Booster landing: 121
Date: 121
Time: 121


Creating a DataFrame from `launch_dict`

In [22]:
# Convert dictionary to DataFrame
df = pd.DataFrame({ key:pd.Series(value) for key, value in launch_dict.items() })
df

Unnamed: 0,Flight No.,Launch site,Payload,Payload mass,Orbit,Customer,Launch outcome,Version Booster,Booster landing,Date,Time
0,1,CCAFS,Dragon Spacecraft Qualification Unit,0,LEO,SpaceX,Success\n,F9 v1.07B0003.18,Failure,4 June 2010,18:45
1,2,CCAFS,Dragon,0,LEO,NASA,Success,F9 v1.07B0004.18,Failure,8 December 2010,15:43
2,3,CCAFS,Dragon,525 kg,LEO,NASA,Success,F9 v1.07B0005.18,No attempt\n,22 May 2012,07:44
3,4,CCAFS,SpaceX CRS-1,"4,700 kg",LEO,NASA,Success\n,F9 v1.07B0006.18,No attempt,8 October 2012,00:35
4,5,CCAFS,SpaceX CRS-2,"4,877 kg",LEO,NASA,Success\n,F9 v1.07B0007.18,No attempt\n,1 March 2013,15:10
...,...,...,...,...,...,...,...,...,...,...,...
116,117,CCSFS,Starlink,"15,600 kg",LEO,SpaceX,Success\n,F9 B5B1051.10657,Success,9 May 2021,06:42
117,118,KSC,Starlink,"~14,000 kg",LEO,SpaceX,Success\n,F9 B5B1058.8660,Success,15 May 2021,22:56
118,119,CCSFS,Starlink,"15,600 kg",LEO,SpaceX,Success\n,F9 B5B1063.2665,Success,26 May 2021,18:59
119,120,KSC,SpaceX CRS-22,"3,328 kg",LEO,NASA,Success\n,F9 B5B1067.1668,Success,3 June 2021,17:29


In [23]:
# Export DataFrame as .csv
df.to_csv('spacex_web_scraped.csv', index=False)

## Authors


<a href="https://www.linkedin.com/in/yan-luo-96288783/">Yan Luo</a>


<a href="https://www.linkedin.com/in/nayefaboutayoun/">Nayef Abou Tayoun</a>


<!--
## Change Log
-->


<!--
| Date (YYYY-MM-DD) | Version | Changed By | Change Description      |
| ----------------- | ------- | ---------- | ----------------------- |
| 2021-06-09        | 1.0     | Yan Luo    | Tasks updates           |
| 2020-11-10        | 1.0     | Nayef      | Created the initial version |
-->


Copyright © 2021 IBM Corporation. All rights reserved.
