# Part 2: Web scraping Falcon 9 and Falcon Heavy Launches Records from Wikipedia


Performing web scraping to collect Falcon 9 historical launch records from a Wikipedia page titled `List of Falcon 9 and Falcon Heavy launches`

https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches

![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_1_L2/images/Falcon9_rocket_family.svg)


Falcon 9 first stage will land successfully

![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/landing_1.gif)


Several examples of an unsuccessful landing are shown here:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/crash.gif)


More specifically, the launch records are stored in a HTML table shown below:

![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_1_L2/images/falcon9-launches-wiki.png)


 ## Objectives
Web scrap Falcon 9 launch records with `BeautifulSoup`: 
- Extract a Falcon 9 launch records HTML table from Wikipedia
- Parse the table and convert it into a Pandas data frame


Importing required packages

In [5]:
!pip3 install beautifulsoup4
!pip3 install requests



In [6]:
import sys

import requests
from bs4 import BeautifulSoup
import re
import unicodedata
import pandas as pd

Helper functions for web scrapping  

In [34]:
def date_time(table_cells):
    """
    This function returns the data and time from the HTML  table cell
    Input: the  element of a table data cell extracts extra row
    """
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

def booster_version(table_cells):
    """
    This function returns the booster version from the HTML  table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=''.join([booster_version for i,booster_version in enumerate( table_cells.strings) if i%2==0][0:-1])
    return out

def landing_status(table_cells):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=[i for i in table_cells.strings][0]
    return out


def get_mass(table_cells):
    mass=unicodedata.normalize("NFKD", table_cells.text).strip()
    if mass:
        mass.find("kg")
        new_mass=mass[0:mass.find("kg")+2]
    else:
        new_mass=0
    return new_mass


def extract_column_from_header(row):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    if (row.br):
        row.br.extract()
    if row.a:
        row.a.extract()
    if row.sup:
        row.sup.extract()
        
    # Extract text content from the row and filter out non-empty strings
    column_name = ' '.join(filter(None, (item.strip() for item in row.stripped_strings)))
    # Filter out digits and empty names
    if not column_name.isdigit():
        return column_name   
      

Scrapping the data from a snapshot of the  `List of Falcon 9 and Falcon Heavy launches` Wikipage updated on
`9th June 2021`



In [10]:
static_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"

### 1 Requesting the Falcon9 Launch Wiki page from its URL

Performing an HTTP GET method to request the Falcon9 Launch HTML page, as an HTTP response.

In [11]:
# Performing an HTTP GET request to retrieve the HTML content
response = requests.get(static_url)

# Check the response status code
if response.status_code == 200:
    print("HTTP GET request successful.")
else:
    print(f"HTTP GET request failed with status code: {response.status_code}")

HTTP GET request successful.


Creating a `BeautifulSoup` object from the HTML `response`

In [12]:
# Creating a BeautifulSoup object from the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

### Extracting all column/variable names from the HTML table header

In [14]:
# Using find_all function to find all tables in the HTML and assigning the result to a list called `html_tables`
html_tables = soup.find_all('table')

Checking contents of `html_tables`

In [None]:
# Iterating through each table and print its contents
for index, table in enumerate(html_tables):
    print(f"Table {index + 1}:")
    print(table.prettify())  # Printing the table's contents with proper formatting
    print("=" * 40)  # Printing a separator line

Starting from the third table is our target table contains the actual launch records

In [None]:
# Printing the third table and check its content
first_launch_table = html_tables[2]
print(first_launch_table)

Column names are imbedded in `<th>` as follows:

```
<tr>
<th scope="col">Flight No.
</th>
<th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
</th>
<th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>Booster</a> <sup class="reference" id="cite_ref-booster_11-0"><a href="#cite_note-booster-11">[b]</a></sup>
</th>
<th scope="col">Launch site
</th>
<th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_12-0"><a href="#cite_note-Dragon-12">[c]</a></sup>
</th>
<th scope="col">Payload mass
</th>
<th scope="col">Orbit
</th>
<th scope="col">Customer
</th>
<th scope="col">Launch<br/>outcome
</th>
<th scope="col"><a href="/wiki/Falcon_9_first-stage_landing_tests" title="Falcon 9 first-stage landing tests">Booster<br/>landing</a>
</th></tr>
```

Iterating through the `<th>` elements and applying the previously defined `extract_column_from_header()` to extract column name one by one

In [35]:
# Start iterating from the third table (index 2)
for table in html_tables[2:]:
    # Iterate through each th element in the current table
    for th in table.find_all('th'):
        # Apply the provided extract_column_from_header() function to get a column name
        name = extract_column_from_header(th)  # You need to define this function

        # Check if the extracted column name is not None and has a length greater than 0
        if name is not None and len(name) > 0:
            column_names.append(name)

print("Extracted Column Names:")
print(column_names)

Extracted Column Names:
['Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome', 'Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome', 'Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome', 'Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome', 'N/A', 'Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome', 'Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome', 'Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome', 'FH 2', 'FH 3', 'Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome', 'Flight No.', 'Date and 

### 3 Creating a data frame by parsing the launch HTML tables

Creating an empty dictionary with keys from the extracted column names

In [36]:
launch_dict= dict.fromkeys(column_names)

# Removing an irrelvant column
del launch_dict['Date and time ( )']

# initial the launch_dict with each value to be an empty list
launch_dict['Flight No.'] = []
launch_dict['Launch site'] = []
launch_dict['Payload'] = []
launch_dict['Payload mass'] = []
launch_dict['Orbit'] = []
launch_dict['Customer'] = []
launch_dict['Launch outcome'] = []
# Adding some new columns
launch_dict['Version Booster']=[]
launch_dict['Booster landing']=[]
launch_dict['Date']=[]
launch_dict['Time']=[]

In [38]:
extracted_row = 0

# Extract each table 
for table_number, table in enumerate(soup.find_all('table', "wikitable plainrowheaders collapsible")):
    # get table row 
    for rows in table.find_all("tr"):
        # check to see if first table heading is a number corresponding to launch a number 
        if rows.th:
            if rows.th.string:
                flight_number = rows.th.string.strip()
                flag = flight_number.isdigit()
        else:
            flag = False
        # get table element 
        row = rows.find_all('td')
        # if it is a number, save cells in a dictionary 
        if flag:
            extracted_row += 1
            # Flight Number value
            launch_dict['Flight No.'].append(flight_number)

            datatimelist = date_time(row[0])

            # Date value
            launch_dict['Date'].append(datatimelist[0].strip(','))

            # Time value
            launch_dict['Time'].append(datatimelist[1])

            # Booster version
            bv = booster_version(row[1])
            if not bv:
                bv = row[1].a.string
            launch_dict['Version Booster'].append(bv)

            # Launch Site
            launch_dict['Launch site'].append(row[2].a.string)

            # Payload
            launch_dict['Payload'].append(row[3].a.string)

            # Payload Mass
            launch_dict['Payload mass'].append(get_mass(row[4]))

            # Orbit
            launch_dict['Orbit'].append(row[5].a.string)

            # Customer
            customer_element = row[6].a
            customer = customer_element.string if customer_element else row[6].get_text()
            launch_dict['Customer'].append(customer)

            # Launch outcome
            launch_dict['Launch outcome'].append(list(row[7].strings)[0])

            # Booster landing
            launch_dict['Booster landing'].append(landing_status(row[8]))

Checking length of each key in the dictionary before converting to pandas data frame

In [40]:
for key, value in launch_dict.items():
    # Check if the value is not None before calculating the length
    if value is not None:
        print(f"Length of {key}: {len(value)}")
    else:
        print(f"Length of {key}: 0")

Length of Flight No.: 227
Length of Launch site: 227
Length of Payload: 227
Length of Payload mass: 227
Length of Orbit: 227
Length of Customer: 226
Length of Launch outcome: 226
Length of N/A: 0
Length of FH 2: 0
Length of FH 3: 0
Length of t e SpaceX missions and payloads: 0
Length of Demo flights: 0
Length of logistics: 0
Length of Crewed missions: 0
Length of Commercial satellites: 0
Length of Scientific satellites: 0
Length of Military satellites: 0
Length of Rideshares: 0
Length of Transporter missions: 0
Length of t e SpaceX: 0
Length of Current: 0
Length of In development: 0
Length of Retired: 0
Length of Cancelled: 0
Length of Spacecraft: 0
Length of Cargo: 0
Length of Crewed: 0
Length of Test vehicles: 0
Length of Unflown: 0
Length of Orbital: 0
Length of Atmospheric: 0
Length of Landing sites: 0
Length of Other facilities: 0
Length of Support: 0
Length of Contracts: 0
Length of R&D programs: 0
Length of Key people: 0
Length of Related: 0
Length of t e Spaceflight lists and t

Dropping lists with length 0

In [41]:
# Create a new dictionary to store non-empty lists
non_empty_launch_dict = {}

# Iterate through the dictionary items
for key, value in launch_dict.items():
    # Check if the value list has a length greater than 0
    if value is not None and len(value) > 0:
        non_empty_launch_dict[key] = value

# Replace the original launch_dict with the non-empty dictionary
launch_dict = non_empty_launch_dict

In [42]:
# Iterate through the dictionary items
for key, value in launch_dict.items():
    # Calculate and print the length of each list
    print(f"Length of {key}: {len(value)}")

Length of Flight No.: 227
Length of Launch site: 227
Length of Payload: 227
Length of Payload mass: 227
Length of Orbit: 227
Length of Customer: 226
Length of Launch outcome: 226
Length of Version Booster: 227
Length of Booster landing: 226
Length of Date: 227
Length of Time: 227


Setting the length of all lists to 226

In [43]:
desired_length = 226

# Iterate through the dictionary items
for key, value in launch_dict.items():
    # Reduce the length of the list if it's longer than the desired length
    if len(value) > desired_length:
            value.pop()

In [44]:
# Iterate through the dictionary items
for key, value in launch_dict.items():
    # Calculate and print the length of each list
    print(f"Length of {key}: {len(value)}")

Length of Flight No.: 226
Length of Launch site: 226
Length of Payload: 226
Length of Payload mass: 226
Length of Orbit: 226
Length of Customer: 226
Length of Launch outcome: 226
Length of Version Booster: 226
Length of Booster landing: 226
Length of Date: 226
Length of Time: 226


Converting to Pandas dataframe

In [46]:
df=pd.DataFrame(launch_dict)
df.head()

Unnamed: 0,Flight No.,Launch site,Payload,Payload mass,Orbit,Customer,Launch outcome,Version Booster,Booster landing,Date,Time
0,1,CCAFS,Dragon Spacecraft Qualification Unit,0,LEO,SpaceX,Success\n,F9 v1.0B0003.1,Failure,4 June 2010,18:45
1,2,CCAFS,Dragon,0,LEO,NASA,Success,F9 v1.0B0004.1,Failure,8 December 2010,15:43
2,3,CCAFS,Dragon,525 kg,LEO,NASA,Success,F9 v1.0B0005.1,No attempt\n,22 May 2012,07:44
3,4,CCAFS,SpaceX CRS-1,"4,700 kg",LEO,NASA,Success\n,F9 v1.0B0006.1,No attempt,8 October 2012,00:35
4,5,CCAFS,SpaceX CRS-2,"4,877 kg",LEO,NASA,Success\n,F9 v1.0B0007.1,No attempt\n,1 March 2013,15:10


Exporting df to CSV

In [47]:
df.to_csv('spacex_web_scraped.csv', index=False)