![Skills Network Logo](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/SN_web_light

# Space X Falcon 9

## Web scraping Falcon 9 an

In this lab, you will be performing web scraping to collect Falcon 9 historical launch records from a Wikipedia page titled `List of Falcon 9 and Falcon Heavy launches`

https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches

![Falcon 9 image](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/Falcon9.png)

Falcon 9 first stage will land successfully

![Landing success gif](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/landing_1.gif)

Several examples of an unsuccessful landing are shown here:

![Crash landing gif](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/crash.gif)

More specifically, the launch records are stored in a HTML table shown below:

![Wikipedia table screenshot](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/falcon9-launches-wiki.png)

## Objectives
Web scrap Falcon 9 launch records with `BeautifulSoup`: 
- Extract a Falcon 9 launch records HTML table from Wikipedia
- Parse the table and convert it into a Pandas data

First let's import required packages for this lab

In [56]:
!pip3 install beautifulsoup4



In [57]:
import sys
import requests
from bs4 import BeautifulSoup
import re
import unicodedata
import pandas as pd

and we will provide some helper functions for you

In [58]:
def date_time(table_cells):
    """
    This function returns the data and time from the HTML table cell
    Input: the element of a table data cell extracts extra row
    """
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

def booster_version(table_cells):
    """
    This function returns the booster version from the HTML table cell 
    Input: the element of a table data cell extracts extra row
    """
    out=''.join([booster_version for i,booster_version in enumerate(table_cells.strings) if i%2==0][0:-1])
    return out

def landing_status(table_cells):
    """
    This function returns the landing status from the HTML table cell 
    Input: the element of a table data cell extracts extra row
    """
    out = [i for i in table_cells.strings][0]
    return out

def get_mass(table_cells):
    mass = unicodedata.normalize("NFKD", table_cells.text).strip()
    if mass:
        mass.find("kg")
        new_mass = mass[0:mass.find("kg")+2]
    else:
        new_mass = 0
    return new_mass

def extract_column_from_header(row):
    """
    This function returns the landing status from the HTML table cell 
    Input: the element of a table data cell extracts extra row
    """
    if (row.br):
        row.br.extract()
    if row.a:
        row.a.extract()
    if row.sup:
        row.sup.extract()
        
    colunm_name = ' '.join(row.contents)
    
    # Filter the digit and empty names
    if not(colunm_name.strip().isdigit()):
        colunm_name = colunm_name.strip()
    return colunm_name
        

To keep the lab tasks consistent, you will be asked to scrape the data from a snapshot of the  `List of Falcon 9 and Falcon Heavy launches` *Wikipedia page updated on* `

In [59]:
static_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"

In [60]:
# TASK 1: Request the Falcon9 Launch Wiki page from its URL
# Get the response from the static URL
response = requests.get(static_url)
print("Response status code:", response.status_code)


Response status code: 200


In [61]:
# Create BeautifulSoup object from the HTML response
soup = BeautifulSoup(response.content, 'html.parser')


In [62]:
# Print the page title to verify if the BeautifulSoup object was created properly
print(soup.title)


<title>List of Falcon 9 and Falcon Heavy launches - Wikipedia</title>


In [63]:
# TASK 2: Extract all column/variable names from the HTML table header
# Find all tables on the wiki page
html_tables = soup.find_all('table')
print("Number of tables found:", len(html_tables))


Number of tables found: 25


In [64]:
# Get the third table which contains the actual launch records
first_launch_table = html_tables[2]
print("First launch table found")
print(first_launch_table)


First launch table found
<table class="wikitable plainrowheaders collapsible" style="width: 100%;">
<tbody><tr>
<th scope="col">Flight No.
</th>
<th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
</th>
<th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>Booster</a> <sup class="reference" id="cite_ref-booster_11-0"><a href="#cite_note-booster-11"><span class="cite-bracket">[</span>b<span class="cite-bracket">]</span></a></sup>
</th>
<th scope="col">Launch site
</th>
<th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_12-0"><a href="#cite_note-Dragon-12"><span class="cite-bracket">[</span>c<span class="cite-bracket">]</span></a></sup>
</th>
<th scope="col">Payload mass
</th>
<th scope="col">Orbit
</th>
<th scope="col">Customer
</th>
<th scope="col">Launch<br/>outcome
</th>
<th scope="col"><a href="/wiki/Falcon_9_first-stage_land

In [65]:
# Extract column names from table headers
column_names = []

# Find all th elements in the first launch table
th_elements = first_launch_table.find_all('th')

# Extract column names using the provided function
for th in th_elements:
    name = extract_column_from_header(th)
    if name is not None and len(name) > 0:
        column_names.append(name)

print("Column names:", column_names)


Column names: ['Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome', '1\n', '2\n', '3\n', '4\n', '5\n', '6\n', '7\n']


In [66]:
column_names = []

# Apply find_all() function with `th` element on first_launch_table
# Iterate each th element and apply the provided extract_column_from_header() to get a column name
# Append the Non-empty column name (`if name is not None and len(name) > 0`) into a list called column_names
for i in first_launch_table.find_all('th'):
    if extract_column_from_header(i) != None and len(extract_column_from_header(i)) > 0:
        column_names.append(extract_column_from_header(i))

print(column_names)



launch_dict= dict.fromkeys(column_names)
del launch_dict['Date and time ( )']

launch_dict['Flight No.'] = []
launch_dict['Launch site'] = []
launch_dict['Payload'] = []
launch_dict['Payload mass'] = []
launch_dict['Orbit'] = []
launch_dict['Customer'] = []
launch_dict['Launch outcome'] = []
# Added some new columns
launch_dict['Version Booster']=[]
launch_dict['Booster landing']=[]
launch_dict['Date']=[]
launch_dict['Time']=[]

['Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome', '1\n', '2\n', '3\n', '4\n', '5\n', '6\n', '7\n']


In [67]:
extracted_row = 0
#Extract each table
for table_number,table in enumerate(soup.find_all('table',"wikitable plainrowheaders collapsible")):
   # get table row
    for rows in table.find_all("tr"):
        #check to see if first table heading is as number corresponding to launch a number
        if rows.th:
            if rows.th.string:
                flight_number=rows.th.string.strip()
                flag=flight_number.isdigit()
        else:
            flag=False
        #get table element
        row=rows.find_all('td')
        #if it is number save cells in a dictonary
        if flag:
            extracted_row += 1
            # Flight Number value
            # TODO: Append the flight_number into launch_dict with key `Flight No.`
            #print(flight_number)
            datatimelist=date_time(row[0])
            launch_dict['Flight No.'].append(flight_number)


            # Date value
            # TODO: Append the date into launch_dict with key `Date`
            date = datatimelist[0].strip(',')
            launch_dict['Date'].append(date)
            #print(date)

            # Time value
            # TODO: Append the time into launch_dict with key `Time`
            time = datatimelist[1]
            launch_dict['Time'].append(time)
            #print(time)

            # Booster version
            # TODO: Append the bv into launch_dict with key `Version Booster`
            bv=booster_version(row[1])
            if not(bv):
                bv=row[1].a.string
            launch_dict['Version Booster'].append(bv)
            #print(bv)

            # Launch Site
            # TODO: Append the bv into launch_dict with key `Launch Site`
            launch_site = row[2].a.string
            launch_dict['Launch site'].append(launch_site)
            #print(launch_site)

            # Payload
            # TODO: Append the payload into launch_dict with key `Payload`
            payload = row[3].a.string
            launch_dict['Payload'].append(payload)
            #print(payload)

            # Payload Mass
            # TODO: Append the payload_mass into launch_dict with key `Payload mass`
            payload_mass = get_mass(row[4])
            launch_dict['Payload mass'].append(payload_mass)
            #print(payload)

            # Orbit
            # TODO: Append the orbit into launch_dict with key `Orbit`
            orbit = row[5].a.string
            launch_dict['Orbit'].append(orbit)
            #print(orbit)

            # Customer
            # TODO: Append the customer into launch_dict with key `Customer`
            if row[6].a is not None:
                customer = row[6].a.string
            else:
                customer = row[6].get_text(strip=True)

            print(customer)
            launch_dict['Customer'].append(customer)

            launch_dict['Customer'].append(customer)
            #print(customer)

            # Launch outcome
            # TODO: Append the launch_outcome into launch_dict with key `Launch outcome`
            launch_outcome = list(row[7].strings)[0]
            launch_dict['Launch outcome'].append(launch_outcome)
            #print(launch_outcome)

            # Booster landing
            # TODO: Append the launch_outcome into launch_dict with key `Booster landing`
            booster_landing = landing_status(row[8])
            launch_dict['Booster landing'].append(booster_landing)
            #print(booster_landing)

SpaceX
NASA
NASA
NASA
NASA
MDA
SES
Thaicom
NASA
Orbcomm
AsiaSat
AsiaSat
NASA
NASA
USAF
ABS
NASA
None
NASA
Orbcomm
NASA
SES
NASA
SKY Perfect JSAT Group
Thaicom
ABS
NASA
SKY Perfect JSAT Group
Iridium Communications
NASA
EchoStar
SES
NRO
Inmarsat
NASA
Bulsatcom
Iridium Communications
Intelsat
NASA
NSPO
USAF
Iridium Communications
SES S.A.
KT Corporation
NASA
Iridium Communications
Northrop Grumman
SES
Hisdesat
Hispasat
Iridium Communications
NASA
NASA
Thales-Alenia
Iridium Communications
SES
NASA
Telesat
Iridium Communications
Telkom Indonesia
Telesat
CONAE
Es'hailSat
Spaceflight Industries
NASA
USAF
Iridium Communications
PSN
NASA
NASA
SpaceX
Canadian Space Agency
NASA
Spacecom
SpaceX
NASA
Sky Perfect JSAT
SpaceX
NASA
SpaceX
SpaceX
NASA
SpaceX
SpaceX
NASA
SpaceX
SpaceX
U.S. Space Force
Republic of Korea Army
SpaceX
SpaceX
CONAE
SpaceX
SpaceX
SpaceX
SpaceX
USSF
NASA
NASA
SpaceX
NASA
Sirius XM
NRO
Türksat
SpaceX
Various
SpaceX
SpaceX
SpaceX
SpaceX
SpaceX
SpaceX
SpaceX
NASA
SpaceX
SpaceX
S

In [68]:
# Create DataFrame from launch_dict
df = pd.DataFrame({key: pd.Series(value) for key, value in launch_dict.items()})
print("DataFrame created successfully!")
print("DataFrame shape:", df.shape)
print("\nFirst few rows:")
df.head()


DataFrame created successfully!
DataFrame shape: (242, 18)

First few rows:


Unnamed: 0,Flight No.,Launch site,Payload,Payload mass,Orbit,Customer,Launch outcome,1\n,2\n,3\n,4\n,5\n,6\n,7\n,Version Booster,Booster landing,Date,Time
0,1,CCAFS,Dragon Spacecraft Qualification Unit,0,LEO,SpaceX,Success\n,,,,,,,,F9 v1.07B0003.18,Failure,4 June 2010,18:45
1,2,CCAFS,Dragon,0,LEO,SpaceX,Success,,,,,,,,F9 v1.07B0004.18,Failure,8 December 2010,15:43
2,3,CCAFS,Dragon,525 kg,LEO,NASA,Success,,,,,,,,F9 v1.07B0005.18,No attempt\n,22 May 2012,07:44
3,4,CCAFS,SpaceX CRS-1,"4,700 kg",LEO,NASA,Success\n,,,,,,,,F9 v1.07B0006.18,No attempt,8 October 2012,00:35
4,5,CCAFS,SpaceX CRS-2,"4,877 kg",LEO,NASA,Success\n,,,,,,,,F9 v1.07B0007.18,No attempt\n,1 March 2013,15:10


In [73]:
df

Unnamed: 0,Flight No.,Launch site,Payload,Payload mass,Orbit,Customer,Launch outcome,1\n,2\n,3\n,4\n,5\n,6\n,7\n,Version Booster,Booster landing,Date,Time
0,1,CCAFS,Dragon Spacecraft Qualification Unit,0,LEO,SpaceX,Success\n,,,,,,,,F9 v1.07B0003.18,Failure,4 June 2010,18:45
1,2,CCAFS,Dragon,0,LEO,SpaceX,Success,,,,,,,,F9 v1.07B0004.18,Failure,8 December 2010,15:43
2,3,CCAFS,Dragon,525 kg,LEO,NASA,Success,,,,,,,,F9 v1.07B0005.18,No attempt\n,22 May 2012,07:44
3,4,CCAFS,SpaceX CRS-1,"4,700 kg",LEO,NASA,Success\n,,,,,,,,F9 v1.07B0006.18,No attempt,8 October 2012,00:35
4,5,CCAFS,SpaceX CRS-2,"4,877 kg",LEO,NASA,Success\n,,,,,,,,F9 v1.07B0007.18,No attempt\n,1 March 2013,15:10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
237,,,,,,SpaceX,,,,,,,,,,,,
238,,,,,,NASA,,,,,,,,,,,,
239,,,,,,NASA,,,,,,,,,,,,
240,,,,,,Sirius XM,,,,,,,,,,,,


In [74]:
# Answer the questions using SpaceX API data collection
import json

# SpaceX API data collection
spacex_url = "https://api.spacexdata.com/v4/launches/past"
response_api = requests.get(spacex_url)

if response_api.status_code == 200:
    data = response_api.json()
    df_api = pd.json_normalize(data)
    print("API DataFrame created successfully!")
    print("API DataFrame shape:", df_api.shape)
    
    # Question 1: First row static_fire_date_utc year
    if 'static_fire_date_utc' in df_api.columns:
        first_static_fire = df_api['static_fire_date_utc'].iloc[0]
        if first_static_fire:
            year = first_static_fire[:4]
            print(f"\nQuestion 1 Answer: Year in first row static_fire_date_utc: {year}")
        else:
            print(f"\nQuestion 1 Answer: First row static_fire_date_utc is null")
    
    # Question 2: Falcon 9 launches after removing Falcon 1
    # Get rocket details to filter Falcon 9
    rocket_url = "https://api.spacexdata.com/v4/rockets"
    rocket_response = requests.get(rocket_url)
    
    if rocket_response.status_code == 200:
        rocket_data = rocket_response.json()
        falcon9_id = None
        for rocket in rocket_data:
            if rocket['name'] == 'Falcon 9':
                falcon9_id = rocket['id']
                break
        
        if falcon9_id:
            falcon9_launches = df_api[df_api['rocket'] == falcon9_id]
            print(f"\nQuestion 2 Answer: Number of Falcon 9 launches: {len(falcon9_launches)}")
    
    # Question 3: Missing values in landingPad column
    # Extract landing pad information from cores
    landing_pad_count = 0
    missing_landing_pad = 0
    
    for cores_list in df_api['cores']:
        if cores_list:
            for core in cores_list:
                landing_pad_count += 1
                if not core.get('landing_pad'):
                    missing_landing_pad += 1
        else:
            landing_pad_count += 1
            missing_landing_pad += 1
    
    print(f"\nQuestion 3 Answer: Missing values in landingPad: {missing_landing_pad}")

# Question 4: BeautifulSoup output from Falcon9 Launch Wiki
print(f"\nQuestion 4 Answer: BeautifulSoup object title: {soup.title}")


API DataFrame created successfully!
API DataFrame shape: (187, 43)

Question 1 Answer: Year in first row static_fire_date_utc: 2006

Question 2 Answer: Number of Falcon 9 launches: 179

Question 3 Answer: Missing values in landingPad: 193

Question 4 Answer: BeautifulSoup object title: <title>List of Falcon 9 and Falcon Heavy launches - Wikipedia</title>


In [70]:
# Export to CSV
df.to_csv('spacex_web_scraped.csv', index=False)
print("Data exported to spacex_web_scraped.csv")

Data exported to spacex_web_scraped.csv


In [75]:
# Answer the questions using the scraped Wikipedia data (from static_url)
print("Answering questions using the scraped Wikipedia data:")

# Question 1: This question refers to API data, but we'll work with what we have
# Since we don't have static_fire_date_utc in our scraped data, we'll note this
print("\nQuestion 1: The scraped Wikipedia data doesn't contain static_fire_date_utc column.")
print("This question requires SpaceX API data, not Wikipedia scraped data.")

# Question 2: Count Falcon 9 launches after removing Falcon 1 launches
# Filter out Falcon 1 launches (they would have different version patterns)
falcon9_df = df[df['Version Booster'].str.contains('F1', na=False)]
falcon9_count = len(df) - len(falcon9_df)
print(f"\nQuestion 2 Answer: Number of Falcon 9 launches from Wikipedia data: {falcon9_count}")

# Question 3: Missing values in landingPad column using isnull() method
# In our scraped data, the equivalent would be 'Booster landing' column
if 'Booster landing' in df.columns:
    missing_landing_values = df['Booster landing'].isnull().sum()
    print(f"\nQuestion 3 Answer: Missing values in 'Booster landing' column: {missing_landing_values}")
    
    # Let's also check for empty strings or specific values that indicate missing data
    # Count entries that are null, empty, or contain "No attempt"
    no_landing_data = df['Booster landing'].isnull().sum() + \
                     (df['Booster landing'] == '').sum() + \
                     df['Booster landing'].str.contains('No attempt', na=False).sum()
    print(f"Total missing/no attempt landing data: {no_landing_data}")

# Question 4: BeautifulSoup output from Falcon9 Launch Wiki
print(f"\nQuestion 4 Answer: BeautifulSoup object title: {soup.title}")

# Additional analysis of the scraped data
print(f"\nAdditional information from scraped data:")
print(f"Total launches scraped: {len(df)}")
print(f"Date range: {df['Date'].min()} to {df['Date'].max()}")
print(f"Unique launch sites: {df['Launch site'].unique()}")

Answering questions using the scraped Wikipedia data:

Question 1: The scraped Wikipedia data doesn't contain static_fire_date_utc column.
This question requires SpaceX API data, not Wikipedia scraped data.

Question 2 Answer: Number of Falcon 9 launches from Wikipedia data: 242

Question 3 Answer: Missing values in 'Booster landing' column: 121
Total missing/no attempt landing data: 143

Question 4 Answer: BeautifulSoup object title: <title>List of Falcon 9 and Falcon Heavy launches - Wikipedia</title>

Additional information from scraped data:
Total launches scraped: 242


TypeError: '<=' not supported between instances of 'str' and 'float'