# **Space X  Falcon 9 First Stage Landing Prediction**


## Web scraping Falcon 9 and Falcon Heavy Launches Records from Wikipedia


In this section, I'm diving into the process of web scraping to gather historical launch records for Falcon 9 from a Wikipedia page titled `List of Falcon 9 and Falcon Heavy launches`. The goal is to get a clear picture of the launch history and outcomes, setting the stage for further analysis.

Link to the Wikipedia page for reference:

https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches

![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_1_L2/images/Falcon9_rocket_family.svg)


Here's an interesting tidbit: the first stage of the Falcon 9 rocket is designed to land back on Earth successfully, though, as you'll see, not every attempt goes as planned.

![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/landing_1.gif)


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/crash.gif)


Now, let's look at the actual launch records, which are conveniently stored in a well-organized HTML table on the Wikipedia page. 

![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_1_L2/images/falcon9-launches-wiki.png)


  ## Objectives
For this part, my main objectives are:
- **Web scraping the Falcon 9 launch records using BeautifulSoup**: This involves extracting the HTML table with launch records from Wikipedia and converting it into a pandas DataFrame for easy data manipulation and analysis.

First let's import required packages for this lab


In [1]:
import sys
import requests
from bs4 import BeautifulSoup
import re
import unicodedata
import pandas as pd

I've also included a few helper functions to make processing the web-scraped HTML table a breeze. These functions will handle everything from extracting the date and time to getting the booster version and landing status.

Let's kick things off by scraping data from a snapshot of the Wikipedia page, specifically the version updated on June 9, 2021.

In [3]:
def date_time(table_cells):
    """
    This function returns the data and time from the HTML  table cell
    Input: the  element of a table data cell extracts extra row
    """
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

def booster_version(table_cells):
    """
    This function returns the booster version from the HTML  table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=''.join([booster_version for i,booster_version in enumerate( table_cells.strings) if i%2==0][0:-1])
    return out

def landing_status(table_cells):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=[i for i in table_cells.strings][0]
    return out


def get_mass(table_cells):
    mass=unicodedata.normalize("NFKD", table_cells.text).strip()
    if mass:
        mass.find("kg")
        new_mass=mass[0:mass.find("kg")+2]
    else:
        new_mass=0
    return new_mass


def extract_column_from_header(row):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    if (row.br):
        row.br.extract()
    if row.a:
        row.a.extract()
    if row.sup:
        row.sup.extract()
        
    colunm_name = ' '.join(row.contents)
    
    if not(colunm_name.strip().isdigit()):
        colunm_name = colunm_name.strip()
        return colunm_name    


In [4]:
static_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"

### Request the Falcon9 Launch Wiki page from its URL


**Creating a BeautifulSoup object**
To parse the HTML content, I create a BeautifulSoup object. This step is crucial as it allows us to navigate and search through the HTML structure easily.

In [7]:
response = requests.get(static_url)

soup = BeautifulSoup(response.content, 'html.parser')

print(soup.title)

<title>List of Falcon 9 and Falcon Heavy launches - Wikipedia</title>


### Extracting all column/variable names from the HTML table header


Next, I focus on extracting all relevant column names from the HTML table header. This part involves finding all tables on the page and isolating the one that contains the launch records.

In [10]:
html_tables = soup.find_all('table', class_='wikitable plainrowheaders collapsible')

Starting from the third table is our target table contains the actual launch records.

In [12]:
first_launch_table = html_tables[2]
print(first_launch_table)

<table class="wikitable plainrowheaders collapsible" style="width: 100%;">
<tbody><tr>
<th scope="col">Flight No.
</th>
<th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
</th>
<th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>Booster</a><sup class="reference" id="cite_ref-booster_11-2"><a href="#cite_note-booster-11">[b]</a></sup>
</th>
<th scope="col">Launch site
</th>
<th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_12-2"><a href="#cite_note-Dragon-12">[c]</a></sup>
</th>
<th scope="col">Payload mass
</th>
<th scope="col">Orbit
</th>
<th scope="col">Customer
</th>
<th scope="col">Launch<br/>outcome
</th>
<th scope="col"><a href="/wiki/Falcon_9_first-stage_landing_tests" title="Falcon 9 first-stage landing tests">Booster<br/>landing</a>
</th></tr>
<tr>
<th rowspan="2" scope="row" style="text-align:center;">14
</th>
<td>

Next, we just need to iterate through the `<th>` elements and apply the provided `extract_column_from_header()` to extract column name one by one


In [14]:
column_names = []

for th in first_launch_table.find_all('th'):
    column_names.append(th.text.strip())

print(column_names)

['Flight No.', 'Date andtime (UTC)', 'Version,Booster[b]', 'Launch site', 'Payload[c]', 'Payload mass', 'Orbit', 'Customer', 'Launchoutcome', 'Boosterlanding', '14', '15', '16', '17', '18', '19', '20']


Checking the extracted column names.


In [16]:
print(column_names)

['Flight No.', 'Date andtime (UTC)', 'Version,Booster[b]', 'Launch site', 'Payload[c]', 'Payload mass', 'Orbit', 'Customer', 'Launchoutcome', 'Boosterlanding', '14', '15', '16', '17', '18', '19', '20']


## Creating a data frame by parsing the launch HTML tables


With the columns set, the next step is parsing the table rows into a dictionary. This dictionary will later be converted into a pandas DataFrame for easy handling.

In [19]:
launch_dict = {
    'Flight No.': [],
    'Date': [],
    'Time': [],
    'Version Booster': [],
    'Launch site': [],
    'Payload': [],
    'Payload mass': [],
    'Orbit': [],
    'Customer': [],
    'Launch outcome': [],
    'Booster landing': []
}

Next, we just need to fill up the `launch_dict` with launch records extracted from table rows.


Usually, HTML tables in Wiki pages are likely to contain unexpected annotations and other types of noises, such as reference links `B0004.1[8]`, missing values `N/A [e]`, inconsistent formatting, etc.


I iterate through the rows, extract data, and populate the dictionary. After filling the dictionary with all the launch details, I convert it into a DataFrame.

In [23]:
import numpy as np
for table_number, table in enumerate(soup.find_all('table', class_="wikitable plainrowheaders collapsible")):
    for rows in table.find_all("tr"):
        # Check if the row is a header row with a flight number
        if rows.th:
            if rows.th.string:
                flight_number = rows.th.string.strip()
                flag = flight_number.isdigit()
        else:
            flag = False
        
        row = rows.find_all('td')
        
        # If it's a valid flight number row, parse the data
        if flag:
            # Flight Number
            launch_dict['Flight No.'].append(flight_number)
            
            datatimelist = [i.strip() for i in row[0].text.split()]
            date = datatimelist[0].strip(',')
            time = datatimelist[1] if len(datatimelist) > 1 else None
            launch_dict['Date'].append(date)
            launch_dict['Time'].append(time)
            
            # Version Booster
            bv = row[1].text.strip()
            launch_dict['Version Booster'].append(bv)
            
            # Launch Site
            launch_site = row[2].text.strip()
            launch_dict['Launch site'].append(launch_site)
            
            # Payload
            payload = row[3].text.strip()
            launch_dict['Payload'].append(payload)
            
            # Payload Mass
            payload_mass = row[4].text.strip().split()[0] if row[4].text.strip() else None
            launch_dict['Payload mass'].append(payload_mass)
            
            # Orbit
            orbit = row[5].text.strip()
            launch_dict['Orbit'].append(orbit)
            
            # Customer
            customer = row[6].text.strip() if row[6].text.strip() else None
            launch_dict['Customer'].append(customer)
            
            # Launch outcome
            launch_outcome = row[7].text.strip() if row[7].text.strip() else None
            launch_dict['Launch outcome'].append(launch_outcome)
            
            # Booster landing
            booster_landing = row[8].text.strip() if len(row) > 8 else None
            launch_dict['Booster landing'].append(booster_landing)

# Converting the dictionary to a Pandas DataFrame
df_launches = pd.DataFrame(launch_dict)

# Display the DataFrame's head to verify the data
print(df_launches.head())

# Replace NaN with empty strings in case of any missing values
df_launches = df_launches.replace(np.nan, '', regex=True)

# Display the information about the DataFrame
print(df_launches.info())

  Flight No. Date      Time       Version Booster   Launch site  \
0          1    4      June  F9 v1.0[7]B0003.1[8]  CCAFS,SLC-40   
1          2    8  December  F9 v1.0[7]B0004.1[8]  CCAFS,SLC-40   
2          3   22       May  F9 v1.0[7]B0005.1[8]  CCAFS,SLC-40   
3          4    8   October  F9 v1.0[7]B0006.1[8]  CCAFS,SLC-40   
4          5    1     March  F9 v1.0[7]B0007.1[8]  CCAFS,SLC-40   

                                   Payload Payload mass      Orbit  \
0     Dragon Spacecraft Qualification Unit         None        LEO   
1       Dragon demo flight C1(Dragon C101)         None  LEO (ISS)   
2  Dragon demo flight C2+[18](Dragon C102)          525  LEO (ISS)   
3            SpaceX CRS-1[22](Dragon C103)        4,700  LEO (ISS)   
4            SpaceX CRS-2[22](Dragon C104)        4,877  LEO (ISS)   

           Customer Launch outcome            Booster landing  
0            SpaceX        Success  Failure[9][10](parachute)  
1  NASA (COTS)\nNRO     Success[9]  Failure[9][1

In [24]:
df= pd.DataFrame({ key:pd.Series(value) for key, value in launch_dict.items() })

In [25]:
df.to_csv('spacex_web_scraped.csv', index=False)

## Authors


<a href="https://www.linkedin.com/in/kristinacinova/">Kristina Cinova</a>


<!--
## Change Log
-->


<!--
| Date (YYYY-MM-DD) | Version | Changed By | Change Description      |
| ----------------- | ------- | ---------- | ----------------------- |
| 2021-06-09        | 1.0     | Yan Luo    | Tasks updates           |
| 2020-11-10        | 1.0     | Nayef      | Created the initial version |
-->
