<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


# **Space X  Falcon 9 First Stage Landing Prediction**


## Web scraping Falcon 9 and Falcon Heavy Launches Records from Wikipedia


Estimated time needed: **40** minutes


In this lab, you will be performing web scraping to collect Falcon 9 historical launch records from a Wikipedia page titled `List of Falcon 9 and Falcon Heavy launches`

https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_1_L2/images/Falcon9_rocket_family.svg)


Falcon 9 first stage will land successfully


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/landing_1.gif)


Several examples of an unsuccessful landing are shown here:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/crash.gif)


More specifically, the launch records are stored in a HTML table shown below:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_1_L2/images/falcon9-launches-wiki.png)


  ## Objectives
Web scrap Falcon 9 launch records with `BeautifulSoup`: 
- Extract a Falcon 9 launch records HTML table from Wikipedia
- Parse the table and convert it into a Pandas data frame


First let's import required packages for this lab


In [1]:
!pip3 install beautifulsoup4
!pip3 install requests



In [2]:
import sys

import requests
from bs4 import BeautifulSoup
import re
import unicodedata
import pandas as pd

and we will provide some helper functions for you to process web scraped HTML table


In [3]:
def date_time(table_cells):
    """
    This function returns the data and time from the HTML  table cell
    Input: the  element of a table data cell extracts extra row
    """
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

def booster_version(table_cells):
    """
    This function returns the booster version from the HTML  table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=''.join([booster_version for i,booster_version in enumerate( table_cells.strings) if i%2==0][0:-1])
    return out

def landing_status(table_cells):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=[i for i in table_cells.strings][0]
    return out


def get_mass(table_cells):
    mass=unicodedata.normalize("NFKD", table_cells.text).strip()
    if mass:
        mass.find("kg")
        new_mass=mass[0:mass.find("kg")+2]
    else:
        new_mass=0
    return new_mass


def extract_column_from_header(row):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    if (row.br):
        row.br.extract()
    if row.a:
        row.a.extract()
    if row.sup:
        row.sup.extract()
        
    colunm_name = ' '.join(row.contents)
    
    # Filter the digit and empty names
    if not(colunm_name.strip().isdigit()):
        colunm_name = colunm_name.strip()
        return colunm_name    


To keep the lab tasks consistent, you will be asked to scrape the data from a snapshot of the  `List of Falcon 9 and Falcon Heavy launches` Wikipage updated on
`9th June 2021`


In [4]:
static_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"

Next, request the HTML page from the above URL and get a `response` object


### TASK 1: Request the Falcon9 Launch Wiki page from its URL


First, let's perform an HTTP GET method to request the Falcon9 Launch HTML page, as an HTTP response.


In [6]:
# use requests.get() method with the provided static_url
# assign the response to a object
response = requests.get(static_url)

if response.status_code == 200:
    print("Solicitud exitosa")
else:
    print(f"Error en la solicitud: {response.status_code}")

soup = BeautifulSoup(response.content, 'html.parser')

tabla = soup.find_all('table', 'wikitable')[0]

filas_tabla = tabla.find_all('tr')
datos_lanzamiento = []
for fila_tabla in filas_tabla:
    celdas_tabla = fila_tabla.find_all('td')
    if len(celdas_tabla) >= 8:  # Asegurar que hay al menos 8 celdas para evitar IndexError
        lanzamiento_dict = {}
        lanzamiento_dict['fecha_hora'] = date_time(celdas_tabla[0])
        lanzamiento_dict['version_refuerzo'] = booster_version(celdas_tabla[1])
        lanzamiento_dict['sitio_lanzamiento'] = celdas_tabla[2].text.strip()
        lanzamiento_dict['carga_util'] = celdas_tabla[3].text.strip()
        lanzamiento_dict['masa_carga_util'] = get_mass(celdas_tabla[4])
        lanzamiento_dict['órbita'] = celdas_tabla[5].text.strip()
        lanzamiento_dict['cliente'] = celdas_tabla[6].text.strip()
        lanzamiento_dict['estado_aterrizaje'] = landing_status(celdas_tabla[7])
        datos_lanzamiento.append(lanzamiento_dict)
    else:
        print(f"Advertencia: Fila omitida debido a celdas insuficientes: {fila_tabla}")

df = pd.DataFrame(datos_lanzamiento)


# Filtrar solo los lanzamientos de Falcon 9
df = df[df['version_refuerzo'].str.contains('Falcon 9', na=False)]

# Convertir la columna 'masa_carga_util' a numérica y rellenar los valores None con la media
df['masa_carga_util'] = df['masa_carga_util'].str.replace('kg', '').str.replace(',', '').astype(float)
mean_mass = df['masa_carga_util'].mean()
df['masa_carga_util'].fillna(mean_mass, inplace=True)

print(df.head())

Solicitud exitosa
Advertencia: Fila omitida debido a celdas insuficientes: <tr>
<th scope="col">Flight No.
</th>
<th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
</th>
<th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>Booster</a> <sup class="reference" id="cite_ref-booster_11-0"><a href="#cite_note-booster-11">[b]</a></sup>
</th>
<th scope="col">Launch site
</th>
<th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_12-0"><a href="#cite_note-Dragon-12">[c]</a></sup>
</th>
<th scope="col">Payload mass
</th>
<th scope="col">Orbit
</th>
<th scope="col">Customer
</th>
<th scope="col">Launch<br/>outcome
</th>
<th scope="col"><a href="/wiki/Falcon_9_first-stage_landing_tests" title="Falcon 9 first-stage landing tests">Booster<br/>landing</a>
</th></tr>
Advertencia: Fila omitida debido a celdas insuficientes: <tr>
<td colspan="9">F

Create a `BeautifulSoup` object from the HTML `response`


In [10]:
import requests
from bs4 import BeautifulSoup
import unicodedata
import pandas as pd

def date_time(table_cells):
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

def booster_version(table_cells):
    out = ''.join([booster_version for i, booster_version in enumerate(table_cells.strings) if i % 2 == 0][0:-1])
    return out

def landing_status(table_cells):
    out = [i for i in table_cells.strings][0]
    return out

def get_mass(table_cells):
    mass = unicodedata.normalize("NFKD", table_cells.text).strip()
    if "kg" in mass:
        new_mass = mass.split("kg")[0] + "kg"
    else:
        new_mass = "0 kg"
    return new_mass

static_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"

response = requests.get(static_url)

if response.status_code == 200:
    print("Solicitud exitosa")
else:
    print(f"Error en la solicitud: {response.status_code}")

soup = BeautifulSoup(response.content, 'html.parser')

tabla = soup.find_all('table', 'wikitable')[0]

filas_tabla = tabla.find_all('tr')
datos_lanzamiento = []
for fila_tabla in filas_tabla:
    celdas_tabla = fila_tabla.find_all('td')
    if len(celdas_tabla) >= 8:
        lanzamiento_dict = {}
        lanzamiento_dict['fecha_hora'] = date_time(celdas_tabla[0])
        lanzamiento_dict['version_refuerzo'] = booster_version(celdas_tabla[1])
        lanzamiento_dict['sitio_lanzamiento'] = celdas_tabla[2].text.strip()
        lanzamiento_dict['carga_util'] = celdas_tabla[3].text.strip()
        lanzamiento_dict['masa_carga_util'] = get_mass(celdas_tabla[4])
        lanzamiento_dict['órbita'] = celdas_tabla[5].text.strip()
        lanzamiento_dict['cliente'] = celdas_tabla[6].text.strip()
        lanzamiento_dict['estado_aterrizaje'] = landing_status(celdas_tabla[7])
        datos_lanzamiento.append(lanzamiento_dict)

df = pd.DataFrame(datos_lanzamiento)

# Filtrar solo los lanzamientos de Falcon 9
df = df[df['version_refuerzo'].str.contains('Falcon 9', na=False)]
# Convertir la columna 'masa_carga_util' a numérica y rellenar los valores None con la media
df['masa_carga_util'] = df['masa_carga_util'].str.replace('kg', '').str.replace(',', '').astype(float)
mean_mass = df['masa_carga_util'].mean()
df['masa_carga_util'].fillna(mean_mass, inplace=True)

print(df.head())


Solicitud exitosa
Empty DataFrame
Columns: [fecha_hora, version_refuerzo, sitio_lanzamiento, carga_util, masa_carga_util, órbita, cliente, estado_aterrizaje]
Index: []


Print the page title to verify if the `BeautifulSoup` object was created properly 


In [9]:
# Use soup.title attribute
print(df.head())

Empty DataFrame
Columns: [fecha_hora, version_refuerzo, sitio_lanzamiento, carga_util, masa_carga_util, órbita, cliente, estado_aterrizaje]
Index: []


In [11]:
# Verifica la estructura de la tabla
print(tabla.prettify())

# Verifica las filas de la tabla
for fila_tabla in filas_tabla:
    print(fila_tabla.prettify())
    celdas_tabla = fila_tabla.find_all('td')
    if len(celdas_tabla) >= 8:
        print([celda.text.strip() for celda in celdas_tabla])

# Verifica los datos extraídos
for lanzamiento_dict in datos_lanzamiento:
    print(lanzamiento_dict)

# Imprime el DataFrame antes del filtrado
print(df.head())


<table class="wikitable plainrowheaders collapsible" style="width: 100%;">
 <tbody>
  <tr>
   <th scope="col">
    Flight No.
   </th>
   <th scope="col">
    Date and
    <br/>
    time (
    <a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">
     UTC
    </a>
    )
   </th>
   <th scope="col">
    <a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">
     Version,
     <br/>
     Booster
    </a>
    <sup class="reference" id="cite_ref-booster_11-0">
     <a href="#cite_note-booster-11">
      [b]
     </a>
    </sup>
   </th>
   <th scope="col">
    Launch site
   </th>
   <th scope="col">
    Payload
    <sup class="reference" id="cite_ref-Dragon_12-0">
     <a href="#cite_note-Dragon-12">
      [c]
     </a>
    </sup>
   </th>
   <th scope="col">
    Payload mass
   </th>
   <th scope="col">
    Orbit
   </th>
   <th scope="col">
    Customer
   </th>
   <th scope="col">
    Launch
    <br/>
    outcome
  

### TASK 2: Extract all column/variable names from the HTML table header


Next, we want to collect all relevant column names from the HTML table header


Let's try to find all tables on the wiki page first. If you need to refresh your memory about `BeautifulSoup`, please check the external reference link towards the end of this lab


In [10]:
# Use the find_all function in the BeautifulSoup object, with element type `table`
# Assign the result to a list called `html_tables`
html_tables = soup.find_all('table', class_='wikitable')
first_launch_table = html_tables[2]
print(first_launch_table)

headers = first_launch_table.find_all('th')
column_names = [extract_column_from_header(header) for header in headers]
print(column_names)


<table class="wikitable plainrowheaders collapsible" style="width: 100%;">
<tbody><tr>
<th scope="col">Flight No.
</th>
<th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
</th>
<th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>Booster</a><sup class="reference" id="cite_ref-booster_11-2"><a href="#cite_note-booster-11">[b]</a></sup>
</th>
<th scope="col">Launch site
</th>
<th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_12-2"><a href="#cite_note-Dragon-12">[c]</a></sup>
</th>
<th scope="col">Payload mass
</th>
<th scope="col">Orbit
</th>
<th scope="col">Customer
</th>
<th scope="col">Launch<br/>outcome
</th>
<th scope="col"><a href="/wiki/Falcon_9_first-stage_landing_tests" title="Falcon 9 first-stage landing tests">Booster<br/>landing</a>
</th></tr>
<tr>
<th rowspan="2" scope="row" style="text-align:center;">14
</th>
<td>

Starting from the third table is our target table contains the actual launch records.


In [11]:
# Let's print the third table and check its content
first_launch_table = html_tables[2]
print(first_launch_table)

<table class="wikitable plainrowheaders collapsible" style="width: 100%;">
<tbody><tr>
<th scope="col">Flight No.
</th>
<th scope="col">Date andtime ()
</th>
<th scope="col">
</th>
<th scope="col">Launch site
</th>
<th scope="col">Payload
</th>
<th scope="col">Payload mass
</th>
<th scope="col">Orbit
</th>
<th scope="col">Customer
</th>
<th scope="col">Launchoutcome
</th>
<th scope="col">
</th></tr>
<tr>
<th rowspan="2" scope="row" style="text-align:center;">14
</th>
<td>10 January 2015,<br/>09:47<sup class="reference" id="cite_ref-nasa20150107_74-0"><a href="#cite_note-nasa20150107-74">[67]</a></sup>
</td>
<td><a href="/wiki/Falcon_9_v1.1" title="Falcon 9 v1.1">F9 v1.1</a><br/>B1012<sup class="reference" id="cite_ref-block_numbers_14-8"><a href="#cite_note-block_numbers-14">[8]</a></sup>
</td>
<td><a href="/wiki/Cape_Canaveral_Space_Force_Station" title="Cape Canaveral Space Force Station">Cape Canaveral</a>,<br/><a href="/wiki/Cape_Canaveral_Space_Launch_Complex_40" title="Cape Canav

You should able to see the columns names embedded in the table header elements `<th>` as follows:


```
<tr>
<th scope="col">Flight No.
</th>
<th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
</th>
<th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>Booster</a> <sup class="reference" id="cite_ref-booster_11-0"><a href="#cite_note-booster-11">[b]</a></sup>
</th>
<th scope="col">Launch site
</th>
<th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_12-0"><a href="#cite_note-Dragon-12">[c]</a></sup>
</th>
<th scope="col">Payload mass
</th>
<th scope="col">Orbit
</th>
<th scope="col">Customer
</th>
<th scope="col">Launch<br/>outcome
</th>
<th scope="col"><a href="/wiki/Falcon_9_first-stage_landing_tests" title="Falcon 9 first-stage landing tests">Booster<br/>landing</a>
</th></tr>
```


Next, we just need to iterate through the `<th>` elements and apply the provided `extract_column_from_header()` to extract column name one by one


In [12]:
# Apply find_all() function with `th` element on first_launch_table
# Iterate each th element and apply the provided extract_column_from_header() to get a column name
# Append the Non-empty column name (`if name is not None and len(name) > 0`) into a list called column_names


column_names = []

# Aplicar find_all() con el elemento `th` en first_launch_table
headers = first_launch_table.find_all('th')

# Iterar cada elemento th y aplicar extract_column_from_header() para obtener el nombre de la columna
for header in headers:
    column_name = extract_column_from_header(header)
    if column_name is not None and len(column_name) > 0:
        column_names.append(column_name)




In [13]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

# URL de la página de Wikipedia con la tabla de lanzamientos de Falcon 9
url = 'https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches'

# Realizar la solicitud HTTP y obtener el contenido de la página
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Encontrar la tabla específica por su clase
tabla = soup.find('table', {'class': 'wikitable'})

# Extraer las filas de la tabla
filas_tabla = tabla.find_all('tr')

# Lista para almacenar los datos de cada lanzamiento
datos_lanzamiento = []

# Iterar sobre las filas (ignorando la primera fila de encabezados)
for fila in filas_tabla[1:]:
    columnas = fila.find_all('td')
    
    # Verificar que la fila tenga al menos 10 columnas
    if len(columnas) >= 10:
        datos_fila = {
            'Flight No.': columnas[0].text.strip(),
            'Date and time (UTC)': columnas[1].text.strip(),
            'Version, Booster': columnas[2].text.strip(),
            'Launch site': columnas[3].text.strip(),
            'Payload': columnas[4].text.strip(),
            'Payload mass': columnas[5].text.strip(),
            'Orbit': columnas[6].text.strip(),
            'Customer': columnas[7].text.strip(),
            'Launch outcome': columnas[8].text.strip(),
            'Booster landing': columnas[9].text.strip()
        }
        datos_lanzamiento.append(datos_fila)
    else:
        print(f"La fila no tiene suficientes columnas: {columnas}")

# Crear un DataFrame con los datos de lanzamiento
df = pd.DataFrame(datos_lanzamiento)

# Imprimir las primeras filas del DataFrame para verificar
print(df.head())


La fila no tiene suficientes columnas: [<td>3 January 2023<br/>14:56<sup class="reference" id="cite_ref-29"><a href="#cite_note-29">[22]</a></sup>
</td>, <td><a href="/wiki/Falcon_9_Block_5" title="Falcon 9 Block 5">F9 B5</a> ♺ <br/><a href="/wiki/List_of_Falcon_9_first-stage_boosters#B1060" title="List of Falcon 9 first-stage boosters">B1060.15</a><sup class="reference" id="cite_ref-30"><a href="#cite_note-30">[23]</a></sup>
</td>, <td><a href="/wiki/Cape_Canaveral_Space_Force_Station" title="Cape Canaveral Space Force Station">CCSFS</a>,<br/><a href="/wiki/Cape_Canaveral_Space_Launch_Complex_40" title="Cape Canaveral Space Launch Complex 40">SLC-40</a>
</td>, <td><a href="/wiki/List_of_spaceflight_launches_in_January%E2%80%93June_2023#SpXTransporter6" title="List of spaceflight launches in January–June 2023"><i>Transporter-6</i>: (115 payloads Smallsat Rideshare)</a>
</td>, <td class="table-na" data-sort-value="" style="background: #ececec; color: #2C2C2C; vertical-align: middle; tex

Check the extracted column names


In [16]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

# URL de la página de Wikipedia con la tabla de lanzamientos de Falcon 9
url = 'https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches'

# Realizar la solicitud HTTP y obtener el contenido de la página
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Encontrar la tabla específica por su clase
tabla = soup.find('table', {'class': 'wikitable'})

# Extraer las filas de la tabla
filas_tabla = tabla.find_all('tr')

# Lista para almacenar los datos de cada lanzamiento
datos_lanzamiento = []

# Iterar sobre las filas (ignorando la primera fila de encabezados)
for fila in filas_tabla[1:]:
    columnas = fila.find_all('td')
    
    # Verificar que la fila tenga al menos 10 columnas
    if len(columnas) >= 10:
        datos_fila = {
            'Flight No.': columnas[0].text.strip(),
            'Date and time (UTC)': columnas[1].text.strip(),
            'Version, Booster': columnas[2].text.strip(),
            'Launch site': columnas[3].text.strip(),
            'Payload': columnas[4].text.strip(),
            'Payload mass': columnas[5].text.strip(),
            'Orbit': columnas[6].text.strip(),
            'Customer': columnas[7].text.strip(),
            'Launch outcome': columnas[8].text.strip(),
            'Booster landing': columnas[9].text.strip()
        }
        datos_lanzamiento.append(datos_fila)
    else:
        print(f"La fila no tiene suficientes columnas: {columnas}")

# Crear un DataFrame con los datos de lanzamiento
df = pd.DataFrame(datos_lanzamiento)

# Verificar los nombres de las columnas
print(df.columns)


La fila no tiene suficientes columnas: [<td>3 January 2023<br/>14:56<sup class="reference" id="cite_ref-29"><a href="#cite_note-29">[22]</a></sup>
</td>, <td><a href="/wiki/Falcon_9_Block_5" title="Falcon 9 Block 5">F9 B5</a> ♺ <br/><a href="/wiki/List_of_Falcon_9_first-stage_boosters#B1060" title="List of Falcon 9 first-stage boosters">B1060.15</a><sup class="reference" id="cite_ref-30"><a href="#cite_note-30">[23]</a></sup>
</td>, <td><a href="/wiki/Cape_Canaveral_Space_Force_Station" title="Cape Canaveral Space Force Station">CCSFS</a>,<br/><a href="/wiki/Cape_Canaveral_Space_Launch_Complex_40" title="Cape Canaveral Space Launch Complex 40">SLC-40</a>
</td>, <td><a href="/wiki/List_of_spaceflight_launches_in_January%E2%80%93June_2023#SpXTransporter6" title="List of spaceflight launches in January–June 2023"><i>Transporter-6</i>: (115 payloads Smallsat Rideshare)</a>
</td>, <td class="table-na" data-sort-value="" style="background: #ececec; color: #2C2C2C; vertical-align: middle; tex

In [14]:
column_names = []

# Aplicar find_all() con el elemento `th` en first_launch_table
headers = first_launch_table.find_all('th')

# Iterar cada elemento th y aplicar extract_column_from_header() para obtener el nombre de la columna
for header in headers:
    column_name = extract_column_from_header(header)
    if column_name is not None and len(column_name) > 0:
        column_names.append(column_name)

print(column_names)
## TASK 3: Create a data frame by parsing the launch HTML tables


['Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome']


We will create an empty dictionary with keys from the extracted column names in the previous task. Later, this dictionary will be converted into a Pandas dataframe


In [15]:
launch_dict= dict.fromkeys(column_names)

# Remove an irrelvant column
del launch_dict['Date and time ( )']

# Let's initial the launch_dict with each value to be an empty list
launch_dict['Flight No.'] = []
launch_dict['Launch site'] = []
launch_dict['Payload'] = []
launch_dict['Payload mass'] = []
launch_dict['Orbit'] = []
launch_dict['Customer'] = []
launch_dict['Launch outcome'] = []
# Added some new columns
launch_dict['Version Booster']=[]
launch_dict['Booster landing']=[]
launch_dict['Date']=[]
launch_dict['Time']=[]

NameError: name 'column_names' is not defined

Next, we just need to fill up the `launch_dict` with launch records extracted from table rows.


Usually, HTML tables in Wiki pages are likely to contain unexpected annotations and other types of noises, such as reference links `B0004.1[8]`, missing values `N/A [e]`, inconsistent formatting, etc.


To simplify the parsing process, we have provided an incomplete code snippet below to help you to fill up the `launch_dict`. Please complete the following code snippet with TODOs or you can choose to write your own logic to parse all launch tables:


In [16]:
extracted_row = 0
#Extract each table 
for table_number,table in enumerate(soup.find_all('table',"wikitable plainrowheaders collapsible")):
   # get table row 
    for rows in table.find_all("tr"):
        #check to see if first table heading is as number corresponding to launch a number 
        if rows.th:
            if rows.th.string:
                flight_number=rows.th.string.strip()
                flag=flight_number.isdigit()
        else:
            flag=False
        #get table element 
        row=rows.find_all('td')
        #if it is number save cells in a dictonary 
        if flag:
            extracted_row += 1
            # Flight Number value
            # TODO: Append the flight_number into launch_dict with key `Flight No.`
            #print(flight_number)
            datatimelist=date_time(row[0])
            
            # Date value
            # TODO: Append the date into launch_dict with key `Date`
            date = datatimelist[0].strip(',')
            #print(date)
            
            # Time value
            # TODO: Append the time into launch_dict with key `Time`
            time = datatimelist[1]
            #print(time)
              
            # Booster version
            # TODO: Append the bv into launch_dict with key `Version Booster`
            bv=booster_version(row[1])
            if not(bv):
                bv=row[1].a.string
            print(bv)
            
            # Launch Site
            # TODO: Append the bv into launch_dict with key `Launch Site`
            launch_site = row[2].a.string
            #print(launch_site)
            
            # Payload
            # TODO: Append the payload into launch_dict with key `Payload`
            payload = row[3].a.string
            #print(payload)
            
            # Payload Mass
            # TODO: Append the payload_mass into launch_dict with key `Payload mass`
            payload_mass = get_mass(row[4])
            #print(payload)
            
            # Orbit
            # TODO: Append the orbit into launch_dict with key `Orbit`
            orbit = row[5].a.string
            #print(orbit)
            
            # Customer
            # TODO: Append the customer into launch_dict with key `Customer`
            customer = row[6].a.string
            #print(customer)
            
            # Launch outcome
            # TODO: Append the launch_outcome into launch_dict with key `Launch outcome`
            launch_outcome = list(row[7].strings)[0]
            #print(launch_outcome)
            
            # Booster landing
            # TODO: Append the launch_outcome into launch_dict with key `Booster landing`
            booster_landing = landing_status(row[8])
            #print(booster_landing)
            

F9 v1.0B0003.1
F9 v1.0B0004.1
F9 v1.0B0005.1
F9 v1.0B0006.1
F9 v1.0B0007.1
F9 v1.1B1003
F9 v1.1
F9 v1.1
F9 v1.1
F9 v1.1
F9 v1.1
F9 v1.1
F9 v1.1
F9 v1.1
F9 v1.1
F9 v1.1
F9 v1.1
F9 v1.1
F9 v1.1
F9 FT
F9 v1.1
F9 FT
F9 FT
F9 FT
F9 FT
F9 FT
F9 FT
F9 FT
F9 FT
F9 FT
F9 FT
F9 FT♺
F9 FT
F9 FT
F9 FT
F9 FTB1029.2
F9 FT
F9 FT
F9 B4
F9 FT
F9 B4
F9 B4
F9 FTB1031.2
F9 B4
F9 FTB1035.2
F9 FTB1036.2
F9 B4
F9 FTB1032.2
F9 FTB1038.2
F9 B4
F9 B4B1041.2
F9 B4B1039.2
F9 B4
F9 B5B1046.1
F9 B4B1043.2
F9 B4B1040.2
F9 B4B1045.2
F9 B5
F9 B5B1048
F9 B5B1046.2
F9 B5
F9 B5B1048.2
F9 B5B1047.2
F9 B5B1046.3
F9 B5
F9 B5
F9 B5B1049.2
F9 B5B1048.3
F9 B5[268]
F9 B5
F9 B5B1049.3
F9 B5B1051.2
F9 B5B1056.2
F9 B5B1047.3
F9 B5
F9 B5
F9 B5B1056.3
F9 B5
F9 B5
F9 B5
F9 B5
F9 B5
F9 B5
F9 B5
F9 B5
F9 B5
F9 B5
F9 B5
F9 B5B1058.2
F9 B5
F9 B5B1049.6
F9 B5
F9 B5B1060.2
F9 B5B1058.3
F9 B5B1051.6
F9 B5
F9 B5
F9 B5
F9 B5
F9 B5 ♺
F9 B5 ♺
F9 B5 ♺
F9 B5 ♺
F9 B5
F9 B5B1051.8
F9 B5B1058.5


AttributeError: 'NoneType' object has no attribute 'string'

After you have fill in the parsed launch record values into `launch_dict`, you can create a dataframe from it.


In [18]:
# Inicializar launch_dict con listas vacías
launch_dict = {
    'Flight No.': [],
    'Launch site': [],
    'Payload': [],
    'Payload mass': [],
    'Orbit': [],
    'Customer': [],
    'Launch outcome': [],
    'Version Booster': [],
    'Booster landing': [],
    'Date': [],
    'Time': []
}

extracted_row = 0
# Extraer cada tabla
for table_number, table in enumerate(soup.find_all('table', "wikitable plainrowheaders collapsible")):
    # Obtener fila de la tabla
    for rows in table.find_all("tr"):
        # Verificar si el primer encabezado de la tabla es un número correspondiente a un número de lanzamiento
        if rows.th:
            if rows.th.string:
                flight_number = rows.th.string.strip()
                flag = flight_number.isdigit()
        else:
            flag = False
        
        # Obtener elementos de la fila
        row = rows.find_all('td')
        # Si es un número, guardar celdas en un diccionario
        if flag:
            extracted_row += 1
            # Flight Number value
            launch_dict['Flight No.'].append(flight_number)

            # Date and Time values
            datatimelist = date_time(row[0])
            date = datatimelist[0].strip(',')
            time = datatimelist[1]
            launch_dict['Date'].append(date)
            launch_dict['Time'].append(time)
            
            # Booster version
            bv = booster_version(row[1])
            if not bv:
                bv = row[1].a.string
            launch_dict['Version Booster'].append(bv)
            
            # Launch Site
            launch_site = row[2].a.string
            launch_dict['Launch site'].append(launch_site)
            
            # Payload
            payload = row[3].a.string
            launch_dict['Payload'].append(payload)
            
            # Payload Mass
            payload_mass = get_mass(row[4])
            launch_dict['Payload mass'].append(payload_mass)
            
            # Orbit
            orbit = row[5].a.string
            launch_dict['Orbit'].append(orbit)
            
            # Customer
            customer = row[6].a.string if row[6].a else 'Unknown'
            launch_dict['Customer'].append(customer)
            
            # Launch outcome
            launch_outcome = list(row[7].strings)[0]
            launch_dict['Launch outcome'].append(launch_outcome)
            
            # Booster landing
            booster_landing = landing_status(row[8])
            launch_dict['Booster landing'].append(booster_landing)

# Crear DataFrame
df = pd.DataFrame(launch_dict)
print(df.head())


  Flight No. Launch site                               Payload Payload mass  \
0          1       CCAFS  Dragon Spacecraft Qualification Unit            0   
1          2       CCAFS                                Dragon            0   
2          3       CCAFS                                Dragon       525 kg   
3          4       CCAFS                          SpaceX CRS-1     4,700 kg   
4          5       CCAFS                          SpaceX CRS-2     4,877 kg   

  Orbit Customer Launch outcome Version Booster Booster landing  \
0   LEO   SpaceX      Success\n  F9 v1.0B0003.1         Failure   
1   LEO     NASA        Success  F9 v1.0B0004.1         Failure   
2   LEO     NASA        Success  F9 v1.0B0005.1    No attempt\n   
3   LEO     NASA      Success\n  F9 v1.0B0006.1      No attempt   
4   LEO     NASA      Success\n  F9 v1.0B0007.1    No attempt\n   

              Date   Time  
0      4 June 2010  18:45  
1  8 December 2010  15:43  
2      22 May 2012  07:44  
3   8 Octo

In [17]:
df = pd.DataFrame({key: pd.Series(value) for key, value in launch_dict.items()})


NameError: name 'launch_dict' is not defined

In [20]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Función para obtener la masa del payload
def get_mass(cell):
    mass = cell.text.strip().split()[0]
    if mass.isdigit():
        return int(mass)
    else:
        return None

# Función para obtener la versión del booster
def booster_version(cell):
    if cell.a:
        return cell.a.text.strip()
    else:
        return cell.text.strip()

# Función para obtener el estado del aterrizaje del booster
def landing_status(cell):
    status = cell.text.strip()
    if status == 'Success' or status == 'Failure':
        return status
    else:
        return 'No attempt'

# Función para obtener la fecha y hora
def date_time(cell):
    if cell.a:
        return cell.a.text.strip().split()
    else:
        return cell.text.strip().split()

# URL de la página web a scrapear
url = 'https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches'

# Hacer la solicitud HTTP GET
response = requests.get(url)

# Parsear el contenido HTML
soup = BeautifulSoup(response.content, 'html.parser')

# Inicializar launch_dict con listas vacías
launch_dict = {
    'Flight No.': [],
    'Launch site': [],
    'Payload': [],
    'Payload mass': [],
    'Orbit': [],
    'Customer': [],
    'Launch outcome': [],
    'Version Booster': [],
    'Booster landing': [],
    'Date': [],
    'Time': []
}

# Extraer cada tabla
for table_number, table in enumerate(soup.find_all('table', "wikitable plainrowheaders collapsible")):
    # Obtener fila de la tabla
    for rows in table.find_all("tr"):
        # Verificar si el primer encabezado de la tabla es un número correspondiente a un número de lanzamiento
        if rows.th:
            if rows.th.string:
                flight_number = rows.th.string.strip()
                flag = flight_number.isdigit()
        else:
            flag = False

        # Obtener elementos de la fila
        cells = rows.find_all('td')

        # Si es un número, guardar celdas en un diccionario
        if flag and len(cells) > 0:
            # Flight Number value
            launch_dict['Flight No.'].append(flight_number)

            # Date and Time values
            date, time = date_time(cells[0])
            launch_dict['Date'].append(date)
            launch_dict['Time'].append(time)

            # Booster version
            bv = booster_version(cells[1])
            launch_dict['Version Booster'].append(bv)

            # Launch Site
            launch_site = cells[2].a.text.strip()
            launch_dict['Launch site'].append(launch_site)

            # Payload
            payload = cells[3].a.text.strip() if cells[3].a else cells[3].text.strip()
            launch_dict['Payload'].append(payload)

            # Payload Mass
            payload_mass = get_mass(cells[4])
            launch_dict['Payload mass'].append(payload_mass)

            # Orbit
            orbit = cells[5].a.text.strip()
            launch_dict['Orbit'].append(orbit)

            # Customer
            customer = cells[6].a.text.strip() if cells[6].a else 'Unknown'
            launch_dict['Customer'].append(customer)

            # Launch outcome
            launch_outcome = cells[7].text.strip()
            launch_dict['Launch outcome'].append(launch_outcome)

            # Booster landing
            booster_landing = landing_status(cells[8])
            launch_dict['Booster landing'].append(booster_landing)

# Crear DataFrame
df = pd.DataFrame(launch_dict)
print(df.head())


ValueError: not enough values to unpack (expected 2, got 1)

In [22]:
df = pd.DataFrame({key: pd.Series(value) for key, value in launch_dict.items()})
print(df.head())


  Flight No. Launch site                               Payload Payload mass  \
0          1       CCAFS  Dragon Spacecraft Qualification Unit            0   
1          2       CCAFS                                Dragon            0   
2          3       CCAFS                                Dragon       525 kg   
3          4       CCAFS                          SpaceX CRS-1     4,700 kg   
4          5       CCAFS                          SpaceX CRS-2     4,877 kg   

  Orbit Customer Launch outcome Version Booster Booster landing  \
0   LEO   SpaceX      Success\n  F9 v1.0B0003.1         Failure   
1   LEO     NASA        Success  F9 v1.0B0004.1         Failure   
2   LEO     NASA        Success  F9 v1.0B0005.1    No attempt\n   
3   LEO     NASA      Success\n  F9 v1.0B0006.1      No attempt   
4   LEO     NASA      Success\n  F9 v1.0B0007.1    No attempt\n   

              Date   Time  
0      4 June 2010  18:45  
1  8 December 2010  15:43  
2      22 May 2012  07:44  
3   8 Octo

In [23]:
import requests
import pandas as pd

# URL de la API de SpaceX para obtener datos de lanzamientos recientes
url = "https://api.spacexdata.com/v4/launches"

# Realizar la solicitud GET
response = requests.get(url)

# Verificar que la solicitud fue exitosa (código 200)
if response.status_code == 200:
    # Convertir la respuesta JSON a un DataFrame usando pd.json_normalize
    df = pd.json_normalize(response.json())
    
    # Mostrar las primeras filas del DataFrame para inspección
    print(df.head())
    
    # Extraer el año de la primera fila de static_fire_date_utc
    first_static_fire_date_utc = df.loc[0, 'static_fire_date_utc']
    year = first_static_fire_date_utc[:4]  # Obtener los primeros 4 caracteres que representan el año
    
    print(f"El año en la primera fila de static_fire_date_utc es: {year}")
else:
    print(f"Error al realizar la solicitud GET: Código {response.status_code}")


       static_fire_date_utc  static_fire_date_unix    net  window  \
0  2006-03-17T00:00:00.000Z           1.142554e+09  False     0.0   
1                      None                    NaN  False     0.0   
2                      None                    NaN  False     0.0   
3  2008-09-20T00:00:00.000Z           1.221869e+09  False     0.0   
4                      None                    NaN  False     0.0   

                     rocket success  \
0  5e9d0d95eda69955f709d1eb   False   
1  5e9d0d95eda69955f709d1eb   False   
2  5e9d0d95eda69955f709d1eb   False   
3  5e9d0d95eda69955f709d1eb    True   
4  5e9d0d95eda69955f709d1eb    True   

                                            failures  \
0  [{'time': 33, 'altitude': None, 'reason': 'mer...   
1  [{'time': 301, 'altitude': 289, 'reason': 'har...   
2  [{'time': 140, 'altitude': 35, 'reason': 'resi...   
3                                                 []   
4                                                 []   

             

In [24]:
import requests
import pandas as pd

# URL de la API de SpaceX para obtener datos de lanzamientos
url = "https://api.spacexdata.com/v4/launches"

# Realizar la solicitud GET
response = requests.get(url)

# Verificar que la solicitud fue exitosa (código 200)
if response.status_code == 200:
    # Convertir la respuesta JSON a un DataFrame usando pd.json_normalize
    df = pd.json_normalize(response.json())
    
    # Filtrar para incluir solo lanzamientos de Falcon 9
    falcon9_launches = df[df['rocket'] == '5e9d0d95eda69973a809d1ec']
    
    # Contar el número de lanzamientos de Falcon 9
    count_falcon9_launches = falcon9_launches.shape[0]
    
    print(f"Número de lanzamientos de Falcon 9 después de eliminar los de Falcon 1: {count_falcon9_launches}")
else:
    print(f"Error al realizar la solicitud GET: Código {response.status_code}")


Número de lanzamientos de Falcon 9 después de eliminar los de Falcon 1: 195


In [25]:
import requests
import pandas as pd

# URL de la API de SpaceX para obtener datos de lanzamientos
url = "https://api.spacexdata.com/v4/launches"

# Realizar la solicitud GET
response = requests.get(url)

# Verificar que la solicitud fue exitosa (código 200)
if response.status_code == 200:
    # Convertir la respuesta JSON a un DataFrame usando pd.json_normalize
    df = pd.json_normalize(response.json())
    
    # Contar los valores faltantes en la columna 'landingPad'
    missing_values_count = df['launchpad'].isnull().sum()
    
    print(f"Número de valores faltantes en la columna 'landingPad': {missing_values_count}")
else:
    print(f"Error al realizar la solicitud GET: Código {response.status_code}")


Número de valores faltantes en la columna 'landingPad': 0


In [28]:
<title>Lista de lanzamientos de Falcon 9 y Falcon Heavy - Wikipedia</title>




SyntaxError: invalid syntax (1967507217.py, line 1)

In [30]:

import requests
import pandas as pd

# URL de la API de SpaceX para obtener datos de lanzamientos
url = "https://api.spacexdata.com/v4/launches"

# Realizar la solicitud GET
response = requests.get(url)

# Verificar que la solicitud fue exitosa (código 200)
if response.status_code == 200:
    # Convertir la respuesta JSON a un DataFrame usando pd.json_normalize
    df = pd.json_normalize(response.json())
    
    # Mostrar las primeras filas del DataFrame para entender su estructura
    print(df.head())
    
    # Filtrar para incluir solo lanzamientos de Falcon 9
    falcon9_launches = df[df['rocket'] == '5e9d0d95eda69973a809d1ec']
    
    # Contar el número de lanzamientos de Falcon 9 después de eliminar los de Falcon 1
    falcon9_count = falcon9_launches.shape[0]
    
    print(f"Número de lanzamientos de Falcon 9 después de eliminar los de Falcon 1: {falcon9_count}")
else:
    print(f"Error al realizar la solicitud GET: Código {response.status_code}")



       static_fire_date_utc  static_fire_date_unix    net  window  \
0  2006-03-17T00:00:00.000Z           1.142554e+09  False     0.0   
1                      None                    NaN  False     0.0   
2                      None                    NaN  False     0.0   
3  2008-09-20T00:00:00.000Z           1.221869e+09  False     0.0   
4                      None                    NaN  False     0.0   

                     rocket success  \
0  5e9d0d95eda69955f709d1eb   False   
1  5e9d0d95eda69955f709d1eb   False   
2  5e9d0d95eda69955f709d1eb   False   
3  5e9d0d95eda69955f709d1eb    True   
4  5e9d0d95eda69955f709d1eb    True   

                                            failures  \
0  [{'time': 33, 'altitude': None, 'reason': 'mer...   
1  [{'time': 301, 'altitude': 289, 'reason': 'har...   
2  [{'time': 140, 'altitude': 35, 'reason': 'resi...   
3                                                 []   
4                                                 []   

             

In [32]:
import requests
import pandas as pd

# URL de la API de SpaceX para obtener datos de lanzamientos
url = "https://api.spacexdata.com/v4/launches"

# Realizar la solicitud GET
response = requests.get(url)

# Verificar que la solicitud fue exitosa (código 200)
if response.status_code == 200:
    # Convertir la respuesta JSON a un DataFrame usando pd.json_normalize
    df = pd.json_normalize(response.json())
    
    # Filtrar para incluir solo lanzamientos de Falcon 9
    falcon9_launches = df[df['rocket'].str.contains('Falcon 9', case=False, na=False)]
    
    # Contar el número de lanzamientos de Falcon 9 después de eliminar los de Falcon 1
    falcon9_count = falcon9_launches.shape[0]
    
    print(f"Número de lanzamientos de Falcon 9 después de eliminar los de Falcon 1: {falcon9_count}")
else:
    print(f"Error al realizar la solicitud GET: Código {response.status_code}")



Número de lanzamientos de Falcon 9 después de eliminar los de Falcon 1: 0


In [37]:
import requests
import pandas as pd

# URL de la API de SpaceX para obtener datos de lanzamientos
url = "https://api.spacexdata.com/v4/launches"

# Realizar la solicitud GET
response = requests.get(url)

# Verificar que la solicitud fue exitosa (código 200)
if response.status_code == 200:
    # Convertir la respuesta JSON a un DataFrame usando pd.json_normalize
    df = pd.json_normalize(response.json())
    
    # Filtrar para incluir solo lanzamientos de Falcon 9
    falcon9_launches = df[df['rocket_name'].str.contains('Falcon 9', case=False, na=False)]
    
    # Contar el número de lanzamientos de Falcon 9 después de eliminar los de Falcon 1
    falcon9_count = falcon9_launches.shape[0]
    
    print(f"Número de lanzamientos de Falcon 9 después de eliminar los de Falcon 1: {falcon9_count}")
else:
    print(f"Error al realizar la solicitud GET: Código {response.status_code}")


KeyError: 'rocket_name'

In [35]:
import requests
import pandas as pd

# URL de la API de SpaceX para obtener datos de lanzamientos
url = "https://api.spacexdata.com/v4/launches"

# Realizar la solicitud GET
response = requests.get(url)

# Verificar que la solicitud fue exitosa (código 200)
if response.status_code == 200:
    # Convertir la respuesta JSON a un DataFrame usando pd.json_normalize
    df = pd.json_normalize(response.json())
    
    # Contar valores faltantes en la columna landingPad de manera más detallada
    missing_values = df[df['launchpad'].isna()]
    missing_values_count = missing_values.shape[0]
    
    print(f"Número de valores faltantes en la columna landingPad: {missing_values_count}")
    if missing_values_count > 0:
        print("Ejemplos de registros con valores faltantes en landingPad:")
        print(missing_values[['name', 'date_utc', 'launchpad']])
else:
    print(f"Error al realizar la solicitud GET: Código {response.status_code}")




Número de valores faltantes en la columna landingPad: 0


In [36]:
import requests
import pandas as pd

# URL de la API de SpaceX para obtener datos de lanzamientos
url = "https://api.spacexdata.com/v4/launches"

# Realizar la solicitud GET
response = requests.get(url)

# Verificar que la solicitud fue exitosa (código 200)
if response.status_code == 200:
    # Convertir la respuesta JSON a un DataFrame usando pd.json_normalize
    df = pd.json_normalize(response.json())
    
    # Imprimir las primeras filas para revisar cómo se ven los datos
    print(df.head())
    
    # Verificar si hay algún valor específico que pueda indicar un valor faltante en landingPad
    unique_values = df['launchpad'].unique()
    print(f"Valores únicos en la columna landingPad: {unique_values}")
    
    # Contar valores faltantes en la columna landingPad de manera más detallada
    missing_values = df[df['launchpad'].isna()]
    missing_values_count = missing_values.shape[0]
    
    print(f"Número de valores faltantes en la columna landingPad: {missing_values_count}")
    if missing_values_count > 0:
        print("Ejemplos de registros con valores faltantes en landingPad:")
        print(missing_values[['name', 'date_utc', 'launchpad']])
else:
    print(f"Error al realizar la solicitud GET: Código {response.status_code}")


       static_fire_date_utc  static_fire_date_unix    net  window  \
0  2006-03-17T00:00:00.000Z           1.142554e+09  False     0.0   
1                      None                    NaN  False     0.0   
2                      None                    NaN  False     0.0   
3  2008-09-20T00:00:00.000Z           1.221869e+09  False     0.0   
4                      None                    NaN  False     0.0   

                     rocket success  \
0  5e9d0d95eda69955f709d1eb   False   
1  5e9d0d95eda69955f709d1eb   False   
2  5e9d0d95eda69955f709d1eb   False   
3  5e9d0d95eda69955f709d1eb    True   
4  5e9d0d95eda69955f709d1eb    True   

                                            failures  \
0  [{'time': 33, 'altitude': None, 'reason': 'mer...   
1  [{'time': 301, 'altitude': 289, 'reason': 'har...   
2  [{'time': 140, 'altitude': 35, 'reason': 'resi...   
3                                                 []   
4                                                 []   

             

We can now export it to a <b>CSV</b> for the next section, but to make the answers consistent and in case you have difficulties finishing this lab. 

Following labs will be using a provided dataset to make each lab independent. 


<code>df.to_csv('spacex_web_scraped.csv', index=False)</code>


## Authors


<a href="https://www.linkedin.com/in/yan-luo-96288783/">Yan Luo</a>


<a href="https://www.linkedin.com/in/nayefaboutayoun/">Nayef Abou Tayoun</a>


## Change Log


| Date (YYYY-MM-DD) | Version | Changed By | Change Description      |
| ----------------- | ------- | ---------- | ----------------------- |
| 2021-06-09        | 1.0     | Yan Luo    | Tasks updates           |
| 2020-11-10        | 1.0     | Nayef      | Created the initial version |


Copyright © 2021 IBM Corporation. All rights reserved.
