# SpaceX Falcon 9 First Stage Landing Prediction

### Project Description

In this project, the goal is to predict the probabiblity of success of a Falcon 9 landing event using data from previous launches. 

Example of a succesful landing:

![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/landing_1.gif)

And a a few unsuccesful ones:

![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/crash.gif)


In doing this project, we will execute some of the major steps in a Data Science pipeline, including:

1. Data Extraction using RESTful APIs
2. Data Extraction using Web Scrapping
3. Data Transformation 
4. Data Visualization
5. Training and comparing different Machine Learning models
6. Building an interactive Dashboard with the results of the previous steps


This project was done as the Final Project in IBM's Data Science Professional Certificate (2023).

## Part I: Extracting data from SpaceX webpage

The main source for the launch data comes from SpaceX API in the form of JSON files. We will retrieve that data using requests calls from Python.

Importing regular libraries

In [1]:
import requests
import pandas as pd
import numpy as np
import datetime

Get URL of SpaceX's API

In [2]:
spacex_url="https://api.spacexdata.com/v4/launches/past"

In [3]:
response = requests.get(spacex_url)

Prepare data into a JSON structure

In [4]:
response_json = response.json()


Read JSON into Pandas dataframe

In [5]:
data = pd.json_normalize(response_json)

In [6]:
data.head(2)

Unnamed: 0,static_fire_date_utc,static_fire_date_unix,net,window,rocket,success,failures,details,crew,ships,...,links.reddit.media,links.reddit.recovery,links.flickr.small,links.flickr.original,links.presskit,links.webcast,links.youtube_id,links.article,links.wikipedia,fairings
0,2006-03-17T00:00:00.000Z,1142554000.0,False,0.0,5e9d0d95eda69955f709d1eb,False,"[{'time': 33, 'altitude': None, 'reason': 'mer...",Engine failure at 33 seconds and loss of vehicle,[],[],...,,,[],[],,https://www.youtube.com/watch?v=0a_00nJ_Y88,0a_00nJ_Y88,https://www.space.com/2196-spacex-inaugural-fa...,https://en.wikipedia.org/wiki/DemoSat,
1,,,False,0.0,5e9d0d95eda69955f709d1eb,False,"[{'time': 301, 'altitude': 289, 'reason': 'har...",Successful first stage burn and transition to ...,[],[],...,,,[],[],,https://www.youtube.com/watch?v=Lk4zQ2wP-Nc,Lk4zQ2wP-Nc,https://www.space.com/3590-spacex-falcon-1-roc...,https://en.wikipedia.org/wiki/DemoSat,


Now that we have the full data on a single dataframe, let's clean it a little more.

First, we will only keep the relevant column for the ML model, i.e., we should drops columns such as links or crew.

Our model will only take into account details about the rocket, the payload, the orbit and the station.

In [7]:
# Take a subset of our dataframe keeping only the features we want and the flight number, and date_utc.
data = data[['rocket', 'payloads', 'launchpad', 'cores', 'flight_number', 'date_utc']]

# We will remove rows with multiple cores because those are falcon rockets with 2 extra rocket boosters and rows that have multiple payloads in a single rocket.
data = data[data['cores'].map(len)==1]
data = data[data['payloads'].map(len)==1]

# Since payloads and cores are lists of size 1 we will also extract the single value in the list and replace the feature.
data['cores'] = data['cores'].map(lambda x : x[0])
data['payloads'] = data['payloads'].map(lambda x : x[0])

# We also want to convert the date_utc to a datetime datatype and then extracting the date leaving the time
data['date'] = pd.to_datetime(data['date_utc']).dt.date

# Using the date we will restrict the dates of the launches
data = data[data['date'] <= datetime.date(2020, 11, 13)]

Using the rows of the table above, we will fetch data from SpaceX which are more specific and are contained on separate URL's.

Each new piece of data will be stored in a separate list which will later be incorporated into a main table.

**Extract Booster Version**

Contains the name of the version of the booster used in the operation

In [8]:
BoosterVersion = []

for x in data['rocket']:
    if x:
        response = requests.get("https://api.spacexdata.com/v4/rockets/"+str(x)).json()
        BoosterVersion.append(response['name'])
        

**Extract Launch Site Name, Latitute and Longitude**

Contains the name and location of the site used for take-off and landing

In [9]:
LaunchSite = []
Longitude = []
Latitude = []

for x in data['launchpad']:
    if x:
        response = requests.get("https://api.spacexdata.com/v4/launchpads/"+str(x)).json()
        Longitude.append(response['longitude'])
        Latitude.append(response['latitude'])
        LaunchSite.append(response['name'])


**Extract Payload mass and Orbit name**

Contains the payload mass and the name of the orbit that the rocket went to

In [10]:
PayloadMass = []
Orbit = []

for load in data['payloads']:
    if load:
        response = requests.get("https://api.spacexdata.com/v4/payloads/"+load).json()
        PayloadMass.append(response['mass_kg'])
        Orbit.append(response['orbit'])


**Extract Data about the Core**

Various specifications about the rocket's core, including how many times it was used before and other technical data

In [11]:
Block = []
ReusedCount = []
Serial = []
Outcome = []
Flights = []
GridFins = []
Reused = []
Legs = []
LandingPad = []


for core in data['cores']:
        if core['core'] != None:
            response = requests.get("https://api.spacexdata.com/v4/cores/"+core['core']).json()
            Block.append(response['block'])
            ReusedCount.append(response['reuse_count'])
            Serial.append(response['serial'])
        else:
            Block.append(None)
            ReusedCount.append(None)
            Serial.append(None)
        Outcome.append(str(core['landing_success'])+' '+str(core['landing_type']))
        Flights.append(core['flight'])
        GridFins.append(core['gridfins'])
        Reused.append(core['reused'])
        Legs.append(core['legs'])
        LandingPad.append(core['landpad'])
            


Finally, we incorporate the previous lists into a single table

In [12]:
launch_dict = {'FlightNumber': list(data['flight_number']),
'Date': list(data['date']),
'BoosterVersion':BoosterVersion,
'PayloadMass':PayloadMass,
'Orbit':Orbit,
'LaunchSite':LaunchSite,
'Outcome':Outcome,
'Flights':Flights,
'GridFins':GridFins,
'Reused':Reused,
'Legs':Legs,
'LandingPad':LandingPad,
'Block':Block,
'ReusedCount':ReusedCount,
'Serial':Serial,
'Longitude': Longitude,
'Latitude': Latitude}

In [13]:
launch_df = pd.DataFrame(launch_dict)

In [14]:
launch_df.head(2)

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude
0,1,2006-03-24,Falcon 1,20.0,LEO,Kwajalein Atoll,None None,1,False,False,False,,,0,Merlin1A,167.743129,9.047721
1,2,2007-03-21,Falcon 1,,LEO,Kwajalein Atoll,None None,1,False,False,False,,,0,Merlin2A,167.743129,9.047721


Let's filter out the rows with Falcon 1 booster

In [15]:
data_falcon9 = launch_df[launch_df['BoosterVersion']!='Falcon 1']

Now that we removed a group of rows, let's re-assign the flight number to make sure it goes in increments of 1

In [16]:
data_falcon9.loc[:,'FlightNumber'] = list(range(1, data_falcon9.shape[0]+1))
data_falcon9.tail(2)

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude
92,89,2020-10-24,Falcon 9,15600.0,VLEO,CCSFS SLC 40,True ASDS,3,True,True,True,5e9e3033383ecbb9e534e7cc,5.0,12,B1060,-80.577366,28.561857
93,90,2020-11-05,Falcon 9,3681.0,MEO,CCSFS SLC 40,True ASDS,1,True,False,True,5e9e3032383ecb6bb234e7ca,5.0,8,B1062,-80.577366,28.561857


Noticing the occurrences of NaN values in the payload massses, we fill those out with the average value

In [17]:
# Calculate the mean value of PayloadMass column
plmean = data_falcon9['PayloadMass'].mean()

# Replace the np.nan values with its mean value
data_falcon9.loc[:, 'PayloadMass'] = data_falcon9.loc[:, 'PayloadMass'].replace(np.nan, plmean)

Finally, we want to contruct a "response variable" which is 1 in the case of a succesful landing and 0 otherwise. This info is contained in the "Outcome" column.

So, the first thing to do is to look at the possible values of the variable Outcome:

In [18]:
pd.unique(data_falcon9['Outcome'])

array(['None None', 'False Ocean', 'True Ocean', 'False ASDS',
       'None ASDS', 'True RTLS', 'True ASDS', 'False RTLS'], dtype=object)

Here, if the Outcome string contains True it means success, whereas if it contains None or False it means failure.

So, we will define a new column Class mapping the values of Outcome to 0s or 1s.

In [19]:
def get_class(x):
    if x in {'True Ocean', 'True RTLS', 'True ASDS'}: return 1
    else: return 0

data_falcon9.loc[:,'Class'] = data_falcon9.loc[:,'Outcome'].map(get_class)
data_falcon9.tail(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_falcon9.loc[:,'Class'] = data_falcon9.loc[:,'Outcome'].map(get_class)


Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude,Class
92,89,2020-10-24,Falcon 9,15600.0,VLEO,CCSFS SLC 40,True ASDS,3,True,True,True,5e9e3033383ecbb9e534e7cc,5.0,12,B1060,-80.577366,28.561857,1
93,90,2020-11-05,Falcon 9,3681.0,MEO,CCSFS SLC 40,True ASDS,1,True,False,True,5e9e3032383ecb6bb234e7ca,5.0,8,B1062,-80.577366,28.561857,1


And we are done! Let's export the dataframe to a CSV file that will be utilized later for data visualization

In [20]:
data_falcon9.to_csv('dataset_api.csv', index=False)

To use the columns which contain categorical data for machine learning, we will use one-hot encoding and store the values in a separate dataframe

In [21]:
features = data_falcon9[['FlightNumber', 'PayloadMass', 'Orbit', 'LaunchSite', 'Flights', 'GridFins', 'Reused', 'Legs', 'LandingPad', 'Block', 'ReusedCount', 'Serial','Class']]
features_one_hot = pd.get_dummies(data=features,columns=['Orbit', 'LaunchSite', 'LandingPad','Serial'])
features_one_hot.head(2)

Unnamed: 0,FlightNumber,PayloadMass,Flights,GridFins,Reused,Legs,Block,ReusedCount,Class,Orbit_ES-L1,...,Serial_B1048,Serial_B1049,Serial_B1050,Serial_B1051,Serial_B1054,Serial_B1056,Serial_B1058,Serial_B1059,Serial_B1060,Serial_B1062
4,1,6123.547647,1,False,False,False,1.0,0,0,False,...,False,False,False,False,False,False,False,False,False,False
5,2,525.0,1,False,False,False,1.0,0,0,False,...,False,False,False,False,False,False,False,False,False,False


Export the one-hot encoded dataframe as CSV file to be used for ML later

In [22]:
features_one_hot.to_csv('one_hot_df.csv', index=False)

---


## Part II: Web scrapping from Wikipedia using BeautifulSoup

Even though we fetched data from the SpaceX API using requests calls, sometimes it is not as convenient to gather data in this way for a project. For example, sometimes the data you are looking for is spread over the internet in random websites.

So, to simulate that, we will scrape data from thr Wikipedia's webpage of Falcon 9 launches. Since the data on Wikipedia is not as comprehensive as the data in the SpaceX database, it will not be used in the future steps of the project. The purpose here is to illustrate how web scraping could be implemented in a data pipeline.

In [23]:
import sys
import requests
from bs4 import BeautifulSoup
import unicodedata2
import pandas as pd

This is Wikipedia's webpage

In [24]:
static_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"

We fetch the data as a raw html text file

In [25]:
ob = requests.get(static_url)
soup = BeautifulSoup(ob.text,'html')
soup.title

<title>List of Falcon 9 and Falcon Heavy launches - Wikipedia</title>

Now we need to parse the html text using Beautiful Soup. The goal is to consolidate the data spread among various tables of the website into a single one.

In order to define the column names of the consolidated table, we go to a specific table of the website by parsing all html "table" tags

In [26]:
html_tables = soup.find_all('table')
first_launch_table = html_tables[5]


The following code extracts the column names from the header of the table "first_launch_table" and store them in a list

In [27]:
column_names = []

def extract_column_from_header(row):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    if (row.br):
        row.br.extract()
    if row.a:
        row.a.extract()
    if row.sup:
        row.sup.extract()
        
    colunm_name = ' '.join(row.contents)
    
    # Filter the digit and empty names
    if not(colunm_name.strip().isdigit()):
        colunm_name = colunm_name.strip()
        return colunm_name 
    

for column in first_launch_table.find_all('th'):
    name = extract_column_from_header(column)
    if name is not None and len(name) > 0:
        column_names.append(extract_column_from_header(column))
        

This is utils code to parse the date, time, landing status and mass from the table rows. It will be used in the next cell

In [28]:
def date_time(table_cells):
    """
    This function returns the data and time from the HTML  table cell
    Input: the  element of a table data cell extracts extra row
    """
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

def landing_status(table_cells):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=[i for i in table_cells.strings][0]
    return out

def get_mass(table_cells):
    mass=unicodedata2.normalize("NFKD", table_cells.text).strip()
    if mass:
        mass.find("kg")
        new_mass=mass[0:mass.find("kg")+2]
    else:
        new_mass=0
    return new_mass

This is the code to extract the data in the column fields of the various tables and put them in a dictionary

In [29]:
extracted_row = 0

launch_dict= dict.fromkeys(column_names)

# Remove an irrelvant column
del launch_dict['Date and time ( )']

# Let's initiate the launch_dict with each value to be an empty list
launch_dict['Flight No.'] = []
launch_dict['Launch site'] = []
launch_dict['Payload'] = []
launch_dict['Payload mass'] = []
launch_dict['Orbit'] = []
launch_dict['Customer'] = []
launch_dict['Launch outcome'] = []
launch_dict['Serial'] = []
launch_dict['Booster landing'] = []
launch_dict['Date'] = []

for table_number, table in enumerate(soup.find_all('table',"wikitable plainrowheaders collapsible")):

    for rows in table.find_all("tr"):

        if rows.th:
            if rows.th.string:
                flight_number=rows.th.string.strip()
                flag=flight_number.isdigit()
        else:
            flag=False
        
        row=rows.find_all('td')

        if flag:
            extracted_row += 1

            launch_dict['Flight No.'].append(flight_number)
            datatimelist=date_time(row[0])
            date = datatimelist[0].strip(',')
            launch_dict['Date'].append(date)

            bv=row[1].find_all(string=True)
            boost_version = [s for s in bv if s[0]=='B']
            if len(boost_version) ==1:
                launch_dict['Serial'].append(boost_version[0][0:5])
            else:
                launch_dict['Serial'].append("NS")

            launch_site = row[2].a.string
            launch_dict['Launch site'].append(launch_site)

            payload = row[3].a.string
            launch_dict['Payload'].append(payload)

            payload_mass = str(get_mass(row[4]))
            payload_mass=payload_mass.replace("~","")
            payload_mass=payload_mass.replace("kg","")
            payload_mass=payload_mass.replace(" ","")
            payload_mass=payload_mass.replace(",","")
            payload_mass=payload_mass.replace("C","0")
            payload_mass=payload_mass.replace("5000–6000","5500")
            launch_dict['Payload mass'].append(int(payload_mass))

            orbit = row[5].a.string
            launch_dict['Orbit'].append(orbit)
            
            if row[6].a is not None:
                customer = row[6].a.string
                launch_dict['Customer'].append(customer)

            launch_outcome = list(row[7].strings)[0]
            launch_dict['Launch outcome'].append(launch_outcome.replace("\n",""))
            
            booster_landing = landing_status(row[8])
            launch_dict['Booster landing'].append(booster_landing)



Finally we transform the dictionary to a dataframe and take a look at it to make sure it looks ok

In [30]:
df = pd.DataFrame({ key:pd.Series(value) for key, value in launch_dict.items() })
df['Date'] = pd.to_datetime(df['Date']).dt.date
df.tail(2)


Unnamed: 0,Flight No.,Launch site,Payload,Payload mass,Orbit,Customer,Launch outcome,N/A,Serial,Booster landing,Date
119,120,KSC,SpaceX CRS-22,3328,LEO,Sirius XM,Success,,B1067,Success,2021-06-03
120,121,CCSFS,SXM-8,7000,GTO,,Success,,B1061,Success,2021-06-06


We can then export the dataframe to a CSV file

In [31]:
df.to_csv('dataset_wiki.csv', index=False)