# **Space X  Falcon 9 Landing Prediction**


## Data wrangling


In this notebook, our primary goal is to conduct Exploratory Data Analysis (EDA) to uncover patterns in the dataset and determine appropriate labels for training supervised machine learning models.

The dataset includes various scenarios where rocket boosters either succeeded or failed in landing attempts. For instance:

<code>True Ocean</code> indicates that the booster successfully landed in a designated ocean region.
<code>False Ocean</code> denotes a failed landing attempt in a designated ocean region.
<code>True RTLS</code> signifies a successful landing on a ground pad, while <code>False RTLS</code> means the attempt to land on a ground pad was unsuccessful.
<code>True ASDS</code> represents a successful landing on a drone ship, whereas <code>False ASDS</code> indicates a failed landing attempt on a drone ship.
In this lab, we will primarily focus on converting these mission outcomes into binary training labels, where a value of 1 indicates a successful landing and 0 represents an unsuccessful attempt. This labeling will be crucial for training our supervised learning models effectively.

Falcon 9 first stage will land successfully


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/landing\_1.gif)


Several examples of an unsuccessful landing are shown here:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/crash.gif)


## Objectives

Perform exploratory  Data Analysis and determine Training Labels

*   Exploratory Data Analysis
*   Determine Training Labels


***


## Import Libraries


We will import the following libraries.


In [1]:
# Pandas is a software library written for the Python programming language for data manipulation and analysis.
import pandas as pd
#NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
import numpy as np

### Data Analysis


Load Space X dataset, from last section.


In [2]:
df=pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/dataset_part_1.csv")
df.head(10)

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude
0,1,2010-06-04,Falcon 9,6104.959412,LEO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0003,-80.577366,28.561857
1,2,2012-05-22,Falcon 9,525.0,LEO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0005,-80.577366,28.561857
2,3,2013-03-01,Falcon 9,677.0,ISS,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0007,-80.577366,28.561857
3,4,2013-09-29,Falcon 9,500.0,PO,VAFB SLC 4E,False Ocean,1,False,False,False,,1.0,0,B1003,-120.610829,34.632093
4,5,2013-12-03,Falcon 9,3170.0,GTO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B1004,-80.577366,28.561857
5,6,2014-01-06,Falcon 9,3325.0,GTO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B1005,-80.577366,28.561857
6,7,2014-04-18,Falcon 9,2296.0,ISS,CCAFS SLC 40,True Ocean,1,False,False,True,,1.0,0,B1006,-80.577366,28.561857
7,8,2014-07-14,Falcon 9,1316.0,LEO,CCAFS SLC 40,True Ocean,1,False,False,True,,1.0,0,B1007,-80.577366,28.561857
8,9,2014-08-05,Falcon 9,4535.0,GTO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B1008,-80.577366,28.561857
9,10,2014-09-07,Falcon 9,4428.0,GTO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B1011,-80.577366,28.561857


Now Identify and calculate the percentage of the missing values in each attribute.

**NOTE:**

Identifying and calculating the percentage of missing values in each attribute is crucial for several reasons:

1. **Data Quality Assessment**: Understanding the extent of missing data helps assess the quality and reliability of the dataset. High percentages of missing values in certain attributes might indicate issues in data collection, which could affect the overall analysis.

2. **Informed Decision-Making**: Knowing which attributes have missing values allows you to make informed decisions on how to handle them. You can choose to impute missing data, exclude those records, or even remove the attribute entirely, depending on the percentage of missing values and their impact on the analysis.

3. **Model Performance**: Missing values can negatively impact the performance of machine learning models. Identifying these gaps early enables you to apply appropriate techniques to address them, ensuring that the models trained on the data are as accurate and robust as possible.

4. **Pattern Detection**: The distribution of missing values can sometimes reveal patterns or biases in the data collection process. For example, if missing values are concentrated in certain attributes or correlated with specific variables, it could indicate underlying trends that are important for the analysis.

5. **Data Imputation Strategies**: The percentage of missing values influences the choice of imputation methods. For example, a small percentage of missing values might be handled with simple techniques like mean or median imputation, while larger gaps might require more sophisticated methods like predictive modeling or data augmentation.

6. **Regulatory Compliance**: In some fields, understanding and documenting missing data is essential for meeting regulatory standards. Accurately identifying and reporting the percentage of missing values ensures transparency and compliance with these requirements.

By identifying and calculating the percentage of missing values, you lay the groundwork for cleaner, more accurate data analysis, ultimately leading to more reliable and actionable insights.



In [3]:
df.isnull().sum()/df.count()*100

FlightNumber       0.000
Date               0.000
BoosterVersion     0.000
PayloadMass        0.000
Orbit              0.000
LaunchSite         0.000
Outcome            0.000
Flights            0.000
GridFins           0.000
Reused             0.000
Legs               0.000
LandingPad        40.625
Block              0.000
ReusedCount        0.000
Serial             0.000
Longitude          0.000
Latitude           0.000
dtype: float64

Identify which columns are numerical and categorical:


Identifying which columns are numerical and categorical is an important step in data preprocessing for several reasons:

1. **Appropriate Analysis Techniques**: Numerical and categorical data require different statistical and analytical techniques. Numerical data can be analyzed using measures like mean, median, and standard deviation, and can be used in regression models, while categorical data is analyzed using frequency counts, mode, and is suitable for classification models. Knowing the type of each column ensures that you apply the correct methods and tools for analysis.

2. **Data Transformation and Encoding**: Categorical variables often need to be transformed into a format suitable for machine learning algorithms. Techniques like one-hot encoding, label encoding, or ordinal encoding are used for this purpose. Identifying categorical columns allows you to choose the appropriate encoding method and prepare the data correctly for model training.

3. **Feature Scaling**: Numerical data often requires scaling or normalization to ensure that features contribute equally to the model. Identifying numerical columns helps in applying scaling techniques like Min-Max scaling or Standardization. Proper scaling improves model performance by ensuring that numerical features are on a similar scale.

4. **Model Selection and Performance**: Different machine learning models have varying requirements for input data types. For instance, linear regression and support vector machines work well with numerical data, while decision trees and random forests can handle both numerical and categorical data. Identifying the type of each column helps in selecting the appropriate model and tuning it effectively.

5. **Exploratory Data Analysis (EDA)**: The type of data (numerical or categorical) influences how you visualize and interpret the data. Numerical data might be visualized using histograms or scatter plots, while categorical data can be visualized with bar charts or pie charts. Identifying the type of each column ensures that you use the right visualization techniques to uncover patterns and insights.

6. **Handling Missing Values**: The strategy for handling missing values can differ based on the type of column. Numerical columns might use imputation techniques such as mean or median substitution, while categorical columns might use the mode or a placeholder category. Knowing the data types allows you to apply appropriate missing value treatment methods.

7. **Feature Engineering**: Knowing which columns are numerical and categorical helps in feature engineering, such as creating interaction features or polynomial features for numerical data and grouping or binning categorical data. This enhances the ability to derive new insights and improve model performance.

8. **Data Quality and Integrity**: Identifying the types of columns helps ensure data quality and integrity. For example, ensuring that numerical columns do not contain non-numeric values or that categorical columns do not have invalid categories.

By clearly identifying numerical and categorical columns, you can tailor your data preprocessing, analysis, and modeling strategies to fit the nature of your data, leading to more accurate and meaningful results.


In [5]:
df.dtypes

FlightNumber        int64
Date               object
BoosterVersion     object
PayloadMass       float64
Orbit              object
LaunchSite         object
Outcome            object
Flights             int64
GridFins             bool
Reused               bool
Legs                 bool
LandingPad         object
Block             float64
ReusedCount         int64
Serial             object
Longitude         float64
Latitude          float64
dtype: object

### Challenge 1: Calculate the Number of Launches at Each Site

In this challenge, you are tasked with calculating the number of launches that occurred at each SpaceX launch facility. The dataset includes information about several launch sites, each representing a distinct facility where SpaceX rockets are launched. Here’s a breakdown of the task:

1. **Identify Launch Sites**:
   The dataset specifies three primary SpaceX launch sites, each with its unique identifier. These are:
   - [Cape Canaveral Space Launch Complex 40 (CCSLC-40)](https://en.wikipedia.org/wiki/List_of_Cape_Canaveral_and_Merritt_Island_launch_sites?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01): Located on Cape Canaveral, Florida, this site is often referred to simply as "CCSLC-40".
   - **VAFB SLC 4E** (Vandenberg Air Force Base Space Launch Complex 4E): Situated at Vandenberg Air Force Base in California, this site is designated as "VAFB SLC 4E".
   - **Kennedy Space Center Launch Complex 39A (KSC LC 39A)**: Located at the Kennedy Space Center in Florida, this site is known as "KSC LC 39A".

2. **Data Column**:
   The location of each launch is recorded in the dataset under the column named `<code>LaunchSite</code>`. This column indicates where each launch took place and is used to aggregate the data for analysis.

3. **Calculate Launch Counts**:
   To calculate the number of launches at each site, you will need to perform the following steps:
   - **Data Aggregation**: Count the occurrences of each launch site in the `<code>LaunchSite</code>` column. This will provide the total number of launches associated with each facility.
   - **Summarize Results**: Present the results in a clear format, such as a table or a bar chart, showing the number of launches for each launch site.

4. **Analysis and Interpretation**:
   - **Analyze Patterns**: Understanding the distribution of launches across different sites can offer insights into operational preferences or site usage patterns.
   - **Operational Insights**: Insights gained from this analysis can help in evaluating the efficiency of each launch site and making data-driven decisions regarding future launch planning.

This challenge is an essential step in understanding launch site utilization and can provide valuable information for operational analysis and strategic planning within the space industry.



Next, let's see the number of launches for each site.

Use the method  <code>value_counts()</code> on the column <code>LaunchSite</code> to determine the number of launches  on each site:

PLEASE, finish the remaining sections of the challenge.


In [5]:
# Apply value_counts() on column LaunchSite
df['LaunchSite'].value_counts()

CCAFS SLC 40    55
KSC LC 39A      22
VAFB SLC 4E     13
Name: LaunchSite, dtype: int64

Each launch aims to an dedicated orbit, and here are some common orbit types:


### Orbit Types and Their Characteristics

* **LEO (Low Earth Orbit)**: Low Earth Orbit is an Earth-centered orbit with an altitude of up to 2,000 kilometers (1,200 miles) or less, which is approximately one-third of the Earth's radius. Satellites in LEO complete at least 11.25 orbits per day, with each orbit lasting 128 minutes or less, and have an eccentricity less than 0.25. Most manmade objects in outer space are positioned in LEO, making it a key region for satellite operations and research. [Learn more about LEO](https://en.wikipedia.org/wiki/Low_Earth_orbit?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01).

* **VLEO (Very Low Earth Orbit)**: Very Low Earth Orbits are characterized by a mean altitude below 450 kilometers. Operating in VLEO allows spacecraft to be closer to the Earth's surface, providing enhanced resolution for Earth observation missions. This proximity enables better data collection for imaging and environmental monitoring. [Explore VLEO benefits and challenges](https://www.researchgate.net/publication/271499606_Very_Low_Earth_Orbit_mission_concepts_for_Earth_Observation_Benefits_and_challenges?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01).

* **GTO (Geostationary Transfer Orbit)**: Geostationary Transfer Orbit is a high Earth orbit that enables satellites to match Earth's rotation. Positioned at 35,786 kilometers (22,236 miles) above the equator, this orbit is crucial for applications like weather monitoring, communication, and surveillance. Satellites in GTO orbit at the same rate as Earth's rotation, appearing to stay fixed over a single longitude, although they may drift slightly north or south. [Read more about GTO](https://www.space.com/29222-geosynchronous-orbit.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01).

* **SSO (Sun-Synchronous Orbit)**: Also known as a heliosynchronous orbit, a Sun-synchronous orbit is a nearly polar orbit that allows a satellite to pass over the same point on the Earth's surface at the same local mean solar time. This orbit is ideal for Earth observation missions requiring consistent lighting conditions. [Learn about SSO](https://en.wikipedia.org/wiki/Sun-synchronous_orbit?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01).

* **ES-L1 (Earth-Sun Lagrange Point 1)**: At the Lagrange points, the gravitational forces of two large bodies, such as the Earth and the Sun, create equilibrium points where a small object can maintain a stable position relative to these bodies. L1 is one of these points located between the Earth and the Sun, useful for solar observation missions. [Explore Lagrange Points](https://en.wikipedia.org/wiki/Lagrange_point?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01#L1_point).

* **HEO (Highly Elliptical Orbit)**: Highly Elliptical Orbits are characterized by their high eccentricity, resulting in an orbit that is highly elongated. These orbits are typically used to achieve extended observations over specific regions of the Earth, making them valuable for missions requiring long-duration coverage of particular areas. [Understand HEO](https://en.wikipedia.org/wiki/Highly_elliptical_orbit?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01).

* **ISS (International Space Station)**: The International Space Station is a modular space station in low Earth orbit, serving as a habitat for astronauts and a platform for scientific research. It represents a collaborative effort among five major space agencies: NASA (United States), Roscosmos (Russia), JAXA (Japan), ESA (Europe), and CSA (Canada). [Learn more about the ISS](https://en.wikipedia.org/wiki/International_Space_Station?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01).

* **MEO (Medium Earth Orbit)**: Medium Earth Orbits range from 2,000 kilometers (1,200 miles) to just below geosynchronous orbit at 35,786 kilometers (22,236 miles). These orbits, often referred to as intermediate circular orbits, are typically positioned around 20,200 kilometers (12,600 miles) or 20,650 kilometers (12,830 miles) with an orbital period of approximately 12 hours. [Discover MEO](https://en.wikipedia.org/wiki/List_of_orbits?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01).

* **HEO (Highly Elliptical Orbit)**: Geocentric orbits above the altitude of geosynchronous orbit (35,786 kilometers or 22,236 miles). These orbits are characterized by their high eccentricity, allowing satellites to cover large portions of the Earth's surface during their orbit. [Learn more about HEO](https://en.wikipedia.org/wiki/List_of_orbits?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01).

* **GEO (Geostationary Orbit)**: A circular geosynchronous orbit located 35,786 kilometers (22,236 miles) above the Earth's equator. Satellites in GEO orbit move in sync with the Earth's rotation, allowing them to remain fixed relative to a specific point on the Earth's surface. This characteristic is critical for continuous coverage of areas such as communications and weather monitoring. [Explore GEO](https://en.wikipedia.org/wiki/Geostationary_orbit?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01).

* **PO (Polar Orbit)**: A type of orbit where a satellite passes over or nearly over both poles of the Earth. This orbit allows the satellite to view almost every part of the Earth's surface as the planet rotates beneath it. Polar orbits are commonly used for Earth observation and remote sensing missions. [Learn about Polar Orbits](https://en.wikipedia.org/wiki/Polar_orbit?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01).

Some of these orbits are illustrated in the following plot:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/Orbits.png)


### Challenge 2: Calculate the number and occurrence of 

To calculate the number and occurrence of each orbit in the `Orbit` column using the `.value_counts()` method, follow these steps:

1. Ensure your data is loaded into a DataFrame.
2. Apply the `.value_counts()` method to the `Orbit` column.


In [6]:
# Apply value_counts on Orbit column
orbit_counts = df['Orbit'].value_counts()
# Display the result
print(orbit_counts)

GTO      27
ISS      21
VLEO     14
PO        9
LEO       7
SSO       5
MEO       3
GEO       1
ES-L1     1
SO        1
HEO       1
Name: Orbit, dtype: int64

### Challenge 3: Calculate the Number and Occurrence of Mission Outcomes per Orbit Type

To calculate the number and occurrence of each mission outcome per orbit type, follow these steps:

1. Use the `.value_counts()` method to determine the number and occurrence of each mission outcome in the `Outcome` column.
2. Assign the result to the variable `landing_outcomes`.

Here's how you can do it:




In [6]:
landing_outcomes = df['Outcome'].value_counts()

# Display the result
print(landing_outcomes)

Outcome
True ASDS      41
None None      19
True RTLS      14
False ASDS      6
True Ocean      5
False Ocean     2
None ASDS       2
False RTLS      1
Name: count, dtype: int64


<code>True Ocean</code> means the mission outcome was successfully  landed to a specific region of the ocean while <code>False Ocean</code> means the mission outcome was unsuccessfully landed to a specific region of the ocean. <code>True RTLS</code> means the mission outcome was successfully  landed to a ground pad <code>False RTLS</code> means the mission outcome was unsuccessfully landed to a ground pad.<code>True ASDS</code> means the mission outcome was successfully  landed to a drone ship <code>False ASDS</code> means the mission outcome was unsuccessfully landed to a drone ship. <code>None ASDS</code> and <code>None None</code> these represent a failure to land.


In [8]:
for i,outcome in enumerate(landing_outcomes.keys()):
    print(i,outcome)

0 True ASDS
1 None None
2 True RTLS
3 False ASDS
4 True Ocean
5 None ASDS
6 False Ocean
7 False RTLS


We create a set of outcomes where the second stage did not land successfully:


In [10]:
bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]])
bad_outcomes

{'False ASDS', 'False Ocean', 'False RTLS', 'None ASDS', 'None None'}

### Extra Challenge: Analyze the Success Rate of Landings Per Orbit Type

Now that you've calculated the number and occurrence of mission outcomes per orbit type, take it a step further:

1. **Calculate the Success Rate**: 
   - Calculate the success rate of landings for each orbit type by determining the ratio of successful landings (`True`) to the total number of landings for that orbit type.
   - You can define a "successful landing" as any outcome that starts with `True`.

2. **Visualize the Success Rates**:
   - Create a bar chart or any other suitable visualization that shows the success rate of landings for each orbit type. 
   - Make sure to label your axes and provide a title for your chart.

3. **Interpret the Results**:
   - Write a brief summary explaining which orbit types have the highest and lowest success rates and any insights you can draw from this analysis.

**Hint**: You may need to filter the outcomes and perform some additional calculations to get the success rates.


### Challenge 4: Create a Landing Outcome Label from the Outcome Column

In this challenge, you'll create a binary classification label based on the `Outcome` column. Follow these steps:

1. **Define the `bad_outcome` Set**:
   - Create a set called `bad_outcome` that includes all the outcomes you want to classify as a failure. For example:
     ```python
     bad_outcome = {"Failure (drone ship)", "No attempt", "Failure (ground pad)", "Precluded (drone ship)"}
     ```

2. **Generate the `landing_class` List**:
   - Iterate over the `Outcome` column, and for each entry:
     - Assign `0` if the outcome is in the `bad_outcome` set.
     - Assign `1` otherwise.
   - Store this list in a variable called `landing_class`. For example:
     ```python
     landing_class = [0 if outcome in bad_outcome else 1 for outcome in df["Outcome"]]
     ```

3. **Check the Result**:
   - Print the first few elements of `landing_class` to verify the labels have been correctly assigned.

4. **Bonus**: 
   - Add `landing_class` as a new column to your DataFrame to keep track of the labels alongside the original data.

By completing this challenge, you'll have successfully created a binary label that distinguishes between successful and unsuccessful landing outcomes.



This variable will represent the classification variable that represents the outcome of each launch. If the value is zero, the  first stage did not land successfully; one means  the first stage landed Successfully


We can use the following line of code to determine the success rate:

```python
success_rate = df["landing_class"].mean()

### Explanation:
- The `landing_class` column contains `1` for successful outcomes and `0` for failures.
- The `.mean()` function calculates the average of this binary column, which effectively gives the proportion of successful outcomes, i.e., the success rate.
  
For example, if `landing_class` contains 70 ones and 30 zeros, the success rate will be `0.7` or 70%.



We can now export the data to a CSV file for the next section.

<code>df.to_csv("preprocessedSpaceX\_2.csv", index=False)</code>
