# Data Wrangling

### Performing Exploratory Data Analysis EDA to find patterns and determine target label for training supervised models

In the dataSet there are cases where booster did not land successfully:

1. Ocean true => successfully landed in a specific ocean region
2. Ocean false => unsuccessfully landed in a specific ocean region
3. RTLS true => successfully landed on ground pad
4. RTLS false => unsuccessfully landed on ground pad
5. ASDS true => successfully landed on a drone ship
6. ASDS false => unsuccessfully landed on a drone ship

⚠️ outcomes are converted into training label:

1. <code>Number 1</code> => successfully landed
2. <code>Number 2</code> => unsuccessfully landed

In [8]:
import pandas as pd
import numpy as np 

## Data Analysis

In [9]:
path = './dataset1.csv'

df = pd.read_csv(path)
df.head(10)

Unnamed: 0,FlightNumber,Date,Outcome,Serial,BoosterName,Flights,Reused,ReusedCount,Block,Legs,GridFins,PayloadMass,FinalOrbit,LaunchPad,LandPad,Latitude,Longitude,Failures
0,1,2010-06-04,None None,B0003,Falcon 9,1,False,0,1.0,False,False,6123.547647,LEO,CCSFS SLC 40,,28.561857,-80.577366,Stage Expended
1,2,2012-05-22,None None,B0005,Falcon 9,1,False,0,1.0,False,False,525.0,LEO,CCSFS SLC 40,,28.561857,-80.577366,
2,3,2013-03-01,None None,B0007,Falcon 9,1,False,0,1.0,False,False,677.0,ISS,CCSFS SLC 40,,28.561857,-80.577366,
3,4,2013-09-29,False Ocean,B1003,Falcon 9,1,False,0,1.0,False,False,500.0,PO,VAFB SLC 4E,,34.632093,-120.610829,"First flight of Falcon 9 v1.1 upgrade, first S..."
4,5,2013-12-03,None None,B1004,Falcon 9,1,False,0,1.0,False,False,3170.0,GTO,CCSFS SLC 40,,28.561857,-80.577366,
5,6,2014-01-06,None None,B1005,Falcon 9,1,False,0,1.0,False,False,3325.0,GTO,CCSFS SLC 40,,28.561857,-80.577366,
6,7,2014-04-18,True Ocean,B1006,Falcon 9,1,False,0,1.0,True,False,2296.0,ISS,CCSFS SLC 40,,28.561857,-80.577366,Broke up after sucessful water landing
7,8,2014-07-14,True Ocean,B1007,Falcon 9,1,False,0,1.0,True,False,1316.0,LEO,CCSFS SLC 40,,28.561857,-80.577366,Broke up after sucessful water landing
8,9,2014-08-05,None None,B1008,Falcon 9,1,False,0,1.0,False,False,4535.0,GTO,CCSFS SLC 40,,28.561857,-80.577366,
9,10,2014-09-07,None None,B1011,Falcon 9,1,False,0,1.0,False,False,4428.0,GTO,CCSFS SLC 40,,28.561857,-80.577366,


Check missing values 

In [10]:
df.isnull().sum()

# shape it into percentage
df.isnull().sum()/df.shape[0]*100

FlightNumber     0.000000
Date             0.000000
Outcome          0.000000
Serial           0.000000
BoosterName      0.000000
Flights          0.000000
Reused           0.000000
ReusedCount      0.000000
Block            0.000000
Legs             0.000000
GridFins         0.000000
PayloadMass      0.000000
FinalOrbit       0.000000
LaunchPad        0.000000
LandPad         28.888889
Latitude         0.000000
Longitude        0.000000
Failures         7.777778
dtype: float64

Check data types

In [11]:
df.dtypes

FlightNumber      int64
Date             object
Outcome          object
Serial           object
BoosterName      object
Flights           int64
Reused             bool
ReusedCount       int64
Block           float64
Legs               bool
GridFins           bool
PayloadMass     float64
FinalOrbit       object
LaunchPad        object
LandPad          object
Latitude        float64
Longitude       float64
Failures         object
dtype: object

Target <code>LaunchSite</code> to determine each launch location. Next, check the number of launches for each site.

In [12]:
# target the launchSite and count the values of occurrence on each site
df['LaunchPad'].value_counts()

CCSFS SLC 40    55
KSC LC 39A      22
VAFB SLC 4E     13
Name: LaunchPad, dtype: int64

## Find number of occurrence of mission outcome for each orbit type

In [13]:
# count the values on the 'Outcome' column to find the landing outcomes
# specific: landing_outcomes = Outcome Column
landing_outcomes = df['Outcome'].value_counts()
landing_outcomes

True ASDS      41
None None      19
True RTLS      14
False ASDS      6
True Ocean      5
False Ocean     2
None ASDS       2
False RTLS      1
Name: Outcome, dtype: int64

Indexing outcomes with keys

In [14]:
for i,outcome in enumerate(landing_outcomes.keys()):
    print(i,outcome)

0 True ASDS
1 None None
2 True RTLS
3 False ASDS
4 True Ocean
5 False Ocean
6 None ASDS
7 False RTLS


Create a Set => ko_outcomes

In [15]:
ko_outcomes = set(landing_outcomes.keys()[[1, 3, 5, 6, 7]])
ko_outcomes

{'False ASDS', 'False Ocean', 'False RTLS', 'None ASDS', 'None None'}

## Create landing outcome label from the 'Outcome' column

from <code>Outcome</code>, create a list: 

1. Element is 0 if the corresponding row in <code>Outcome</code> is set to <code>ko_outcome</code> else element is 1
2. Assign the value to the variable <code>land_status</code>

In [16]:
land_status = []

for item in df['Outcome']:
    if item in set(ko_outcomes):
        land_status.append(0)
    else:
        land_status.append(1)

In [17]:
df['Status'] = land_status
df[['Status']].head(10)

Unnamed: 0,Status
0,0
1,0
2,0
3,0
4,0
5,0
6,1
7,1
8,0
9,0


## Determine the Mean 

In [18]:
# ok -cp -
df['Status'].mean()

0.6666666666666666

Export the dataSet

In [19]:
df.to_csv('dataSet_part_2.csv', index=False)