# IBM Applied Data Science Capstone
## Part 2: Data Wrangling

**Objective:** Clean, transform, and prepare data for analysis

**Author:** Son Nguyen

---


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style("whitegrid")
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✓ Libraries imported successfully!")
print("=" * 60)


✓ Libraries imported successfully!


In [2]:
# Load SpaceX dataset
df = pd.read_csv('../data/spacex_launches.csv')

print(f"✓ Dataset loaded successfully!")
print(f"✓ Shape: {df.shape}")
print(f"✓ Columns: {len(df.columns)}")
print(f"\nColumn names:")
for i, col in enumerate(df.columns, 1):
    print(f"  {i:2d}. {col}")
print(f"\nFirst 5 rows:")
df.head()


✓ Dataset loaded successfully!
✓ Shape: (187, 22)
✓ Columns: 22

Column names:
   1. Flight_Number
   2. Launch_Name
   3. Date_UTC
   4. Year
   5. Month
   6. Quarter
   7. Success
   8. Success_Rate
   9. Rocket_Name
  10. Rocket_Type
  11. Cost_Per_Launch
  12. Launchpad_Name
  13. Location
  14. Region
  15. Latitude
  16. Longitude
  17. Payload_Count
  18. Payload_Mass_kg
  19. Payload_Type
  20. Core_Landing
  21. Core_Reused
  22. Year_Category

First 5 rows:


Unnamed: 0,Flight_Number,Launch_Name,Date_UTC,Year,Month,Quarter,Success,Success_Rate,Rocket_Name,Rocket_Type,...,Location,Region,Latitude,Longitude,Payload_Count,Payload_Mass_kg,Payload_Type,Core_Landing,Core_Reused,Year_Category
0,1,FalconSat,2006-03-24T22:30:00.000Z,2006,3,1,False,0.0,Unknown,Unknown,...,Omelek Island,Marshall Islands,9.047721,167.743129,1,20.0,Satellite,No Attempt,False,Early
1,2,DemoSat,2007-03-21T01:10:00.000Z,2007,3,1,False,0.0,Unknown,Unknown,...,Omelek Island,Marshall Islands,9.047721,167.743129,1,0.0,Satellite,No Attempt,False,Early
2,3,Trailblazer,2008-08-03T03:34:00.000Z,2008,8,3,False,0.0,Unknown,Unknown,...,Omelek Island,Marshall Islands,9.047721,167.743129,2,0.0,Satellite,No Attempt,False,Early
3,4,RatSat,2008-09-28T23:15:00.000Z,2008,9,3,True,1.0,Unknown,Unknown,...,Omelek Island,Marshall Islands,9.047721,167.743129,1,165.0,Satellite,No Attempt,False,Early
4,5,RazakSat,2009-07-13T03:35:00.000Z,2009,7,3,True,1.0,Unknown,Unknown,...,Omelek Island,Marshall Islands,9.047721,167.743129,1,200.0,Satellite,No Attempt,False,Early


## 2. Data Cleaning and Validation

### 2.1 Check for Missing Values

We'll identify and handle any missing values in the dataset.


In [3]:
# Check for missing values
print("MISSING VALUES ANALYSIS")
print("-" * 60)

missing_values = df.isnull().sum()
missing_pct = (missing_values / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Missing Percentage': missing_pct.round(2)
}).sort_values('Missing Count', ascending=False)

print("\nMissing Values Summary:")
if missing_df['Missing Count'].sum() > 0:
    print(missing_df[missing_df['Missing Count'] > 0])
    
    # Handle missing values
    # Fill Success with most common value
    if df['Success'].isnull().sum() > 0:
        df['Success'] = df['Success'].fillna(True)  # Most launches are successful
    
    print("\n✓ Missing values handled")
else:
    print("✓ No missing values found!")

print(f"\nDataset shape after cleaning: {df.shape}")


MISSING VALUES ANALYSIS
------------------------------------------------------------

Missing Values Summary:
         Missing Count  Missing Percentage
Success              1                0.53

✓ Missing values handled

Dataset shape after cleaning: (187, 22)


### 2.2 Feature Engineering

Create additional features that will be useful for analysis and predictive modeling.


In [4]:
# Create Date column from Date_UTC
df['Date'] = pd.to_datetime(df['Date_UTC'])

# Create target variable for classification: First Stage Landing Success
# This is the main prediction target - whether SpaceX successfully landed the first stage
df['Landing_Success'] = (df['Core_Landing'] == 'Success').astype(int)
df['Landing_Success_Label'] = df['Landing_Success'].map({0: 'Failed/No Attempt', 1: 'Success'})

# Create Launch Success binary
df['Launch_Success_Binary'] = df['Success'].astype(int)

# Create Cost Category
if df['Cost_Per_Launch'].max() > 0:
    df['Cost_Category'] = pd.cut(
        df['Cost_Per_Launch'],
        bins=[0, 1, 50_000_000, 70_000_000, 100_000_000],
        labels=['Free/Unknown', 'Low Cost', 'Mid Cost', 'High Cost']
    )
else:
    df['Cost_Category'] = 'Unknown'

# Create Payload Mass Category
df['Payload_Mass_Category'] = pd.cut(
    df['Payload_Mass_kg'],
    bins=[0, 1000, 5000, 10000, float('inf')],
    labels=['Light', 'Medium', 'Heavy', 'Very Heavy']
)

# Create Launch Period (Early SpaceX vs Recent)
df['Launch_Period'] = df['Year'].apply(lambda x: 'Early' if x < 2015 else 'Recent')

# Create Landing Attempt binary
df['Landing_Attempt'] = (df['Core_Landing'] != 'No Attempt').astype(int)

print("✓ Feature engineering completed!")
print(f"\nNew features created:")
new_features = ['Date', 'Landing_Success', 'Landing_Success_Label', 'Launch_Success_Binary', 
                'Cost_Category', 'Payload_Mass_Category', 'Launch_Period', 'Landing_Attempt']
print(f"  - {', '.join(new_features)}")

print(f"\nSample of new features:")
df[['Launch_Name', 'Year', 'Rocket_Name', 'Landing_Success_Label', 'Payload_Mass_Category', 'Launch_Period']].head(10)


✓ Feature engineering completed!

New features created:
  - Date, Landing_Success, Landing_Success_Label, Launch_Success_Binary, Cost_Category, Payload_Mass_Category, Launch_Period, Landing_Attempt

Sample of new features:


Unnamed: 0,Launch_Name,Year,Rocket_Name,Landing_Success_Label,Payload_Mass_Category,Launch_Period
0,FalconSat,2006,Unknown,Failed/No Attempt,Light,Early
1,DemoSat,2007,Unknown,Failed/No Attempt,,Early
2,Trailblazer,2008,Unknown,Failed/No Attempt,,Early
3,RatSat,2008,Unknown,Failed/No Attempt,Light,Early
4,RazakSat,2009,Unknown,Failed/No Attempt,Light,Early
5,Falcon 9 Test Flight,2010,Unknown,Failed/No Attempt,,Early
6,COTS 1,2010,Unknown,Failed/No Attempt,,Early
7,COTS 2,2012,Unknown,Failed/No Attempt,Light,Early
8,CRS-1,2012,Unknown,Failed/No Attempt,Light,Early
9,CRS-2,2013,Unknown,Failed/No Attempt,Light,Early


In [5]:
# Save cleaned dataset
output_path = '../data/spacex_launches_cleaned.csv'
df.to_csv(output_path, index=False)

print("=" * 60)
print("DATA WRANGLING COMPLETE")
print("=" * 60)
print(f"✓ Cleaned dataset saved to: {output_path}")
print(f"✓ Final shape: {df.shape}")
print(f"✓ Total columns: {len(df.columns)}")
print(f"\n✓ Dataset is ready for exploratory data analysis!")
print(f"✓ Ready for predictive modeling (target: Landing_Success)")

# Summary statistics
print(f"\nDataset Summary:")
print(f"  - Total launches: {len(df)}")
print(f"  - Time period: {df['Year'].min()} - {df['Year'].max()}")
print(f"  - Successful launches: {df['Launch_Success_Binary'].sum()} ({df['Launch_Success_Binary'].mean()*100:.1f}%)")
print(f"  - Successful landings: {df['Landing_Success'].sum()} ({df['Landing_Success'].mean()*100:.1f}%)")
print(f"  - Reused cores: {df['Core_Reused'].sum()} ({df['Core_Reused'].mean()*100:.1f}%)")


DATA WRANGLING COMPLETE
✓ Cleaned dataset saved to: ../data/spacex_launches_cleaned.csv
✓ Final shape: (187, 30)
✓ Total columns: 30

✓ Dataset is ready for exploratory data analysis!
✓ Ready for predictive modeling (target: Landing_Success)

Dataset Summary:
  - Total launches: 187
  - Time period: 2006 - 2022
  - Successful launches: 182 (97.3%)
  - Successful landings: 143 (76.5%)
  - Reused cores: 115 (61.5%)
