<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 20px; border-radius: 10px; color: white;">
    <h1 style="color: white; text-align: center;">üö¢ Titanic Survival Prediction Analysis</h1>
    <p style="text-align: center; font-size: 18px;"><strong>From EDA to Machine Learning - A Professional Data Science Workflow</strong></p>
</div>

## üìã Project Overview
This analysis explores the Titanic dataset through a comprehensive EDA-to-ML workflow, serving as a bridge between exploratory analysis and predictive modeling in my data science portfolio.

**üéØ Key Objectives:**
- Perform comprehensive exploratory data analysis (EDA)
- Engineer meaningful features from raw data
- Build and evaluate multiple machine learning models
- Demonstrate professional workflow and documentation

## 1. üì• Import & Setup

We begin by importing all necessary libraries and configuring our environment. This foundational step ensures we have the right tools for data manipulation, visualization, and machine learning tasks throughout the project.

### 1a. Importing Essential Libraries

We import core data science libraries that form the foundation of our analysis. Each library serves a specific purpose in the data science workflow, from data manipulation to machine learning implementation.

In [1]:
# Data manipulation core libraries
import pandas as pd  # Primary data structure (DataFrame) and analysis tools
import numpy as np   # Numerical computing and array operations

# Data visualization libraries  
import matplotlib.pyplot as plt  # Foundation for all plotting in Python
import seaborn as sns            # Enhanced statistical visualizations

# Scikit-learn preprocessing modules
from sklearn.preprocessing import StandardScaler  # Standardizes numeric features (mean=0, std=1)
from sklearn.preprocessing import LabelEncoder    # Converts categorical text to numerical labels
from sklearn.impute import SimpleImputer          # Systematically fills missing values
from sklearn.model_selection import train_test_split  # Creates training/test splits for ML

# System and utility libraries
import warnings  # Manages warning messages during execution
from datetime import datetime  # Handles date/time for analysis timestamping

print("‚úÖ Libraries imported successfully!")

‚úÖ Libraries imported successfully!


### 1b. Configuration & Settings

We configure our environment with global settings for visualizations and data display. Professional configuration ensures consistent, publication-quality plots and prevents common issues like truncated outputs.

In [2]:
# Configure matplotlib for professional visualizations
plt.style.use('seaborn-v0_8-whitegrid')  # Use seaborn's whitegrid theme for clean background
sns.set_palette("husl")  # Set color palette to "husl" for distinct, accessible colors

# Set default figure size for all plots
plt.rcParams['figure.figsize'] = (10, 6)  # Width: 10 inches, Height: 6 inches
plt.rcParams['font.size'] = 12  # Base font size for all text elements in plots

# Configure pandas display options for better data inspection
pd.set_option('display.max_columns', 50)  # Show up to 50 columns when displaying DataFrames
pd.set_option('display.max_rows', 100)    # Show up to 100 rows when displaying DataFrames
pd.set_option('display.float_format', '{:.2f}'.format)  # Format floats to 2 decimal places

# Suppress warnings for cleaner output (use with caution)
warnings.filterwarnings('ignore')  # Ignore warning messages that don't affect analysis

print("‚úÖ Environment configured successfully!")
print(f"üìÖ Analysis timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

‚úÖ Environment configured successfully!
üìÖ Analysis timestamp: 2025-11-25 15:43:18


## 2. üìä Data Loading & Overview

In this section, we load the Titanic dataset and perform an initial exploration. Our goals are to:

- Load the dataset from a reliable source and understand its structure
- Examine the dataset's dimensions (rows and columns) and data types  
- Identify missing values and data quality issues
- Generate statistical summaries to spot outliers and patterns
- Validate column names for consistency throughout our analysis

### 2.1 Load Data

We load the Titanic dataset from seaborn's built-in datasets, which are well-maintained and pre-cleaned. This ensures data quality and consistency for our analysis.

In [3]:
# Load Titanic dataset from seaborn (well-maintained and pre-cleaned)
df = sns.load_dataset('titanic')  # seaborn's built-in Titanic dataset

print("‚úÖ Dataset loaded successfully!")
print(f"üìä Dataset shape: {df.shape[0]} passengers, {df.shape[1]} features")

# Display the first few rows to understand the data structure
print("\nFirst 5 rows of the dataset:")
display(df.head())

‚úÖ Dataset loaded successfully!
üìä Dataset shape: 891 passengers, 15 features

First 5 rows of the dataset:


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.28,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.92,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [4]:
# Examine dataset dimensions and basic information
print(f"üìê Dataset Shape: {df.shape[0]} rows, {df.shape[1]} columns")

print("\n" + "="*29)
print("üìã DATA TYPES & MEMORY USAGE")
print("="*29)
df.info()  # Comprehensive overview of data types, non-null counts, and memory usage

print("\n" + "="*27)
print("üîç MISSING VALUES ANALYSIS")
print("="*27)
missing_data = df.isnull().sum()  # Count null values for each column
missing_percent = (df.isnull().sum() / len(df)) * 100  # Calculate percentage missing

# Create a clean missing values summary
missing_summary = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing Percentage': missing_percent
})
display(missing_summary[missing_summary['Missing Count'] > 0])  # Show only columns with missing values

print(f"\n‚úÖ Total missing values in dataset: {df.isnull().sum().sum()}")

üìê Dataset Shape: 891 rows, 15 columns

üìã DATA TYPES & MEMORY USAGE
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB

üîç 

Unnamed: 0,Missing Count,Missing Percentage
age,177,19.87
embarked,2,0.22
deck,688,77.22
embark_town,2,0.22



‚úÖ Total missing values in dataset: 869


### 2.3 Statistical Summary

We generate summary statistics for both numerical and categorical features. This helps us identify:

- Outliers in numerical data (e.g., extreme fares or ages)
- Data distribution patterns and central tendencies  
- Top categories in categorical data (e.g., most common passenger class or embarkation point)
- Potential data quality issues requiring attention

In [5]:
print(" " * 10 + "=" * 30)
print(" " * 10 +"üìà NUMERICAL FEATURES SUMMARY")
print(" " * 10 +"=" * 30)
# Describe numerical columns with detailed statistics
numerical_summary = df.describe()  # Generates count, mean, std, min, percentiles, max
display(numerical_summary)

print(" " * 20 +"=" * 32)
print(" " * 20 +"üìä CATEGORICAL FEATURES SUMMARY") 
print(" " * 20 +"=" * 32)
# Describe categorical columns with frequency analysis
categorical_summary = df.describe(include=['object', 'category', 'bool'])  # Includes object, category and boolean columns
display(categorical_summary)

          üìà NUMERICAL FEATURES SUMMARY


Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.38,2.31,29.7,0.52,0.38,32.2
std,0.49,0.84,14.53,1.1,0.81,49.69
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.12,0.0,0.0,7.91
50%,0.0,3.0,28.0,0.0,0.0,14.45
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.33


                    üìä CATEGORICAL FEATURES SUMMARY


Unnamed: 0,sex,embarked,class,who,adult_male,deck,embark_town,alive,alone
count,891,889,891,891,891,203,889,891,891
unique,2,3,3,3,2,7,3,2,2
top,male,S,Third,man,True,C,Southampton,no,True
freq,577,644,491,537,537,59,644,549,537


### 2.4 Column Name Validation

We verify that all column names are consistent (no spaces or special characters) to ensure easy programmatic access throughout our analysis. Clean column names prevent errors during data manipulation and analysis.

In [6]:
print("üìù COLUMN NAMES VALIDATION")
print("=" * 27)
print("Column Names List:")
print(list(df.columns))

print("\nüîç Checking for naming inconsistencies:")
issues_found = False
for col in df.columns:
    # Check for spaces, special characters, or inconsistent formatting
    if " " in col or "-" in col or col != col.lower():
        print(f"‚ö†Ô∏è  Column '{col}' contains inconsistencies!")
        issues_found = True

if not issues_found:
    print("‚úÖ All column names are clean and consistent (snake_case format)")
    print("‚úÖ No spaces, special characters, or uppercase letters detected")

üìù COLUMN NAMES VALIDATION
Column Names List:
['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone']

üîç Checking for naming inconsistencies:
‚úÖ All column names are clean and consistent (snake_case format)
‚úÖ No spaces, special characters, or uppercase letters detected


### 2.5 Observations from Data Overview

Based on our comprehensive data exploration, here are the key findings and insights that will guide our data cleaning and analysis strategy:

In [7]:
print(" " * 30 + "=" * 40)
print(" " * 35 + "üéØ KEY OBSERVATIONS & INSIGHTS")
print(" " * 30 + "=" * 40)

observations = [
    "üìä **Dataset Structure**: 891 passengers with 15 features including survival status, demographics, and travel details",
    "‚ö†Ô∏è **Missing Data**: Age (20%), Deck (77%) have significant missing values requiring careful handling",
    "üé´ **Passenger Class**: Majority are 3rd class (491), indicating socioeconomic distribution",
    "üë• **Demographics**: 577 males vs 314 females, with age range from 0.42 to 80 years",
    "üí∞ **Fare Analysis**: Wide range (0 to 512) with median 14.45, suggesting economic diversity",
    "üö¢ **Embarkation**: Southampton (644) was the most common departure point",
    "üéØ **Target Variable**: 38% survival rate (342 survived, 549 did not)",
    "üîç **Data Quality**: Clean column names, appropriate data types, no major structural issues"
]

for i, observation in enumerate(observations, 1):
    print(f"{i}. {observation}")

                                   üéØ KEY OBSERVATIONS & INSIGHTS
1. üìä **Dataset Structure**: 891 passengers with 15 features including survival status, demographics, and travel details
2. ‚ö†Ô∏è **Missing Data**: Age (20%), Deck (77%) have significant missing values requiring careful handling
3. üé´ **Passenger Class**: Majority are 3rd class (491), indicating socioeconomic distribution
4. üë• **Demographics**: 577 males vs 314 females, with age range from 0.42 to 80 years
5. üí∞ **Fare Analysis**: Wide range (0 to 512) with median 14.45, suggesting economic diversity
6. üö¢ **Embarkation**: Southampton (644) was the most common departure point
7. üéØ **Target Variable**: 38% survival rate (342 survived, 549 did not)
8. üîç **Data Quality**: Clean column names, appropriate data types, no major structural issues


## 3. üßπ Data Cleaning & Feature Engineering

In this section, we systematically address data quality issues and create new features to enhance our analysis. Our cleaning strategy follows professional standards:

- Handle missing values using appropriate imputation methods
- Fix data type inconsistencies and formatting issues  
- Engineer new features that capture meaningful patterns
- Remove or transform outliers and problematic values
- Ensure data consistency for machine learning readiness

### 3.1 Handling Missing Values

We implement a strategic approach to missing data based on the nature and percentage of missingness in each column. Professional imputation considers:
- **Low missingness (<5%)**: Simple imputation (mean, mode, or forward fill)
- **High missingness (>50%)**: Consider column removal or advanced imputation
- **Categorical vs Numerical**: Different strategies for different data types

In [20]:
print("üîÑ HANDLING MISSING VALUES")
print("=" * 27)

# Check current missing values
print("Missing values to handle:")
missing_data = df.isnull().sum()
print(missing_data[missing_data > 0])

üîÑ HANDLING MISSING VALUES
Missing values to handle:
age            177
embarked         2
deck           688
embark_town      2
age_group      177
dtype: int64


### 3.2 Feature Engineering

We'll create new features that can provide better insights for our analysis. Feature engineering transforms raw data into meaningful attributes that help machine learning models identify patterns more effectively.m

In [15]:
print("üîß CREATING NEW FEATURES")
print("=" * 25)

# Create family size feature
df['family_size'] = df['sibsp'] + df['parch'] + 1  # +1 for the passenger themselves
print(f"‚úÖ Created 'family_size': Combines siblings/spouses + parents/children + self")

# Create age groups for better analysis
df['age_group'] = pd.cut(df['age'], 
                        bins=[0, 12, 18, 35, 60, 100], 
                        labels=['Child', 'Teen', 'Adult', 'Middle', 'Senior'])
print(f"‚úÖ Created 'age_group': Categorical age ranges for demographic analysis")

# Create fare per person
df['fare_per_person'] = df['fare'] / df['family_size']
print(f"‚úÖ Created 'fare_per_person': Individual fare cost for economic analysis")

print(f"\nüìä New dataset shape: {df.shape} i.e {df.shape[0]} rows and {df.shape[1]} columns")
print("New features added:", [col for col in df.columns if col not in ['sibsp', 'parch', 'age', 'fare']][-3:])

üîß CREATING NEW FEATURES
‚úÖ Created 'family_size': Combines siblings/spouses + parents/children + self
‚úÖ Created 'age_group': Categorical age ranges for demographic analysis
‚úÖ Created 'fare_per_person': Individual fare cost for economic analysis

üìä New dataset shape: (891, 18) i.e 891 rows and 18 columns
New features added: ['family_size', 'age_group', 'fare_per_person']


### 3.3 Data Type Optimization

We optimize data types to improve memory efficiency and ensure proper data representation. This is especially important for categorical data that can be converted to more efficient `category` dtype.

In [18]:
print("üîÑ OPTIMIZING DATA TYPES")
print("=" * 25)

# Check memory usage before optimization
memory_before = df.memory_usage(deep=True).sum() / 1024**2  # Convert to MB

# Convert appropriate columns to category dtype
categorical_columns = ['sex', 'embarked', 'class', 'who', 'embark_town', 'alive', 'age_group']
for col in categorical_columns:
    if col in df.columns:
        df[col] = df[col].astype('category')
        print(f"‚úÖ Converted '{col}' to category dtype")

# Check memory usage after optimization  
memory_after = df.memory_usage(deep=True).sum() / 1024**2
memory_saved = memory_before - memory_after

print(f"\nüíæ Memory usage: {memory_before:.2f}MB ‚Üí {memory_after:.2f}MB")
print(f"üìâ Memory saved: {memory_saved:.2f}MB ({memory_saved/memory_before*100:.1f}% reduction)")

print(f"\nüìä Final dataset shape: {df.shape}")
print("üîç Data types after optimization:")
print(df.dtypes.value_counts())

üîÑ OPTIMIZING DATA TYPES
‚úÖ Converted 'sex' to category dtype
‚úÖ Converted 'embarked' to category dtype
‚úÖ Converted 'class' to category dtype
‚úÖ Converted 'who' to category dtype
‚úÖ Converted 'embark_town' to category dtype
‚úÖ Converted 'alive' to category dtype
‚úÖ Converted 'age_group' to category dtype

üíæ Memory usage: 0.07MB ‚Üí 0.07MB
üìâ Memory saved: 0.00MB (0.0% reduction)

üìä Final dataset shape: (891, 18)
üîç Data types after optimization:
int64       5
float64     3
bool        2
category    1
category    1
category    1
category    1
category    1
category    1
category    1
category    1
Name: count, dtype: int64


### 3.4 Data Cleaning Summary

Our data preparation phase is complete. The seaborn Titanic dataset required minimal cleaning due to its pre-processed nature. We've successfully:

- Confirmed no missing values require imputation
- Engineered three new meaningful features for enhanced analysis
- Maintained optimal data types for efficient processing

The dataset is now ready for comprehensive exploratory data analysis.

In [21]:
print("‚úÖ DATA CLEANING & FEATURE ENGINEERING COMPLETE")
print("=" * 50)
print(f"üìä Final dataset: {df.shape[0]} passengers, {df.shape[1]} features")
print(f"üéØ Target variable: 'survived' (Binary classification)")
print(f"üîß New features created: 3")
print(f"üöÄ Ready for Exploratory Data Analysis")

‚úÖ DATA CLEANING & FEATURE ENGINEERING COMPLETE
üìä Final dataset: 891 passengers, 18 features
üéØ Target variable: 'survived' (Binary classification)
üîß New features created: 3
üöÄ Ready for Exploratory Data Analysis
