# Titanic Dataset K-means Clustering Analysis

This comprehensive notebook demonstrates how to perform K-means clustering on the Titanic dataset to discover hidden patterns and group passengers based on their characteristics. We'll explore every step of the machine learning pipeline from data preprocessing to actionable insights.

**What is K-means Clustering?**
K-means is an unsupervised machine learning algorithm that groups data points into k clusters based on feature similarity. It works by finding cluster centers (centroids) and assigning each data point to the nearest centroid.

**Why use clustering on Titanic data?**
- Discover passenger segments with similar characteristics
- Understand survival patterns across different groups
- Identify risk factors and safety insights
- Segment passengers for targeted analysis

---

## 1. Import Required Libraries

Before we start our analysis, we need to import essential Python libraries. Each library serves a specific purpose in our machine learning pipeline.

**Libraries explanation:**
- **pandas**: Data manipulation and analysis (think Excel but more powerful)
- **numpy**: Numerical computations and array operations
- **matplotlib & seaborn**: Data visualization and plotting
- **sklearn**: Machine learning algorithms and tools
- **warnings**: Suppress unnecessary warning messages

In [None]:
# Data manipulation and analysis
import pandas as pd  # For dataframes, data loading, and manipulation
import numpy as np   # For numerical operations and array handling

# Data visualization libraries
import matplotlib.pyplot as plt  # Basic plotting functionality
import seaborn as sns           # Statistical data visualization (prettier plots)

# Machine learning components
from sklearn.cluster import KMeans              # K-means clustering algorithm
from sklearn.preprocessing import StandardScaler, LabelEncoder  # Data preprocessing tools
from sklearn.decomposition import PCA          # Principal Component Analysis for dimensionality reduction
from sklearn.metrics import silhouette_score   # Clustering quality measurement

# Utility imports
import warnings
warnings.filterwarnings('ignore')  # Hide warning messages for cleaner output

# Set random seed for reproducible results
np.random.seed(42)

## 2. Load and Explore the Dataset

Data exploration is the foundation of any successful machine learning project. We need to understand our data structure, types, missing values, and basic statistics before proceeding.

**Key concepts:**
- **Dataset shape**: Number of rows (samples) and columns (features)
- **Data types**: Numerical vs categorical variables
- **Missing values**: Gaps in data that need handling
- **Statistical summary**: Mean, median, standard deviation, etc.

In [None]:
# Load the Titanic dataset
# Method 1: If you have a CSV file uploaded to Colab
# df = pd.read_csv('titanic.csv')

# Method 2: Use seaborn's built-in dataset (recommended for this tutorial)
df = sns.load_dataset('titanic')  # Loads the famous Titanic dataset

# Basic dataset information
print("Dataset shape:", df.shape)  # (rows, columns) - tells us dataset size
print("\nColumn names and types:")
print(df.dtypes)  # Shows data type of each column (int64, float64, object, etc.)

print("\nFirst few rows:")
print(df.head())  # Display first 5 rows to see data structure

print("\nDataset statistical summary:")
print(df.describe())  # Statistical summary for numerical columns

print("\nDetailed dataset information:")
print(df.info())  # Memory usage, non-null counts, data types

print("\nMissing values count:")
missing_values = df.isnull().sum()  # Count missing values in each column
print(missing_values[missing_values > 0])  # Show only columns with missing values

print("\nUnique values in categorical columns:")
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
    print(f"{col}: {df[col].nunique()} unique values -> {list(df[col].unique())}")




```python

```

---

## 3. Data Preprocessing and Feature Engineering

Data preprocessing is crucial for clustering success. Raw data often contains missing values, inconsistent formats, and needs transformation for machine learning algorithms.

**Key preprocessing steps:**
1. **Missing value imputation**: Fill gaps in data with appropriate values
2. **Feature engineering**: Create new meaningful features from existing ones
3. **Categorical encoding**: Convert text categories to numbers
4. **Feature scaling**: Normalize numerical ranges

**Why preprocessing matters:**
- K-means uses distance calculations, so all features need similar scales
- Missing values can break algorithms
- Well-engineered features improve clustering quality

```python
def preprocess_data(df):
    """
    Comprehensive data preprocessing function for Titanic dataset.
    
    This function handles:
    - Missing value imputation using statistical methods
    - Feature engineering to create meaningful new variables
    - Categorical variable encoding for machine learning compatibility
    
    Args:
        df (pandas.DataFrame): Raw Titanic dataset
    
    Returns:
        pandas.DataFrame: Preprocessed dataset ready for clustering
    """
    # Create a copy to avoid modifying the original dataset
    data = df.copy()
    print("Starting data preprocessing...")
    
    # === MISSING VALUE IMPUTATION ===
    print("\n1. Handling missing values:")
    
    # Age: Use median (robust to outliers, represents typical passenger age)
    median_age = data['age'].median()
    data['age'] = data['age'].fillna(median_age)
    print(f"   • Filled {df['age'].isnull().sum()} missing ages with median: {median_age:.1f}")
    
    # Fare: Use median (some passengers had free tickets or unknown fares)
    median_fare = data['fare'].median()
    data['fare'] = data['fare'].fillna(median_fare)
    print(f"   • Filled {df['fare'].isnull().sum()} missing fares with median: ${median_fare:.2f}")
    
    # Embarked: Use mode (most common port)
    mode_embarked = data['embarked'].mode()[0]  # mode() returns a Series, [0] gets the value
    data['embarked'] = data['embarked'].fillna(mode_embarked)
    print(f"   • Filled {df['embarked'].isnull().sum()} missing embarkation ports with mode: {mode_embarked}")
    
    # Deck: Fill with 'Unknown' (many passengers had unknown deck assignments)
    deck_missing = data['deck'].isnull().sum()
    data['deck'] = data['deck'].fillna('Unknown')
    print(f"   • Filled {deck_missing} missing deck assignments with 'Unknown'")
    
    # === FEATURE ENGINEERING ===
    print("\n2. Creating new features:")
    
    # Family size: Total family members aboard (including passenger)
    data['family_size'] = data['sibsp'] + data['parch'] + 1
    print(f"   • Created 'family_size': sibsp + parch + 1 (range: {data['family_size'].min()}-{data['family_size'].max()})")
    
    # Is alone: Boolean indicator for solo travelers
    data['is_alone'] = (data['family_size'] == 1).astype(int)  # Convert boolean to 0/1
    alone_count = data['is_alone'].sum()
    print(f"   • Created 'is_alone': {alone_count} passengers ({alone_count/len(data)*100:.1f}%) traveled alone")
    
    # Age groups: Categorize ages into life stages
    data['age_group'] = pd.cut(data['age'], 
                              bins=[0, 18, 35, 55, 100],  # Age boundaries
                              labels=['Child', 'Young', 'Middle', 'Senior'],  # Category names
                              include_lowest=True)  # Include boundary values
    print(f"   • Created 'age_group': {data['age_group'].value_counts().to_dict()}")
    
    # Fare groups: Economic status indicator based on ticket price
    data['fare_group'] = pd.cut(data['fare'], 
                               bins=[0, 10, 50, 100, 1000],  # Price boundaries
                               labels=['Low', 'Medium', 'High', 'Very High'],  # Economic levels
                               include_lowest=True)
    print(f"   • Created 'fare_group': {data['fare_group'].value_counts().to_dict()}")
    
    # === CATEGORICAL ENCODING ===
    print("\n3. Encoding categorical variables:")
    
    # LabelEncoder converts categories to numbers (0, 1, 2, ...)
    le = LabelEncoder()
    categorical_cols = ['sex', 'embarked', 'class', 'who', 'deck', 'age_group', 'fare_group']
    
    for col in categorical_cols:
        if col in data.columns:
            # Create new encoded column (keeps original for reference)
            encoded_col = col + '_encoded'
            data[encoded_col] = le.fit_transform(data[col].astype(str))
            
            # Show encoding mapping
            unique_values = data[col].unique()
            encoded_values = le.fit_transform(unique_values.astype(str))
            mapping = dict(zip(unique_values, encoded_values))
            print(f"   • Encoded '{col}': {mapping}")
    
    print(f"\nPreprocessing completed! Dataset shape: {data.shape}")
    return data

# Apply preprocessing to our dataset
processed_df = preprocess_data(df)

# Show the results
print("\nPreprocessed dataset preview:")
print(processed_df.head()[['age', 'fare', 'family_size', 'is_alone', 'sex_encoded', 'embarked_encoded']])
```

---

## 4. Feature Selection and Preparation

Feature selection is critical for clustering success. We need to choose features that are meaningful, non-redundant, and appropriate for distance-based algorithms.

**Feature selection principles:**
- **Relevance**: Features should relate to passenger characteristics
- **Non-redundancy**: Avoid highly correlated features
- **Measurability**: All features should be quantifiable
- **Clustering-appropriate**: Suitable for distance calculations

**Our feature categories:**
1. **Demographics**: Age, gender
2. **Socioeconomic**: Class, fare
3. **Family structure**: Family size, traveling alone
4. **Journey details**: Embarkation port, cabin class

```python
# Select features for clustering analysis
print("=== FEATURE SELECTION FOR CLUSTERING ===\n")

# Define our feature set with clear rationale
feature_columns = [
    'pclass',           # Passenger class (1st=1, 2nd=2, 3rd=3) - Socioeconomic status
    'sex_encoded',      # Gender encoded (0=female, 1=male) - Demographic factor
    'age',              # Age in years - Demographic factor
    'sibsp',            # Siblings/spouses aboard - Family structure
    'parch',            # Parents/children aboard - Family structure  
    'fare',             # Ticket fare in pounds - Economic indicator
    'embarked_encoded', # Port of embarkation encoded - Journey characteristic
    'family_size',      # Total family size (engineered feature)
    'is_alone'          # Traveling alone indicator (engineered feature)
]

print("Selected features and their business meaning:")
feature_descriptions = {
    'pclass': 'Passenger class (1st, 2nd, 3rd) - Social/economic status',
    'sex_encoded': 'Gender (0=female, 1=male) - Demographic characteristic',
    'age': 'Age in years - Life stage and vulnerability',
    'sibsp': 'Number of siblings/spouses - Family support system',
    'parch': 'Number of parents/children - Family responsibilities',
    'fare': 'Ticket fare - Economic capability and cabin quality',
    'embarked_encoded': 'Embarkation port - Journey origin and class',
    'family_size': 'Total family size - Social support network',
    'is_alone': 'Solo traveler indicator - Social isolation factor'
}

for i, (feature, description) in enumerate(feature_descriptions.items(), 1):
    print(f"{i:2d}. {feature:<17} -> {description}")

# Create the feature matrix X
X = processed_df[feature_columns].copy()
print(f"\nFeature matrix created with shape: {X.shape}")
print("(Rows = passengers, Columns = features)")

# Display feature statistics
print("\nFeature statistics:")
print(X.describe().round(2))

# Check for any remaining missing values
print(f"\nMissing values check:")
missing_check = X.isnull().sum()
if missing_check.sum() == 0:
    print("✅ No missing values found!")
else:
    print(f"⚠️  Found missing values: {missing_check[missing_check > 0]}")
    # Handle any remaining missing values with median imputation
    X = X.fillna(X.median())
    print("   Fixed with median imputation")

print(f"\nFinal feature matrix shape: {X.shape}")
print("Ready for clustering algorithm!")
```

---

## 5. Feature Standardization

K-means clustering is sensitive to feature scales because it uses Euclidean distance. Features with larger scales (like fare in dollars) will dominate features with smaller scales (like number of siblings).

**Why standardization is crucial:**
- **Equal influence**: All features contribute equally to distance calculations
- **Algorithm stability**: Prevents numerical issues and improves convergence
- **Interpretability**: Standardized cluster centers are easier to interpret

**Standardization process:**
- Transform each feature to have mean = 0 and standard deviation = 1
- Formula: (value - mean) / standard_deviation
- Result: All features have similar scales

```python
print("=== FEATURE STANDARDIZATION ===\n")

# Display original feature scales for comparison
print("Original feature scales (before standardization):")
scale_comparison = pd.DataFrame({
    'Feature': X.columns,
    'Min': X.min().round(2),
    'Max': X.max().round(2),
    'Mean': X.mean().round(2),
    'Std': X.std().round(2),
    'Range': (X.max() - X.min()).round(2)
})
print(scale_comparison)

print(f"\nProblem: Features have vastly different scales!")
print(f"• Fare ranges from ${X['fare'].min():.0f} to ${X['fare'].max():.0f} (range: {X['fare'].max() - X['fare'].min():.0f})")
print(f"• Age ranges from {X['age'].min():.0f} to {X['age'].max():.0f} years (range: {X['age'].max() - X['age'].min():.0f})")
print(f"• Sex encoded is only 0 or 1 (range: {X['sex_encoded'].max() - X['sex_encoded'].min():.0f})")

# Initialize the StandardScaler
print(f"\nApplying StandardScaler transformation...")
scaler = StandardScaler()

# Fit the scaler to our data and transform it
# fit() calculates mean and std for each feature
# transform() applies the standardization formula
X_scaled = scaler.fit_transform(X)

print("StandardScaler process:")
print("1. fit() - Calculate mean and standard deviation for each feature")
print("2. transform() - Apply formula: (value - mean) / std")

# Convert back to DataFrame for easier analysis
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)

# Verify standardization worked correctly
print(f"\nAfter standardization verification:")
print("✅ All features should have mean ≈ 0 and std ≈ 1")
standardized_stats = pd.DataFrame({
    'Feature': X_scaled_df.columns,
    'Mean': X_scaled_df.mean().round(6),      # Should be ~0
    'Std': X_scaled_df.std().round(6),        # Should be ~1
    'Min': X_scaled_df.min().round(2),
    'Max': X_scaled_df.max().round(2)
})
print(standardized_stats)

# Show the transformation effect with examples
print(f"\nStandardization example for 'fare' feature:")
original_fares = X['fare'].head(3).values
standardized_fares = X_scaled_df['fare'].head(3).values
for i in range(3):
    print(f"  Passenger {i+1}: ${original_fares[i]:.2f} → {standardized_fares[i]:.3f}")

print(f"\nStandardized feature matrix ready!")
print(f"Shape: {X_scaled.shape} (same as before, just transformed values)")
print(f"Now all features contribute equally to distance calculations! 🎯")
```

---

## 6. Determine Optimal Number of Clusters

Choosing the right number of clusters is crucial for meaningful results. We'll use two complementary methods to find the optimal k.

**Methods for finding optimal k:**
1. **Elbow Method**: Look for the "elbow" point where adding more clusters doesn't significantly reduce within-cluster variance
2. **Silhouette Analysis**: Measures how well-separated clusters are (range: -1 to 1, higher is better)

**Key concepts:**
- **Inertia**: Sum of squared distances from points to their cluster centers (lower is better)
- **Silhouette Score**: Average silhouette coefficient across all points (higher is better)
- **Trade-off**: More clusters reduce inertia but may create overfitting

```python
def plot_elbow_method(X_scaled, max_k=10):
    """
    Determine optimal number of clusters using Elbow Method and Silhouette Analysis.
    
    The Elbow Method plots inertia (within-cluster sum of squares) vs number of clusters.
    The "elbow" point indicates optimal k where adding more clusters gives diminishing returns.
    
    Silhouette Score measures cluster separation quality:
    - Score > 0.5: Good clustering
    - Score > 0.7: Strong clustering  
    - Score < 0.2: Poor clustering
    
    Args:
        X_scaled (numpy.ndarray): Standardized feature matrix
        max_k (int): Maximum number of clusters to test
    
    Returns:
        tuple: (K_range, inertias, silhouette_scores) for further analysis
    """
    print("=== FINDING OPTIMAL NUMBER OF CLUSTERS ===\n")
    
    # Initialize storage for results
    inertias = []           # Within-cluster sum of squares
    silhouette_scores = []  # Cluster separation quality
    K_range = range(2, max_k + 1)  # Start from 2 (can't have 1 cluster for silhouette)
    
    print("Testing different numbers of clusters:")
    print("k | Inertia  | Silhouette | Interpretation")
    print("-" * 45)
    
    # Test each possible number of clusters
    for k in K_range:
        # Initialize K-means with specific parameters
        kmeans = KMeans(
            n_clusters=k,      # Number of clusters to form
            random_state=42,   # For reproducible results
            n_init=10,         # Number of random initializations (chooses best)
            max_iter=300       # Maximum iterations for convergence
        )
        
        # Fit the model and predict cluster labels
        cluster_labels = kmeans.fit_predict(X_scaled)
        
        # Calculate metrics
        inertia = kmeans.inertia_  # Within-cluster sum of squared distances
        sil_score = silhouette_score(X_scaled, cluster_labels)  # Cluster quality
        
        # Store results
        inertias.append(inertia)
        silhouette_scores.append(sil_score)
        
        # Interpret silhouette score
        if sil_score > 0.7:
            interpretation = "Excellent"
        elif sil_score > 0.5:
            interpretation = "Good"
        elif sil_score > 0.3:
            interpretation = "Fair"
        else:
            interpretation = "Poor"
            
        print(f"{k} | {inertia:8.2f} | {sil_score:10.3f} | {interpretation}")
    
    # Create visualization
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # === ELBOW METHOD PLOT ===
    ax1.plot(K_range, inertias, 'bo-', markersize=8, linewidth=2, color='steelblue')
    ax1.set_xlabel('Number of Clusters (k)', fontsize=12)
    ax1.set_ylabel('Inertia (Within-cluster sum of squares)', fontsize=12)
    ax1.set_title('Elbow Method for Optimal k\n(Look for the "elbow" point)', fontsize=14)
    ax1.grid(True, alpha=0.3)
    
    # Add annotations for key points
    for i, (k, inertia) in enumerate(zip(K_range, inertias)):
        if i % 2 == 0:  # Annotate every other point to avoid crowding
            ax1.annotate(f'k={k}\n{inertia:.0f}', 
                        (k, inertia), 
                        textcoords="offset points", 
                        xytext=(0,10), 
                        ha='center', fontsize=9)
    
    # === SILHOUETTE SCORE PLOT ===  
    colors = ['red' if score < 0.3 else 'orange' if score < 0.5 else 'green' for score in silhouette_scores]
    bars = ax2.bar(K_range, silhouette_scores, color=colors, alpha=0.7)
    ax2.set_xlabel('Number of Clusters (k)', fontsize=12)
    ax2.set_ylabel('Silhouette Score', fontsize=12)
    ax2.set_title('Silhouette Score for Different k\n(Higher is better, >0.5 is good)', fontsize=14)
    ax2.grid(True, alpha=0.3, axis='y')
    ax2.set_ylim(0, 1)
    
    # Add value labels on bars
    for bar, score in zip(bars, silhouette_scores):
        height = bar.get_height()
        ax2.text(bar.get_x() + bar.get_width()/2., height + 0.01,
                f'{score:.3f}', ha='center', va='bottom', fontsize=9)
    
    # Add color legend
    from matplotlib.patches import Patch
    legend_elements = [
        Patch(facecolor='green', alpha=0.7, label='Good (>0.5)'),
        Patch(facecolor='orange', alpha=0.7, label='Fair (0.3-0.5)'),
        Patch(facecolor='red', alpha=0.7, label='Poor (<0.3)')
    ]
    ax2.legend(handles=legend_elements, loc='upper right')
    
    plt.tight_layout()
    plt.show()
    
    return K_range, inertias, silhouette_scores

# Execute the analysis
print("Analyzing cluster quality for k=2 to k=10...")
K_range, inertias, silhouette_scores = plot_elbow_method(X_scaled, max_k=10)

# Provide recommendations
print(f"\n=== RECOMMENDATIONS ===")
best_silhouette_k = K_range[np.argmax(silhouette_scores)]
best_silhouette_score = max(silhouette_scores)

print(f"📊 Best silhouette score: k={best_silhouette_k} (score: {best_silhouette_score:.3f})")

# Calculate elbow point (simplified method)
# Look for the point with maximum rate of change decrease
if len(inertias) > 2:
    diffs = np.diff(inertias)  # First differences
    diff_diffs = np.diff(diffs)  # Second differences (rate of change)
    elbow_idx = np.argmax(diff_diffs) + 2  # +2 because of double differencing
    elbow_k = K_range[elbow_idx] if elbow_idx < len(K_range) else best_silhouette_k
    print(f"📈 Elbow method suggests: k={elbow_k}")

print(f"\n💡 Recommendation: Use k={best_silhouette_k} for best cluster separation")
print(f"   This provides {best_silhouette_k} distinct passenger groups with good internal cohesion")
```

---

## 7. Choose Optimal K and Perform Clustering

Now we'll implement the actual K-means clustering using our chosen optimal number of clusters.

**K-means Algorithm Steps:**
1. **Initialize**: Place k centroids randomly in feature space
2. **Assign**: Assign each point to the nearest centroid
3. **Update**: Move centroids to the center of their assigned points
4. **Repeat**: Steps 2-3 until centroids stop moving (convergence)

**Key parameters explained:**
- **n_clusters**: Number of clusters (k)
- **random_state**: Ensures reproducible results
- **n_init**: Number of different random initializations (algorithm picks the best)
- **max_iter**: Maximum iterations before stopping

```python
print("=== PERFORMING K-MEANS CLUSTERING ===\n")

# Determine optimal k from our previous analysis
optimal_k = K_range[np.argmax(silhouette_scores)]
print(f"Using optimal k = {optimal_k} (based on highest silhouette score)")

# Alternative: You can manually override if you prefer a different k
# optimal_k = 3  # Uncomment this line to manually set k

print(f"\nInitializing K-means algorithm with parameters:")
print(f"• n_clusters = {optimal_k} (number of clusters)")
print(f"• random_state = 42 (for reproducible results)")
print(f"• n_init = 10 (try 10 different random starts, pick best)")
print(f"• max_iter = 300 (maximum iterations before stopping)")

# Initialize and fit the K-means model
kmeans = KMeans(
    n_clusters=optimal_k,    # Our optimal number of clusters
    random_state=42,         # Seed for reproducible results
    n_init=10,              # Number of different centroid initializations
    max_iter=300,           # Maximum number of iterations
    tol=1e-4                # Tolerance for convergence (when to stop)
)

print(f"\nFitting K-means model...")
print("Algorithm process:")
print("1. Randomly initialize cluster centroids")
print("2. Assign each passenger to nearest centroid")
print("3. Move centroids to center of assigned passengers")
print("4. Repeat steps 2-3 until centroids stabilize")

# Fit the model and get cluster predictions
cluster_labels = kmeans.fit_predict(X_scaled)

print(f"\n✅ Clustering completed!")
print(f"• Converged in {kmeans.n_iter_} iterations")
print(f"• Final inertia (within-cluster sum of squares): {kmeans.inertia_:.2f}")

# Add cluster labels to our original dataframe
processed_df['cluster'] = cluster_labels

# Calculate final clustering quality metrics
final_silhouette = silhouette_score(X_scaled, cluster_labels)
print(f"• Final silhouette score: {final_silhouette:.3f}")

# Interpret silhouette score
if final_silhouette > 0.7:
    quality = "Excellent clustering quality! 🏆"
elif final_silhouette > 0.5:
    quality = "Good clustering quality! 👍"
elif final_silhouette > 0.3:
    quality = "Fair clustering quality 👌"
else:
    quality = "Poor clustering quality - consider different approach ⚠️"

print(f"• Quality assessment: {quality}")

print(f"\nCluster centers (centroids) in standardized space:")
centroids_df = pd.DataFrame(
    kmeans.cluster_centers_,
    columns=feature_columns,
    index=[f'Cluster {i}' for i in range(optimal_k)]
)
print(centroids_df.round(3))

print(f"\nNote: These are standardized values (mean=0, std=1)")
print(f"Positive values = above average, Negative values = below average")
```

---

## 8. Analyze Cluster Distribution and Basic Statistics

Understanding the size and basic characteristics of each cluster helps us interpret what each group represents.

**Key metrics to analyze:**
- **Cluster sizes**: How many passengers in each group?
- **Feature averages**: What are the typical characteristics of each cluster?
- **Survival rates**: How did different clusters fare during the disaster?

```python
print("=== CLUSTER DISTRIBUTION ANALYSIS ===\n")

# === CLUSTER SIZES ===
print("1. CLUSTER SIZE DISTRIBUTION:")
cluster_distribution = processed_df['cluster'].value_counts().sort_index()
total_passengers = len(processed_df)

print("Cluster | Count | Percentage")
print("-" * 30)
for cluster_id, count in cluster_distribution.items():
    percentage = (count / total_passengers) * 100
    bar = "█" * int(percentage / 2)  # Visual bar representation
    print(f"{cluster_id:7d} | {count:5d} | {percentage:5.1f}% {bar}")

print(f"\nTotal passengers analyzed: {total_passengers}")

# Check for balanced clusters
max_cluster_size = cluster_distribution.max()
min_cluster_size = cluster_distribution.min()
balance_ratio = max_cluster_size / min_cluster_size

print(f"Cluster balance ratio: {balance_ratio:.2f}")
if balance_ratio < 3:
    print("✅ Well-balanced clusters (no cluster dominates)")
elif balance_ratio < 5:
    print("⚠️  Somewhat imbalanced clusters")
else:
    print("❌ Highly imbalanced clusters - consider different k")

# === DETAILED CLUSTER STATISTICS ===
print(f"\n2. DETAILED CLUSTER STATISTICS:")
print("(Average values for each feature by cluster)")

# Calculate mean values for each cluster
cluster_stats = processed_df.groupby('cluster')[feature_columns + ['survived']].agg(['mean', 'std']).round(3)

# Simplify column names for better readability
cluster_means = processed_df.groupby('cluster')[feature_columns + ['survived']].mean().round(3)

print("\nCluster feature averages:")
print(cluster_means)

# === CLUSTER INTERPRETATION HELPER ===
print(f"\n3. QUICK CLUSTER INTERPRETATION:")
print("(Comparing each cluster to overall averages)")

# Calculate overall averages for comparison
overall_means = processed_df[feature_columns + ['survived']].mean()

print(f"\nOverall dataset averages (for comparison):")
for feature, avg_val in overall_means.items():
    if feature in ['pclass', 'sex_encoded', 'is_alone']:
        print(f"• {feature}: {avg_val:.2f}")
    elif feature == 'age':
        print(f"• {feature}: {avg_val:.1f} years")  
    elif feature == 'fare':
        print(f"• {feature}: ${avg_val:.2f}")
    else:
        print(f"• {feature}: {avg_val:.2f}")

print(f"\nCluster deviations from average:")
for cluster_id in sorted(cluster_distribution.index):
    print(f"\n--- CLUSTER {cluster_id} ---")
    cluster_data = cluster_means.loc[cluster_id]
    
    notable_features = []
    
    # Compare each feature to overall average
    for feature in feature_columns:
        cluster_val = cluster_data[feature]
        overall_val = overall_means[feature]
        
        # Calculate relative difference
        if overall_val != 0:
            rel_diff = (cluster_val - overall_val) / overall_val * 100
            
            if abs(rel_diff) > 20:  # Only show significant differences
                direction = "higher" if rel_diff