<a href="https://colab.research.google.com/github/nbilasals/AlgoritmaBruteForce/blob/main/Untitled9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# ============================================================================
# ü©∫ BREAST CANCER PREDICTION: MALIGNANT VS BENIGN CLASSIFICATION üß¨
# ============================================================================
#
# PROJECT OVERVIEW:
# -----------------
# This comprehensive notebook demonstrates a complete machine learning workflow
# for predicting whether a breast tumor is malignant (cancerous) or benign
# (non-cancerous) based on physical characteristics extracted from biopsy images.
#
# DATASET: Breast Cancer Wisconsin (Diagnostic) Dataset
# - Source: UCI Machine Learning Repository
# - Samples: 569 breast tumor cases
# - Features: 30 numeric measurements of tumor characteristics
# - Target: Binary classification (Malignant vs Benign)
#
# WORKFLOW STAGES:
# 1. Data Loading & Exploration
# 2. Exploratory Data Analysis (EDA)
# 3. Data Preprocessing & Feature Engineering
# 4. Multiple ML Model Training
# 5. Model Evaluation & Comparison
# 6. Final Recommendations
#
# CLINICAL SIGNIFICANCE:
# Early and accurate detection of breast cancer is crucial for successful
# treatment. This machine learning approach can assist medical professionals
# in making more informed diagnostic decisions.
#
# ============================================================================

# ============================================================================
# CELL 1: IMPORT NECESSARY LIBRARIES
# ============================================================================
#
# SECTION PURPOSE:
# ----------------
# In this cell, we import all the essential Python libraries needed for our
# complete machine learning pipeline. Each library serves a specific purpose
# in our analysis workflow.
#
# LIBRARIES BREAKDOWN:
# --------------------
#
# üìä DATA MANIPULATION:
# - numpy: Numerical computing, array operations, mathematical functions
# - pandas: Data manipulation, DataFrame operations, data analysis
#
# üìà DATA VISUALIZATION:
# - matplotlib.pyplot: Creating static, animated, and interactive visualizations
# - seaborn: Statistical data visualization built on matplotlib, prettier plots
#
# ü§ñ MACHINE LEARNING - CORE:
# - sklearn.datasets: Access to built-in datasets including breast cancer data
# - sklearn.model_selection: Tools for splitting data and cross-validation
# - sklearn.preprocessing: Data preprocessing tools like scaling and encoding
#
# üéØ MACHINE LEARNING - ALGORITHMS:
# - LogisticRegression: Linear model for binary classification
# - DecisionTreeClassifier: Tree-based model with interpretable rules
# - RandomForestClassifier: Ensemble of decision trees for robust predictions
# - SVC: Support Vector Classifier for complex decision boundaries
# - KNeighborsClassifier: Instance-based learning using nearest neighbors
#
# üìè MACHINE LEARNING - EVALUATION:
# - Various metrics: accuracy, precision, recall, F1-score for model assessment
# - confusion_matrix: Visual representation of classification results
# - ROC curves & AUC: Evaluate model performance across different thresholds
#
# WHY EACH METRIC MATTERS:
# - Accuracy: Overall correctness of predictions
# - Precision: Of all positive predictions, how many were actually positive?
# - Recall: Of all actual positives, how many did we correctly identify?
# - F1-Score: Harmonic mean of precision and recall (balanced metric)
# - ROC-AUC: Model's ability to distinguish between classes
#
# ============================================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, confusion_matrix, classification_report,
                             roc_curve, roc_auc_score)
import warnings
warnings.filterwarnings('ignore')

# Set visualization style for better-looking plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print("=" * 80)
print("‚úÖ ALL LIBRARIES IMPORTED SUCCESSFULLY!")
print("=" * 80)
print("\nüì¶ Libraries loaded:")
print("   ‚úì NumPy - Numerical computing")
print("   ‚úì Pandas - Data manipulation")
print("   ‚úì Matplotlib & Seaborn - Data visualization")
print("   ‚úì Scikit-learn - Machine learning algorithms and tools")
print("\nüöÄ Ready to begin breast cancer prediction analysis!")
print("=" * 80)

# ============================================================================
# CELL 2: LOAD AND PREPARE THE BREAST CANCER DATASET
# ============================================================================
#
# SECTION PURPOSE:
# ----------------
# This cell loads the Breast Cancer Wisconsin dataset from scikit-learn's
# built-in datasets and prepares it for analysis by converting it into a
# pandas DataFrame with proper column names and labels.
#
# ABOUT THE DATASET:
# ------------------
# The Breast Cancer Wisconsin (Diagnostic) Dataset contains features computed
# from digitized images of fine needle aspirate (FNA) of breast masses. These
# features describe characteristics of cell nuclei present in the images.
#
# DATA COLLECTION METHOD:
# 1. A fine needle aspirate (FNA) is taken from a breast mass
# 2. The sample is digitized and processed
# 3. Computer vision algorithms extract features from cell nuclei
# 4. Medical experts provide the diagnosis (malignant or benign)
#
# DATASET STRUCTURE:
# - 569 instances (patient cases)
# - 30 real-valued features (measurements)
# - 2 classes: Malignant (0) and Benign (1)
# - No missing values (complete dataset)
#
# TARGET VARIABLE ENCODING:
# - 0 = Malignant (M) ‚Üí Cancer is present, requires treatment
# - 1 = Benign (B) ‚Üí No cancer, but monitoring may be needed
#
# WHY THIS MATTERS:
# In medical diagnosis, correctly identifying malignant tumors (high recall)
# is critical to ensure patients receive timely treatment. However, we also
# want to avoid false positives (high precision) to prevent unnecessary
# stress and medical procedures.
#
# ============================================================================

print("\n" + "=" * 80)
print("üìä LOADING BREAST CANCER WISCONSIN DATASET")
print("=" * 80)

# Load the dataset from scikit-learn's built-in datasets
data = load_breast_cancer()

# Create a pandas DataFrame for easier manipulation and analysis
df = pd.DataFrame(data.data, columns=data.feature_names)

# Add the target variable (diagnosis) to our DataFrame
df['diagnosis'] = data.target

# Create human-readable labels for better interpretation
# Map 0 ‚Üí 'Malignant' (cancerous), 1 ‚Üí 'Benign' (non-cancerous)
df['diagnosis_label'] = df['diagnosis'].map({0: 'Malignant', 1: 'Benign'})

print("\n‚úÖ Dataset loaded successfully!")
print("\n" + "-" * 80)
print("üìã DATASET OVERVIEW:")
print("-" * 80)
print(f"   ‚Ä¢ Total number of samples: {len(df)}")
print(f"   ‚Ä¢ Number of features: {len(data.feature_names)}")
print(f"   ‚Ä¢ Dataset dimensions: {df.shape[0]} rows √ó {df.shape[1]} columns")
print(f"   ‚Ä¢ Number of malignant cases: {(df['diagnosis'] == 0).sum()}")
print(f"   ‚Ä¢ Number of benign cases: {(df['diagnosis'] == 1).sum()}")
print(f"   ‚Ä¢ Class balance ratio: {(df['diagnosis'] == 1).sum() / len(df) * 100:.1f}% benign")

print("\n" + "-" * 80)
print("üìù DATASET DESCRIPTION:")
print("-" * 80)
print(data.DESCR[:500] + "...")  # Print first 500 characters of description

print("\n" + "=" * 80)

# ============================================================================
# CELL 3: INITIAL DATA EXPLORATION - VIEWING THE DATA
# ============================================================================
#
# SECTION PURPOSE:
# ----------------
# This cell performs initial exploration of our dataset to understand its
# structure, data types, and basic characteristics. This is a critical first
# step in any data science project.
#
# WHY DATA EXPLORATION MATTERS:
# -----------------------------
# Before building any machine learning model, we need to:
# 1. Understand what data we're working with
# 2. Identify data types and potential issues
# 3. Get familiar with feature names and values
# 4. Check for any obvious problems or anomalies
#
# WHAT WE'RE EXAMINING:
# ---------------------
# ‚Ä¢ First few rows: Get a feel for the actual data values
# ‚Ä¢ Data types: Ensure all features are numeric (required for ML)
# ‚Ä¢ Memory usage: Understand dataset size
# ‚Ä¢ Feature names: Familiarize ourselves with what we're measuring
#
# THINGS TO LOOK FOR:
# -------------------
# ‚úì Are all features numeric? (Yes, required for our ML models)
# ‚úì Do values seem reasonable? (No obvious errors)
# ‚úì Are there any unexpected patterns?
# ‚úì Do feature names make clinical sense?
#
# UNDERSTANDING THE OUTPUT:
# -------------------------
# - .head(): Shows first 5 rows of data
# - .info(): Provides data types, non-null counts, memory usage
# - .describe(): Statistical summary (mean, std, min, max, quartiles)
#
# ============================================================================

print("\n" + "=" * 80)
print("üîç INITIAL DATA EXPLORATION")
print("=" * 80)

print("\n" + "-" * 80)
print("üìã FIRST 5 ROWS OF THE DATASET")
print("-" * 80)
print("\nThis gives us a glimpse of the actual data values:")
print(df.head())

print("\n" + "-" * 80)
print("üìä DATASET INFORMATION (DATA TYPES & STRUCTURE)")
print("-" * 80)
print("\nDetailed information about each column:")
print(df.info())

print("\n" + "-" * 80)
print("üìà STATISTICAL SUMMARY OF ALL FEATURES")
print("-" * 80)
print("\nDescriptive statistics for each numeric feature:")
print(df.describe().round(2))

print("\n" + "-" * 80)
print("üìù INTERPRETATION GUIDE:")
print("-" * 80)
print("""
‚Ä¢ count: Number of non-null values (should be 569 for all)
‚Ä¢ mean: Average value - center of the distribution
‚Ä¢ std: Standard deviation - measure of spread/variability
‚Ä¢ min: Minimum value observed
‚Ä¢ 25%: First quartile (25% of data is below this value)
‚Ä¢ 50%: Median (middle value when sorted)
‚Ä¢ 75%: Third quartile (75% of data is below this value)
‚Ä¢ max: Maximum value observed

KEY OBSERVATIONS:
‚úì All features have 569 non-null values ‚Üí No missing data!
‚úì All features are numeric (float64) ‚Üí Ready for machine learning
‚úì Different features have vastly different scales ‚Üí Will need scaling
‚úì Some features show high variability (large std) ‚Üí Normal for medical data
""")

print("=" * 80)

# ============================================================================
# CELL 4: DATA QUALITY CHECK - MISSING VALUES ANALYSIS
# ============================================================================
#
# SECTION PURPOSE:
# ----------------
# This cell thoroughly checks for missing values in our dataset. Missing data
# is one of the most common problems in real-world datasets and can significantly
# impact model performance if not handled properly.
#
# WHY MISSING VALUES MATTER:
# --------------------------
# Missing values can occur due to:
# ‚Ä¢ Data collection errors
# ‚Ä¢ Equipment malfunction
# ‚Ä¢ Human error during data entry
# ‚Ä¢ Privacy concerns (data intentionally omitted)
# ‚Ä¢ Technical issues during data transfer
#
# IMPACT ON MACHINE LEARNING:
# ----------------------------
# Most ML algorithms cannot handle missing values and will either:
# 1. Throw an error and refuse to run
# 2. Produce incorrect results
# 3. Automatically drop rows/columns with missing values
#
# COMMON STRATEGIES FOR HANDLING MISSING DATA:
# ---------------------------------------------
# If we find missing values, we can:
# 1. DELETE: Remove rows or columns with missing values
#    - Use when: Very few missing values (<5% of data)
#    - Pros: Simple, no assumptions made
#    - Cons: Lose potentially valuable data
#
# 2. IMPUTE - MEAN/MEDIAN/MODE:
#    - Use when: Data is missing at random
#    - Pros: Retains all samples
#    - Cons: Can distort distributions
#
# 3. IMPUTE - ADVANCED METHODS:
#    - Use algorithms like KNN or regression to predict missing values
#    - Pros: More accurate than simple imputation
#    - Cons: More complex, computationally expensive
#
# 4. CREATE INDICATOR VARIABLES:
#    - Add binary column indicating if value was missing
#    - Use when: Missingness itself is informative
#
# GOOD NEWS FOR OUR DATASET:
# ---------------------------
# The Wisconsin Breast Cancer dataset is a well-curated research dataset
# with NO missing values! This is rare in real-world scenarios but makes
# our analysis cleaner and more straightforward.
#
# ============================================================================

print("\n" + "=" * 80)
print("üîç DATA QUALITY CHECK: MISSING VALUES ANALYSIS")
print("=" * 80)

# Check for missing values in each column
missing_values = df.isnull().sum()
total_cells = np.product(df.shape)
total_missing = missing_values.sum()

print("\n" + "-" * 80)
print("üìä MISSING VALUES SUMMARY:")
print("-" * 80)
print(f"\n   ‚Ä¢ Total cells in dataset: {total_cells:,}")
print(f"   ‚Ä¢ Total missing values: {total_missing}")
print(f"   ‚Ä¢ Percentage of missing data: {(total_missing / total_cells) * 100:.2f}%")

if total_missing == 0:
    print("\n   ‚úÖ EXCELLENT! No missing values found in any column!")
    print("   ‚úÖ Dataset is complete and ready for machine learning!")
    print("\n   This is a high-quality dataset - no imputation needed!")
else:
    print("\n   ‚ö†Ô∏è WARNING: Missing values detected!")
    print("\n   Columns with missing values:")
    print("-" * 80)
    missing_df = pd.DataFrame({
        'Column': missing_values[missing_values > 0].index,
        'Missing Count': missing_values[missing_values > 0].values,
        'Percentage': (missing_values[missing_values > 0].values / len(df) * 100).round(2)
    })
    print(missing_df.to_string(index=False))

    print("\n   üìù RECOMMENDED ACTIONS:")
    print("   " + "-" * 76)
    for col, missing_pct in zip(missing_df['Column'], missing_df['Percentage']):
        if missing_pct < 5:
            print(f"   ‚Ä¢ {col}: {missing_pct}% missing ‚Üí Consider removing rows")
        elif missing_pct < 30:
            print(f"   ‚Ä¢ {col}: {missing_pct}% missing ‚Üí Consider imputation")
        else:
            print(f"   ‚Ä¢ {col}: {missing_pct}% missing ‚Üí Consider removing column")

print("\n" + "=" * 80)
print("üí° DATA QUALITY ASSESSMENT:")
print("=" * 80)
print("""
‚úì All 569 samples are complete
‚úì All 30 features have valid measurements
‚úì No data cleaning required for missing values
‚úì Ready to proceed with exploratory data analysis

This clean dataset allows us to focus on feature engineering and
model building without worrying about data imputation strategies!
""")
print("=" * 80)

# ============================================================================
# CELL 5: TARGET VARIABLE DISTRIBUTION ANALYSIS
# ============================================================================
#
# SECTION PURPOSE:
# ----------------
# This cell analyzes the distribution of our target variable (diagnosis).
# Understanding class distribution is crucial because it affects:
# 1. Model training and performance
# 2. Evaluation metric selection
# 3. Potential need for balancing techniques
#
# WHY CLASS DISTRIBUTION MATTERS:
# -------------------------------
# BALANCED DATASET (50:50 ratio):
# ‚Ä¢ Models learn both classes equally well
# ‚Ä¢ Standard metrics (accuracy) work well
# ‚Ä¢ No special techniques needed
#
# IMBALANCED DATASET (e.g., 90:10 ratio):
# ‚Ä¢ Models may become biased toward majority class
# ‚Ä¢ Accuracy can be misleading (e.g., 90% by always predicting majority)
# ‚Ä¢ May need: oversampling, undersampling, SMOTE, or class weights
# ‚Ä¢ Should focus on: precision, recall, F1-score, not just accuracy
#
# CLASS IMBALANCE IN MEDICAL DIAGNOSIS:
# --------------------------------------
# In medical datasets, imbalance is common because:
# ‚Ä¢ Diseases are often rare in the general population
# ‚Ä¢ More benign cases than malignant in screening programs
# ‚Ä¢ Cost of false negatives (missing cancer) is very high
#
# WHAT TO LOOK FOR:
# -----------------
# ‚Ä¢ Ratio between classes (is one significantly larger?)
# ‚Ä¢ Absolute numbers (do we have enough samples of minority class?)
# ‚Ä¢ Consider if imbalance reflects real-world prevalence
#
# EVALUATION METRIC IMPLICATIONS:
# --------------------------------
# ‚Ä¢ Balanced dataset ‚Üí Accuracy is fine
# ‚Ä¢ Imbalanced dataset ‚Üí Focus on precision/recall/F1-score
# ‚Ä¢ Medical context ‚Üí Prioritize RECALL (don't miss cancer cases)
#
# VISUALIZATIONS INCLUDED:
# ------------------------
# 1. Count plot: Shows absolute numbers of each class
# 2. Pie chart: Shows proportional distribution
#
# ============================================================================

print("\n" + "=" * 80)
print("üéØ TARGET VARIABLE DISTRIBUTION ANALYSIS")
print("=" * 80)

# Calculate class distribution
diagnosis_counts = df['diagnosis_label'].value_counts()
diagnosis_percentages = df['diagnosis_label'].value_counts(normalize=True) * 100

print("\n" + "-" * 80)
print("üìä CLASS DISTRIBUTION (ABSOLUTE COUNTS):")
print("-" * 80)
for label, count in diagnosis_counts.items():
    print(f"   ‚Ä¢ {label:12s}: {count:3d} cases")

print("\n" + "-" * 80)
print("üìä CLASS DISTRIBUTION (PERCENTAGES):")
print("-" * 80)
for label, pct in diagnosis_percentages.items():
    print(f"   ‚Ä¢ {label:12s}: {pct:5.2f}%")

# Calculate imbalance ratio
majority_count = diagnosis_counts.values[0]
minority_count = diagnosis_counts.values[1]
imbalance_ratio = majority_count / minority_count

print("\n" + "-" * 80)
print("‚öñÔ∏è CLASS BALANCE ANALYSIS:")
print("-" * 80)
print(f"   ‚Ä¢ Majority class: {diagnosis_counts.index[0]} ({majority_count} samples)")
print(f"   ‚Ä¢ Minority class: {diagnosis_counts.index[1]} ({minority_count} samples)")
print(f"   ‚Ä¢ Imbalance ratio: {imbalance_ratio:.2f}:1")

if imbalance_ratio < 1.5:
    balance_status = "‚úÖ WELL BALANCED"
    recommendation = "Standard ML algorithms will work well without modifications."
elif imbalance_ratio < 3:
    balance_status = "‚ö†Ô∏è SLIGHT IMBALANCE"
    recommendation = "Consider monitoring precision and recall separately. May use class weights."
else:
    balance_status = "‚ùå SIGNIFICANT IMBALANCE"
    recommendation = "Should use: class weights, SMOTE, or focus on F1-score/AUC metrics."

print(f"\n   Status: {balance_status}")
print(f"   Recommendation: {recommendation}")

# Visualization
print("\n" + "-" * 80)
print("üìä GENERATING VISUALIZATIONS...")
print("-" * 80)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Count plot with detailed annotations
bars = axes[0].bar(range(len(diagnosis_counts)), diagnosis_counts.values,
                   color=['#FF6B6B', '#4ECDC4'], alpha=0.8, edgecolor='black', linewidth=2)
axes[0].set_xticks(range(len(diagnosis_counts)))
axes[0].set_xticklabels(diagnosis_counts.index, fontsize=12, fontweight='bold')
axes[0].set_ylabel('Number of Cases', fontsize=13, fontweight='bold')
axes[0].set_title('Distribution of Breast Cancer Diagnosis\n(Absolute Counts)',
                  fontsize=14, fontweight='bold', pad=20)
axes[0].grid(axis='y', alpha=0.3, linestyle='--')

# Add value labels on bars
for i, (bar, count, pct) in enumerate(zip(bars, diagnosis_counts.values, diagnosis_percentages.values)):
    height = bar.get_height()
    axes[0].text(bar.get_x() + bar.get_width()/2., height + 5,
                f'{count}\n({pct:.1f}%)',
                ha='center', va='bottom', fontsize=12, fontweight='bold')

# Pie chart with enhanced styling
colors = ['#FF6B6B', '#4ECDC4']
explode = (0.05, 0.05)  # Slightly separate both slices
wedges, texts, autotexts = axes[1].pie(diagnosis_counts.values,
                                        labels=diagnosis_counts.index,
                                        autopct='%1.1f%%',
                                        colors=colors,
                                        explode=explode,
                                        startangle=90,
                                        textprops={'fontsize': 12, 'fontweight': 'bold'},
                                        shadow=True)
axes[1].set_title('Proportion of Malignant vs Benign Cases\n(Percentage Distribution)',
                  fontsize=14, fontweight='bold', pad=20)

# Enhance autotext
for autotext in autotexts:
    autotext.set_color('white')
    autotext.set_fontsize(13)
    autotext.set_weight('bold')

plt.tight_layout()
plt.show()

print("\n" + "=" * 80)
print("üí° KEY INSIGHTS FROM CLASS DISTRIBUTION:")
print("=" * 80)
print(f"""
1. Dataset Balance: The dataset has a {imbalance_ratio:.2f}:1 ratio of
   Benign to Malignant cases.

2. Clinical Relevance: This distribution is actually quite realistic for
   breast cancer screening programs, where benign findings are more common
   than malignant tumors.

3. Modeling Implications:
   ‚Ä¢ Our dataset is reasonably balanced ({diagnosis_percentages.values[1]:.1f}% minority class)
   ‚Ä¢ We can use standard accuracy metrics, but should also monitor:
     - Recall (sensitivity): To minimize false negatives
     - Precision: To minimize false positives
     - F1-Score: Balanced measure of both

4. Medical Context: In cancer detection, missing a malignant case
   (false negative) is typically considered worse than a false positive,
   so we'll pay special attention to RECALL scores.
""")
print("=" * 80)

# ============================================================================
# CELL 6: FEATURE ANALYSIS - UNDERSTANDING THE MEASUREMENTS
# ============================================================================
#
# SECTION PURPOSE:
# ----------------
# This cell provides a comprehensive analysis of the 30 features in our dataset.
# Each feature represents a different measurement of tumor characteristics
# derived from cell nuclei in biopsy images.
#
# FEATURE ORGANIZATION:
# ---------------------
# The 30 features are organized into 3 groups of 10 measurements each:
#
# 1. MEAN VALUES (10 features):
#    - Average of measurements across all cells in the image
#    - Suffix: "_mean"
#    - Example: "mean radius", "mean texture"
#
# 2. STANDARD ERROR (10 features):
#    - Standard error of measurements (variability measure)
#    - Suffix: "_se"
#    - Example: "radius error", "texture error"
#
# 3. WORST VALUES (10 features):
#    - Mean of the three largest values
#    - Suffix: "_worst"
#    - Example: "worst radius", "worst texture"
#
# THE 10 CORE MEASUREMENTS:
# --------------------------
# Each group contains these 10 measurements:
#
# 1. RADIUS:
#    - Mean distance from center to points on perimeter
#    - Larger radius ‚Üí Larger tumor
#    - Medical significance: Size is a key diagnostic indicator
#
# 2. TEXTURE:
#    - Standard deviation of gray-scale values
#    - Higher texture ‚Üí More irregular cell appearance
#    - Medical significance: Malignant cells often more irregular
#
# 3. PERIMETER:
#    - Total boundary length of the tumor
#    - Related to radius but captures shape complexity
#    - Medical significance: Irregular perimeters suggest malignancy
#
# 4. AREA:
#    - Total area enclosed by tumor perimeter
#    - Directly related to tumor size
#    - Medical significance: Larger tumors more concerning
#
# 5. SMOOTHNESS:
#    - Local variation in radius lengths
#    - Smoother ‚Üí More uniform, regular shape
#    - Medical significance: Benign tumors typically smoother
#
# 6. COMPACTNESS:
#    - Perimeter¬≤ / Area - 1.0
#    - Measures how compact vs. spread out the tumor is
#    - Medical significance: Malignant cells less compact
#
# 7. CONCAVITY:
#    - Severity of concave portions of contour
#    - Higher values ‚Üí More indentations in tumor boundary
#    - Medical significance: Malignant tumors more irregular
#
# 8. CONCAVE POINTS:
#    - Number of concave portions of contour
#    - Counts distinct indentations
#    - Medical significance: Strong indicator of malignancy
#
# 9. SYMMETRY:
#    - Measures symmetry of the tumor
#    - Higher values ‚Üí More asymmetric
#    - Medical significance: Malignant tumors less symmetric
#
# 10. FRACTAL DIMENSION:
#     - "Coastline approximation" - 1
#     - Measures complexity of the boundary
#     - Medical significance: Complex boundaries suggest malignancy
#
# WHY THREE VERSIONS OF EACH MEASUREMENT?
# ----------------------------------------
# ‚Ä¢ MEAN: Overall average characteristic
# ‚Ä¢ ERROR: Variability within the sample (uncertainty measure)
# ‚Ä¢ WORST: Most severe measurements (often most diagnostic)
#
# CLINICAL INTERPRETATION:
# ------------------------
# Medical professionals look for:
# ‚úì Large radius/perimeter/area ‚Üí Concerning
# ‚úì High texture/concavity ‚Üí Irregular cells ‚Üí Concerning
# ‚úì Low smoothness/symmetry ‚Üí Irregular shape ‚Üí Concerning
# ‚úì High "worst" values ‚Üí Most concerning areas ‚Üí Diagnostic
#
# ============================================================================

print("\n" + "=" * 80)
print("üî¨ COMPREHENSIVE FEATURE ANALYSIS")
print("=" * 80)

# Categorize features by type
mean_features = [col for col in df.columns if 'mean' in col]
se_features = [col for col in df.columns if 'error' in col or 'se' in col]
worst_features = [col for col in df.columns if 'worst' in col]

print("\n" + "-" * 80)
print("üìã FEATURE ORGANIZATION:")
print("-" * 80)
print(f"\n1. MEAN FEATURES ({len(mean_features)} features):")
print("   These represent the average measurements across all cells:")
for i, feature in enumerate(mean_features, 1):
    print(f"   {i:2d}. {feature}")

print(f"\n2. STANDARD ERROR FEATURES ({len(se_features)} features):")
print("   These represent the variability/uncertainty in measurements:")
for i, feature in enumerate(se_features, 1):
    print(f"   {i:2d}. {feature}")

print(f"\n3. WORST FEATURES ({len(worst_features)} features):")
print("   These represent the most extreme measurements:")
for i, feature in enumerate(worst_features, 1):
    print(f"   {i:2d}. {feature}")

print("\n" + "-" * 80)
print("üìä FEATURE VALUE RANGES:")
print("-" * 80)
print("\nUnderstanding the scale of measurements:")
print(df[mean_features].describe().loc[['min', 'max']].round(2))

print("\n" + "-" * 80)
print("üí° KEY OBSERVATIONS:")
print("-" * 80)
print("""
1. SCALE DIFFERENCES:
   ‚Ä¢ Features have vastly different scales
   ‚Ä¢ Example: 'mean area' ranges from ~143 to ~2501
   ‚Ä¢ Example: 'mean smoothness' ranges from ~0.05 to ~0.16
   ‚Ä¢ Implication: MUST scale features before ML modeling

2. PHYSICAL MEANING:
   ‚Ä¢ Size features (radius, perimeter, area) are correlated
   ‚Ä¢ Shape features (smoothness, symmetry) describe regularity
   ‚Ä¢ Texture features describe cell appearance variability

3. DIAGNOSTIC RELEVANCE:
   ‚Ä¢ "Worst" features often most important for diagnosis
   ‚Ä¢ They capture the most abnormal regions of the tumor
   ‚Ä¢ Medical professionals focus on worst-case characteristics
""")

print("\n" + "=" * 80)
print("üìä GENERATING FEATURE DISTRIBUTION VISUALIZATIONS...")
print("=" * 80)

# Visualization of mean features distribution by diagnosis
fig, axes = plt.subplots(2, 5, figsize=(22, 10))
axes = axes.ravel()

for idx, feature in enumerate(mean_features):
    # Separate data by diagnosis
    malignant_data = df[df['diagnosis'] == 0][feature]
    benign_data = df[df['diagnosis'] == 1][feature]

    # Create overlapping histograms
    axes[idx].hist(malignant_data, bins=25, alpha=0.6, label='Malignant',
                   color='#FF6B6B', edgecolor='black', linewidth=0.5)
    axes[idx].hist(benign_data, bins=25, alpha=0.6, label='Benign',
                   color='#4ECDC4', edgecolor='black', linewidth=0.5)

    # Styling
    axes[idx].set_title(feature.replace('mean ', '').title(),
                        fontsize=11, fontweight='bold')
    axes[idx].set_xlabel('Value', fontsize=9)
    axes[idx].set_ylabel('Frequency', fontsize=9)
    axes[idx].legend(fontsize=8, loc='upper right')
    axes[idx].grid(alpha=0.3, linestyle='--')

    # Add statistical annotations
    mal_mean = malignant_data.mean()
    ben_mean = benign_data.mean()
    axes[idx].axvline(mal_mean, color='#FF6B6B', linestyle='--',
                      linewidth=2, alpha=0.7, label=f'M Œº={mal_mean:.1f}')
    axes[idx].axvline(ben_mean, color='#4ECDC4', linestyle='--',
                      linewidth=2, alpha=0.7, label=f'B Œº={ben_mean:.1f}')

plt.suptitle('Distribution of Mean Features by Diagnosis\n' +
             'Red = Malignant | Teal = Benign | Dashed lines = Mean values',
             fontsize=16, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

print

In [None]:
print("\n" + "="*80)
print("FEATURE SELECTION FOR MODELING")
print("="*80)

# Get all encoded features and numerical features
encoded_features = [col for col in df_encoded.columns if col.endswith('_encoded')]
original_numerical = ['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak', 'FastingBS']
engineered_numerical = ['HR_Achievement_Pct', 'Simple_Risk_Score']

# Combine features
feature_cols = [col for col in original_numerical if col in df_encoded.columns]
feature_cols.extend([col for col in encoded_features if col in df_encoded.columns])
feature_cols.extend([col for col in engineered_numerical if col in df_encoded.columns])

# Remove target if accidentally included
if target_col in feature_cols:
    feature_cols.remove(target_col)

print(f"Selected {len(feature_cols)} features for modeling:")
for i, col in enumerate(feature_cols, 1):
    print(f"  {i:2d}. {col}")

# Create feature matrix and target vector
X = df_encoded[feature_cols].fillna(0)  # Fill any remaining NaN with 0
y = df_encoded[target_col]

print(f"\n‚úì Feature matrix (X): {X.shape[0]} patients √ó {X.shape[1]} features")
print(f"‚úì Target vector (y): {y.shape[0]} patients")
print(f"‚úì Class distribution:")
print(f"   - Class 0 (No Disease): {(y == 0).sum()} ({(y == 0).sum()/len(y)*100:.1f}%)")
print(f"   - Class 1 (Disease):    {(y == 1).sum()} ({(y == 1).sum()/len(y)*100:.1f}%)")

# ### 5.2 Feature Importance Analysis (Preliminary)

"""
Use Random Forest to identify most important features before modeling
"""

print("\n" + "="*80)
print("PRELIMINARY FEATURE IMPORTANCE ANALYSIS")
print("="*80)

from sklearn.ensemble import RandomForestClassifier

# Train Random Forest for feature importance
rf_temp = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_temp.fit(X, y)

# Get feature importances
feature_importance_df = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': rf_temp.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nTop 15 Most Important Features:")
print(feature_importance_df.head(15).to_string(index=False))

# Visualize
plt.figure(figsize=(12, 8))
top_15 = feature_importance_df.head(15)
plt.barh(range(len(top_15)), top_15['Importance'], color='steelblue',
        edgecolor='black', alpha=0.8)
plt.yticks(range(len(top_15)), top_15['Feature'])
plt.xlabel('Importance Score')
plt.title('Top 15 Most Important Features (Random Forest)', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

# ---
# ## 6. Data Splitting and Scaling
# ### 6.1 Train-Test Split

"""
Split data into training and testing sets with stratification
to maintain class distribution
"""

print("\n" + "="*80)
print("TRAIN-TEST SPLIT")
print("="*80)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"‚úì Training set: {len(X_train)} patients ({len(X_train)/len(X)*100:.1f}%)")
print(f"‚úì Test set: {len(X_test)} patients ({len(X_test)/len(X)*100:.1f}%)")
print(f"‚úì Features: {X_train.shape[1]}")

print(f"\nClass distribution in training set:")
print(f"   - Class 0: {(y_train == 0).sum()} ({(y_train == 0).sum()/len(y_train)*100:.1f}%)")
print(f"   - Class 1: {(y_train == 1).sum()} ({(y_train == 1).sum()/len(y_train)*100:.1f}%)")

print(f"\nClass distribution in test set:")
print(f"   - Class 0: {(y_test == 0).sum()} ({(y_test == 0).sum()/len(y_test)*100:.1f}%)")
print(f"   - Class 1: {(y_test == 1).sum()} ({(y_test == 1).sum()/len(y_test)*100:.1f}%)")

# ### 6.2 Feature Scaling

"""
Standardize features to have zero mean and unit variance.
Critical for distance-based algorithms like SVM and KNN.
"""

print("\n" + "="*80)
print("FEATURE SCALING")
print("="*80)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("‚úì Features scaled using StandardScaler")
print(f"  Training set: Mean ‚âà 0, Std ‚âà 1 for all features")

# Convert back to DataFrame for easier handling
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=feature_cols, index=X_train.index)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=feature_cols, index=X_test.index)

# ---
# ## 7. Model Training and Evaluation
# ### 7.1 Model 1: Logistic Regression

"""
Baseline Model: Logistic Regression
- Simple, interpretable, fast
- Good for linear relationships
- Provides probability estimates
"""

print("\n" + "="*80)
print("MODEL 1: LOGISTIC REGRESSION")
print("="*80)

lr_model = LogisticRegression(random_state=42, max_iter=1000, solver='lbfgs')
lr_model.fit(X_train_scaled_df, y_train)

y_pred_lr = lr_model.predict(X_test_scaled_df)
y_pred_proba_lr = lr_model.predict_proba(X_test_scaled_df)[:, 1]

# Evaluate
lr_accuracy = accuracy_score(y_test, y_pred_lr)
lr_precision = precision_score(y_test, y_pred_lr)
lr_recall = recall_score(y_test, y_pred_lr)
lr_f1 = f1_score(y_test, y_pred_lr)
lr_auc = roc_auc_score(y_test, y_pred_proba_lr)

print(f"\nPerformance Metrics:")
print(f"  ‚Ä¢ Accuracy:  {lr_accuracy:.4f} ({lr_accuracy*100:.2f}%)")
print(f"  ‚Ä¢ Precision: {lr_precision:.4f} (of predicted positives, {lr_precision*100:.1f}% are correct)")
print(f"  ‚Ä¢ Recall:    {lr_recall:.4f} (detected {lr_recall*100:.1f}% of actual disease cases)")
print(f"  ‚Ä¢ F1-Score:  {lr_f1:.4f} (harmonic mean of precision and recall)")
print(f"  ‚Ä¢ ROC-AUC:   {lr_auc:.4f} (area under ROC curve)")

# ### 7.2 Model 2: Random Forest Classifier

"""
Ensemble Model: Random Forest
- Handles non-linear relationships
- Robust to outliers
- Provides feature importance
- Less prone to overfitting
"""

print("\n" + "="*80)
print("MODEL 2: RANDOM FOREST CLASSIFIER")
print("="*80)

rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=10,
    min_samples_leaf=4,
    random_state=42,
    n_jobs=-1
)
rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)
y_pred_proba_rf = rf_model.predict_proba(X_test)[:, 1]

# Evaluate
rf_accuracy = accuracy_score(y_test, y_pred_rf)
rf_precision = precision_score(y_test, y_pred_rf)
rf_recall = recall_score(y_test, y_pred_rf)
rf_f1 = f1_score(y_test, y_pred_rf)
rf_auc = roc_auc_score(y_test, y_pred_proba_rf)

print(f"\nPerformance Metrics:")
print(f"  ‚Ä¢ Accuracy:  {rf_accuracy:.4f} ({rf_accuracy*100:.2f}%)")
print(f"  ‚Ä¢ Precision: {rf_precision:.4f} (of predicted positives, {rf_precision*100:.1f}% are correct)")
print(f"  ‚Ä¢ Recall:    {rf_recall:.4f} (detected {rf_recall*100:.1f}% of actual disease cases)")
print(f"  ‚Ä¢ F1-Score:  {rf_f1:.4f}")
print(f"  ‚Ä¢ ROC-AUC:   {rf_auc:.4f}")

# ### 7.3 Model 3: Support Vector Machine (SVM)

"""
Support Vector Machine with RBF Kernel
- Effective in high-dimensional spaces
- Good for non-linear classification
- Memory efficient
"""

print("\n" + "="*80)
print("MODEL 3: SUPPORT VECTOR MACHINE (SVM)")
print("="*80)

svm_model = SVC(kernel='rbf', probability=True, random_state=42, C=1.0, gamma='scale')
svm_model.fit(X_train_scaled_df, y_train)

y_pred_svm = svm_model.predict(X_test_scaled_df)
y_pred_proba_svm = svm_model.predict_proba(X_test_scaled_df)[:, 1]

# Evaluate
svm_accuracy = accuracy_score(y_test, y_pred_svm)
svm_precision = precision_score(y_test, y_pred_svm)
svm_recall = recall_score(y_test, y_pred_svm)
svm_f1 = f1_score(y_test, y_pred_svm)
svm_auc = roc_auc_score(y_test, y_pred_proba_svm)

print(f"\nPerformance Metrics:")
print(f"  ‚Ä¢ Accuracy:  {svm_accuracy:.4f} ({svm_accuracy*100:.2f}%)")
print(f"  ‚Ä¢ Precision: {svm_precision:.4f} (of predicted positives, {svm_precision*100:.1f}% are correct)")
print(f"  ‚Ä¢ Recall:    {svm_recall:.4f} (detected {svm_recall*100:.1f}% of actual disease cases)")
print(f"  ‚Ä¢ F1-Score:  {svm_f1:.4f}")
print(f"  ‚Ä¢ ROC-AUC:   {svm_auc:.4f}")

# ### 7.4 Model 4: Gradient Boosting Classifier

"""
Gradient Boosting: Advanced Ensemble Method
- Sequential learning
- Often achieves high accuracy
- Handles complex patterns
"""

print("\n" + "="*80)
print("MODEL 4: GRADIENT BOOSTING CLASSIFIER")
print("="*80)

gb_model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    min_samples_split=10,
    random_state=42
)
gb_model.fit(X_train, y_train)

y_pred_gb = gb_model.predict(X_test)
y_pred_proba_gb = gb_model.predict_proba(X_test)[:, 1]

# Evaluate
gb_accuracy = accuracy_score(y_test, y_pred_gb)
gb_precision = precision_score(y_test, y_pred_gb)
gb_recall = recall_score(y_test, y_pred_gb)
gb_f1 = f1_score(y_test, y_pred_gb)
gb_auc = roc_auc_score(y_test, y_pred_proba_gb)

print(f"\nPerformance Metrics:")
print(f"  ‚Ä¢ Accuracy:  {gb_accuracy:.4f} ({gb_accuracy*100:.2f}%)")
print(f"  ‚Ä¢ Precision: {gb_precision:.4f} (of predicted positives, {gb_precision*100:.1f}% are correct)")
print(f"  ‚Ä¢ Recall:    {gb_recall:.4f} (detected {gb_recall*100:.1f}% of actual disease cases)")
print(f"  ‚Ä¢ F1-Score:  {gb_f1:.4f}")
print(f"  ‚Ä¢ ROC-AUC:   {gb_auc:.4f}")

# ### 7.5 Model 5: K-Nearest Neighbors (KNN)

"""
K-Nearest Neighbors
- Non-parametric, instance-based learning
- Simple intuition: similar patients have similar outcomes
- No training phase (lazy learning)
"""

print("\n" + "="*80)
print("MODEL 5: K-NEAREST NEIGHBORS (KNN)")
print("="*80)

knn_model = KNeighborsClassifier(n_neighbors=5, weights='distance', metric='euclidean')
knn_model.fit(X_train_scaled_df, y_train)

y_pred_knn = knn_model.predict(X_test_scaled_df)
y_pred_proba_knn = knn_model.predict_proba(X_test_scaled_df)[:, 1]

# Evaluate
knn_accuracy = accuracy_score(y_test, y_pred_knn)
knn_precision = precision_score(y_test, y_pred_knn)
knn_recall = recall_score(y_test, y_pred_knn)
knn_f1 = f1_score(y_test, y_pred_knn)
knn_auc = roc_auc_score(y_test, y_pred_proba_knn)

print(f"\nPerformance Metrics:")
print(f"  ‚Ä¢ Accuracy:  {knn_accuracy:.4f} ({knn_accuracy*100:.2f}%)")
print(f"  ‚Ä¢ Precision: {knn_precision:.4f} (of predicted positives, {knn_precision*100:.1f}% are correct)")
print(f"  ‚Ä¢ Recall:    {knn_recall:.4f} (detected {knn_recall*100:.1f}% of actual disease cases)")
print(f"  ‚Ä¢ F1-Score:  {knn_f1:.4f}")
print(f"  ‚Ä¢ ROC-AUC:   {knn_auc:.4f}")

# ### 7.6 Model 6: Naive Bayes

"""
Gaussian Naive Bayes
- Probabilistic classifier
- Fast and efficient
- Works well with small datasets
- Assumes feature independence
"""

print("\n" + "="*80)
print("MODEL 6: NAIVE BAYES (GAUSSIAN)")
print("="*80)

nb_model = GaussianNB()
nb_model.fit(X_train_scaled_df, y_train)

y_pred_nb = nb_model.predict(X_test_scaled_df)
y_pred_proba_nb = nb_model.predict_proba(X_test_scaled_df)[:, 1]

# Evaluate
nb_accuracy = accuracy_score(y_test, y_pred_nb)
nb_precision = precision_score(y_test, y_pred_nb)
nb_recall = recall_score(y_test, y_pred_nb)
nb_f1 = f1_score(y_test, y_pred_nb)
nb_auc = roc_auc_score(y_test, y_pred_proba_nb)

print(f"\nPerformance Metrics:")
print(f"  ‚Ä¢ Accuracy:  {nb_accuracy:.4f} ({nb_accuracy*100:.2f}%)")
print(f"  ‚Ä¢ Precision: {nb_precision:.4f} (of predicted positives, {nb_precision*100:.1f}% are correct)")
print(f"  ‚Ä¢ Recall:    {nb_recall:.4f} (detected {nb_recall*100:.1f}% of actual disease cases)")
print(f"  ‚Ä¢ F1-Score:  {nb_f1:.4f}")
print(f"  ‚Ä¢ ROC-AUC:   {nb_auc:.4f}")

# ---
# ## 8. Model Comparison and Analysis
# ### 8.1 Comprehensive Model Comparison

"""
Compare all models across multiple metrics
"""

print("\n" + "="*80)
print("COMPREHENSIVE MODEL PERFORMANCE COMPARISON")
print("="*80)

comparison_df = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest', 'SVM',
              'Gradient Boosting', 'KNN', 'Naive Bayes'],
    'Accuracy': [lr_accuracy, rf_accuracy, svm_accuracy, gb_accuracy, knn_accuracy, nb_accuracy],
    'Precision': [lr_precision, rf_precision, svm_precision, gb_precision, knn_precision, nb_precision],
    'Recall': [lr_recall, rf_recall, svm_recall, gb_recall, knn_recall, nb_recall],
    'F1-Score': [lr_f1, rf_f1, svm_f1, gb_f1, knn_f1, nb_f1],
    'ROC-AUC': [lr_auc, rf_auc, svm_auc, gb_auc, knn_auc, nb_auc]
})

# Sort by F1-Score
comparison_df = comparison_df.sort_values('F1-Score', ascending=False).reset_index(drop=True)

print("\nModel Performance Summary (Ranked by F1-Score):")
print(comparison_df.to_string(index=False))

# Visualize comparison
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']
colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12', '#9b59b6', '#1abc9c']

for idx, metric in enumerate(metrics):
    axes[idx].bar(range(len(comparison_df)), comparison_df[metric],
                 color=colors, edgecolor='black', alpha=0.8)
    axes[idx].set_xticks(range(len(comparison_df)))
    axes[idx].set_xticklabels(comparison_df['Model'], rotation=45, ha='right')
    axes[idx].set_title(f'{metric} Comparison', fontweight='bold', fontsize=12)
    axes[idx].set_ylabel(metric)
    axes[idx].set_ylim([0, 1.1])
    axes[idx].grid(axis='y', alpha=0.3)

    # Add value labels
    for i, v in enumerate(comparison_df[metric]):
        axes[idx].text(i, v + 0.02, f'{v:.3f}', ha='center', fontweight='bold', fontsize=9)

# Overall performance radar chart
axes[5].remove()
ax_radar = fig.add_subplot(2, 3, 6, projection='polar')

# Get best model for radar chart
best_model_idx = comparison_df['F1-Score'].idxmax()
best_model_data = comparison_df.iloc[best_model_idx]

categories = metrics
values = best_model_data[metrics].values
angles = np.linspace(0, 2 * np.pi, len(categories), endpoint=False).tolist()
values = np.concatenate((values, [values[0]]))
angles += angles[:1]

ax_radar.plot(angles, values, 'o-', linewidth=2, color='red', label=best_model_data['Model'])
ax_radar.fill(angles, values, alpha=0.25, color='red')
ax_radar.set_xticks(angles[:-1])
ax_radar.set_xticklabels(categories)
ax_radar.set_ylim(0, 1)
ax_radar.set_title('Best Model Performance\nRadar Chart', fontweight='bold', pad=20)
ax_radar.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))
ax_radar.grid(True)

plt.tight_layout()
plt.show()

# Select best model
best_model_name = comparison_df.iloc[0]['Model']
best_f1 = comparison_df.iloc[0]['F1-Score']
best_auc = comparison_df.iloc[0]['ROC-AUC']

print(f"\nüèÜ BEST PERFORMING MODEL: {best_model_name}")
print(f"   ‚Ä¢ F1-Score: {best_f1:.4f}")
print(f"   ‚Ä¢ ROC-AUC:  {best_auc:.4f}")
print(f"   ‚Ä¢ Accuracy: {comparison_df.iloc[0]['Accuracy']:.4f}")

# Map to actual model and predictions
model_map = {
    'Logistic Regression': (lr_model, y_pred_lr, y_pred_proba_lr),
    'Random Forest': (rf_model, y_pred_rf, y_pred_proba_rf),
    'SVM': (svm_model, y_pred_svm, y_pred_proba_svm),
    'Gradient Boosting': (gb_model, y_pred_gb, y_pred_proba_gb),
    'KNN': (knn_model, y_pred_knn, y_pred_proba_knn),
    'Naive Bayes': (nb_model, y_pred_nb, y_pred_proba_nb)
}
best_model, y_pred_best, y_pred_proba_best = model_map[best_model_name]

# ### 8.2 Confusion Matrix Analysis

"""
Detailed breakdown of predictions for all models
"""

print("\n" + "="*80)
print("CONFUSION MATRIX ANALYSIS - ALL MODELS")
print("="*80)

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

predictions = [
    ('Logistic Regression', y_pred_lr),
    ('Random Forest', y_pred_rf),
    ('SVM', y_pred_svm),
    ('Gradient Boosting', y_pred_gb),
    ('KNN', y_pred_knn),
    ('Naive Bayes', y_pred_nb)
]

for idx, (model_name, y_pred) in enumerate(predictions):
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False, ax=axes[idx],
               xticklabels=['No Disease', 'Disease'],
               yticklabels=['No Disease', 'Disease'])
    axes[idx].set_title(f'{model_name}\nConfusion Matrix', fontweight='bold')
    axes[idx].set_ylabel('Actual')
    axes[idx].set_xlabel('Predicted')

plt.tight_layout()
plt.show()

# Detailed metrics for best model
cm_best = confusion_matrix(y_test, y_pred_best)
tn, fp, fn, tp = cm_best.ravel()

print(f"\nDetailed Analysis - {best_model_name}:")
print(f"  ‚Ä¢ True Positives (TP):  {tp} - Correctly identified disease cases")
print(f"  ‚Ä¢ True Negatives (TN):  {tn} - Correctly identified healthy patients")
print(f"  ‚Ä¢ False Positives (FP): {fp} - Healthy patients incorrectly flagged")
print(f"  ‚Ä¢ False Negatives (FN): {fn} - Disease cases missed")
print(f"\n  ‚Ä¢ Sensitivity (Recall): {tp/(tp+fn):.4f} - {tp/(tp+fn)*100:.1f}% of disease cases detected")
print(f"  ‚Ä¢ Specificity:          {tn/(tn+fp):.4f} - {tn/(tn+fp)*100:.1f}% of healthy correctly identified")

# ### 8.3 ROC Curve Comparison

"""
ROC curves show trade-off between true positive and false positive rates
"""

print("\n" + "="*80)
print("ROC CURVE ANALYSIS - ALL MODELS")
print("="*80)

plt.figure(figsize=(12, 8))

# Plot ROC curves
roc_data = [
    ('Logistic Regression', y_pred_proba_lr, lr_auc),
    ('Random Forest', y_pred_proba_rf, rf_auc),
    ('SVM', y_pred_proba_svm, svm_auc),
    ('Gradient Boosting', y_pred_proba_gb, gb_auc),
    ('KNN', y_pred_proba_knn, knn_auc),
    ('Naive Bayes', y_pred_proba_nb, nb_auc)
]

for model_name, y_proba, auc_score in roc_data:
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    plt.plot(fpr, tpr, linewidth=2, label=f'{model_name} (AUC = {auc_score:.4f})')

plt.plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random Classifier (AUC = 0.5000)')
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate (Sensitivity)', fontsize=12)
plt.title('ROC Curves - All Models', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=10)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print(f"üìä ROC-AUC Interpretation:")
print(f"   ‚Ä¢ AUC = 1.0: Perfect classifier")
print(f"   ‚Ä¢ AUC = 0.9-1.0: Excellent")
print(f"   ‚Ä¢ AUC = 0.8-0.9: Good")
print(f"   ‚Ä¢ AUC = 0.7-0.8: Fair")
print(f"   ‚Ä¢ AUC = 0.5: Random guess")

# ### 8.4 Precision-Recall Curve

"""
Precision-Recall curves are useful for imbalanced datasets
"""

print("\n" + "="*80)
print("PRECISION-RECALL CURVE ANALYSIS")
print("="*80)

plt.figure(figsize=(12, 8))

for model_name, y_proba, _ in roc_data:
    precision, recall, _ = precision_recall_curve(y_test, y_proba)
    pr_auc = auc(recall, precision)
    plt.plot(recall, precision, linewidth=2, label=f'{model_name} (AP = {pr_auc:.4f})')

plt.xlabel('Recall (Sensitivity)', fontsize=12)
plt.ylabel('Precision', fontsize=12)
plt.title('Precision-Recall Curves - All Models', fontsize=14, fontweight='bold')
plt.legend(loc='best', fontsize=10)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# ### 8.5 Classification Report - Best Model

"""
Detailed classification metrics for the best model
"""

print("\n" + "="*80)
print(f"CLASSIFICATION REPORT - {best_model_name.upper()}")
print("="*80)

print(classification_report(y_test, y_pred_best,
                          target_names=['No Disease', 'Heart Disease'],
                          digits=4))

# ---
# ## 9. Cross-Validation
# ### 9.1 K-Fold Cross-Validation for All Models

"""
Perform k-fold cross-validation to assess model stability
"""

print("\n" + "="*80)
print("5-FOLD CROSS-VALIDATION ANALYSIS")
print("="*80)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

cv_results = {}

# Models to cross-validate
models_cv = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'SVM': SVC(kernel='rbf', probability=True, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'KNN': KNeighborsClassifier(n_neighbors=5),
    'Naive Bayes': GaussianNB()
}

for model_name, model in models_cv.items():
    # Use scaled data for distance-based models
    if model_name in ['Logistic Regression', 'SVM', 'KNN', 'Naive Bayes']:
        X_cv = X_train_scaled_df
    else:
        X_cv = X_train

    cv_scores = cross_val_score(model, X_cv, y_train, cv=cv, scoring='f1')
    cv_results[model_name] = cv_scores

    print(f"\n{model_name}:")
    print(f"  CV Scores: {cv_scores}")
    print(f"  Mean F1:   {cv_scores.mean():.4f}")
    print(f"  Std Dev:   {cv_scores.std():.4f}")

# Visualize CV results
plt.figure(figsize=(14, 6))
cv_df = pd.DataFrame(cv_results)

bp = plt.boxplot([cv_df[col] for col in cv_df.columns],
                labels=cv_df.columns,
                patch_artist=True,
                showmeans=True)

# Color the boxes
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

plt.xlabel('Model', fontsize=12)
plt.ylabel('F1-Score', fontsize=12)
plt.title('5-Fold Cross-Validation Results', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

# ---
# ## 10. Key Insights and Clinical Findings
# ### 10.1 Feature Importance - Clinical Insights

"""
Analyze which clinical factors are most predictive of heart disease
"""

print("\n" + "="*80)# Heart Failure Prediction Using Machine Learning
# ================================================
# Data Exploration, Model Comparison, and Insights
#
# Objective: Develop a machine learning pipeline to predict heart failure
# using clinical data, compare multiple models, and derive actionable insights

# ## 1. Project Setup and Environment Configuration
# ### 1.1 Import Required Libraries

"""
This notebook implements a complete machine learning pipeline for heart failure prediction.
We'll use:
- Data manipulation: pandas, numpy
- Visualization: matplotlib, seaborn
- Machine Learning: scikit-learn
- Statistical analysis: scipy
"""

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime

# Machine Learning Libraries
from sklearn.model_selection import (train_test_split, cross_val_score,
                                    StratifiedKFold, GridSearchCV, learning_curve)
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

# ML Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

# Metrics
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                            f1_score, roc_auc_score, roc_curve, confusion_matrix,
                            classification_report, precision_recall_curve, auc)

# Feature Selection
from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif

# Configuration
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("Set2")
np.random.seed(42)

print("="*80)
print("HEART FAILURE PREDICTION USING MACHINE LEARNING")
print("Data Exploration, Model Comparison, and Clinical Insights")
print("="*80)
print("‚úì All libraries imported successfully")
print(f"‚úì Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

# ### 1.2 Configure Display Settings

"""
Set up display options for better readability and professional visualizations
"""

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.4f' % x)
pd.set_option('display.width', 1000)

plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['figure.dpi'] = 100
plt.rcParams['font.size'] = 10
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 11

print("‚úì Display and visualization settings configured")

# ---
# ## 2. Data Loading and Initial Exploration
# ### 2.1 Load the Heart Failure Dataset

"""
Load the clinical heart failure dataset.
This dataset contains various medical and demographic features
used to predict the presence of heart disease.
"""

# Load the dataset
df = pd.read_csv('heart_failure_data.csv')

print("\n" + "="*80)
print("DATASET LOADED SUCCESSFULLY")
print("="*80)
print(f"Total Patients: {len(df):,}")
print(f"Total Features: {len(df.columns)}")
print(f"Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# ### 2.2 Display Sample Data

"""
Examine the first and last few rows to understand data structure
"""

print("\n" + "="*80)
print("FIRST 10 PATIENT RECORDS")
print("="*80)
print(df.head(10))

print("\n" + "="*80)
print("LAST 5 PATIENT RECORDS")
print("="*80)
print(df.tail())

# ### 2.3 Dataset Structure and Information

"""
Comprehensive overview of dataset structure, data types, and completeness
"""

print("\n" + "="*80)
print("DATASET STRUCTURE AND INFORMATION")
print("="*80)
df.info()

# ### 2.4 Feature Descriptions

"""
Understanding each clinical feature in the dataset:
"""

feature_descriptions = {
    'Age': 'Age of the patient (years)',
    'Sex': 'Sex of the patient (M: Male, F: Female)',
    'ChestPainType': 'Type of chest pain (TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic)',
    'RestingBP': 'Resting blood pressure (mm Hg)',
    'Cholesterol': 'Serum cholesterol (mm/dl)',
    'FastingBS': 'Fasting blood sugar > 120 mg/dl (1: Yes, 0: No)',
    'RestingECG': 'Resting electrocardiogram results (Normal, ST, LVH)',
    'MaxHR': 'Maximum heart rate achieved (60-202)',
    'ExerciseAngina': 'Exercise-induced angina (Y: Yes, N: No)',
    'Oldpeak': 'ST depression induced by exercise relative to rest',
    'ST_Slope': 'Slope of peak exercise ST segment (Up, Flat, Down)',
    'HeartDisease': 'Target variable - Presence of heart disease (1: Yes, 0: No)'
}

print("\n" + "="*80)
print("CLINICAL FEATURE DESCRIPTIONS")
print("="*80)
for feature, description in feature_descriptions.items():
    # Check for similar column names (case-insensitive, space handling)
    matching_cols = [col for col in df.columns if feature.lower().replace('_', '').replace(' ', '')
                    == col.lower().replace('_', '').replace(' ', '')]
    if matching_cols:
        print(f"  ‚Ä¢ {matching_cols[0]:20s}: {description}")

# ### 2.5 Statistical Summary

"""
Statistical overview of numerical features:
- Central tendency (mean, median)
- Spread (std, min, max)
- Distribution characteristics
"""

print("\n" + "="*80)
print("STATISTICAL SUMMARY OF NUMERICAL FEATURES")
print("="*80)
print(df.describe())

# ### 2.6 Target Variable Analysis

"""
Analyze the distribution of heart disease (target variable).
Understanding class balance is critical for model development.
"""

print("\n" + "="*80)
print("TARGET VARIABLE ANALYSIS: HEART DISEASE")
print("="*80)

# Find the target column (might be named differently)
target_col = None
for col in df.columns:
    if 'heart' in col.lower() and 'disease' in col.lower():
        target_col = col
        break
    elif col.lower() in ['target', 'output', 'class']:
        target_col = col
        break

if target_col:
    target_counts = df[target_col].value_counts().sort_index()
    target_pct = (target_counts / len(df) * 100).round(2)

    print(f"Target Variable: '{target_col}'")
    print("\nClass Distribution:")
    print(f"  ‚Ä¢ No Heart Disease (0): {target_counts.get(0, 0):,} patients ({target_pct.get(0, 0)}%)")
    print(f"  ‚Ä¢ Heart Disease (1):    {target_counts.get(1, 0):,} patients ({target_pct.get(1, 0)}%)")

    # Check for class imbalance
    if len(target_counts) == 2:
        imbalance_ratio = max(target_counts) / min(target_counts)
        print(f"\n  Class Balance Ratio: {imbalance_ratio:.2f}:1")

        if imbalance_ratio > 1.5:
            print(f"  ‚ö† Moderate class imbalance detected")
        elif imbalance_ratio > 2:
            print(f"  ‚ö† Significant class imbalance detected")
        else:
            print(f"  ‚úì Classes are well balanced")

    # Visualize target distribution
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Bar chart
    axes[0].bar(['No Disease', 'Heart Disease'], target_counts.values,
               color=['#2ecc71', '#e74c3c'], edgecolor='black', alpha=0.8)
    axes[0].set_title('Heart Disease Distribution', fontsize=14, fontweight='bold')
    axes[0].set_ylabel('Number of Patients')
    axes[0].grid(axis='y', alpha=0.3)

    # Add value labels
    for i, v in enumerate(target_counts.values):
        axes[0].text(i, v + 10, f'{v:,}\n({target_pct.values[i]}%)',
                    ha='center', fontweight='bold')

    # Pie chart
    axes[1].pie(target_counts.values, labels=['No Disease', 'Heart Disease'],
               autopct='%1.1f%%', colors=['#2ecc71', '#e74c3c'],
               startangle=90, explode=[0, 0.1])
    axes[1].set_title('Class Distribution', fontsize=14, fontweight='bold')

    plt.tight_layout()
    plt.show()
else:
    print("‚ö† Warning: Target variable not found. Please specify the correct column name.")

# ---
# ## 3. Data Preprocessing
# ### 3.1 Missing Values Analysis

"""
Identify and quantify missing data across all features.
Missing data can significantly impact model performance.
"""

print("\n" + "="*80)
print("MISSING VALUES ANALYSIS")
print("="*80)

missing_data = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percentage': (df.isnull().sum() / len(df) * 100).round(2),
    'Data_Type': df.dtypes
})

missing_data = missing_data[missing_data['Missing_Count'] > 0].sort_values(
    'Missing_Percentage', ascending=False)

if len(missing_data) > 0:
    print("Columns with Missing Values:")
    print(missing_data.to_string(index=False))

    # Visualize missing data
    plt.figure(figsize=(12, 6))
    plt.barh(missing_data['Column'], missing_data['Missing_Percentage'],
            color='coral', edgecolor='black', alpha=0.8)
    plt.xlabel('Missing Percentage (%)')
    plt.title('Missing Data by Feature', fontsize=14, fontweight='bold')
    plt.gca().invert_yaxis()
    plt.grid(axis='x', alpha=0.3)
    plt.tight_layout()
    plt.show()
else:
    print("‚úì Excellent! No missing values found in the dataset")

# ### 3.2 Handle Missing Values

"""
Strategy for handling missing values:
1. Drop columns with >50% missing data
2. Impute numerical features with median
3. Impute categorical features with mode
4. Remove rows with missing target variable
"""

df_clean = df.copy()
initial_rows = len(df_clean)
initial_cols = len(df_clean.columns)

print("\n" + "="*80)
print("HANDLING MISSING VALUES")
print("="*80)

# Remove columns with >50% missing
if len(missing_data) > 0:
    high_missing = missing_data[missing_data['Missing_Percentage'] > 50]['Column'].tolist()
    if len(high_missing) > 0:
        df_clean = df_clean.drop(columns=high_missing)
        print(f"‚úì Dropped {len(high_missing)} columns with >50% missing: {high_missing}")

# Remove rows with missing target
if target_col and df_clean[target_col].isnull().sum() > 0:
    df_clean = df_clean.dropna(subset=[target_col])
    print(f"‚úì Removed {initial_rows - len(df_clean)} rows with missing target variable")

# Impute numerical features with median
numerical_cols = df_clean.select_dtypes(include=[np.number]).columns.tolist()
if target_col in numerical_cols:
    numerical_cols.remove(target_col)

for col in numerical_cols:
    if df_clean[col].isnull().sum() > 0:
        median_val = df_clean[col].median()
        df_clean[col].fillna(median_val, inplace=True)
        print(f"‚úì Imputed '{col}' missing values with median: {median_val:.2f}")

# Impute categorical features with mode
categorical_cols = df_clean.select_dtypes(include=['object']).columns.tolist()
for col in categorical_cols:
    if df_clean[col].isnull().sum() > 0:
        mode_val = df_clean[col].mode()[0]
        df_clean[col].fillna(mode_val, inplace=True)
        print(f"‚úì Imputed '{col}' missing values with mode: {mode_val}")

print(f"\n‚úì Final dataset: {len(df_clean):,} patients √ó {len(df_clean.columns)} features")
print(f"‚úì Data retention: {len(df_clean)/initial_rows*100:.1f}% of original rows")

# ### 3.3 Check for Duplicates

"""
Identify and remove duplicate patient records
"""

print("\n" + "="*80)
print("DUPLICATE RECORDS CHECK")
print("="*80)

duplicates = df_clean.duplicated().sum()
print(f"Duplicate rows found: {duplicates}")

if duplicates > 0:
    df_clean = df_clean.drop_duplicates()
    print(f"‚úì Removed {duplicates} duplicate records")
    print(f"‚úì Dataset now has {len(df_clean):,} unique patient records")
else:
    print("‚úì No duplicate records found")

# ### 3.4 Data Type Corrections

"""
Ensure all features have appropriate data types
"""

print("\n" + "="*80)
print("DATA TYPE VALIDATION AND CORRECTION")
print("="*80)

# Binary features should be 0/1 integers
binary_features = ['FastingBS', target_col] if target_col else ['FastingBS']
for col in binary_features:
    if col in df_clean.columns:
        df_clean[col] = df_clean[col].astype(int)
        print(f"‚úì Converted '{col}' to integer type")

# Ensure numerical features are numeric
for col in numerical_cols:
    if col in df_clean.columns and df_clean[col].dtype == 'object':
        try:
            df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce')
            print(f"‚úì Converted '{col}' to numeric type")
        except:
            print(f"‚ö† Could not convert '{col}' to numeric")

print(f"\n‚úì Data type validation completed")

# ### 3.5 Encode Categorical Variables

"""
Convert categorical text features to numerical format for machine learning.
We'll use Label Encoding for binary categories and create dummy variables for multi-class.
"""

print("\n" + "="*80)
print("ENCODING CATEGORICAL VARIABLES")
print("="*80)

df_encoded = df_clean.copy()
label_encoders = {}

# Get categorical columns
categorical_columns = df_encoded.select_dtypes(include=['object']).columns.tolist()

print(f"Found {len(categorical_columns)} categorical columns: {categorical_columns}")

for col in categorical_columns:
    unique_values = df_encoded[col].nunique()
    print(f"\n  ‚Ä¢ {col}: {unique_values} unique values")
    print(f"    Values: {df_encoded[col].unique().tolist()}")

    # Use Label Encoding
    le = LabelEncoder()
    df_encoded[col + '_encoded'] = le.fit_transform(df_encoded[col].astype(str))
    label_encoders[col] = le

    # Create mapping
    mapping = dict(zip(le.classes_, le.transform(le.classes_)))
    print(f"    Mapping: {mapping}")

print(f"\n‚úì Categorical encoding completed")
print(f"‚úì Total features: {len(df_encoded.columns)}")

# ### 3.6 Feature Engineering

"""
Create additional features that might improve model performance:
1. Age groups
2. Blood pressure categories
3. Cholesterol categories
4. Heart rate zones
5. Risk scores
"""

print("\n" + "="*80)
print("FEATURE ENGINEERING")
print("="*80)

# 1. Age groups
if 'Age' in df_encoded.columns:
    df_encoded['Age_Group'] = pd.cut(df_encoded['Age'],
                                     bins=[0, 40, 55, 70, 100],
                                     labels=['Young', 'Middle', 'Senior', 'Elderly'])
    df_encoded['Age_Group_encoded'] = LabelEncoder().fit_transform(df_encoded['Age_Group'].astype(str))
    print("‚úì Created Age_Group: Young (<40), Middle (40-55), Senior (55-70), Elderly (70+)")

# 2. Blood Pressure categories
if 'RestingBP' in df_encoded.columns:
    def categorize_bp(bp):
        if bp < 120:
            return 'Normal'
        elif bp < 130:
            return 'Elevated'
        elif bp < 140:
            return 'High_Stage1'
        else:
            return 'High_Stage2'

    df_encoded['BP_Category'] = df_encoded['RestingBP'].apply(categorize_bp)
    df_encoded['BP_Category_encoded'] = LabelEncoder().fit_transform(df_encoded['BP_Category'])
    print("‚úì Created BP_Category: Normal, Elevated, High_Stage1, High_Stage2")

# 3. Cholesterol categories
if 'Cholesterol' in df_encoded.columns:
    # Filter out zero values (often missing data coded as 0)
    df_encoded['Cholesterol_Valid'] = df_encoded['Cholesterol'].replace(0, np.nan)

    def categorize_cholesterol(chol):
        if pd.isna(chol):
            return 'Unknown'
        elif chol < 200:
            return 'Desirable'
        elif chol < 240:
            return 'Borderline'
        else:
            return 'High'

    df_encoded['Chol_Category'] = df_encoded['Cholesterol_Valid'].apply(categorize_cholesterol)
    df_encoded['Chol_Category_encoded'] = LabelEncoder().fit_transform(df_encoded['Chol_Category'])
    print("‚úì Created Chol_Category: Desirable (<200), Borderline (200-240), High (>240)")

# 4. Heart Rate zones
if 'MaxHR' in df_encoded.columns and 'Age' in df_encoded.columns:
    df_encoded['Max_HR_Expected'] = 220 - df_encoded['Age']
    df_encoded['HR_Achievement_Pct'] = (df_encoded['MaxHR'] / df_encoded['Max_HR_Expected']) * 100
    print("‚úì Created HR_Achievement_Pct: Percentage of maximum heart rate achieved")

# 5. Risk score (simple composite)
if all(col in df_encoded.columns for col in ['Age', 'RestingBP', 'Cholesterol', 'FastingBS']):
    df_encoded['Simple_Risk_Score'] = (
        (df_encoded['Age'] > 55).astype(int) +
        (df_encoded['RestingBP'] > 130).astype(int) +
        (df_encoded['Cholesterol'] > 200).astype(int) +
        df_encoded['FastingBS']
    )
    print("‚úì Created Simple_Risk_Score: Composite cardiovascular risk indicator (0-4)")

print(f"\n‚úì Feature engineering completed")
print(f"‚úì Total features now: {len(df_encoded.columns)}")

# ---
# ## 4. Exploratory Data Analysis (EDA)
# ### 4.1 Univariate Analysis - Numerical Features

"""
Analyze the distribution of numerical features
"""

print("\n" + "="*80)
print("EXPLORATORY DATA ANALYSIS: NUMERICAL FEATURES")
print("="*80)

numerical_features = ['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']
available_num_features = [f for f in numerical_features if f in df_encoded.columns]

if len(available_num_features) > 0:
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    axes = axes.flatten()

    for idx, feature in enumerate(available_num_features[:6]):
        axes[idx].hist(df_encoded[feature], bins=30, color='steelblue',
                      edgecolor='black', alpha=0.7)
        axes[idx].set_title(f'Distribution of {feature}', fontweight='bold')
        axes[idx].set_xlabel(feature)
        axes[idx].set_ylabel('Frequency')
        axes[idx].axvline(df_encoded[feature].mean(), color='red', linestyle='--',
                         linewidth=2, label=f'Mean: {df_encoded[feature].mean():.1f}')
        axes[idx].axvline(df_encoded[feature].median(), color='green', linestyle='--',
                         linewidth=2, label=f'Median: {df_encoded[feature].median():.1f}')
        axes[idx].legend(fontsize=8)
        axes[idx].grid(alpha=0.3)

    # Hide extra subplots
    for idx in range(len(available_num_features), 6):
        axes[idx].set_visible(False)

    plt.tight_layout()
    plt.show()

# ### 4.2 Univariate Analysis - Categorical Features

"""
Analyze the distribution of categorical features
"""

print("\n" + "="*80)
print("EXPLORATORY DATA ANALYSIS: CATEGORICAL FEATURES")
print("="*80)

categorical_features = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
# Find matching columns (handle different naming conventions)
available_cat_features = []
for feat in categorical_features:
    matching = [col for col in df_clean.columns if feat.lower().replace('_', '') in col.lower().replace('_', '')]
    if matching:
        available_cat_features.append(matching[0])

if len(available_cat_features) > 0:
    n_features = len(available_cat_features)
    n_cols = min(3, n_features)
    n_rows = (n_features + n_cols - 1) // n_cols

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(6*n_cols, 5*n_rows))
    if n_features == 1:
        axes = [axes]
    else:
        axes = axes.flatten()

    for idx, feature in enumerate(available_cat_features):
        feature_counts = df_clean[feature].value_counts()
        axes[idx].bar(range(len(feature_counts)), feature_counts.values,
                     color='coral', edgecolor='black', alpha=0.8)
        axes[idx].set_xticks(range(len(feature_counts)))
        axes[idx].set_xticklabels(feature_counts.index, rotation=45, ha='right')
        axes[idx].set_title(f'Distribution of {feature}', fontweight='bold')
        axes[idx].set_ylabel('Count')
        axes[idx].grid(axis='y', alpha=0.3)

        # Add value labels
        for i, v in enumerate(feature_counts.values):
            axes[idx].text(i, v + 5, str(v), ha='center', fontweight='bold')

    # Hide extra subplots
    for idx in range(n_features, len(axes)):
        axes[idx].set_visible(False)

    plt.tight_layout()
    plt.show()

# ### 4.3 Bivariate Analysis - Features vs Heart Disease

"""
Analyze how each feature relates to heart disease outcome
"""

print("\n" + "="*80)
print("BIVARIATE ANALYSIS: FEATURES VS HEART DISEASE")
print("="*80)

if target_col:
    # Numerical features vs target
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    axes = axes.flatten()

    for idx, feature in enumerate(available_num_features[:6]):
        df_encoded.boxplot(column=feature, by=target_col, ax=axes[idx])
        axes[idx].set_title(f'{feature} by Heart Disease Status', fontweight='bold')
        axes[idx].set_xlabel('Heart Disease (0=No, 1=Yes)')
        axes[idx].set_ylabel(feature)
        plt.suptitle('')
        axes[idx].grid(alpha=0.3)

    for idx in range(len(available_num_features), 6):
        axes[idx].set_visible(False)

    plt.tight_layout()
    plt.show()

    # Statistical comparison
    print("\nMean Values by Heart Disease Status:")
    for feature in available_num_features:
        no_disease_mean = df_encoded[df_encoded[target_col] == 0][feature].mean()
        disease_mean = df_encoded[df_encoded[target_col] == 1][feature].mean()
        difference = disease_mean - no_disease_mean
        print(f"  ‚Ä¢ {feature:15s}: No Disease={no_disease_mean:6.1f}, Disease={disease_mean:6.1f}, Diff={difference:+6.1f}")

# ### 4.4 Categorical Features vs Heart Disease

"""
Analyze relationship between categorical features and heart disease
"""

print("\n" + "="*80)
print("CATEGORICAL FEATURES VS HEART DISEASE")
print("="*80)

if target_col and len(available_cat_features) > 0:
    n_features = min(4, len(available_cat_features))
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    axes = axes.flatten()

    for idx, feature in enumerate(available_cat_features[:4]):
        crosstab = pd.crosstab(df_clean[feature], df_clean[target_col])
        crosstab_pct = pd.crosstab(df_clean[feature], df_clean[target_col], normalize='index') * 100

        crosstab.plot(kind='bar', ax=axes[idx], color=['#2ecc71', '#e74c3c'],
                     edgecolor='black', alpha=0.8)
        axes[idx].set_title(f'{feature} vs Heart Disease', fontweight='bold')
        axes[idx].set_xlabel(feature)
        axes[idx].set_ylabel('Count')
        axes[idx].legend(['No Disease', 'Disease'])
        axes[idx].tick_params(axis='x', rotation=45)
        axes[idx].grid(axis='y', alpha=0.3)

    plt.tight_layout()
    plt.show()

# ### 4.5 Correlation Analysis

"""
Analyze correlations between all numerical features
"""

print("\n" + "="*80)
print("CORRELATION ANALYSIS")
print("="*80)

# Select numerical columns including encoded categoricals
numerical_for_corr = df_encoded.select_dtypes(include=[np.number]).columns.tolist()

# Remove non-predictive columns
exclude_cols = [col for col in numerical_for_corr if 'Group' in col and '_encoded' not in col]
numerical_for_corr = [col for col in numerical_for_corr if col not in exclude_cols]

if len(numerical_for_corr) > 2:
    # Calculate correlation matrix
    corr_matrix = df_encoded[numerical_for_corr].corr()

    # Visualize
    plt.figure(figsize=(16, 14))
    sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='RdYlGn', center=0,
                square=True, linewidths=0.5, cbar_kws={"shrink": 0.8},
                annot_kws={'size': 8})
    plt.title('Correlation Matrix - All Numerical Features', fontsize=16, fontweight='bold', pad=20)
    plt.tight_layout()
    plt.show()

    # Features most correlated with target
    if target_col in corr_matrix.columns:
        target_corr = corr_matrix[target_col].drop(target_col).sort_values(ascending=False)

        print("\nTop 15 Features Correlated with Heart Disease:")
        print("\nPositive Correlations (increase disease risk):")
        for feature, corr in target_corr[target_corr > 0].head(10).items():
            print(f"   ‚Ä¢ {feature:30s}: {corr:.4f}")

        print("\nNegative Correlations (decrease disease risk):")
        for feature, corr in target_corr[target_corr < 0].head(10).items():
            print(f"   ‚Ä¢ {feature:30s}: {corr:.4f}")

        # Visualize top correlations
        plt.figure(figsize=(12, 8))
        top_15_corr = pd.concat([target_corr.head(8), target_corr.tail(7)])
        colors = ['green' if x > 0 else 'red' for x in top_15_corr.values]
        plt.barh(range(len(top_15_corr)), top_15_corr.values, color=colors,
                edgecolor='black', alpha=0.8)
        plt.yticks(range(len(top_15_corr)), top_15_corr.index)
        plt.xlabel('Correlation Coefficient')
        plt.title('Top 15 Features Correlated with Heart Disease', fontsize=14, fontweight='bold')
        plt.axvline(x=0, color='black', linestyle='-', linewidth=0.8)
        plt.gca().invert_yaxis()
        plt.grid(axis='x', alpha=0.3)
        plt.tight_layout()
        plt.show()

# ---
# ## 5. Feature Selection and Preparation
# ### 5.1 Prepare Features for Modeling

"""
Select final features for machine learning models:
- Use encoded versions of categorical features
- Remove original categorical columns
- Remove intermediate feature engineering columns
- Separate features (X) and target (y)
"""

print("\n" + "