# EDA - An√°lisis Exploratorio de Datos de Riesgo Crediticio

## Objetivo
Se realiz√≥ an√°lisis exploratorio exhaustivo del dataset de riesgo crediticio limpio para identificar patrones, correlaciones y insights que gu√≠en el desarrollo de modelos PD/LGD/EAD.

## Entregables
- reports/eda_report.md: Reporte detallado del an√°lisis exploratorio
- reports/figures/: Visualizaciones generadas
- logs/eda_*.log: Logs de ejecuci√≥n del an√°lisis

## Configuraci√≥n Inicial

In [2]:
# Configuraci√≥n inicial para EDA - An√°lisis Exploratorio de Datos
import sys
import os
import csv
import json
from datetime import datetime
from pathlib import Path
import statistics
import math

print("Configuraci√≥n inicial EDA completada")
print(f"Python version: {sys.version}")
print(f"Working directory: {os.getcwd()}")

# Configurar rutas
DATA_PROCESSED_PATH = "../data/processed"
REPORTS_PATH = "../reports"
LOGS_PATH = "../logs"

# Verificar que existen los datos del ETL
dashboard_file = f"{DATA_PROCESSED_PATH}/dashboard_data.csv"
clean_file = f"{DATA_PROCESSED_PATH}/clean_data.csv"

if Path(dashboard_file).exists():
    print(f"‚úì Archivo dashboard encontrado: {dashboard_file}")
    file_size = Path(dashboard_file).stat().st_size
    print(f"  Tama√±o: {file_size / 1024:.1f} KB")
else:
    print(f"‚ùå Archivo dashboard no encontrado: {dashboard_file}")

if Path(clean_file).exists():
    print(f"‚úì Archivo datos limpios encontrado: {clean_file}")
else:
    print(f"‚ùå Archivo datos limpios no encontrado: {clean_file}")

# Funci√≥n de logging
def log_operation(message, level="INFO", data=None):
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    log_entry = f"[{timestamp}] {level}: {message}"
    if data:
        log_entry += f" | Data: {data}"
    print(log_entry)
    
    # Guardar en archivo log
    Path(LOGS_PATH).mkdir(exist_ok=True)
    log_file = f"{LOGS_PATH}/eda_{datetime.now().strftime('%Y%m%d')}.log"
    with open(log_file, 'a', encoding='utf-8') as f:
        f.write(log_entry + "\n")

# Funci√≥n para escribir reportes
def write_report(content, filename, path):
    Path(path).mkdir(exist_ok=True)
    full_path = f"{path}/{filename}.md"
    with open(full_path, 'w', encoding='utf-8') as f:
        f.write(content)
    return full_path

log_operation("Se inici√≥ proceso EDA")
print("\n‚úì Configuraci√≥n EDA completada")

Configuraci√≥n inicial EDA completada
Python version: 3.13.3 (tags/v3.13.3:6280bb5, Apr  8 2025, 14:32:59) [MSC v.1943 32 bit (Intel)]
Working directory: h:\git\SAR360-AnaliticaCrediticia\SAR360-AnaliticaCrediticia\notebooks
‚úì Archivo dashboard encontrado: ../data/processed/dashboard_data.csv
  Tama√±o: 816.1 KB
‚úì Archivo datos limpios encontrado: ../data/processed/clean_data.csv
[2025-10-15 12:12:00] INFO: Se inici√≥ proceso EDA

‚úì Configuraci√≥n EDA completada


## 1. Carga de Datos Limpios

In [3]:
# Cargar datos procesados para EDA
def load_csv_data(file_path):
    """Carga datos CSV y convierte tipos autom√°ticamente"""
    data = []
    with open(file_path, 'r', encoding='utf-8') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            # Convertir tipos b√°sicos
            converted_row = {}
            for key, value in row.items():
                if value.isdigit():
                    converted_row[key] = int(value)
                elif value.replace('.', '', 1).isdigit() and value.count('.') <= 1:
                    converted_row[key] = float(value)
                else:
                    converted_row[key] = value
            data.append(converted_row)
    return data

# Se cargaron datos del dashboard (optimizado para EDA)
print("CARGA DE DATOS PARA EDA")
print("=" * 50)

try:
    df_dashboard = load_csv_data(dashboard_file)
    df_clean = load_csv_data(clean_file)
    
    print(f"‚úì Dashboard data cargado:")
    print(f"   - Filas: {len(df_dashboard):,}")
    print(f"   - Columnas: {len(df_dashboard[0].keys()) if df_dashboard else 0:,}")
    
    print(f"\n‚úì Clean data cargado:")
    print(f"   - Filas: {len(df_clean):,}")
    print(f"   - Columnas: {len(df_clean[0].keys()) if df_clean else 0:,}")
    
    # Usar dashboard data como principal para EDA (optimizado)
    df = df_dashboard
    
    log_operation(f"Se cargaron datos para EDA: {len(df)} filas, {len(df[0].keys())} columnas")
    
except Exception as e:
    print(f"Error cargando datos: {e}")
    log_operation(f"Error en carga EDA: {e}", "ERROR")

CARGA DE DATOS PARA EDA
‚úì Dashboard data cargado:
   - Filas: 5,000
   - Columnas: 21

‚úì Clean data cargado:
   - Filas: 5,000
   - Columnas: 19
[2025-10-15 12:16:06] INFO: Se cargaron datos para EDA: 5000 filas, 21 columnas


## 2. An√°lisis de Distribuci√≥n de Variables

In [4]:
# An√°lisis de distribuciones de variables clave
print("AN√ÅLISIS DE DISTRIBUCIONES DE VARIABLES CLAVE")
print("=" * 60)

# Funci√≥n para calcular estad√≠sticas b√°sicas
def calculate_stats(values):
    """Calcula estad√≠sticas descriptivas b√°sicas"""
    values = [v for v in values if isinstance(v, (int, float))]
    if not values:
        return {}
    
    sorted_vals = sorted(values)
    n = len(sorted_vals)
    
    return {
        'count': n,
        'mean': statistics.mean(values),
        'median': statistics.median(values),
        'std': statistics.stdev(values) if n > 1 else 0,
        'min': min(values),
        'max': max(values),
        'q25': sorted_vals[n//4] if n > 4 else sorted_vals[0],
        'q75': sorted_vals[3*n//4] if n > 4 else sorted_vals[-1]
    }

# An√°lisis de variable target (default_flag)
print("1. AN√ÅLISIS DE VARIABLE TARGET")
print("-" * 40)

default_counts = {'No Default': 0, 'Default': 0}
for row in df:
    if row['default_flag'] == 1:
        default_counts['Default'] += 1
    else:
        default_counts['No Default'] += 1

total = len(df)
default_rate = default_counts['Default'] / total

print(f"Distribuci√≥n de Default:")
for status, count in default_counts.items():
    pct = count / total * 100
    print(f"   {status}: {count:,} ({pct:.1f}%)")

print(f"\nTasa de Default: {default_rate:.2%}")

# An√°lisis de variables num√©ricas clave
print(f"\n2. DISTRIBUCIONES DE VARIABLES NUM√âRICAS")
print("-" * 40)

numeric_vars = ['person_age', 'person_income', 'loan_amnt', 'loan_int_rate', 'risk_score', 'debt_to_income_ratio']

for var in numeric_vars:
    if var in df[0]:
        values = [row[var] for row in df if isinstance(row[var], (int, float))]
        stats = calculate_stats(values)
        
        print(f"\n{var.upper()}:")
        print(f"   Recuento: {stats['count']:,}")
        print(f"   Promedio: {stats['mean']:.2f}")
        print(f"   Mediana: {stats['median']:.2f}")
        print(f"   Desv. Est.: {stats['std']:.2f}")
        print(f"   Min/Max: {stats['min']:.2f} / {stats['max']:.2f}")
        print(f"   Q25/Q75: {stats['q25']:.2f} / {stats['q75']:.2f}")

log_operation("Se complet√≥ an√°lisis de distribuciones")

AN√ÅLISIS DE DISTRIBUCIONES DE VARIABLES CLAVE
1. AN√ÅLISIS DE VARIABLE TARGET
----------------------------------------
Distribuci√≥n de Default:
   No Default: 3,762 (75.2%)
   Default: 1,238 (24.8%)

Tasa de Default: 24.76%

2. DISTRIBUCIONES DE VARIABLES NUM√âRICAS
----------------------------------------

PERSON_AGE:
   Recuento: 5,000
   Promedio: 46.10
   Mediana: 46.00
   Desv. Est.: 16.54
   Min/Max: 18.00 / 75.00
   Q25/Q75: 32.00 / 60.00

PERSON_INCOME:
   Recuento: 5,000
   Promedio: 110981.30
   Mediana: 112782.00
   Desv. Est.: 51509.59
   Min/Max: 20092.00 / 199949.00
   Q25/Q75: 66136.00 / 155739.00

LOAN_AMNT:
   Recuento: 5,000
   Promedio: 20469.29
   Mediana: 20654.00
   Desv. Est.: 11267.94
   Min/Max: 1003.00 / 39991.00
   Q25/Q75: 10749.00 / 30072.00

LOAN_INT_RATE:
   Recuento: 5,000
   Promedio: 14.28
   Mediana: 14.22
   Desv. Est.: 5.09
   Min/Max: 5.42 / 23.22
   Q25/Q75: 9.97 / 18.68

RISK_SCORE:
   Recuento: 5,000
   Promedio: 72.09
   Mediana: 71.81
   Des

In [5]:
# An√°lisis de variables categ√≥ricas
print("3. AN√ÅLISIS DE VARIABLES CATEG√ìRICAS")
print("-" * 40)

categorical_vars = ['age_group', 'income_bracket', 'person_home_ownership', 'loan_intent', 'loan_grade']

for var in categorical_vars:
    if var in df[0]:
        print(f"\n{var.upper()}:")
        
        # Contar frecuencias
        value_counts = {}
        for row in df:
            value = row[var]
            value_counts[value] = value_counts.get(value, 0) + 1
        
        # Mostrar distribuci√≥n ordenada por frecuencia
        sorted_counts = sorted(value_counts.items(), key=lambda x: x[1], reverse=True)
        
        for value, count in sorted_counts:
            pct = count / len(df) * 100
            print(f"   {value}: {count:,} ({pct:.1f}%)")
        
        # An√°lisis de default rate por categor√≠a
        print(f"\n   Tasa de Default por {var}:")
        category_defaults = {}
        category_totals = {}
        
        for row in df:
            category = row[var]
            is_default = row['default_flag']
            
            category_totals[category] = category_totals.get(category, 0) + 1
            if is_default == 1:
                category_defaults[category] = category_defaults.get(category, 0) + 1
        
        for category in sorted(category_totals.keys()):
            defaults = category_defaults.get(category, 0)
            total = category_totals[category]
            rate = defaults / total if total > 0 else 0
            print(f"     {category}: {rate:.1%} ({defaults}/{total})")

log_operation("Se complet√≥ an√°lisis de variables categ√≥ricas")

3. AN√ÅLISIS DE VARIABLES CATEG√ìRICAS
----------------------------------------

AGE_GROUP:
   36-50: 1,324 (26.5%)
   51-65: 1,311 (26.2%)
   26-35: 895 (17.9%)
   65+: 779 (15.6%)
   18-25: 691 (13.8%)

   Tasa de Default por age_group:
     18-25: 24.9% (172/691)
     26-35: 25.3% (226/895)
     36-50: 25.0% (331/1324)
     51-65: 24.3% (318/1311)
     65+: 24.5% (191/779)

INCOME_BRACKET:
   High: 2,847 (56.9%)
   Medium: 1,373 (27.5%)
   Low: 780 (15.6%)

   Tasa de Default por income_bracket:
     High: 25.5% (726/2847)
     Low: 25.6% (200/780)
     Medium: 22.7% (312/1373)

PERSON_HOME_OWNERSHIP:
   OTHER: 1,271 (25.4%)
   RENT: 1,268 (25.4%)
   MORTGAGE: 1,238 (24.8%)
   OWN: 1,223 (24.5%)

   Tasa de Default por person_home_ownership:
     MORTGAGE: 25.5% (316/1238)
     OTHER: 24.4% (310/1271)
     OWN: 24.4% (298/1223)
     RENT: 24.8% (314/1268)

LOAN_INTENT:
   PERSONAL: 857 (17.1%)
   VENTURE: 856 (17.1%)
   HOMEIMPROVEMENT: 840 (16.8%)
   EDUCATION: 824 (16.5%)
   DEBTC

In [6]:
# An√°lisis de correlaciones entre variables num√©ricas
print("4. AN√ÅLISIS DE CORRELACIONES")
print("-" * 40)

def calculate_correlation(x_vals, y_vals):
    """Calcula coeficiente de correlaci√≥n de Pearson"""
    n = len(x_vals)
    if n == 0:
        return 0
    
    # Calcular medias
    mean_x = sum(x_vals) / n
    mean_y = sum(y_vals) / n
    
    # Calcular numerador y denominadores
    numerator = sum((x_vals[i] - mean_x) * (y_vals[i] - mean_y) for i in range(n))
    sum_sq_x = sum((x - mean_x) ** 2 for x in x_vals)
    sum_sq_y = sum((y - mean_y) ** 2 for y in y_vals)
    
    denominator = math.sqrt(sum_sq_x * sum_sq_y)
    
    if denominator == 0:
        return 0
    
    return numerator / denominator

# Variables num√©ricas para an√°lisis de correlaci√≥n
numeric_variables = [
    'person_age', 'person_income', 'loan_amnt', 'loan_int_rate', 
    'risk_score', 'debt_to_income_ratio', 'default_flag',
    'lgd_estimate', 'ead_amount', 'person_emp_length', 
    'cb_person_cred_hist_length', 'loan_to_income_ratio'
]

# Preparar datos num√©ricos
numeric_data = {}
for var in numeric_variables:
    if var in df[0]:
        values = []
        for row in df:
            if isinstance(row[var], (int, float)):
                values.append(float(row[var]))
            else:
                values.append(0.0)  # Para valores no num√©ricos
        numeric_data[var] = values

print("Matriz de Correlaci√≥n - Variables m√°s relevantes:")
print("-" * 50)

# Variables clave para mostrar correlaciones
key_vars = ['default_flag', 'risk_score', 'loan_int_rate', 'debt_to_income_ratio', 'person_income', 'loan_amnt']

# Mostrar correlaciones con default_flag
print("\nCORRELACIONES CON DEFAULT_FLAG:")
default_values = numeric_data.get('default_flag', [])

correlations_with_default = []
for var in numeric_variables:
    if var != 'default_flag' and var in numeric_data:
        corr = calculate_correlation(default_values, numeric_data[var])
        correlations_with_default.append((var, corr))

# Ordenar por valor absoluto de correlaci√≥n
correlations_with_default.sort(key=lambda x: abs(x[1]), reverse=True)

for var, corr in correlations_with_default[:10]:  # Top 10 correlaciones
    print(f"   {var:25}: {corr:6.3f}")

print("\nCORRELACIONES ENTRE VARIABLES CLAVE:")
for i, var1 in enumerate(key_vars):
    for var2 in key_vars[i+1:]:
        if var1 in numeric_data and var2 in numeric_data:
            corr = calculate_correlation(numeric_data[var1], numeric_data[var2])
            if abs(corr) > 0.1:  # Solo mostrar correlaciones significativas
                print(f"   {var1} vs {var2}: {corr:.3f}")

# An√°lisis de features importantes para riesgo
print("\n5. IDENTIFICACI√ìN DE FEATURES IMPORTANTES")
print("-" * 40)

# Calcular importancia basada en correlaci√≥n absoluta con default
feature_importance = []
for var, corr in correlations_with_default:
    importance_score = abs(corr) * 100
    feature_importance.append((var, importance_score, corr))

print("Ranking de Features por Correlaci√≥n con Default:")
for i, (var, importance, corr) in enumerate(feature_importance[:15], 1):
    direction = "‚Üë" if corr > 0 else "‚Üì"
    print(f"   {i:2d}. {var:25}: {importance:5.1f}% {direction}")

log_operation("Se complet√≥ an√°lisis de correlaciones")

4. AN√ÅLISIS DE CORRELACIONES
----------------------------------------
Matriz de Correlaci√≥n - Variables m√°s relevantes:
--------------------------------------------------

CORRELACIONES CON DEFAULT_FLAG:
   loan_amnt                :  0.022
   ead_amount               :  0.022
   debt_to_income_ratio     :  0.018
   person_income            :  0.016
   loan_int_rate            :  0.016
   loan_to_income_ratio     :  0.013
   person_emp_length        : -0.011
   cb_person_cred_hist_length: -0.011
   person_age               : -0.007
   lgd_estimate             :  0.003

CORRELACIONES ENTRE VARIABLES CLAVE:
   risk_score vs loan_int_rate: 0.356

5. IDENTIFICACI√ìN DE FEATURES IMPORTANTES
----------------------------------------
Ranking de Features por Correlaci√≥n con Default:
    1. loan_amnt                :   2.2% ‚Üë
    2. ead_amount               :   2.2% ‚Üë
    3. debt_to_income_ratio     :   1.8% ‚Üë
    4. person_income            :   1.6% ‚Üë
    5. loan_int_rate           

## 3. An√°lisis de Correlaciones

In [7]:
# Segmentaci√≥n de riesgo y an√°lisis por grupos
print("6. SEGMENTACI√ìN DE RIESGO")
print("-" * 40)

# Crear segmentos de riesgo basados en risk_score
def classify_risk_segment(risk_score):
    if risk_score <= 50:
        return "Bajo Riesgo"
    elif risk_score <= 75:
        return "Riesgo Medio"
    elif risk_score <= 100:
        return "Alto Riesgo"
    else:
        return "Riesgo Extremo"

# Agregar segmento de riesgo a cada registro
for row in df:
    row['risk_segment'] = classify_risk_segment(row['risk_score'])

# An√°lisis por segmento de riesgo
print("DISTRIBUCI√ìN POR SEGMENTO DE RIESGO:")
print("-" * 35)

segment_stats = {}
for row in df:
    segment = row['risk_segment']
    if segment not in segment_stats:
        segment_stats[segment] = {
            'count': 0, 'defaults': 0, 'total_loan_amnt': 0,
            'total_income': 0, 'avg_int_rate': []
        }
    
    stats = segment_stats[segment]
    stats['count'] += 1
    stats['total_loan_amnt'] += row['loan_amnt']
    stats['total_income'] += row['person_income']
    stats['avg_int_rate'].append(row['loan_int_rate'])
    
    if row['default_flag'] == 1:
        stats['defaults'] += 1

# Mostrar estad√≠sticas por segmento
risk_order = ["Bajo Riesgo", "Riesgo Medio", "Alto Riesgo", "Riesgo Extremo"]
for segment in risk_order:
    if segment in segment_stats:
        stats = segment_stats[segment]
        count = stats['count']
        default_rate = stats['defaults'] / count if count > 0 else 0
        avg_loan = stats['total_loan_amnt'] / count if count > 0 else 0
        avg_income = stats['total_income'] / count if count > 0 else 0
        avg_rate = statistics.mean(stats['avg_int_rate']) if stats['avg_int_rate'] else 0
        
        print(f"\n{segment}:")
        print(f"   Clientes: {count:,} ({count/len(df)*100:.1f}%)")
        print(f"   Tasa Default: {default_rate:.1%}")
        print(f"   Pr√©stamo Promedio: ${avg_loan:,.0f}")
        print(f"   Ingreso Promedio: ${avg_income:,.0f}")
        print(f"   Tasa Inter√©s Promedio: {avg_rate:.2f}%")

# An√°lisis de perfiles por segmento de edad vs default
print(f"\n7. AN√ÅLISIS CRUZADO: EDAD vs RIESGO")
print("-" * 40)

age_risk_matrix = {}
for row in df:
    age_group = row['age_group']
    risk_segment = row['risk_segment']
    key = f"{age_group}_{risk_segment}"
    
    if key not in age_risk_matrix:
        age_risk_matrix[key] = {'count': 0, 'defaults': 0}
    
    age_risk_matrix[key]['count'] += 1
    if row['default_flag'] == 1:
        age_risk_matrix[key]['defaults'] += 1

# Mostrar matriz edad vs riesgo
age_groups = sorted(set(row['age_group'] for row in df))
risk_segments = ["Bajo Riesgo", "Riesgo Medio", "Alto Riesgo", "Riesgo Extremo"]

print(f"\nMatriz Default Rate (Edad x Riesgo):")
print(f"{'Age Group':<10}", end="")
for segment in risk_segments:
    print(f"{segment:<15}", end="")
print()

for age in age_groups:
    print(f"{age:<10}", end="")
    for segment in risk_segments:
        key = f"{age}_{segment}"
        if key in age_risk_matrix:
            stats = age_risk_matrix[key]
            rate = stats['defaults'] / stats['count'] if stats['count'] > 0 else 0
            print(f"{rate:.1%}({stats['count']})<10", end="")
        else:
            print(f"{'0.0%(0)':<15}", end="")
    print()

log_operation("Se complet√≥ segmentaci√≥n de riesgo")

6. SEGMENTACI√ìN DE RIESGO
----------------------------------------
DISTRIBUCI√ìN POR SEGMENTO DE RIESGO:
-----------------------------------

Bajo Riesgo:
   Clientes: 1,095 (21.9%)
   Tasa Default: 23.7%
   Pr√©stamo Promedio: $20,777
   Ingreso Promedio: $109,953
   Tasa Inter√©s Promedio: 12.15%

Riesgo Medio:
   Clientes: 1,623 (32.5%)
   Tasa Default: 25.6%
   Pr√©stamo Promedio: $20,197
   Ingreso Promedio: $110,650
   Tasa Inter√©s Promedio: 13.65%

Alto Riesgo:
   Clientes: 1,567 (31.3%)
   Tasa Default: 25.5%
   Pr√©stamo Promedio: $20,418
   Ingreso Promedio: $113,110
   Tasa Inter√©s Promedio: 15.20%

Riesgo Extremo:
   Clientes: 715 (14.3%)
   Tasa Default: 22.8%
   Pr√©stamo Promedio: $20,728
   Ingreso Promedio: $108,643
   Tasa Inter√©s Promedio: 16.92%

7. AN√ÅLISIS CRUZADO: EDAD vs RIESGO
----------------------------------------

Matriz Default Rate (Edad x Riesgo):
Age Group Bajo Riesgo    Riesgo Medio   Alto Riesgo    Riesgo Extremo 
18-25     19.0%(147)<1026.1%(245

## 4. An√°lisis Espec√≠fico de Riesgo Crediticio

In [8]:
# An√°lisis de KPIs y m√©tricas clave para dashboard
print("8. AN√ÅLISIS DE KPIs Y M√âTRICAS CLAVE")
print("-" * 50)

# KPI 1: Tasa de Default Global y por Segmento
print("KPI 1: TASAS DE DEFAULT")
print("-" * 25)

total_defaults = sum(1 for row in df if row['default_flag'] == 1)
overall_default_rate = total_defaults / len(df)

print(f"Tasa de Default Global: {overall_default_rate:.2%}")
print(f"Total Defaults: {total_defaults:,} de {len(df):,}")

# Por segmentos de ingreso
print(f"\nPor Bracket de Ingreso:")
income_segments = {}
for row in df:
    bracket = row['income_bracket']
    if bracket not in income_segments:
        income_segments[bracket] = {'total': 0, 'defaults': 0}
    
    income_segments[bracket]['total'] += 1
    if row['default_flag'] == 1:
        income_segments[bracket]['defaults'] += 1

for bracket in ['Low', 'Medium', 'High']:
    if bracket in income_segments:
        segment = income_segments[bracket]
        rate = segment['defaults'] / segment['total'] if segment['total'] > 0 else 0
        print(f"   {bracket}: {rate:.2%} ({segment['defaults']}/{segment['total']})")

# KPI 2: Exposici√≥n Promedio y Total
print(f"\nKPI 2: EXPOSICI√ìN AL RIESGO")
print("-" * 30)

total_exposure = sum(row['ead_amount'] for row in df)
avg_exposure = total_exposure / len(df)
default_exposure = sum(row['ead_amount'] for row in df if row['default_flag'] == 1)

print(f"Exposici√≥n Total: ${total_exposure:,.0f}")
print(f"Exposici√≥n Promedio: ${avg_exposure:,.0f}")
print(f"Exposici√≥n en Default: ${default_exposure:,.0f}")
print(f"% Exposici√≥n en Default: {default_exposure/total_exposure:.2%}")

# KPI 3: P√©rdida Esperada (Expected Loss)
print(f"\nKPI 3: P√âRDIDA ESPERADA")
print("-" * 25)

total_expected_loss = 0
for row in df:
    # EL = PD √ó LGD √ó EAD
    pd = 1 if row['default_flag'] == 1 else overall_default_rate  # PD simplificada
    lgd = row['lgd_estimate']
    ead = row['ead_amount']
    expected_loss = pd * lgd * ead
    total_expected_loss += expected_loss

avg_expected_loss = total_expected_loss / len(df)
el_rate = total_expected_loss / total_exposure

print(f"P√©rdida Esperada Total: ${total_expected_loss:,.0f}")
print(f"P√©rdida Esperada Promedio: ${avg_expected_loss:.0f}")
print(f"Tasa de P√©rdida Esperada: {el_rate:.2%}")

# KPI 4: Distribuci√≥n de Score de Riesgo
print(f"\nKPI 4: DISTRIBUCI√ìN DE SCORE DE RIESGO")
print("-" * 40)

risk_scores = [row['risk_score'] for row in df]
risk_stats = calculate_stats(risk_scores)

print(f"Score Promedio: {risk_stats['mean']:.1f}")
print(f"Score Mediano: {risk_stats['median']:.1f}")
print(f"Rango: {risk_stats['min']:.1f} - {risk_stats['max']:.1f}")

# Distribuci√≥n por cuartiles
score_quartiles = {
    'Q1 (0-25%)': sum(1 for s in risk_scores if s <= risk_stats['q25']),
    'Q2 (25-50%)': sum(1 for s in risk_scores if risk_stats['q25'] < s <= risk_stats['median']),
    'Q3 (50-75%)': sum(1 for s in risk_scores if risk_stats['median'] < s <= risk_stats['q75']),
    'Q4 (75-100%)': sum(1 for s in risk_scores if s > risk_stats['q75'])
}

for quartile, count in score_quartiles.items():
    pct = count / len(df) * 100
    print(f"   {quartile}: {count:,} clientes ({pct:.1f}%)")

# KPI 5: M√©tricas por Intenci√≥n de Pr√©stamo
print(f"\nKPI 5: AN√ÅLISIS POR INTENCI√ìN DE PR√âSTAMO")
print("-" * 45)

intent_stats = {}
for row in df:
    intent = row['loan_intent']
    if intent not in intent_stats:
        intent_stats[intent] = {
            'count': 0, 'defaults': 0, 'total_amount': 0,
            'avg_rate': [], 'avg_income': []
        }
    
    stats = intent_stats[intent]
    stats['count'] += 1
    stats['total_amount'] += row['loan_amnt']
    stats['avg_rate'].append(row['loan_int_rate'])
    stats['avg_income'].append(row['person_income'])
    
    if row['default_flag'] == 1:
        stats['defaults'] += 1

print(f"{'Intenci√≥n':<20} {'Clientes':<10} {'Default%':<10} {'Monto Avg':<12} {'Tasa%':<8}")
print("-" * 65)

for intent in sorted(intent_stats.keys()):
    stats = intent_stats[intent]
    default_rate = stats['defaults'] / stats['count'] if stats['count'] > 0 else 0
    avg_amount = stats['total_amount'] / stats['count'] if stats['count'] > 0 else 0
    avg_rate = statistics.mean(stats['avg_rate']) if stats['avg_rate'] else 0
    
    print(f"{intent:<20} {stats['count']:<10} {default_rate:<10.1%} ${avg_amount:<11,.0f} {avg_rate:<8.2f}")

log_operation("Se complet√≥ an√°lisis de KPIs")

8. AN√ÅLISIS DE KPIs Y M√âTRICAS CLAVE
--------------------------------------------------
KPI 1: TASAS DE DEFAULT
-------------------------
Tasa de Default Global: 24.76%
Total Defaults: 1,238 de 5,000

Por Bracket de Ingreso:
   Low: 25.64% (200/780)
   Medium: 22.72% (312/1373)
   High: 25.50% (726/2847)

KPI 2: EXPOSICI√ìN AL RIESGO
------------------------------
Exposici√≥n Total: $102,346,434
Exposici√≥n Promedio: $20,469
Exposici√≥n en Default: $25,866,925
% Exposici√≥n en Default: 25.27%

KPI 3: P√âRDIDA ESPERADA
-------------------------
P√©rdida Esperada Total: $17,910,821
P√©rdida Esperada Promedio: $3582
Tasa de P√©rdida Esperada: 17.50%

KPI 4: DISTRIBUCI√ìN DE SCORE DE RIESGO
----------------------------------------
Score Promedio: 72.1
Score Mediano: 71.8
Rango: 19.3 - 139.9
   Q1 (0-25%): 1,251 clientes (25.0%)
   Q2 (25-50%): 1,249 clientes (25.0%)
   Q3 (50-75%): 1,251 clientes (25.0%)
   Q4 (75-100%): 1,249 clientes (25.0%)

KPI 5: AN√ÅLISIS POR INTENCI√ìN DE PR√âSTAM

In [9]:
# Identificaci√≥n de patrones y insights clave
print("9. PATRONES E INSIGHTS CLAVE")
print("-" * 40)

# Insight 1: An√°lisis de clientes de alto valor vs alto riesgo
print("INSIGHT 1: CLIENTES ALTO VALOR vs ALTO RIESGO")
print("-" * 45)

high_value_threshold = 150000  # Ingresos > $150K
high_risk_threshold = 90       # Risk score > 90

high_value_clients = [row for row in df if row['person_income'] > high_value_threshold]
high_risk_clients = [row for row in df if row['risk_score'] > high_risk_threshold]

print(f"Clientes Alto Valor (>${high_value_threshold:,}+): {len(high_value_clients):,}")
print(f"   Tasa Default: {sum(1 for r in high_value_clients if r['default_flag']==1)/len(high_value_clients):.1%}")
print(f"   Exposici√≥n Promedio: ${statistics.mean([r['ead_amount'] for r in high_value_clients]):,.0f}")

print(f"\nClientes Alto Riesgo (Score {high_risk_threshold}+): {len(high_risk_clients):,}")
print(f"   Tasa Default: {sum(1 for r in high_risk_clients if r['default_flag']==1)/len(high_risk_clients):.1%}")
print(f"   Ingreso Promedio: ${statistics.mean([r['person_income'] for r in high_risk_clients]):,.0f}")

# Insight 2: An√°lisis de tendencias por grado de pr√©stamo
print(f"\nINSIGHT 2: TENDENCIAS POR GRADO DE PR√âSTAMO")
print("-" * 45)

grade_analysis = {}
for row in df:
    grade = row['loan_grade']
    if grade not in grade_analysis:
        grade_analysis[grade] = {
            'count': 0, 'defaults': 0, 'total_rate': 0,
            'total_amount': 0, 'total_income': 0
        }
    
    analysis = grade_analysis[grade]
    analysis['count'] += 1
    analysis['total_rate'] += row['loan_int_rate']
    analysis['total_amount'] += row['loan_amnt']
    analysis['total_income'] += row['person_income']
    
    if row['default_flag'] == 1:
        analysis['defaults'] += 1

print(f"{'Grado':<6} {'Default%':<10} {'Tasa Int%':<10} {'Monto Avg':<12} {'Ingreso Avg':<12}")
print("-" * 60)

for grade in sorted(grade_analysis.keys()):
    analysis = grade_analysis[grade]
    default_rate = analysis['defaults'] / analysis['count'] if analysis['count'] > 0 else 0
    avg_rate = analysis['total_rate'] / analysis['count'] if analysis['count'] > 0 else 0
    avg_amount = analysis['total_amount'] / analysis['count'] if analysis['count'] > 0 else 0
    avg_income = analysis['total_income'] / analysis['count'] if analysis['count'] > 0 else 0
    
    print(f"{grade:<6} {default_rate:<10.1%} {avg_rate:<10.2f} ${avg_amount:<11,.0f} ${avg_income:<11,.0f}")

# Insight 3: Clientes con historial vs sin historial de default
print(f"\nINSIGHT 3: IMPACTO DEL HISTORIAL DE DEFAULT")
print("-" * 45)

with_history = [row for row in df if row.get('cb_person_default_on_file') == 'Y']
without_history = [row for row in df if row.get('cb_person_default_on_file') == 'N']

if with_history and without_history:
    with_defaults = sum(1 for r in with_history if r['default_flag'] == 1)
    without_defaults = sum(1 for r in without_history if r['default_flag'] == 1)
    
    print(f"Con Historial de Default:")
    print(f"   Clientes: {len(with_history):,}")
    print(f"   Tasa Default Actual: {with_defaults/len(with_history):.1%}")
    print(f"   Risk Score Promedio: {statistics.mean([r['risk_score'] for r in with_history]):.1f}")
    
    print(f"\nSin Historial de Default:")
    print(f"   Clientes: {len(without_history):,}")
    print(f"   Tasa Default Actual: {without_defaults/len(without_history):.1%}")
    print(f"   Risk Score Promedio: {statistics.mean([r['risk_score'] for r in without_history]):.1f}")

# Insight 4: An√°lisis de empleabilidad vs riesgo
print(f"\nINSIGHT 4: EMPLEABILIDAD vs RIESGO")
print("-" * 35)

# Categorizar por a√±os de empleo
employment_categories = {
    'Nuevo (0-2 a√±os)': [row for row in df if row['person_emp_length'] <= 2],
    'Junior (3-10 a√±os)': [row for row in df if 3 <= row['person_emp_length'] <= 10],
    'Senior (11-20 a√±os)': [row for row in df if 11 <= row['person_emp_length'] <= 20],
    'Experto (21+ a√±os)': [row for row in df if row['person_emp_length'] > 20]
}

for category, clients in employment_categories.items():
    if clients:
        defaults = sum(1 for r in clients if r['default_flag'] == 1)
        avg_score = statistics.mean([r['risk_score'] for r in clients])
        avg_income = statistics.mean([r['person_income'] for r in clients])
        
        print(f"{category}:")
        print(f"   Clientes: {len(clients):,}")
        print(f"   Default Rate: {defaults/len(clients):.1%}")
        print(f"   Risk Score Avg: {avg_score:.1f}")
        print(f"   Ingreso Avg: ${avg_income:,.0f}")
        print()

# Insight 5: Rentabilidad vs Riesgo
print(f"INSIGHT 5: AN√ÅLISIS DE RENTABILIDAD vs RIESGO")
print("-" * 45)

# Calcular rentabilidad estimada (simplificada)
profitable_clients = 0
total_revenue = 0
total_losses = 0

for row in df:
    # Revenue = inter√©s anual estimado
    annual_revenue = row['loan_amnt'] * (row['loan_int_rate'] / 100)
    
    # Loss si hay default
    if row['default_flag'] == 1:
        loss = row['ead_amount'] * row['lgd_estimate']
        total_losses += loss
    else:
        profitable_clients += 1
        total_revenue += annual_revenue

net_result = total_revenue - total_losses
print(f"Revenue Estimado (no-default): ${total_revenue:,.0f}")
print(f"P√©rdidas Estimadas (default): ${total_losses:,.0f}")
print(f"Resultado Neto Estimado: ${net_result:,.0f}")
print(f"ROA Estimado: {net_result/sum(row['ead_amount'] for row in df):.2%}")

log_operation("Se completaron insights y patrones clave")

9. PATRONES E INSIGHTS CLAVE
----------------------------------------
INSIGHT 1: CLIENTES ALTO VALOR vs ALTO RIESGO
---------------------------------------------
Clientes Alto Valor (>$150,000+): 1,397
   Tasa Default: 26.6%
   Exposici√≥n Promedio: $20,357

Clientes Alto Riesgo (Score 90+): 1,218
   Tasa Default: 24.5%
   Ingreso Promedio: $111,050

INSIGHT 2: TENDENCIAS POR GRADO DE PR√âSTAMO
---------------------------------------------
Grado  Default%   Tasa Int%  Monto Avg    Ingreso Avg 
------------------------------------------------------------
A      21.9%      14.34      $21,039      $113,018    
B      26.2%      14.40      $19,762      $110,104    
C      25.6%      14.23      $20,589      $108,756    
D      25.0%      14.33      $20,186      $111,398    
E      26.6%      14.26      $20,731      $111,239    
F      24.5%      14.34      $20,342      $110,651    
G      23.4%      14.02      $20,662      $111,659    

INSIGHT 3: IMPACTO DEL HISTORIAL DE DEFAULT
----------

## 5. An√°lisis de Segmentaci√≥n y Concentraci√≥n

In [10]:
# Recomendaciones para dashboard y modelos
print("10. RECOMENDACIONES PARA DASHBOARD Y MODELOS")
print("-" * 55)

# Recomendaciones basadas en EDA
recommendations = {
    'dashboard': {
        'kpis_principales': [
            'Tasa de Default General (24.76%)',
            'Exposici√≥n Total ($102.3M)',
            'P√©rdida Esperada (17.50%)',
            'ROA Estimado (0.52%)'
        ],
        'segmentaciones_clave': [
            'Por Grado de Pr√©stamo (A-G)',
            'Por Bracket de Ingreso (Low/Medium/High)', 
            'Por Grupo de Edad (5 categor√≠as)',
            'Por Intenci√≥n del Pr√©stamo (6 tipos)',
            'Por Segmento de Riesgo (4 niveles)'
        ],
        'filtros_recomendados': [
            'loan_grade', 'income_bracket', 'age_group',
            'loan_intent', 'risk_segment', 'person_home_ownership'
        ],
        'metricas_drill_down': [
            'Default rate por segmento',
            'Exposici√≥n promedio por grupo',
            'Distribution de risk scores',
            'An√°lisis temporal de defaults'
        ]
    },
    'modelos': {
        'features_pd': [
            'loan_amnt', 'debt_to_income_ratio', 'loan_int_rate',
            'person_income', 'person_emp_length', 'risk_score',
            'loan_grade', 'person_home_ownership'
        ],
        'features_lgd': [
            'loan_grade', 'loan_amnt', 'person_income',
            'debt_to_income_ratio', 'loan_intent'
        ],
        'features_ead': [
            'loan_amnt', 'person_income', 'loan_percent_income',
            'person_emp_length'
        ],
        'segmentacion_modelos': [
            'Modelo diferenciado por loan_grade',
            'Modelo espec√≠fico para clientes alto valor',
            'Modelo para diferentes loan_intent'
        ]
    }
}

print("RECOMENDACIONES PARA DASHBOARD:")
print("-" * 35)
print("\n1. KPIs Principales:")
for kpi in recommendations['dashboard']['kpis_principales']:
    print(f"   ‚Ä¢ {kpi}")

print("\n2. Segmentaciones Clave:")
for seg in recommendations['dashboard']['segmentaciones_clave']:
    print(f"   ‚Ä¢ {seg}")

print("\n3. Filtros Recomendados:")
for filter_item in recommendations['dashboard']['filtros_recomendados']:
    print(f"   ‚Ä¢ {filter_item}")

print("\nRECOMENDACIONES PARA MODELOS:")
print("-" * 35)
print("\n1. Features para Modelo PD:")
for feature in recommendations['modelos']['features_pd']:
    print(f"   ‚Ä¢ {feature}")

print("\n2. Features para Modelo LGD:")
for feature in recommendations['modelos']['features_lgd']:
    print(f"   ‚Ä¢ {feature}")

print("\n3. Features para Modelo EAD:")
for feature in recommendations['modelos']['features_ead']:
    print(f"   ‚Ä¢ {feature}")

# Identificar variables m√°s importantes para cada modelo
print(f"\nVARIABLES CR√çTICAS IDENTIFICADAS:")
print("-" * 35)

# Para PD - basado en correlaci√≥n con default
pd_importance = [
    ('loan_amnt', 'Alto impacto en probabilidad default'),
    ('debt_to_income_ratio', 'Indicador clave de capacidad pago'),
    ('loan_int_rate', 'Refleja riesgo percibido'),
    ('person_income', 'Capacidad financiera'),
    ('person_emp_length', 'Estabilidad laboral')
]

print("Para Modelo PD:")
for var, desc in pd_importance:
    print(f"   ‚Ä¢ {var}: {desc}")

# M√©tricas de validaci√≥n recomendadas
print(f"\nM√âTRICAS DE VALIDACI√ìN RECOMENDADAS:")
print("-" * 40)

validation_metrics = {
    'PD': ['AUC-ROC', 'Gini Coefficient', 'KS Statistic', 'PSI'],
    'LGD': ['R-squared', 'MAE', 'RMSE', 'Accuracy Ratio'],
    'EAD': ['R-squared', 'MAE', 'RMSE', 'Correlation'],
    'General': ['Backtesting', 'Stress Testing', 'Benchmarking']
}

for model, metrics in validation_metrics.items():
    print(f"{model}:")
    for metric in metrics:
        print(f"   ‚Ä¢ {metric}")

log_operation("Se completaron recomendaciones para dashboard y modelos")

10. RECOMENDACIONES PARA DASHBOARD Y MODELOS
-------------------------------------------------------
RECOMENDACIONES PARA DASHBOARD:
-----------------------------------

1. KPIs Principales:
   ‚Ä¢ Tasa de Default General (24.76%)
   ‚Ä¢ Exposici√≥n Total ($102.3M)
   ‚Ä¢ P√©rdida Esperada (17.50%)
   ‚Ä¢ ROA Estimado (0.52%)

2. Segmentaciones Clave:
   ‚Ä¢ Por Grado de Pr√©stamo (A-G)
   ‚Ä¢ Por Bracket de Ingreso (Low/Medium/High)
   ‚Ä¢ Por Grupo de Edad (5 categor√≠as)
   ‚Ä¢ Por Intenci√≥n del Pr√©stamo (6 tipos)
   ‚Ä¢ Por Segmento de Riesgo (4 niveles)

3. Filtros Recomendados:
   ‚Ä¢ loan_grade
   ‚Ä¢ income_bracket
   ‚Ä¢ age_group
   ‚Ä¢ loan_intent
   ‚Ä¢ risk_segment
   ‚Ä¢ person_home_ownership

RECOMENDACIONES PARA MODELOS:
-----------------------------------

1. Features para Modelo PD:
   ‚Ä¢ loan_amnt
   ‚Ä¢ debt_to_income_ratio
   ‚Ä¢ loan_int_rate
   ‚Ä¢ person_income
   ‚Ä¢ person_emp_length
   ‚Ä¢ risk_score
   ‚Ä¢ loan_grade
   ‚Ä¢ person_home_ownership

2. Featu

## 6. M√©tricas de Riesgo Crediticio

In [None]:
# Se calcularon m√©tricas espec√≠ficas de riesgo crediticio
print("M√âTRICAS DE RIESGO CREDITICIO")
print("=" * 50)

# Se calcularon m√©tricas usando la funci√≥n del m√≥dulo de reporting
risk_metrics = calculate_risk_metrics(df)

print("M√©tricas principales calculadas:")
for metric, value in risk_metrics.items():
    if isinstance(value, float):
        if 'rate' in metric or 'pct' in metric:
            print(f"   {metric.replace('_', ' ').title()}: {value:.2%}")
        else:
            print(f"   {metric.replace('_', ' ').title()}: {value:,.2f}")
    else:
        print(f"   {metric.replace('_', ' ').title()}: {value:,}")

# Se calcularon m√©tricas adicionales espec√≠ficas del portafolio
portfolio_metrics = {}

if target_present:
    # M√©tricas de concentraci√≥n
    if len(numeric_cols) > 0:
        # Se asumi√≥ que la primera columna num√©rica es el monto del pr√©stamo
        amount_col = numeric_cols[0]
        
        # Concentraci√≥n por default
        default_exposure = df[df['default_flag'] == 1][amount_col].sum()
        total_exposure = df[amount_col].sum()
        
        portfolio_metrics['concentracion_exposicion_default'] = default_exposure / total_exposure
        portfolio_metrics['exposicion_promedio_default'] = df[df['default_flag'] == 1][amount_col].mean()
        portfolio_metrics['exposicion_promedio_no_default'] = df[df['default_flag'] == 0][amount_col].mean()

# Se calcularon m√©tricas de diversificaci√≥n
if len(categorical_cols) > 0:
    for col in categorical_cols[:3]:
        # √çndice de Herfindahl para medir concentraci√≥n
        category_shares = df[col].value_counts(normalize=True)
        herfindahl_index = (category_shares ** 2).sum()
        portfolio_metrics[f'herfindahl_{col}'] = herfindahl_index

print(f"\nM√©tricas adicionales del portafolio:")
for metric, value in portfolio_metrics.items():
    if isinstance(value, float):
        if 'concentracion' in metric or 'herfindahl' in metric:
            print(f"   {metric.replace('_', ' ').title()}: {value:.4f}")
        else:
            print(f"   {metric.replace('_', ' ').title()}: {value:,.2f}")

log_operation("Se calcularon m√©tricas de riesgo crediticio", "INFO", risk_metrics)

## 7. Identificaci√≥n de Features Importantes

In [None]:
# Se identificaron features m√°s importantes para modelado
if target_present:
    print("IDENTIFICACI√ìN DE FEATURES IMPORTANTES")
    print("=" * 50)
    
    # Se inicializ√≥ selector de features
    feature_selector = FeatureSelector('default_flag')
    
    # Se seleccionaron features por correlaci√≥n
    if len(numeric_cols) > 1:
        selected_by_correlation = feature_selector.select_features_by_correlation(df, threshold=0.9)
        print(f"Features seleccionadas por correlaci√≥n (umbral 0.9): {len(selected_by_correlation)}")
        
        # Se calcularon correlaciones con target
        target_correlations = {}
        for col in selected_by_correlation:
            if col != 'default_flag':
                corr = df[col].corr(df['default_flag'])
                if not np.isnan(corr):
                    target_correlations[col] = abs(corr)
        
        # Se ordenaron por correlaci√≥n con target
        sorted_correlations = sorted(target_correlations.items(), key=lambda x: x[1], reverse=True)
        
        print(f"\nTop 10 features por correlaci√≥n con default:")
        for i, (feature, corr) in enumerate(sorted_correlations[:10], 1):
            print(f"   {i:2d}. {feature}: {corr:.3f}")
        
        # Se seleccionaron features por importancia mutua
        try:
            # Se prepararon datos para selecci√≥n
            numeric_for_selection = [col for col in numeric_cols if col != 'default_flag'][:20]  # Limitar a 20
            df_for_selection = df[numeric_for_selection + ['default_flag']].dropna()
            
            if len(df_for_selection) > 100:  # M√≠nimo de datos para an√°lisis
                selected_by_importance = feature_selector.select_features_by_importance(
                    df_for_selection, method='mutual_info', top_k=15
                )
                
                print(f"\nFeatures seleccionadas por importancia mutua: {len(selected_by_importance)}")
                for i, feature in enumerate(selected_by_importance[:10], 1):
                    print(f"   {i:2d}. {feature}")
                
                log_operation(f"Se seleccionaron {len(selected_by_importance)} features por importancia")
            
        except Exception as e:
            print(f"\nNo se pudo realizar selecci√≥n por importancia: {e}")
            log_operation(f"Error en selecci√≥n de features: {e}", "WARNING")
    
    # Se identificaron features categ√≥ricas importantes
    categorical_importance = {}
    for col in categorical_cols:
        if df[col].nunique() <= 20:  # Solo variables con pocas categor√≠as
            # Se calcul√≥ informaci√≥n mutua aproximada usando diferencia de tasas
            segment_rates = df.groupby(col)['default_flag'].mean()
            overall_rate = df['default_flag'].mean()
            
            # Varianza ponderada de tasas por segmento
            segment_sizes = df[col].value_counts(normalize=True)
            weighted_variance = sum(segment_sizes[seg] * (rate - overall_rate)**2 
                                  for seg, rate in segment_rates.items())
            
            categorical_importance[col] = weighted_variance
    
    if categorical_importance:
        print(f"\nImportancia de variables categ√≥ricas (varianza ponderada):")
        sorted_cat_importance = sorted(categorical_importance.items(), key=lambda x: x[1], reverse=True)
        
        for i, (feature, importance) in enumerate(sorted_cat_importance[:10], 1):
            print(f"   {i:2d}. {feature}: {importance:.4f}")

else:
    print("No se puede realizar selecci√≥n de features sin variable target")

## 8. Generaci√≥n de Reporte EDA

In [11]:
# Generar reporte final del EDA
print("GENERACI√ìN DE REPORTE FINAL EDA")
print("=" * 50)

# Recopilar estad√≠sticas finales para el reporte
final_stats = {
    'dataset_size': len(df),
    'total_columns': len(df[0].keys()),
    'default_rate': sum(1 for row in df if row['default_flag'] == 1) / len(df),
    'total_exposure': sum(row['ead_amount'] for row in df),
    'expected_loss': sum(row['default_flag'] * row['lgd_estimate'] * row['ead_amount'] for row in df),
    'avg_risk_score': statistics.mean([row['risk_score'] for row in df]),
    'processing_time': datetime.now()
}

# Crear contenido del reporte EDA
report_content = f"""# Reporte EDA - An√°lisis Exploratorio de Riesgo Crediticio

## Informaci√≥n General
- **Fecha de ejecuci√≥n**: {final_stats['processing_time'].strftime('%Y-%m-%d %H:%M:%S')}
- **Dataset**: dashboard_data.csv (optimizado para an√°lisis)
- **Registros analizados**: {final_stats['dataset_size']:,}
- **Variables analizadas**: {final_stats['total_columns']:,}

## Resumen Ejecutivo

### M√©tricas Clave de Riesgo
- **Tasa de Default Global**: {final_stats['default_rate']:.2%}
- **Exposici√≥n Total**: ${final_stats['total_exposure']:,.0f}
- **P√©rdida Esperada**: ${final_stats['expected_loss']:,.0f}
- **Score de Riesgo Promedio**: {final_stats['avg_risk_score']:.1f}
- **ROA Estimado**: {(sum(row['loan_amnt'] * row['loan_int_rate']/100 for row in df if row['default_flag']==0) - final_stats['expected_loss']) / final_stats['total_exposure']:.2%}

## An√°lisis Detallado

### 1. Distribuci√≥n de Variables Target

#### Variable Default (PD)
- **No Default**: {sum(1 for row in df if row['default_flag'] == 0):,} clientes (75.2%)
- **Default**: {sum(1 for row in df if row['default_flag'] == 1):,} clientes (24.8%)
- **Interpretaci√≥n**: Tasa de default dentro de rangos comerciales normales

#### Variables LGD y EAD
- **LGD Promedio**: {statistics.mean([row['lgd_estimate'] for row in df]):.1%}
- **EAD Promedio**: ${statistics.mean([row['ead_amount'] for row in df]):,.0f}
- **Distribuci√≥n**: Variables preparadas para modelado

### 2. Segmentaci√≥n de Clientes

#### Por Bracket de Ingreso
- **Low (‚â§$50K)**: {sum(1 for row in df if row['income_bracket'] == 'Low'):,} clientes - Default: {sum(1 for row in df if row['income_bracket'] == 'Low' and row['default_flag'] == 1)/sum(1 for row in df if row['income_bracket'] == 'Low'):.1%}
- **Medium ($50K-$100K)**: {sum(1 for row in df if row['income_bracket'] == 'Medium'):,} clientes - Default: {sum(1 for row in df if row['income_bracket'] == 'Medium' and row['default_flag'] == 1)/sum(1 for row in df if row['income_bracket'] == 'Medium'):.1%}
- **High (>$100K)**: {sum(1 for row in df if row['income_bracket'] == 'High'):,} clientes - Default: {sum(1 for row in df if row['income_bracket'] == 'High' and row['default_flag'] == 1)/sum(1 for row in df if row['income_bracket'] == 'High'):.1%}

#### Por Segmento de Riesgo
- **Bajo Riesgo**: {sum(1 for row in df if row['risk_segment'] == 'Bajo Riesgo'):,} clientes (21.9%)
- **Riesgo Medio**: {sum(1 for row in df if row['risk_segment'] == 'Riesgo Medio'):,} clientes (32.5%)
- **Alto Riesgo**: {sum(1 for row in df if row['risk_segment'] == 'Alto Riesgo'):,} clientes (31.3%)
- **Riesgo Extremo**: {sum(1 for row in df if row['risk_segment'] == 'Riesgo Extremo'):,} clientes (14.3%)

### 3. Insights Cr√≠ticos para el Negocio

#### Rentabilidad vs Riesgo
- **Revenue Estimado (no-default)**: ${sum(row['loan_amnt'] * row['loan_int_rate']/100 for row in df if row['default_flag']==0):,.0f}
- **P√©rdidas por Default**: ${final_stats['expected_loss']:,.0f}
- **Resultado Neto**: ${sum(row['loan_amnt'] * row['loan_int_rate']/100 for row in df if row['default_flag']==0) - final_stats['expected_loss']:,.0f}

#### Patrones por Grado de Pr√©stamo
- Grados A-C: Menor tasa de inter√©s, default rates similares
- Grados D-F: Mayor variabilidad en performance
- Grado G: Comportamiento inesperado (menor default rate)

#### Factores de Riesgo Clave
1. **Monto del Pr√©stamo**: Correlaci√≥n positiva con default
2. **Ratio Deuda/Ingreso**: Indicador cr√≠tico de capacidad de pago
3. **Tasa de Inter√©s**: Refleja riesgo percibido
4. **Historial Laboral**: Estabilidad laboral reduce riesgo

## Recomendaciones para Dashboard

### KPIs Principales
1. **Tasa de Default por Segmento**
2. **Exposici√≥n Total y por Categor√≠a**
3. **P√©rdida Esperada vs Real**
4. **ROA por L√≠nea de Negocio**

### Segmentaciones Cr√≠ticas
1. **Por Grado de Pr√©stamo (A-G)**
2. **Por Bracket de Ingreso**
3. **Por Intenci√≥n del Pr√©stamo**
4. **Por Grupo Etario**
5. **Por Tipo de Propiedad**

### Filtros Recomendados
- `loan_grade`: Segmentaci√≥n por riesgo crediticio
- `income_bracket`: An√°lisis por capacidad financiera
- `age_group`: Patrones demogr√°ficos
- `loan_intent`: An√°lisis por prop√≥sito
- `risk_segment`: Visi√≥n integrada de riesgo

## Recomendaciones para Modelos

### Modelo PD (Probability of Default)
**Features Principales**: loan_amnt, debt_to_income_ratio, loan_int_rate, person_income, person_emp_length

**Enfoque**: Modelo XGBoost con validaci√≥n cruzada
**Target**: default_flag (binaria)

### Modelo LGD (Loss Given Default)
**Features Principales**: loan_grade, loan_amnt, person_income, debt_to_income_ratio, loan_intent

**Enfoque**: Regresi√≥n con t√©cnicas de regularizaci√≥n
**Target**: lgd_estimate (continua 0-1)

### Modelo EAD (Exposure at Default)
**Features Principales**: loan_amnt, person_income, loan_percent_income, person_emp_length

**Enfoque**: Regresi√≥n lineal con features transformadas
**Target**: ead_amount (continua)

## Validaci√≥n y Pr√≥ximos Pasos

### M√©tricas de Validaci√≥n
- **PD**: AUC-ROC, Gini, KS Statistic, PSI
- **LGD**: R-squared, MAE, RMSE, Accuracy Ratio
- **EAD**: R-squared, MAE, RMSE, Correlation

### Implementaci√≥n
1. **Fase 1**: Desarrollo de modelos PD/LGD/EAD
2. **Fase 2**: Construcci√≥n de dashboard con KPIs identificados
3. **Fase 3**: Implementaci√≥n de sistema de scoring en tiempo real
4. **Fase 4**: Backtesting y calibraci√≥n de modelos

## Conclusiones

‚úÖ **EDA COMPLETADO EXITOSAMENTE**

- Se identificaron {len(set(row['risk_segment'] for row in df))} segmentos de riesgo diferenciados
- Se validaron {len([var for var in df[0].keys() if 'default' in var or 'lgd' in var or 'ead' in var])} variables target para modelado
- Se establecieron {len(recommendations['dashboard']['kpis_principales'])} KPIs principales para dashboard
- Se definieron {len(recommendations['modelos']['features_pd'])} features cr√≠ticas para modelos

**Datos preparados para**: Dashboard Development, Model Training, Risk Management

**Estado**: LISTO PARA MODELADO Y DASHBOARD
"""

# Guardar reporte
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
report_file = write_report(report_content, f"eda_report_{timestamp}", REPORTS_PATH)

# Guardar resumen en JSON
summary_data = {
    'execution_timestamp': final_stats['processing_time'].isoformat(),
    'dataset_summary': {
        'size': final_stats['dataset_size'],
        'columns': final_stats['total_columns'],
        'default_rate': final_stats['default_rate'],
        'avg_risk_score': final_stats['avg_risk_score']
    },
    'business_metrics': {
        'total_exposure': final_stats['total_exposure'],
        'expected_loss': final_stats['expected_loss'],
        'roa_estimate': (sum(row['loan_amnt'] * row['loan_int_rate']/100 for row in df if row['default_flag']==0) - final_stats['expected_loss']) / final_stats['total_exposure']
    },
    'recommendations': recommendations,
    'next_steps': ['Model Training', 'Dashboard Development', 'Risk Framework Implementation']
}

summary_file = f"{REPORTS_PATH}/eda_summary_{timestamp}.json"
with open(summary_file, 'w', encoding='utf-8') as f:
    json.dump(summary_data, f, indent=2, ensure_ascii=False)

print(f"‚úÖ Reporte EDA generado:")
print(f"   üìã Reporte completo: {report_file}")
print(f"   üìä Resumen JSON: {summary_file}")

log_operation("Se complet√≥ proceso EDA exitosamente", "INFO", final_stats)

print(f"\nüéâ PROCESO EDA COMPLETADO EXITOSAMENTE")
print(f"\nüìä DATOS LISTOS PARA DASHBOARD:")
print(f"   ‚Ä¢ KPIs principales identificados: {len(recommendations['dashboard']['kpis_principales'])}")
print(f"   ‚Ä¢ Segmentaciones definidas: {len(recommendations['dashboard']['segmentaciones_clave'])}")
print(f"   ‚Ä¢ Variables optimizadas para visualizaci√≥n")

print(f"\nü§ñ DATOS LISTOS PARA MODELOS:")
print(f"   ‚Ä¢ Features PD identificadas: {len(recommendations['modelos']['features_pd'])}")
print(f"   ‚Ä¢ Features LGD definidas: {len(recommendations['modelos']['features_lgd'])}")
print(f"   ‚Ä¢ Features EAD preparadas: {len(recommendations['modelos']['features_ead'])}")

print(f"\nüìà PR√ìXIMOS PASOS:")
print(f"   1. Construir dashboard con dashboard_data.csv")
print(f"   2. Entrenar modelos PD/LGD/EAD")
print(f"   3. Implementar sistema de scoring")
print(f"   4. Validar performance en producci√≥n")

GENERACI√ìN DE REPORTE FINAL EDA
‚úÖ Reporte EDA generado:
   üìã Reporte completo: ../reports/eda_report_20251015_123101.md
   üìä Resumen JSON: ../reports/eda_summary_20251015_123101.json
[2025-10-15 12:31:01] INFO: Se complet√≥ proceso EDA exitosamente | Data: {'dataset_size': 5000, 'total_columns': 22, 'default_rate': 0.2476, 'total_exposure': 102346434, 'expected_loss': 10353015.7, 'avg_risk_score': 72.09380361757106, 'processing_time': datetime.datetime(2025, 10, 15, 12, 31, 1, 599306)}

üéâ PROCESO EDA COMPLETADO EXITOSAMENTE

üìä DATOS LISTOS PARA DASHBOARD:
   ‚Ä¢ KPIs principales identificados: 4
   ‚Ä¢ Segmentaciones definidas: 5
   ‚Ä¢ Variables optimizadas para visualizaci√≥n

ü§ñ DATOS LISTOS PARA MODELOS:
   ‚Ä¢ Features PD identificadas: 8
   ‚Ä¢ Features LGD definidas: 5
   ‚Ä¢ Features EAD preparadas: 4

üìà PR√ìXIMOS PASOS:
   1. Construir dashboard con dashboard_data.csv
   2. Entrenar modelos PD/LGD/EAD
   3. Implementar sistema de scoring
   4. Validar perfo