
------

### ---- EXPLORACI√ìN 20 RATAS SANAS : FILTRADO, PROMEDIO Y SELECCI√ìN ---- Caso th = 0.2
#### DATOS ALEJANDRO RATAS (1-20) 13.10.25

------

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pickle
import os
from pathlib import Path
from scipy import stats
from scipy.signal import find_peaks
from scipy.stats import entropy, wasserstein_distance
from sklearn.cluster import KMeans
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA
import seaborn as sns
from sklearn.metrics import silhouette_score, davies_bouldin_score
from collections import defaultdict
from collections import Counter

# HELPER FUNCTIONS

# --- Construcci√≥n robusta del name_map con IDs reales ---
def make_name_map_from_ids(roi_names, left_ids, right_ids):
    """
    roi_names: lista de 78 nombres base en el orden correcto del atlas.
    left_ids, right_ids: listas con los IDs REALES que corresponden a esos 78 nombres.
                         Deben tener len()==len(roi_names).
    Devuelve dict {roi_id: "L-<name>" / "R-<name>"}.
    """
    if len(left_ids) != len(roi_names) or len(right_ids) != len(roi_names):
        raise ValueError("left_ids y right_ids deben tener la misma longitud que roi_names (79).")

    name_map = {}
    for k, rid in enumerate(left_ids):
        name_map[int(rid)] = f"L-{roi_names[k]}"
    for k, rid in enumerate(right_ids):
        name_map[int(rid)] = f"R-{roi_names[k]}"
    return name_map

# --- Etiquetador que NO asume contig√ºidad ---
def roi_label(idx, name_map):
    """
    idx: ROI id real (disperso). name_map: dict {id: 'L-Name'/'R-Name'}.
    """
    try:
        return name_map[int(idx)]
    except KeyError:
        return f"ID{int(idx)}"  # fallback visible

## 2. Diferencias clave con el notebook de 1 rata

| Aspecto | Notebook actual | Nuevo enfoque multi-rata |
|---------|-----------------|--------------------------|
| Input | `pickle` con delays crudos | `.dat` con fits (medias/stds) |
| Estructura | `dict[(i,j)] ‚Üí array(N√ó6)` | Probablemente matriz o lista de par√°metros |
| n_fibers | Por streamline | Agregado en el fit |
| CV/dispersi√≥n | Calculado desde delays | Ya resumido o recalcular desde par√°metros |
| Multi-sujeto | No aplica | Agregar/promediar entre 18 ratas |

## 3. Flujo propuesto (adaptado)

### Fase A: Carga y consolidaci√≥n
1. **Leer todos los `.dat`** ‚Üí tabla unificada por rata
2. **Estructura target**: `DataFrame` con columnas:
   - `rat_id`, `roi_i`, `roi_j`, `n_fibers`, `tau_mean_ms`, `tau_std_ms`, ...
3. **Filtrar conexiones**: `n_fibers ‚â• umbral` (50-100)

### Fase B: An√°lisis por rata y agregado
4. **M√©tricas por rata**:
   - Distribuciones de œÑ por conexi√≥n
   - Relaci√≥n œÑ~D (si D est√° en los fits)
5. **Agregaci√≥n entre ratas**:
   - Media/mediana de œÑ por conexi√≥n (i,j) across ratas
   - Variabilidad inter-sujeto (CV entre ratas)
6. **Selecci√≥n robusta**:
   - Conexiones presentes en ‚â• N/2 ratas (e.g., ‚â•10/18)
   - Bajo CV inter-rata
   - Buen n_fibers promedio



### Fase C: Categorizaci√≥n y clustering
7. Aplicar misma l√≥gica del notebook:
   - Intra/inter hemisf√©rico
   - Hipocampo-PFC, t√°lamo-cortical
   - Clustering por forma de distribuci√≥n (si hay par√°metros de fit suficientes)

### 1. Carga consolidada - Nombres + Datos

In [None]:
os.chdir("../..")

path = './data/raw/rat_delays_fibers_0.2/th-0.2/'

names = [f for f in os.listdir(path) if 'name' in f]

# Abrir archivo .txt con nombres de ROIs y .dat con datos en formato diccionario
with open(path+names[0], 'r') as f:
    roi_names = [line.strip() for line in f.readlines()][1:]
    
print(roi_names), len(roi_names)

In [None]:
def load_all_rats(data_dir, threshold='0.0'):
    """Carga 18 ratas ‚Üí dict {rat_id: data_dict}"""
    rats = {}
    path = Path(data_dir) / f'rat_delays_fibers_{threshold}' / f'th-{threshold}'
    
    for f in path.glob(f'th-{threshold}_R*_b20_r_Fit_Histogram_Tau_all_fibers.dat'):
        rat_id = f.stem.split('_')[1]  # 'R01', 'R02', etc.
        with open(f, 'rb') as fh:
            rats[rat_id] = pickle.load(fh)
    
    return rats

# Uso
data_dir = './data/raw/'
all_rats = load_all_rats(data_dir, threshold='0.2')
print(f"Ratas cargadas: {sorted(all_rats.keys())}")  # R01-R19 (sin R11)
print(f"Ejemplo estructura R01: {len(all_rats['R01'])} conexiones")

### Celda 4: Name map y exploraci√≥n inicial

In [None]:
left_ids = range(0, 78)   # IDs del 1 al 78 para hemisferio izquierdo
right_ids = range(78, 156)

name_map = make_name_map_from_ids(roi_names, left_ids, right_ids)

# Exploraci√≥n: conexiones comunes entre ratas
all_pairs = Counter()
for rat_data in all_rats.values():
    all_pairs.update(rat_data.keys())

print(f"Total conexiones √∫nicas: {len(all_pairs)}")
print(f"Conexiones en ‚â•9 ratas: {sum(1 for c in all_pairs.values() if c >= 9)}")
print(f"\nTop 10 conexiones m√°s frecuentes:")
for (i,j), count in all_pairs.most_common(10):
    print(f"  {roi_label(i, name_map)} ‚Üí {roi_label(j, name_map)}: {count} ratas")

Perfecto, los primeros pasos son correctos:

## Celdas 3-4: Carga y Exploraci√≥n ‚úÖ

**Carga th=0.0**:
- 18 ratas cargadas (R01-R19, sin R11)
- R01 ejemplo: 3937 conexiones raw

**Exploraci√≥n inicial**:
- **7259 conexiones √∫nicas** entre todas las ratas
- **3409 conexiones robustas** (‚â•9 ratas) - ~47% del total
- Top 10: **todas con 18/18 ratas** (m√°xima consistencia)

**Patr√≥n dominante**: L-Olfactory bulb como hub principal (aparece en 8/10 top)

**Observaci√≥n importante**: Las conexiones mostradas son **intra-hemisf√©ricas izquierdas** (L‚ÜíL), lo cual es esperado ya que el bulbo olfatorio tiene conectividad extensa ipsilateral.


### Celda 5: Limpieza por rata
    - Min fibers: 25

In [None]:
# Columnas (D y V no se usan en la limpieza; se dejan por compatibilidad)
COL_TAU = 0
COL_D   = 1
COL_V   = 2

def clean_data(
    data: dict,
    *,
    min_n_fibers: int = 50,
    enforce_positive: bool = True,
    tau_quantiles: tuple[float, float] | None = (0.005, 0.995),
) -> tuple[dict, pd.DataFrame, dict]:
    """
    Limpia mediciones por par (i,j) sin chequeo œÑ‚âàD/V.
    - Filtros: finitos, (opcional) œÑ,D,V > 0, cuantiles de œÑ por par.
    - Umbral min_n_fibers antes y despu√©s de limpiar.
    Devuelve:
      
      cleaned_data: dict[(i,j)] -> ndarray float32 (m, >=3)
      pair_summary: DF con n_raw, n_clean y medianas (œÑ,D,V)
      stats: contadores de pares y filas
    """
    cleaned_data = {}

    pair_stats = {
        "pairs_original": len(data),
        "pairs_empty_raw": 0,
        "pairs_raw_lt_min": 0,
        "pairs_all_invalid": 0,
        "pairs_after_lt_min": 0,
        "pairs_kept": 0,
    }
    row_stats = {
        "rows_total": 0,
        "rows_kept": 0,
        "rows_drop_nan_inf": 0,
        "rows_drop_nonpositive": 0,
        "rows_drop_outlier_tau": 0,
    }

    rows_summary = []

    for (i, j), measurements in data.items():
        if measurements is None or len(measurements) == 0:
            pair_stats["pairs_empty_raw"] += 1
            continue

        arr = np.asarray(measurements)
        # exigimos al menos œÑ,D,V (>=3 columnas)
        if arr.ndim != 2 or arr.shape[1] < 3:
            pair_stats["pairs_all_invalid"] += 1
            continue

        n_raw = arr.shape[0]
        row_stats["rows_total"] += n_raw
        if n_raw < min_n_fibers:
            pair_stats["pairs_raw_lt_min"] += 1
            continue

        # Finite en œÑ,D,V
        finite = np.isfinite(arr[:, [COL_TAU, COL_D, COL_V]]).all(axis=1)
        row_stats["rows_drop_nan_inf"] += int((~finite).sum())
        arr = arr[finite]
        if arr.size == 0:
            pair_stats["pairs_all_invalid"] += 1
            continue

        # Positivos (opcional)
        if enforce_positive:
            pos = (arr[:, COL_TAU] > 0) & (arr[:, COL_D] > 0) & (arr[:, COL_V] > 0)
            row_stats["rows_drop_nonpositive"] += int((~pos).sum())
            arr = arr[pos]
            if arr.size == 0:
                pair_stats["pairs_all_invalid"] += 1
                continue

        # Outliers de œÑ por cuantiles (por par)
        if tau_quantiles is not None and arr.shape[0] >= 5:
            qlo, qhi = tau_quantiles
            tau_vals = arr[:, COL_TAU]
            lo = np.nanquantile(tau_vals, qlo)
            hi = np.nanquantile(tau_vals, qhi)
            in_rng = (tau_vals >= lo) & (tau_vals <= hi)
            row_stats["rows_drop_outlier_tau"] += int((~in_rng).sum())
            arr = arr[in_rng]
            if arr.size == 0:
                pair_stats["pairs_all_invalid"] += 1
                continue

        n_clean = arr.shape[0]
        if n_clean < min_n_fibers:
            pair_stats["pairs_after_lt_min"] += 1
            continue

        cleaned = arr.astype(np.float32, copy=False)
        cleaned_data[(int(i), int(j))] = cleaned
        pair_stats["pairs_kept"] += 1
        row_stats["rows_kept"] += n_clean

        # Resumen por par
        med_tau = float(np.median(cleaned[:, COL_TAU]))
        med_D   = float(np.median(cleaned[:, COL_D]))
        med_V   = float(np.median(cleaned[:, COL_V]))

        rows_summary.append({
            "roi_i": int(i), "roi_j": int(j),
            "roi_name1": roi_label(i, name_map), "roi_name2": roi_label(j, name_map),
            "n_raw": int(n_raw), "n_clean": int(n_clean),
            "tau_med_s": med_tau, "tau_med_ms": med_tau*1e3,
            "D_med_m": med_D, "D_med_mm": med_D*1e3,
            "V_med_mps": med_V,
        })

    pair_summary = pd.DataFrame(rows_summary).sort_values(["roi_i", "roi_j"]).reset_index(drop=True)
    stats = {"pairs": pair_stats, "rows": row_stats}
    return cleaned_data, pair_summary, stats


cleaned_rats = {}
summaries = {}
stats = {}
for rat_id, data in all_rats.items():
    cleaned_rats[rat_id], summaries[rat_id], stats[rat_id] = clean_data(
        data, min_n_fibers=50, enforce_positive=True, tau_quantiles=(0.0, 1.0)
    )
cleaned_rats.keys()

### - Mostramos los resultados para una rata: 02

    - Claves de pares de ROIs (i,j)

In [None]:
cleaned_rats['R02'].keys()

### - Resumen de estad√≠sticas descriptivas: Pares, nombres, n_fibras_raw vs n_fibras_clean, medianas de: tau, distancia, velocity

In [None]:
summaries['R02']

- ### Original vs kept pairs, n_rows...
  
    - Configuraci√≥n usada:

        - min_n_fibers=25 (m√°s permisivo que el t√≠pico 50)
        - tau_quantiles=(0.0, 1.0) ‚Üí sin filtrado de outliers œÑ
        - Mantiene positivos y finitos

    - Resultado R02 (ejemplo):

        - 3283 ‚Üí 1392 pares (42% retenido)
        - 468K ‚Üí 459K filas (98% streamlines OK)
        - P√©rdidas principales: pares con n<25 (1282) y vac√≠os (605)

In [None]:
stats['R02']

- ### Agregaci√≥n inter-rata con m√©tricas clave
  - Ordenado por tau_range_mean (prioriza diversidad temporal) + n_rats.

In [None]:
def aggregate_multi_rat(cleaned_rats, min_rats=10):
    """
    Consolida conexiones presentes en ‚â• min_rats.
    Devuelve DataFrame con estad√≠sticas inter-rata.
    """
    
    conn_data = defaultdict(lambda: {
        'rats': [], 'tau_med_ms': [], 'tau_range_ms': [], 
        'n_fibers': [], 'D_med_mm': []
    })
    
    for rat_id, data in cleaned_rats.items():
        for (i,j), arr in data.items():
            tau_ms = arr[:, COL_TAU] * 1e3
            D_mm = arr[:, COL_D] * 1e3
            
            conn_data[(i,j)]['rats'].append(rat_id)
            conn_data[(i,j)]['tau_med_ms'].append(np.median(tau_ms))
            conn_data[(i,j)]['tau_range_ms'].append(np.ptp(tau_ms))  # max-min
            conn_data[(i,j)]['n_fibers'].append(len(tau_ms))
            conn_data[(i,j)]['D_med_mm'].append(np.median(D_mm))
    
    rows = []
    for (i,j), stats in conn_data.items():
        n_rats = len(stats['rats'])
        if n_rats < min_rats:
            continue
        
        tau_vals = np.array(stats['tau_med_ms'])
        rows.append({
            'roi_i': int(i), 'roi_j': int(j),
            'pair_label': f"{roi_label(i, name_map)} ‚Üí {roi_label(j, name_map)}",
            'n_rats': n_rats,
            'tau_mean_ms': tau_vals.mean(),
            'tau_std_inter': tau_vals.std(),           # variabilidad entre ratas
            'cv_inter': tau_vals.std() / tau_vals.mean(),
            'tau_range_mean': np.mean(stats['tau_range_ms']),  # rango promedio
            'n_fibers_mean': np.mean(stats['n_fibers']),
            'D_mean_mm': np.mean(stats['D_med_mm']),
            'hemi': 'intra' if (i < 78 and j < 78) or (i >= 78 and j >= 78) else 'inter',
        })
    
    df = pd.DataFrame(rows)
    return df.sort_values(['tau_range_mean', 'n_rats'], ascending=[False, False])

df_multi = aggregate_multi_rat(cleaned_rats, min_rats=9)
print(f"Conexiones con ‚â•10 ratas: {len(df_multi)}")
df_multi.head(20)

## Resultados Agregaci√≥n ‚úÖ

**1635 conexiones robustas** (‚â•9 ratas) - excelente cobertura.

### An√°lisis Top 20:

**Patrones anat√≥micos**:
- 100% **intra-hemisf√©ricas** (esperado para delays largos)
- 70% hemisferio **derecho** (R‚ÜíR)
- **Subthalamic nucleus** = hub (6/20 conexiones)

**Calidad temporal**:
- œÑ_range: **4.3-5.0 ms** - diversidad √≥ptima
- CV inter-rata: **0.11-0.52** (mayor√≠a <0.35) - buena robustez
- n_fibers: **122-5436** - suficiente

**Conexiones anat√≥micas de inter√©s**:
- #8: **Parietal ‚Üí Subiculum** (3233 fibers, CV=0.28) ‚Üê Corteza-l√≠mbico
- #10: **Accumbens ‚Üí Motor** (707 fibers, CV=0.20) ‚Üê L√≠mbico-motor
- #4: **Endopiriform ‚Üí Prelimbic** (1060 fibers, CV=0.19) ‚Üê L√≠mbico-PFC

**Observaci√≥n**: Falta diversidad inter-hemisf√©rica. Deber√° buscarse en candidatos con œÑ_range >3ms pero <4.3ms.


In [None]:
def advanced_outlier_removal(cleaned_rats, min_rats=8):
    """
    Elimina fibras/conexiones outliers usando consenso entre ratas.
    
    Para cada conexi√≥n (i,j):
    1. Calcula percentiles P5-P95 de tau_median por rata
    2. Elimina ratas con tau fuera del rango inter-rata
    3. Dentro de cada rata, elimina fibras outliers (MAD > 3)
    4. Descarta conexi√≥n si CV_inter > 0.5 o quedan <min_rats
    """
    
    refined_rats = {}
    removal_log = []
    
    # Paso 1: Identificar conexiones compartidas
    conn_inventory = defaultdict(list)
    for rat_id, data in cleaned_rats.items():
        for (i,j) in data.keys():
            conn_inventory[(i,j)].append(rat_id)
    
    # Filtrar solo conexiones con suficientes ratas
    valid_conns = {k: v for k, v in conn_inventory.items() if len(v) >= min_rats}
    print(f"Conexiones con ‚â•{min_rats} ratas: {len(valid_conns)}")
    
    # Paso 2: Limpieza por conexi√≥n
    for (i,j), rat_list in valid_conns.items():
        # Recopilar medianas por rata
        tau_medians = []
        for rat_id in rat_list:
            tau_ms = cleaned_rats[rat_id][(i,j)][:, COL_TAU] * 1e3
            tau_medians.append(np.median(tau_ms))
        
        tau_medians = np.array(tau_medians)
        
        # Detectar ratas outlier (fuera de P5-P95 inter-rata)
        p5, p95 = np.percentile(tau_medians, [5, 95])
        valid_rats = [rat_list[k] for k in range(len(rat_list)) 
                      if p5 <= tau_medians[k] <= p95]
        
        if len(valid_rats) < min_rats:
            removal_log.append({
                'pair': (i,j), 'reason': 'insufficient_rats_after_outlier',
                'n_rats_before': len(rat_list), 'n_rats_after': len(valid_rats)
            })
            continue
        
        # CV inter-rata (solo con ratas v√°lidas)
        valid_medians = [tau_medians[rat_list.index(r)] for r in valid_rats]
        cv_inter = np.std(valid_medians) / np.mean(valid_medians)
        
        if cv_inter > 0.5:
            removal_log.append({
                'pair': (i,j), 'reason': 'high_cv_inter',
                'cv': cv_inter, 'n_rats': len(valid_rats)
            })
            continue
        
        # Paso 3: Guardar conexiones v√°lidas (sin MAD intra-rata)
        # Solo aplicamos filtros inter-rata, preservando colas largas
        for rat_id in valid_rats:
            if rat_id not in refined_rats:
                refined_rats[rat_id] = {}
            
            # Mantener datos originales (ya limpiados en clean_data)
            refined_rats[rat_id][(i,j)] = cleaned_rats[rat_id][(i,j)]
    
    print(f"\nRefinamiento completado:")
    print(f"  Ratas procesadas: {len(refined_rats)}")
    print(f"  Conexiones eliminadas: {len(removal_log)}")
    
    # Resumen de razones de eliminaci√≥n
    reasons = pd.DataFrame(removal_log)
    if len(reasons) > 0:
        print("\nMotivos de eliminaci√≥n:")
        print(reasons['reason'].value_counts())
    
    return refined_rats, removal_log

In [None]:
# Ejecutar limpieza avanzada
print("="*70)
print("LIMPIEZA AVANZADA POR CONSENSO INTER-RATA")
print("="*70)

refined_rats, log = advanced_outlier_removal(cleaned_rats, min_rats=9)

# Reagregar con datos refinados
df_refined = aggregate_multi_rat(refined_rats, min_rats=9)

print(f"\n{'COMPARACI√ìN':=^70}")
print(f"Antes:    {len(df_multi)} conexiones robustas")
print(f"Despu√©s:  {len(df_refined)} conexiones refinadas")
print(f"P√©rdida:  {len(df_multi) - len(df_refined)} ({100*(len(df_multi)-len(df_refined))/len(df_multi):.1f}%)")

# Top 20 despu√©s de refinamiento
print(f"\n{'TOP 20 REFINADAS':=^70}")
print(df_refined.head(20)[['pair_label', 'n_rats', 'tau_range_mean', 
                            'cv_inter', 'n_fibers_mean']].to_string(index=False))

# # Guardar
# df_refined.to_csv('../../results/data_analysis/refined_th_0.0.csv', index=False)
# print("\n‚úì Guardado: refined_th_0.0.csv")

## Refinamiento ‚úÖ

**Eliminadas**: 354 conexiones (13.6%)
- 230 por ratas outliers
- 124 por CV>0.5

**Top 20 refinadas**: œÑ_range 4.2-5.0 ms, CV <0.35 (excepto #5 y #8)

**Mejoras**:
- ‚Üì CV promedio: 0.28‚Üí0.21
- ‚Üë robustez: mayor√≠a 16/16 ratas
- Mantiene diversidad anat√≥mica

In [None]:
# Cargar ambos
df_02 = pd.read_csv('./results/data_analysis/results_th_0.2.csv')
df_00 = df_refined.copy()

# Pares comunes
pairs_02 = set(zip(df_02['roi_i'], df_02['roi_j']))
pairs_00 = set(zip(df_00['roi_i'], df_00['roi_j']))

common = pairs_02 & pairs_00
print(f"Comunes: {len(common)}")
print(f"Solo en 0.0: {len(pairs_00 - pairs_02)}")
print(f"Solo en 0.2: {len(pairs_02 - pairs_00)}")

# Top 20 coincidentes
top20_00 = set(zip(df_00.head(20)['roi_i'], df_00.head(20)['roi_j']))
top20_02 = set(zip(df_02.nlargest(20, 'tau_range_mean')['roi_i'], 
                   df_02.nlargest(20, 'tau_range_mean')['roi_j']))
print(f"\nTop 20 coinciden: {len(top20_00 & top20_02)}/20")

### Top 20 con mayor rango temporal (2.4-2.8 ms):
- Patrones clave:

  - Todos intra-hemisf√©ricos subcorticales/l√≠mbicos
  - 18/18 ratas en todos (m√°xima robustez)
  - CV_inter: 0.12-0.55 (algunos muy estables, otros m√°s variables entre sujetos)
  - Protagonistas: Hipot√°lamo, Subiculum, PAG, Zona incerta

In [None]:
# Criterios configurables
MIN_RATS_FILTER = 9
MAX_CV_INTER = 0.15
MIN_FIBERS_FILTER = 100
MIN_TAU_RANGE = 2.5

# Filtrado
df_stable_diverse = df_refined[
    (df_refined['n_rats'] >= MIN_RATS_FILTER) &
    (df_refined['cv_inter'] < MAX_CV_INTER) &
    (df_refined['n_fibers_mean'] >= MIN_FIBERS_FILTER) &
    (df_refined['tau_range_mean'] > MIN_TAU_RANGE)
].sort_values('tau_range_mean', ascending=False)

print(f"Candidatos: {len(df_stable_diverse )} (œÑ>{MIN_TAU_RANGE}ms, n>{MIN_FIBERS_FILTER}, CV<{MAX_CV_INTER})")

# Exploraciones
print("\nTop 20 por n_fibers:")
display(df_refined.nlargest(20, 'n_fibers_mean')[['pair_label', 'n_fibers_mean', 'tau_range_mean']])

print("\nTop 20 por tau_range:")
display(df_refined.nlargest(20, 'tau_range_mean')[['pair_label', 'tau_range_mean', 'n_rats', 'cv_inter']])

df_stable_diverse.head(15)

## Filtrado Final ‚úÖ

**319 candidatos** con œÑ>2.5ms, n‚â•100, CV<0.5

### An√°lisis por Prioridad:

**Por œÑ_range** (top 15):
- 4.2-5.0 ms, CV<0.35
- **Subthalamic nucleus** sigue dominando
- #3: **Endopiriform‚ÜíPrelimbic** (1058f, CV=0.16) ‚Üê PFC
- #6: **Parietal‚ÜíSubiculum** (3207f, CV=0.21) ‚Üê Corteza-l√≠mbico
- #10: **Accumbens‚ÜíMotor** (701f, CV=0.16) ‚Üê L√≠mbico-motor

**Por n_fibers** (top 20):
- Miles de fibras pero **œÑ_range m√°s bajo** (1.9-3.8ms)
- Dominan: **Subiculum, Ventral striatal, PAG**
- Balance calidad/cantidad

### Problema: **0% inter-hemisf√©ricas**

Para diversidad anat√≥mica, necesitas:
1. Bajar umbral a œÑ>2.0ms o revisar hemi=='inter'
2. Clustering sobre 319 para identificar morfolog√≠as

In [None]:
def plot_tau_distributions_multirat(rats_data, pair, name_map, bins=50):
    """Histogramas œÑ por rata (grid 3√ó6)"""
    fig, axes = plt.subplots(3, 6, figsize=(18, 9))
    axes = axes.ravel()
    
    for idx, (rat_id, data) in enumerate(sorted(rats_data.items())):
        if pair not in data:
            axes[idx].text(0.5, 0.5, 'N/A', ha='center', va='center')
            axes[idx].set_title(rat_id)
            axes[idx].axis('off')
            continue
        
        tau_ms = data[pair][:, COL_TAU] * 1e3
        axes[idx].hist(tau_ms, bins=bins, alpha=0.75, edgecolor='k', lw=0.5)
        axes[idx].axvline(np.median(tau_ms), color='r', ls='--', lw=1.5)
        axes[idx].set_title(f"{rat_id} (n={len(tau_ms)})", fontsize=9)
        axes[idx].set_xlabel('œÑ (ms)', fontsize=8)
    
    i, j = pair
    fig.suptitle(f"{roi_label(i, name_map)} ‚Üí {roi_label(j, name_map)}", fontsize=13, y=0.995)
    plt.tight_layout()
    return fig

# Visualizar top 3
for idx in [0, 1, 2,3,4,5]:
    row = df_stable_diverse.iloc[idx]
    pair = (row['roi_i'], row['roi_j'])
    plot_tau_distributions_multirat(refined_rats, pair, name_map, bins=50)
    print(pair)
    plt.show()

In [None]:
# # ==============================================================
# # üß© EXPLORACI√ìN MANUAL DE DISTRIBUCIONES ‚Äî SELECCI√ìN VISUAL
# # ==============================================================

# import matplotlib.pyplot as plt
# import pandas as pd

# # --------------------------------------------------------------
# # 1Ô∏è‚É£ Selecci√≥n de conexiones candidatas
# # --------------------------------------------------------------

# # Filtramos las conexiones estables y ricas en fibras
# df_candidates = df_multi[
#     (df_multi['n_rats'] >= 4) &
#     (df_multi['cv_inter'] < 0.1) &
#     (df_multi['n_fibers_mean'] >= 50) &
#     (df_multi['tau_range_mean'] > 2.0)
# ].copy()

# # Ordenar por rango temporal y robustez
# df_candidates = df_candidates.sort_values(
#     ['tau_range_mean', 'n_fibers_mean'],
#     ascending=[False, False]
# ).reset_index(drop=True)

# print(f"‚úÖ {len(df_candidates)} conexiones cumplen criterios")
# display(df_candidates.head(15)[[
#     'pair_label', 'tau_range_mean', 'cv_inter', 'n_fibers_mean',
#     'n_rats', 'hemi'
# ]])

# # --------------------------------------------------------------
# # 2Ô∏è‚É£ Visualizaci√≥n del top N
# # --------------------------------------------------------------

# def plot_multi_histogram(pair, cleaned_rats, name_map, bins=50):
#     """Visualiza distribuciones œÑ (ms) por rata en un grid."""
#     i, j = pair
#     fig, axes = plt.subplots(3, 6, figsize=(18, 9))
#     axes = axes.ravel()

#     for idx, (rat_id, data) in enumerate(sorted(cleaned_rats.items())):
#         ax = axes[idx]
#         if pair not in data:
#             ax.text(0.5, 0.5, 'N/A', ha='center', va='center')
#             ax.set_title(rat_id)
#             ax.axis('off')
#             continue

#         tau_ms = data[pair][:, COL_TAU] * 1e3
#         ax.hist(tau_ms, bins=bins, alpha=0.75, edgecolor='k', linewidth=0.5)
#         ax.axvline(np.median(tau_ms), color='r', ls='--', lw=1)
#         ax.set_title(f"{rat_id} (n={len(tau_ms)})", fontsize=8)
#         ax.set_xlabel('œÑ (ms)', fontsize=7)
#         ax.set_ylabel('freq', fontsize=7)

#     fig.suptitle(f"{roi_label(i, name_map)} ‚Üí {roi_label(j, name_map)}", fontsize=13)
#     fig.tight_layout()
#     plt.show()

# # --------------------------------------------------------------
# # 3Ô∏è‚É£ Iterar visualmente sobre el top
# # --------------------------------------------------------------

# TOP_N = 10  # n√∫mero de conexiones que quieres explorar

# for idx in range(TOP_N):
#     row = df_candidates.iloc[idx]
#     pair = (int(row.roi_i), int(row.roi_j))

#     print("="*80)
#     print(f"#{idx+1}  {row.pair_label}")
#     print(f"  œÑ_range_mean = {row.tau_range_mean:.2f} ms")
#     print(f"  n_fibers_mean = {row.n_fibers_mean:.0f}")
#     print(f"  n_rats = {row.n_rats}, CV_inter = {row.cv_inter:.2f}, hemi = {row.hemi}")
#     plot_multi_histogram(pair, cleaned_rats, name_map, bins=40)


In [None]:
df_stable_diverse['pair_label'].to_list()

In [None]:
cleaned_rats['R01'][(29, 70)]

In [None]:
roi_names[28], roi_names[69]    

In [None]:
pair = (29, 70)
plot_tau_distributions_multirat(cleaned_rats, pair, name_map, bins=40)
print(pair)
plt.show()

In [None]:
pair = (28, 69)
plot_avg_distribution(pair, refined_rats, name_map, 50, False)
print(pair)
print(label)
plt.show()

In [None]:
# ==============================================================
# üß© DISTRIBUCIONES PROMEDIADAS MULTI-RATA + DESCARTE OUTLIERS (¬±3œÉ)
# ==============================================================

import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

def plot_avg_distribution(pair, cleaned_rats, name_map, bins=75, save=False, sigma_thresh=1.0):
    """
    Calcula y muestra el histograma promedio multi-rata con exclusi√≥n de outliers (>3œÉ).
    Devuelve (centers, mean_hist, kept_rats)
    """
    i, j = pair
    all_hists, all_edges, valid_rats = [], [], []

    # --- Calcular histogramas individuales normalizados ---
    for rat_id, data in cleaned_rats.items():
        if pair not in data:
            continue
        tau_ms = data[pair][:, COL_TAU] * 1e3
        hist, edges = np.histogram(tau_ms, bins=bins, density=True)
        all_hists.append(hist)
        all_edges.append(edges)
        valid_rats.append(rat_id)

    if not all_hists:
        print(f"‚ö†Ô∏è Sin datos suficientes para {roi_label(i, name_map)} ‚Üí {roi_label(j, name_map)}")
        return None, None, []

    # --- Verificar consistencia de bins ---
    edges = all_edges[0]
    all_hists = np.array([h for h in all_hists if len(h) == len(edges) - 1])

    # --- Calcular media y desviaci√≥n inicial ---
    mean_init = all_hists.mean(axis=0)
    std_init = all_hists.std(axis=0)

    # --- Evaluar distancia tipo z-score promedio por rata ---
    z_scores = []
    for h in all_hists:
        z = np.abs(h - mean_init) / (std_init + 1e-8)
        z_mean = np.nanmean(z)
        z_scores.append(z_mean)
    z_scores = np.array(z_scores)

    # --- Filtrar ratas dentro de umbral (3œÉ) ---
    keep_mask = z_scores < sigma_thresh
    kept_hists = all_hists[keep_mask]
    kept_rats = np.array(valid_rats)[keep_mask]

    # --- Recalcular promedio y desviaci√≥n final ---
    mean_hist = kept_hists.mean(axis=0)
    std_hist = kept_hists.std(axis=0)
    centers = (edges[:-1] + edges[1:]) / 2

    # --- Plot ---
    plt.figure(figsize=(8, 5))
    plt.plot(centers, mean_hist, color='royalblue', lw=2, label='Media (post-filtrado)')
    plt.fill_between(centers, mean_hist - std_hist, mean_hist + std_hist,
                     color='lightblue', alpha=0.4, label='¬±1œÉ inter-rata')
    plt.xlabel('Delay œÑ (ms)')
    plt.ylabel('Densidad normalizada')
    plt.title(f"{roi_label(i, name_map)} ‚Üí {roi_label(j, name_map)}\n"
              f"Ratas v√°lidas: {len(kept_rats)}/{len(valid_rats)} (outliers: {len(valid_rats)-len(kept_rats)})")
    plt.legend()
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()

    # --- Exportar opcionalmente los valores promedio ---
    if save:
        export_dir = Path('./data/exports/avg_distributions_filtered')
        export_dir.mkdir(parents=True, exist_ok=True)
        np.savez(
            export_dir / f"avgdist_{roi_label(i, name_map)}_to_{roi_label(j, name_map)}.npz",
            centers=centers,
            mean_hist=mean_hist,
            std_hist=std_hist,
            kept_rats=kept_rats
        )
        print(f"‚úÖ Exportado promedio multi-rata (filtrado 3œÉ): "
              f"{roi_label(i, name_map)} ‚Üí {roi_label(j, name_map)}")

    return centers, mean_hist, kept_rats


for idx in range(len(df_stable_diverse)):
    
    label = df_stable_diverse.iloc[idx]['pair_label']
    
    if "Hippo" in label:

        row = df_stable_diverse.iloc[idx]
        pair = (row['roi_i'], row['roi_j'])
        plot_avg_distribution(pair, refined_rats, name_map, 50, False)
        print(pair)
        print(label)
        plt.show()


In [None]:
# An√°lisis sistem√°tico de las 15 top conexiones
results = []

for idx in range(min(5, len(df_stable_diverse))):
    row = df_stable_diverse.iloc[idx]
    pair = (row['roi_i'], row['roi_j'])
    
    centers, mean_hist, kept_rats = plot_avg_distribution(
        pair, refined_rats, name_map, bins=50, save=False, sigma_thresh=3.0
    )
    
    if centers is not None and len(centers) > 0:
        # Detectar modalidad
        from scipy.signal import find_peaks
        peaks, _ = find_peaks(mean_hist, height=mean_hist.max()*0.3, distance=5)
        n_modes = len(peaks)
        
        # Estad√≠sticas
        tau_mean = np.average(centers, weights=mean_hist)
        tau_median = centers[np.argmax(mean_hist)]
        skewness = ((centers - tau_mean)**3 * mean_hist).sum() / (((centers - tau_mean)**2 * mean_hist).sum())**1.5
        
        results.append({
            'pair_label': row['pair_label'],
            'n_modes': n_modes,
            'tau_peak_ms': tau_median,
            'tau_weighted_mean_ms': tau_mean,
            'skewness': skewness,
            'n_rats_kept': len(kept_rats),
            'n_rats_total': row['n_rats'],
            'outliers': row['n_rats'] - len(kept_rats)
        })

df_shapes = pd.DataFrame(results)
print("\nüîç AN√ÅLISIS DE FORMAS DE DISTRIBUCI√ìN (Top 15)")
print("="*80)
display(df_shapes)

# Clasificaci√≥n por modalidad
print("\nüìä CLASIFICACI√ìN POR MODALIDAD:")
print(df_shapes.groupby('n_modes').size())

print("\nüìà BIMODALES (n_modes ‚â• 2):")
display(df_shapes[df_shapes['n_modes'] >= 2][['pair_label', 'tau_peak_ms', 'skewness']])

print("\n‚öñÔ∏è SIMETR√çA (skewness cerca de 0 = gaussiana, >1 = cola derecha larga):")
display(df_shapes[['pair_label', 'skewness', 'n_modes']].sort_values('skewness'))

In [None]:
for idx in range(10):
    
    
    row = df_stable_diverse.iloc[idx]
    pair = (row['roi_i'], row['roi_j'])
    plot_avg_distribution(pair, cleaned_rats, name_map, 50, False)
    print(pair)
    print(df_stable_diverse.iloc[idx]['pair_label'])
    # plt.show()

In [None]:
# ==============================================================
# üß© DISTRIBUCIONES PROMEDIADAS FINALES (filtrado ¬±3œÉ)
# ==============================================================

selected_pairs = [
    (np.int64(28), np.int64(35)),   # L-Hippocampus ‚Üí L-Hypothalamic region
    (np.int64(108), np.int64(132)), # R-Subthalamic nucleus ‚Üí R-Retrosplenial dysgranular area
    (np.int64(108), np.int64(137)), # R-Subthalamic nucleus ‚Üí R-Retrosplenial granular area
    (np.int64(45), np.int64(61)),   # L-Perirhinal area 35 ‚Üí L-Endopiriform nucleus
    (np.int64(35), np.int64(76))    # L-Hypothalamic region ‚Üí L-Medial orbital area
]

export_dir = Path('./results/data_analysis/distros/final_avg_distributions_manual')
export_dir.mkdir(parents=True, exist_ok=True)

summary_records = []

for pair in selected_pairs:
    centers, mean_hist, kept_rats = plot_avg_distribution(
        pair, cleaned_rats, name_map, bins=50, save=True
    )

    if centers is None:
        continue

    label = f"{roi_label(pair[0], name_map)} ‚Üí {roi_label(pair[1], name_map)}"
    summary_records.append({
        'pair': pair,
        'label': label,
        'n_rats_valid': len(kept_rats),
        'kept_rats': ','.join(kept_rats),
        'file': f"avgdist_{roi_label(pair[0], name_map)}_to_{roi_label(pair[1], name_map)}.npz"
    })

# Crear resumen tabular
df_summary = pd.DataFrame(summary_records)
df_summary.to_csv(export_dir / 'summary_avg_distributions.csv', index=False)

print("\n‚úÖ Exportaci√≥n completa:")
display(df_summary[['label', 'n_rats_valid', 'file']])


In [None]:
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from scipy import stats

def characterize_distributions_enhanced(cleaned_rats, pairs_list):
    """Features + divergencias vs referencia"""
    rows = []
    
    # 1er paso: construir distribuci√≥n de referencia (pooled)
    all_tau = []
    for rat_id, data in cleaned_rats.items():
        for (i, j) in pairs_list:
            if (i, j) in data:
                all_tau.append(data[(i,j)][:, COL_TAU] * 1e3)
    ref_tau = np.concatenate(all_tau)
    ref_hist, ref_edges = np.histogram(ref_tau, bins=50, density=True)
    ref_cdf = np.cumsum(ref_hist) / ref_hist.sum()
    
    # 2do paso: caracterizar cada distribuci√≥n
    for rat_id, data in cleaned_rats.items():
        for (i, j) in pairs_list:
            if (i, j) not in data:
                continue
            
            tau_ms = data[(i,j)][:, COL_TAU] * 1e3
            
            # Histogram
            hist, edges = np.histogram(tau_ms, bins='auto', density=True)
            peaks, _ = find_peaks(hist, prominence=hist.max()*0.1)
            
            # M√©tricas de forma
            g1 = stats.skew(tau_ms)
            g2 = stats.kurtosis(tau_ms, fisher=True)
            bimodality_coef = (g1**2 + 1) / (g2 + 3)
            
            hist_prob = hist / hist.sum()
            shannon_entropy = entropy(hist_prob[hist_prob > 0])
            
            # Robustez
            med = np.median(tau_ms)
            mad = np.median(np.abs(tau_ms - med))
            cv_robust = 1.4826 * mad / med if med > 0 else np.nan
            
            # Divergencias vs referencia
            wassers_dist = wasserstein_distance(tau_ms, ref_tau)
            ks_stat, _ = stats.ks_2samp(tau_ms, ref_tau)
            
            # KL divergence (discretizada, evita log(0))
            hist_sample, _ = np.histogram(tau_ms, bins=ref_edges, density=True)
            hist_sample = hist_sample / hist_sample.sum()
            # A√±adir epsilon para evitar log(0)
            eps = 1e-10
            kl_div = entropy(hist_sample + eps, ref_hist + eps)
            
            rows.append({
                'rat_id': rat_id, 'roi_i': i, 'roi_j': j,
                'pair_label': f"{roi_label(i, name_map)} ‚Üí {roi_label(j, name_map)}",
                'n': len(tau_ms),
                'mean': tau_ms.mean(),
                'median': med,
                'cv_robust': cv_robust,
                'skew': g1,
                'kurt': g2,
                'bimodality_coef': bimodality_coef,
                'n_peaks': len(peaks),
                'entropy': shannon_entropy,
                'range_norm': np.ptp(tau_ms) / tau_ms.mean(),
                'iqr_norm': stats.iqr(tau_ms) / tau_ms.mean(),
                'wasserstein': wassers_dist,
                'ks_stat': ks_stat,
                'kl_div': kl_div
            })
    
    return pd.DataFrame(rows)

## Visualizaciones Multi-Rata ‚úÖ

**An√°lisis morfol√≥gico**:

**#1 Subthalamic‚ÜíAmygdala** (œÑ_range=4.96ms):
- Heterog√©neo: R02/R08 unimodales estrechos vs R07/R15 dispersos
- Medianas estables ~1.2ms pero colas largas (~8ms)

**#2 Subthalamic‚ÜíRetrosplenial** (œÑ_range=4.94ms):
- **Extremadamente consistente**: pico √∫nico ~0.7ms
- 5000+ fibras, distribuci√≥n casi id√©ntica entre ratas
- Candidato "delta-like" estable

**#3 Endopiriform‚ÜíPrelimbic** (œÑ_range=4.87ms):
- **Bimodal en varias ratas** (R01, R03, R07, R13)
- Picos ~1ms y ~2-3ms
- Variabilidad morfol√≥gica significativa

**Conclusi√≥n visual**: Hay al menos **2-3 morfolog√≠as distintas** (unimodal estrecho, disperso, bimodal).

In [None]:
# ==============================================================
# üß© CLUSTERING MORFOL√ìGICO OPTIMIZADO (multi-m√©trica + grid search)
# ==============================================================

from sklearn.cluster import KMeans
from sklearn.preprocessing import RobustScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.metrics import (
    silhouette_score,
    davies_bouldin_score,
    calinski_harabasz_score
)
from sklearn.model_selection import ParameterGrid
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# --------------------------------------------------------------
# 1Ô∏è‚É£ Selecci√≥n de features y preprocesamiento
# --------------------------------------------------------------
feature_cols = [
    'cv_robust', 'skew', 'kurt', 'bimodality_coef',
    'n_peaks', 'entropy', 'wasserstein', 
]

# Caracterizar
top_pairs = [(r['roi_i'], r['roi_j']) for _, r in df_stable_diverse.head(50).iterrows()]
df_feat = characterize_distributions_enhanced(refined_rats, top_pairs)

# Agregamos por conexi√≥n (media inter-rata)
df_feat_conn = (
    df_feat.groupby(['roi_i', 'roi_j', 'pair_label'])[feature_cols].mean()
    .reset_index()
    .dropna()
)

X = df_feat_conn[feature_cols].values
X_scaled = RobustScaler().fit_transform(X)

print(f"üìä {len(df_feat_conn)} conexiones analizadas con {len(feature_cols)} features.")

# --------------------------------------------------------------
# 2Ô∏è‚É£ Grid Search: KMeans + PCA
# --------------------------------------------------------------
param_grid = {
    'pca_variance': [0.7,0.8, 0.9, 0.95],
    'n_clusters': [3,4,5,6],
    'n_init': [50, 100]
}

results = []
configs = list(ParameterGrid(param_grid))
print(f"Evaluando {len(configs)} configuraciones...\n")

for i, params in enumerate(configs):
    # PCA con varianza acumulada
    pca = PCA(n_components=params['pca_variance'])
    X_pca = pca.fit_transform(X_scaled)

    # K-means
    kmeans = KMeans(
        n_clusters=params['n_clusters'],
        n_init=params['n_init'],
        random_state=42
    )
    labels = kmeans.fit_predict(X_pca)

    # M√©tricas de calidad
    sil = silhouette_score(X_pca, labels)
    db = davies_bouldin_score(X_pca, labels)
    ch = calinski_harabasz_score(X_pca, labels)

    results.append({
        **params,
        'n_pcs': pca.n_components_,
        'silhouette': sil,
        'davies_bouldin': db,
        'calinski_harabasz': ch,
        'inertia': kmeans.inertia_
    })

    if (i + 1) % 10 == 0 or (i + 1) == len(configs):
        print(f"  {i + 1}/{len(configs)} completado")

df_grid = pd.DataFrame(results)

# --------------------------------------------------------------
# 3Ô∏è‚É£ Ponderaci√≥n multi-m√©trica normalizada
# --------------------------------------------------------------
scaler = MinMaxScaler()
scaled_metrics = scaler.fit_transform(df_grid[['silhouette', 'davies_bouldin', 'calinski_harabasz']]) # TODO CUMSUM DIFF
sil, db, ch = scaled_metrics.T
df_grid['score'] = sil - db + ch

df_grid = df_grid.sort_values('score', ascending=False)

print(f"\n‚úì Grid search completado: {len(df_grid)} configuraciones")
print("\nTop 5 configuraciones:")
display(df_grid.head(5)[['n_clusters', 'pca_variance', 'n_pcs', 'silhouette', 'davies_bouldin', 'calinski_harabasz']])


In [None]:
# --------------------------------------------------------------
# 4Ô∏è‚É£ Clustering final
# --------------------------------------------------------------
best = df_grid.iloc[3]
pca_final = PCA(n_components=best['pca_variance'])
X_pca_final = pca_final.fit_transform(X_scaled)

kmeans_final = KMeans(
    n_clusters=int(best['n_clusters']),
    n_init=int(best['n_init']),
    random_state=42
)
df_feat_conn['cluster'] = kmeans_final.fit_predict(X_pca_final)

print(f"\nüéØ Clustering final: k={int(best['n_clusters'])}, "
      f"PCs={pca_final.n_components_}, "
      f"var_PCA={best['pca_variance']:.2f}")

# --------------------------------------------------------------
# 5Ô∏è‚É£ Visualizaciones resumen
# --------------------------------------------------------------

fig1 = plt.figure(figsize=(16, 14))
gs = fig1.add_gridspec(4, 4, hspace=0.35, wspace=0.35)

# (1) Silhouette vs k
ax1 = fig1.add_subplot(gs[0, :2])
for pca_var in sorted(df_grid['pca_variance'].unique()):
    subset = df_grid[(df_grid['pca_variance'] == pca_var) & (df_grid['n_init'] == 100)]
    ax1.plot(subset['n_clusters'], subset['silhouette'], 'o-', label=f'{pca_var:.2f}', lw=2)
ax1.set_xlabel('k (#clusters)')
ax1.set_ylabel('Silhouette')
ax1.legend(title='PCA var.')
ax1.grid(alpha=0.3)

# (2) Davies-Bouldin vs k
ax2 = fig1.add_subplot(gs[0, 2:])
for pca_var in sorted(df_grid['pca_variance'].unique()):
    subset = df_grid[(df_grid['pca_variance'] == pca_var) & (df_grid['n_init'] == 100)]
    ax2.plot(subset['n_clusters'], subset['davies_bouldin'], 'o-', label=f'{pca_var:.2f}', lw=2)
ax2.set_xlabel('k (#clusters)')
ax2.set_ylabel('Davies-Bouldin ‚Üì')
ax2.legend(title='PCA var.')
ax2.grid(alpha=0.3)

# (3) Calinski-Harabasz medio por varianza PCA
ax3 = fig1.add_subplot(gs[1, :2])
grouped = df_grid.groupby('pca_variance')['calinski_harabasz'].mean()
ax3.bar(grouped.index, grouped.values, alpha=0.7, edgecolor='k')
ax3.set_xlabel('PCA variance')
ax3.set_ylabel('Calinski-Harabasz')
ax3.grid(alpha=0.3, axis='y')

# (4) PCA final scatter
ax4 = fig1.add_subplot(gs[1, 2:])
sc = ax4.scatter(
    X_pca_final[:, 0], X_pca_final[:, 1],
    c=df_feat_conn['cluster'], cmap='tab10',
    s=60, alpha=0.8, edgecolors='k', lw=0.4
)
ax4.set_xlabel(f'PC1 ({pca_final.explained_variance_ratio_[0]:.1%})')
ax4.set_ylabel(f'PC2 ({pca_final.explained_variance_ratio_[1]:.1%})')
ax4.set_title(f"PCA (k={int(best['n_clusters'])}, var={best['pca_variance']:.2f})")
plt.colorbar(sc, ax=ax4, label='Cluster')
plt.suptitle('K-means + PCA (Optimized)', fontsize=14)
plt.show()

# --------------------------------------------------------------
# 6Ô∏è‚É£ Heatmaps de correlaciones y perfiles
# --------------------------------------------------------------
fig2, ax = plt.subplots(figsize=(10, 8))
corr_cols = ['n_clusters', 'pca_variance', 'n_init', 'silhouette', 'davies_bouldin', 'calinski_harabasz', 'score']
sns.heatmap(df_grid[corr_cols].corr(), annot=True, fmt='.2f', cmap='RdBu_r', center=0, ax=ax, square=True)
ax.set_title('Correlaciones entre par√°metros y m√©tricas')
plt.tight_layout()
plt.show()

# Perfil morfol√≥gico por cluster
fig3, ax = plt.subplots(figsize=(10, 6))
cluster_profiles = df_feat_conn.groupby('cluster')[feature_cols].median()
sns.heatmap(cluster_profiles.T, annot=True, fmt='.2f', cmap='coolwarm', center=0, ax=ax)
ax.set_title('Perfil morfol√≥gico por cluster')
plt.tight_layout()
plt.show()

# --------------------------------------------------------------
# 7Ô∏è‚É£ Estad√≠sticas finales
# --------------------------------------------------------------
print("\n" + "="*70)
print("CLUSTERING FINAL ‚Äî RESUMEN")
print("="*70)
print(f"Par√°metros √≥ptimos:")
print(f"  k = {int(best['n_clusters'])}")
print(f"  PCA var = {best['pca_variance']:.2f} ({pca_final.n_components_} componentes)")
print("\nM√©tricas:")
print(f"  Silhouette         : {best['silhouette']:.3f}")
print(f"  Davies-Bouldin     : {best['davies_bouldin']:.3f}")
print(f"  Calinski-Harabasz  : {best['calinski_harabasz']:.1f}")
print("\nDistribuci√≥n de clusters:")
print(df_feat_conn['cluster'].value_counts().sort_index())
print("="*70)

# --------------------------------------------------------------
# 8Ô∏è‚É£ Guardado de resultados
# --------------------------------------------------------------
threshold = 0.0  # o 0.0 / 0.4 seg√∫n el dataset
df_grid.to_csv(f'gridsearch_kmeans_th_{threshold}.csv', index=False)
df_feat_conn.to_csv(f'feature_clusters_th_{threshold}.csv', index=False)
print(f"\n‚úÖ Resultados guardados para threshold={threshold}")

In [None]:
df_grid

In [None]:
df_feat_conn

In [None]:
# Ejemplos por cluster
np.random.seed(42)

for c in sorted(df_feat_conn['cluster'].unique()):
    # Seleccionar 2 conexiones del cluster
    cluster_conns = df_feat_conn[df_feat_conn['cluster'] == c].head(2)
    
    if len(cluster_conns) == 0:
        continue
    
    n_plots = len(cluster_conns)
    fig, axes = plt.subplots(1, n_plots, figsize=(6*n_plots, 4))
    if n_plots == 1:
        axes = [axes]
    
    cluster_profile = df_feat_conn[df_feat_conn['cluster'] == c][feature_cols].median()
    
    for idx, (_, row) in enumerate(cluster_conns.iterrows()):
        i, j = int(row['roi_i']), int(row['roi_j'])
        
        # Tomar primera rata disponible para esta conexi√≥n
        rat_id = None
        for rid, data in cleaned_rats.items():
            if (i, j) in data:
                rat_id = rid
                break
        
        if rat_id is None:
            continue
            
        tau_ms = cleaned_rats[rat_id][(i,j)][:, COL_TAU] * 1e3
        
        axes[idx].hist(tau_ms, bins=40, alpha=0.75, edgecolor='k', linewidth=0.5)
        axes[idx].axvline(np.median(tau_ms), color='r', ls='--', lw=2)
        axes[idx].set_title(f"{row['pair_label'][:50]}\n({rat_id}, n={len(tau_ms)})", fontsize=9)
        axes[idx].set_xlabel('œÑ (ms)')
        axes[idx].set_ylabel('Count')
    
    fig.suptitle(
        f'Cluster {c} | CV={cluster_profile["cv_robust"]:.2f}, '
        f'skew={cluster_profile["skew"]:.2f}, kurt={cluster_profile["kurt"]:.2f}',
        fontsize=12
    )
    plt.tight_layout()
    plt.show()

In [None]:

# Caracterizar
top_pairs = [(r['roi_i'], r['roi_j']) for _, r in df_stable_diverse.head(50).iterrows()]
df_feat = characterize_distributions_enhanced(cleaned_rats, top_pairs)

# Features para clustering (robustos a outliers)
X = df_feat[['cv_robust', 'skew', 'bimodality_coef', 'iqr_norm', 'wasserstein', 'ks_stat', 'mean', 'median', 'kurt', 'n_peaks', 'entropy', 'kl_div', 'range_norm']].values

feature_cols = ['cv_robust', 'skew', 'kurt', 'bimodality_coef', 
                'n_peaks', 'entropy', 'wasserstein']

corr_cols = ['n_neighbors', 'min_dist', 'min_cluster_size', 'min_samples',
             'silhouette', 'davies_bouldin', 'n_clusters', 'noise_pct']

X = df_feat[feature_cols].dropna().values
X_scaled = RobustScaler().fit_transform(X)

# ===== GRID SEARCH =====
param_grid = {
    'n_neighbors': [10, 20],
    'min_dist': [0.0, 0.2],
    'min_cluster_size': [10, 100],  # m√°s grande ‚Üí menos clusters
    'min_samples': [5, 10]
}

results = []
print(f"Evaluando {len(list(ParameterGrid(param_grid)))} configuraciones...")

for i, params in enumerate(ParameterGrid(param_grid)):
    # UMAP
    reducer = UMAP(n_neighbors=params['n_neighbors'], 
                   min_dist=params['min_dist'],
                   n_components=2, n_jobs=-1)
    X_umap = reducer.fit_transform(X_scaled)
    
    # HDBSCAN
    clusterer = HDBSCAN(min_cluster_size=params['min_cluster_size'],
                        min_samples=params['min_samples'], 
                        core_dist_n_jobs=-1)
    labels = clusterer.fit_predict(X_umap)
    
    # M√©tricas (solo no-ruido)
    mask = labels != -1
    n_noise = (~mask).sum()
    n_clustered = mask.sum()
    n_clusters = len(set(labels[mask])) if n_clustered > 0 else 0
    
    if n_clustered > 20 and n_clusters > 1:
        sil = silhouette_score(X_umap[mask], labels[mask])
        db = davies_bouldin_score(X_umap[mask], labels[mask])
        ch = calinski_harabasz_score(X_umap[mask], labels[mask])
        noise_pct = 100 * n_noise / len(labels)
        
        results.append({
            **params, 
            'silhouette': sil, 
            'davies_bouldin': db,
            'calinski_harabasz': ch,
            'n_clusters': n_clusters, 
            'noise_pct': noise_pct,
            'n_clustered': n_clustered
        })
    
    if (i+1) % 20 == 0:
        print(f"  {i+1}/{len(list(ParameterGrid(param_grid)))} completado")

df_grid = pd.DataFrame(results)

# Mejor configuraci√≥n (multiobjetivo)
# Cambiar funci√≥n de score
df_grid['score'] = (
    df_grid['silhouette'] / df_grid['silhouette'].max() - 
    df_grid['davies_bouldin'] / df_grid['davies_bouldin'].max() +
    df_grid['calinski_harabasz'] / df_grid['calinski_harabasz'].max() -
    df_grid['noise_pct'] / 100 -
    abs(df_grid['n_clusters'] - 4) / 10  # penaliza alejarse de k=4
)
df_grid = df_grid.sort_values('score', ascending=False)

print(f"\n‚úì Grid search completo: {len(df_grid)} configuraciones v√°lidas")
print("\nTop 5 configuraciones:")
print(df_grid.head(5)[['n_neighbors', 'min_dist', 'min_cluster_size', 'min_samples', 
                        'n_clusters', 'silhouette', 'noise_pct']].to_string(index=False))

# ===== CLUSTERING FINAL =====
best = df_grid.iloc[0]
reducer_final = UMAP(n_neighbors=int(best['n_neighbors']), 
                     min_dist=best['min_dist'],
                     n_components=2, random_state=42)
X_umap_final = reducer_final.fit_transform(X_scaled)

clusterer_final = HDBSCAN(min_cluster_size=int(best['min_cluster_size']),
                          min_samples=int(best['min_samples']))
df_feat['cluster'] = clusterer_final.fit_predict(X_umap_final)

mask_final = df_feat['cluster'] != -1
print(f"\nClustering final: {df_feat['cluster'].nunique()-1} clusters + {(~mask_final).sum()} noise")

# ===== PLOTS PRINCIPALES (4x4) =====
fig1 = plt.figure(figsize=(16, 14))
gs = fig1.add_gridspec(4, 4, hspace=0.35, wspace=0.35)

# Fila 1: M√©tricas vs par√°metros
ax1 = fig1.add_subplot(gs[0, :2])
for metric in ['silhouette', 'davies_bouldin']:
    grouped = df_grid.groupby('n_neighbors')[metric].mean()
    ax1.plot(grouped.index, grouped.values, 'o-', label=metric, lw=2)
ax1.set_xlabel('n_neighbors'); ax1.legend(); ax1.grid(alpha=0.3)

ax2 = fig1.add_subplot(gs[0, 2:])
grouped = df_grid.groupby('min_cluster_size')['noise_pct'].mean()
ax2.bar(grouped.index, grouped.values, alpha=0.7)
ax2.set_xlabel('min_cluster_size'); ax2.set_ylabel('% Noise')

# Fila 2: Scatter n_clusters
ax3 = fig1.add_subplot(gs[1, :])
for md in sorted(df_grid['min_dist'].unique()):
    subset = df_grid[df_grid['min_dist'] == md]
    ax3.scatter(subset['n_neighbors'], subset['n_clusters'], 
               label=f'min_dist={md}', s=40)
ax3.legend(); ax3.grid(alpha=0.3)

# Fila 3-4: UMAP grande
ax4 = fig1.add_subplot(gs[2:, :])
scatter = ax4.scatter(X_umap_final[:, 0], X_umap_final[:, 1], 
                     c=df_feat['cluster'], cmap='tab10', s=60, alpha=0.7)
ax4.set_xlabel('UMAP 1'); ax4.set_ylabel('UMAP 2')
plt.colorbar(scatter, ax=ax4)

plt.suptitle(f'UMAP+HDBSCAN Optimization (th=0.0)', fontsize=14)
plt.savefig(f'clustering_main_th_0.0.png', dpi=150, bbox_inches='tight')
plt.show()

# ===== HEATMAPS SEPARADOS =====
# Correlaciones
fig2, ax = plt.subplots(1, 1, figsize=(10, 8))
corr_matrix = df_grid[corr_cols].corr()
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='RdBu_r', 
            center=0, ax=ax, square=True)
ax.set_title('Correlaciones')
plt.tight_layout()
plt.savefig(f'correlations_th_0.0.png', dpi=150)
plt.show()

# Perfil clusters
fig3, ax = plt.subplots(1, 1, figsize=(8, 6))
cluster_profiles = df_feat[mask_final].groupby('cluster')[feature_cols].median()
sns.heatmap(cluster_profiles.T, annot=True, fmt='.2f', cmap='coolwarm',
            center=0, ax=ax)
ax.set_title('Perfil Morfol√≥gico')
plt.tight_layout()
plt.savefig(f'profiles_th_0.0.png', dpi=150)
plt.show()

# ===== ESTAD√çSTICAS FINALES =====
print("\n" + "="*70)
print("ESTAD√çSTICAS CLUSTERING FINAL")
print("="*70)
print(f"Par√°metros √≥ptimos:")
print(f"  n_neighbors={int(best['n_neighbors'])}, min_dist={best['min_dist']}")
print(f"  min_cluster_size={int(best['min_cluster_size'])}, min_samples={int(best['min_samples'])}")
print(f"\nM√©tricas:")
print(f"  Silhouette: {best['silhouette']:.3f}")
print(f"  Davies-Bouldin: {best['davies_bouldin']:.3f}")
print(f"  Calinski-Harabasz: {best['calinski_harabasz']:.1f}")
print(f"  Noise: {best['noise_pct']:.1f}%")
print(f"\nDistribuci√≥n por cluster:")
print(df_feat['cluster'].value_counts().sort_index())

In [None]:
# 2 ejemplos por cluster
np.random.seed(42)
for c in sorted(df_feat['cluster'].unique()):
    # Tomar 2 pares distintos del cluster
    cluster_pairs = df_feat[df_feat['cluster'] == c][['roi_i', 'roi_j', 'rat_id']].drop_duplicates(['roi_i', 'roi_j']).head(2)
    
    if len(cluster_pairs) == 0:
        continue
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    cluster_profile = df_feat[df_feat['cluster'] == c][['cv_robust', 'skew', 'n_peaks']].median()
    
    for idx, (_, row) in enumerate(cluster_pairs.iterrows()):
        i, j, rat = int(row['roi_i']), int(row['roi_j']), row['rat_id']
        tau_ms = cleaned_rats[rat][(i,j)][:, COL_TAU] * 1e3
        
        axes[idx].hist(tau_ms, bins=40, alpha=0.75, edgecolor='k', linewidth=0.5)
        axes[idx].axvline(np.median(tau_ms), color='r', ls='--', lw=2)
        axes[idx].set_title(f"{roi_label(i, name_map)[:25]}\n‚Üí {roi_label(j, name_map)[:25]}", fontsize=9)
        axes[idx].set_xlabel('œÑ (ms)')
    
    fig.suptitle(f'Cluster {c} | CV={cluster_profile["cv_robust"]:.2f}, skew={cluster_profile["skew"]:.2f}, peaks={cluster_profile["n_peaks"]:.0f}', fontsize=12)
    plt.tight_layout()
    plt.show()