# **Modelo original: SalPred‑Trans — Transformer tabular cuantílico + Conformal Prediction**

**SalPred-Trans** es un modelo de **regresión tabular basado en Transformers** para predecir salarios (USD, 2025). Convierte variables categóricas y numéricas en *tokens* y aprende **interacciones complejas** sin mucha ingeniería de atributos. Se entrena con **GridSearchCV** optimizando **RMSE** y reporta métricas en escala original y logarítmica. Además, entrega **intervalos de predicción calibrados** con *Conformal Prediction* para cuantificar la incertidumbre.

## 1. Importaciones y configuración inicial
Aquí cargamos todas las librerías necesarias:

In [2]:

# 0) Setup
import os, sys, time, warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

sys.path.insert(0, os.getcwd())
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

try:
    import torch
    torch.set_num_threads(1)
    torch.manual_seed(RANDOM_STATE)
except Exception:
    pass

print("Setup OK.")


Setup OK.


In [3]:
import importlib, salpred_components
importlib.reload(salpred_components)
from salpred_components import SalPredPreprocessor, SalPredTransRegressor

In [4]:
from joblib import Memory
Memory("./_salpred_cache").clear(warn=False)

In [5]:
# Uso del archivo .py salpred_components:
from salpred_components import SalPredPreprocessor, SalPredTransRegressor

from sklearn.model_selection import train_test_split, KFold, GridSearchCV, ParameterGrid
from sklearn.pipeline import Pipeline
from joblib import Memory
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score



## 2. Carga de datos y selección de columnas

1. Leemos el archivo salaries.csv.
2. Definimos variables categóricas (ejemplo: job_title, company_size) y numéricas (ejemplo: salario, work_year).

Aquí limpiamos y organizamos el dataset.

In [7]:
# 2) Datos + normalización de títulos + splits 
DATA_PATH = "salaries.csv"
if not os.path.exists(DATA_PATH):
    raise FileNotFoundError("No se encontró salaries.csv junto al notebook.")

df = pd.read_csv(DATA_PATH)
df = df[df["work_year"] == 2025].copy()

def normalize_title(s: str) -> str:
    s = str(s).lower().replace('-', ' ').replace('/', ' ')
    if 'data scientist' in s: return 'data scientist'
    if 'data engineer' in s: return 'data engineer'
    if 'data analyst'  in s: return 'data analyst'
    if 'machine learning' in s or 'ml engineer' in s: return 'ml engineer'
    if 'ai engineer' in s or 'artificial intelligence' in s: return 'ai engineer'
    if 'research' in s and ('ml' in s or 'ai' in s): return 'ml researcher'
    return s

if 'job_title' in df.columns:
    df['job_title'] = df['job_title'].map(normalize_title)

target_col = "salary_in_usd"
cat_cols = ["experience_level","employment_type","job_title",
            "employee_residence","company_location","company_size"]
num_cols = ["work_year","remote_ratio"]

X = df[cat_cols + num_cols].copy()
y = df[target_col].astype(float).values

strata = df["experience_level"] if "experience_level" in df.columns else None
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.20, random_state=RANDOM_STATE, stratify=strata)
strata_temp = X_temp["experience_level"] if "experience_level" in X_temp.columns else None
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=RANDOM_STATE, stratify=strata_temp)
strata_train = X_train["experience_level"] if "experience_level" in X_train.columns else None
X_trn, X_cal, y_trn, y_cal = train_test_split(X_train, y_train, test_size=0.20, random_state=RANDOM_STATE, stratify=strata_train)



In [None]:
# === 2.5) Limpieza en dos etapas: WIN(S1) -> TRIM(S2) ===
import numpy as np, pandas as pd

GROUPS = ["company_location","experience_level"]
CLEAN_CAL = True   # aplica también a CAL; VAL/TEST no se tocan

def _keys(Xs): 
    return list(zip(Xs["company_location"], Xs["experience_level"]))

def _compute_bounds(X_trn_cur, y_trn_cur, qlo, qhi):
    tmp = pd.DataFrame({"key": _keys(X_trn_cur), "y_log": np.log1p(y_trn_cur)})
    g = tmp.groupby("key")["y_log"]
    qlo_g = g.quantile(qlo); qhi_g = g.quantile(qhi)
    glo_lo = float(tmp["y_log"].quantile(qlo)); glo_hi = float(tmp["y_log"].quantile(qhi))
    def _bounds(Xs):
        ks = _keys(Xs)
        lo = np.array([np.expm1(qlo_g.get(k, glo_lo)) for k in ks], float)
        hi = np.array([np.expm1(qhi_g.get(k, glo_hi)) for k in ks], float)
        return lo, hi
    return _bounds

def _winsorize(Xs, ys, bounds_fn):
    lo, hi = bounds_fn(Xs)
    return Xs, np.minimum(np.maximum(ys, lo), hi)

def _trim(Xs, ys, bounds_fn):
    lo, hi = bounds_fn(Xs)
    keep = (ys >= lo) & (ys <= hi)
    removed = int((~keep).sum())
    return Xs[keep], ys[keep], removed

# --- Etapa 1: WIN(S1) con umbral suave (p.ej. 0.5%–99.5%) ---
bounds_s1 = _compute_bounds(X_trn, y_trn, qlo=0.005, qhi=0.995)
X_trn, y_trn = _winsorize(X_trn, y_trn, bounds_s1)
if CLEAN_CAL: X_cal, y_cal = _winsorize(X_cal, y_cal, bounds_s1)

# --- Etapa 2: TRIM(S2) más agresivo (p.ej. 1%–99%) ---
bounds_s2 = _compute_bounds(X_trn, y_trn, qlo=0.01, qhi=0.99)  # recalcula con TRN ya winsorizado
X_trn, y_trn, rem_trn = _trim(X_trn, y_trn, bounds_s2)
if CLEAN_CAL:
    X_cal, y_cal, rem_cal = _trim(X_cal, y_cal, bounds_s2)
else:
    rem_cal = 0


## 3. Pipeline y GridSearch - Modelo Original
Se define el modelo SalPredTrans
Aquí encontramos la “mejor versión” del modelo.

In [None]:
# 3.1) Pipeline + GridSearch (compacto)
memory = Memory("./_salpred_cache", verbose=0)

prep  = SalPredPreprocessor(cat_cols=cat_cols, num_cols=num_cols, min_freq_rare=80, clip_val=5.0)
model = SalPredTransRegressor(
    d_model=128, n_heads=8, n_layers=2, dropout=0.1,
    lr=1e-3, weight_decay=1e-4,
    batch_size=512, warmup_epochs=1, max_epochs=8, patience=2,
    lambda_nc=10.0, seed=RANDOM_STATE, verbose=False
)
pipe = Pipeline([("prep", prep), ("model", model)], memory=memory)

param_grid = {
    "prep__min_freq_rare": [80, 120],
    "model__d_model": [128, 190],
    "model__n_heads": [8],
    "model__n_layers": [2, 3],
    "model__lr": [1e-3, 3e-4],
    "model__dropout": [0.1, 0.2, 0.3],
}
KFOLDS = 5
N_JOBS = 2
cv = KFold(n_splits=KFOLDS, shuffle=True, random_state=RANDOM_STATE)

gs = GridSearchCV(
    pipe,
    param_grid,
    scoring="neg_root_mean_squared_error",   #Se optimiza por RMSE
    cv=cv,
    n_jobs=N_JOBS,
    verbose=0,
    refit=True
)



In [None]:
# 3.2) GridSearch con progreso (si está tqdm) o verbose=3
import time, numpy as np

use_tqdm = False
try:
    from tqdm.auto import tqdm
    import joblib
    from contextlib import contextmanager
    @contextmanager
    def tqdm_joblib(tqdm_object):
        class TqdmBatchCompletionCallBack(joblib.parallel.BatchCompletionCallBack):
            def __call__(self, *args, **kwargs):
                tqdm_object.update(n=self.batch_size)
                return super().__call__(*args, **kwargs)
        old_cb = joblib.parallel.BatchCompletionCallBack
        joblib.parallel.BatchCompletionCallBack = TqdmBatchCompletionCallBack
        try:
            yield tqdm_object
        finally:
            joblib.parallel.BatchCompletionCallBack = old_cb
            tqdm_object.close()
    use_tqdm = True
except Exception:
    pass

t0 = time.perf_counter()
if use_tqdm:
    n_candidates = len(list(ParameterGrid(gs.param_grid)))
    # cuenta también el refit del mejor modelo si corresponde
    n_fits = KFOLDS * n_candidates + (1 if getattr(gs, "refit", True) else 0)
    with tqdm_joblib(tqdm(total=n_fits, desc="GridSearch fits")):
        gs.fit(X_trn, y_trn)
else:
    gs.set_params(verbose=3)
    gs.fit(X_trn, y_trn)
elapsed = time.perf_counter() - t0

# ---- Prints adaptativos según scoring ----
sc = getattr(gs, "scoring", None)
if isinstance(sc, str):
    is_neg = sc.startswith("neg_")
    key = sc.replace("neg_", "")
else:
    is_neg, key = False, "score"

best = (-gs.best_score_) if is_neg else gs.best_score_

print("Best params:", gs.best_params_)
print(f"Tiempo GridSearch (s): {elapsed:.1f}")


GridSearch fits: 100%|█████████▉| 240/241 [3:30:29<00:52, 52.62s/it]   

Best params: {'model__d_model': 128, 'model__dropout': 0.1, 'model__lr': 0.001, 'model__n_heads': 8, 'model__n_layers': 3, 'prep__min_freq_rare': 80}
Tiempo GridSearch (s): 12629.2





In [None]:

# 6) Validación (VAL) + refit (TRN+VAL) + métricas en log
def rmse(y_true, y_pred):
    try:
        return mean_squared_error(y_true, y_pred, squared=False)
    except TypeError:
        return np.sqrt(mean_squared_error(y_true, y_pred))

best = gs.best_estimator_
best.named_steps["model"].batch_size = 2048

y_val_pred = best.predict(X_val)
MAE_val  = mean_absolute_error(y_val, y_val_pred)
RMSE_val = rmse(y_val, y_val_pred)
MAPE_val = np.mean(np.abs((y_val - y_val_pred)/np.maximum(1e-9, np.abs(y_val))))*100
R2_val   = r2_score(y_val, y_val_pred)

Xd_val = best.named_steps["prep"].transform(X_val)
_, qM_val, _ = best.named_steps["model"].predict_quantiles(Xd_val)
yl = np.log1p(y_val); yhatl = np.log1p(qM_val)
rmse_log = np.sqrt(np.mean((yl - yhatl)**2))
mae_log  = np.mean(np.abs(yl - yhatl))
r2_log   = 1 - np.sum((yl - yhatl)**2) / np.sum((yl - yl.mean())**2)

from sklearn.pipeline import Pipeline as _Pipeline
pipe_final = _Pipeline([("prep", SalPredPreprocessor(cat_cols=cat_cols, num_cols=num_cols)),
                        ("model", SalPredTransRegressor())], memory=memory)
pipe_final.set_params(**gs.best_params_)
_ = pipe_final.fit(pd.concat([X_trn, X_val]), np.concatenate([y_trn, y_val]))
pipe_final.named_steps["model"].batch_size = 4096


## 4. Test 
Se testea el modelo SalPredTrans en escala usd y logarítmica, se muestran los resultados conrrespondientes en la escala original para poder comparar justamente con los modelos tradicionales desarrollados previamente:

In [None]:
# 4.1) TEST punto + log
y_tst_pred = pipe_final.predict(X_test)
MAE  = mean_absolute_error(y_test, y_tst_pred)
RMSE = rmse(y_test, y_tst_pred)
MAPE = np.mean(np.abs((y_test - y_tst_pred)/np.maximum(1e-9, np.abs(y_test))))*100
print(f"TEST | MAE={MAE:,.0f}  RMSE={RMSE:,.0f}  MAPE={MAPE:.2f}%") #correspondientes a escala en usd

Xd_test = pipe_final.named_steps["prep"].transform(X_test)
_, qM_test, _ = pipe_final.named_steps["model"].predict_quantiles(Xd_test)
ylt = np.log1p(y_test); yhatlt = np.log1p(qM_test)
rmse_log_t = np.sqrt(np.mean((ylt - yhatlt)**2))
mae_log_t  = np.mean(np.abs(ylt - yhatlt))
r2_log_t   = 1 - np.sum((ylt - yhatlt)**2) / np.sum((ylt - ylt.mean())**2)


TEST | MAE=411  RMSE=757  MAPE=0.39%


## 5. Conformal Prediction 

In [None]:

# 8) Conformal CQR normalizado
alpha = 0.10

Xd_cal = pipe_final.named_steps["prep"].transform(X_cal)
qL_cal, qM_cal, qU_cal = pipe_final.named_steps["model"].predict_quantiles(Xd_cal)

scale_cal = np.maximum(qM_cal - qL_cal, qU_cal - qM_cal)
scale_cal = np.maximum(scale_cal, 1e-6)

scores = np.maximum((qL_cal - y_cal)/scale_cal, (y_cal - qU_cal)/scale_cal)
scores = np.sort(scores)
m = len(scores)
k = int(np.ceil((1 - alpha) * (m + 1))) - 1
k = int(np.clip(k, 0, m-1))
qhat = float(scores[k])

Xd_test = pipe_final.named_steps["prep"].transform(X_test)
qL_t, qM_t, qU_t = pipe_final.named_steps["model"].predict_quantiles(Xd_test)
s_t = np.maximum(qM_t - qL_t, qU_t - qM_t); s_t = np.maximum(s_t, 1e-6)
L = qL_t - qhat * s_t
U = qU_t + qhat * s_t

coverage  = np.mean((y_test >= L) & (y_test <= U))
avg_width = np.mean(U - L)
def interval_score(y, L, U, alpha=0.10):
    return (U-L) + (2/alpha)*((L - y)*(y < L)) + (2/alpha)*((y - U)*(y > U))
IS = np.mean(interval_score(y_test, L, U, alpha))
print(f"Normalized CQR | Coverage={coverage:.3f} | Avg width={avg_width:,.0f} | Interval Score={IS:,.0f}")

Normalized CQR | Coverage=0.900 | Avg width=2,489 | Interval Score=2,511


## 6. Evaluación y métricas

In [None]:
from IPython.display import display

In [None]:
#6.1) Construye df_usd (TRAIN/TEST) en USD
import numpy as np, pandas as pd
from sklearn.metrics import mean_squared_error, mean_absolute_error

def rmse(y, yhat):
    try:
        return mean_squared_error(y, yhat, squared=False)
    except TypeError:
        return np.sqrt(mean_squared_error(y, yhat))

def mape_pct(y, yhat):
    y = np.asarray(y, float); yhat = np.asarray(yhat, float)
    return np.mean(np.abs((y - yhat) / np.maximum(1e-9, np.abs(y)))) * 100

# Predicciones para TRAIN y TEST
est_train  = gs.best_estimator_           # refitteado en X_trn por GridSearchCV
y_trn_pred = est_train.predict(X_trn)

est_test   = pipe_final                   # entrenado en TRN+VAL
y_tst_pred = est_test.predict(X_test)

# DataFrame para la tabla
df_usd = pd.DataFrame([{
    "Modelo": "SalPred-Trans",
    "Train_RMSE": rmse(y_trn, y_trn_pred),
    "Train_MAE":  mean_absolute_error(y_trn, y_trn_pred),
    "Train_MSE":  float(mean_squared_error(y_trn, y_trn_pred)),
    "Train_MAPE": mape_pct(y_trn, y_trn_pred),

    "Test_RMSE": rmse(y_test, y_tst_pred),
    "Test_MAE":  mean_absolute_error(y_test, y_tst_pred),
    "Test_MSE":  float(mean_squared_error(y_test, y_tst_pred)),
    "Test_MAPE": mape_pct(y_test, y_tst_pred),
}]).set_index("Modelo")

In [None]:
#6.2) Tabla USD (solo TRAIN/TEST) — estilo "Reds"
import pandas as pd
from IPython.display import display

HEADER = "#b22222"
CAPTION_USD = "Comparación de Modelos – Train vs Test (MSE, RMSE, MAE, MAPE en escala original USD)"

def _apply_header_theme(styler, caption, header_color=HEADER):
    return (styler
            .set_caption(caption)
            .set_table_styles([
                {"selector":"thead th",
                 "props":[("background-color", header_color), ("color","white"),
                          ("font-weight","bold"), ("border", f"1px solid {header_color}")]},
                {"selector":"caption",
                 "props":[("caption-side","top"), ("font-size","16px"),
                          ("font-weight","bold")]}
            ])
            .set_properties(**{"text-align":"center","white-space":"nowrap"}))

if "df_usd" in globals() and isinstance(df_usd, pd.DataFrame) and not df_usd.empty:
    df_usd_disp = df_usd.copy()

    err_cols_usd_all = [
        "Train_RMSE","Train_MAE","Train_MSE","Train_MAPE",
        "Test_RMSE","Test_MAE","Test_MSE","Test_MAPE"
    ]
    err_cols_usd = [c for c in err_cols_usd_all if c in df_usd_disp.columns]

    # Formatos (miles y %)
    fmt_usd = {}
    for c in df_usd_disp.columns:
        if any(k in c for k in ["RMSE","MAE","MSE"]):
            fmt_usd[c] = "{:,.0f}"
    if "Train_MAPE" in df_usd_disp.columns: fmt_usd["Train_MAPE"] = "{:.2f}%"
    if "Test_MAPE"  in df_usd_disp.columns: fmt_usd["Test_MAPE"]  = "{:.2f}%"

    styled_usd = (df_usd_disp.style
                  .format(fmt_usd)
                  .background_gradient(cmap="Reds", subset=err_cols_usd, axis=0)
                  .set_table_attributes('id="T_usd"'))
    styled_usd = _apply_header_theme(styled_usd, CAPTION_USD)
    try: styled_usd = styled_usd.hide_index()
    except: pass
    display(styled_usd)
else:
    print("[aviso] df_usd no existe o está vacío → no se mostrará la tabla USD.")


Unnamed: 0_level_0,Train_RMSE,Train_MAE,Train_MSE,Train_MAPE,Test_RMSE,Test_MAE,Test_MSE,Test_MAPE
Modelo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
SalPred-Trans,868,420,753737,0.31%,757,411,572793,0.39%


En USD, SalPred-Trans muestra buena generalización, sin señales de sobreajuste; incluso el RMSE/MAE de test es levemente mejor. La coherencia 
RMSE
2
≈
MSE
RMSE
2
≈MSE confirma estabilidad numérica. El aumento del MAPE en test indica mayor error relativo (probablemente por más casos de salario bajo), mientras que en términos absolutos el modelo mantiene un error medio cercano a $411 y un error típico (RMSE) alrededor de $757 en test.

In [None]:
import numpy as np, pandas as pd
from IPython.display import display

def _interval_score_vec(y, L, U, alpha):
    under = np.maximum(L - y, 0.0)
    over  = np.maximum(y - U, 0.0)
    return (U - L) + (2.0/alpha) * (under + over)

coverage  = float(np.mean((y_test >= L) & (y_test <= U)))
avg_width = float(np.mean(U - L))
IS        = float(np.mean(_interval_score_vec(y_test, L, U, alpha)))

df_conf = pd.DataFrame([{
    "Split": "TEST (conformal CQR norm)",
    "Coverage": coverage,
    "Avg_width": avg_width,
    "Interval_Score": IS
}]).set_index("Split")

# --- estilo "Reds" con header #b22222 ---
HEADER = "#b22222"
fmt = {"Coverage":"{:.3f}", "Avg_width":"{:,.0f}", "Interval_Score":"{:,.0f}"}

styled_conf = (df_conf.style
    .format(fmt)
    .set_caption("Intervalos Conformales — TEST")
    .set_table_styles([
        {"selector":"thead th",
         "props":[("background-color", HEADER), ("color","white"), ("font-weight","bold")]},
        {"selector":"caption",
         "props":[("caption-side","top"), ("font-size","16px"), ("font-weight","bold")]}
    ])
    .background_gradient(cmap="Reds", subset=["Avg_width","Interval_Score"], axis=0)
    .apply(lambda s: ['background-color: white'] * len(s), subset=["Coverage"])
)
display(styled_conf)


Unnamed: 0_level_0,Coverage,Avg_width,Interval_Score
Split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TEST (conformal CQR norm),0.9,2489,2511


**Intervalos (TEST):** Con **coverage = 0.900**, los intervalos están calibrados exactamente al objetivo \(1-\alpha = 0.90\). El **ancho promedio = 2,489** USD es coherente con el desempeño del modelo en test, y el **Interval Score = 2,511** queda apenas por encima del ancho, señal de pocas violaciones y penalidad baja. En conjunto, son **intervalos bien calibrados y de amplitud razonable**.


## Modelo original vs. Tradicionales - Conclusión

SalPred-Trans lidera la comparación: obtiene el mejor RMSE (757) y mejor MSE (572,793) en test, con MAE 411 y MAPE 0.39%, logrando el balance más sólido entre error medio y control de outliers. 
Detrás, el MLP destaca en MAE (348, el más bajo) y comparte el MAPE top (~0.30%) con XGB (que, no obstante, queda lejos en RMSE: 2,557 y MAE 447). Más atrás aparecen KNN (RMSE 8,698, MAE 2,720), SVR (RMSE 10,908, MAE 8,987), DecisionTree (RMSE 52,958, MAE 35,319), y, ya bastante rezagados en error cuadrático, Ridge (RMSE 89,435) y Lasso (RMSE 111,077). 

