# Workflow 618 - ONLY TREND Features (FIXED VERSION)

Feature Engineering con SOLO features de tendencia (trend_3, trend_6).

**FIXED:** Esta version corrige las diferencias con el script R local:
- Catastrophe Analysis: 13 variables (incluye ctarjeta_master_transacciones)
- Trend calculation: usa frollapply() con calc_slope_fast (sin lm, optimizado)
- campos_buenos: excluye numero_de_cliente, foto_mes, clase_ternaria, clase01, azar
- BO iterations: 100 (configurable)

Logica identica a z610 baseline y al script R local.

#### Seteo del ambiente en Google Colab

Esta parte se debe correr con el runtime en Python3
<br>Ir al menu, Runtime -> Change Runtime Type -> Runtime type -> **Python 3**

Conectar la virtual machine donde esta corriendo Google Colab con el Google Drive, para poder tener persistencia de archivos

In [None]:
# primero establecer el Runtime de Python 3
from google.colab import drive
drive.mount('/content/.drive')

Para correr la siguiente celda es fundamental en Arranque en Frio haber copiado el archivo kaggle.json al Google Drive, en la carpeta indicada en el instructivo

<br>los siguientes comando estan en shell script de Linux
* Crear las carpetas en el Google Drive
* "instalar" el archivo kaggle.json desde el Google Drive a la virtual machine para que pueda ser utilizado por la libreria kaggle de Python
* Bajar el **dataset_pequeno** al Google Drive y tambien al disco local de la virtual machine que esta corriendo Google Colab
* Bajar el **dataset_historico** al Google Drive y tambien al disco local de la virtual machine que esta corriendo Google Colab

In [None]:
%%shell

mkdir -p "/content/.drive/My Drive/labo1"
mkdir -p "/content/buckets"
ln -s "/content/.drive/My Drive/labo1" /content/buckets/b1

mkdir -p ~/.kaggle
cp /content/buckets/b1/kaggle/kaggle.json ~/.kaggle
chmod 600 ~/.kaggle/kaggle.json


mkdir -p /content/buckets/b1/exp
mkdir -p /content/buckets/b1/datasets
mkdir -p /content/datasets


webfiles="https://storage.googleapis.com/open-courses/austral2025-af91/"
destino_local="/content/datasets"
destino_bucket="/content/buckets/b1/datasets"


archivo="dataset_pequeno.csv"

if ! test -f $destino_bucket/$archivo; then
  wget $webfiles/$archivo -O $destino_bucket/$archivo
fi


if ! test -f $destino_local/$pequeno; then
  cp $destino_bucket/$archivo $destino_local/$archivo
fi

#-------

archivo="gerencial_competencia_2025.csv.gz"

if ! test -f $destino_bucket/$archivo; then
  wget $webfiles/$archivo -O $destino_bucket/$archivo
fi


if ! test -f $destino_local/$pequeno; then
  cp $destino_bucket/$archivo $destino_local/$archivo
fi

## Workflow 618 - Loop sobre 5 Semillas (FIXED)

## Inicializacion

Esta parte se debe correr con el runtime en lenguaje **R** Ir al menu, Runtime -> Change Runtime Type -> Runtime type -> R

limpio el ambiente de R

In [None]:
format(Sys.time(), "%a %b %d %X %Y")

In [None]:
# limpio la memoria
rm(list=ls(all.names=TRUE)) # remove all objects
gc(full=TRUE, verbose=FALSE) # garbage collection

In [None]:
require("data.table")

if( !require("R.utils")) install.packages("R.utils")
require("R.utils")

#### Parametros Globales
Si es gerente, no cambie nada
<br>Si es Analista, cambie el nombre del dataset

In [None]:
PARAM_GLOBAL <- list()
PARAM_GLOBAL$experimento_base <- 6180  # Experimento 618 ONLY TREND FIXED
PARAM_GLOBAL$dataset <- "gerencial_competencia_2025.csv.gz"

# Vector de 5 semillas diferentes
PARAM_GLOBAL$semillas <- c(153929, 838969, 922081, 795581, 194609)

# BO iterations - FIXED: 100 iteraciones (igual que script R local)
PARAM_GLOBAL$bo_iterations <- 100

# Lista para almacenar resultados de todas las semillas
resultados_totales <- list()

## Funcion calc_slope_fast (OPTIMIZADA)

Funcion para calcular la pendiente de regresion lineal usando formula analitica.
**MUCHO mas rapida** que usar lm() - identica al script R local.

In [None]:
# ============================================================================
# FUNCION OPTIMIZADA PARA CALCULAR TENDENCIA (SIN LM)
# ============================================================================
#
# En lugar de usar lm() (muy lento), calculamos la pendiente analiticamente:
#   slope = (n * sum(x*y) - sum(x) * sum(y)) / (n * sum(x^2) - sum(x)^2)
#
# Esta funcion es MUCHO mas rapida que la version con lm()
#
# ============================================================================

calc_slope_fast <- function(y) {
  n <- length(y)
  valid <- !is.na(y)
  n_valid <- sum(valid)
  if (n_valid < 2) return(NA_real_)

  x <- 1:n
  x_valid <- x[valid]
  y_valid <- y[valid]

  sum_x <- sum(x_valid)
  sum_y <- sum(y_valid)
  sum_xy <- sum(x_valid * y_valid)
  sum_x2 <- sum(x_valid^2)

  denom <- n_valid * sum_x2 - sum_x^2
  if (denom == 0) return(NA_real_)

  (n_valid * sum_xy - sum_x * sum_y) / denom
}

## Loop Principal - Iteracion Automatica sobre las 5 Semillas

Este loop ejecuta todo el workflow completo para cada una de las 5 semillas de manera automatica.

In [None]:
# ============================================================================
# LOOP AUTOMATICO SOBRE TODAS LAS SEMILLAS
# ============================================================================

for (seed_idx in 1:length(PARAM_GLOBAL$semillas)) {

  cat("\n\n========================================\n")
  cat("PROCESANDO SEMILLA ", seed_idx, " de ", length(PARAM_GLOBAL$semillas), "\n")
  cat("Semilla: ", PARAM_GLOBAL$semillas[seed_idx], "\n")
  cat("========================================\n\n")

  inicio_seed <- Sys.time()

  # Inicializar PARAM para esta semilla
  PARAM <- list()
  PARAM$semilla_primigenia <- PARAM_GLOBAL$semillas[seed_idx]
  PARAM$experimento <- PARAM_GLOBAL$experimento_base + seed_idx - 1
  PARAM$dataset <- PARAM_GLOBAL$dataset
  PARAM$out <- list()
  PARAM$out$lgbm <- list()

  # ==========================================================================
  # Carpeta del Experimento
  # ==========================================================================
  
  # Asegurar que el directorio base existe
  if (!dir.exists("/content/buckets/b1/exp")) {
    dir.create("/content/buckets/b1/exp", showWarnings = FALSE, recursive = TRUE)
  }
  
  setwd("/content/buckets/b1/exp")
  experimento_folder <- paste0("WF", PARAM$experimento, "_seed", seed_idx, "_ONLY_TREND_FIXED")
  dir.create(experimento_folder, showWarnings=FALSE)
  setwd( paste0("/content/buckets/b1/exp/", experimento_folder ))
  
  cat("Carpeta de trabajo: ", experimento_folder, "\n\n")

  # ==========================================================================
  # Preprocesamiento del dataset
  # ==========================================================================
  
  cat("Cargando dataset...\n")
  dataset <- fread(paste0("/content/datasets/", PARAM$dataset))
  cat("Dataset cargado:", nrow(dataset), "filas x", ncol(dataset), "cols\n\n")

  # ==========================================================================
  # Catastrophe Analysis - 13 VARIABLES (FIXED)
  # ==========================================================================
  
  cat("Aplicando Catastrophe Analysis (13 variables)...\n")
  dataset[ foto_mes==202006, internet:=NA]
  dataset[ foto_mes==202006, mrentabilidad:=NA]
  dataset[ foto_mes==202006, mrentabilidad_annual:=NA]
  dataset[ foto_mes==202006, mcomisiones:=NA]
  dataset[ foto_mes==202006, mactivos_margen:=NA]
  dataset[ foto_mes==202006, mpasivos_margen:=NA]
  dataset[ foto_mes==202006, mcuentas_saldo:=NA]
  dataset[ foto_mes==202006, ctarjeta_visa_transacciones:=NA]
  dataset[ foto_mes==202006, mtarjeta_visa_consumo:=NA]
  dataset[ foto_mes==202006, mtarjeta_master_consumo:=NA]
  dataset[ foto_mes==202006, ccallcenter_transacciones:=NA]
  dataset[ foto_mes==202006, chomebanking_transacciones:=NA]
  dataset[ foto_mes==202006, ctarjeta_master_transacciones:=NA]  # FIXED: Incluido
  cat("13 variables en 202006 -> NA\n\n")

  # ==========================================================================
  # Feature Engineering - ROLLING TRENDS (OPTIMIZADO)
  # ==========================================================================
  
  cat("Feature Engineering - ROLLING TRENDS (frollapply + calc_slope_fast)...\n")
  cat("Usando formula analitica (sin lm) - MUCHO mas rapido...\n\n")
  
  inicio_fe <- Sys.time()
  
  # Variables base (excluir ID, fecha, clase)
  cols_lagueables <- setdiff(
    colnames(dataset),
    c("numero_de_cliente", "foto_mes", "clase_ternaria")
  )
  
  cat("Variables base:", length(cols_lagueables), "\n")
  
  # Ordenar por cliente y mes
  setorder(dataset, numero_de_cliente, foto_mes)
  
  # GENERAR ROLLING TRENDS (ventanas 3 y 6) - OPTIMIZADO con calc_slope_fast
  cat("Generando rolling trends...\n")
  cols_antes_trends <- ncol(dataset)
  
  for (ventana in c(3, 6)) {
    cat(paste("  Ventana", ventana, "..."))
    
    for (col in cols_lagueables) {
      trend_col <- paste0(col, "_trend_", ventana)
      
      dataset[, (trend_col) := frollapply(
        x = get(col),
        n = ventana,
        FUN = calc_slope_fast,
        align = "right"
      ), by = numero_de_cliente]
    }
    
    cat(" OK\n")
  }
  
  fin_fe <- Sys.time()
  cols_trends <- ncol(dataset) - cols_antes_trends
  tiempo_fe <- as.numeric(difftime(fin_fe, inicio_fe, units = "mins"))
  
  cat(paste("Rolling trends generados:", cols_trends, "variables en",
            round(tiempo_fe, 1), "min\n"))
  cat(paste("Dataset final:", nrow(dataset), "filas x", ncol(dataset), "cols\n\n"))

  # ==========================================================================
  # Training Strategy
  # ==========================================================================
  
  cat("Configurando Training Strategy...\n")
  
  PARAM$trainingstrategy <- list()
  PARAM$trainingstrategy$validate <- c(202105)
  
  PARAM$trainingstrategy$training <- c(
    202104, 202103, 202102, 202101,
    202012, 202011, 202010, 202009, 202008, 202007,
    202006, 202005
  )
  
  PARAM$trainingstrategy$training_pct <- 1.0
  PARAM$trainingstrategy$positivos <- c( "BAJA+1", "BAJA+2")

  dataset[, clase01 := ifelse( clase_ternaria %in% PARAM$trainingstrategy$positivos, 1, 0 )]

  # FIXED: campos_buenos excluye numero_de_cliente, foto_mes, clase_ternaria, clase01, azar
  set.seed(PARAM$semilla_primigenia, kind = "L'Ecuyer-CMRG")
  dataset[, azar:=runif(nrow(dataset))]

  campos_buenos <- setdiff(
      colnames(dataset),
      c("numero_de_cliente", "foto_mes", "clase_ternaria", "clase01", "azar")
  )
  
  dataset[, fold_train := foto_mes %in% PARAM$trainingstrategy$training &
      (clase_ternaria %in% c("BAJA+1", "BAJA+2") |
       azar < PARAM$trainingstrategy$training_pct ) ]
  
  cat(paste("Features para modelo:", length(campos_buenos), "\n\n"))
  
  if( !require("lightgbm")) install.packages("lightgbm")
  require("lightgbm")
  
  dtrain <- lgb.Dataset(
    data= data.matrix(dataset[fold_train == TRUE, campos_buenos, with = FALSE]),
    label= dataset[fold_train == TRUE, clase01],
    free_raw_data= TRUE
  )

  dvalidate <- lgb.Dataset(
    data= data.matrix(dataset[foto_mes %in% PARAM$trainingstrategy$validate, campos_buenos, with = FALSE]),
    label= dataset[foto_mes %in% PARAM$trainingstrategy$validate, clase01],
    free_raw_data= TRUE
  )

  # ==========================================================================
  # Hyperparameter Tuning - FIXED: 100 iteraciones
  # ==========================================================================
  
  cat(paste("Bayesian Optimization (", PARAM_GLOBAL$bo_iterations, " iteraciones)...\n"))
  
  if(!require("DiceKriging")) install.packages("DiceKriging")
  require("DiceKriging")
  
  if(!require("mlrMBO")) install.packages("mlrMBO")
  require("mlrMBO")

  inicio_bo <- Sys.time()

  PARAM$hipeparametertuning <- list()
  PARAM$hipeparametertuning$num_iterations <- PARAM_GLOBAL$bo_iterations  # FIXED: 100
  PARAM$lgbm <- list()
  
  PARAM$lgbm$param_fijos <- list(
    objective= "binary",
    metric= "auc",
    first_metric_only= TRUE,
    boost_from_average= TRUE,
    feature_pre_filter= FALSE,
    verbosity= -100,
    force_row_wise= TRUE,
    seed= PARAM$semilla_primigenia,
    max_bin= 31,
    learning_rate= 0.03,
    feature_fraction= 0.5,
    num_iterations= 2048,
    early_stopping_rounds= 200
  )
  
  PARAM$hipeparametertuning$hs <- makeParamSet(
    makeIntegerParam("num_leaves", lower = 2L, upper = 256L),
    makeIntegerParam("min_data_in_leaf", lower = 2L, upper = 8192L)
  )

  EstimarGanancia_AUC_lightgbm <- function(x) {
  
    param_completo <- modifyList(PARAM$lgbm$param_fijos, x)
  
    modelo_train <- lgb.train(
      data= dtrain,
      valids= list(valid = dvalidate),
      eval= "auc",
      param= param_completo,
      verbose= -100
    )
  
    AUC <- modelo_train$record_evals$valid$auc$eval[[modelo_train$best_iter]]
    attr(AUC, "extras") <- list("num_iterations"= modelo_train$best_iter)
  
    rm(modelo_train)
    gc(full= TRUE, verbose= FALSE)
  
    return(AUC)
  }

  configureMlr(show.learner.output = FALSE)
  
  obj.fun <- makeSingleObjectiveFunction(
      fn= EstimarGanancia_AUC_lightgbm,
      minimize= FALSE,
      noisy= FALSE,
      par.set= PARAM$hipeparametertuning$hs,
      has.simple.signature= FALSE
  )
  
  ctrl <- makeMBOControl(
      save.on.disk.at.time= 600,
      save.file.path= "HT.RDATA"
  )
  
  ctrl <- setMBOControlTermination(
      ctrl,
      iters= PARAM$hipeparametertuning$num_iterations  # FIXED: 100
  )
  
  ctrl <- setMBOControlInfill(ctrl, crit = makeMBOInfillCritEI())
  
  surr.km <- makeLearner(
      "regr.km",
      predict.type= "se",
      covtype= "matern3_2",
      control= list(trace = FALSE)
  )

  if (!file.exists("HT.RDATA")) {
    bayesiana_salida <- mbo(obj.fun, learner= surr.km, control= ctrl)
  } else {
    bayesiana_salida <- mboContinue("HT.RDATA")
  }

  fin_bo <- Sys.time()
  tiempo_bo <- as.numeric(difftime(fin_bo, inicio_bo, units = "mins"))

  tb_bayesiana <- as.data.table(bayesiana_salida$opt.path)
  setorder(tb_bayesiana, -y, -num_iterations)
  
  fwrite( tb_bayesiana,
    file="BO_log.txt",
    sep="\t"
  )
  
  PARAM$out$lgbm$mejores_hiperparametros <- tb_bayesiana[
    1,
    setdiff(colnames(tb_bayesiana),
      c("y","dob","eol","error.message","exec.time","ei","error.model",
        "train.time","prop.type","propose.time","se","mean","iter")),
    with= FALSE
  ]
  
  mejor_auc <- tb_bayesiana[1, y]

  cat(paste("BO completado en", round(tiempo_bo, 1), "min\n"))
  cat(paste("  Mejor AUC:", round(mejor_auc, 6), "\n"))
  cat(paste("  num_leaves:", PARAM$out$lgbm$mejores_hiperparametros$num_leaves, "\n"))
  cat(paste("  min_data_in_leaf:", PARAM$out$lgbm$mejores_hiperparametros$min_data_in_leaf, "\n"))
  cat(paste("  num_iterations:", PARAM$out$lgbm$mejores_hiperparametros$num_iterations, "\n\n"))

  # ==========================================================================
  # Produccion
  # ==========================================================================
  
  cat("Entrenando modelo final...\n")
  
  PARAM$trainingstrategy$final_train <- c(
    202105, 202104, 202103, 202102, 202101,
    202012, 202011, 202010, 202009, 202008, 202007,
    202006, 202005
  )
  
  dataset[, fold_final_train := foto_mes %in% PARAM$trainingstrategy$final_train ]
  
  dfinal_train <- lgb.Dataset(
    data= data.matrix(dataset[fold_final_train == TRUE, campos_buenos, with= FALSE]),
    label= dataset[fold_final_train == TRUE, clase01],
    free_raw_data= TRUE
  )

  fijos <- copy(PARAM$lgbm$param_fijos)
  fijos$num_iterations <- NULL
  fijos$early_stopping_rounds <- NULL
  
  param_final <- c(fijos, PARAM$out$lgbm$mejores_hiperparametros)

  inicio_train_final <- Sys.time()

  final_model <- lgb.train(
    data= dfinal_train,
    param= param_final,
    verbose= -100
  )

  fin_train_final <- Sys.time()
  tiempo_train_final <- as.numeric(difftime(fin_train_final, inicio_train_final, units = "mins"))

  cat(paste("Modelo final entrenado en", round(tiempo_train_final, 1), "min\n\n"))

  lgb.save(final_model, "modelo.txt")

  tb_importancia <- as.data.table(lgb.importance(final_model))
  fwrite( tb_importancia,
    file= "impo.txt",
    sep= "\t"
  )

  # ==========================================================================
  # Scoring
  # ==========================================================================
  
  cat("Generando predicciones...\n")
  
  PARAM$trainingstrategy$future <- c(202107)
  dfuture <- dataset[ foto_mes %in% PARAM$trainingstrategy$future ]

  prediccion <- predict(
    final_model,
    data.matrix(dfuture[, campos_buenos, with= FALSE])
  )

  tb_prediccion <- dfuture[, list(numero_de_cliente)]
  tb_prediccion[, prob := prediccion]
  
  fwrite(tb_prediccion,
    file= "prediccion.txt",
    sep= "\t"
  )

  # ==========================================================================
  # Curva de Ganancia
  # ==========================================================================
  
  tb_prediccion[, clase_ternaria := dfuture$clase_ternaria ]
  tb_prediccion[, ganancia := -3000.0 ]
  tb_prediccion[clase_ternaria=="BAJA+2", ganancia := 117000.0 ]
  
  setorder( tb_prediccion, -prob )
  tb_prediccion[, gan_acum := cumsum(ganancia)]
  
  tb_prediccion[,
    gan_suavizada := frollmean(
      x= gan_acum,
      n= 400,
      align= "center",
      na.rm= TRUE,
      hasNA= TRUE
    )
  ]

  resultado <- list()
  resultado$ganancia_suavizada_max <- max( tb_prediccion$gan_suavizada, na.rm=TRUE )
  options(digits= 8)
  resultado$envios <- which.max( tb_prediccion$gan_suavizada)
  resultado$semilla <- PARAM$semilla_primigenia
  resultado$seed_idx <- seed_idx
  resultado$mejor_auc <- mejor_auc
  
  fwrite( tb_prediccion,
    file= "ganancias.txt",
    sep= "\t"
  )

  tb_prediccion[, envios:= .I]
  
  pdf("curva_de_ganancia.pdf")
  
  plot(
    x= tb_prediccion$envios,
    y= tb_prediccion$gan_acum,
    type= "l",
    col= "gray",
    xlim= c(0, 6000),
    ylim= c(0, 8000000),
    main= paste0("Seed ", seed_idx, " (ONLY TREND FIXED) - Gan= ", as.integer(resultado$ganancia_suavizada_max), " envios= ", resultado$envios),
    xlab= "Envios",
    ylab= "Ganancia",
    panel.first= grid()
  )
  
  dev.off()

  if( !require("yaml")) install.packages("yaml")
  require("yaml")
  
  PARAM$resultado <- resultado
  
  write_yaml( PARAM, file="PARAM.yml")

  # ==========================================================================
  # Guardar resultado y limpiar para siguiente iteracion
  # ==========================================================================
  
  if(!exists("resultados_totales")) resultados_totales <- list()
  resultados_totales[[seed_idx]] <- resultado

  fin_seed <- Sys.time()
  duracion_total <- as.numeric(difftime(fin_seed, inicio_seed, units = "mins"))

  rm(dataset, dtrain, dvalidate, dfinal_train, final_model, tb_prediccion)
  gc(full=TRUE, verbose=FALSE)

  cat("\n========================================\n")
  cat("Semilla ", seed_idx, " completada en ", round(duracion_total, 1), " min\n")
  cat("   Ganancia: ", formatC(resultado$ganancia_suavizada_max, format="f", big.mark=",", digits=0), "\n")
  cat("   Envios: ", resultado$envios, "\n")
  cat("   AUC: ", round(resultado$mejor_auc, 6), "\n")
  cat("========================================\n\n")

} # Fin del loop sobre las semillas

cat("\n\n***************************************\n")
cat("TODAS LAS SEMILLAS PROCESADAS\n")
cat("***************************************\n")

## Resumen de Resultados de Todas las Semillas

In [None]:
# Crear tabla resumen
setwd("/content/buckets/b1/exp")

tb_resumen <- data.table(
  seed_idx = sapply(resultados_totales, function(x) x$seed_idx),
  semilla = sapply(resultados_totales, function(x) x$semilla),
  ganancia = sapply(resultados_totales, function(x) x$ganancia_suavizada_max),
  envios = sapply(resultados_totales, function(x) x$envios),
  mejor_auc = sapply(resultados_totales, function(x) x$mejor_auc)
)

# Agregar estadisticas
tb_resumen[, rank := rank(-ganancia)]

cat("\n\n========================================\n")
cat("RESUMEN FINAL DE LAS 5 SEMILLAS\n")
cat("(SOLO TREND FEATURES - FIXED)\n")
cat("========================================\n\n")
print(tb_resumen)

cat("\nESTADISTICAS:\n")
cat("Ganancia promedio: ", formatC(mean(tb_resumen$ganancia), format="f", big.mark=",", digits=0), "\n")
cat("Ganancia maxima: ", formatC(max(tb_resumen$ganancia), format="f", big.mark=",", digits=0), "\n")
cat("Ganancia minima: ", formatC(min(tb_resumen$ganancia), format="f", big.mark=",", digits=0), "\n")
cat("Desviacion estandar: ", formatC(sd(tb_resumen$ganancia), format="f", big.mark=",", digits=0), "\n")
cat("AUC promedio: ", round(mean(tb_resumen$mejor_auc), 6), "\n")
cat("Mejor semilla: ", tb_resumen[rank==1, semilla], " (seed_idx ", tb_resumen[rank==1, seed_idx], ")\n")

# Guardar resumen
fwrite(tb_resumen, 
  file=paste0("resumen_5_seeds_exp", PARAM_GLOBAL$experimento_base, "_ONLY_TREND_FIXED.txt"),
  sep="\t"
)

# Guardar objeto completo
saveRDS(resultados_totales, 
  file=paste0("resultados_completos_exp", PARAM_GLOBAL$experimento_base, "_ONLY_TREND_FIXED.rds")
)

cat("\nWORKFLOW COMPLETADO\n")