# 610 WorkFlow Gerencial - Nicolas Horn (MEJORADO)

Basado en z610_WorkFlow_01_gerencial_julio.ipynb con mejoras de z910_WorkFlow_01_junior.ipynb

**Mejoras implementadas:**
- **Data Drifting:** Correccion por IPC (deflacion) en variables monetarias
- **Feature Engineering:** lags (1,2) + deltas (1,2) + **trends (3,6)**
- **Hyperparameters:** Espacio de busqueda expandido (feature_fraction, learning_rate)
- **5 semillas:** 153929, 838969, 922081, 795581, 194609
- **Loop automatico** sobre las 5 semillas

#### Seteo del ambiente en Google Colab

Esta parte se debe correr con el runtime en Python3
<br>Ir al menu, Runtime -> Change Runtime Type -> Runtime type -> **Python 3**

Conectar la virtual machine donde esta corriendo Google Colab con el Google Drive, para poder tener persistencia de archivos

In [None]:
# primero establecer el Runtime de Python 3
from google.colab import drive
drive.mount('/content/.drive')

Descargar dataset

In [None]:
%%shell

mkdir -p "/content/.drive/My Drive/labo1"
mkdir -p "/content/buckets"
ln -s "/content/.drive/My Drive/labo1" /content/buckets/b1

mkdir -p ~/.kaggle
cp /content/buckets/b1/kaggle/kaggle.json ~/.kaggle
chmod 600 ~/.kaggle/kaggle.json


mkdir -p /content/buckets/b1/exp
mkdir -p /content/buckets/b1/datasets
mkdir -p /content/datasets


webfiles="https://storage.googleapis.com/open-courses/austral2025-af91/"
destino_local="/content/datasets"
destino_bucket="/content/buckets/b1/datasets"

archivo="gerencial_competencia_2025.csv.gz"

if ! test -f $destino_bucket/$archivo; then
  wget $webfiles/$archivo -O $destino_bucket/$archivo
fi

if ! test -f $destino_local/$archivo; then
  cp $destino_bucket/$archivo $destino_local/$archivo
fi

## Workflow 610 MEJORADO - Loop sobre 5 Semillas

## Inicializacion

Esta parte se debe correr con el runtime en lenguaje **R** Ir al menu, Runtime -> Change Runtime Type -> Runtime type -> R

In [None]:
format(Sys.time(), "%a %b %d %X %Y")

In [None]:
# limpio la memoria
rm(list=ls(all.names=TRUE)) # remove all objects
gc(full=TRUE, verbose=FALSE) # garbage collection

In [None]:
require("data.table")

if( !require("R.utils")) install.packages("R.utils")
require("R.utils")

#### Parametros Globales

In [None]:
PARAM_GLOBAL <- list()
PARAM_GLOBAL$experimento_base <- 6100
PARAM_GLOBAL$dataset <- "gerencial_competencia_2025.csv.gz"

# Vector de 5 semillas - Nicolas Horn
PARAM_GLOBAL$semillas <- c(153929, 838969, 922081, 795581, 194609)

# Lista para almacenar resultados de todas las semillas
resultados_totales <- list()

## Indices para Data Drifting (IPC)

Valores de IPC calculados por alumnos, momento 1.0 = 31-dic-2020

In [None]:
# Meses disponibles en el dataset gerencial
vfoto_mes <- c(
  202005, 202006, 202007, 202008, 202009, 202010, 202011, 202012,
  202101, 202102, 202103, 202104, 202105, 202106, 202107
)

# IPC correspondiente a esos meses (momento 1.0 = 31-dic-2020)
vIPC <- c(
  1.2118694724, 1.1881073259,  # 202005, 202006
  1.1693969743, 1.1375456949, 1.1065619600,  # 202007, 202008, 202009
  1.0681100000, 1.0370000000, 1.0000000000,  # 202010, 202011, 202012
  0.9680542110, 0.9344152616, 0.8882274350,  # 202101, 202102, 202103
  0.8532444140, 0.8251880213, 0.8003763543,  # 202104, 202105, 202106
  0.7763107219  # 202107
)

tb_indices <- data.table(
  foto_mes = vfoto_mes,
  IPC = vIPC
)

print(tb_indices)

## Funcion para calcular tendencia (optimizada)

Calcula la pendiente de regresion lineal usando formula analitica (sin lm)

In [None]:
calc_slope_fast <- function(y) {
  n <- length(y)
  valid <- !is.na(y)
  n_valid <- sum(valid)
  if (n_valid < 2) return(NA_real_)

  x <- 1:n
  x_valid <- x[valid]
  y_valid <- y[valid]

  sum_x <- sum(x_valid)
  sum_y <- sum(y_valid)
  sum_xy <- sum(x_valid * y_valid)
  sum_x2 <- sum(x_valid^2)

  denom <- n_valid * sum_x2 - sum_x^2
  if (denom == 0) return(NA_real_)

  (n_valid * sum_xy - sum_x * sum_y) / denom
}

## Loop Principal - Iteracion Automatica sobre las 5 Semillas

Este loop ejecuta todo el workflow completo para cada una de las 5 semillas de manera automatica.

In [None]:
# ============================================================================
# LOOP AUTOMATICO SOBRE TODAS LAS SEMILLAS
# ============================================================================

for (seed_idx in 1:length(PARAM_GLOBAL$semillas)) {

  cat("\n\n========================================\n")
  cat("PROCESANDO SEMILLA ", seed_idx, " de ", length(PARAM_GLOBAL$semillas), "\n")
  cat("Semilla: ", PARAM_GLOBAL$semillas[seed_idx], "\n")
  cat("========================================\n\n")

  inicio_seed <- Sys.time()

  # Inicializar PARAM para esta semilla
  PARAM <- list()
  PARAM$semilla_primigenia <- PARAM_GLOBAL$semillas[seed_idx]
  PARAM$experimento <- PARAM_GLOBAL$experimento_base + seed_idx - 1
  PARAM$dataset <- PARAM_GLOBAL$dataset
  PARAM$out <- list()
  PARAM$out$lgbm <- list()

  # ==========================================================================
  # Carpeta del Experimento
  # ==========================================================================
  
  if (!dir.exists("/content/buckets/b1/exp")) {
    dir.create("/content/buckets/b1/exp", showWarnings = FALSE, recursive = TRUE)
  }
  
  setwd("/content/buckets/b1/exp")
  experimento_folder <- paste0("WF", PARAM$experimento, "_seed", seed_idx, "_nicolas_horn_v2")
  dir.create(experimento_folder, showWarnings=FALSE)
  setwd( paste0("/content/buckets/b1/exp/", experimento_folder ))
  
  cat("Carpeta de trabajo: ", experimento_folder, "\n\n")

  # ==========================================================================
  # Carga del dataset
  # ==========================================================================
  
  cat("Cargando dataset...\n")
  dataset <- fread(paste0("/content/datasets/", PARAM$dataset))
  cat("Dataset cargado:", nrow(dataset), "filas x", ncol(dataset), "cols\n\n")

  # ==========================================================================
  # Catastrophe Analysis (13 variables)
  # ==========================================================================
  
  cat("Aplicando Catastrophe Analysis (13 variables)...\n")
  dataset[ foto_mes==202006, internet:=NA]
  dataset[ foto_mes==202006, mrentabilidad:=NA]
  dataset[ foto_mes==202006, mrentabilidad_annual:=NA]
  dataset[ foto_mes==202006, mcomisiones:=NA]
  dataset[ foto_mes==202006, mactivos_margen:=NA]
  dataset[ foto_mes==202006, mpasivos_margen:=NA]
  dataset[ foto_mes==202006, mcuentas_saldo:=NA]
  dataset[ foto_mes==202006, ctarjeta_visa_transacciones:=NA]
  dataset[ foto_mes==202006, mtarjeta_visa_consumo:=NA]
  dataset[ foto_mes==202006, mtarjeta_master_consumo:=NA]
  dataset[ foto_mes==202006, ccallcenter_transacciones:=NA]
  dataset[ foto_mes==202006, chomebanking_transacciones:=NA]
  dataset[ foto_mes==202006, ctarjeta_master_transacciones:=NA]  # Variable 13
  cat("Catastrophe Analysis completado\n\n")

  # ==========================================================================
  # DATA DRIFTING - Correccion por IPC (deflacion)
  # ==========================================================================
  
  cat("Aplicando correccion de Data Drifting (IPC)...\n")
  
  # Identificar campos monetarios (empiezan con 'm')
  campos_monetarios <- colnames(dataset)
  campos_monetarios <- campos_monetarios[campos_monetarios %like% "^m"]
  
  cat("  Variables monetarias a corregir:", length(campos_monetarios), "\n")
  
  # Aplicar deflacion: multiplicar por IPC para llevar todo a valores de dic-2020
  dataset[tb_indices,
    on = c("foto_mes"),
    (campos_monetarios) := .SD * i.IPC,
    .SDcols = campos_monetarios
  ]
  
  cat("  Data Drifting corregido por IPC\n\n")

  # ==========================================================================
  # Feature Engineering intra-mes
  # ==========================================================================
  
  cat("Feature Engineering intra-mes...\n")
  
  atributos_presentes <- function( patributos ) {
    atributos <- unique( patributos )
    comun <- intersect( atributos, colnames(dataset) )
    return( length( atributos ) == length( comun ) )
  }

  if( atributos_presentes( c("foto_mes") ))
    dataset[, kmes := foto_mes %% 100]

  if( atributos_presentes( c("mpayroll", "cliente_edad") ))
    dataset[, mpayroll_sobre_edad := mpayroll / cliente_edad]
  
  cat("FE intra-mes completado\n\n")

  # ==========================================================================
  # Feature Engineering Historico - LAGS + DELTAS + TRENDS
  # ==========================================================================
  
  cat("Feature Engineering Historico (lags + deltas + trends)...\n")
  inicio_fe <- Sys.time()
  
  # Ordenar por cliente y mes
  setorder(dataset, numero_de_cliente, foto_mes)
  
  cols_lagueables <- copy( setdiff(
      colnames(dataset),
      c("numero_de_cliente", "foto_mes", "clase_ternaria")
  ))
  
  cat("  Variables base:", length(cols_lagueables), "\n")

  # --- LAGS ---
  cat("  Generando lags orden 1...\n")
  dataset[,
      paste0(cols_lagueables, "_lag1") := shift(.SD, 1, NA, "lag"),
      by = numero_de_cliente,
      .SDcols = cols_lagueables
  ]

  cat("  Generando lags orden 2...\n")
  dataset[,
      paste0(cols_lagueables, "_lag2") := shift(.SD, 2, NA, "lag"),
      by = numero_de_cliente,
      .SDcols = cols_lagueables
  ]

  # --- DELTAS ---
  cat("  Generando deltas...\n")
  for (vcol in cols_lagueables) {
      dataset[, paste0(vcol, "_delta1") := get(vcol) - get(paste0(vcol, "_lag1"))]
      dataset[, paste0(vcol, "_delta2") := get(vcol) - get(paste0(vcol, "_lag2"))]
  }
  
  # --- TRENDS (ventanas 3 y 6) ---
  cat("  Generando trends (ventana 3)...\n")
  for (col in cols_lagueables) {
    dataset[, paste0(col, "_trend_3") := frollapply(
      x = get(col),
      n = 3,
      FUN = calc_slope_fast,
      align = "right"
    ), by = numero_de_cliente]
  }
  
  cat("  Generando trends (ventana 6)...\n")
  for (col in cols_lagueables) {
    dataset[, paste0(col, "_trend_6") := frollapply(
      x = get(col),
      n = 6,
      FUN = calc_slope_fast,
      align = "right"
    ), by = numero_de_cliente]
  }
  
  fin_fe <- Sys.time()
  tiempo_fe <- as.numeric(difftime(fin_fe, inicio_fe, units = "mins"))
  
  cat(paste("FE Historico completado en", round(tiempo_fe, 1), "min\n"))
  cat(paste("Dataset final:", nrow(dataset), "filas x", ncol(dataset), "cols\n\n"))

  # ==========================================================================
  # Training Strategy
  # ==========================================================================
  
  cat("Configurando Training Strategy...\n")
  
  PARAM$trainingstrategy <- list()
  PARAM$trainingstrategy$validate <- c(202105)
  
  PARAM$trainingstrategy$training <- c(
    202104, 202103, 202102, 202101,
    202012, 202011, 202010, 202009, 202008, 202007,
    202006, 202005
  )
  
  PARAM$trainingstrategy$training_pct <- 1.0
  PARAM$trainingstrategy$positivos <- c( "BAJA+1", "BAJA+2")

  dataset[, clase01 := ifelse( clase_ternaria %in% PARAM$trainingstrategy$positivos, 1, 0 )]

  campos_buenos <- copy( setdiff(
      colnames(dataset), c("clase_ternaria","clase01","azar"))
  )

  set.seed(PARAM$semilla_primigenia, kind = "L'Ecuyer-CMRG")
  dataset[, azar:=runif(nrow(dataset))]

  dataset[, fold_train := foto_mes %in% PARAM$trainingstrategy$training &
      (clase_ternaria %in% c("BAJA+1", "BAJA+2") |
       azar < PARAM$trainingstrategy$training_pct ) ]
  
  cat(paste("Features para modelo:", length(campos_buenos), "\n\n"))
  
  if( !require("lightgbm")) install.packages("lightgbm")
  require("lightgbm")

  # ==========================================================================
  # Hyperparameters - PRE-COMPUTED (EXPANDIDO)
  # ==========================================================================
  
  cat("Usando hyperparametros pre-computados (expandidos)...\n")

  PARAM$lgbm <- list()
  
  # Parametros fijos mejorados (basado en z910)
  PARAM$lgbm$param_fijos <- list(
    objective= "binary",
    metric= "auc",
    first_metric_only= TRUE,
    boost_from_average= TRUE,
    feature_pre_filter= FALSE,
    verbosity= -100,
    force_row_wise= TRUE,
    seed= PARAM$semilla_primigenia,
    max_bin= 31
  )
  
  # Mejores hyperparametros pre-computados (expandido con feature_fraction y learning_rate)
  PARAM$out$lgbm$mejores_hiperparametros <- list(
    num_leaves = 25,
    min_data_in_leaf = 2764,
    num_iterations = 2009,
    feature_fraction = 0.5,
    learning_rate = 0.03
  )
  
  mejor_auc <- 0.99512706276939
  
  cat("  num_leaves: 25\n")
  cat("  min_data_in_leaf: 2764\n")
  cat("  num_iterations: 2009\n")
  cat("  feature_fraction: 0.5\n")
  cat("  learning_rate: 0.03\n\n")

  # ==========================================================================
  # Final Training
  # ==========================================================================
  
  cat("Entrenando modelo final...\n")
  
  PARAM$trainingstrategy$final_train <- c(
    202105, 202104, 202103, 202102, 202101,
    202012, 202011, 202010, 202009, 202008, 202007,
    202006, 202005
  )
  
  dataset[, fold_final_train := foto_mes %in% PARAM$trainingstrategy$final_train ]
  
  dfinal_train <- lgb.Dataset(
    data= data.matrix(dataset[fold_final_train == TRUE, campos_buenos, with= FALSE]),
    label= dataset[fold_final_train == TRUE, clase01],
    free_raw_data= TRUE
  )

  fijos <- copy(PARAM$lgbm$param_fijos)
  param_final <- c(fijos, PARAM$out$lgbm$mejores_hiperparametros)

  inicio_train_final <- Sys.time()

  final_model <- lgb.train(
    data= dfinal_train,
    param= param_final,
    verbose= -100
  )

  fin_train_final <- Sys.time()
  tiempo_train_final <- as.numeric(difftime(fin_train_final, inicio_train_final, units = "mins"))

  cat(paste("Modelo final entrenado en", round(tiempo_train_final, 1), "min\n\n"))

  lgb.save(final_model, "modelo.txt")

  tb_importancia <- as.data.table(lgb.importance(final_model))
  fwrite( tb_importancia,
    file= "impo.txt",
    sep= "\t"
  )

  # ==========================================================================
  # Scoring
  # ==========================================================================
  
  cat("Generando predicciones...\n")
  
  PARAM$trainingstrategy$future <- c(202107)
  dfuture <- dataset[ foto_mes %in% PARAM$trainingstrategy$future ]

  prediccion <- predict(
    final_model,
    data.matrix(dfuture[, campos_buenos, with= FALSE])
  )

  tb_prediccion <- dfuture[, list(numero_de_cliente)]
  tb_prediccion[, prob := prediccion]
  
  fwrite(tb_prediccion,
    file= "prediccion.txt",
    sep= "\t"
  )

  # ==========================================================================
  # Curva de Ganancia
  # ==========================================================================
  
  tb_prediccion[, clase_ternaria := dfuture$clase_ternaria ]
  tb_prediccion[, ganancia := -3000.0 ]
  tb_prediccion[clase_ternaria=="BAJA+2", ganancia := 117000.0 ]
  
  setorder( tb_prediccion, -prob )
  tb_prediccion[, gan_acum := cumsum(ganancia)]
  
  tb_prediccion[,
    gan_suavizada := frollmean(
      x= gan_acum,
      n= 400,
      align= "center",
      na.rm= TRUE,
      hasNA= TRUE
    )
  ]

  resultado <- list()
  resultado$ganancia_suavizada_max <- max( tb_prediccion$gan_suavizada, na.rm=TRUE )
  options(digits= 8)
  resultado$envios <- which.max( tb_prediccion$gan_suavizada)
  resultado$semilla <- PARAM$semilla_primigenia
  resultado$seed_idx <- seed_idx
  resultado$mejor_auc <- mejor_auc
  
  fwrite( tb_prediccion,
    file= "ganancias.txt",
    sep= "\t"
  )

  tb_prediccion[, envios:= .I]
  
  pdf("curva_de_ganancia.pdf")
  
  plot(
    x= tb_prediccion$envios,
    y= tb_prediccion$gan_acum,
    type= "l",
    col= "gray",
    xlim= c(0, 6000),
    ylim= c(0, 8000000),
    main= paste0("Seed ", seed_idx, " (MEJORADO) - Gan= ", as.integer(resultado$ganancia_suavizada_max), " envios= ", resultado$envios),
    xlab= "Envios",
    ylab= "Ganancia",
    panel.first= grid()
  )
  
  dev.off()

  if( !require("yaml")) install.packages("yaml")
  require("yaml")
  
  PARAM$resultado <- resultado
  
  write_yaml( PARAM, file="PARAM.yml")

  # ==========================================================================
  # Guardar resultado y limpiar
  # ==========================================================================
  
  if(!exists("resultados_totales")) resultados_totales <- list()
  resultados_totales[[seed_idx]] <- resultado

  fin_seed <- Sys.time()
  duracion_total <- as.numeric(difftime(fin_seed, inicio_seed, units = "mins"))

  rm(dataset, dfinal_train, final_model, tb_prediccion)
  gc(full=TRUE, verbose=FALSE)

  cat("\n========================================\n")
  cat("Semilla ", seed_idx, " completada en ", round(duracion_total, 1), " min\n")
  cat("   Ganancia: ", formatC(resultado$ganancia_suavizada_max, format="f", big.mark=",", digits=0), "\n")
  cat("   Envios: ", resultado$envios, "\n")
  cat("========================================\n\n")

} # Fin del loop sobre las semillas

cat("\n\n***************************************\n")
cat("TODAS LAS SEMILLAS PROCESADAS\n")
cat("***************************************\n")

## Resumen de Resultados de Todas las Semillas

In [None]:
# Crear tabla resumen
setwd("/content/buckets/b1/exp")

tb_resumen <- data.table(
  seed_idx = sapply(resultados_totales, function(x) x$seed_idx),
  semilla = sapply(resultados_totales, function(x) x$semilla),
  ganancia = sapply(resultados_totales, function(x) x$ganancia_suavizada_max),
  envios = sapply(resultados_totales, function(x) x$envios)
)

tb_resumen[, rank := rank(-ganancia)]

cat("\n\n========================================\n")
cat("RESUMEN FINAL DE LAS 5 SEMILLAS\n")
cat("(Nicolas Horn - MEJORADO con IPC + Trends)\n")
cat("========================================\n\n")
print(tb_resumen)

cat("\nESTADISTICAS:\n")
cat("Ganancia promedio: ", formatC(mean(tb_resumen$ganancia), format="f", big.mark=",", digits=0), "\n")
cat("Ganancia maxima: ", formatC(max(tb_resumen$ganancia), format="f", big.mark=",", digits=0), "\n")
cat("Ganancia minima: ", formatC(min(tb_resumen$ganancia), format="f", big.mark=",", digits=0), "\n")
cat("Desviacion estandar: ", formatC(sd(tb_resumen$ganancia), format="f", big.mark=",", digits=0), "\n")
cat("Mejor semilla: ", tb_resumen[rank==1, semilla], " (seed_idx ", tb_resumen[rank==1, seed_idx], ")\n")

# Guardar resumen
fwrite(tb_resumen, 
  file=paste0("resumen_5_seeds_exp", PARAM_GLOBAL$experimento_base, "_nicolas_horn_v2.txt"),
  sep="\t"
)

saveRDS(resultados_totales, 
  file=paste0("resultados_completos_exp", PARAM_GLOBAL$experimento_base, "_nicolas_horn_v2.rds")
)

cat("\nWORKFLOW COMPLETADO\n")

In [None]:
format(Sys.time(), "%a %b %d %X %Y")