<img src="../img/mCIDaeNnb.png" alt="Logo CiDAEN" align="right">

<br><br><br>
<h2><font color="#00586D" size=4>Trabajo Fin de Máster</font></h2>

<h1><font color="#00586D" size=5>Análisis y Predicción de Resultados en Partidas de Clash Royale:<br><b>4. Preprocesamiento de Datos</b></font></h1>
<br><br><br>


<div align="right">
<font color="#00586D" size=3>Máster en Ciencia de Datos e Ingeniería de Datos en la Nube</font><br>
<font color="#00586D" size=3>Universidad de Castilla-La Mancha</font><br>
</div>

<font color="#00586D" size=3>Iván Fernández García</font><br>
<font color="#00586D" size=3>Curso académico 2024/2025</font><br>

---

<a id="indice"></a>
<h2><font color="#00586D" size=5>Índice</font></h2>


* [1. Introducción](#section1)
* [2. Preprocesamiento e ingeniería de características](#section2)
    * [2.1. Imputación de valores perdidos](#section2_1)
    * [2.2. Creación de atributos](#section2_2)
    * [2.3. Selección de variables](#section2_3)
    * [2.4. Discretización](#section2_4)
    * [2.5. Codificación](#section2_5)
    * [2.6. Escalado](#section2_6)
* [3. Pipelines de preprocesamiento](#section3)
    * [3.1. Primer pipeline (Binarias + Numéricas)](#section3_1)
    * [3.2. Segundo pipeline (Binarias + Diferencias)](#section3_2)
    * [3.3. Tercer pipeline (Binarias + Numéricas + Diferencias)](#section3_3)
    * [3.4. Cuarto pipeline (Diferencias)](#section3_4)
    * [3.5. Quinto pipeline (Diferencias seleccionadas)](#section3_5)
* [4. Conclusiones](#section4)

---

In [71]:
import pandas as pd
import numpy as np
from sklearn import set_config
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from joblib import dump

In [72]:
set_config(transform_output="pandas")

---

<a id="section1"></a>
## <font color="#00586D"> 1. Introducción</font>

Tras realizar una exploración de nuestro conjunto de entrenamiento, ha llegado el momento de preprocesar los datos. Durante esta fase utilizaremos toda la información y las conclusiones obtenidas durante el análisis para implementar diferentes pasos de preprocesamiento que permitan a los algoritmos aprender correctamente y puedan mejorar el rendimiento de los modelos (creación y selección de características, imputación de valores perdidos, codificación de variables categóricas...).

Para realizar este proceso de forma adecuada, encapsularemos todas estas transformaciones en *pipelines* de preprocesamiento que nos permitan realizar el proceso de manera lineal. Esto evita fugas de datos y hace que los cambios se apliquen automáticamente de manera sencilla sin necesidad de realizarlos manualmente.

En primer lugar, vamos a cargar los datos de entrenamiento sin la variable objetivo para poder obtener más fácilmente los nombres de las columnas y para comprobar que las transformaciones se realizan correctamente.

In [73]:
X = pd.read_csv("../data/final/train.csv", parse_dates=["battleTime"]).drop(columns="winner")

**Importante:** En ningún momento se modificará el conjunto de entrenamiento, estas transformaciones se aplican de manera automática al entrenar y utilizar los modelos (de ahí su gran utilidad). Durante esta fase se simulará este comportamiento llamando explícitamente al método `fit_transform` para verificar que el proceso se realiza correctamente en todos los casos.

---

<a id="section2"></a>
## <font color="#00586D"> 2. Preprocesamiento e ingeniería de características</font>

Antes de comenzar con la toma de decisiones, podríamos agrupar las variables predictoras de la siguiente manera:

* Variables propias de la partida: `battleTime` y `arena`
* Variables propias de cada jugador
    * Información del jugador: `tag` (categórica), `name` (categórica) y `startingTrophies` (numérica)
    * Cartas del mazo (binarias): `hasKnight`, `hasArchers`...
    * Propiedades del mazo (numéricas): `meanCardLevel`, `numCounters`...
    * Información de la tropa de las torres: `supportCardName` (categórica), `supportCardRarity` (categórica) y `supportCardLevel` (numérica)

Se ha decidido crear 5 *pipelines* de preprocesamiento, destinados a ser utilizados por diferentes algoritmos de aprendizaje y cuya diferencia principal será la selección de características, ya que el resto de pasos (imputación, creación, discretización, codificación y escalado) serán muy similares.

* El primer *pipeline* utilizará tanto las variables binarias como las variables numéricas de cada jugador, por lo que su salida será bastante parecida al conjunto de datos original en cuanto a columnas.

* El segundo *pipeline* también conservará las variables binarias, pero transformará las variables numéricas de cada jugador en las diferencias. Se añadirán también nuevas variables como la diferencia en los puntajes.

* El tercer *pipeline* combina los dos anteriores, conservando las variables binarias, las numéricas de cada jugador y añadiendo también las diferencias. Esto nos permite aprovechar el valor añadido de estas últimas sin perder la información que nos pueden aportar las columnas individuales. Se añadirán también nuevas variables, tanto para cada jugador como las diferencias.

* El cuarto *pipeline* conservará solamente las diferencias, se descartarán todas las variables binarias. Por lo tanto, la dimensionalidad se reduce considerablemente y puede funcionar mejor que los anteriores para modelos más sencillos.

* El quinto *pipeline* mantendrá únicamente las diferencias entre variables numéricas, y se seleccionará solamente un número reducido de ellas en base a conocimiento experto y a lo observado durante el análisis (importancia, correlaciones...).

Esto nos permite probar diferentes opciones para encontrar un equilibrio entre el rendimiento del modelo y su complejidad. Estos *pipelines* de preprocesamiento se guardarán para ser utilizados posteriormente en la fase de modelado, donde cada modelo será un nuevo *pipeline* que solamente tendrá dos pasos: `[[PIPELINE DE PREPROCESAMIENTO] => ESTIMADOR]`

<a id="section2_1"></a>
### <font color="#00586D"> 2.1. Imputación de valores perdidos</font>

Si bien es cierto que no contamos con valores perdidos en nuestro conjunto de entrenamiento, definiremos cómo gestionarlos para evitar problemas frente a ejemplos nuevos que pudieran contener información faltante por cualquier motivo.

Se ha decidido incluir la imputación como primer paso porque se van a utilizar variables individuales de cada jugador durante la creación de nuevas características (segundo paso). Estas diferencias no pueden calcularse si hay valores perdidos en las variables originales, por lo que se debe imputar antes de crearlas y no después. Por lo tanto, debemos imputar:

* Las variables numéricas necesarias para crear las diferencias.
* Del resto de variables, las que no vayan a ser eliminadas en el tercer paso (dependerá del *pipeline*).


Concretamente, se imputará de la siguiente manera:
* Variables numéricas: Imputación por la media.
* Variables binarias: Imputación por 0 (Se asume que la carta no se utiliza).
* Variables categóricas: Imputación por la moda.

No es necesario imputar el resto, ya que las descartaremos y nunca serán utilizadas por los algoritmos de aprendizaje.

<a id="section2_2"></a>
### <font color="#00586D"> 2.2. Creación de atributos</font>

Un paso fundamental será la creación de nuevas características. Tras apreciar una simetría en gran parte de los datos durante el análisis exploratorio, se decidió explorar la posibilidad de calcular las diferencias entre las variables numéricas de ambos jugadores. Además, se estudió crear otras variables como los puntos de rareza y balance de los mazos, calculando a su vez las diferencias.

Se ha tomado la decisión de crear una nueva variable por cada uno de estos pares a excepción de `airAdvantage`, que ofreció unos resultados extremadamente pobres en cuanto a importancia y una varianza muy baja. Para ello, se van a definir las funciones de preprocesamiento necesarias para transformar las columnas. También las guardaremos en un archivo `preprocessing_functions.py` para poder importarlas en fases posteriores cuando sea necesario. Esto aplica para todos los *pipelines*, será en el siguiente paso cuando se descarten las variables correspondientes.

En el caso de `winConditionAdvantage`, `rarityScore` y `balanceScore` la función es más compleja y tiene un parámetro adicional. Este nos permite crear la variable para cada jugador, solamente la diferencia o las tres columnas.

*Diferencia en el número de trofeos antes de la partida*

In [74]:
def create_diff_starting_trophies(df):
    return (df["player1_startingTrophies"] - df["player2_startingTrophies"]).to_frame(name="diff_startingTrophies")

*Diferencia en el nivel medio de las cartas del mazo*

In [75]:
def create_diff_mean_card_level(df):
    return (df["player1_meanCardLevel"] - df["player2_meanCardLevel"]).to_frame(name="diff_meanCardLevel")

*Diferencia en el nivel mínimo de las cartas del mazo*

In [76]:
def create_diff_min_card_level(df):
    return (df["player1_minCardLevel"] - df["player2_minCardLevel"]).to_frame(name="diff_minCardLevel")

*Diferencia en el nivel máximo de las cartas del mazo*

In [77]:
def create_diff_max_card_level(df):
    return (df["player1_maxCardLevel"] - df["player2_maxCardLevel"]).to_frame(name="diff_maxCardLevel")

*Diferencia en el nivel de las carta de apoyo (tropa de las torres de coronas) del mazo*

In [78]:
def create_diff_support_card_level(df):
    return (df["player1_supportCardLevel"] - df["player2_supportCardLevel"]).to_frame(name="diff_supportCardLevel")

*Diferencia en los puntos estelares totales de las cartas del mazo*

In [79]:
def create_diff_total_star_level(df):
    return (df["player1_totalStarLevel"] - df["player2_totalStarLevel"]).to_frame(name="diff_totalStarLevel")

*Diferencia en el coste medio de elixir del mazo*

In [80]:
def create_diff_mean_elixir_cost(df):
    return (df["player1_meanElixirCost"] - df["player2_meanElixirCost"]).to_frame(name="diff_meanElixirCost")

*Diferencia en el número de cartas evolucionadas del mazo*

In [81]:
def create_diff_num_evolution_cards(df):
    return (df["player1_numEvolutionCards"] - df["player2_numEvolutionCards"]).to_frame(name="diff_numEvolutionCards")

*Diferencia en el número de cartas "Win Condition" del mazo*

In [82]:
def create_diff_num_win_condition_cards(df):
    return (df["player1_numWinConditionCards"] - df["player2_numWinConditionCards"]).to_frame(name="diff_numWinConditionCards")

*Diferencia en el número de cartas cuerpo a cuerpo del mazo*

In [83]:
def create_diff_num_melee_cards(df):
    return (df["player1_numMeleeCards"] - df["player2_numMeleeCards"]).to_frame(name="diff_numMeleeCards")

*Diferencia en el número de cartas a distancia del mazo*

In [84]:
def create_diff_num_ranged_cards(df):
    return (df["player1_numRangedCards"] - df["player2_numRangedCards"]).to_frame(name="diff_numRangedCards")

*Diferencia en el número de unidades aéreas del mazo*

In [85]:
def create_diff_num_air_cards(df):
    return (df["player1_numAirCards"] - df["player2_numAirCards"]).to_frame(name="diff_numAirCards")

*Diferencia en el número de cartas antiaéreas del mazo*

In [86]:
def create_diff_num_anti_air_cards(df):
    return (df["player1_numAntiAirCards"] - df["player2_numAntiAirCards"]).to_frame(name="diff_numAntiAirCards")

*Diferencia en el número de cartas con daño directo a torre del mazo*

In [87]:
def create_diff_num_direct_damage_cards(df):
    return (df["player1_numDirectDamageCards"] - df["player2_numDirectDamageCards"]).to_frame(name="diff_numDirectDamageCards")

*Diferencia en el número de cartas con daño de salpicadura del mazo*

In [88]:
def create_diff_num_splash_damage_cards(df):
    return (df["player1_numSplashDamageCards"] - df["player2_numSplashDamageCards"]).to_frame(name="diff_numSplashDamageCards")

*Diferencia en el número de cartas con reseteo del ataque rival del mazo*

In [89]:
def create_diff_num_reset_attack_cards(df):
    return (df["player1_numResetAttackCards"] - df["player2_numResetAttackCards"]).to_frame(name="diff_numResetAttackCards")

*Diferencia en el número de cartas comunes del mazo*

In [90]:
def create_diff_num_common_cards(df):
    return (df["player1_numCommonCards"] - df["player2_numCommonCards"]).to_frame(name="diff_numCommonCards")

*Diferencia en el número de cartas raras del mazo*

In [91]:
def create_diff_num_rare_cards(df):
    return (df["player1_numRareCards"] - df["player2_numRareCards"]).to_frame(name="diff_numRareCards")

*Diferencia en el número de cartas épicas del mazo*

In [92]:
def create_diff_num_epic_cards(df):
    return (df["player1_numEpicCards"] - df["player2_numEpicCards"]).to_frame(name="diff_numEpicCards")

*Diferencia en el número de cartas legendarias del mazo*

In [93]:
def create_diff_num_legendary_cards(df):
    return (df["player1_numLegendaryCards"] - df["player2_numLegendaryCards"]).to_frame(name="diff_numLegendaryCards")

*Diferencia en el número de campeones del mazo*

In [94]:
def create_diff_num_champion_cards(df):
    return (df["player1_numChampionCards"] - df["player2_numChampionCards"]).to_frame(name="diff_numChampionCards")

*Diferencia en el número de tropas del mazo*

In [95]:
def create_diff_num_troop_cards(df):
    return (df["player1_numTroopCards"] - df["player2_numTroopCards"]).to_frame(name="diff_numTroopCards")

*Diferencia en el número de edificios del mazo*

In [96]:
def create_diff_num_building_cards(df):
    return (df["player1_numBuildingCards"] - df["player2_numBuildingCards"]).to_frame(name="diff_numBuildingCards")

*Diferencia en el número de hechizos del mazo*

In [97]:
def create_diff_num_spell_cards(df):
    return (df["player1_numSpellCards"] - df["player2_numSpellCards"]).to_frame(name="diff_numSpellCards")

*Diferencia en el número de "counters" del mazo*

In [98]:
def create_diff_num_counters(df):
    return (df["player1_numCounters"] - df["player2_numCounters"]).to_frame(name="diff_numCounters")

*Diferencia en el número de cartas sin contrarrestar del mazo*

In [99]:
def create_diff_num_uncountered_cards(df):
    return (df["player1_numUncounteredCards"] - df["player2_numUncounteredCards"]).to_frame(name="diff_numUncounteredCards")

*Ventaja "Win Condition" VS "Edificios" (por jugador, diferencia o ambos)*

In [100]:
def create_win_condition_advantage(df, output):
    player1_win_condition_advantage = np.maximum(0, df["player1_numWinConditionCards"] - df["player2_numBuildingCards"])
    player2_win_condition_advantage = np.maximum(0, df["player2_numWinConditionCards"] - df["player1_numBuildingCards"])
    if output == "players":
        return pd.DataFrame({"player1_winConditionAdvantage": player1_win_condition_advantage, "player2_winConditionAdvantage": player2_win_condition_advantage})
    elif output == "diff":
        return (player1_win_condition_advantage - player2_win_condition_advantage).to_frame(name="diff_winConditionAdvantage")
    elif output == "all":
        return pd.DataFrame({
            "player1_winConditionAdvantage": player1_win_condition_advantage,
            "player2_winConditionAdvantage": player2_win_condition_advantage,
            "diff_winConditionAdvantage": (player1_win_condition_advantage - player2_win_condition_advantage)
        })
    else:
        raise ValueError("Invalid output type. Choose from 'players', 'diff', or 'all'.")

*Puntaje de rareza del mazo (por jugador, diferencia o ambos)*

In [101]:
def create_rarity_score(df, output):
    rarity_points = {"numCommonCards": 1, "numRareCards": 3, "numEpicCards": 5, "numLegendaryCards": 10, "numChampionCards": 20}
    player1_rarity_score = sum(df["player1_" + col] * points for col, points in rarity_points.items())
    player2_rarity_score = sum(df["player2_" + col] * points for col, points in rarity_points.items())
    if output == "players":
        return pd.DataFrame({"player1_rarityScore": player1_rarity_score, "player2_rarityScore": player2_rarity_score})
    elif output == "diff":
        return pd.DataFrame({"diff_rarityScore": player1_rarity_score - player2_rarity_score})
    elif output == "all":
        return pd.DataFrame({
            "player1_rarityScore": player1_rarity_score,
            "player2_rarityScore": player2_rarity_score,
            "diff_rarityScore": player1_rarity_score - player2_rarity_score
        })
    else:
        raise ValueError("Invalid output type. Choose 'players', 'diff', or 'all'.")

*Puntaje de equilibrio del mazo (por jugador, diferencia o ambos)*

In [102]:
def create_balance_score(df, output):
    def compute_balance_score(df, prefix):
        score = pd.Series(0.0, index=df.index)
        score += (df[f"{prefix}numWinConditionCards"] >= 1).astype(float) * 3
        score += (df[f"{prefix}numDirectDamageCards"] >= 1).astype(float) * 2
        score += (df[f"{prefix}numAntiAirCards"] >= 1).astype(float) * 1
        score += (df[f"{prefix}numSplashDamageCards"] >= 1).astype(float) * 1
        score += (df[f"{prefix}numResetAttackCards"] >= 1).astype(float) * 1
        score += (df[f"{prefix}numBuildingCards"] >= 1).astype(float) * 1
        score += ((df[f"{prefix}numMeleeCards"] >= 1) & (df[f"{prefix}numRangedCards"] >= 1)).astype(float) * 1
        score += (df[f"{prefix}numTroopCards"] >= 4).astype(float) * 1
        score += (df[f"{prefix}numAirCards"] >= 1).astype(float) * 1
        score += ((df[f"{prefix}meanElixirCost"] >= 2.5) & (df[f"{prefix}meanElixirCost"] <= 4.5)).astype(float) * 1
        score += (
            (df[f"{prefix}numTroopCards"] >= 1) &
            (df[f"{prefix}numSpellCards"] >= 1) &
            (df[f"{prefix}numBuildingCards"] >= 1)
        ).astype(float) * 1
        return score
    player1_balance_score = compute_balance_score(df, "player1_")
    player2_balance_score = compute_balance_score(df, "player2_")
    if output == "players":
        return pd.DataFrame({"player1_balanceScore": player1_balance_score, "player2_balanceScore": player2_balance_score})
    elif output == "diff":
        return pd.DataFrame({"diff_balanceScore": player1_balance_score - player2_balance_score})
    elif output == "all":
        return pd.DataFrame({
            "player1_balanceScore": player1_balance_score,
            "player2_balanceScore": player2_balance_score,
            "diff_balanceScore": player1_balance_score - player2_balance_score
        })
    else:
        raise ValueError("Invalid output type. Choose 'players', 'diff', or 'all'.")

En esta primera aproximación se ha decidido mantener las variables binarias de las cartas de ambos jugadores tal y como están.

Otra opción es realizar las diferencias, aunque no distinguiríamos cuando ningún jugador utiliza una carta y cuando la usan los dos. En futuras iteraciones, podríamos estudiar la posibilidad de combinarlas para que tomen cuatro posibles valores:
* `0` significaría que ninguno de los dos jugadores utiliza la carta.
* `1` significaría que solamente el primer jugador utiliza la carta.
* `2` significaría que solamente el segundo jugador utiliza la carta.
* `3` significaría que ambos jugadores utilizan la carta.

Como se ha comentado, para esta primera aproximación no se aplicará ninguna de estas transformaciones, solamente se crearán las diferencias entre variables numéricas.

<a id="section2_3"></a>
### <font color="#00586D"> 2.3. Selección de variables</font>

Se eliminarán **en todos los *pipelines*** las siguientes variables:

* Los tags y los nombres de ambos jugadores: Son completamente irrelevantes para el problema que queremos resolver.

* `battleTime`: Todas las partidas pertenecen al periodo posterior a los cambios de balance del 9 de abril, por lo que la jugabilidad no cambia y podemos eliminarla.

* `arena` y `startingTrophies` (por jugador o diferencia según el *pipeline*): El conjunto de datos contiene partidas en las que dos jugadores han sido emparejados, con una diferencia de trofeos mínima y en la misma arena (a menos que estén justo en el límite y en ese caso se jugaría en la del jugador con más copas). A pesar de ello, nuestro modelo también se quiere utilizar para predecir partidas "amistosas" entre dos jugadores cualesquiera sin que estos tengan que cumplir estrictamente las condiciones necesarias para ser emparejados por el juego. Si bien es cierto que estas variables pueden afectar al resultado, conservarlas podría empeorar el funcionamiento del modelo. Al crear un nuevo registro, la diferencia de trofeos podría ser demasiado grande. En este caso, se ha optado por descartar características que podrían mejorar el rendimiento sobre nuestros datos pero empeorarlo para un caso de uso que se prevee habitual tras el despliegue, a cambio de aprender patrones a partir de otras variables que no se vean afectadas por ello. Lo ideal para mantenerlas sería contar también con partidas amistosas (sin restricciones de nivel ni de modos de juego especiales) en los datos adquiridos en la primera fase del proyecto. Como no es el caso, debemos adaptarnos al problema y tomar las decisiones adecuadas.

* `numSpellCards` (por jugador o diferencia según el *pipeline*): Se ha decidido eliminar debido a su alta correlación con el número de cartas de daño directo a torre y con el número de tropas.

* `supporCardRarity` (para ambos jugadores): Al haber solamente cuatro cartas de soporte, se ha considerado insuficiente el valor que nos puede aportar conocer su rareza además del nombre (si bien es cierto que dos de ellas tienen la misma). Además, son variables categóricas que tendríamos que codificar y la dimensionalidad aumentaría seis unidades en vez de dos.

Además, del segundo *pipeline* se eliminarán también las variables numéricas de cada jugador al ser reemplazadas por las diferencias. Lo mismo ocurrirá con el cuarto *pipeline*, en el que también se eliminarán todas las variables binarias y `supportCardName` para ambos jugadores.

Por último, del quinto *pipeline* también se eliminarán:
* Las variables binarias de ambos jugadores y `supportCardName` para ambos jugadores.
* Algunas diferencias en base a conocimiento experto o a lo observado durante el análisis, para quedarnos solamente con 20 características:
    * `diff_numDirectDamageCards:` Baja importancia y alta correlación con la diferencia en el número de tropas.
    * `diff_numBuildingCards:` Baja importancia.
    * `diff_numChampionCards:` Baja importancia.
    * `diff_minCardLevel:` Ya disponemos de otras variables que miden el nivel de las cartas.
    * `diff_maxCardLevel:` Ya disponemos de otras variables que miden el nivel de las cartas.
    * `diff_numUncounteredCards:` Alta correlación con la diferencia en el número de *counters*.
    * `diff_winConditionAdvantage:` Baja importancia y alta correlación con la diferencia en el número de *Win Condition*.

<a id="section2_4"></a>
### <font color="#00586D"> 2.4. Discretización</font>

En esta primera aproximación, no se discretizará ninguna variable numérica para ningún *pipeline* de preprocesamiento.

<a id="section2_5"></a>
### <font color="#00586D"> 2.5. Codificación</font>

Debemos codificar las variables categóricas. Para ello aplicaremos codificación *One-Hot* a:
* `player1_supportCardName`
* `player2_supportCardName`

Este paso no se aplicará en los dos últimos *pipelines*, ya que estas variables son descartadas previamente.

<a id="section2_6"></a>
### <font color="#00586D"> 2.6. Escalado</font>

Por último, estandarizaremos las variables numéricas correspondientes. La idea es que todos los modelos prueben cada uno de los *pipelines*. Como queremos probar algunos algoritmos que funcionan mejor con escalado y no afecta negativamente el rendimiento de otros para los que no es tan necesario, lo añadiremos en todos los casos. 

---

<a id="section3"></a>
## <font color="#00586D"> 3. Pipelines de preprocesamiento</font>

Vamos a realizar ahora la implementación de los *pipelines* utilizando para ello los transformadores y las herramientas proporcionadas por `Scikit-learn`.

La metodología será utilizar un `ColumnTransformer` para cada paso de preprocesamiento, aplicando las transformaciones necesarias a las columnas correspondientes de manera secuencial.

<a id="section3_1"></a>
### <font color="#00586D"> 3.1. Primer pipeline (Binarias + Numéricas)</font>

Lo denomiraremos `preprocessing_bin_num` y tendrá los pasos definidos anteriormente.

##### *Imputación*

Debemos imputar todas las binarias, así como las numéricas y categóricas a mantener.

In [103]:
original_features_to_drop = [
    "battleTime", "arena", "player1_tag", "player2_tag", "player1_name", "player2_name", "player1_supportCardRarity", "player2_supportCardRarity",
    "player1_startingTrophies", "player2_startingTrophies", "player1_numSpellCards", "player2_numSpellCards",
]

categorical_features_to_impute = ["player1_supportCardName", "player2_supportCardName"]
binary_features_to_impute = [col for col in X.columns if col.startswith("player1_has") or col.startswith("player2_has")]
numeric_features_to_impute = [col for col in X.columns if col not in binary_features_to_impute + original_features_to_drop + categorical_features_to_impute]

imputer = ColumnTransformer(
    transformers=[
        ("numeric_imputer", SimpleImputer(strategy="mean"), numeric_features_to_impute),
        ("binary_imputer", SimpleImputer(strategy="constant", fill_value=0), binary_features_to_impute),
        ("categorical_imputer", SimpleImputer(strategy="most_frequent"), categorical_features_to_impute)
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

imputer

##### *Creación*

Vamos a crear las tres nuevas variables para ambos jugadores utilizando el parámetro adicional de la función, pero conservando las columnas originales utilizadas mediante un transformador `"passthrough"` declarado explícitamente.

In [104]:
win_condition_advantage_features = [
    "player1_numWinConditionCards", "player1_numBuildingCards",
    "player2_numWinConditionCards", "player2_numBuildingCards"
]

rarity_score_features = [
    "player1_numCommonCards", "player1_numRareCards", "player1_numEpicCards", "player1_numLegendaryCards", "player1_numChampionCards",
    "player2_numCommonCards", "player2_numRareCards", "player2_numEpicCards", "player2_numLegendaryCards", "player2_numChampionCards"
]

balance_score_features = [
    "player1_numWinConditionCards", "player1_numDirectDamageCards", "player1_numAntiAirCards", "player1_numSplashDamageCards",
    "player1_numResetAttackCards", "player1_numBuildingCards", "player1_numMeleeCards", "player1_numRangedCards", "player1_numTroopCards",
    "player1_numAirCards", "player1_meanElixirCost", "player1_numSpellCards",
    "player2_numWinConditionCards", "player2_numDirectDamageCards", "player2_numAntiAirCards", "player2_numSplashDamageCards",
    "player2_numResetAttackCards", "player2_numBuildingCards", "player2_numMeleeCards", "player2_numRangedCards", "player2_numTroopCards",
    "player2_numAirCards", "player2_meanElixirCost", "player2_numSpellCards"
]

creator = ColumnTransformer(
    transformers=[
        ("players_winConditionAdvantage", FunctionTransformer(create_win_condition_advantage, kw_args={"output": "players"}), win_condition_advantage_features),
        ("players_rarityScore", FunctionTransformer(create_rarity_score, kw_args={"output": "players"}), rarity_score_features),
        ("players_balanceScore", FunctionTransformer(create_balance_score, kw_args={"output": "players"}), balance_score_features),
        ("passthrough", "passthrough", list(set(win_condition_advantage_features + rarity_score_features + balance_score_features))),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

creator

##### *Eliminación*

Vamos a descartar únicamente las variables comunes a todos los *pipelines*, que este caso coinciden con las que no hemos imputado.

In [105]:
features_to_drop = original_features_to_drop

drop = ColumnTransformer(
    transformers=[("drop", "drop", features_to_drop)],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

drop

##### *Codificación*

Aplicamos codificación *One-Hot* a las variables categóricas restantes.

In [106]:
features_to_encode = ["player1_supportCardName", "player2_supportCardName"]

encoder = ColumnTransformer(
    transformers=[
        ("encoder", OneHotEncoder(sparse_output=False, handle_unknown="ignore"), features_to_encode),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

encoder

##### *Escalado*

Estandarizamos las variables numéricas conservadas hasta este punto, que en este caso coinciden con las numéricas imputadas a las que hay añadir los nuevos atributos:

In [107]:
features_to_scale = numeric_features_to_impute + ["player1_winConditionAdvantage", "player2_winConditionAdvantage", "player1_rarityScore", "player2_rarityScore", "player1_balanceScore", "player2_balanceScore"]

scaler = ColumnTransformer(
    transformers=[("scaler", StandardScaler(), features_to_scale)],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

scaler

##### *Creación del pipeline*

Definimos nuestro *pipeline* `preprocessing_bin_num` y lo mostramos:

In [108]:
preprocessing_bin_num = Pipeline([
    ("imputer", imputer),
    ("creator", creator),
    ("drop", drop),
    ("encoder", encoder),
    ("scaler", scaler)
])

preprocessing_bin_num

Comprobamos que las transformaciones son correctas **(en ningún momento se modifica el conjunto de datos)**:

In [109]:
preprocessing_bin_num.fit_transform(X).describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
player1_meanCardLevel,50377.0,1.060659e-16,1.000010,-4.319144,-0.549881,0.246442,0.777324,1.255118
player1_minCardLevel,50377.0,5.867475e-17,1.000010,-3.715671,-0.702904,0.050288,0.803480,1.556672
player1_maxCardLevel,50377.0,-1.236965e-16,1.000010,-4.611621,-0.525579,0.382430,0.836435,0.836435
player1_totalStarLevel,50377.0,5.585385e-17,1.000010,-1.388289,-0.846722,-0.124632,0.777981,2.583206
player1_meanElixirCost,50377.0,1.158544e-15,1.000010,-4.266617,-0.739172,-0.005743,0.727686,6.595119
...,...,...,...,...,...,...,...,...
player2_hasHealSpirit,50377.0,3.076801e-03,0.055384,0.000000,0.000000,0.000000,0.000000,1.000000
player2_hasGiantSnowball,50377.0,1.927467e-02,0.137490,0.000000,0.000000,0.000000,0.000000,1.000000
player2_hasRoyalDelivery,50377.0,1.298211e-02,0.113198,0.000000,0.000000,0.000000,0.000000,1.000000
player2_hasVoid,50377.0,3.791413e-03,0.061458,0.000000,0.000000,0.000000,0.000000,1.000000


Finalmente, lo guardamos:

In [110]:
dump(preprocessing_bin_num, "../pipelines/preprocessing/preprocessing_bin_num.joblib");

<a id="section3_2"></a>
### <font color="#00586D"> 3.2. Segundo pipeline (Binarias + Diferencias)</font>

Lo denomiraremos `preprocessing_bin_diff` y tendrá los pasos definidos anteriormente.

##### *Imputación*

Imputaremos igual que en el caso anterior, solo que ahora sí incluiremos trofeos y hechizos por jugador porque estos se eliminarán al transformar y descartaremos las diferencias.

In [111]:
original_features_to_drop = ["battleTime", "arena", "player1_tag", "player2_tag", "player1_name", "player2_name", "player1_supportCardRarity", "player2_supportCardRarity"]

categorical_features_to_impute = ["player1_supportCardName", "player2_supportCardName"]
binary_features_to_impute = [col for col in X.columns if col.startswith("player1_has") or col.startswith("player2_has")]
numeric_features_to_impute = [col for col in X.columns if col not in binary_features_to_impute + original_features_to_drop + categorical_features_to_impute]

imputer = ColumnTransformer(
    transformers=[
        ("numeric_imputer", SimpleImputer(strategy="mean"), numeric_features_to_impute),
        ("binary_imputer", SimpleImputer(strategy="constant", fill_value=0), binary_features_to_impute),
        ("categorical_imputer", SimpleImputer(strategy="most_frequent"), categorical_features_to_impute)
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

imputer

##### *Creación*

En este caso sí crearemos todas las diferencias y las variables originales se descartarán automáticamente tras la transformación.

In [112]:
win_condition_advantage_features = [
    "player1_numWinConditionCards", "player1_numBuildingCards",
    "player2_numWinConditionCards", "player2_numBuildingCards"
]

rarity_score_features = [
    "player1_numCommonCards", "player1_numRareCards", "player1_numEpicCards", "player1_numLegendaryCards", "player1_numChampionCards",
    "player2_numCommonCards", "player2_numRareCards", "player2_numEpicCards", "player2_numLegendaryCards", "player2_numChampionCards"
]

balance_score_features = [
    "player1_numWinConditionCards", "player1_numDirectDamageCards", "player1_numAntiAirCards", "player1_numSplashDamageCards",
    "player1_numResetAttackCards", "player1_numBuildingCards", "player1_numMeleeCards", "player1_numRangedCards", "player1_numTroopCards",
    "player1_numAirCards", "player1_meanElixirCost", "player1_numSpellCards",
    "player2_numWinConditionCards", "player2_numDirectDamageCards", "player2_numAntiAirCards", "player2_numSplashDamageCards",
    "player2_numResetAttackCards", "player2_numBuildingCards", "player2_numMeleeCards", "player2_numRangedCards", "player2_numTroopCards",
    "player2_numAirCards", "player2_meanElixirCost", "player2_numSpellCards"
]

creator = ColumnTransformer(
    transformers=[
        ("diff_startingTrophies", FunctionTransformer(create_diff_starting_trophies), ["player1_startingTrophies", "player2_startingTrophies"]),
        ("diff_meanCardLevel", FunctionTransformer(create_diff_mean_card_level), ["player1_meanCardLevel", "player2_meanCardLevel"]),
        ("diff_minCardLevel", FunctionTransformer(create_diff_min_card_level), ["player1_minCardLevel", "player2_minCardLevel"]),
        ("diff_maxCardLevel", FunctionTransformer(create_diff_max_card_level), ["player1_maxCardLevel", "player2_maxCardLevel"]),
        ("diff_supportCardLevel", FunctionTransformer(create_diff_support_card_level), ["player1_supportCardLevel", "player2_supportCardLevel"]),
        ("diff_totalStarLevel", FunctionTransformer(create_diff_total_star_level), ["player1_totalStarLevel", "player2_totalStarLevel"]),
        ("diff_meanElixirCost", FunctionTransformer(create_diff_mean_elixir_cost), ["player1_meanElixirCost", "player2_meanElixirCost"]),
        ("diff_numEvolutionCards", FunctionTransformer(create_diff_num_evolution_cards), ["player1_numEvolutionCards", "player2_numEvolutionCards"]),
        ("diff_numWinConditionCards", FunctionTransformer(create_diff_num_win_condition_cards), ["player1_numWinConditionCards", "player2_numWinConditionCards"]),
        ("diff_numMeleeCards", FunctionTransformer(create_diff_num_melee_cards), ["player1_numMeleeCards", "player2_numMeleeCards"]),
        ("diff_numRangedCards", FunctionTransformer(create_diff_num_ranged_cards), ["player1_numRangedCards", "player2_numRangedCards"]),
        ("diff_numAirCards", FunctionTransformer(create_diff_num_air_cards), ["player1_numAirCards", "player2_numAirCards"]),
        ("diff_numAntiAirCards", FunctionTransformer(create_diff_num_anti_air_cards), ["player1_numAntiAirCards", "player2_numAntiAirCards"]),
        ("diff_numDirectDamageCards", FunctionTransformer(create_diff_num_direct_damage_cards), ["player1_numDirectDamageCards", "player2_numDirectDamageCards"]),
        ("diff_numSplashDamageCards", FunctionTransformer(create_diff_num_splash_damage_cards), ["player1_numSplashDamageCards", "player2_numSplashDamageCards"]),
        ("diff_numResetAttackCards", FunctionTransformer(create_diff_num_reset_attack_cards), ["player1_numResetAttackCards", "player2_numResetAttackCards"]),
        ("diff_numCommonCards", FunctionTransformer(create_diff_num_common_cards), ["player1_numCommonCards", "player2_numCommonCards"]),
        ("diff_numRareCards", FunctionTransformer(create_diff_num_rare_cards), ["player1_numRareCards", "player2_numRareCards"]),
        ("diff_numEpicCards", FunctionTransformer(create_diff_num_epic_cards), ["player1_numEpicCards", "player2_numEpicCards"]),
        ("diff_numLegendaryCards", FunctionTransformer(create_diff_num_legendary_cards), ["player1_numLegendaryCards", "player2_numLegendaryCards"]),
        ("diff_numChampionCards", FunctionTransformer(create_diff_num_champion_cards), ["player1_numChampionCards", "player2_numChampionCards"]),
        ("diff_numTroopCards", FunctionTransformer(create_diff_num_troop_cards), ["player1_numTroopCards", "player2_numTroopCards"]),
        ("diff_numBuildingCards", FunctionTransformer(create_diff_num_building_cards), ["player1_numBuildingCards", "player2_numBuildingCards"]),
        ("diff_numSpellCards", FunctionTransformer(create_diff_num_spell_cards), ["player1_numSpellCards", "player2_numSpellCards"]),
        ("diff_numCounters", FunctionTransformer(create_diff_num_counters), ["player1_numCounters", "player2_numCounters"]),
        ("diff_numUncounteredCards", FunctionTransformer(create_diff_num_uncountered_cards), ["player1_numUncounteredCards", "player2_numUncounteredCards"]),
        ("diff_winConditionAdvantage", FunctionTransformer(create_win_condition_advantage, kw_args={"output": "diff"}), win_condition_advantage_features),
        ("diff_rarityScore", FunctionTransformer(create_rarity_score, kw_args={"output": "diff"}), rarity_score_features),
        ("diff_balanceScore", FunctionTransformer(create_balance_score, kw_args={"output": "diff"}), balance_score_features)
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

creator

##### *Eliminación*

A las variables individuales de cada jugador eliminadas a través de la propia creación de las diferencias, tenemos que añadir el resto de características (ahora sí incluimos los trofeos y hechizos):

In [113]:
features_to_drop = [
    "battleTime", "arena", "diff_startingTrophies", "diff_numSpellCards",
    "player1_tag", "player1_name", "player1_supportCardRarity",
    "player2_tag", "player2_name", "player2_supportCardRarity"
]

drop = ColumnTransformer(
    transformers=[("drop", "drop", features_to_drop)],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

drop

##### *Codificación*

Codificaremos exactamente igual que en el caso anterior.

In [114]:
features_to_encode = ["player1_supportCardName", "player2_supportCardName"]

encoder = ColumnTransformer(
    transformers=[
        ("encoder", OneHotEncoder(sparse_output=False, handle_unknown="ignore"), features_to_encode),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

encoder

##### *Escalado*

Estandarizaremos los datos, pero ahora las variables numéricas con las diferencias conservadas hasta este paso.

In [115]:
features_to_scale = [
    "diff_meanCardLevel", "diff_minCardLevel", "diff_maxCardLevel", "diff_supportCardLevel", "diff_totalStarLevel", "diff_meanElixirCost",
    "diff_numEvolutionCards", "diff_numWinConditionCards", "diff_numMeleeCards", "diff_numRangedCards", "diff_numAirCards", "diff_numAntiAirCards",
    "diff_numDirectDamageCards", "diff_numSplashDamageCards", "diff_numResetAttackCards", "diff_numCommonCards", "diff_numRareCards",
    "diff_numEpicCards", "diff_numLegendaryCards", "diff_numChampionCards", "diff_numTroopCards", "diff_numBuildingCards", "diff_numCounters",
    "diff_numUncounteredCards", "diff_winConditionAdvantage", "diff_rarityScore", "diff_balanceScore"
]

scaler = ColumnTransformer(
    transformers=[("scaler", StandardScaler(), features_to_scale)],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

scaler

##### *Creación del pipeline*

Definimos nuestro *pipeline* `preprocessing_bin_diff` y lo mostramos:

In [116]:
preprocessing_bin_diff = Pipeline([
    ("imputer", imputer), 
    ("creator", creator),
    ("drop", drop),
    ("encoder", encoder),
    ("scaler", scaler)
])

preprocessing_bin_diff

Comprobamos que las transformaciones son correctas **(en ningún momento se modifica el conjunto de datos)**:

In [117]:
preprocessing_bin_diff.fit_transform(X).describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
diff_meanCardLevel,50377.0,-1.082521e-17,1.000010,-7.112444,-0.461603,0.037210,0.536023,12.008724
diff_minCardLevel,50377.0,3.529653e-17,1.000010,-9.319226,-0.676481,0.043748,0.763977,9.406722
diff_maxCardLevel,50377.0,-2.179146e-17,1.000010,-6.558934,-0.242069,-0.242069,0.810742,10.286039
diff_supportCardLevel,50377.0,-3.547283e-17,1.000010,-7.562536,-0.055942,-0.055942,1.016429,12.812504
diff_totalStarLevel,50377.0,4.330084e-17,1.000010,-4.128092,-0.429829,-0.040539,0.543397,3.852369
...,...,...,...,...,...,...,...,...
player2_hasHealSpirit,50377.0,3.076801e-03,0.055384,0.000000,0.000000,0.000000,0.000000,1.000000
player2_hasGiantSnowball,50377.0,1.927467e-02,0.137490,0.000000,0.000000,0.000000,0.000000,1.000000
player2_hasRoyalDelivery,50377.0,1.298211e-02,0.113198,0.000000,0.000000,0.000000,0.000000,1.000000
player2_hasVoid,50377.0,3.791413e-03,0.061458,0.000000,0.000000,0.000000,0.000000,1.000000


Finalmente, lo guardamos:

In [118]:
dump(preprocessing_bin_diff, "../pipelines/preprocessing/preprocessing_bin_diff.joblib");

<a id="section3_3"></a>
### <font color="#00586D"> 3.3. Tercer pipeline (Binarias + Numéricas + Diferencias)</font>

Lo denomiraremos `preprocessing_bin_num_diff` y tendrá los pasos definidos anteriormente.

##### *Imputación*

Podemos imputar de la misma forma, aunque no incluiremos trofeos ni hechizos porque ni siquiera crearemos sus diferencias.

In [119]:
original_features_to_drop = [
    "battleTime", "arena", "player1_tag", "player2_tag", "player1_name", "player2_name", "player1_supportCardRarity", "player2_supportCardRarity",
    "player1_startingTrophies", "player2_startingTrophies", "player1_numSpellCards", "player2_numSpellCards"
]

categorical_features_to_impute = ["player1_supportCardName", "player2_supportCardName"]
binary_features_to_impute = [col for col in X.columns if col.startswith("player1_has") or col.startswith("player2_has")]
numeric_features_to_impute = [col for col in X.columns if col not in binary_features_to_impute + original_features_to_drop + categorical_features_to_impute]

imputer = ColumnTransformer(
    transformers=[
        ("numeric_imputer", SimpleImputer(strategy="mean"), numeric_features_to_impute),
        ("binary_imputer", SimpleImputer(strategy="constant", fill_value=0), binary_features_to_impute),
        ("categorical_imputer", SimpleImputer(strategy="most_frequent"), categorical_features_to_impute)
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

imputer

##### *Creación*

Crearemos todas las diferencias, pero en este caso las variables originales deben mantenerse tras la transformación. También crearemos las nuevas variables por jugador.

In [120]:
win_condition_advantage_features = [
    "player1_numWinConditionCards", "player1_numBuildingCards",
    "player2_numWinConditionCards", "player2_numBuildingCards"
]

rarity_score_features = [
    "player1_numCommonCards", "player1_numRareCards", "player1_numEpicCards", "player1_numLegendaryCards", "player1_numChampionCards",
    "player2_numCommonCards", "player2_numRareCards", "player2_numEpicCards", "player2_numLegendaryCards", "player2_numChampionCards"
]

balance_score_features = [
    "player1_numWinConditionCards", "player1_numDirectDamageCards", "player1_numAntiAirCards", "player1_numSplashDamageCards",
    "player1_numResetAttackCards", "player1_numBuildingCards", "player1_numMeleeCards", "player1_numRangedCards", "player1_numTroopCards",
    "player1_numAirCards", "player1_meanElixirCost", "player1_numSpellCards",
    "player2_numWinConditionCards", "player2_numDirectDamageCards", "player2_numAntiAirCards", "player2_numSplashDamageCards",
    "player2_numResetAttackCards", "player2_numBuildingCards", "player2_numMeleeCards", "player2_numRangedCards", "player2_numTroopCards",
    "player2_numAirCards", "player2_meanElixirCost", "player2_numSpellCards"
]

numeric_features_to_keep = [
    "player1_meanCardLevel", "player2_meanCardLevel", "player1_minCardLevel", "player2_minCardLevel",
    "player1_maxCardLevel", "player2_maxCardLevel", "player1_supportCardLevel", "player2_supportCardLevel",
    "player1_totalStarLevel", "player2_totalStarLevel", "player1_meanElixirCost", "player2_meanElixirCost",
    "player1_numEvolutionCards", "player2_numEvolutionCards", "player1_numWinConditionCards", "player2_numWinConditionCards",
    "player1_numMeleeCards", "player2_numMeleeCards", "player1_numRangedCards", "player2_numRangedCards",
    "player1_numAirCards", "player2_numAirCards", "player1_numAntiAirCards", "player2_numAntiAirCards",
    "player1_numDirectDamageCards", "player2_numDirectDamageCards", "player1_numSplashDamageCards", "player2_numSplashDamageCards",
    "player1_numResetAttackCards", "player2_numResetAttackCards", "player1_numCommonCards", "player2_numCommonCards",
    "player1_numRareCards", "player2_numRareCards", "player1_numEpicCards", "player2_numEpicCards",
    "player1_numLegendaryCards", "player2_numLegendaryCards", "player1_numChampionCards", "player2_numChampionCards",
    "player1_numTroopCards", "player2_numTroopCards", "player1_numBuildingCards", "player2_numBuildingCards",
    "player1_numCounters", "player2_numCounters", "player1_numUncounteredCards", "player2_numUncounteredCards"
]

creator = ColumnTransformer(
    transformers=[
        ("diff_meanCardLevel", FunctionTransformer(create_diff_mean_card_level), ["player1_meanCardLevel", "player2_meanCardLevel"]),
        ("diff_minCardLevel", FunctionTransformer(create_diff_min_card_level), ["player1_minCardLevel", "player2_minCardLevel"]),
        ("diff_maxCardLevel", FunctionTransformer(create_diff_max_card_level), ["player1_maxCardLevel", "player2_maxCardLevel"]),
        ("diff_supportCardLevel", FunctionTransformer(create_diff_support_card_level), ["player1_supportCardLevel", "player2_supportCardLevel"]),
        ("diff_totalStarLevel", FunctionTransformer(create_diff_total_star_level), ["player1_totalStarLevel", "player2_totalStarLevel"]),
        ("diff_meanElixirCost", FunctionTransformer(create_diff_mean_elixir_cost), ["player1_meanElixirCost", "player2_meanElixirCost"]),
        ("diff_numEvolutionCards", FunctionTransformer(create_diff_num_evolution_cards), ["player1_numEvolutionCards", "player2_numEvolutionCards"]),
        ("diff_numWinConditionCards", FunctionTransformer(create_diff_num_win_condition_cards), ["player1_numWinConditionCards", "player2_numWinConditionCards"]),
        ("diff_numMeleeCards", FunctionTransformer(create_diff_num_melee_cards), ["player1_numMeleeCards", "player2_numMeleeCards"]),
        ("diff_numRangedCards", FunctionTransformer(create_diff_num_ranged_cards), ["player1_numRangedCards", "player2_numRangedCards"]),
        ("diff_numAirCards", FunctionTransformer(create_diff_num_air_cards), ["player1_numAirCards", "player2_numAirCards"]),
        ("diff_numAntiAirCards", FunctionTransformer(create_diff_num_anti_air_cards), ["player1_numAntiAirCards", "player2_numAntiAirCards"]),
        ("diff_numDirectDamageCards", FunctionTransformer(create_diff_num_direct_damage_cards), ["player1_numDirectDamageCards", "player2_numDirectDamageCards"]),
        ("diff_numSplashDamageCards", FunctionTransformer(create_diff_num_splash_damage_cards), ["player1_numSplashDamageCards", "player2_numSplashDamageCards"]),
        ("diff_numResetAttackCards", FunctionTransformer(create_diff_num_reset_attack_cards), ["player1_numResetAttackCards", "player2_numResetAttackCards"]),
        ("diff_numCommonCards", FunctionTransformer(create_diff_num_common_cards), ["player1_numCommonCards", "player2_numCommonCards"]),
        ("diff_numRareCards", FunctionTransformer(create_diff_num_rare_cards), ["player1_numRareCards", "player2_numRareCards"]),
        ("diff_numEpicCards", FunctionTransformer(create_diff_num_epic_cards), ["player1_numEpicCards", "player2_numEpicCards"]),
        ("diff_numLegendaryCards", FunctionTransformer(create_diff_num_legendary_cards), ["player1_numLegendaryCards", "player2_numLegendaryCards"]),
        ("diff_numChampionCards", FunctionTransformer(create_diff_num_champion_cards), ["player1_numChampionCards", "player2_numChampionCards"]),
        ("diff_numTroopCards", FunctionTransformer(create_diff_num_troop_cards), ["player1_numTroopCards", "player2_numTroopCards"]),
        ("diff_numBuildingCards", FunctionTransformer(create_diff_num_building_cards), ["player1_numBuildingCards", "player2_numBuildingCards"]),
        ("diff_numCounters", FunctionTransformer(create_diff_num_counters), ["player1_numCounters", "player2_numCounters"]),
        ("diff_numUncounteredCards", FunctionTransformer(create_diff_num_uncountered_cards), ["player1_numUncounteredCards", "player2_numUncounteredCards"]),
        ("diff_winConditionAdvantage", FunctionTransformer(create_win_condition_advantage, kw_args={"output": "all"}), win_condition_advantage_features),
        ("diff_rarityScore", FunctionTransformer(create_rarity_score, kw_args={"output": "all"}), rarity_score_features),
        ("diff_balanceScore", FunctionTransformer(create_balance_score, kw_args={"output": "all"}), balance_score_features),
        ("passthrough", "passthrough", numeric_features_to_keep)
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

creator

##### *Eliminación*

Borramos las variables comunes, incluyendo en este caso los trofeos (las diferencias ni las hemos creado y los hechizos se han descartado al crear un puntaje y no se ha especficado conservarlos):

In [121]:
features_to_drop = [
    "battleTime", "arena",
    "player1_tag", "player1_name", "player1_supportCardRarity",
    "player2_tag", "player2_name", "player2_supportCardRarity",
    "player1_startingTrophies", "player2_startingTrophies"
]

drop = ColumnTransformer(
    transformers=[("drop", "drop", features_to_drop)],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

drop

##### *Codificación*

Aquí también codificaremos igual.

In [122]:
features_to_encode = ["player1_supportCardName", "player2_supportCardName"]

encoder = ColumnTransformer(
    transformers=[
        ("encoder", OneHotEncoder(sparse_output=False, handle_unknown="ignore"), features_to_encode),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

encoder

##### *Escalado*

Por último, estandarizamos las variables numéricas. Debemos incluir tanto las individuales (incluyendo los nuevos atributos) como las diferencias conservadas.

In [123]:
features_to_scale = [
    "diff_meanCardLevel", "diff_minCardLevel", "diff_maxCardLevel", "diff_supportCardLevel", "diff_totalStarLevel", "diff_meanElixirCost",
    "diff_numEvolutionCards", "diff_numWinConditionCards", "diff_numMeleeCards", "diff_numRangedCards", "diff_numAirCards", "diff_numAntiAirCards",
    "diff_numDirectDamageCards", "diff_numSplashDamageCards", "diff_numResetAttackCards", "diff_numCommonCards", "diff_numRareCards",
    "diff_numEpicCards", "diff_numLegendaryCards", "diff_numChampionCards", "diff_numTroopCards", "diff_numBuildingCards", "diff_numCounters",
    "diff_numUncounteredCards", "diff_winConditionAdvantage", "diff_rarityScore", "diff_balanceScore"
] + numeric_features_to_keep + ["player1_winConditionAdvantage", "player2_winConditionAdvantage", "player1_rarityScore", "player2_rarityScore", "player1_balanceScore", "player2_balanceScore"]

scaler = ColumnTransformer(
    transformers=[("scaler", StandardScaler(), features_to_scale)],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

scaler

##### *Creación del pipeline*

Definimos nuestro *pipeline* `preprocessing_bin_num_diff` y lo mostramos:

In [124]:
preprocessing_bin_num_diff = Pipeline([
    ("imputer", imputer), 
    ("creator", creator),
    ("drop", drop),
    ("encoder", encoder),
    ("scaler", scaler)
])

preprocessing_bin_num_diff

Comprobamos que las transformaciones son correctas **(en ningún momento se modifica el conjunto de datos)**:

In [125]:
preprocessing_bin_num_diff.fit_transform(X).describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
diff_meanCardLevel,50377.0,-1.082521e-17,1.000010,-7.112444,-0.461603,0.037210,0.536023,12.008724
diff_minCardLevel,50377.0,3.529653e-17,1.000010,-9.319226,-0.676481,0.043748,0.763977,9.406722
diff_maxCardLevel,50377.0,-2.179146e-17,1.000010,-6.558934,-0.242069,-0.242069,0.810742,10.286039
diff_supportCardLevel,50377.0,-3.547283e-17,1.000010,-7.562536,-0.055942,-0.055942,1.016429,12.812504
diff_totalStarLevel,50377.0,4.330084e-17,1.000010,-4.128092,-0.429829,-0.040539,0.543397,3.852369
...,...,...,...,...,...,...,...,...
player2_hasHealSpirit,50377.0,3.076801e-03,0.055384,0.000000,0.000000,0.000000,0.000000,1.000000
player2_hasGiantSnowball,50377.0,1.927467e-02,0.137490,0.000000,0.000000,0.000000,0.000000,1.000000
player2_hasRoyalDelivery,50377.0,1.298211e-02,0.113198,0.000000,0.000000,0.000000,0.000000,1.000000
player2_hasVoid,50377.0,3.791413e-03,0.061458,0.000000,0.000000,0.000000,0.000000,1.000000


Finalmente, lo guardamos:

In [126]:
dump(preprocessing_bin_num_diff, "../pipelines/preprocessing/preprocessing_bin_num_diff.joblib");

<a id="section3_4"></a>
### <font color="#00586D"> 3.4. Cuarto pipeline (Diferencias)</font>

Lo denomiraremos `preprocessing_only_diff` y tendrá los pasos definidos anteriormente.

##### *Imputación*

Seguiremos la metodología anterior, aunque ahora no incluiremos las variables binarias porque serán descartadas igualmente. Podríamos utilizar el mismo, pero imputaríamos innecesariamente las binarias.

In [127]:
original_features_to_drop = ["battleTime", "arena", "player1_tag", "player2_tag", "player1_name", "player2_name", "player1_supportCardRarity", "player2_supportCardRarity"]
binary_features = [col for col in X.columns if col.startswith("player1_has") or col.startswith("player2_has")]

categorical_features_to_impute = ["player1_supportCardName", "player2_supportCardName"]
numeric_features_to_impute = [col for col in X.columns if col not in binary_features + original_features_to_drop + categorical_features_to_impute]

imputer = ColumnTransformer(
    transformers=[
        ("numeric_imputer", SimpleImputer(strategy="mean"), numeric_features_to_impute),
        ("categorical_imputer", SimpleImputer(strategy="most_frequent"), categorical_features_to_impute)
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

imputer

##### *Creación*

Utilizaremos el transformador del segundo *pipeline*, ya que queremos transformar las variables individuales en las diferencias sin conservar las primeras.

In [128]:
win_condition_advantage_features = [
    "player1_numWinConditionCards", "player1_numBuildingCards",
    "player2_numWinConditionCards", "player2_numBuildingCards"
]

rarity_score_features = [
    "player1_numCommonCards", "player1_numRareCards", "player1_numEpicCards", "player1_numLegendaryCards", "player1_numChampionCards",
    "player2_numCommonCards", "player2_numRareCards", "player2_numEpicCards", "player2_numLegendaryCards", "player2_numChampionCards"
]

balance_score_features = [
    "player1_numWinConditionCards", "player1_numDirectDamageCards", "player1_numAntiAirCards", "player1_numSplashDamageCards",
    "player1_numResetAttackCards", "player1_numBuildingCards", "player1_numMeleeCards", "player1_numRangedCards", "player1_numTroopCards",
    "player1_numAirCards", "player1_meanElixirCost", "player1_numSpellCards",
    "player2_numWinConditionCards", "player2_numDirectDamageCards", "player2_numAntiAirCards", "player2_numSplashDamageCards",
    "player2_numResetAttackCards", "player2_numBuildingCards", "player2_numMeleeCards", "player2_numRangedCards", "player2_numTroopCards",
    "player2_numAirCards", "player2_meanElixirCost", "player2_numSpellCards"
]

creator = ColumnTransformer(
    transformers=[
        ("diff_startingTrophies", FunctionTransformer(create_diff_starting_trophies), ["player1_startingTrophies", "player2_startingTrophies"]),
        ("diff_meanCardLevel", FunctionTransformer(create_diff_mean_card_level), ["player1_meanCardLevel", "player2_meanCardLevel"]),
        ("diff_minCardLevel", FunctionTransformer(create_diff_min_card_level), ["player1_minCardLevel", "player2_minCardLevel"]),
        ("diff_maxCardLevel", FunctionTransformer(create_diff_max_card_level), ["player1_maxCardLevel", "player2_maxCardLevel"]),
        ("diff_supportCardLevel", FunctionTransformer(create_diff_support_card_level), ["player1_supportCardLevel", "player2_supportCardLevel"]),
        ("diff_totalStarLevel", FunctionTransformer(create_diff_total_star_level), ["player1_totalStarLevel", "player2_totalStarLevel"]),
        ("diff_meanElixirCost", FunctionTransformer(create_diff_mean_elixir_cost), ["player1_meanElixirCost", "player2_meanElixirCost"]),
        ("diff_numEvolutionCards", FunctionTransformer(create_diff_num_evolution_cards), ["player1_numEvolutionCards", "player2_numEvolutionCards"]),
        ("diff_numWinConditionCards", FunctionTransformer(create_diff_num_win_condition_cards), ["player1_numWinConditionCards", "player2_numWinConditionCards"]),
        ("diff_numMeleeCards", FunctionTransformer(create_diff_num_melee_cards), ["player1_numMeleeCards", "player2_numMeleeCards"]),
        ("diff_numRangedCards", FunctionTransformer(create_diff_num_ranged_cards), ["player1_numRangedCards", "player2_numRangedCards"]),
        ("diff_numAirCards", FunctionTransformer(create_diff_num_air_cards), ["player1_numAirCards", "player2_numAirCards"]),
        ("diff_numAntiAirCards", FunctionTransformer(create_diff_num_anti_air_cards), ["player1_numAntiAirCards", "player2_numAntiAirCards"]),
        ("diff_numDirectDamageCards", FunctionTransformer(create_diff_num_direct_damage_cards), ["player1_numDirectDamageCards", "player2_numDirectDamageCards"]),
        ("diff_numSplashDamageCards", FunctionTransformer(create_diff_num_splash_damage_cards), ["player1_numSplashDamageCards", "player2_numSplashDamageCards"]),
        ("diff_numResetAttackCards", FunctionTransformer(create_diff_num_reset_attack_cards), ["player1_numResetAttackCards", "player2_numResetAttackCards"]),
        ("diff_numCommonCards", FunctionTransformer(create_diff_num_common_cards), ["player1_numCommonCards", "player2_numCommonCards"]),
        ("diff_numRareCards", FunctionTransformer(create_diff_num_rare_cards), ["player1_numRareCards", "player2_numRareCards"]),
        ("diff_numEpicCards", FunctionTransformer(create_diff_num_epic_cards), ["player1_numEpicCards", "player2_numEpicCards"]),
        ("diff_numLegendaryCards", FunctionTransformer(create_diff_num_legendary_cards), ["player1_numLegendaryCards", "player2_numLegendaryCards"]),
        ("diff_numChampionCards", FunctionTransformer(create_diff_num_champion_cards), ["player1_numChampionCards", "player2_numChampionCards"]),
        ("diff_numTroopCards", FunctionTransformer(create_diff_num_troop_cards), ["player1_numTroopCards", "player2_numTroopCards"]),
        ("diff_numBuildingCards", FunctionTransformer(create_diff_num_building_cards), ["player1_numBuildingCards", "player2_numBuildingCards"]),
        ("diff_numSpellCards", FunctionTransformer(create_diff_num_spell_cards), ["player1_numSpellCards", "player2_numSpellCards"]),
        ("diff_numCounters", FunctionTransformer(create_diff_num_counters), ["player1_numCounters", "player2_numCounters"]),
        ("diff_numUncounteredCards", FunctionTransformer(create_diff_num_uncountered_cards), ["player1_numUncounteredCards", "player2_numUncounteredCards"]),
        ("diff_winConditionAdvantage", FunctionTransformer(create_win_condition_advantage, kw_args={"output": "diff"}), win_condition_advantage_features),
        ("diff_rarityScore", FunctionTransformer(create_rarity_score, kw_args={"output": "diff"}), rarity_score_features),
        ("diff_balanceScore", FunctionTransformer(create_balance_score, kw_args={"output": "diff"}), balance_score_features)
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

creator

##### *Eliminación*

Vamos a descartar también todas las binarias correspondientes a las cartas de los mazos y a las cartas de las tropas de las torres:

In [129]:
features_to_drop = [
    "battleTime", "arena", "diff_startingTrophies", "diff_numSpellCards",
    "player1_tag", "player1_name", "player1_supportCardRarity", "player1_supportCardName",
    "player2_tag", "player2_name", "player2_supportCardRarity", "player2_supportCardName",
] + [col for col in X.columns if col.startswith("player1_has") or col.startswith("player2_has")]

drop = ColumnTransformer(
    transformers=[("drop", "drop", features_to_drop)],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

drop

##### *Codificación*

Todas las variables son numéricas, por lo que no hay categóricas por codificar.

##### *Escalado*

Estandarizaremos las variables numéricas conservadas, que son todas las diferencias a excepción de los trofeos y los hechizos.

In [130]:
features_to_scale = [
    "diff_meanCardLevel", "diff_minCardLevel", "diff_maxCardLevel", "diff_supportCardLevel", "diff_totalStarLevel", "diff_meanElixirCost",
    "diff_numEvolutionCards", "diff_numWinConditionCards", "diff_numMeleeCards", "diff_numRangedCards", "diff_numAirCards", "diff_numAntiAirCards",
    "diff_numDirectDamageCards", "diff_numSplashDamageCards", "diff_numResetAttackCards", "diff_numCommonCards", "diff_numRareCards",
    "diff_numEpicCards", "diff_numLegendaryCards", "diff_numChampionCards", "diff_numTroopCards", "diff_numBuildingCards", "diff_numCounters",
    "diff_numUncounteredCards", "diff_winConditionAdvantage", "diff_rarityScore", "diff_balanceScore"
]

scaler = ColumnTransformer(
    transformers=[("scaler", StandardScaler(), features_to_scale)],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

scaler

##### *Creación del pipeline*

Definimos nuestro *pipeline* `preprocessing_only_diff` y lo mostramos:

In [131]:
preprocessing_only_diff = Pipeline([
    ("imputer", imputer), 
    ("creator", creator),
    ("drop", drop),
    ("scaler", scaler)
])

preprocessing_only_diff

Comprobamos que las transformaciones son correctas **(en ningún momento se modifica el conjunto de datos)**:

In [132]:
preprocessing_only_diff.fit_transform(X).describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
diff_meanCardLevel,50377.0,-1.0825210000000001e-17,1.00001,-7.112444,-0.461603,0.03721,0.536023,12.008724
diff_minCardLevel,50377.0,3.5296530000000004e-17,1.00001,-9.319226,-0.676481,0.043748,0.763977,9.406722
diff_maxCardLevel,50377.0,-2.179146e-17,1.00001,-6.558934,-0.242069,-0.242069,0.810742,10.286039
diff_supportCardLevel,50377.0,-3.547283e-17,1.00001,-7.562536,-0.055942,-0.055942,1.016429,12.812504
diff_totalStarLevel,50377.0,4.3300840000000005e-17,1.00001,-4.128092,-0.429829,-0.040539,0.543397,3.852369
diff_meanElixirCost,50377.0,3.5684400000000004e-17,1.00001,-4.556927,-0.620136,0.064524,0.578018,4.685975
diff_numEvolutionCards,50377.0,-4.6403830000000006e-17,1.00001,-2.220944,-0.173802,-0.173802,0.849769,1.87334
diff_numWinConditionCards,50377.0,1.3981090000000002e-17,1.00001,-4.253623,-1.059649,0.005009,1.069667,4.263641
diff_numMeleeCards,50377.0,-1.622018e-17,1.00001,-3.947573,-0.709311,-0.061658,0.585995,3.824258
diff_numRangedCards,50377.0,-3.72359e-17,1.00001,-4.4395,-0.80322,-0.075964,0.651292,3.560317


Finalmente, lo guardamos:

In [133]:
dump(preprocessing_only_diff, "../pipelines/preprocessing/preprocessing_only_diff.joblib");

<a id="section3_5"></a>
### <font color="#00586D"> 3.5. Quinto pipeline (Diferencias seleccionadas)</font>

Lo denomiraremos `preprocessing_selected_diff` y tendrá los pasos definidos anteriormente.

##### *Imputación*

En este caso descartaremos tanto las variables binarias como las categóricas. Es cierto que también algunas diferencias, pero necesitamos imputar todas las variables numéricas individuales para poder crearlas y eliminarlas después.

In [134]:
original_features_to_drop = ["battleTime", "arena", "player1_tag", "player2_tag", "player1_name", "player2_name", "player1_supportCardRarity", "player2_supportCardRarity"]
binary_features = [col for col in X.columns if col.startswith("player1_has") or col.startswith("player2_has")]
categorical_features = ["player1_supportCardName", "player2_supportCardName"]

features_to_impute = [col for col in X.columns if col not in binary_features + original_features_to_drop + categorical_features]

imputer = ColumnTransformer(
    transformers=[("imputer", SimpleImputer(strategy="mean"), features_to_impute)],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

imputer

##### *Creación*

Utilizaremos el transformador del *pipeline* anterior.

Aunque vayamos a descartar algunas diferencias más, crearemos todas para descartar en este paso las variables individuales de cada jugador y en el siguinte paso eliminaremos lo que corresponda.

In [135]:
win_condition_advantage_features = [
    "player1_numWinConditionCards", "player1_numBuildingCards",
    "player2_numWinConditionCards", "player2_numBuildingCards"
]

rarity_score_features = [
    "player1_numCommonCards", "player1_numRareCards", "player1_numEpicCards", "player1_numLegendaryCards", "player1_numChampionCards",
    "player2_numCommonCards", "player2_numRareCards", "player2_numEpicCards", "player2_numLegendaryCards", "player2_numChampionCards"
]

balance_score_features = [
    "player1_numWinConditionCards", "player1_numDirectDamageCards", "player1_numAntiAirCards", "player1_numSplashDamageCards",
    "player1_numResetAttackCards", "player1_numBuildingCards", "player1_numMeleeCards", "player1_numRangedCards", "player1_numTroopCards",
    "player1_numAirCards", "player1_meanElixirCost", "player1_numSpellCards",
    "player2_numWinConditionCards", "player2_numDirectDamageCards", "player2_numAntiAirCards", "player2_numSplashDamageCards",
    "player2_numResetAttackCards", "player2_numBuildingCards", "player2_numMeleeCards", "player2_numRangedCards", "player2_numTroopCards",
    "player2_numAirCards", "player2_meanElixirCost", "player2_numSpellCards"
]

creator = ColumnTransformer(
    transformers=[
        ("diff_startingTrophies", FunctionTransformer(create_diff_starting_trophies), ["player1_startingTrophies", "player2_startingTrophies"]),
        ("diff_meanCardLevel", FunctionTransformer(create_diff_mean_card_level), ["player1_meanCardLevel", "player2_meanCardLevel"]),
        ("diff_minCardLevel", FunctionTransformer(create_diff_min_card_level), ["player1_minCardLevel", "player2_minCardLevel"]),
        ("diff_maxCardLevel", FunctionTransformer(create_diff_max_card_level), ["player1_maxCardLevel", "player2_maxCardLevel"]),
        ("diff_supportCardLevel", FunctionTransformer(create_diff_support_card_level), ["player1_supportCardLevel", "player2_supportCardLevel"]),
        ("diff_totalStarLevel", FunctionTransformer(create_diff_total_star_level), ["player1_totalStarLevel", "player2_totalStarLevel"]),
        ("diff_meanElixirCost", FunctionTransformer(create_diff_mean_elixir_cost), ["player1_meanElixirCost", "player2_meanElixirCost"]),
        ("diff_numEvolutionCards", FunctionTransformer(create_diff_num_evolution_cards), ["player1_numEvolutionCards", "player2_numEvolutionCards"]),
        ("diff_numWinConditionCards", FunctionTransformer(create_diff_num_win_condition_cards), ["player1_numWinConditionCards", "player2_numWinConditionCards"]),
        ("diff_numMeleeCards", FunctionTransformer(create_diff_num_melee_cards), ["player1_numMeleeCards", "player2_numMeleeCards"]),
        ("diff_numRangedCards", FunctionTransformer(create_diff_num_ranged_cards), ["player1_numRangedCards", "player2_numRangedCards"]),
        ("diff_numAirCards", FunctionTransformer(create_diff_num_air_cards), ["player1_numAirCards", "player2_numAirCards"]),
        ("diff_numAntiAirCards", FunctionTransformer(create_diff_num_anti_air_cards), ["player1_numAntiAirCards", "player2_numAntiAirCards"]),
        ("diff_numDirectDamageCards", FunctionTransformer(create_diff_num_direct_damage_cards), ["player1_numDirectDamageCards", "player2_numDirectDamageCards"]),
        ("diff_numSplashDamageCards", FunctionTransformer(create_diff_num_splash_damage_cards), ["player1_numSplashDamageCards", "player2_numSplashDamageCards"]),
        ("diff_numResetAttackCards", FunctionTransformer(create_diff_num_reset_attack_cards), ["player1_numResetAttackCards", "player2_numResetAttackCards"]),
        ("diff_numCommonCards", FunctionTransformer(create_diff_num_common_cards), ["player1_numCommonCards", "player2_numCommonCards"]),
        ("diff_numRareCards", FunctionTransformer(create_diff_num_rare_cards), ["player1_numRareCards", "player2_numRareCards"]),
        ("diff_numEpicCards", FunctionTransformer(create_diff_num_epic_cards), ["player1_numEpicCards", "player2_numEpicCards"]),
        ("diff_numLegendaryCards", FunctionTransformer(create_diff_num_legendary_cards), ["player1_numLegendaryCards", "player2_numLegendaryCards"]),
        ("diff_numChampionCards", FunctionTransformer(create_diff_num_champion_cards), ["player1_numChampionCards", "player2_numChampionCards"]),
        ("diff_numTroopCards", FunctionTransformer(create_diff_num_troop_cards), ["player1_numTroopCards", "player2_numTroopCards"]),
        ("diff_numBuildingCards", FunctionTransformer(create_diff_num_building_cards), ["player1_numBuildingCards", "player2_numBuildingCards"]),
        ("diff_numSpellCards", FunctionTransformer(create_diff_num_spell_cards), ["player1_numSpellCards", "player2_numSpellCards"]),
        ("diff_numCounters", FunctionTransformer(create_diff_num_counters), ["player1_numCounters", "player2_numCounters"]),
        ("diff_numUncounteredCards", FunctionTransformer(create_diff_num_uncountered_cards), ["player1_numUncounteredCards", "player2_numUncounteredCards"]),
        ("diff_winConditionAdvantage", FunctionTransformer(create_win_condition_advantage, kw_args={"output": "diff"}), win_condition_advantage_features),
        ("diff_rarityScore", FunctionTransformer(create_rarity_score, kw_args={"output": "diff"}), rarity_score_features),
        ("diff_balanceScore", FunctionTransformer(create_balance_score, kw_args={"output": "diff"}), balance_score_features)
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

creator

##### *Eliminación*

Eliminamos todas las variables originales y algunas de las diferencias, para quedarnos solamente con las 20 características seleccionadas (todas ellas diferencias numéricas):

In [136]:
features_to_drop = [
    "battleTime", "arena", "diff_startingTrophies", "diff_numSpellCards",
    "player1_tag", "player1_name", "player1_supportCardRarity", "player1_supportCardName",
    "player2_tag", "player2_name", "player2_supportCardRarity", "player2_supportCardName",
    "diff_numDirectDamageCards", "diff_numBuildingCards", "diff_numChampionCards", "diff_minCardLevel",
    "diff_maxCardLevel", "diff_numUncounteredCards", "diff_winConditionAdvantage"
] + [col for col in X.columns if col.startswith("player1_has") or col.startswith("player2_has")]

drop = ColumnTransformer(
    transformers=[("drop", "drop", features_to_drop)],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

drop

##### *Codificación*

Todas las variables son numéricas, por lo que no hay categóricas por codificar.

##### *Escalado*

Estandarizaremos las variables numéricas, que ahora incluyen algunas diferencias menos:

In [137]:
features_to_scale = [
    "diff_meanCardLevel", "diff_rarityScore", "diff_totalStarLevel", "diff_meanElixirCost", "diff_numCounters",
    "diff_balanceScore", "diff_numRareCards", "diff_numSplashDamageCards", "diff_numEpicCards", "diff_numMeleeCards",
    "diff_numAntiAirCards", "diff_numCommonCards", "diff_numRangedCards", "diff_numLegendaryCards", "diff_numAirCards",
    "diff_numTroopCards", "diff_numEvolutionCards", "diff_numResetAttackCards", "diff_supportCardLevel", "diff_numWinConditionCards"
]

scaler = ColumnTransformer(
    transformers=[("scaler", StandardScaler(), features_to_scale)],
    remainder="passthrough",
    verbose_feature_names_out=False,
    force_int_remainder_cols=False
)

scaler

##### *Creación del pipeline*

Definimos nuestro *pipeline* `preprocessing_selected_diff` y lo mostramos:

In [138]:
preprocessing_selected_diff = Pipeline([
    ("imputer", imputer), 
    ("creator", creator),
    ("drop", drop),
    ("scaler", scaler)
])

preprocessing_selected_diff

Comprobamos que las transformaciones son correctas **(en ningún momento se modifica el conjunto de datos)**:

In [139]:
preprocessing_selected_diff.fit_transform(X).describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
diff_meanCardLevel,50377.0,-1.0825210000000001e-17,1.00001,-7.112444,-0.461603,0.03721,0.536023,12.008724
diff_rarityScore,50377.0,-1.8194810000000002e-17,1.00001,-4.130145,-0.634726,-0.02975,0.642446,4.205083
diff_totalStarLevel,50377.0,4.3300840000000005e-17,1.00001,-4.128092,-0.429829,-0.040539,0.543397,3.852369
diff_meanElixirCost,50377.0,3.5684400000000004e-17,1.00001,-4.556927,-0.620136,0.064524,0.578018,4.685975
diff_numCounters,50377.0,-1.5656e-17,1.00001,-6.258982,-0.609998,-0.00475,0.600499,4.231988
diff_balanceScore,50377.0,-5.0494130000000004e-17,1.00001,-3.671952,-0.659309,0.093852,0.470432,3.859655
diff_numRareCards,50377.0,1.3117190000000002e-17,1.00001,-3.856456,-0.643147,-0.000485,0.642177,4.498149
diff_numSplashDamageCards,50377.0,9.73211e-18,1.00001,-4.073183,-0.634576,0.053145,0.740867,4.179473
diff_numEpicCards,50377.0,-1.128361e-17,1.00001,-4.233399,-0.639564,-0.040592,0.558381,3.553243
diff_numMeleeCards,50377.0,-1.622018e-17,1.00001,-3.947573,-0.709311,-0.061658,0.585995,3.824258


Finalmente, lo guardamos:

In [140]:
dump(preprocessing_selected_diff, "../pipelines/preprocessing/preprocessing_selected_diff.joblib");

---

<a id="section4"></a>
## <font color="#00586D"> 4. Conclusiones</font>

Anteriormente realizamos un ánalisis de nuestro junto de datos de entrenamiento que nos permitió comprender mejor la distribución de las distintas variables y las relaciones entre ellas. Durante esta fase se han implementado diferentes transformaciones o pasos de preprocesamiento previos a la utilización de algoritmos de aprendizaje, ya sea porque estos requieren un formato determinado (por ejemplo, codificación) o porque en base al análisis anterior se han tomado decisiones que se consideran oportunas para que los modelos ofrezcan un mejor rendimiento (por ejemplo, creación o selección de variables).

Al contar con un gran número de características, especialmente fruto de la manera en la que se codifican los mazos para que puedan interpretarse como conjuntos de cartas en los que el orden no importa, ha surgido la oportunidad de crear varios *pipelines* de preprocesamiento cuya diferencia principal está en la selección de características (binarias + numéricas, binarias + diferencias, solo diferencias...).

Para llevar a cabo la implementación, se ha utilizado la clase `Pipeline` de *Scikit-learn* con un `ColumnTransformer` por cada paso de preprocesamiento. Esta no es la única manera de hacerlo, pero se ha considerado la más adecuada. Además, estos pasos han sufrido diferentes cambios en función de las variables a seleccionar, variantes que se han intentado gestionar de la forma más óptima posible. Todo ello ha permitido un mayor dominio de este tipo de transformadores y la posibilidad de explorar varias opciones para resolver el problema.

Esta metodología nos permite encapsular todos los pasos necesarios, de modo que no es necesario aplicarlos manualmente y evitamos la posibilidad de incurrir en una fuga de datos de manera involuntaria. Para crear modelos que utilicen estos *pipelines*, es tan sencillo como crear un *pipeline* por modelo donde tenemos dos pasos: uno de estos *pipelines* de preprocesamiento seguido de un clasificador.

Esta es una fase larga y en la que se concentra mucho trabajo técnico. Es habitual regresar a ella incluso durante el modelado, y las decisiones tomadas son solo una primera aproximación. Como futuras mejoras, se podrían probar muchas más opciones (discretización, descartar otras variables, combinar las variables binarias y luego codificarlas, etc.). El número de posibilidades es inmenso.

---