<a href="https://colab.research.google.com/github/maxiuboldi/test_ml/blob/main/test_ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Digitas - Examen DS**

## Dependencias

In [None]:
%pip install \
    category-encoders==2.6.2 \
    cloudpickle==3.0.0 \
    colorama==0.4.6 \
    contourpy==1.1.1 \
    cycler==0.12.1 \
    feature-engine==1.6.2 \
    fonttools==4.43.1 \
    jinja2==3.1.2 \
    joblib==1.3.2 \
    kiwisolver==1.4.5 \
    lightgbm==4.1.0 \
    llvmlite==0.41.1 \
    markupsafe==2.1.3 \
    matplotlib==3.8.0 \
    numba==0.58.1 \
    numpy==1.26.1 \
    packaging==23.2 \
    pandas==2.1.1 \
    patsy==0.5.3 \
    pillow==10.1.0 \
    pyparsing==3.1.1 \
    python-dateutil==2.8.2 \
    pytz==2023.3.post1 \
    scikit-learn==1.3.2 \
    scikit-plot==0.3.7 \
    scipy==1.11.3 \
    seaborn==0.13.0 \
    setuptools-scm==8.0.4 \
    setuptools==68.2.2 \
    shap==0.43.0 \
    six==1.16.0 \
    slicer==0.0.7 \
    statsmodels==0.14.0 \
    threadpoolctl==3.2.0 \
    tomli==2.0.1 \
    tqdm==4.66.1 \
    typing-extensions==4.8.0 \
    tzdata==2023.3 \
    xgboost==2.0.1

In [None]:
from typing import Dict, List
import time
import warnings

from tqdm import tqdm
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter
import seaborn as sns

from sklearn.metrics import (
    average_precision_score,
    roc_auc_score,
    roc_curve,
    ConfusionMatrixDisplay
)
from sklearn.calibration import CalibrationDisplay
from scikitplot.helpers import binary_ks_curve
from scikitplot.metrics import plot_ks_statistic
from sklearn.model_selection import train_test_split
import shap

from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.exceptions import ConvergenceWarning
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import StratifiedKFold

from category_encoders import CatBoostEncoder
from feature_engine.imputation import EndTailImputer
from feature_engine.wrappers import SklearnTransformerWrapper

In [None]:
dataset = pd.read_csv('FlightDelays_Data_3.0[67].csv')

In [None]:
dataset

In [None]:
dataset.describe(include='all').T

In [None]:
dataset.dtypes

Se observan algunos valores ausentes en ciertas variables, pero son mínimos. Se asume que es posible que, eventualmente, en una implementación también pudieran surgir por lo que se mantienen y se procederá a su imputación.

No obstante, se observa un registro en la variable objetivo que también presenta un valor ausente. Siendo el target, se excluye ese registro.

In [None]:
dataset = dataset[~dataset['Canceled'].isna()]

Se observa un problema desbalanceado

In [None]:
ax = sns.countplot(
    data=dataset,
    x='Canceled'
)
ax.set_xlabel('Target')
ax.set_ylabel('Cantidad')
ax.set_title(
    f'Distribución de Vuelos Cancelados'
)
ax.yaxis.set_major_formatter(StrMethodFormatter('{x:,.0f}'))

for p in ax.patches:
    height = p.get_height()
    percentage = (height / len(dataset)) * 100
    ax.annotate(
        text=f'{percentage:.1f}%',
        xy=(p.get_x() + p.get_width() / 2., height),
        ha='center',
        va='center',
        xytext=(0, 5),
        textcoords='offset points'
    )

ax.set_xticklabels(['No cancelado', 'Cancelado'])
plt.show()

En principio parece ser que la aerolínea AA es la que mayor proporción de cancelados manifiesta

In [None]:
plt.figure(figsize=(12, 6))
ax = sns.countplot(
    data=dataset,
    x='UniqueCarrier',
    hue='Canceled'
)

plt.title('Distribución de Vuelos Cancelados por Aerolínea')
plt.xlabel('Aerolínea')
plt.ylabel('Cantidad de vuelos')

legend_labels, _= ax.get_legend_handles_labels()
ax.legend(legend_labels, ['No cancelado', 'Cancelado'])

plt.show()

In [None]:
total_flights = dataset['UniqueCarrier'].value_counts()
canceled_flights = dataset[dataset['Canceled']
                           == 1]['UniqueCarrier'].value_counts()

proportion_canceled = (canceled_flights / total_flights).reset_index()
proportion_canceled.columns = ['UniqueCarrier', 'ProportionCanceled']

plt.figure(figsize=(12, 6))
sns.barplot(
    data=proportion_canceled,
    x='UniqueCarrier',
    y='ProportionCanceled'
)
plt.title('Proporción de Vuelos Cancelados por Aerolínea')
plt.xlabel('Aerolínea')
plt.ylabel('Proporción de vuelos cancelados')
plt.show()

In [None]:
num_cols = list(dataset.select_dtypes(exclude=['object']).columns)

In [None]:
num_cols

In [None]:
dataset[[x for x in num_cols if x not in ['Canceled']]].hist(
    figsize=(10, 10),
    grid=False
)
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 20))
for index, column in enumerate([x for x in num_cols if x not in ['Canceled']]):
    plt.subplot(math.ceil(len(num_cols)/2), 2, index+1)
    sns.boxplot(y=column, x='Canceled', hue='Canceled', data=dataset,
                palette='dark', showfliers=False, legend=False)
    plt.title('Distribución de "{}" según "Canceled"'.format(column))
    plt.xticks(ticks=[0, 1], labels=['No cancelado', 'Cancelado'])
    plt.grid()
    plt.tight_layout()


Se observan ciertos valores atípicos en las distribuciones que sugerirían profundizar en las lógicas de construcción del conjunto de datos. Por ejemplo, por qué podría producirse que el tiempo de viaje planificado (SchedElapsedTime) o la distancia (Distance) presenten valores negativos?

## Modelado

In [None]:
# Renombramos para tener identificada la variable target
dataset = dataset.rename({'Canceled': 'Target'},axis=1)

In [None]:
dataset

In [None]:
'''Funciones útiles'''

def compute_metrics(
        y_true: List[float],
        y_score: List[float]
    ) -> Dict[str, float]:
    '''
    Computes various evaluation metrics based on true labels and scores.
    '''
    return {
        'roc_score': roc_auc_score(y_true, y_score),
        'ks': binary_ks_curve(y_true, y_score)[3],
        'pr_score': average_precision_score(y_true, y_score)
    }

def compute_roc_optimal_cutoff(
        y_true: List[float],
        y_score: List[float]
    ) -> float:
    '''
    Compute the optimal cutoff threshold based on the Receiver Operating
    Characteristic (ROC) curve (Youden's index).
    '''

    fpr, tpr, thresh = roc_curve(y_true, y_score)
    idx = np.arange(len(tpr))
    roc = pd.DataFrame(
        {
            'tf': pd.Series(tpr - (1 - fpr), index=idx),
            'threshold': pd.Series(thresh, index=idx)
        }
    )
    roc_t = roc.iloc[(roc.tf - 0).abs().argsort()[:1]]
    return float(roc_t['threshold'].values[0])

def compute_ks_optimal_cutoff(
        y_true: List[float],
        y_score: List[float]
    ) -> float:
    '''
    Compute the optimal cutoff threshold based on the Kolmogorov-Smirnov
    (KS) statistic.
    '''
    return binary_ks_curve(y_true, y_score)[4]

def compute_optimal_cutoff(
        y_true: List[float],
        y_score: List[float],
        method: str = 'ks'
    ) -> float:
    '''
    Compute the optimal cutoff threshold based on the specified method
    and test date.
    '''
    if method == 'roc':
        return compute_roc_optimal_cutoff(y_true, y_score)
    elif method == 'ks':
        return compute_ks_optimal_cutoff(y_true, y_score)
    else:
        raise ValueError(f'Unsupported method: {method}')

def plot_ks(
        y_true: List[float],
        y_score: List[float],
        figsize: tuple = (10, 6),
        test_date: str = 'N/D'
    ):
    '''
    Plot the Kolmogorov-Smirnov (KS) statistic.
    '''
    pred_scores = np.column_stack(
        (1 - y_score, y_score))

    plot_ks_statistic(
        y_true,
        pred_scores,
        title=f'Estadístico KS - {test_date}',
        figsize=figsize
    )
    plt.show()

def plot_feature_importance(
        model,
        top_n: int = 10,
        figsize: tuple = (10, 6)
    ):
    '''
    Plot the top feature importances.
    '''

    try:
        if hasattr(model, 'feature_names_in_'):
            feature_importances = pd.Series(
                model.feature_importances_ /
                model.feature_importances_.sum(),
                index=model.feature_names_in_
            )
        elif hasattr(model, 'feature_name_'):
            feature_importances = pd.Series(
                model.feature_importances_ /
                model.feature_importances_.sum(),
                index=model.feature_name_
            )
        elif hasattr(model, 'feature_names_'):
            feature_importances = pd.Series(
                model.feature_importances_ /
                model.feature_importances_.sum(),
                index=model.feature_names_
            )
        else:
            feature_importances = pd.Series(
                model.feature_importances_ /
                model.feature_importances_.sum(),
                index=[str(i) for i in range(
                    len(model.feature_importances_))]
            )
    except AttributeError:
        print(
            (
                'Cannot calculate feature importance. '
                'Is your model a decision tree object?'
            )
        )
    feature_importances = feature_importances.sort_values(ascending=False)
    top_features = feature_importances[:top_n]

    plt.figure(figsize=figsize)
    sns.barplot(x=top_features.values, y=top_features.index)
    plt.xlabel('Importancia')
    plt.ylabel('Variable')
    plt.title(f'Top {top_n} - Importancia de Variables')
    plt.show()


def plot_confusion_matrix(
        y_true: List[float],
        y_score: List[float],
        test_date='N/D',
        threshold=0.5,
        display_labels=None,
        figsize=(10, 6),
        normalize=None
    ):
    '''
    Plot the confusion matrix.
    '''

    test_target = y_true
    pred_target = np.where(y_score > threshold, 1, 0)

    disp = ConfusionMatrixDisplay.from_predictions(
        test_target,
        pred_target,
        display_labels=display_labels,
        cmap='Blues',
        values_format='.0f' if normalize is None else None,
        normalize=normalize,
        colorbar=False
    )
    disp.ax_.set_title(f"Matriz de Confusión - {test_date}")
    disp.figure_.set_size_inches(figsize)
    plt.show()

def plot_calibration_curve(
        y_true: List[float],
        y_score: List[float],
        test_date='N/D',
        figsize=(10, 6)
    ):
    '''
    Plot the calibration curve.
    '''

    plt.figure(figsize=figsize)
    disp = CalibrationDisplay.from_predictions(
        y_true,
        y_score
    )
    disp.ax_.set_title(f"Curva de Calibración - {test_date}")

    handles, labels = disp.ax_.get_legend_handles_labels()
    disp.ax_.legend(handles, labels, loc='best')

    plt.show()

def plot_shap_importance(
        model,
        train_data,
        top_n: int = 10,
        figsize: tuple = (10, 6)
    ):
    '''
    Plot the top feature importances.
    '''

    explainer = shap.TreeExplainer(
        model=model,
        feature_perturbation='tree_path_dependent',
        model_output='raw'
    )
    shap_values = explainer.shap_values(train_data)
    shap.summary_plot(
        # shap_values[1],
        shap_values,
        train_data,
        plot_type='violin',
        max_display=top_n,
        plot_size=figsize,
        show=False
    )
    plt.title(f'Importancia SHAP - Top {top_n}', fontsize=16, y=1.05)
    plt.show()

In [None]:
SEED = 8888

In [None]:
dataset_model = dataset.copy()

In [None]:
train_data, test_data, train_target, test_target = train_test_split(
    dataset_model.drop('Target', axis=1),
    dataset_model['Target'],
    test_size=0.1,
    random_state=SEED,
    stratify=dataset_model['Target']
)

In [None]:
cat_cols = list(train_data.select_dtypes(['object']).columns)
num_cols = list(train_data.select_dtypes(exclude=['object']).columns)

In [None]:
models = [
        ('Dummy', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('model', DummyClassifier()
                )
            ]
        )
    ),
        ('LogisticRegression', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('scaler', SklearnTransformerWrapper(StandardScaler())
                ),
                ('model', LogisticRegression(
                        penalty=None,
                        solver='lbfgs',
                        random_state=SEED
                    )
                )
            ]
        )
    ),
        ('LassoLogisticRegression', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('scaler', SklearnTransformerWrapper(StandardScaler())
                ),
                ('model', LogisticRegression(
                        penalty='l1',
                        solver='saga',
                        random_state=SEED
                    )
                )
            ]
        )
    ),
        ('RidgeLogisticRegression', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('scaler', SklearnTransformerWrapper(StandardScaler())
                ),
                ('model', LogisticRegression(
                        penalty='l2',
                        solver='lbfgs',
                        random_state=SEED
                    )
                )
            ]
        )
    ),
        ('ElasticNetLogisticRegression', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('scaler', SklearnTransformerWrapper(StandardScaler())
                ),
                ('model', LogisticRegression(
                        penalty='elasticnet',
                        solver='saga',
                        l1_ratio=0.5,
                        random_state=SEED
                    )
                )
            ]
        )
    ),
        ('KNeighbors', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('scaler', SklearnTransformerWrapper(MinMaxScaler())
                ),
                ('model', KNeighborsClassifier(n_jobs=-1),
                )
            ]
        )
    ),
        ('LightGBM', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('model', LGBMClassifier(
                        random_state=SEED,
                        n_jobs=-1,
                        verbose=-1
                    ),
                )
            ]
        )
    ),
        ('XGBoost', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('model', XGBClassifier(random_state=SEED, n_jobs=-1),
                )
            ]
        )
    ),
        ('ExtraTrees', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('model', ExtraTreesClassifier(
                        random_state=SEED,
                        n_jobs=-1
                    ),
                )
            ]
        )
    ),
        ('RandomForest', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('model', RandomForestClassifier(
                        random_state=SEED,
                        n_jobs=-1
                    ),
                )
            ]
        )
    ),
        ('SGDSVM', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('scaler', SklearnTransformerWrapper(StandardScaler())
                ),
                ('model', CalibratedClassifierCV(
                        SGDClassifier(
                            loss='hinge', # SVM
                            n_jobs=-1,
                            random_state=SEED
                        ),
                        method='sigmoid',
                        cv=None,
                        n_jobs=-1,
                        ensemble=True
                    )
                )
            ]
        )
    )
]

### Selección de algoritmos

In [None]:
warnings.filterwarnings('ignore', 'is_categorical_dtype')
model_metrics_list = []

for model, learner in tqdm(models, desc=r'Running Model Selection'):

    metrics_list = []
    for train_index, valid_index in StratifiedKFold(
            n_splits=5, random_state=SEED, shuffle=True
        ).split(
        train_data, train_target):

        start_time = time.time()

        train_data_fold = train_data.iloc[train_index].copy()
        train_target_fold = train_target.iloc[train_index].copy()

        val_data_fold = train_data.iloc[valid_index].copy()
        val_target_fold = train_target.iloc[valid_index].copy()

        num_imputer = EndTailImputer(
            imputation_method='iqr',
            tail='left',
            fold=5,
            variables=num_cols
        )
        train_data_fold = num_imputer.fit_transform(train_data_fold)
        val_data_fold = num_imputer.transform(val_data_fold)

        cat_encoder = CatBoostEncoder(
            cols=cat_cols,
            drop_invariant=True
        )
        train_data_fold = cat_encoder.fit_transform(train_data_fold, 
                                                    train_target_fold)
        val_data_fold = cat_encoder.transform(val_data_fold)
        with warnings.catch_warnings():
            warnings.filterwarnings('ignore', category=ConvergenceWarning)
            learner.fit(train_data_fold, train_target_fold)

        pred_target_fold = learner.predict_proba(val_data_fold)[:, -1]

        metrics_fold = compute_metrics(val_target_fold, pred_target_fold)
        metrics_fold.update({'TT_secs': time.time() - start_time})
        metrics_list.append(metrics_fold)

    metrics = pd.DataFrame(metrics_list)
    metrics.columns = metrics.columns.str.upper()
    metrics = metrics.mean().to_frame().T
    metrics.index = [model]

    model_metrics_list.append(metrics)

metrics = pd.concat(model_metrics_list).round(4)
metrics.index.name = 'MODEL'

In [None]:
metrics.sort_values(
    'PR_SCORE',
    ascending=False
).style.highlight_max(
        subset=[col for col in metrics.columns if col != 'TT_SECS'],
        color='green'
    ).map(lambda x: 'background-color: gray', subset=['TT_SECS'])

In [None]:
pipeline = Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('model', XGBClassifier(random_state=SEED, n_jobs=-1),
                )
            ]
        )

In [None]:
pipeline.fit(
    train_data,
    train_target
)
scores = pipeline.predict_proba(test_data)[:, -1]

In [None]:
metrics = compute_metrics(
    y_true=test_target,
    y_score=scores
)
for m, s in metrics.items():
    print(f'{m}: {s:.4f}')

In [None]:
plot_feature_importance(
    model=pipeline['model'],
    top_n=7,
    figsize=(8, 6)
)

In [None]:
train_data_shap = pipeline['cat_encoder'].transform(train_data)
train_data_shap = pipeline['num_imputer'].transform(train_data_shap)
plot_shap_importance(
    model=pipeline['model'],
    train_data=train_data_shap,
    top_n=7,
    figsize=(8, 6)
)

In [None]:
# Estas variables para "predecir" no sirven porque a priori entiendo que no estarían disponibles al momento de una eventual inferencia.
dataset_model = dataset.copy()
dataset_model = dataset_model.drop(['ArrDelay', 'DepDelay'], axis=1)

In [None]:
train_data, test_data, train_target, test_target = train_test_split(
    dataset_model.drop('Target', axis=1),
    dataset_model['Target'],
    test_size=0.1,
    random_state=SEED,
    stratify=dataset_model['Target']
)

In [None]:
cat_cols = list(train_data.select_dtypes(['object']).columns)
num_cols = list(train_data.select_dtypes(exclude=['object']).columns)

In [None]:
models = [
        ('Dummy', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('model', DummyClassifier()
                )
            ]
        )
    ),
        ('LogisticRegression', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('scaler', SklearnTransformerWrapper(StandardScaler())
                ),
                ('model', LogisticRegression(
                        penalty=None,
                        solver='lbfgs',
                        random_state=SEED
                    )
                )
            ]
        )
    ),
        ('LassoLogisticRegression', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('scaler', SklearnTransformerWrapper(StandardScaler())
                ),
                ('model', LogisticRegression(
                        penalty='l1',
                        solver='saga',
                        random_state=SEED
                    )
                )
            ]
        )
    ),
        ('RidgeLogisticRegression', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('scaler', SklearnTransformerWrapper(StandardScaler())
                ),
                ('model', LogisticRegression(
                        penalty='l2',
                        solver='lbfgs',
                        random_state=SEED
                    )
                )
            ]
        )
    ),
        ('ElasticNetLogisticRegression', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('scaler', SklearnTransformerWrapper(StandardScaler())
                ),
                ('model', LogisticRegression(
                        penalty='elasticnet',
                        solver='saga',
                        l1_ratio=0.5,
                        random_state=SEED
                    )
                )
            ]
        )
    ),
        ('KNeighbors', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('scaler', SklearnTransformerWrapper(MinMaxScaler())
                ),
                ('model', KNeighborsClassifier(n_jobs=-1),
                )
            ]
        )
    ),
        ('LightGBM', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('model', LGBMClassifier(
                        random_state=SEED,
                        n_jobs=-1,
                        verbose=-1
                    ),
                )
            ]
        )
    ),
        ('XGBoost', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('model', XGBClassifier(random_state=SEED, n_jobs=-1),
                )
            ]
        )
    ),
        ('ExtraTrees', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('model', ExtraTreesClassifier(
                        random_state=SEED,
                        n_jobs=-1
                    ),
                )
            ]
        )
    ),
        ('RandomForest', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('model', RandomForestClassifier(
                        random_state=SEED,
                        n_jobs=-1
                    ),
                )
            ]
        )
    ),
        ('SGDSVM', Pipeline(
            steps=[
                ('cat_encoder', CatBoostEncoder(
                        drop_invariant=True,
                        random_state=SEED,
                        cols=cat_cols,
                    )
                ),
                ('num_imputer', EndTailImputer(
                        imputation_method='iqr',
                        tail='left',
                        fold=5,
                        variables=num_cols
                    )
                ),
                ('scaler', SklearnTransformerWrapper(StandardScaler())
                ),
                ('model', CalibratedClassifierCV(
                        SGDClassifier(
                            loss='hinge', # SVM
                            n_jobs=-1,
                            random_state=SEED
                        ),
                        method='sigmoid',
                        cv=None,
                        n_jobs=-1,
                        ensemble=True
                    )
                )
            ]
        )
    )
]

In [None]:
warnings.filterwarnings('ignore', 'is_categorical_dtype')
model_metrics_list = []

for model, learner in tqdm(models, desc=r'Running Model Selection'):

    metrics_list = []
    for train_index, valid_index in StratifiedKFold(
            n_splits=5, random_state=SEED, shuffle=True
        ).split(
        train_data, train_target):

        start_time = time.time()

        train_data_fold = train_data.iloc[train_index].copy()
        train_target_fold = train_target.iloc[train_index].copy()

        val_data_fold = train_data.iloc[valid_index].copy()
        val_target_fold = train_target.iloc[valid_index].copy()

        num_imputer = EndTailImputer(
            imputation_method='iqr',
            tail='left',
            fold=5,
            variables=num_cols
        )
        train_data_fold = num_imputer.fit_transform(train_data_fold)
        val_data_fold = num_imputer.transform(val_data_fold)

        cat_encoder = CatBoostEncoder(
            cols=cat_cols,
            drop_invariant=True
        )
        train_data_fold = cat_encoder.fit_transform(train_data_fold, 
                                                    train_target_fold)
        val_data_fold = cat_encoder.transform(val_data_fold)
        with warnings.catch_warnings():
            warnings.filterwarnings('ignore', category=ConvergenceWarning)
            learner.fit(train_data_fold, train_target_fold)

        pred_target_fold = learner.predict_proba(val_data_fold)[:, -1]

        metrics_fold = compute_metrics(val_target_fold, pred_target_fold)
        metrics_fold.update({'TT_secs': time.time() - start_time})
        metrics_list.append(metrics_fold)

    metrics = pd.DataFrame(metrics_list)
    metrics.columns = metrics.columns.str.upper()
    metrics = metrics.mean().to_frame().T
    metrics.index = [model]

    model_metrics_list.append(metrics)

metrics = pd.concat(model_metrics_list).round(4)
metrics.index.name = 'MODEL'

In [None]:
metrics.sort_values(
    'PR_SCORE',
    ascending=False
).style.highlight_max(
        subset=[col for col in metrics.columns if col != 'TT_SECS'],
        color='green'
    ).map(lambda x: 'background-color: gray', subset=['TT_SECS'])

## Conclusiones

El problema en cuestión se define principalmente por 2 variables, ArrDelay (retraso en el arribo) y DepDelay (retraso en la partida). Estas 2 variables por si solas tienen una relación directa con la variable objetivo (cancelaciones). Según el análisis preliminar, valores bajos en estas variables parecen presentar una mayor correlación con el target. Excluyendo estas variables el problema en cuestión no parece ser explicado por las restantes y el modelado no se vuelve aconsejable.

La principal pregunta acerca de un posible modelo productivo respecto de esta cuestión, sería, estas variables estarían disponibles al momento de una eventual predicción? o son variables recolectadas en forma posterior que no podrían utilizarse en la práctica posteriormente?