### Etapas
1. Cria√ß√£o de features
2. Tratamento de outliers
3. Normaliza√ß√£o dos dados
4. Pipeline de pr√©-processamento
5. Separa√ß√£o em treino/teste
6. Balanceamento dos dados
7. Treinamento dos modelos

In [32]:
# @title Importa√ß√£o das bibliotecas utilizadas no programa

import pandas as pd
import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin

# Carregamento do dataset
import requests
from pathlib import Path

# Cria√ß√£o de features
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors

# Normaliza√ß√£o dos dados
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector

# Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

# Treinamento do modelo
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import cross_validate
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier, StackingClassifier,  AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import  AdaBoostClassifier, HistGradientBoostingClassifier
from sklearn.linear_model import RidgeClassifier
from xgboost import XGBClassifier
#from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

In [33]:
# @title Carregamento do dataset

# Vers√£o Google Collab
github_link = "https://github.com/mfigueireddo/ciencia-de-dados/blob/ba579573c5b8a9246ca04f7da29bc2c74c8b362c/datasets/pre-pipeline_wildfires.parquet"
url = github_link.replace("/blob/", "/raw/")

local_file_path = Path("/content/raw_wildfires.parquet")

# Faz uma requisi√ß√£o HTTP GET ao GitHub
with requests.get(url, stream=True) as request:
    request.raise_for_status() # Confere se houve √™xito

    with open(local_file_path, "wb") as file:
        for chunk in request.iter_content(chunk_size=1024*1024):
            if chunk:
                file.write(chunk)

wildfires = pd.read_parquet(local_file_path, engine="pyarrow") # Leitura realizada com a engine pyarrow


# Vers√£o VSCode
#file_path = "../datasets/pre-pipeline_wildfires.parquet"

#wildfires = pd.read_parquet(
    #file_path,
    #engine="pyarrow",
    #use_nullable_dtypes=False
#)

In [34]:
# @title Vari√°veis globais

data_column_name = 'data'
id_column_name = 'fire_id'
latitude_column_name = 'latitude'
longitude_column_name = 'longitude'
precipitation_column_name = 'precipitacao'
max_temperature_column_name = 'temperatura_max'
precipitation_sum_window_column_name = 'soma_precipitacao_90_dias'
max_temperature_mean_column_name = 'media_temp_max_90_dias'
season_column_name = 'estacao_ano_id'
region_column_name = 'regiao_incendio'
target_column_name = 'houve_incendio'

### 1. Cria√ß√£o de features
<u>**Features criadas**</u>
- Esta√ß√£o do ano
- Regi√£o do inc√™ndio
- Soma da precipita√ß√£o nos √∫ltimos 90 dias
- M√©dia de temperatura m√°xima nos √∫ltimos 90 dias

**Observa√ß√£o 1**: as features esta√ß√£o do ano baseada e regi√£o do inc√™ndio no momento n√£o est√£o sendo utilizadas para treinar o modelo<br>
**Observa√ß√£o 2**: caso essas duas features passem a ser utilizadas no treinamento do modelo, elas dever√£o ser tratadas porque s√£o categ√≥ricas e n√£o h√° rela√ß√£o num√©rica entre seus valores.

In [35]:
# @title Classe FeaturesCreation (Corrigida para remover a data)

class FeaturesCreation(BaseEstimator, TransformerMixin):

    def __init__(self):
        # Colunas utilizadas
        self.m_date_column = data_column_name
        self.m_group_column = id_column_name
        self.m_latitude_column = latitude_column_name
        self.m_longitude_column = longitude_column_name
        self.m_precipitation_column = precipitation_column_name
        self.m_max_temperature_column = max_temperature_column_name

        # Par√¢metros personalizados para cria√ß√£o das features
        self.m_precipitation_window_days = 90 # Valor default
        self.m_max_temperature_window_days = 90 # Valor default

        # Par√¢metros do DBSCAN
        self.mm_max_radiuskm = 1.0
        self.m_min_samples = 5

        # Objetos aprendidos no fit
        self.m_dbscan = None
        self.m_nearest_neighbors_core = None
        self.m_core_labels = None
        self.m_max_radius = None

    # Convers√£o necess√°ria para o DBSCAN
    @staticmethod
    def convert_to_radians(latitude_or_longitude):
        return np.radians(latitude_or_longitude.astype(float))

    @staticmethod
    def convert_month_to_season(month):
        if month in (12, 1, 2): return 0  # Inverno
        if month in (3, 4, 5): return 1  # Primavera
        if month in (6, 7, 8): return 2  # Ver√£o
        return 3  # 9, 10, 11 -> Outono

    def fit(self, dataframe, target=None):
        dataframe = dataframe.copy()

        # Desativa o DBSCAN caso n√£o haja latitude e longitude
        missing_cols = [column for column in [self.m_latitude_column, self.m_longitude_column] if column not in dataframe.columns]
        if missing_cols:
            self.m_dbscan = None
            self.m_nearest_neighbors_core = None
            self.m_core_labels = None
            self.m_max_radius = None
            return self

        latitude_or_longitude = dataframe[[self.m_latitude_column, self.m_longitude_column]].to_numpy()
        latitude_or_longitude_radians = self.convert_to_radians(latitude_or_longitude)

        earth_radius = 6371.0
        self.m_max_radius = self.mm_max_radiuskm / earth_radius

        dbscan = DBSCAN(eps=self.m_max_radius, min_samples=self.m_min_samples, metric='haversine')
        dbscan.fit(latitude_or_longitude_radians)
        self.m_dbscan = dbscan

        # Treina um NearestNeighbors apenas nos pontos-core
        core_mask = np.zeros_like(dbscan.labels_, dtype=bool)
        if hasattr(dbscan, 'core_sample_indices_') and len(dbscan.core_sample_indices_) > 0:
            core_mask[dbscan.core_sample_indices_] = True
            core_points = latitude_or_longitude_radians[core_mask]
            core_labels = dbscan.labels_[core_mask]

            if len(core_points) > 0:
                nearest_neighbors = NearestNeighbors(n_neighbors=1, metric='haversine')
                nearest_neighbors.fit(core_points)
                self.m_nearest_neighbors_core = nearest_neighbors
                self.m_core_labels = core_labels
            else:
                self.m_nearest_neighbors_core = None
                self.m_core_labels = None
        else:
            self.m_nearest_neighbors_core = None
            self.m_core_labels = None

        return self

    # Rotula novos pontos
    def assign_dbscanlabels(self, dataframe):
        if self.m_nearest_neighbors_core is None or self.m_core_labels is None or self.m_max_radius is None:
            return pd.Series([-1] * len(dataframe), index=dataframe.index, dtype='int64')

        latitude_or_longitude = dataframe[[self.m_latitude_column, self.m_longitude_column]].to_numpy()
        latitude_or_longitude_radians = self.convert_to_radians(latitude_or_longitude)

        distances, indices = self.m_nearest_neighbors_core.kneighbors(latitude_or_longitude_radians, n_neighbors=1, return_distance=True)
        distances = distances.reshape(-1)
        indices = indices.reshape(-1)

        labels = np.full(len(dataframe), -1, dtype='int64')
        within = distances <= self.m_max_radius
        labels[within] = self.m_core_labels[indices[within]]

        return pd.Series(labels, index=dataframe.index, dtype='int64')

    # Adiciona m√©dias m√≥veis e soma pro grupo de inc√™ndio
    def add_temporal_rollings(self, dataframe):
        # Ordena por grupo e tempo
        if self.m_group_column in dataframe.columns and self.m_date_column in dataframe.columns:
            dataframe = dataframe.sort_values([self.m_group_column, self.m_date_column])
        elif self.m_date_column in dataframe.columns:
            dataframe = dataframe.sort_values(self.m_date_column)

        # Rolling de precipita√ß√£o
        if self.m_precipitation_column in dataframe.columns:
            dataframe[precipitation_sum_window_column_name] = (
                dataframe.groupby(self.m_group_column, dropna=False)[self.m_precipitation_column]
                  .rolling(self.m_precipitation_window_days, min_periods=1)
                  .sum()
                  .reset_index(level=0, drop=True)
            )
        else:
            dataframe[precipitation_sum_window_column_name] = np.nan

        # Rolling de temperatura m√°xima
        if self.m_max_temperature_column in dataframe.columns:
            dataframe[max_temperature_mean_column_name] = (
                dataframe.groupby(self.m_group_column, dropna=False)[self.m_max_temperature_column]
                  .rolling(self.m_max_temperature_window_days, min_periods=1)
                  .mean()
                  .reset_index(level=0, drop=True)
            )
        else:
            dataframe[max_temperature_mean_column_name] = np.nan

        return dataframe

    # Aplica transforma√ß√µes e cria as novas features
    def transform(self, dataframe, target=None):
        dataframe = pd.DataFrame(dataframe).copy()

        # Esta√ß√£o do ano
        if self.m_date_column in dataframe.columns:
            dataframe[self.m_date_column] = pd.to_datetime(dataframe[self.m_date_column], errors='coerce')
            estacao = dataframe[self.m_date_column].dt.month.map(self.convert_month_to_season).astype('Int64')
            dataframe[season_column_name] = estacao.astype('float')
        else:
            dataframe[season_column_name] = np.nan

        # Regi√£o
        if all(c in dataframe.columns for c in [self.m_latitude_column, self.m_longitude_column]):
            dataframe[region_column_name] = self.assign_dbscanlabels(dataframe).astype('int64')
        else:
            dataframe[region_column_name] = -1

        # Rollings temporais
        dataframe = self.add_temporal_rollings(dataframe)

        # --- CORRE√á√ÉO: REMOVER COLUNAS DE DATA E ID ANTES DE RETORNAR ---
        # O modelo n√£o sabe lidar com datas ou IDs (strings), ent√£o dropamos aqui.
        cols_to_drop = [self.m_date_column, self.m_group_column]
        # Dropamos apenas se elas existirem no dataframe
        dataframe = dataframe.drop(columns=[c for c in cols_to_drop if c in dataframe.columns], errors='ignore')
        # ----------------------------------------------------------------

        return dataframe

### 2. Tratamento de outliers

<u>**Transforma√ß√£o Logar√≠tmica**</u>
- Aplica log(x) ou log(x+constante) para valores positivos
- Muito eficaz para dados com distribui√ß√£o assim√©trica positiva
- Comprime valores grandes e expande valores pequenos
- F√≥rmula: X_log = log(X + c), onde c evita log(0)

<u>**Transforma√ß√£o Raiz Quadrada**</u>
- Menos dr√°stica que a transforma√ß√£o logar√≠tmica
- √ötil para dados de contagem e vari√°veis positivamente assim√©tricas
- F√≥rmula: X_sqrt = sqrt(X)

<u>**Winsoriza√ß√£o (Capping/Clipping)**</u>

A **Winsoriza√ß√£o** √© uma t√©cnica de tratamento de outliers que **limita valores extremos** sem remov√™-los completamente. Em vez de excluir outliers, substitu√≠mos os valores extremos pelos valores de percentis espec√≠ficos.

**Como funciona:**
- Define-se limites baseados em percentis (ex: 5¬∫ e 95¬∫ percentil)
- Valores abaixo do limite inferior s√£o substitu√≠dos pelo valor do limite inferior
- Valores acima do limite superior s√£o substitu√≠dos pelo valor do limite superior

| Vari√°vel                                 | Melhor m√©todo          | Resultado obtido |
| :--------------------------------------- | :--------------------  | :-------------------------------------------------------------------------------------------------------- |
| **precipitacao**                         | **Log(x + 1)**         | Assimetria (7.87 ‚Üí 2.59)                                                                                  |
| **umidade_relativa_max**                 | **Winsoriza√ß√£o**       | Outliers (23 ‚Üí 0)                                                                                         |
| **umidade_relativa_min**                 | **Winsoriza√ß√£o**       | Outliers (1155 ‚Üí 0)                                                                                       |
| **umidade_especifica**                   | **Sqrt**               | Simetria (0.89 ‚Üí 0.16) / Outliers (6967 ‚Üí 2102)                                                           |
| **radiacao_solar**                       | **Sem transforma√ß√£o**  |                                                                                                           |
| **temperatura_min**                      | **Winsoriza√ß√£o**       | Outliers (5066 ‚Üí 0)                                                                                       |
| **temperatura_max**                      | **Winsoriza√ß√£o**       | Outliers (5066 ‚Üí 0)                                                                                       |
| **velocidade_vento**                     | **Log(x + 1)**         | Assimetria (1.23 ‚Üí 0.19) / Outliers (8723 ‚Üí 1128)                                                         |
| **indice_queima**                        | **Winsoriza√ß√£o**       |                                                                                                           |
| **umidade_combustivel_morto_100_horas**  | **Sqrt**               | Outliers (44 ‚Üí 9)                                                                                         |
| **umidade_combustivel_morto_1000_horas** | **Sqrt**               | Outliers (373 ‚Üí 4)                                                                                        |
| **componente_energia_lancada**           | **Sem transforma√ß√£o**  |                                                                                                           |
| **evapotranspiracao_real**               | **Sqrt**               | Assimetria (0.71 ‚Üí -0.00) / Outliers (3292 ‚Üí 153)                                                         |
| **evapotranspiracao_potencial**          | **Log(x + 1)**         | Outliers (679 ‚Üí 0)                                                                                        |
| **deficit_pressao_vapor**                | **Log(x + 1)**         | Outliers (14758 ‚Üí 672)                                                                                    |


In [36]:
# @title Classe OutliersTreatment

class OutliersTreatment(BaseEstimator, TransformerMixin):

    def __init__(self):
        self.m_log_columns = [
            "precipitacao",
            "velocidade_vento",
            "evapotranspiracao_potencial",
            "deficit_pressao_vapor",
        ]
        self.m_sqrt_columns = [
            "umidade_especifica",
            "umidade_combustivel_morto_100_horas",
            "umidade_combustivel_morto_1000_horas",
            "evapotranspiracao_real",
        ]
        self.m_winsor_columns = [
            "umidade_relativa_max",
            "umidade_relativa_min",
            "temperatura_min",
            "temperatura_max",
            "indice_queima",
        ]
        self.m_winsor_limits = (0.05, 0.05)

    # Calcula par√¢metros necess√°rios para aplicar as transforma√ß√µes corretamente
    def fit(self, dataframe, target=None):

        # Garante que o usu√°rio esteja enviado um dataframe no formato correto
        dataframe = dataframe if isinstance(dataframe, pd.DataFrame) else pd.DataFrame(dataframe)

        # C√°lculo de offsets para garantir que n√£o haver√£o valores zerados ou negativos
        # "coerce" convete valores inv√°lidos para NaN

        self.m_log_offset = {}
        for column in self.m_log_columns:
            if column in dataframe.columns:
                series = pd.to_numeric(dataframe[column], errors="coerce")
                min_value = series.min()
                self.m_log_offset[column] = (abs(min_value) + 1) if pd.notna(min_value) and min_value <= 0 else 1.0

        self.m_sqrt_offset = {}
        for column in self.m_sqrt_columns:
            if column in dataframe.columns:
                series = pd.to_numeric(dataframe[column], errors="coerce")
                min_value = series.min()
                self.m_sqrt_offset[column] = (abs(min_value) + 0.01) if pd.notna(min_value) and min_value < 0 else 0.0

        # Garante que a winsoriza√ß√£o s√≥ seja feita com colunas que realmente est√£o no dataframe
        actual_columns_to_winsor = [column for column in self.m_winsor_columns if column in dataframe.columns]
        low_quantile, high_quantile = self.m_winsor_limits

        if actual_columns_to_winsor:
            # Converte cada coluna para num√©rico (coerces -> NaN) e calcula quantis por coluna
            winsor_dataframe = dataframe[actual_columns_to_winsor].apply(pd.to_numeric, errors="coerce")
            self.m_low_quantile  = winsor_dataframe.quantile(low_quantile)
            self.m_high_quantile = winsor_dataframe.quantile(1 - high_quantile)
        else:
            # garante atributos vazios para n√£o quebrar no transform()
            self.m_low_quantile  = pd.Series(dtype=float)
            self.m_high_quantile = pd.Series(dtype=float)

        return self

    # Aplica as transforma√ß√µes
    def transform(self, dataframe):

        # Garante que o usu√°rio esteja enviado o dataframe correto
        dataframe = dataframe.copy() if isinstance(dataframe, pd.DataFrame) else pd.DataFrame(dataframe).copy()

        # LOG
        for column, offset in self.m_log_offset.items():
            if column in dataframe.columns:
                series = pd.to_numeric(dataframe[column], errors="coerce")
                dataframe[column] = np.log(series + offset)

        # SQRT
        for column, offset in self.m_sqrt_offset.items():
            if column in dataframe.columns:
                series = pd.to_numeric(dataframe[column], errors="coerce")
                dataframe[column] = np.sqrt(series + offset)

        # Winsoriza√ß√£o
        for column in self.m_low_quantile.index:
            if column in dataframe.columns:
                series = pd.to_numeric(dataframe[column], errors="coerce")
                dataframe[column] = series.clip(lower=self.m_low_quantile[column], upper=self.m_high_quantile[column])

        return dataframe

### 3. Normaliza√ß√£o dos dados

S√£o utilizados para normalizar os dados num√©ricos
- SimpleImputer -> Preenche NaN com a m√©dia da feature
- MinMaxScaler -> Ajusta todos os valores para o intervalo entre 0 e 1

No momento, n√£o est√£o sendo utilizadas features categ√≥ricas para treinar o modelo. Contudo, existe a implementa√ß√£o de uma normaliza√ß√£o provis√≥ria para as mesmas utilizando
- SimpleImputer -> Preenche NaN com a m√©dia da feature
- OneHotEncoder

In [37]:
# @title Classe DataNormalization (Corrigida e Ajustada)

# 1. Definimos quais s√£o as colunas puramente categ√≥ricas
categoric_cols = [season_column_name, region_column_name]

# 2. Fun√ß√£o para selecionar num√©ricas EXCLUINDO as categ√≥ricas acima
def numeric_columns_selector(dataframe):
    # Pega tudo que √© n√∫mero
    numeric_cols = dataframe.select_dtypes(include='number').columns.tolist()
    # Remove as que sabemos que s√£o categorias (mesmo que estejam como int/float)
    return [col for col in numeric_cols if col not in categoric_cols]

# 3. Pipeline para as Categ√≥ricas (OneHotEncoder)
categoric_columns_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('one_hot_encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])

# 4. Pipeline para as Num√©ricas (MinMaxScaler)
numeric_columns_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', MinMaxScaler()),
])

# 5. O ColumnTransformer final
DataNormalization = ColumnTransformer(
    transformers=[
        # Aplica nas categ√≥ricas espec√≠ficas
        ('categoric', categoric_columns_pipeline, categoric_cols),
        # Aplica nas num√©ricas (filtradas pela fun√ß√£o seletora)
        ('numerical', numeric_columns_pipeline, numeric_columns_selector)
    ],
    remainder='passthrough'
)

### 4. Pipeline de pr√©-processamento

Concentra 5 passos
1. Cria√ß√£o de features
2. Tratamento de outliers
3. Elimina√ß√£o de features indesejadas no treinamento dos modelos
4. Normaliza√ß√£o dos dados
5. Sanitiza√ß√£o dos dados


Para que o modelo seja treinado de maneira correta, al√©m da coluna target, √© ideal que algumas outras fiquem de fora de seu escopo, s√£o elas:
- Data
- ID
- Latitude
- Longitude
- Esta√ß√£o do ano*
- Regi√£o*

In [38]:
# @title Classe ColumnDropper

# Elimina colunas n√£o desej√°veis no treinamento do modelo
class ColumnDropper(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.columns = [data_column_name, id_column_name, latitude_column_name, longitude_column_name, season_column_name, region_column_name]
    def fit(self, dataframe, target=None):
        return self
    def transform(self, dataframe):
        return dataframe.drop(columns=self.columns, errors="ignore")

Fun√ß√£o simples que serve para garantir que os dados est√£o no formato esperado para que ocorra o treinamento dos modelos (numpy.ndarray)

In [39]:
# @title Fun√ß√£o sanitizer

# Remove data e fire_id, al√©m de converter o dataframe para o formato esperado pelos classificadores
def sanitizer(features):
    # Transforma em numpy.ndarray
    if isinstance(features, pd.DataFrame):
        features = features.select_dtypes(include='number')
    return features

In [40]:
# @title Pipeline preprocess

preprocess = Pipeline(steps=[
    ("features_creation", FeaturesCreation()),
    ("outliers_treatment", OutliersTreatment()),
    ("drop_unused_columns", ColumnDropper()),
    ("data_normalization", DataNormalization),
    ("sanitize", FunctionTransformer(sanitizer, validate=False, feature_names_out='one-to-one'))
])

### 5. Separa√ß√£o em treino/teste

<u>**Time Series Cross Validation**</u>

A Time Series Cross Validation √© uma t√©cnica especializada para validar modelos quando os dados possuem ordem cronol√≥gica. Diferente das t√©cnicas tradicionais, ela respeita a estrutura temporal dos dados.

Foi implemetada uma fun√ß√£o personalizada para que dados referentes ao mesmo inc√™ndio permanecessem nos mesmos grupos

In [41]:
# @title Fun√ß√£o group_time_series_cross_validation

# Gera folds (train_idx, test_idx) respeitando ordem temporal por grupo, embargo em n√≠vel de grupo e exclus√£o m√∫tua treino/teste por grupo.
def group_time_series_cross_validation():
    dataframe = wildfires
    time_column = data_column_name
    group_column = id_column_name
    folds_amount = 5
    fold_groups_size = 1
    gap_between_groups_amount = 0

    # Ordena grupos pelo primeiro timestamp (data da primeira amostra do grupo)
    first_time = (
        dataframe[[group_column, time_column]]
        .dropna(subset=[time_column])
        .groupby(group_column)[time_column]
        .min()
        .sort_values()
    )
    ordered_groups = first_time.index.to_numpy()
    ordered_groups_len = len(ordered_groups)

    groups_amount_by_step = fold_groups_size

    min_train_groups = max(1, fold_groups_size) # pelo menos 1

    # √Çncora: √∫ltimo grupo incluso no treino
    # Precisamos garantir espa√ßo para gap + teste √† frente
    max_anchor = ordered_groups_len - gap_between_groups_amount - fold_groups_size
    if max_anchor <= min_train_groups:
        return  # n√£o h√° splits poss√≠veis

    splits = 0
    anchor = min_train_groups
    while anchor <= max_anchor and splits < folds_amount:
        train_groups = ordered_groups[:anchor]

        test_start = anchor + gap_between_groups_amount
        test_end = test_start + fold_groups_size
        test_groups = ordered_groups[test_start:test_end]

        train_idx = dataframe.index[dataframe[group_column].isin(train_groups)].to_numpy()
        test_idx  = dataframe.index[dataframe[group_column].isin(test_groups)].to_numpy()

        if train_idx.size and test_idx.size:
            yield (train_idx, test_idx) # Gera√ß√£o sobre demanda (lazy evaluation)
            splits += 1

        anchor += groups_amount_by_step

cross_validation = cross_validation_splits = list(group_time_series_cross_validation())


### 6. Balanceamento dos dados

Ap√≥s rodar um algoritmo que contava a quantidade de amostras de cada classe e tamb√©m calculava a Raz√£o de Desbalanceamento (IR), obtivemos:
- Classe 0 (n√£o-inc√™ndio): **2025** amostras acumuladas entre todas as folds (90%)
- Classe 1 (inc√™ndio): **225** amostras acumuladas entre todas as folds (10%)
- Raz√£o de desbalanceamento (IR) acumulada: **9.00x**

**Observa√ß√£o**: o dataset possui mais de 300.000 linhas, mas apenas 2.500 grupos distintos (separados por inc√™ndio)

O dataset apresenta uma raz√£o de desbalanceamento leve (IR < 10), o que n√£o levanta a necessidade de utiliza√ß√£o de algoritmos robustos para balanceamento.

Com essa informa√ß√£o em mente, foram feitos testes em alguns modelos que aceitavam como par√¢metro o "peso" dos dados (class_weight = 'balanced') e os resultados foram esses:

**LogisticRegression** ‚úÖ
| M√©trica   | Antes | Depois    | Diferen√ßa |
| --------- | ----- | --------- | --------- |
| Accuracy  | 0.921 | **0.932** | üîº +0.011 |
| Precision | 0.529 | **0.765** | üîº +0.236 |
| Recall    | 0.293 | **0.787** | üîº +0.494 |
| F1-score  | 0.335 | **0.715** | üîº +0.380 |
| ROC AUC   | 0.975 | **0.979** | üîº +0.004 |

**DecisionTreeClassifier** ‚úÖ
| M√©trica   | Antes | Depois    | Diferen√ßa |
| --------- | ----- | --------- | --------- |
| Accuracy  | 0.909 | **0.940** | üîº +0.031 |
| Precision | 0.506 | **0.826** | üîº +0.320 |
| Recall    | 0.413 | **0.680** | üîº +0.267 |
| F1-score  | 0.391 | **0.662** | üîº +0.271 |
| ROC AUC   | 0.727 | **0.824** | üîº +0.097 |


**RandomForestClassifier** ‚úÖ
| M√©trica   | Antes | Depois    | Diferen√ßa       |
| --------- | ----- | --------- | --------------- |
| Accuracy  | 0.925 | **0.933** | üîº +0.008       |
| Precision | 0.859 | **0.876** | üîº +0.017       |
| Recall    | 0.507 | **0.560** | üîº +0.053       |
| F1-score  | 0.498 | **0.566** | üîº +0.068       |
| ROC AUC   | 0.947 | 0.947     | ‚ö™ sem varia√ß√£o |

**XGBoost (scale_pos_weight=9.0)** üü®
| M√©trica   | Antes     | Depois    | Diferen√ßa |
| --------- | --------- | --------- | --------- |
| Accuracy  | **0.935** | 0.927     | üîª -0.008 |
| Precision | **0.830** | 0.797     | üîª -0.033 |
| Recall    | 0.653     | **0.693** | üîº +0.040 |
| F1-score  | 0.635     | **0.638** | üîº +0.003 |
| ROC AUC   | **0.950** | 0.946     | üîª -0.004 |


In [42]:
# @title Classe BalancingCount

# Classe que balancea os dados
class BalancingCount(BaseEstimator, TransformerMixin):
    # Vari√°veis de classe (compartilhadas entre todas as inst√¢ncias)
    total_counts = None
    total_calls = 0

    def fit(self, dataframe, target=None):
        if target is None:
            raise ValueError("O target √© obrigat√≥rio para verificar o balanceamento.")

        # Conta as ocorr√™ncias por classe neste fold
        counts = pd.Series(target).value_counts().sort_index()

        # Atualiza contagem total global
        if BalancingCount.total_counts is None:
            BalancingCount.total_counts = counts.copy()
        else:
            # soma os valores por classe
            BalancingCount.total_counts = BalancingCount.total_counts.add(counts, fill_value=0)

        BalancingCount.total_calls += 1

        # --- Exibi√ß√£o do fold atual ---
        print(f"\nüìò Fold {BalancingCount.total_calls}")
        print("üìä Contagem de classes (este fold):")
        for classes, count in counts.items():
            print(f"  Classe {classes}: {count} amostras")

        ratio = counts.max() / counts.min() if len(counts) > 1 else 1.0
        print(f"‚öñÔ∏è  Raz√£o de desbalanceamento (este fold): {ratio:.2f}x")

        # --- Exibi√ß√£o acumulada ---
        if BalancingCount.total_calls == 1:
            print("\nüîÑ Iniciando contagem global...")
        else:
            total_ratio = BalancingCount.total_counts.max() / BalancingCount.total_counts.min()
            print("\nüìà Contagem acumulada at√© agora:")
            for classes, total in BalancingCount.total_counts.items():
                print(f"  Classe {int(classes)}: {int(total)} amostras acumuladas")
            print(f"‚öñÔ∏è  Raz√£o de desbalanceamento acumulada: {total_ratio:.2f}x")

        return self

    def transform(self, dataframe):
        return dataframe

    @classmethod
    def reset(cls):
        """Reseta contadores globais (para novo experimento)."""
        cls.total_counts = None
        cls.total_calls = 0

### 7. Treinamento dos modelos

Os modelos treinados foram
- Dummy (mais frequente)
- Regress√£o Log√≠stica
- √Årvore de Decis√£o
- Random Forest
- Naive Bayes
- KNN
- Gradient Boosting **(NOVO)**
- AdaBoost **(NOVO)**
- HistGradientBoosting **(NOVO)**
- RidgeClassifier **(NOVO)**
- XGBoost **(NOVO)**
- CatBoost **(NOVO)**
- LightGBM **(NOVO - com problemas)**

<u>**M√©tricas para avalia√ß√£o dos modelos**</u>

| **M√©trica**                | **Descri√ß√£o**                                                                               | **Interpreta√ß√£o ideal**                                                                  |
| -------------------------- | ------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------- |
| **Accuracy**               | Propor√ß√£o de acertos totais (tanto positivos quanto negativos).                             | Boa para classes balanceadas, mas pode mascarar desempenho ruim em classes minorit√°rias. |
| **Precision**              | Entre as previs√µes positivas, quantas realmente eram positivas.                             | Alta precis√£o = poucos falsos positivos.                                                 |
| **Recall (Sensibilidade)** | Entre os casos realmente positivos, quantos foram identificados corretamente.               | Alto recall = poucos falsos negativos.                                                   |
| **F1-score**               | M√©dia harm√¥nica entre precis√£o e recall, equilibrando ambos.                                | Ideal quando h√° desbalanceamento de classes.                                             |
| **ROC AUC**                | Mede a capacidade global do modelo em separar as classes (0.5 = aleat√≥rio; 1.0 = perfeito). | Pr√≥ximo de 1 indica boa separabilidade.                                                  |


In [43]:
# @title Escolha dos modelos e das m√©tricas

modelos = {
    "Dummy (mais frequente)": DummyClassifier(strategy="most_frequent"), # Baseline simples

    # Sem balanceamento
    # "Regress√£o Log√≠stica": LogisticRegression(C=0.5, penalty='l2', solver='liblinear', max_iter=2000), # Regulariza√ß√£o L2 leve
    # "√Årvore de Decis√£o": DecisionTreeClassifier(max_depth=8, min_samples_split=4, min_samples_leaf=2, random_state=42), # Controle de profundidade e tamanho da folha
    # "Random Forest": RandomForestClassifier(n_estimators=150, max_depth=10, min_samples_split=5,random_state=42, n_jobs=-1), # Mais √°rvores e profundidade moderada

    # Com balanceamento
    "Regress√£o Log√≠stica": LogisticRegression(C=0.5, penalty='l2', solver='liblinear', max_iter=2000, class_weight = 'balanced'), # Regulariza√ß√£o L2 leve
    "√Årvore de Decis√£o": DecisionTreeClassifier(max_depth=8, min_samples_split=4, min_samples_leaf=2, random_state=42, class_weight = 'balanced'), # Controle de profundidade e tamanho da folha
    "Random Forest": RandomForestClassifier(n_estimators=150, max_depth=10, min_samples_split=5,random_state=42, n_jobs=-1, class_weight = 'balanced'), # Mais √°rvores e profundidade moderada

    "Naive Bayes": GaussianNB(var_smoothing=1e-8), # Suaviza√ß√£o leve (mais est√°vel)
    "KNN": KNeighborsClassifier(n_neighbors=7, weights='distance'), # Mais vizinhos e dist√¢ncia ponderada
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, learning_rate=0.05, max_depth=5, random_state=42), # Par√¢metros leves conforme aula
    "AdaBoost": AdaBoostClassifier(n_estimators=100, learning_rate=0.5, random_state=42), # Taxa de aprendizado menor
    "HistGradientBoosting": HistGradientBoostingClassifier(max_iter=150, learning_rate=0.05, max_depth=5, random_state=42), # Mais itera√ß√µes e taxa menor
    "RidgeClassifier": RidgeClassifier(alpha=1.0), # Regulariza√ß√£o leve
    "XGBoost": XGBClassifier(eval_metric='logloss', n_estimators=150, learning_rate=0.05, max_depth=5, subsample=0.8, colsample_bytree=0.8, random_state=42), #  Par√¢metros t√≠picos de equil√≠brio (aula 15 parte 4)
    #"CatBoost": CatBoostClassifier(iterations=150, learning_rate=0.05, depth=5,verbose=0, random_state=42), # Taxa de aprendizado reduzida e itera√ß√µes extras
    # "LightGBM": LGBMClassifier(n_estimators=150, learning_rate=0.05, num_leaves=31, max_depth=6, random_state=42) # Profundidade controlada e taxa moderada
}

# M√©tricas de avalia√ß√£o
scoring = {
    'accuracy': 'accuracy',
    'precision': make_scorer(precision_score, zero_division=0),
    'recall': make_scorer(recall_score, zero_division=0),
    'f1': make_scorer(f1_score, zero_division=0),
    'roc_auc': 'roc_auc'
}

In [44]:
# @title Algoritmo de treinamento

# --- IN√çCIO DA INTEGRA√á√ÉO ---
full_preprocess_pipeline = Pipeline([
    ('feature_creation', FeaturesCreation()),  # Gera estacao, regiao, etc.
    ('normalization', DataNormalization)
])
# --- FIM DA INTEGRA√á√ÉO ---

# Separar features e target do dataframe original
# O FeaturesCreation vai lidar com a cria√ß√£o das colunas extras internamente
features = wildfires.drop(columns=target_column_name)
target = wildfires[target_column_name].astype(int)

periodos_testes = [30, 60, 90, 120, 150, 180, 210]
resultados_gerais = []

for periodo in periodos_testes:
    resultados = []

    # Atualiza a janela de dias na classe de cria√ß√£o de features
    # Nota: Como 'FeaturesCreation' est√° dentro do pipeline, precisamos passar o par√¢metro
    # atrav√©s do m√©todo set_params do pipeline principal, ou recriar o pipeline a cada loop.
    # Recriar √© mais seguro e claro neste loop.

    print(f"Per√≠odo de dias: {periodo}")

    # Atualizamos o par√¢metro window_days na classe FeaturesCreation
    # (Assumindo que sua classe aceita isso no __init__)
    features_creator = FeaturesCreation()
    features_creator.m_precipitation_window_days = periodo
    features_creator.m_max_temperature_window_days = periodo

    # Pipeline de pr√©-processamento atualizado para este per√≠odo
    current_preprocess = Pipeline([
        ('feature_creation', features_creator),
        ('normalization', DataNormalization)
    ])

    for nome, modelo in modelos.items():
        try:
            # Pipeline final: Pr√©-processamento -> Tratamento de NaN -> Modelo
            pipeline = Pipeline([
                ("preprocess", current_preprocess), # USAMOS O NOVO PIPELINE AQUI
                ("nan_shield", SimpleImputer(strategy="constant", fill_value=0.0)),
                ("classificator", modelo)
            ])

            scores = cross_validate(
                pipeline,
                features,
                target,
                cv=cross_validation, # Certifique-se que 'cross_validation' est√° definido (ex: TimeSeriesSplit)
                scoring=scoring,
                n_jobs=1
            )

            resultados.append({
                "Modelo": nome,
                "Accuracy": np.mean(scores['test_accuracy']),
                "Precision": np.mean(scores['test_precision']),
                "Recall": np.mean(scores['test_recall']),
                "F1-score": np.mean(scores['test_f1']),
                "ROC AUC": np.mean(scores['test_roc_auc']),
            })

            resultados_gerais.append({
                "Modelo": nome,
                "F1-score": np.mean(scores['test_f1']),
                "Tempo": periodo
            })

        except Exception as e:
             print(f" Erro ao rodar o modelo {nome}: {e}\n")
             # pass # Comentei o pass para voc√™ ver os erros se houverem

    # Mostra os resultados por per√≠odo
    df_resultados = pd.DataFrame(resultados).sort_values("F1-score", ascending=False)
    display(df_resultados)

Per√≠odo de dias: 30


Unnamed: 0,Modelo,Accuracy,Precision,Recall,F1-score,ROC AUC
2,√Årvore de Decis√£o,0.946667,0.85641,0.613333,0.655736,0.797778
1,Regress√£o Log√≠stica,0.758667,0.584066,0.893333,0.616724,0.978765
10,XGBoost,0.932,0.786667,0.693333,0.616029,0.951901
3,Random Forest,0.926667,0.827209,0.586667,0.56396,0.953086
6,Gradient Boosting,0.929333,0.774552,0.586667,0.560593,0.827506
5,KNN,0.924,0.846465,0.506667,0.538725,0.876988
8,HistGradientBoosting,0.917333,0.551645,0.653333,0.521674,0.96884
4,Naive Bayes,0.770667,0.229788,0.8,0.35467,0.862864
7,AdaBoost,0.9,0.594444,0.44,0.332563,0.905728
9,RidgeClassifier,0.92,0.32,0.333333,0.31,0.988148


Per√≠odo de dias: 60


Unnamed: 0,Modelo,Accuracy,Precision,Recall,F1-score,ROC AUC
2,√Årvore de Decis√£o,0.946667,0.85641,0.613333,0.655736,0.797778
1,Regress√£o Log√≠stica,0.758667,0.584066,0.893333,0.616724,0.978765
10,XGBoost,0.932,0.786667,0.693333,0.616029,0.951901
3,Random Forest,0.926667,0.827209,0.586667,0.56396,0.953086
6,Gradient Boosting,0.929333,0.774552,0.586667,0.560593,0.827506
5,KNN,0.924,0.846465,0.506667,0.538725,0.876988
8,HistGradientBoosting,0.917333,0.551645,0.653333,0.521674,0.96884
4,Naive Bayes,0.770667,0.229788,0.8,0.35467,0.862864
7,AdaBoost,0.9,0.594444,0.44,0.332563,0.905728
9,RidgeClassifier,0.92,0.32,0.333333,0.31,0.988148


Per√≠odo de dias: 90


Unnamed: 0,Modelo,Accuracy,Precision,Recall,F1-score,ROC AUC
2,√Årvore de Decis√£o,0.946667,0.85641,0.613333,0.655736,0.797778
1,Regress√£o Log√≠stica,0.758667,0.584066,0.893333,0.616724,0.978765
10,XGBoost,0.932,0.786667,0.693333,0.616029,0.951901
3,Random Forest,0.926667,0.827209,0.586667,0.56396,0.953086
6,Gradient Boosting,0.929333,0.774552,0.586667,0.560593,0.827506
5,KNN,0.924,0.846465,0.506667,0.538725,0.876988
8,HistGradientBoosting,0.917333,0.551645,0.653333,0.521674,0.96884
4,Naive Bayes,0.770667,0.229788,0.8,0.35467,0.862864
7,AdaBoost,0.9,0.594444,0.44,0.332563,0.905728
9,RidgeClassifier,0.92,0.32,0.333333,0.31,0.988148


Per√≠odo de dias: 120


Unnamed: 0,Modelo,Accuracy,Precision,Recall,F1-score,ROC AUC
2,√Årvore de Decis√£o,0.946667,0.85641,0.613333,0.655736,0.797778
1,Regress√£o Log√≠stica,0.758667,0.584066,0.893333,0.616724,0.978765
10,XGBoost,0.932,0.786667,0.693333,0.616029,0.951901
3,Random Forest,0.926667,0.827209,0.586667,0.56396,0.953086
6,Gradient Boosting,0.929333,0.774552,0.586667,0.560593,0.827506
5,KNN,0.924,0.846465,0.506667,0.538725,0.876988
8,HistGradientBoosting,0.917333,0.551645,0.653333,0.521674,0.96884
4,Naive Bayes,0.770667,0.229788,0.8,0.35467,0.862864
7,AdaBoost,0.9,0.594444,0.44,0.332563,0.905728
9,RidgeClassifier,0.92,0.32,0.333333,0.31,0.988148


Per√≠odo de dias: 150


Unnamed: 0,Modelo,Accuracy,Precision,Recall,F1-score,ROC AUC
2,√Årvore de Decis√£o,0.946667,0.85641,0.613333,0.655736,0.797778
1,Regress√£o Log√≠stica,0.758667,0.584066,0.893333,0.616724,0.978765
10,XGBoost,0.932,0.786667,0.693333,0.616029,0.951901
3,Random Forest,0.926667,0.827209,0.586667,0.56396,0.953086
6,Gradient Boosting,0.929333,0.774552,0.586667,0.560593,0.827506
5,KNN,0.924,0.846465,0.506667,0.538725,0.876988
8,HistGradientBoosting,0.917333,0.551645,0.653333,0.521674,0.96884
4,Naive Bayes,0.770667,0.229788,0.8,0.35467,0.862864
7,AdaBoost,0.9,0.594444,0.44,0.332563,0.905728
9,RidgeClassifier,0.92,0.32,0.333333,0.31,0.988148


Per√≠odo de dias: 180


Unnamed: 0,Modelo,Accuracy,Precision,Recall,F1-score,ROC AUC
2,√Årvore de Decis√£o,0.946667,0.85641,0.613333,0.655736,0.797778
1,Regress√£o Log√≠stica,0.758667,0.584066,0.893333,0.616724,0.978765
10,XGBoost,0.932,0.786667,0.693333,0.616029,0.951901
3,Random Forest,0.926667,0.827209,0.586667,0.56396,0.953086
6,Gradient Boosting,0.929333,0.774552,0.586667,0.560593,0.827506
5,KNN,0.924,0.846465,0.506667,0.538725,0.876988
8,HistGradientBoosting,0.917333,0.551645,0.653333,0.521674,0.96884
4,Naive Bayes,0.770667,0.229788,0.8,0.35467,0.862864
7,AdaBoost,0.9,0.594444,0.44,0.332563,0.905728
9,RidgeClassifier,0.92,0.32,0.333333,0.31,0.988148


Per√≠odo de dias: 210


Unnamed: 0,Modelo,Accuracy,Precision,Recall,F1-score,ROC AUC
2,√Årvore de Decis√£o,0.946667,0.85641,0.613333,0.655736,0.797778
1,Regress√£o Log√≠stica,0.758667,0.584066,0.893333,0.616724,0.978765
10,XGBoost,0.932,0.786667,0.693333,0.616029,0.951901
3,Random Forest,0.926667,0.827209,0.586667,0.56396,0.953086
6,Gradient Boosting,0.929333,0.774552,0.586667,0.560593,0.827506
5,KNN,0.924,0.846465,0.506667,0.538725,0.876988
8,HistGradientBoosting,0.917333,0.551645,0.653333,0.521674,0.96884
4,Naive Bayes,0.770667,0.229788,0.8,0.35467,0.862864
7,AdaBoost,0.9,0.594444,0.44,0.332563,0.905728
9,RidgeClassifier,0.92,0.32,0.333333,0.31,0.988148


In [45]:
# @title Compara√ß√£o de resultados

resultados_por_modelo = {}

# Agrupa os resultados por modelo
for entrada in resultados_gerais:
    nome_modelo = entrada["Modelo"]
    if nome_modelo not in resultados_por_modelo:
        resultados_por_modelo[nome_modelo] = []
    resultados_por_modelo[nome_modelo].append(entrada)

# Ordena os resultados de cada modelo
for nome_modelo, lista_resultados in resultados_por_modelo.items():
    lista_ordenada = sorted(
        lista_resultados,
        key=lambda x: x["F1-score"],
        reverse=True
    )

    print(f"\n==== Modelo: {nome_modelo} ====")
    for posicao, item in enumerate(lista_ordenada, start=1):
        print(f"{posicao}¬∫ lugar | Tempo: {item['Tempo']:>3} dias | F1-score: {item['F1-score']:.4f}")

# Ordena os resultados de maneira gerais
resultados_gerais_ordenados = sorted(
    resultados_gerais,
    key=lambda x: x["F1-score"],
    reverse=True
)

print("\n==== Ranking geral (todos os modelos e per√≠odos) ====")
for posicao, item in enumerate(resultados_gerais_ordenados, start=1):
    print(f"{posicao}¬∫ lugar | Modelo: {item['Modelo']} | Tempo: {item['Tempo']:>3} dias | F1-score: {item['F1-score']:.4f}")


==== Modelo: Dummy (mais frequente) ====
1¬∫ lugar | Tempo:  30 dias | F1-score: 0.0000
2¬∫ lugar | Tempo:  60 dias | F1-score: 0.0000
3¬∫ lugar | Tempo:  90 dias | F1-score: 0.0000
4¬∫ lugar | Tempo: 120 dias | F1-score: 0.0000
5¬∫ lugar | Tempo: 150 dias | F1-score: 0.0000
6¬∫ lugar | Tempo: 180 dias | F1-score: 0.0000
7¬∫ lugar | Tempo: 210 dias | F1-score: 0.0000

==== Modelo: Regress√£o Log√≠stica ====
1¬∫ lugar | Tempo:  30 dias | F1-score: 0.6167
2¬∫ lugar | Tempo:  60 dias | F1-score: 0.6167
3¬∫ lugar | Tempo:  90 dias | F1-score: 0.6167
4¬∫ lugar | Tempo: 120 dias | F1-score: 0.6167
5¬∫ lugar | Tempo: 150 dias | F1-score: 0.6167
6¬∫ lugar | Tempo: 180 dias | F1-score: 0.6167
7¬∫ lugar | Tempo: 210 dias | F1-score: 0.6167

==== Modelo: √Årvore de Decis√£o ====
1¬∫ lugar | Tempo:  30 dias | F1-score: 0.6557
2¬∫ lugar | Tempo:  60 dias | F1-score: 0.6557
3¬∫ lugar | Tempo:  90 dias | F1-score: 0.6557
4¬∫ lugar | Tempo: 120 dias | F1-score: 0.6557
5¬∫ lugar | Tempo: 150 dias | F1