# Regressão completa (California Housing) — EDA, 20+ modelos, AutoML (PyTorch)

**Resumo:** Notebook que baixa a base *California Housing* (20.640 amostras) usando `sklearn.datasets`, faz EDA (com `ydata-profiling`), pré-processamento, treina e compara 20+ modelos (sklearn, XGBoost, LightGBM, CatBoost, MLP PyTorch, 1D-CNN PyTorch), usa AutoML (FLAML), e apresenta análise dos resultados com células explicativas.

**Observação:** execute as células em ordem. A primeira célula instala dependências necessárias.

In [1]:
# Instala dependências (execute apenas se necessário)
# Alguns pacotes são grandes; comente os que você não quiser instalar.
#!pip install -q ydata-profiling==4.1.1 pandas matplotlib seaborn missingno scikit-learn xgboost lightgbm catboost flaml optuna torch torchvision

# Aviso: instalação do PyTorch pode variar conforme CUDA; aqui instalamos a versão CPU via pip padrão.


## 1) Imports e carregamento da base

Nesta seção importamos bibliotecas e carregamos a base `fetch_california_housing` do scikit-learn (20.640 amostras).

In [2]:
# Imports básicos
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Carrega o dataset
data = fetch_california_housing(as_frame=True)
df = data.frame.copy()
display(df.head())
print('Shape:', df.shape)


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


Shape: (20640, 9)


## 2) EDA aprofundado

Usaremos `ydata-profiling` para gerar um relatório exploratório e também faremos plots customizados.

In [4]:
# Gera relatório rápido com ydata-profiling (salva em HTML)
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title='California Housing - EDA', explorative=True)
profile.to_file('california_eda_report.html')
print('Relatório salvo: california_eda_report.html')


AttributeError: module 'numba' has no attribute 'generated_jit'

In [None]:
# Visualizações customizadas
plt.figure(figsize=(10,6))
sns.histplot(df['MedHouseVal'], kde=True)
plt.title('Distribuição do alvo (MedHouseVal)')
plt.show()

plt.figure(figsize=(12,10))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm', square=True)
plt.title('Mapa de correlação')
plt.show()

# Missing values visualization
msno.matrix(df)
plt.show()


## 3) Pré-processamento e engenharia de variáveis

Criamos colunas opcionais e definimos treino/teste.

In [None]:
# Criar algumas features engenheiradas (opcionais)
df['RoomsPerHousehold'] = df['AveRooms'] / (df['HouseAge']+1)
df['BedroomsPerRoom'] = df['AveBedrms'] / (df['AveRooms']+1)
df['PopulationPerHousehold'] = df['Population'] / (df['HouseAge']+1)

# Alvo
y = df['MedHouseVal']
X = df.drop(columns=['MedHouseVal'])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('Shapes:', X_train.shape, X_test.shape)


## 4) Lista de modelos (20+)

Vamos treinar um conjunto diversificado de modelos para regressão.

In [None]:
# Model list
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, BayesianRidge, SGDRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor, AdaBoostRegressor, BaggingRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor

# Optional third-party
import xgboost as xgb
import lightgbm as lgb
import catboost as cb

models = {
    'LinearRegression': LinearRegression(),
    'Ridge': Ridge(random_state=42),
    'Lasso': Lasso(random_state=42),
    'ElasticNet': ElasticNet(random_state=42),
    'BayesianRidge': BayesianRidge(),
    'SGDRegressor': SGDRegressor(max_iter=1000, tol=1e-3, random_state=42),
    'KNeighbors': KNeighborsRegressor(),
    'DecisionTree': DecisionTreeRegressor(random_state=42),
    'RandomForest': RandomForestRegressor(n_estimators=100, random_state=42),
    'ExtraTrees': ExtraTreesRegressor(n_estimators=100, random_state=42),
    'GradientBoosting': GradientBoostingRegressor(random_state=42),
    'HistGradientBoosting': HistGradientBoostingRegressor(random_state=42),
    'AdaBoost': AdaBoostRegressor(random_state=42),
    'Bagging': BaggingRegressor(random_state=42),
    'SVR': SVR(),
    'MLPRegressor': MLPRegressor(hidden_layer_sizes=(128,64), max_iter=500, random_state=42),
    'XGBoost': xgb.XGBRegressor(objective='reg:squarederror', n_estimators=200, random_state=42, verbosity=0),
    'LightGBM': lgb.LGBMRegressor(n_estimators=200, random_state=42),
    'CatBoost': cb.CatBoostRegressor(verbose=0, random_state=42),
}
len(models)


## 5) Treinamento básico (pipeline) e avaliação

Para cada modelo usaremos um Pipeline com `StandardScaler` quando aplicável. Medidas: RMSE, MAE, R2.

In [None]:
results = []
from time import time
for name, model in models.items():
    # pipeline: scaler + model (some tree-based models ignore scaling but it's okay)
    pipe = Pipeline([('scaler', StandardScaler()), ('model', model)])
    t0 = time()
    pipe.fit(X_train, y_train)
    preds = pipe.predict(X_test)
    rmse = mean_squared_error(y_test, preds, squared=False)
    mae = mean_absolute_error(y_test, preds)
    r2 = r2_score(y_test, preds)
    elapsed = time() - t0
    results.append({'model': name, 'rmse': rmse, 'mae': mae, 'r2': r2, 'time_s': elapsed})
    print(f'{name}: RMSE={rmse:.4f}, MAE={mae:.4f}, R2={r2:.4f}, time={elapsed:.1f}s')

results_df = pd.DataFrame(results).sort_values('rmse')
display(results_df)


## 6) Redes neurais (PyTorch): MLP e 1D-CNN

Construiremos uma MLP com PyTorch e uma 1D-CNN aplicando convolução sobre as features (tratadas como sequência).

In [None]:
# PyTorch models
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Device:', device)

# Prepare scaled data
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train).astype(np.float32)
X_test_s = scaler.transform(X_test).astype(np.float32)
y_train_s = y_train.to_numpy().astype(np.float32).reshape(-1,1)
y_test_s = y_test.to_numpy().astype(np.float32).reshape(-1,1)

# Create DataLoaders
train_ds = TensorDataset(torch.from_numpy(X_train_s), torch.from_numpy(y_train_s))
test_ds = TensorDataset(torch.from_numpy(X_test_s), torch.from_numpy(y_test_s))
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=256, shuffle=False)

# MLP definition
class MLPRegressorPyTorch(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )
    def forward(self, x):
        return self.net(x)

# Training function
def train_model(model, train_loader, val_loader=None, epochs=50, lr=1e-3):
    model.to(device)
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)
    best_val = float('inf')
    for epoch in range(epochs):
        model.train()
        for xb, yb in train_loader:
            xb = xb.to(device)
            yb = yb.to(device)
            preds = model(xb)
            loss = criterion(preds, yb)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        # optional val
        if val_loader is not None:
            model.eval()
            val_loss = 0.0
            with torch.no_grad():
                for xb, yb in val_loader:
                    xb = xb.to(device)
                    yb = yb.to(device)
                    preds = model(xb)
                    val_loss += criterion(preds, yb).item() * xb.size(0)
            val_loss = val_loss / len(val_loader.dataset)
            if val_loss < best_val:
                best_val = val_loss
                best_weights = model.state_dict()
    if val_loader is not None:
        model.load_state_dict(best_weights)
    return model

# Instantiate and train MLP
mlp_pt = MLPRegressorPyTorch(X_train_s.shape[1])
mlp_pt = train_model(mlp_pt, train_loader, val_loader=None, epochs=50)

# Predict helper
def predict_torch(model, X_np):
    model.eval()
    with torch.no_grad():
        xb = torch.from_numpy(X_np).to(device)
        preds = model(xb).cpu().numpy().ravel()
    return preds

preds_mlp_pt = predict_torch(mlp_pt, X_test_s)
print('MLP PyTorch RMSE:', mean_squared_error(y_test, preds_mlp_pt, squared=False))

# 1D-CNN: reshape features as (batch, channels, seq_len) for Conv1d in PyTorch
X_train_c = X_train_s.reshape((X_train_s.shape[0], 1, X_train_s.shape[1]))
X_test_c = X_test_s.reshape((X_test_s.shape[0], 1, X_test_s.shape[1]))
train_ds_c = TensorDataset(torch.from_numpy(X_train_c), torch.from_numpy(y_train_s))
test_ds_c = TensorDataset(torch.from_numpy(X_test_c), torch.from_numpy(y_test_s))
train_loader_c = DataLoader(train_ds_c, batch_size=64, shuffle=True)
test_loader_c = DataLoader(test_ds_c, batch_size=256, shuffle=False)

class Conv1DRegressorPyTorch(nn.Module):
    def __init__(self, seq_len):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv1d(1, 64, kernel_size=3),
            nn.ReLU(),
            nn.MaxPool1d(2),
            nn.Conv1d(64, 32, kernel_size=3),
            nn.ReLU(),
            nn.AdaptiveAvgPool1d(1)
        )
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(32, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )
    def forward(self, x):
        x = self.conv(x)
        x = self.fc(x)
        return x

cnn_pt = Conv1DRegressorPyTorch(X_train_c.shape[2])
cnn_pt = train_model(cnn_pt, train_loader_c, val_loader=None, epochs=50)
preds_cnn_pt = predict_torch(cnn_pt, X_test_c)
print('1D-CNN PyTorch RMSE:', mean_squared_error(y_test, preds_cnn_pt, squared=False))


## 7) AutoML com FLAML

Usaremos FLAML para tentar encontrar um bom modelo automaticamente.

In [None]:
from flaml import AutoML
automl = AutoML()
automl_settings = {
    'time_budget': 120,  # seconds
    'metric': 'rmse',
    'task': 'regression',
    'log_file_name': 'flaml.log',
    'verbose': 0
}
automl.fit(X_train, y_train, **automl_settings)
print('Best model:', automl.best_estimator)
preds_automl = automl.predict(X_test)
print('AutoML RMSE:', mean_squared_error(y_test, preds_automl, squared=False))


## 8) Consolidação de resultados e comparação

Juntamos todas as métricas comparativas em uma tabela ordenada por RMSE.

In [None]:
# Add neural nets and automl to results_df
extra = [
    {'model':'MLP_PyTorch', 'rmse': mean_squared_error(y_test, preds_mlp_pt, squared=False), 'mae': mean_absolute_error(y_test, preds_mlp_pt), 'r2': r2_score(y_test, preds_mlp_pt), 'time_s': None},
    {'model':'CNN1D_PyTorch', 'rmse': mean_squared_error(y_test, preds_cnn_pt, squared=False), 'mae': mean_absolute_error(y_test, preds_cnn_pt), 'r2': r2_score(y_test, preds_cnn_pt), 'time_s': None},
    {'model':'FLAML_AutoML', 'rmse': mean_squared_error(y_test, preds_automl, squared=False), 'mae': mean_absolute_error(y_test, preds_automl), 'r2': r2_score(y_test, preds_automl), 'time_s': None},
]
results_df = pd.concat([results_df, pd.DataFrame(extra)], ignore_index=True).sort_values('rmse')
results_df.reset_index(drop=True, inplace=True)
display(results_df)


## 9) Interpretação: importância de variáveis e partial dependence (exemplo)

Mostramos importância do modelo RandomForest e PDP para a variável mais relevante.

In [None]:
# Feature importance (RandomForest)
rf = models['RandomForest']
pipe_rf = Pipeline([('scaler', StandardScaler()), ('model', rf)])
pipe_rf.fit(X_train, y_train)
importances = pipe_rf.named_steps['model'].feature_importances_
feat_imp = pd.Series(importances, index=X.columns).sort_values(ascending=False)
display(feat_imp.head(10))


In [None]:
# Plot top 10
plt.figure(figsize=(8,6))
feat_imp.head(10).plot(kind='bar')
plt.title('Feature importance - RandomForest')
plt.show()


In [None]:
# Partial dependence (sklearn)
from sklearn.inspection import PartialDependenceDisplay
top_feat = feat_imp.index[0]
fig, ax = plt.subplots(figsize=(6,4))
PartialDependenceDisplay.from_estimator(pipe_rf.named_steps['model'], pipe_rf.named_steps['scaler'].transform(X_test), [list(X.columns).index(top_feat)], feature_names=list(X.columns), ax=ax)
plt.show()


## 10) Conclusões e próximos passos

- Resumo dos melhores modelos, sugestões para tuning, possibilidade de usar ensembles, mais AutoML com orçamento maior (AutoGluon/H2O/auto-sklearn), e produção do modelo com MLflow.

**Próximos passos sugeridos:**

- Aumentar `time_budget` do FLAML e testar AutoGluon/H2O.
- Realizar análise de erros por faixa (ex.: bairros, valores altos/baixos).
- Otimizar hyperparâmetros com Optuna.
- Converter o melhor modelo para produção (ONNX/PyTorch) e versionar com MLflow.