# Projeto Final de Machine Learning
Feito por: _Henrique Bucci_ e _Marcelo Alonso_

Dados e informações: https://www.kaggle.com/datasets/marcopale/housing/data

**Perguntas**  
- Posso remover o PID?
    - **R:** Sim
- Posso criar colunas a partir de contas de outras antes de fazer a seleção?
    - **R:** Sim
- Se eu aplicar PolynomialFeatures nos dados, eles também contam como features para a contagem?
    - **R:** Fazer PolyFeatures depois de selecionar as features
- Posso utilizar correlação na análise exploratória?
    - **R:** Pode, mas é "inútil"
- Posso utilizar métodos de clustering na pipeline para incluir a classificação como uma nova feature?
    - **R:** SoftMax no resultado do Kmeans para exagerar a classe mais próxima.
- Posso utilizar algum método de Dimensionality Reduction (ex: PCA) para me ajudar a escolher as features?
    - **R:** Sim.


Testar stacking: Treinar diversos modelos e treinar um modelo final com os predicts destes modelos.

#### ANOTAÇÕES
Utilizar LASSO para seleção de features.

Regressão linear para ignorar outliers.

RANSAC -> regressao linear que ignora outliers

## Etapa 0

Nesta etapa, iremos:
- Importar bibliotecas
- Carregar os dados
- Verificar se existem colunas que não fazem sentido serem colocadas no dataset final (como ID ou algum outro tipo de identificador arbitrário), olhando apenas a descrição das colunas.
- Separar o dataset em Treino-Teste

### Bibliotecas e Configurações Globais

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
from utils import *


from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.preprocessing  import FunctionTransformer, StandardScaler, MinMaxScaler, OneHotEncoder, PolynomialFeatures
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.dummy import DummyRegressor
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.cluster        import KMeans
import xgboost as xgb


In [2]:
plt.rcParams['figure.figsize'] = (12, 6)
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['font.size'] = 14
plt.rcParams['figure.autolayout'] = True

### Constantes

In [3]:
SEED = 420

### Carregamento e Pré-processamento dos Dados

In [4]:
dataset = load_data()
dataset.head()

Unnamed: 0,Order,PID,MS.SubClass,MS.Zoning,Lot.Frontage,Lot.Area,Street,Alley,Lot.Shape,Land.Contour,...,Pool.Area,Pool.QC,Fence,Misc.Feature,Misc.Val,Mo.Sold,Yr.Sold,Sale.Type,Sale.Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


Neste caso, a presença de duplicatas não seria intencional, uma vez que cada casa deveria ser única.  
Portanto, vamos removê-las.

In [5]:
print(f"Total de linhas antes de remover duplicatas: {dataset.shape[0]}")
dataset.drop_duplicates(inplace=True)
print(f"Total de linhas depois de remover duplicatas: {dataset.shape[0]}")

Total de linhas antes de remover duplicatas: 2930
Total de linhas depois de remover duplicatas: 2930


Como a primeira coluna é o ID da observação e a segunda é um identificador, podemos removê-las, uma vez que estes são valores arbitrários.

In [6]:
dataset = dataset.iloc[:, 2:] # Estamos removendo as duas primeiras colunas, que são o ID e o PID (Parcel identification number)

### Criando novas features

Ao analisarmos as features da forma descrita acima, vimos espaço para a criação de novas features que podem vir a ser úteis na modelagem dos dados:
- **Tot Lot Area** : `Lot Frontage + Lot Area`
- **Bsmt Tot Bath** : `Bsmt Full Bath + 0.5*Bsmt Half Bath`
- **Garage Area/Car** : `Garage Area / Garage Cars`
- **Tot Porch SF** : `Open Porch SF + Enclosed Porch + 3Ssn Porch + Screen Porch`
- **Date Sold** : `timestamp(Month Sold, Year Sold)`

In [7]:
# dataset.loc[:, 'Tot.Lot.Area'] = dataset.loc[:, 'Lot.Frontage'] + dataset.loc[:, 'Lot.Area']
dataset.loc[:, 'Bsmt.Tot.Bath'] = dataset.loc[:, 'Bsmt.Full.Bath'] + 0.5*dataset.loc[:, 'Bsmt.Half.Bath']
# dataset.loc[:, 'Garage.Area/Cars'] = dataset.loc[:, 'Garage.Area'] / dataset.loc[:, 'Garage.Cars']
dataset.loc[:, 'Tot.Porch.SF'] = dataset.loc[:, 'Open.Porch.SF'] + dataset.loc[:, 'X3Ssn.Porch'] + dataset.loc[:, 'Enclosed.Porch'] + dataset.loc[:, 'Screen.Porch']
# dataset.loc[:, 'Date.Sold'] = pd.to_datetime(dict(year=dataset['Yr.Sold'], month=dataset['Mo.Sold'], day=1)).apply(lambda x: x.timestamp())

### Train-Test Split

- A partir de agora, usaremos apenas o dataset de treino, a partição de teste será tratada como se não existisse ainda.
- O dataset total será dividido em uma proporção 80/20, uma vez que temos poucos dados (2930 no total).
- Por não se tratar de uma série temporal, podemos aplicar uma aleatoriedade na partição.

In [8]:
X, y = dataset.drop('SalePrice', axis=1), dataset.loc[:, 'SalePrice']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

## Etapa 1

### Análise Exploratória

Nesta parte, iremos fazer uma análise global dos dados, apenas para garantir a integridade destes.  
Assim sendo, iremos procurar entender quais são as features e target, quais são seus respectivos tipos e buscar outras informações como:
- Dados nulos
- Dados duplicados
- Outliers
- Spikes
- Erros grosseiros

Além disso, iremos buscar saber a distribuição e a "cara" de cada variável.

#### Valores Faltantes e Data Types

In [10]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2344 entries, 1157 to 1096
Data columns (total 81 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   MS.SubClass      2344 non-null   int64  
 1   MS.Zoning        2344 non-null   object 
 2   Lot.Frontage     1945 non-null   float64
 3   Lot.Area         2344 non-null   int64  
 4   Street           2344 non-null   object 
 5   Alley            159 non-null    object 
 6   Lot.Shape        2344 non-null   object 
 7   Land.Contour     2344 non-null   object 
 8   Utilities        2344 non-null   object 
 9   Lot.Config       2344 non-null   object 
 10  Land.Slope       2344 non-null   object 
 11  Neighborhood     2344 non-null   object 
 12  Condition.1      2344 non-null   object 
 13  Condition.2      2344 non-null   object 
 14  Bldg.Type        2344 non-null   object 
 15  House.Style      2344 non-null   object 
 16  Overall.Qual     2344 non-null   int64  
 17  Overall.Cond    

#### Distribuição dos Dados

Nesta parte, iremos olhar especificamente para a distribuição dos dados.  
Nas células abaixo conseguimos ver:
- Distribuição dos dados numéricos, com os valores de `count`, `min`, `max`, `std`, `mean`, e os quartis.
- Distribuição dos dados categóricos, com os valores de `count`, `unique`, `top` (moda), `freq` (número de ocorrências da moda)

In [11]:
X_train.describe()

Unnamed: 0,MS.SubClass,Lot.Frontage,Lot.Area,Overall.Qual,Overall.Cond,Year.Built,Year.Remod.Add,Mas.Vnr.Area,BsmtFin.SF.1,BsmtFin.SF.2,...,Open.Porch.SF,Enclosed.Porch,X3Ssn.Porch,Screen.Porch,Pool.Area,Misc.Val,Mo.Sold,Yr.Sold,Bsmt.Tot.Bath,Tot.Porch.SF
count,2344.0,1945.0,2344.0,2344.0,2344.0,2344.0,2344.0,2325.0,2343.0,2343.0,...,2344.0,2344.0,2344.0,2344.0,2344.0,2344.0,2344.0,2344.0,2342.0,2344.0
mean,57.487201,69.387147,10130.794795,6.101109,5.558447,1971.458191,1984.43686,103.673548,436.46863,51.261204,...,47.62756,22.955205,2.389078,15.336604,2.396331,41.335751,6.234215,2007.791809,0.459223,88.308447
std,42.697657,23.645307,7021.928686,1.413162,1.103673,30.424244,20.945233,181.229599,453.039342,170.845327,...,67.760644,61.518987,23.262509,54.365229,37.264067,378.295773,2.728457,1.305689,0.521169,104.923186
min,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,0.0,0.0
25%,20.0,59.0,7500.0,5.0,5.0,1954.0,1965.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0,0.0,0.0
50%,50.0,68.0,9504.0,6.0,5.0,1973.0,1993.0,0.0,362.0,0.0,...,28.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,0.0,50.0
75%,70.0,80.0,11618.25,7.0,6.0,2001.0,2004.0,166.0,729.0,0.0,...,70.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,1.0,135.0
max,190.0,313.0,164660.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1526.0,...,742.0,584.0,407.0,490.0,800.0,12500.0,12.0,2010.0,3.0,1027.0


In [12]:
X_train.describe(include=np.object_)

Unnamed: 0,MS.Zoning,Street,Alley,Lot.Shape,Land.Contour,Utilities,Lot.Config,Land.Slope,Neighborhood,Condition.1,...,Garage.Type,Garage.Finish,Garage.Qual,Garage.Cond,Paved.Drive,Pool.QC,Fence,Misc.Feature,Sale.Type,Sale.Condition
count,2344,2344,159,2344,2344,2344,2344,2344,2344,2344,...,2212,2211,2211,2211,2344,11,459,85,2344,2344
unique,7,2,2,4,4,3,5,3,28,9,...,6,3,5,5,3,4,4,4,10,6
top,RL,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,...,Attchd,Unf,TA,TA,Y,Gd,MnPrv,Shed,WD,Normal
freq,1815,2334,98,1476,2099,2341,1708,2231,349,2037,...,1387,977,2090,2125,2128,4,260,77,2018,1908


##### Gráficos

In [13]:
plot_distribution(X_train, 'x_train_original.png')

Gráfico salvo em ./graphs/x_train_original.png


Para uma visualização melhor fizemos este gráfico, e nele podemos ver que diversas features que são estitamente positivas e possuem uma cauda direita alongada.  
  
Neste caso, o ideal é transformá-las em distribuições normais.  

<img src="./graphs/x_train_original.png" alt="drawing" width="700"/>  
  
Assim sendo, aplicaremos log nas colunas que possuem uma cauda direita, e iremos fazer um gráfico para visualizarmos as diferenças.

In [14]:
"""
Pegando os nomes das colunas numéricas, categóricas e com cauda direita alongada 
para fazermos as transformações necessárias.
Estas variáveis serão utilizadas durante todo o notebook.
"""
right_skewed, numerical, categorical = get_column_subsets(X_train)

In [15]:
X_train_log = X_train.copy()
X_train_log[right_skewed] = np.log1p(X_train_log[right_skewed])

plot_distribution(X_train_log.select_dtypes(include='number'), 'x_train_log.png')

Gráfico salvo em ./graphs/x_train_log.png


<img src="./graphs/x_train_log.png" alt="drawing" width="700"/>  

#### Distribuição do Target e Remoção de Outliers

In [16]:
# Agora, basta remover os outliers encontrados no target do dataset
print(f"Total de linhas antes de remover outliers: {X_train.shape[0]}")
X_train, y_train = remove_outliers(X_train, y_train)
print(f"Total de linhas depois de remover outliers: {X_train.shape[0]}")

Total de linhas antes de remover outliers: 2344
Total de linhas depois de remover outliers: 2298


In [17]:
y_hist(y_train, 'target distribution')

Gráfico salvo em ./graphs/target distribution


In [18]:
num_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
])

c_log_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('log', FunctionTransformer(np.log1p, validate=False, feature_names_out='one-to-one')),
])

cat_pipe = Pipeline(steps=[
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
])

preprocessing_pipeline = ColumnTransformer(
    transformers = [
        ('num', num_pipe, numerical),
        ('clog', c_log_pipe, right_skewed),
        ('cat', cat_pipe, categorical)
    ],
    remainder='passthrough'
)

preprocessing_pipeline

In [19]:
X_train_transformed = preprocessing_pipeline.fit_transform(X_train)

In [20]:
rnd_clf = RandomForestRegressor(
    n_estimators=700,
    max_leaf_nodes=16,
    random_state=SEED,
    n_jobs=1
)

In [21]:
rnd_clf.fit(X_train_transformed, y_train)

In [22]:
importances = rnd_clf.feature_importances_

feature_names = preprocessing_pipeline.get_feature_names_out()
feature_importances_df = pd.DataFrame(zip(feature_names, importances), columns=['Feature', 'Importance']) \
    .sort_values(by='Importance', ascending=False)

feature_importances_df = aggregate_categorical_importances(feature_importances_df.set_index('Feature'))

top15_features = list(feature_importances_df[:14].index)+['cat__MS.Zoning']
X_feats = [feat.split('__')[1] for feat in top15_features]

In [23]:
X_train = X_train[X_feats]

## Parte 3/4/5/6/...

1. Escolher modelos (métodos de stacking inclusos)
    - DummyRegressor
    - LinearRegression
    - Outros modelos básicos
        - Polynomial Features
        - Scalers
    - Pipelines avançadas
        - Utilizar KMeans como fonte de novas features na pipeline
        - Métodos de Ensemble
        
2. Montar GridSearchCV com hiperparâmetros

In [24]:
right_skewed, numerical, categorical = split_by_prefix(top15_features)

In [34]:
right_skewed, numerical, categorical

(['Gr.Liv.Area',
  'Total.Bsmt.SF',
  'X1st.Flr.SF',
  'Garage.Area',
  'BsmtFin.SF.1',
  'Lot.Area'],
 ['Overall.Qual', 'Garage.Cars', 'Year.Built', 'Full.Bath', 'Year.Remod.Add'],
 ['Bsmt.Qual', 'Garage.Type', 'Kitchen.Qual', 'MS.Zoning'])

In [26]:
# 3) build the inner ColumnTransformer
log_pipe = Pipeline([
    ("log1p",   FunctionTransformer(np.log1p, validate=False)),
    ("impute",  SimpleImputer(strategy="median")),
])
num_pipe = Pipeline([
    ("impute",  SimpleImputer(strategy="median")),
    # scaler will be overridden in grid
    ("scale",   StandardScaler()),
])
cat_pipe = Pipeline([
    ("impute",  SimpleImputer(strategy="constant", fill_value="MISSING")),
    ("ohe",     OneHotEncoder(handle_unknown="ignore", drop="first")),
])

base_preprocessor = ColumnTransformer([
    ("skewed",   log_pipe,  right_skewed),
    ("numeric",  num_pipe,  numerical),
    ("categorical", cat_pipe, categorical),
])

In [27]:
# 4) wrap in a FeatureUnion so we can add KMeans & Poly branches
full_features = FeatureUnion([
    ("base", base_preprocessor),
    ("kmeans", Pipeline([
        ("cluster", KMeans()),                 # will tune n_clusters
        ("onehot",  OneHotEncoder(handle_unknown="ignore", drop="first")),
    ])),
    ("poly", Pipeline([
        ("poly", PolynomialFeatures(include_bias=False)),  # will tune degree
    ])),
])

In [28]:
# 5) single master pipeline
pipe = Pipeline([
    ("features",  full_features),
    ("regressor", DummyRegressor()),  # placeholder
])

In [29]:
# 6) custom RMSE scorer
rmse = make_scorer(lambda y_true, y_pred: 
                   np.sqrt(mean_squared_error(y_true, y_pred)),
                   greater_is_better=False)

In [30]:
# 7) param_distributions as a list of dicts
param_distributions = [

    # ─────────── baseline regressors ───────────
    {
      "regressor": [DummyRegressor(), LinearRegression()],
      "features__kmeans__cluster__n_clusters": [1, 2, 3, 4, 5, 6],
      "features__poly__poly__degree": [1],   # no poly for baseline
      "features__base__numeric__scale": [StandardScaler(), MinMaxScaler()],
    },

    # ─────────── Ridge & Lasso ───────────
    {
      "regressor": [Lasso(), Ridge()],
      "regressor__alpha": [0.1, 1, 10, 100],
      "features__kmeans__cluster__n_clusters": [1, 2, 3, 4, 5, 6],
      "features__poly__poly__degree": [1, 2],
      "features__base__numeric__scale": [StandardScaler(), MinMaxScaler()],
    },

    # ────────── ElasticNet ──────────
    {
      "regressor": [ElasticNet()],
      "regressor__alpha": [0.1, 1, 10, 100],
      "regressor__l1_ratio": [0.1, 0.5, 0.9],
      "features__kmeans__cluster__n_clusters": [1, 2, 3, 4, 5, 6],
      "features__poly__poly__degree": [1, 2],
      "features__base__numeric__scale": [None, StandardScaler(), MinMaxScaler()],
    },

    # ───────── RandomForest ─────────
    {
      "regressor": [RandomForestRegressor(random_state=42)],
      "regressor__n_estimators": [500, 700, 1000],
      "regressor__max_depth": [None, 10, 20, 30],
      "regressor__min_samples_split": [2, 5, 10],
      "regressor__bootstrap": [True, False],
      "features__kmeans__cluster__n_clusters": [1, 2, 3, 4, 5, 6],
      "features__poly__poly__degree": [1, 2, 3],
      "features__base__numeric__scale": [None, StandardScaler(), MinMaxScaler()],
    },

    # ─────── GradientBoosting ───────
    {
      "regressor": [GradientBoostingRegressor(random_state=42)],
      "regressor__n_estimators": [500, 700, 1000],
      "regressor__learning_rate": [0.01, 0.05, 0.1],
      "regressor__max_depth": [3, 5, 7],
      "regressor__subsample": [0.6, 0.8, 1.0],
      "features__kmeans__cluster__n_clusters": [1, 2, 3, 4, 5, 6],
      "features__poly__poly__degree": [1, 2, 3],
      "features__base__numeric__scale": [None, StandardScaler(), MinMaxScaler()],
    },

    # ─────────── XGBoost ───────────
    {
      "regressor": [xgb.XGBRegressor(random_state=42, objective="reg:squarederror")],
      "regressor__n_estimators": [500, 700, 1000],
      "regressor__learning_rate": [0.01, 0.05, 0.1],
      "regressor__max_depth": [3, 5, 7, 10],
      "regressor__subsample": [0.6, 0.8, 1.0],
      "regressor__colsample_bytree": [0.6, 0.8, 1.0],
      "regressor__reg_alpha": [0, 0.1, 1, 10],
      "regressor__reg_lambda": [1, 10, 100],
      "features__kmeans__cluster__n_clusters": [1, 2, 3, 4, 5, 6],
      "features__poly__poly__degree": [1, 2, 3],
      "features__base__numeric__scale": [None, StandardScaler(), MinMaxScaler()],
    },
]

In [31]:
X_train.columns

Index(['Overall.Qual', 'Gr.Liv.Area', 'Garage.Cars', 'Total.Bsmt.SF',
       'X1st.Flr.SF', 'Garage.Area', 'BsmtFin.SF.1', 'Bsmt.Qual', 'Year.Built',
       'Lot.Area', 'Full.Bath', 'Garage.Type', 'Year.Remod.Add',
       'Kitchen.Qual', 'MS.Zoning'],
      dtype='object')

In [32]:
right_skewed, numerical, categorical

(['Gr.Liv.Area',
  'Total.Bsmt.SF',
  'X1st.Flr.SF',
  'Garage.Area',
  'BsmtFin.SF.1',
  'Lot.Area'],
 ['Overall.Qual', 'Garage.Cars', 'Year.Built', 'Full.Bath', 'Year.Remod.Add'],
 ['Bsmt.Qual', 'Garage.Type', 'Kitchen.Qual', 'MS.Zoning'])

In [33]:
# 8) wrap in RandomizedSearchCV
search = RandomizedSearchCV(
    pipe,
    param_distributions=param_distributions,
    n_iter=50,                    # sample 50 of these combos
    scoring=rmse,
    cv=5,
    n_jobs=-1,
    random_state=42,
    verbose=2,
)

# 9) run it
search.fit(X_train, y_train)
print("Best RMSE:", -search.best_score_)
print("Best params:", search.best_params_)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


ValueError: 
All the 250 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
250 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\pipeline.py", line 654, in fit
    Xt = self._fit(X, y, routed_params, raw_params=params)
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\pipeline.py", line 588, in _fit
    X, fitted_transformer = fit_transform_one_cached(
                            ~~~~~~~~~~~~~~~~~~~~~~~~^
        cloned_transformer,
        ^^^^^^^^^^^^^^^^^^^
    ...<5 lines>...
        params=step_params,
        ^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\joblib\memory.py", line 326, in __call__
    return self.func(*args, **kwargs)
           ~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\pipeline.py", line 1551, in _fit_transform_one
    res = transformer.fit_transform(X, y, **params.get("fit_transform", {}))
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\utils\_set_output.py", line 319, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\pipeline.py", line 1974, in fit_transform
    results = self._parallel_func(X, y, _fit_transform_one, routed_params)
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\pipeline.py", line 1996, in _parallel_func
    return Parallel(n_jobs=self.n_jobs)(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        delayed(func)(
        ^^^^^^^^^^^^^^
    ...<8 lines>...
        for idx, (name, transformer, weight) in enumerate(transformers, 1)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\utils\parallel.py", line 77, in __call__
    return super().__call__(iterable_with_config)
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\joblib\parallel.py", line 1985, in __call__
    return output if self.return_generator else list(output)
                                                ~~~~^^^^^^^^
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\joblib\parallel.py", line 1913, in _get_sequential_output
    res = func(*args, **kwargs)
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\utils\parallel.py", line 139, in __call__
    return self.function(*args, **kwargs)
           ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\pipeline.py", line 1551, in _fit_transform_one
    res = transformer.fit_transform(X, y, **params.get("fit_transform", {}))
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\pipeline.py", line 718, in fit_transform
    Xt = self._fit(X, y, routed_params)
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\pipeline.py", line 588, in _fit
    X, fitted_transformer = fit_transform_one_cached(
                            ~~~~~~~~~~~~~~~~~~~~~~~~^
        cloned_transformer,
        ^^^^^^^^^^^^^^^^^^^
    ...<5 lines>...
        params=step_params,
        ^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\joblib\memory.py", line 326, in __call__
    return self.func(*args, **kwargs)
           ~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\pipeline.py", line 1551, in _fit_transform_one
    res = transformer.fit_transform(X, y, **params.get("fit_transform", {}))
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\utils\_set_output.py", line 319, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\cluster\_kmeans.py", line 1122, in fit_transform
    return self.fit(X, sample_weight=sample_weight)._transform(X)
           ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\cluster\_kmeans.py", line 1454, in fit
    X = validate_data(
        self,
    ...<5 lines>...
        accept_large_sparse=False,
    )
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\utils\validation.py", line 2944, in validate_data
    out = check_array(X, input_name="X", **check_params)
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\utils\validation.py", line 1055, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\sklearn\utils\_array_api.py", line 839, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
  File "c:\Users\henri\Documents\Insper\25.1\ML\projeto_ml_4sem\venv\Lib\site-packages\pandas\core\generic.py", line 2153, in __array__
    arr = np.asarray(values, dtype=dtype)
ValueError: could not convert string to float: 'Gd'


## Parte 5

Seleção de modelos com GridSearchCV

In [None]:
param_grid = [{
    'regressor' : [LinearRegression(), DummyRegressor()],
}, {
    'regressor': [Lasso(), Ridge()],
    'alpha': [0.1, 1, 10, 100],
}, {
    'regressor': [ElasticNet()],
    'alpha': [0.1, 1, 10, 100],
    'l1_ratio': [0.1, 0.5, 0.9]
}, {
    'regressor': [RandomForestRegressor()],
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}, {
    'regressor': [GradientBoostingRegressor()],
    'n_estimators': [10, 50, 100],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7]
}]

grid = GridSearchCV(
    estimator=Pipeline(steps=[
        ('preprocessor', preprocessing_pipeline),
        ('regressor', RandomForestRegressor())
    ]),
    param_grid=param_grid,
    scoring='neg_mean_squared_error',
    cv=5,
    verbose=1,
    n_jobs=-1
)