# Projeto Final de Machine Learning
Feito por: _Henrique Bucci_ e _Marcelo Alonso_

Dados e informações: https://www.kaggle.com/datasets/marcopale/housing/data

**Perguntas**  
- Posso remover o PID?
    - **R:** Sim
- Posso criar colunas a partir de contas de outras antes de fazer a seleção?
    - **R:** Sim
- Se eu aplicar PolynomialFeatures nos dados, eles também contam como features para a contagem?
    - **R:** Fazer PolyFeatures depois de selecionar as features
- Posso utilizar correlação na análise exploratória?
    - **R:** Pode, mas é "inútil"
- Posso utilizar métodos de clustering na pipeline para incluir a classificação como uma nova feature?
    - **R:** SoftMax no resultado do Kmeans para exagerar a classe mais próxima.
- Posso utilizar algum método de Dimensionality Reduction (ex: PCA) para me ajudar a escolher as features?
    - **R:** Sim.


Testar stacking: Treinar diversos modelos e treinar um modelo final com os predicts destes modelos.

#### ANOTAÇÕES
Utilizar LASSO para seleção de features.

Regressão linear para ignorar outliers.

RANSAC -> regressao linear que ignora outliers

## Etapa 0

Nesta etapa, iremos:
- Importar bibliotecas
- Carregar os dados
- Verificar se existem colunas que não fazem sentido serem colocadas no dataset final (como ID ou algum outro tipo de identificador arbitrário), olhando apenas a descrição das colunas.
- Separar o dataset em Treino-Teste

### Bibliotecas e Configurações Globais

In [125]:
import pandas as pd
import numpy as np
from datetime import datetime
from utils import *


from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.preprocessing  import FunctionTransformer, StandardScaler, MinMaxScaler, OneHotEncoder, PolynomialFeatures
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.dummy import DummyRegressor
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.cluster        import KMeans
import xgboost as xgb


In [126]:
plt.rcParams['figure.figsize'] = (12, 6)
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['font.size'] = 14
plt.rcParams['figure.autolayout'] = True

### Constantes

In [127]:
SEED = 420

### Carregamento e Pré-processamento dos Dados

In [128]:
dataset = load_data()
dataset.head()

Unnamed: 0,Order,PID,MS.SubClass,MS.Zoning,Lot.Frontage,Lot.Area,Street,Alley,Lot.Shape,Land.Contour,...,Pool.Area,Pool.QC,Fence,Misc.Feature,Misc.Val,Mo.Sold,Yr.Sold,Sale.Type,Sale.Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


Neste caso, a presença de duplicatas não seria intencional, uma vez que cada casa deveria ser única.  
Portanto, vamos removê-las.

In [129]:
print(f"Total de linhas antes de remover duplicatas: {dataset.shape[0]}")
dataset.drop_duplicates(inplace=True)
print(f"Total de linhas depois de remover duplicatas: {dataset.shape[0]}")

Total de linhas antes de remover duplicatas: 2930
Total de linhas depois de remover duplicatas: 2930


Como a primeira coluna é o ID da observação e a segunda é um identificador, podemos removê-las, uma vez que estes são valores arbitrários.

In [130]:
dataset = dataset.iloc[:, 2:] # Estamos removendo as duas primeiras colunas, que são o ID e o PID (Parcel identification number)

### Criando novas features

Ao analisarmos as features da forma descrita acima, vimos espaço para a criação de novas features que podem vir a ser úteis na modelagem dos dados:
- **Tot Lot Area** : `Lot Frontage + Lot Area`
- **Bsmt Tot Bath** : `Bsmt Full Bath + 0.5*Bsmt Half Bath`
- **Garage Area/Car** : `Garage Area / Garage Cars`
- **Tot Porch SF** : `Open Porch SF + Enclosed Porch + 3Ssn Porch + Screen Porch`
- **Date Sold** : `timestamp(Month Sold, Year Sold)`

In [131]:
# dataset.loc[:, 'Tot.Lot.Area'] = dataset.loc[:, 'Lot.Frontage'] + dataset.loc[:, 'Lot.Area']
dataset.loc[:, 'Bsmt.Tot.Bath'] = dataset.loc[:, 'Bsmt.Full.Bath'] + 0.5*dataset.loc[:, 'Bsmt.Half.Bath']
# dataset.loc[:, 'Garage.Area/Cars'] = dataset.loc[:, 'Garage.Area'] / dataset.loc[:, 'Garage.Cars']
dataset.loc[:, 'Tot.Porch.SF'] = dataset.loc[:, 'Open.Porch.SF'] + dataset.loc[:, 'X3Ssn.Porch'] + dataset.loc[:, 'Enclosed.Porch'] + dataset.loc[:, 'Screen.Porch']
# dataset.loc[:, 'Date.Sold'] = pd.to_datetime(dict(year=dataset['Yr.Sold'], month=dataset['Mo.Sold'], day=1)).apply(lambda x: x.timestamp())

### Train-Test Split

- A partir de agora, usaremos apenas o dataset de treino, a partição de teste será tratada como se não existisse ainda.
- O dataset total será dividido em uma proporção 80/20, uma vez que temos poucos dados (2930 no total).
- Por não se tratar de uma série temporal, podemos aplicar uma aleatoriedade na partição.

In [132]:
X, y = dataset.drop('SalePrice', axis=1), dataset.loc[:, 'SalePrice']

In [133]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

## Etapa 1

### Análise Exploratória

Nesta parte, iremos fazer uma análise global dos dados, apenas para garantir a integridade destes.  
Assim sendo, iremos procurar entender quais são as features e target, quais são seus respectivos tipos e buscar outras informações como:
- Dados nulos
- Dados duplicados
- Outliers
- Spikes
- Erros grosseiros

Além disso, iremos buscar saber a distribuição e a "cara" de cada variável.

#### Valores Faltantes e Data Types

In [134]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2344 entries, 1157 to 1096
Data columns (total 81 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   MS.SubClass      2344 non-null   int64  
 1   MS.Zoning        2344 non-null   object 
 2   Lot.Frontage     1945 non-null   float64
 3   Lot.Area         2344 non-null   int64  
 4   Street           2344 non-null   object 
 5   Alley            159 non-null    object 
 6   Lot.Shape        2344 non-null   object 
 7   Land.Contour     2344 non-null   object 
 8   Utilities        2344 non-null   object 
 9   Lot.Config       2344 non-null   object 
 10  Land.Slope       2344 non-null   object 
 11  Neighborhood     2344 non-null   object 
 12  Condition.1      2344 non-null   object 
 13  Condition.2      2344 non-null   object 
 14  Bldg.Type        2344 non-null   object 
 15  House.Style      2344 non-null   object 
 16  Overall.Qual     2344 non-null   int64  
 17  Overall.Cond    

#### Distribuição dos Dados

Nesta parte, iremos olhar especificamente para a distribuição dos dados.  
Nas células abaixo conseguimos ver:
- Distribuição dos dados numéricos, com os valores de `count`, `min`, `max`, `std`, `mean`, e os quartis.
- Distribuição dos dados categóricos, com os valores de `count`, `unique`, `top` (moda), `freq` (número de ocorrências da moda)

In [135]:
X_train.describe()

Unnamed: 0,MS.SubClass,Lot.Frontage,Lot.Area,Overall.Qual,Overall.Cond,Year.Built,Year.Remod.Add,Mas.Vnr.Area,BsmtFin.SF.1,BsmtFin.SF.2,...,Open.Porch.SF,Enclosed.Porch,X3Ssn.Porch,Screen.Porch,Pool.Area,Misc.Val,Mo.Sold,Yr.Sold,Bsmt.Tot.Bath,Tot.Porch.SF
count,2344.0,1945.0,2344.0,2344.0,2344.0,2344.0,2344.0,2325.0,2343.0,2343.0,...,2344.0,2344.0,2344.0,2344.0,2344.0,2344.0,2344.0,2344.0,2342.0,2344.0
mean,57.487201,69.387147,10130.794795,6.101109,5.558447,1971.458191,1984.43686,103.673548,436.46863,51.261204,...,47.62756,22.955205,2.389078,15.336604,2.396331,41.335751,6.234215,2007.791809,0.459223,88.308447
std,42.697657,23.645307,7021.928686,1.413162,1.103673,30.424244,20.945233,181.229599,453.039342,170.845327,...,67.760644,61.518987,23.262509,54.365229,37.264067,378.295773,2.728457,1.305689,0.521169,104.923186
min,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,0.0,0.0
25%,20.0,59.0,7500.0,5.0,5.0,1954.0,1965.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0,0.0,0.0
50%,50.0,68.0,9504.0,6.0,5.0,1973.0,1993.0,0.0,362.0,0.0,...,28.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,0.0,50.0
75%,70.0,80.0,11618.25,7.0,6.0,2001.0,2004.0,166.0,729.0,0.0,...,70.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,1.0,135.0
max,190.0,313.0,164660.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1526.0,...,742.0,584.0,407.0,490.0,800.0,12500.0,12.0,2010.0,3.0,1027.0


In [136]:
X_train.describe(include=np.object_)

Unnamed: 0,MS.Zoning,Street,Alley,Lot.Shape,Land.Contour,Utilities,Lot.Config,Land.Slope,Neighborhood,Condition.1,...,Garage.Type,Garage.Finish,Garage.Qual,Garage.Cond,Paved.Drive,Pool.QC,Fence,Misc.Feature,Sale.Type,Sale.Condition
count,2344,2344,159,2344,2344,2344,2344,2344,2344,2344,...,2212,2211,2211,2211,2344,11,459,85,2344,2344
unique,7,2,2,4,4,3,5,3,28,9,...,6,3,5,5,3,4,4,4,10,6
top,RL,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,...,Attchd,Unf,TA,TA,Y,Gd,MnPrv,Shed,WD,Normal
freq,1815,2334,98,1476,2099,2341,1708,2231,349,2037,...,1387,977,2090,2125,2128,4,260,77,2018,1908


##### Gráficos

In [137]:
plot_distribution(X_train, 'x_train_original.png')

Gráfico salvo em ./graphs/x_train_original.png


Para uma visualização melhor fizemos este gráfico, e nele podemos ver que diversas features que são estritamente positivas e possuem uma cauda direita alongada.  
  
Neste caso, o ideal é transformá-las em distribuições normais.  

<img src="./graphs/x_train_original.png" alt="drawing" width="700"/>  
  
Assim sendo, aplicaremos log nas colunas que possuem uma cauda direita, e iremos fazer um gráfico para visualizarmos as diferenças.

In [138]:
"""
Pegando os nomes das colunas numéricas, categóricas e com cauda direita alongada 
para fazermos as transformações necessárias.
Estas variáveis serão utilizadas durante todo o notebook.
"""
right_skewed, numerical, categorical = get_column_subsets(X_train)

In [139]:
X_train_log = X_train.copy()
X_train_log[right_skewed] = np.log1p(X_train_log[right_skewed])

plot_distribution(X_train_log.select_dtypes(include='number'), 'x_train_log.png')
plot_distribution(X_train_log[right_skewed], 'x_train_log_only.png')
plot_distribution(X_train[right_skewed], 'x_train_original_sk_only.png')

Gráfico salvo em ./graphs/x_train_log.png
Gráfico salvo em ./graphs/x_train_log_only.png
Gráfico salvo em ./graphs/x_train_original_sk_only.png


Ao aplicar log nas colunas:

In [140]:
right_skewed

['Lot.Frontage',
 'Lot.Area',
 'Mas.Vnr.Area',
 'BsmtFin.SF.1',
 'BsmtFin.SF.2',
 'Bsmt.Unf.SF',
 'Total.Bsmt.SF',
 'X1st.Flr.SF',
 'X2nd.Flr.SF',
 'Gr.Liv.Area',
 'Garage.Area',
 'Wood.Deck.SF',
 'Open.Porch.SF',
 'Enclosed.Porch',
 'Screen.Porch',
 'X3Ssn.Porch',
 'Tot.Porch.SF']

A distribuição das features se aproxima de uma distribuição normal, a normalização dos dados desta forma, facilita a interpretação dos dados pelos modelos. 

Uma vez que os dados estão distribuídos de uma forma normalizada, o peso de features com valores muito altos é reduzido, fazendo também com que as difereças sejam contadas de forma multiplicativa ao invés de aditiva, destacando mudanças proporcionais no lugar de mudanças absolutas. 

#### Antes/Depois das features normalizadas

<img src="./graphs/x_train_original_sk_only.png" alt="drawing" width="700"/> <img src="./graphs/x_train_log_only.png" alt="drawing" width="700"/>  

#### Resultado final das features após a normalização por `log1p()`

<img src="./graphs/x_train_log.png" alt="drawing" width="700"/>

#### Distribuição do Target e Remoção de Outliers

Primeiro, Para ajudar a encontrar os outliers no conjunto de treino, decidimos vizualizar o gráfico de correlações entre as features numéricas, incluindo o **target**

In [141]:
#plot the correlation matrix

X_train_y_train = pd.concat([X_train_log.select_dtypes(include='number'), np.log1p(y_train)], axis=1)

display(X_train_y_train \
    .corr() \
    .style \
    .background_gradient(cmap='coolwarm', axis=None) \
    .set_table_attributes('style="width: 50%;"') \
    .set_caption('Correlação entre variáveis numéricas'))

Unnamed: 0,MS.SubClass,Lot.Frontage,Lot.Area,Overall.Qual,Overall.Cond,Year.Built,Year.Remod.Add,Mas.Vnr.Area,BsmtFin.SF.1,BsmtFin.SF.2,Bsmt.Unf.SF,Total.Bsmt.SF,X1st.Flr.SF,X2nd.Flr.SF,Low.Qual.Fin.SF,Gr.Liv.Area,Bsmt.Full.Bath,Bsmt.Half.Bath,Full.Bath,Half.Bath,Bedroom.AbvGr,Kitchen.AbvGr,TotRms.AbvGrd,Fireplaces,Garage.Yr.Blt,Garage.Cars,Garage.Area,Wood.Deck.SF,Open.Porch.SF,Enclosed.Porch,X3Ssn.Porch,Screen.Porch,Pool.Area,Misc.Val,Mo.Sold,Yr.Sold,Bsmt.Tot.Bath,Tot.Porch.SF,SalePrice
MS.SubClass,1.0,-0.540551,-0.508866,0.038483,-0.064628,0.030041,0.044609,0.002414,-0.014155,-0.078304,-0.14244,-0.077176,-0.297293,0.364871,0.018203,0.074812,0.015699,0.005202,0.135442,0.164769,-0.028485,0.266373,0.032888,-0.04273,0.09588,-0.054884,-0.121999,0.018133,0.005684,-0.034618,-0.016667,-0.054915,-0.005785,-0.024822,-0.009216,-0.019455,0.016998,-0.047676,-0.076828
Lot.Frontage,-0.540551,1.0,0.770277,0.170414,-0.044181,0.080996,0.081486,0.144269,0.036585,0.023327,0.079725,0.084476,0.467955,-0.113332,-0.000711,0.328939,0.084132,-0.032718,0.159419,-0.008103,0.252203,0.024728,0.343612,0.218948,0.055432,0.276555,0.167338,0.082535,0.138237,-0.007282,0.003837,0.081878,0.102581,0.026508,0.026095,-0.001083,0.076881,0.1461,0.33822
Lot.Area,-0.508866,0.770277,1.0,0.1357,-0.030886,0.032455,0.047469,0.088327,0.0573,0.068778,0.032756,0.080259,0.470873,-0.063744,-0.001245,0.361038,0.11022,0.006729,0.160336,0.032786,0.25884,-0.006082,0.344066,0.268877,0.009335,0.266812,0.187591,0.119167,0.117159,0.020167,0.005156,0.080393,0.085319,0.047702,0.018991,-0.027093,0.112475,0.136922,0.35826
Overall.Qual,0.038483,0.170414,0.1357,1.0,-0.096968,0.59579,0.569967,0.425652,0.097365,-0.079877,0.251323,0.35434,0.454833,0.138739,-0.062337,0.584954,0.154944,-0.047893,0.515968,0.264894,0.04813,-0.158386,0.368456,0.385523,0.572939,0.59466,0.375883,0.300423,0.45728,-0.191446,0.014498,0.007924,0.0284,-0.033378,0.035587,-0.022358,0.144902,0.292401,0.825515
Overall.Cond,-0.064628,-0.044181,-0.030886,-0.096968,1.0,-0.352895,0.052825,-0.148801,0.063654,0.090202,-0.052374,-0.015676,-0.142035,0.035141,0.008242,-0.127942,-0.024445,0.075232,-0.201157,-0.086025,-0.002435,-0.074627,-0.084782,-0.008977,-0.325443,-0.166332,0.003189,-0.017333,-0.140513,0.113101,0.04421,0.05288,-0.009016,0.03478,-0.007028,0.049616,-0.00727,-0.031216,-0.039234
Year.Built,0.030041,0.080996,0.032455,0.59579,-0.352895,1.0,0.614972,0.394687,0.163894,-0.057995,0.078063,0.212915,0.290046,-0.082136,-0.146551,0.261029,0.192912,-0.017536,0.456741,0.279545,-0.069325,-0.145581,0.095308,0.155599,0.829,0.523575,0.323211,0.304225,0.395498,-0.470074,0.013229,-0.068275,0.010582,-0.034134,0.006105,-0.021235,0.190106,0.03377,0.614506
Year.Remod.Add,0.044609,0.081486,0.047469,0.569967,0.052825,0.614972,1.0,0.226717,0.011449,-0.101118,0.158493,0.200774,0.224731,0.083599,-0.062176,0.334994,0.114196,-0.044615,0.454941,0.22311,-0.027324,-0.134359,0.191048,0.12306,0.652659,0.425506,0.218101,0.284107,0.397174,-0.271064,0.037072,-0.058234,-0.015634,-0.035868,0.023736,0.030779,0.104649,0.165445,0.587773
Mas.Vnr.Area,0.002414,0.144269,0.088327,0.425652,-0.148801,0.394687,0.226717,1.0,0.204007,-0.029886,0.092235,0.19369,0.359613,-0.048691,-0.082889,0.320152,0.1445,0.03894,0.266851,0.141798,0.059647,-0.060071,0.220114,0.26386,0.309443,0.376303,0.233974,0.146135,0.23371,-0.173289,0.036032,0.049272,0.008842,-0.000775,0.012318,-0.005762,0.154436,0.107453,0.439504
BsmtFin.SF.1,-0.014155,0.036585,0.0573,0.097365,0.063654,0.163894,0.011449,0.204007,1.0,0.187383,-0.254174,0.298568,0.184635,-0.192376,-0.063544,-0.014691,0.590853,0.119288,-0.087497,-0.010868,-0.122297,-0.142733,-0.109812,0.198323,0.054792,0.088442,0.144954,0.127653,0.010182,-0.1197,0.067808,0.094883,0.019576,0.029558,-0.018038,0.038351,0.622113,-0.047253,0.231433
BsmtFin.SF.2,-0.078304,0.023327,0.068778,-0.079877,0.090202,-0.057995,-0.101118,-0.029886,0.187383,1.0,-0.275182,0.079551,0.076548,-0.142087,-0.011689,-0.068727,0.174972,0.131662,-0.118952,-0.072151,-0.043371,-0.044745,-0.081327,0.072131,-0.099755,-0.044756,0.044531,0.036051,-0.069157,0.041673,-0.013887,0.074468,0.061569,-0.006793,-0.000947,0.016133,0.206423,-0.041752,-0.019518


Agora, dado que há uma alta correlação entre os valores de *overall_qual* e o target, vamos tentar enconrar anomalias a este padrão, ou seja, casas com um *overall_qual* baixo e com um valor alto, que podem acabar prejudicando o aprendizado do nosso modelo

Agora, vamos remover os outliers de acordo com o target apenas, seguindo a distribuição normal do target em *log1p*

In [142]:
y_hist(y_train, 'target distribution')

Gráfico salvo em ./graphs/target distribution


<img src="./graphs/target distribution.png" alt="drawing" width="70%"/>

In [143]:
# Agora, basta remover os outliers encontrados no target do dataset
print(f"Total de linhas antes de remover outliers: {X_train.shape[0]}")
X_train, y_train = remove_outliers(X_train, y_train)
print(f"Total de linhas depois de remover outliers: {X_train.shape[0]}")

Total de linhas antes de remover outliers: 2344
Total de linhas depois de remover outliers: 2298


## Selecionando as features 

Para selecionar as features, decidimos utilizar um `RandomForestRegressor`, uma vez que tal modelo possui o atributo `.feature_importances_`, que retorna um número de 0 a 1, que somados resultam em 1, ou seja, a porcentagem da imfluência de cada feature na decisão final do modelo.

Ao obtermos o dataframe de feature importances, iremos pegar as 11 numéricas mais influentes, em conjunto com as 4 categóricas mais influêntes para utilizarmos ao longo do nosso treinamento seguinte.

Primeiro, para rodarmos o `RandomForest`, precisamos montar uma pipeline de pré-processamento dos dados:

In [144]:
num_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
])

c_log_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('log', FunctionTransformer(np.log1p, validate=False, feature_names_out='one-to-one')),
])

cat_pipe = Pipeline(steps=[
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
])

preprocessing_pipeline = ColumnTransformer(
    transformers = [
        ('num', num_pipe, numerical),
        ('clog', c_log_pipe, right_skewed),
        ('cat', cat_pipe, categorical)
    ],
    remainder='passthrough'
)

preprocessing_pipeline

Esta pipeline é simples, 

Para os dados numéricos, estamos apenas preenchendo os valores nulos com o `SimpleImputer`, já que as *Random Forests*, por serem um método ensemble de *DecisionTrees*, herdam a propriedade de *nonparametric model*, logo, não necessitam de uma quantidade pré-determinada de parâmetros, o que facilita a adaptação aos dados de treino.

Os dados de features classificadas como *right_skewed*, estamos, além do procedimento normal dos dados numéricos, normalizando com o *log1p* dos dados. Tal procedimento, não é necessário quando se trata de *Random Forests*, porém, achamos que seria bom colocar, uma vez que pretendemos utilizar esta regularização na pipeline dos modelos mais tarde.

Os dados categóricos estão passando por um `OneHotEncoder`, para tratar as features categóricas como valores numéricos (0 e 1).

In [147]:
rnd_clf = RandomForestRegressor(
    n_estimators=700,
    max_leaf_nodes=16,
    random_state=SEED,
    n_jobs=1
)

feature_selector = Pipeline(steps=[
    ('preprocessing', preprocessing_pipeline),
    ('regressor', rnd_clf),
])

feature_selector.fit(X_train, np.log1p(y_train))

In [None]:
importances = feature_selector.named_steps['regressor'].feature_importances_

feature_names = preprocessing_pipeline.get_feature_names_out()
feature_importances_df = pd.DataFrame(zip(feature_names, importances), columns=['Feature', 'Importance']) \
    .sort_values(by='Importance', ascending=False)

feature_importances_df = aggregate_categorical_importances(feature_importances_df.set_index('Feature'))

top4_cat_features = [x for x in feature_importances_df.index if 'cat__' in x][:4]
top11_num_features = [x for x in feature_importances_df.index if 'clog__' in x or 'num__' in x][:11]

top15_features = top4_cat_features + top11_num_features

X_feats = [feat.split('__')[1] for feat in top15_features]

feature_importances_df


AttributeError: 'Pipeline' object has no attribute 'feature_importances_'

In [117]:
X_train = X_train.loc[:, X_feats]

X_train

Unnamed: 0,Bsmt.Qual,Garage.Type,Kitchen.Qual,MS.Zoning,Overall.Qual,Gr.Liv.Area,Garage.Cars,Total.Bsmt.SF,X1st.Flr.SF,Garage.Area,BsmtFin.SF.1,Year.Built,Lot.Area,Full.Bath,Year.Remod.Add
0,Gd,Attchd,Gd,RL,6,2683,2.0,930.0,1364,473.0,704.0,1990,9205,2,1991
1,Ex,Attchd,Ex,RL,9,1800,2.0,1800.0,1800,702.0,0.0,2007,11923,2,2007
2,TA,Detchd,TA,RM,5,789,1.0,789.0,789,250.0,104.0,1948,6000,1,1950
3,TA,Detchd,TA,RM,4,1656,2.0,801.0,1095,440.0,0.0,1940,7628,2,1985
4,TA,Attchd,TA,RL,6,1458,1.0,912.0,912,330.0,0.0,1948,17503,1,1950
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2293,TA,Detchd,TA,RM,6,1176,2.0,816.0,816,528.0,0.0,1934,6240,1,1950
2294,Gd,Attchd,Gd,RL,7,1061,2.0,1054.0,1061,462.0,460.0,1990,10778,1,1991
2295,Gd,Attchd,Gd,RL,8,1226,2.0,1226.0,1226,484.0,960.0,1981,3782,1,1981
2296,Gd,Attchd,TA,RL,6,1350,2.0,1064.0,1350,478.0,0.0,1974,10140,2,1974


## Parte 3/4/5/6/...

1. Escolher modelos (métodos de stacking inclusos)
    - DummyRegressor
    - LinearRegression
    - Outros modelos básicos
        - Polynomial Features
        - Scalers
    - Pipelines avançadas
        - Utilizar KMeans como fonte de novas features na pipeline
        - Métodos de Ensemble
        
2. Montar GridSearchCV com hiperparâmetros

In [119]:
right_skewed, numerical, categorical = split_by_prefix(top15_features)

right_skewed, numerical, categorical

(['Gr.Liv.Area',
  'Total.Bsmt.SF',
  'X1st.Flr.SF',
  'Garage.Area',
  'BsmtFin.SF.1',
  'Lot.Area'],
 ['Overall.Qual', 'Garage.Cars', 'Year.Built', 'Full.Bath', 'Year.Remod.Add'],
 ['Bsmt.Qual', 'Garage.Type', 'Kitchen.Qual', 'MS.Zoning'])

In [25]:
# 3) build the inner ColumnTransformer
log_pipe = Pipeline([
    ("log1p",   FunctionTransformer(np.log1p, validate=False)),
    ("impute",  SimpleImputer(strategy="median")),
])

num_pipe = Pipeline([
    ("impute",  SimpleImputer(strategy="median")),
    # scaler will be overridden in grid
    ("scale",   StandardScaler()),
])

cat_pipe = Pipeline([
    ("impute",  SimpleImputer(strategy="constant", fill_value="MISSING")),
    ("ohe",     OneHotEncoder(handle_unknown="ignore", drop="first")),
])

base_preprocessor = ColumnTransformer([
    ("skewed",   log_pipe,  right_skewed),
    ("numeric",  num_pipe,  numerical),
    ("categorical", cat_pipe, categorical),
])

kmeans_branch = Pipeline([
    ("pre", base_preprocessor),
    ("cluster", KMeans()),
    ("onehot", OneHotEncoder(handle_unknown="ignore", drop="first")),
])

In [26]:
# 4) wrap in a FeatureUnion so we can add KMeans & Poly branches
full_features = FeatureUnion([
    ("base", base_preprocessor),
    ("kmeans", kmeans_branch),
    ("poly", Pipeline([
        ("select_num", ColumnTransformer([
            ("num", "passthrough", numerical)
        ], remainder='drop')),
        ("poly", PolynomialFeatures(include_bias=False)),
    ])),
])

In [27]:
# 5) single master pipeline
pipe = Pipeline([
    ("features",  full_features),
    ("regressor", DummyRegressor()),  # placeholder
])

In [28]:
# 6) custom RMSE scorer
rmse = make_scorer(lambda y_true, y_pred: 
                   np.sqrt(mean_squared_error(y_true, y_pred)),
                   greater_is_better=False)

In [29]:
# 7) param_distributions as a list of dicts
param_distributions = [

    # ─────────── baseline regressors ───────────
    {
      "regressor": [DummyRegressor(), LinearRegression()],
      "features__kmeans__cluster__n_clusters": [1, 2, 3, 4, 5, 6],
      "features__poly__poly__degree": [1],   # no poly for baseline
      "features__base__numeric__scale": [StandardScaler(), MinMaxScaler()],
      # "features__base__categorical__impute__strategy": ['constant'],
      # "features__base__categorical__impute__fill_value": ['MISSING'],
    },

    # ─────────── Ridge & Lasso ───────────
    {
      "regressor": [Lasso(), Ridge()],
      "regressor__alpha": [0.1, 1, 10, 100],
      "features__kmeans__cluster__n_clusters": [1, 2, 3, 4, 5, 6],
      "features__poly__poly__degree": [1, 2],
      "features__base__numeric__scale": [StandardScaler(), MinMaxScaler()],
    },

    # ────────── ElasticNet ──────────
    {
      "regressor": [ElasticNet()],
      "regressor__alpha": [0.1, 1, 10, 100],
      "regressor__l1_ratio": [0.1, 0.5, 0.9],
      "features__kmeans__cluster__n_clusters": [1, 2, 3, 4, 5, 6],
      "features__poly__poly__degree": [1, 2],
      "features__base__numeric__scale": [None, StandardScaler(), MinMaxScaler()],
    },

    # ───────── RandomForest ─────────
    {
      "regressor": [RandomForestRegressor(random_state=42)],
      "regressor__n_estimators": [500, 700, 1000],
      "regressor__max_depth": [None, 10, 20, 30],
      "regressor__min_samples_split": [2, 5, 10],
      "regressor__bootstrap": [True, False],
      "features__kmeans__cluster__n_clusters": [1, 2, 3, 4, 5, 6],
      "features__poly__poly__degree": [1, 2, 3],
      "features__base__numeric__scale": [None, StandardScaler(), MinMaxScaler()],
    },

    # ─────── GradientBoosting ───────
    {
      "regressor": [GradientBoostingRegressor(random_state=42)],
      "regressor__n_estimators": [500, 700, 1000],
      "regressor__learning_rate": [0.01, 0.05, 0.1],
      "regressor__max_depth": [3, 5, 7],
      "regressor__subsample": [0.6, 0.8, 1.0],
      "features__kmeans__cluster__n_clusters": [1, 2, 3, 4, 5, 6],
      "features__poly__poly__degree": [1, 2, 3],
      "features__base__numeric__scale": [None, StandardScaler(), MinMaxScaler()],
    },

    # ─────────── XGBoost ───────────
    {
      "regressor": [xgb.XGBRegressor(random_state=42, objective="reg:squarederror")],
      "regressor__n_estimators": [500, 700, 1000],
      "regressor__learning_rate": [0.01, 0.05, 0.1],
      "regressor__max_depth": [3, 5, 7, 10],
      "regressor__subsample": [0.6, 0.8, 1.0],
      "regressor__colsample_bytree": [0.6, 0.8, 1.0],
      "regressor__reg_alpha": [0, 0.1, 1, 10],
      "regressor__reg_lambda": [1, 10, 100],
      "features__kmeans__cluster__n_clusters": [1, 2, 3, 4, 5, 6],
      "features__poly__poly__degree": [1, 2, 3],
      "features__base__numeric__scale": [None, StandardScaler(), MinMaxScaler()],
    },
]

In [30]:
y_train_log = np.log1p(y_train.copy())

In [31]:
# 8) wrap in RandomizedSearchCV
search = RandomizedSearchCV(
    pipe,
    param_distributions=param_distributions,
    n_iter=50,                    # sample 50 of these combos
    scoring=rmse,
    cv=5,
    n_jobs=-1,
    random_state=42,
    verbose=2,
)

# 9) run it
search.fit(X_train, y_train_log)
print("Best RMSE:", -search.best_score_)
print("Best params:", search.best_params_)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


KeyboardInterrupt: 

## Parte 5

Seleção de modelos com GridSearchCV

In [None]:
param_grid = [{
    'regressor' : [LinearRegression(), DummyRegressor()],
}, {
    'regressor': [Lasso(), Ridge()],
    'alpha': [0.1, 1, 10, 100],
}, {
    'regressor': [ElasticNet()],
    'alpha': [0.1, 1, 10, 100],
    'l1_ratio': [0.1, 0.5, 0.9]
}, {
    'regressor': [RandomForestRegressor()],
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}, {
    'regressor': [GradientBoostingRegressor()],
    'n_estimators': [10, 50, 100],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7]
}]

grid = GridSearchCV(
    estimator=Pipeline(steps=[
        ('preprocessor', preprocessing_pipeline),
        ('regressor', RandomForestRegressor())
    ]),
    param_grid=param_grid,
    scoring='neg_mean_squared_error',
    cv=5,
    verbose=1,
    n_jobs=-1
)