# Projeto Final de Machine Learning
Feito por: _Henrique Bucci_ e _Marcelo Alonso_

Dados e informações: https://www.kaggle.com/datasets/marcopale/housing/data

**Perguntas**  
- Posso remover o PID?
    - **R:** Sim
- Posso criar colunas a partir de contas de outras antes de fazer a seleção?
    - **R:** Sim
- Se eu aplicar PolynomialFeatures nos dados, eles também contam como features para a contagem?
    - **R:** Fazer PolyFeatures depois de selecionar as features
- Posso utilizar correlação na análise exploratória?
    - **R:** Pode, mas é "inútil"
- Posso utilizar métodos de clustering na pipeline para incluir a classificação como uma nova feature?
    - **R:** SoftMax no resultado do Kmeans para exagerar a classe mais próxima.
- Posso utilizar algum método de Dimensionality Reduction (ex: PCA) para me ajudar a escolher as features?
    - **R:** Sim.


Testar stacking: Treinar diversos modelos e treinar um modelo final com os predicts destes modelos.

#### ANOTAÇÕES
Utilizar LASSO para seleção de features.

Regressão linear para ignorar outliers.

RANSAC -> regressao linear que ignora outliers

## Etapa 0

Nesta etapa, iremos:
- Importar bibliotecas
- Carregar os dados
- Verificar se existem colunas que não fazem sentido serem colocadas no dataset final (como ID ou algum outro tipo de identificador arbitrário), olhando apenas a descrição das colunas.
- Separar o dataset em Treino-Teste

### Bibliotecas

In [None]:
import pandas as pd
from datetime import datetime
from utils import *


from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.dummy import DummyRegressor
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import LinearSVR, SVR

### Constantes

In [96]:
SEED = 420

### Carregamento e Pré-processamento dos Dados

In [97]:
dataset = load_data()
dataset.head()

Unnamed: 0,Order,PID,MS.SubClass,MS.Zoning,Lot.Frontage,Lot.Area,Street,Alley,Lot.Shape,Land.Contour,...,Pool.Area,Pool.QC,Fence,Misc.Feature,Misc.Val,Mo.Sold,Yr.Sold,Sale.Type,Sale.Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


In [98]:
"""
Neste caso, a presença de duplicatas não seria intencional, uma vez que cada casa deveria ser única.
Portanto, vamos removê-las.
"""
print(f"Total de linhas antes de remover duplicatas: {dataset.shape[0]}")
dataset.drop_duplicates(inplace=True) # Removendo duplicatas
print(f"Total de linhas depois de remover duplicatas: {dataset.shape[0]}")

Total de linhas antes de remover duplicatas: 2930
Total de linhas depois de remover duplicatas: 2930


In [99]:
"""
Como a primeira coluna é o ID da observação e a segunda é um identificador, podemos removê-las, uma vez que estes são valores arbitrários.
"""
dataset = dataset.iloc[:, 2:] # Estamos removendo as duas primeiras colunas, que são o ID e o PID (Parcel identification number)

### Criando novas features

Ao analisarmos as features da forma descrita acima, vimos espaço para a criação de novas features que podem ser úteis na modelagem dos dados:
- **Tot Lot Area** : `Lot Frontage + Lot Area`
- **Bsmt Tot Bath** : `Bsmt Full Bath + 0.5*Bsmt Half Bath`
- **Garage Area/Car** : `Garage Area / Garage Cars`
- **Tot Porch SF** : `Open Porch SF + Enclosed Porch + 3Ssn Porch + Screen Porch`
- **Date Sold** : `timestamp(Month Sold, Year Sold)`

In [100]:
dataset.loc[:, 'Tot.Lot.Area'] = dataset.loc[:, 'Lot.Frontage'] + dataset.loc[:, 'Lot.Area']
dataset.loc[:, 'Bsmt.Tot.Bath'] = dataset.loc[:, 'Bsmt.Full.Bath'] + 0.5*dataset.loc[:, 'Bsmt.Half.Bath']
dataset.loc[:, 'Garage.Area/Cars'] = dataset.loc[:, 'Garage.Area'] / dataset.loc[:, 'Garage.Cars']
dataset.loc[:, 'Tot.Porch.SF'] = dataset.loc[:, 'Open.Porch.SF'] + dataset.loc[:, 'X3Ssn.Porch'] + dataset.loc[:, 'Enclosed.Porch'] + dataset.loc[:, 'Screen.Porch']
dataset.loc[:, 'Date.Sold'] = pd.to_datetime(dict(year=dataset['Yr.Sold'], month=dataset['Mo.Sold'], day=1)).apply(lambda x: x.timestamp())

### Train-Test Split

- A partir de agora, usaremos apenas o dataset de treino, a partição de teste será tratada como se não existisse ainda.
- O dataset total será dividido em uma proporção 80/20, uma vez que temos poucos dados (2930 no total).
- Por não se tratar de uma série temporal, podemos aplicar uma aleatoriedade na partição.

In [101]:
X, y = dataset.drop('SalePrice', axis=1), dataset.loc[:, 'SalePrice']

In [102]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

## Etapa 1

### Análise Exploratória

Nesta parte, iremos fazer uma análise global dos dados, apenas para garantir a integridade destes.  
Assim sendo, iremos procurar entender quais são as features e target, quais são seus respectivos tipos e buscar outras informações como:
- Dados nulos
- Dados duplicados
- Outliers
- Spikes
- Erros grosseiros

Além disso, iremos buscar saber a distribuição e a "cara" de cada variável.

In [103]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 85 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   MS.SubClass       2930 non-null   int64  
 1   MS.Zoning         2930 non-null   object 
 2   Lot.Frontage      2440 non-null   float64
 3   Lot.Area          2930 non-null   int64  
 4   Street            2930 non-null   object 
 5   Alley             198 non-null    object 
 6   Lot.Shape         2930 non-null   object 
 7   Land.Contour      2930 non-null   object 
 8   Utilities         2930 non-null   object 
 9   Lot.Config        2930 non-null   object 
 10  Land.Slope        2930 non-null   object 
 11  Neighborhood      2930 non-null   object 
 12  Condition.1       2930 non-null   object 
 13  Condition.2       2930 non-null   object 
 14  Bldg.Type         2930 non-null   object 
 15  House.Style       2930 non-null   object 
 16  Overall.Qual      2930 non-null   int64  


In [104]:
dataset.describe(include='all')

Unnamed: 0,MS.SubClass,MS.Zoning,Lot.Frontage,Lot.Area,Street,Alley,Lot.Shape,Land.Contour,Utilities,Lot.Config,...,Mo.Sold,Yr.Sold,Sale.Type,Sale.Condition,SalePrice,Tot.Lot.Area,Bsmt.Tot.Bath,Garage.Area/Cars,Tot.Porch.SF,Date.Sold
count,2930.0,2930,2440.0,2930.0,2930,198,2930,2930,2930,2930,...,2930.0,2930.0,2930,2930,2930.0,2440.0,2928.0,2772.0,2930.0,2930.0
unique,,7,,,2,2,4,4,3,5,...,,,10,6,,,,,,
top,,RL,,,Pave,Grvl,Reg,Lvl,AllPub,Inside,...,,,WD,Normal,,,,,,
freq,,2273,,,2918,120,1859,2633,2927,2140,...,,,2536,2413,,,,,,
mean,57.387372,,69.22459,10147.921843,,,,,,,...,6.216041,2007.790444,,,180796.060068,9778.747541,0.461919,272.233574,89.13959,1206200000.0
std,42.638025,,23.365335,7880.017759,,,,,,,...,2.714492,1.316613,,,79886.692357,6435.636376,0.520835,61.659127,107.734138,41058350.0
min,20.0,,21.0,1300.0,,,,,,,...,1.0,2006.0,,,12789.0,1324.0,0.0,100.0,0.0,1136074000.0
25%,20.0,,58.0,7440.25,,,,,,,...,4.0,2007.0,,,129500.0,7283.75,0.0,235.5,0.0,1172707000.0
50%,50.0,,68.0,9436.5,,,,,,,...,6.0,2008.0,,,160000.0,9331.5,0.0,264.0,50.0,1207008000.0
75%,70.0,,80.0,11555.25,,,,,,,...,8.0,2009.0,,,213500.0,11287.25,1.0,296.0,136.0,1243814000.0


In [105]:
plot_Xtrain(X_train, 'x_train_original.png')

Gráfico salvo em ./graphs/x_train_original.png


In [106]:
# Agora, basta remover os outliers encontrados no target do dataset
X_train, y_train = remove_outliers(X_train, y_train)

In [107]:
cauda_direita = ['Lot.Frontage', 'Lot.Area', 'Mas.Vnr.Area', 'BsmtFin.SF.1', 'BsmtFin.SF.2',
                 'Bsmt.Unf.SF', 'Total.Bsmt.SF', 'X1st.Flr.SF', 'X2nd.Flr.SF', 'Gr.Liv.Area', 
                 'Garage.Area', 'Wood.Deck.SF', 'Open.Porch.SF', 'Enclosed.Porch', 
                 'Screen.Porch', 'X3Ssn.Porch', 'Tot.Lot.Area', 'Garage.Area/Cars', 'Tot.Porch.SF']

categorical = X.select_dtypes(include='object').columns.tolist()
numerical = X.select_dtypes(include='number').columns.tolist()

In [108]:
X_train_log = X_train.copy()
X_train_log[cauda_direita] = np.log1p(X_train_log[cauda_direita])

plot_Xtrain(X_train[numerical], 'x_train_log.png')


Gráfico salvo em ./graphs/x_train_log.png


In [109]:
num_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
])

c_log_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('log', FunctionTransformer(np.log1p, validate=False, feature_names_out='one-to-one')),
])

cat_pipe = Pipeline(steps=[
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
])

preprocessing_pipeline = ColumnTransformer(
    transformers = [
        ('num', num_pipe, [col for col in numerical if col not in cauda_direita]),
        ('clog', c_log_pipe, cauda_direita),
        ('cat', cat_pipe, categorical)
    ],
    remainder='passthrough'
)

preprocessing_pipeline

In [110]:
X_train_transformed = preprocessing_pipeline.fit_transform(X_train)

In [111]:
rnd_clf = RandomForestRegressor(
    n_estimators=700,
    max_leaf_nodes=16,
    random_state=SEED,
    n_jobs=1
)

In [112]:
rnd_clf.fit(X_train_transformed, y_train)

In [113]:
importances = rnd_clf.feature_importances_

feature_names = preprocessing_pipeline.get_feature_names_out()
feature_importances_df = pd.DataFrame(zip(feature_names, importances), columns=['Feature', 'Importance']) \
    .sort_values(by='Importance', ascending=False)

feature_importances_df = aggregate_categorical_importances(feature_importances_df.set_index('Feature'))


X_feats = [feat.split('__')[1] for feat in feature_importances_df[:15].index]

feature_importances_df[:15]

Unnamed: 0,Aggregated_Importance
num__Overall.Qual,0.69528
clog__Gr.Liv.Area,0.130847
clog__Total.Bsmt.SF,0.033933
num__Garage.Cars,0.033929
clog__Garage.Area,0.028159
clog__X1st.Flr.SF,0.014843
clog__BsmtFin.SF.1,0.011778
num__Year.Built,0.011039
clog__Lot.Area,0.006394
num__Year.Remod.Add,0.004653


In [114]:
X_train = X_train[X_feats]

## Parte 3/4/5/6/...

1. Escolher modelos (métodos de stacking inclusos)
    - DummyRegressor
    - LinearRegression
    - Outros modelos básicos
        - Polynomial Features
        - Scalers
    - Pipelines avançadas
        - Utilizar KMeans como fonte de novas features na pipeline
        - Métodos de Ensemble
        
2. Montar GridSearchCV com hiperparâmetros

In [115]:
X_cat = X_train.select_dtypes(include='object').columns.tolist()
X_num = X_train.select_dtypes(include='number').columns.tolist()
X_log = [feat for feat in X_num if feat in cauda_direita]

In [116]:
log_transformer = FunctionTransformer(np.log1p, validate=False, feature_names_out='one-to-one')

In [None]:
cat_pipe = Pipeline(steps=[
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore')),
    ('imputer', SimpleImputer(strategy='most_frequent'))
])

num_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
])

log_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('log', log_transformer)
])

preprocessing_pipeline = ColumnTransformer(
    transformers = [
        ('num', num_pipe, X_num),
        # ('log', log_pipe, X_log ),
        ('cat', cat_pipe, X_cat)
    ],
    remainder='passthrough'
)

preprocessing_pipeline

In [127]:
def preprocess(reg, scaling=False,poly=False):
    return Pipeline(steps=[
        ('preprocessing', preprocessing_pipeline),
        ('scaler', StandardScaler() if scaling else 'passthrough'),
        ('poly', PolynomialFeatures(degree=2, include_bias=False) if poly else 'passthrough'),
        ('regressor', reg)
    ])

In [133]:
param_grid = [{
    'regressor' : [
        preprocess(LinearRegression(), scaling=True, poly=True), 
        DummyRegressor()
    ],

}, {
    'regressor': [
        preprocess(Lasso(), scaling=True, poly=True),
        preprocess(Ridge(), scaling=True, poly=True)
    ],

    'regressor__regressor__alpha': [0.1, 1, 10, 100],
}, {
    'regressor': [
        preprocess(ElasticNet(), scaling=True, poly=True)
    ],

    'regressor__regressor__alpha': [0.1, 1, 10, 100],
    'regressor__regressor__l1_ratio': [0.1, 0.5, 0.9]
}, {
    'regressor': [
        preprocess(RandomForestRegressor())
    ],

    'regressor__regressor__n_estimators': [10, 50, 100],
    'regressor__regressor__max_depth': [None, 10, 20, 30],
    'regressor__regressor__min_samples_split': [2, 5, 10]
}, {
    'regressor': [
        preprocess(GradientBoostingRegressor(), scaling=True, poly=True)
    ],

    'regressor__regressor__n_estimators': [10, 50, 100],
    'regressor__regressor__learning_rate': [0.01, 0.1, 0.2],
    'regressor__regressor__max_depth': [3, 5, 7]
}]

model = Pipeline(steps=[
    ('regressor', LinearRegression())
])

gd_cv = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring='neg_mean_squared_error',
    cv=5,
    return_train_score=True,
    n_jobs=-1,
    verbose=1
)

gd_cv

In [None]:
gd_cv.fit(X_train, y_train)
best_model = gd_cv.best_estimator_


Fitting 5 folds for each of 85 candidates, totalling 425 fits


## Parte 5

In [153]:
import pandas as pd

df = pd.DataFrame(gd_cv.cv_results_) \
    .sort_values(by='rank_test_score') \
    .loc[:, ['params', 'mean_train_score', 'mean_test_score', 'std_test_score']] \
    .assign(
        mean_test_score=lambda df: df['mean_test_score'].apply(lambda x: -x),
        mean_train_score=lambda df: df['mean_train_score'].apply(lambda x: -x),
    )


In [156]:
df

Unnamed: 0,params,mean_train_score,mean_test_score,std_test_score
45,{'regressor': (ColumnTransformer(remainder='pa...,0.003922,0.019668,0.002586
24,{'regressor': (ColumnTransformer(remainder='pa...,0.002829,0.0197,0.002575
41,{'regressor': (ColumnTransformer(remainder='pa...,0.002943,0.019733,0.002504
72,{'regressor': (ColumnTransformer(remainder='pa...,0.002288,0.019744,0.003335
69,{'regressor': (ColumnTransformer(remainder='pa...,0.007631,0.019793,0.003087
27,{'regressor': (ColumnTransformer(remainder='pa...,0.003949,0.019796,0.002699
36,{'regressor': (ColumnTransformer(remainder='pa...,0.005592,0.019798,0.002364
71,{'regressor': (ColumnTransformer(remainder='pa...,0.004021,0.019816,0.00327
51,{'regressor': (ColumnTransformer(remainder='pa...,0.002802,0.019837,0.002556
54,{'regressor': (ColumnTransformer(remainder='pa...,0.003924,0.019951,0.002648


Seleção de modelos com GridSearchCV

In [20]:
param_grid = [{
    'regressor' : [LinearRegression(), DummyRegressor()],
}, {
    'regressor': [Lasso(), Ridge()],
    'alpha': [0.1, 1, 10, 100],
}, {
    'regressor': [ElasticNet()],
    'alpha': [0.1, 1, 10, 100],
    'l1_ratio': [0.1, 0.5, 0.9]
}, {
    'regressor': [RandomForestRegressor()],
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}, {
    'regressor': [GradientBoostingRegressor()],
    'n_estimators': [10, 50, 100],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7]
}]