## Projeto de Regressão

Aluna: Júlia Ferreira de Paiva

In [74]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import ElasticNet, Lasso, Ridge, LinearRegression, SGDRegressor
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
import matplotlib.pyplot as plt
import numpy as np

## Comparação de modelos, ajuste de hiperparâmetros, análise de desempenho e feature importance

In [75]:
data = pd.read_csv('data/ames_reformado.csv')

data.head()

Unnamed: 0,MS.SubClass,MS.Zoning,Lot.Frontage,Lot.Area,Lot.Shape,Land.Contour,Lot.Config,Land.Slope,Neighborhood,Bldg.Type,...,SalePrice,Condition,HasShed,HasAlley,Exterior,Garage_Quality,Exterior_Quality,Garage.Age,Remod.Age,House.Age
0,20,RL,141.0,31770.0,IR1,Lvl,Corner,Gtl,NAmes,1Fam,...,5.332438,Norm,False,False,BrkFace,TA_TA,TA_TA,50.0,50.0,50.0
1,20,RH,80.0,11622.0,Reg,Lvl,Inside,Gtl,NAmes,1Fam,...,5.021189,Roads,False,False,VinylSd,TA_TA,TA_TA,49.0,49.0,49.0
2,20,RL,81.0,14267.0,IR1,Lvl,Corner,Gtl,NAmes,1Fam,...,5.235528,Norm,False,False,Wd Sdng,TA_TA,TA_TA,52.0,52.0,52.0
3,20,RL,93.0,11160.0,Reg,Lvl,Corner,Gtl,NAmes,1Fam,...,5.38739,Norm,False,False,BrkFace,TA_TA,Gd_TA,42.0,42.0,42.0
4,60,RL,74.0,13830.0,IR1,Lvl,Inside,Gtl,Gilbert,1Fam,...,5.278525,Norm,False,False,VinylSd,TA_TA,TA_TA,13.0,12.0,13.0


In [76]:
df = data.copy()

categorical_columns = []
ordinal_columns = []
for col in df.select_dtypes('category').columns:
    if df[col].cat.ordered:
        ordinal_columns.append(col)
    else:
        categorical_columns.append(col)

for col in ordinal_columns:
    codes, _ = pd.factorize(data[col], sort=True)
    df[col] = codes

In [77]:
df = pd.get_dummies(df, drop_first=True)

for cat in categorical_columns:
    dummies = []
    for col in df.columns:
        if col.startswith(cat + "_"):
            dummies.append(f'"{col}"')
    dummies_str = ', '.join(dummies)
    print(f'From column "{cat}" we made {dummies_str}\n')

In [78]:
X = df.drop(columns=['SalePrice']).copy().values
y = df['SalePrice'].copy().values

In [79]:
RANDOM_SEED = 42  

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=RANDOM_SEED,
)

In [129]:
param_grid = {
    'poly_features__degree': [1, 2]}

pipe = Pipeline([
    ("poly_features", PolynomialFeatures(include_bias=False)),
    ("std_scaler", StandardScaler()),
    ("lin_reg", LinearRegression()),
])

grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

melhor_modelo = grid_search.best_estimator_
melhor_modelo.fit(X_train, y_train)

y_pred = melhor_modelo.predict(X_test)
RMSE = np.sqrt(mean_squared_error(y_pred, y_test))

print(f'Melhor modelo sem regularização:')
print(f'Fit de grau: {grid_search.best_params_["poly_features__degree"]}')
print(f'RMSE: {RMSE}')

coeficientes = melhor_modelo.named_steps['lin_reg'].coef_
nomes_features = df.drop(columns=['SalePrice']).columns
importancia_features = [abs(coeficiente) for coeficiente in coeficientes]
feature_importance = list(zip(nomes_features, importancia_features))
feature_importance.sort(key=lambda x: x[1], reverse=True)

print("Importância das features:")
for feature, importance in feature_importance:
    print(f"{feature}: {importance}")


Melhor modelo sem regularização:
Fit de grau: 1
RMSE: 2502763457.9405923
Importância das features:
Exterior_Quality_Gd_TA: 32589875473.354286
Exter.Cond_Gd: 27182624429.27018
Gr.Liv.Area: 26765314745.622307
X2nd.Flr.SF: 23393008407.10706
X1st.Flr.SF: 20850336398.229965
Exter.Cond_TA: 18953274388.633373
Exter.Qual_TA: 17243819507.321915
Exterior_Quality_Gd_Gd: 15416774840.507748
Exterior_Quality_TA_Gd: 14372067123.528137
Bldg.Type_Duplex: 13387585029.999744
MS.SubClass_90: 13387585029.992704
Exterior_Quality_Fa_TA: 12307578962.55938
Exter.Qual_Fa: 11880600507.457817
Exterior_Quality_TA_TA: 11109255433.48717
Exterior_Quality_Ex_TA: 10965376037.162666
Exterior_Quality_Fa_Fa: 8171648485.647028
Exter.Qual_Gd: 5730071279.317346
Exterior_Quality_Fa_Gd: 5707636971.302155
Exter.Cond_Fa: 5278310903.00754
Exterior_Quality_Ex_Gd: 3362756805.96896
BsmtFin.SF.1: 2538923799.3522816
Bsmt.Unf.SF: 2529858719.434396
Total.Bsmt.SF: 2466649673.2380867
Low.Qual.Fin.SF: 2145618800.5617077
Exterior_Quality_TA

In [107]:

param_grid = {
    'poly_features__degree': [1, 2],
    'lin_reg__alpha': [0.1, 1, 10, 100, 1e-1],  
}

pipe = Pipeline([
    ("poly_features", PolynomialFeatures(include_bias=False)),
    ("std_scaler", StandardScaler()),
    ("lin_reg", Ridge()),
])

grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

melhor_modelo = grid_search.best_estimator_
melhor_modelo.fit(X_train, y_train)

y_pred = melhor_modelo.predict(X_test)
RMSE = np.sqrt(mean_squared_error(y_pred, y_test))

print(f'Melhor modelo com regularização Ridge:')
print(f'Fit de grau: {grid_search.best_params_["poly_features__degree"]} com alpha: {grid_search.best_params_["lin_reg__alpha"]}')
print(f'RMSE: {RMSE}')

coeficientes = melhor_modelo.named_steps['lin_reg'].coef_
nomes_features = df.drop(columns=['SalePrice']).columns
importancia_features = [abs(coeficiente) for coeficiente in coeficientes]
feature_importance = list(zip(nomes_features, importancia_features))
feature_importance.sort(key=lambda x: x[1], reverse=True)

print("Importância das features:")
for feature, importance in feature_importance:
    print(f"{feature}: {importance}")


Melhor modelo com regularização Ridge:
Fit de grau: 1 com alpha: 100
RMSE: 0.06477647878566266
Importância das features:
Overall.Qual: 0.027999076558499668
Gr.Liv.Area: 0.024045855075633237
House.Age: 0.016591876240077505
Overall.Cond: 0.01657694897259861
X2nd.Flr.SF: 0.015049867513693936
X1st.Flr.SF: 0.013957491968256124
Total.Bsmt.SF: 0.013942714699484047
BsmtFin.SF.1: 0.01116702764060306
Sale.Condition_Normal: 0.010339190420066688
Misc.Val: 0.00981599365436033
Full.Bath: 0.009635016336978608
Kitchen.Qual_TA: 0.009391340258405444
Neighborhood_Crawfor: 0.008373175137368166
Garage.Cars: 0.007942629683190079
Kitchen.Qual_Gd: 0.007796803731969478
MS.SubClass_160: 0.007351007702955967
Remod.Age: 0.007158651146624946
Lot.Area: 0.006934025320638954
MS.Zoning_RM: 0.006914959930279227
Neighborhood_Edwards: 0.0068885852461870924
Neighborhood_StoneBr: 0.006619639473957999
Exterior_BrkFace: 0.006542580813605187
Condition_Roads: 0.006527257203333368
Neighborhood_NridgHt: 0.006513483451253985
Sale

In [108]:
param_grid = {
    'poly_features__degree': [1, 2],
    'lin_reg__alpha': [0.1, 1, 10, 100, 1e-1],  
}

pipe = Pipeline([
    ("poly_features", PolynomialFeatures(include_bias=False)),
    ("std_scaler", StandardScaler()),
    ("lin_reg", Lasso()),
])

grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

melhor_modelo = grid_search.best_estimator_
melhor_modelo.fit(X_train, y_train)

y_pred = melhor_modelo.predict(X_test)
RMSE = np.sqrt(mean_squared_error(y_pred, y_test))

print(f'Melhor modelo com regularização Lasso:')
print(f'Fit de grau: {grid_search.best_params_["poly_features__degree"]} com alpha: {grid_search.best_params_["lin_reg__alpha"]}')
print(f'RMSE: {RMSE}')

coeficientes = melhor_modelo.named_steps['lin_reg'].coef_
nomes_features = df.drop(columns=['SalePrice']).columns
importancia_features = [abs(coeficiente) for coeficiente in coeficientes]
feature_importance = list(zip(nomes_features, importancia_features))
feature_importance.sort(key=lambda x: x[1], reverse=True)

print("Importância das features:")
for feature, importance in feature_importance:
    print(f"{feature}: {importance}")


Melhor modelo com regularização Lasso:
Fit de grau: 2 com alpha: 0.1
RMSE: 0.13308900782103728
Importância das features:
Overall.Qual: 0.01091072271392108
Lot.Frontage: 0.0
Lot.Area: 0.0
Overall.Cond: 0.0
Mas.Vnr.Area: 0.0
BsmtFin.SF.1: 0.0
BsmtFin.SF.2: 0.0
Bsmt.Unf.SF: 0.0
Total.Bsmt.SF: 0.0
X1st.Flr.SF: 0.0
X2nd.Flr.SF: 0.0
Low.Qual.Fin.SF: 0.0
Gr.Liv.Area: 0.0
Bsmt.Full.Bath: 0.0
Bsmt.Half.Bath: 0.0
Full.Bath: 0.0
Half.Bath: 0.0
Bedroom.AbvGr: 0.0
Kitchen.AbvGr: 0.0
TotRms.AbvGrd: 0.0
Fireplaces: 0.0
Garage.Cars: 0.0
Garage.Area: 0.0
Wood.Deck.SF: 0.0
Open.Porch.SF: 0.0
Enclosed.Porch: 0.0
X3Ssn.Porch: 0.0
Screen.Porch: 0.0
Pool.Area: 0.0
Misc.Val: 0.0
Mo.Sold: 0.0
Yr.Sold: 0.0
HasShed: 0.0
HasAlley: 0.0
Garage.Age: 0.0
Remod.Age: 0.0
House.Age: 0.0
MS.SubClass_160: 0.0
MS.SubClass_190: 0.0
MS.SubClass_20: 0.0
MS.SubClass_30: 0.0
MS.SubClass_50: 0.0
MS.SubClass_60: 0.0
MS.SubClass_70: 0.0
MS.SubClass_80: 0.0
MS.SubClass_85: 0.0
MS.SubClass_90: 0.0
MS.SubClass_Other: 0.0
MS.Zoning_R

In [128]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import ElasticNet

param_grid = {
    'poly_features__degree': [1, 2],
    'lin_reg__alpha': [0.1, 1, 10, 100, 1e-1],  
    'lin_reg__l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
}

grid_search = GridSearchCV(pipe, param_grid, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)

melhor_modelo = grid_search.best_estimator_
melhor_modelo.fit(X_train, y_train)

y_pred = melhor_modelo.predict(X_test)
RMSE = np.sqrt(mean_squared_error(y_pred, y_test))

print(f'Melhor modelo com regularização Elastic Net:')
print(f'Fit de grau: {grid_search.best_params_["poly_features__degree"]} com alpha: {grid_search.best_params_["lin_reg__alpha"]} e l1_ratio: {grid_search.best_params_["lin_reg__l1_ratio"]}')
print(f'RMSE: {RMSE}')
print()

coeficientes = melhor_modelo.named_steps['lin_reg'].coef_
nomes_features = df.drop(columns=['SalePrice']).columns
importancia_features = [abs(coeficiente) for coeficiente in coeficientes]
feature_importance = list(zip(nomes_features, importancia_features))
feature_importance.sort(key=lambda x: x[1], reverse=True)

print("Importância das features:")
for feature, importance in feature_importance:
    print(f"{feature}: {importance}")


Melhor modelo com regularização Elastic Net:
Fit de grau: 2 com alpha: 0.1 e l1_ratio: 0.1
RMSE: 0.07488824130783071
Importância das features:
Overall.Qual: 0.011467491733725996
Gr.Liv.Area: 0.007639679711782924
House.Age: 0.0037306080656400818
BsmtFin.SF.1: 0.0007938866853125253
Remod.Age: 0.0007456710159984982
Lot.Frontage: 0.0
Lot.Area: 0.0
Overall.Cond: 0.0
Mas.Vnr.Area: 0.0
BsmtFin.SF.2: 0.0
Bsmt.Unf.SF: 0.0
Total.Bsmt.SF: 0.0
X1st.Flr.SF: 0.0
X2nd.Flr.SF: 0.0
Low.Qual.Fin.SF: 0.0
Bsmt.Full.Bath: 0.0
Bsmt.Half.Bath: 0.0
Full.Bath: 0.0
Half.Bath: 0.0
Bedroom.AbvGr: 0.0
Kitchen.AbvGr: 0.0
TotRms.AbvGrd: 0.0
Fireplaces: 0.0
Garage.Cars: 0.0
Garage.Area: 0.0
Wood.Deck.SF: 0.0
Open.Porch.SF: 0.0
Enclosed.Porch: 0.0
X3Ssn.Porch: 0.0
Screen.Porch: 0.0
Pool.Area: 0.0
Misc.Val: 0.0
Mo.Sold: 0.0
Yr.Sold: 0.0
HasShed: 0.0
HasAlley: 0.0
Garage.Age: 0.0
MS.SubClass_160: 0.0
MS.SubClass_190: 0.0
MS.SubClass_20: 0.0
MS.SubClass_30: 0.0
MS.SubClass_50: 0.0
MS.SubClass_60: 0.0
MS.SubClass_70: 0.0


In [127]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

param_grid = {
    'sgd_reg__penalty': ['l2', 'l1', 'elasticnet'],
    'sgd_reg__alpha': [0.1, 1, 10, 100, 1e-1], 
    'sgd_reg__l1_ratio': [0.1, 0.5, 0.9],  
}

sgd_reg = Pipeline([
    ("std_scaler", StandardScaler()),
    ("sgd_reg", SGDRegressor(random_state=RANDOM_SEED)),
])

grid_search = GridSearchCV(sgd_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)

y_pred = best_model.predict(X_test)
RMSE = np.sqrt(mean_squared_error(y_pred, y_test))

print(f'Melhor SGDRegressor:')
print(f'SGDRegressor com alpha: {grid_search.best_params_["sgd_reg__alpha"]}, l1_ratio: {grid_search.best_params_["sgd_reg__l1_ratio"]} e regularização {grid_search.best_params_["sgd_reg__penalty"]}')
print(f'RMSE: {RMSE}')
print()

coeficientes = best_model.named_steps['sgd_reg'].coef_
nomes_features = df.drop(columns=['SalePrice']).columns

importance_features = [abs(coefficient) for coefficient in coeficientes]
feature_importance = list(zip(nomes_features, importance_features))
feature_importance.sort(key=lambda x: x[1], reverse=True)

print("Importância das features:")
for feature, importance in feature_importance:
    print(f"{feature}: {importance}")


Melhor SGDRegressor:
SGDRegressor com alpha: 1, l1_ratio: 0.1 e regularização l2
RMSE: 0.06833593669524526

Feature Importance:
Overall.Qual: 0.015753537196480705
Gr.Liv.Area: 0.013079604162368759
Total.Bsmt.SF: 0.011432005066518518
X1st.Flr.SF: 0.011076316455682517
Exterior_Quality_Ex_TA: 0.00775177866790715
BsmtFin.SF.1: 0.007703306970827028
Full.Bath: 0.007670252220172625
Fireplaces: 0.0076420031364006396
Garage.Cars: 0.00754856312912837
Garage.Area: 0.007349727427080487
TotRms.AbvGrd: 0.007019496860785428
Functional_Sal: 0.006777643303065224
Remod.Age: 0.006617119994283315
Neighborhood_NridgHt: 0.006231629303009501
MS.SubClass_30: 0.005820805848263185
Overall.Cond: 0.005554094975765948
Kitchen.Qual_TA: 0.005469034549230094
Bsmt.Exposure_Gd: 0.005403132554891018
X2nd.Flr.SF: 0.005137444271201382
Neighborhood_StoneBr: 0.005103862548787427
House.Age: 0.005077206458037964
Lot.Frontage: 0.00504634896680569
Bsmt.Full.Bath: 0.005032030662998321
Neighborhood_Crawfor: 0.005014694672830834
N

## Análise

1. Métodos utilizados

Foram testados modelos de regressão linear com graus 1 e 2, sem regularização e com regularização Lasso, ElasticNet e Ridge. Também foi testado o SGDRegressor com regularizações l1, l2 e ElasticNet. Para todos esses casos, foi aplicado o GridSearch, para encontrar os melhores parâmetros de cada modelo, e o StandardScaler, para padronizar as features. A métrica utilizada para medir o desempenho de cada método foi o RMSE, erro quadrático médio. 

2. Modelo com pior x melhor resultado

O modelo com melhor resultado foi um fit de grau 1 com regularização Ridge e alpha = 100, com RMSE de 0.06477647878566266. O com pior resultado seria um fit de grau 1 sem regularização, com RMSE de 2502763457.9405923. Entretanto, houveram outros bons resultados também utilizando outras regularizações, com destaque para um fit de SGDRegressor com alpha = 1, l1_ratio = 0.1 e regularização l2, que tem um RMSE de 0.06833593669524526.

3. Feature importance

As importâncias das features em cada modelo foram mostradas com detalhes no output de cada modelo. As que mais se destacam no geral, por estarem mais presentes na maioria dos modelos, são Overall.Qual (qualidade da casa como um todo), Gr.Liv.Area (área de estar), X1st.Flr.SF (área do primeiro andar da casa) e House Age (idade da casa).