Modelos de regressão utilizados:
    
    Decision Tree Regressor
    Linear Regression
    Lasso Lars
    Ridge Regression
    SVR
    Random Forest Regression
    Logistic Regression
    Bayesian Ridge
    Gradient Boostring

#### Aqui tem-se um resumo das colunas:

    SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.

    MSSubClass: The building class
    MSZoning: The general zoning classification
    LotFrontage: Linear feet of street connected to property
    LotArea: Lot size in square feet
    Street: Type of road access
    Alley: Type of alley access
    LotShape: General shape of property
    LandContour: Flatness of the property
    Utilities: Type of utilities available
    LotConfig: Lot configuration
    LandSlope: Slope of property
    Neighborhood: Physical locations within Ames city limits
    Condition1: Proximity to main road or railroad
    Condition2: Proximity to main road or railroad (if a second is present)
    BldgType: Type of dwelling
    HouseStyle: Style of dwelling
    OverallQual: Overall materand finish quality
    OverallCond: Overall condition rating
    YearBuilt: Original construction date
    YearRemodAdd: Remodel date
    RoofStyle: Type of roof
    RoofMatl: Roof material
    Exterior1st: Exterior covering on house
    Exterior2nd: Exterior covering on house (if more than one material)
    MasVnrType: Masonry veneer type
    MasVnrArea: Masonry veneer area in square feet
    ExterQual: Exterior material quality
    ExterCond: Present condition of the material on the exterior
    Foundation: Type of foundation
    BsmtQual: Height of the basement
    BsmtCond: General condition of the basement
    BsmtExposure: Walkout or garden level basement walls
    BsmtFinType1: Quality of basement finished area
    BsmtFinSF1: Type 1 finished square feet
    BsmtFinType2: Quality of second finished area (if present)
    BsmtFinSF2: Type 2 finished square feet
    BsmtUnfSF: Unfinished square feet of basement area
    TotalBsmtSF: Total square feet of basement area
    Heating: Type of heating
    HeatingQC: Heating quality and condition
    CentralAir: Central air conditioning
    Electrical: Electrical system
    1stFlrSF: First Floor square feet
    2ndFlrSF: Second floor square feet
    LowQualFinSF: Low quality finished square feet (all floors)
    GrLivArea: Above grade (ground) living area square feet
    BsmtFullBath: Basement full bathrooms
    BsmtHalfBath: Basement half bathrooms
    FullBath: Full bathrooms above grade
    HalfBath: Half baths above grade
    Bedroom: Number of bedrooms above basement level
    Kitchen: Number of kitchens
    KitchenQual: Kitchen quality
    TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
    Functional: Home functionality rating
    Fireplaces: Number of fireplaces
    FireplaceQu: Fireplace quality
    GarageType: Garage location
    GarageYrBlt: Year garage was built
    GarageFinish: Interior finish of the garage
    GarageCars: Size of garage in car capacity
    GarageArea: Size of garage in square feet
    GarageQual: Garage quality
    GarageCond: Garage condition
    PavedDrive: Paved driveway
    WoodDeckSF: Wood deck area in square feet
    OpenPorchSF: Open porch area in square feet
    EnclosedPorch: Enclosed porch area in square feet
    3SsnPorch: Three season porch area in square feet
    ScreenPorch: Screen porch area in square feet
    PoolArea: Pool area in square feet
    PoolQC: Pool quality
    Fence: Fence quality
    MiscFeature: Miscellaneous feature not covered in other categories
    MiscVal: $ Value of miscellaneous 
    MoSold: Month Sold
    YrSold: Year Sold
    SaleType: Type of sale
    SaleCondition: Condition of sale

In [1]:
# Importando as bibliotecas

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, LassoLars, SGDRegressor, Ridge, LogisticRegression
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score,mean_squared_log_error

In [2]:
# Importando os datasets
ds_train = pd.read_csv('train.csv')

In [3]:
# Verificando a base train
ds_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [4]:
# Verificando os valores nulos/vazios
ds_train.isnull().sum()[ds_train.isnull().sum() != 0].sort_values(ascending=False)

PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
FireplaceQu      690
LotFrontage      259
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
BsmtExposure      38
BsmtFinType2      38
BsmtFinType1      37
BsmtCond          37
BsmtQual          37
MasVnrArea         8
MasVnrType         8
Electrical         1
dtype: int64

##### Acima é possível verificar que existem muitas colunas com valores vazios. Para tratar iremos procurar alguma relação para que possamos substituir ou inserir dados a estes valores vazios.


In [5]:
# Substituindo os valores
ds_train['PoolQC'].fillna("None",inplace=True)
ds_train['MiscFeature'].fillna("None",inplace=True)
ds_train['Alley'].fillna("None",inplace=True)
ds_train['Fence'].fillna("None",inplace=True)
ds_train['FireplaceQu'].fillna('None',inplace=True)
ds_train.loc[ds_train['LotFrontage'].isnull(), 'LotFrontage'] = ds_train['LotFrontage'].mean()
ds_train['GarageType'].fillna('None',inplace=True)
ds_train.loc[ds_train['GarageYrBlt'].isnull(), 'GarageYrBlt'] = ds_train['GarageYrBlt'].mean()
ds_train['GarageFinish'].fillna('None',inplace=True)
ds_train['GarageQual'].fillna('None',inplace=True)
ds_train['GarageCond'].fillna('None',inplace=True)
ds_train['BsmtExposure'].fillna('None',inplace=True)
ds_train['BsmtFinType2'].fillna('None',inplace=True)
ds_train['BsmtFinType1'].fillna('None',inplace=True)
ds_train['BsmtCond'].fillna("None",inplace=True)
ds_train['BsmtQual'].fillna("None",inplace=True)
ds_train['MasVnrType'].fillna('None',inplace=True)
ds_train['MasVnrArea'].fillna(0,inplace=True)
ds_train = ds_train[ds_train['Electrical'].isnull() != True]

In [6]:
# Verificando novamente se ainda há valores vazios/nulos

ds_train.isnull().sum()[ds_train.isnull().sum() != 0].sort_values(ascending=False)

Series([], dtype: int64)

In [7]:
categorical_features = ds_train.select_dtypes(include = ["object"]).columns
numerical_features = ds_train.select_dtypes(exclude = ["object"]).columns
print("Numerical features : " + str(len(numerical_features)))
print("Categorical features : " + str(len(categorical_features)))
train_num = ds_train[numerical_features]
train_cat = ds_train[categorical_features]
print(train_num.shape)
print(train_cat.shape)

Numerical features : 38
Categorical features : 43
(1459, 38)
(1459, 43)


In [8]:
train_cat = pd.get_dummies(train_cat)
train_cat.shape

(1459, 266)

In [9]:
ds_train = pd.concat([train_cat,train_num],axis=1)
ds_train.shape

(1459, 304)

In [10]:
# Atribuindo as variáveis
y = ds_train['SalePrice']
X = ds_train.drop('SalePrice', axis=1)

In [11]:
# Dividindo em treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [12]:
def evaluate(y_true, y_pred):
    mae = mean_absolute_error(y_true,y_pred)
    mse = mean_squared_error(y_true,y_pred)
    rsquare = r2_score(y_true,y_pred)
    rmse = mean_squared_error(y_true,y_pred,squared = False)
    try:
        rmsle = mean_squared_log_error(y_true,y_pred,squared = False)
    except:
        rmsle = np.nan
    return mae, mse, rsquare, rmse, rmsle

In [13]:
# Modelos de regressão a serem utilizados
model_list = {
    'decision_tree_regression':DecisionTreeRegressor(random_state = 42),
    'linear_regression':LinearRegression(),
    'lasso_lars':LassoLars(alpha=41,eps=1.38,random_state = 42,normalize=False),
    'ridge_regression':Ridge(alpha=1778,random_state=42),
    'SVR': SVR(),
    'random_forest_regression':RandomForestRegressor(n_estimators=400,random_state = 42),
    'logistic_regression':LogisticRegression(random_state = 42),
    'gradient_boosting':GradientBoostingRegressor(random_state = 42),
}

In [14]:
# Avaliando os modelos

score = dict()
for key in model_list.keys():
    model = model_list[key].fit(X_train, y_train)
    y_pred_test = model.predict(X_test)
    score[key] = evaluate(y_test ,y_pred_test)
score_df = pd.DataFrame(score).T.round(5)
score_df.columns=['MAE','MSE','R2 Square','RMSE','RMSLE']
score_df


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,MAE,MSE,R2 Square,RMSE,RMSLE
decision_tree_regression,25975.5274,1536437000.0,0.74721,39197.40672,0.21235
linear_regression,30384.97292,18590270000.0,-2.0586,136346.12446,0.25012
lasso_lars,17654.27098,667493100.0,0.89018,25835.88819,0.14502
ridge_regression,20595.18782,971148800.0,0.84022,31163.2603,0.17251
SVR,56416.68244,6270725000.0,-0.0317,79187.91136,0.40581
random_forest_regression,16818.57372,705050200.0,0.884,26552.781,0.14648
logistic_regression,38692.35274,3236084000.0,0.46758,56886.58803,0.28836
gradient_boosting,15965.81599,657911800.0,0.89176,25649.79232,0.13896


In [15]:
# Realizando os mesmos passos agora para a base de teste
ds_test = pd.read_csv('test.csv')

In [16]:
# Verificando a base teste
ds_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSSubClass     1459 non-null   int64  
 2   MSZoning       1455 non-null   object 
 3   LotFrontage    1232 non-null   float64
 4   LotArea        1459 non-null   int64  
 5   Street         1459 non-null   object 
 6   Alley          107 non-null    object 
 7   LotShape       1459 non-null   object 
 8   LandContour    1459 non-null   object 
 9   Utilities      1457 non-null   object 
 10  LotConfig      1459 non-null   object 
 11  LandSlope      1459 non-null   object 
 12  Neighborhood   1459 non-null   object 
 13  Condition1     1459 non-null   object 
 14  Condition2     1459 non-null   object 
 15  BldgType       1459 non-null   object 
 16  HouseStyle     1459 non-null   object 
 17  OverallQual    1459 non-null   int64  
 18  OverallC

In [17]:
# Verificando os valores nulos/vazios
ds_test.isnull().sum()[ds_test.isnull().sum() != 0].sort_values(ascending=False)

PoolQC          1456
MiscFeature     1408
Alley           1352
Fence           1169
FireplaceQu      730
LotFrontage      227
GarageCond        78
GarageYrBlt       78
GarageQual        78
GarageFinish      78
GarageType        76
BsmtCond          45
BsmtExposure      44
BsmtQual          44
BsmtFinType1      42
BsmtFinType2      42
MasVnrType        16
MasVnrArea        15
MSZoning           4
BsmtFullBath       2
BsmtHalfBath       2
Functional         2
Utilities          2
GarageCars         1
GarageArea         1
TotalBsmtSF        1
KitchenQual        1
BsmtUnfSF          1
BsmtFinSF2         1
BsmtFinSF1         1
Exterior2nd        1
Exterior1st        1
SaleType           1
dtype: int64

In [18]:
# Substituindo os valores

ds_test['PoolQC'].fillna("None",inplace=True)
ds_test['MiscFeature'].fillna("None",inplace=True)
ds_test['Alley'].fillna("None",inplace=True)
ds_test['Fence'].fillna("None",inplace=True)
ds_test['FireplaceQu'].fillna('None',inplace=True)
ds_test.loc[ds_test['LotFrontage'].isnull(), 'LotFrontage'] = ds_test['LotFrontage'].mean()
ds_test['GarageType'].fillna('None',inplace=True)
ds_test.loc[ds_test['GarageYrBlt'].isnull(), 'GarageYrBlt'] = ds_test['GarageYrBlt'].mean()
ds_test['GarageFinish'].fillna('None',inplace=True)
ds_test['GarageArea'].fillna(0,inplace=True)
ds_test['GarageCars'].fillna(0,inplace=True)
ds_test['GarageQual'].fillna('None',inplace=True)
ds_test['GarageCond'].fillna('None',inplace=True)
ds_test['BsmtExposure'].fillna('None',inplace=True)
ds_test['BsmtFinType2'].fillna('None',inplace=True)
ds_test['BsmtFinType1'].fillna('None',inplace=True)
ds_test['BsmtCond'].fillna("None",inplace=True)
ds_test['BsmtQual'].fillna("None",inplace=True)
ds_test['BsmtFinSF1'].fillna(0,inplace=True)
ds_test['BsmtFinSF2'].fillna(0,inplace=True)
ds_test['BsmtUnfSF'].fillna(0,inplace=True)
ds_test['TotalBsmtSF'].fillna(0,inplace=True)
ds_test['BsmtFullBath'].fillna(0,inplace=True)
ds_test['BsmtHalfBath'].fillna(0,inplace=True)
ds_test['MasVnrType'].fillna('None',inplace=True)
ds_test['MasVnrArea'].fillna(0,inplace=True)
ds_test['Functional'].fillna('Typ',inplace=True)
ds_test['MSZoning'].fillna('RL',inplace=True)
ds_test['KitchenQual'].fillna('TA',inplace=True)
ds_test['Exterior1st'].fillna('VinylSd',inplace=True)
ds_test['Exterior2nd'].fillna('VinylSd',inplace=True)
ds_test['SaleType'].fillna('WD',inplace=True)
ds_test = ds_test[ds_test['Utilities'].isnull() != True]


In [19]:
# Verificando novamente se ainda há valores vazios/nulos

ds_test.isnull().sum()[ds_test.isnull().sum() != 0].sort_values(ascending=False)

Series([], dtype: int64)

In [20]:
t_categorical_features = ds_test.select_dtypes(include = ["object"]).columns
t_numerical_features = ds_test.select_dtypes(exclude = ["object"]).columns
print("Numerical features : " + str(len(t_numerical_features)))
print("Categorical features : " + str(len(t_categorical_features)))
test_num = ds_test[t_numerical_features]
test_cat = ds_test[t_categorical_features]
print(test_num.shape)
print(test_cat.shape)

Numerical features : 37
Categorical features : 43
(1457, 37)
(1457, 43)


In [21]:
test_cat = pd.get_dummies(test_cat)
test_cat.shape

(1457, 248)

In [22]:
ds_test = pd.concat([test_cat,test_num],axis=1)
ds_test.shape

(1457, 285)

In [23]:


for key in model_list.keys():
    model = model_list[key].fit(X, y)
    test_y_sr = model.predict(ds_test)
    df=pd.DataFrame({'Id':test_X_df.Id,'SalePrice':test_y_sr})


Feature names seen at fit time, yet now missing:
- Condition2_RRAe
- Condition2_RRAn
- Condition2_RRNn
- Electrical_Mix
- Exterior1st_ImStucc
- ...



ValueError: X has 285 features, but DecisionTreeRegressor is expecting 303 features as input.

In [24]:
print(X.shape)
print(y.shape)
print(ds_test.shape)

(1459, 303)
(1459,)
(1457, 285)
