## Estudo de Caso: Regressão

### Pipeline de Regressão

**Descrição do Problema**

Importando as packages

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston

Importando funções:

In [0]:
from sklearn.preprocessing import StandardScaler

# Função de escalonamento
def feature_scaling(data):
    sc = StandardScaler()
    return sc.fit_transform(data)

Importando os dados. O dataset contém dados gerais e preços das casas de boston. O objetivo é predizer o valor das casas.

In [0]:
boston = load_boston()

Transformando os dados importados em um dataframe:

In [0]:
df = pd.DataFrame(boston.data, columns = boston.feature_names)

Adicionando o valor do preço das casas (target) ao dataframe:

In [0]:
df['PRICE'] = boston.target

Visualizando e descrevendo  o dataset

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
PRICE      506 non-null float64
dtypes: float64(14)
memory usage: 55.5 KB


In [45]:
df.head(5)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


Descrevendo o dataset:

In [46]:
df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


Definindo as variáveis indepedentes e dependentes

In [0]:
X = df.iloc[:, :13].values
y = df.iloc[:, -1].values


Dividindo o dataset em conjunto de treinamento e testes

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)


Normalização das features:

In [0]:
# Normalização das features
X_train = feature_scaling(X_train)
X_test = feature_scaling(X_test)


Criando o dicionário contendo todos os regressores

In [0]:
regressors = {'Linear Regression': LinearRegression(),
              'Decision Tree Reg:': DecisionTreeRegressor(random_state = 0),
              'Random Forest Reg': RandomForestRegressor(n_estimators = 10, random_state = 0),
              'SVR:': SVR(kernel = 'rbf')}

Criando dataframe que irá guardar os resultados finais dos regressores

In [0]:
df_results = pd.DataFrame(columns=['reg', 'r_2_score','rmse'])

Percorrendo o dicionário e treinando e avaliando os modelos:

In [0]:
for name, reg in regressors.items():
    
    # Treinando os regressores com Conjunto de Treinamento
    reg.fit(X_train, y_train)
    
    # Prevendo os resultados com o conjunto de testes
    y_pred = reg.predict(X_test)
    
    df_results.loc[len(df_results), :] = [name, reg.score(X_test, y_test), 
                   mean_squared_error(y_test, y_pred)]


Exibindo os resultados:

In [53]:
df_results

Unnamed: 0,reg,r_2_score,rmse
0,Linear Regression,0.626273,27.4068
1,Decision Tree Reg:,0.567803,31.6946
2,Random Forest Reg,0.803,14.4468
3,SVR:,0.614223,28.2905
