## Comparativo entre Técnicas de Regressão

### Pipeline de Regressão

Importando as packages e funções:

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# Função de escalonamento
def feature_scaling(data):
    sc = StandardScaler()
    return sc.fit_transform(data)

Importando o dataset do nosso estudo. Esse dataset consiste em prever o consumo médio de carros através da coluna *mpg - galões de combustível por milhas*. Portanto, queremos prever o grau de economia de cada modelo de carro através de atributos como: número de cilindros, peso, potência, etc..
Fonte: [UCL](https://archive.ics.uci.edu/ml/datasets/Auto+MPG)

In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/intelligentagents/aprendizagem-supervisionada/master/data/cars.csv', sep = ";")

Visualizando e descrevendo  o dataset

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 8 columns):
mpg             398 non-null float64
cylinders       398 non-null int64
displacement    398 non-null float64
horsepower      392 non-null float64
weight          398 non-null int64
acceleration    398 non-null float64
model year      398 non-null int64
origin          398 non-null int64
dtypes: float64(4), int64(4)
memory usage: 25.0 KB


In [4]:
df.head(5)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
0,18.0,8,307.0,130.0,3504,12.0,70,1
1,15.0,8,350.0,165.0,3693,11.5,70,1
2,18.0,8,318.0,150.0,3436,11.0,70,1
3,16.0,8,304.0,150.0,3433,12.0,70,1
4,17.0,8,302.0,140.0,3449,10.5,70,1


Descrevendo o dataset:

In [5]:
df.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
count,398.0,398.0,398.0,392.0,398.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,104.469388,2970.424623,15.56809,76.01005,1.572864
std,7.815984,1.701004,104.269838,38.49116,846.841774,2.757689,3.697627,0.802055
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0
25%,17.5,4.0,104.25,75.0,2223.75,13.825,73.0,1.0
50%,23.0,4.0,148.5,93.5,2803.5,15.5,76.0,1.0
75%,29.0,8.0,262.0,126.0,3608.0,17.175,79.0,2.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0,3.0


Analisando se algumas colunas do atributo *horsepower* contém valores nulos:

In [6]:
df[df.isnull().values.any(axis=1)]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
32,25.0,4,98.0,,2046,19.0,71,1
126,21.0,6,200.0,,2875,17.0,74,1
330,40.9,4,85.0,,1835,17.3,80,2
336,23.6,4,140.0,,2905,14.3,80,1
354,34.5,4,100.0,,2320,15.8,81,2
374,23.0,4,151.0,,3035,20.5,82,1


Preechendo os valores nulas com a mediana:

In [0]:
df = df.fillna(df.median())

Definindo as variáveis indepedentes e dependentes

In [0]:
X = df.iloc[:,1:8]
y = df.iloc[:,0]  

Dividindo o dataset em conjunto de treinamento e testes

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)


Normalização das features:

In [0]:
X_train = feature_scaling(X_train)
X_test = feature_scaling(X_test)

Criando o dicionário contendo todos os regressores

In [0]:
regressors = {'Linear Regression': LinearRegression(),
              'Decision Tree Reg:': DecisionTreeRegressor(random_state = 0),
              'SVR:': SVR(kernel = 'rbf')}

Criando dataframe que irá guardar os resultados finais dos regressores

In [0]:
df_results = pd.DataFrame(columns=['reg', 'r_2_score', 'rmse'])

Percorrendo o dicionário e treinando e avaliando os modelos:

In [0]:
for name, reg in regressors.items():
    
    # Treinando os regressores com Conjunto de Treinamento
    reg.fit(X_train, y_train)
    
    # Prevendo os resultados com o conjunto de testes
    y_pred = reg.predict(X_test)
    
    df_results.loc[len(df_results), :] = [name, reg.score(X_test, y_test), 
                   mean_squared_error(y_test, y_pred)]

Exibindo os resultados:

In [0]:
df_results

Unnamed: 0,reg,r_2_score,rmse
0,Linear Regression,0.852799,7.91446
1,Decision Tree Reg:,0.76205,12.7937
2,SVR:,0.866287,7.18928
