# Treinamento de Modelo
## Descrição

Esse é o primeiro código de treinamento de um modelo. É uma pipeline simples, com o carregamento do dataset, a divisão do conjunto de dados em treino e validação (ainda sem conjunto de testes), o treinamento do modelo e a validação.

## Imports

In [1]:
import os
import sys
from tabnanny import verbose
import pandas as pd
import numpy as np
import yaml
from itertools import chain, combinations
import datetime
import tempfile

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
from sklearn.utils import shuffle

import mlflow
import mlflow.sklearn
from mlflow.tracking import MlflowClient
from mlflow.utils.mlflow_tags import MLFLOW_PARENT_RUN_ID


## Extração de Dados

In [7]:
df = pd.read_csv(os.path.abspath("../extracao/data.csv"))

In [8]:
df

Unnamed: 0,edge_simpl,hue_simpl,average_lum,contrast_ratio,hist_width,blur,score
80184044.jpg,0.988281,17,32.791357,0.033461,0.828125,0.506367,3.911765
3632417985.jpg,0.984701,10,60.394967,0.040131,0.937500,0.509684,4.167411
10344921126.jpg,0.900228,14,82.289109,0.008832,0.820312,0.499054,4.466387
33000255.jpg,0.987467,16,14.709630,0.001770,0.992188,0.490172,4.705508
3091624267.jpg,0.987630,17,20.625202,0.000000,0.800781,0.494483,4.948718
...,...,...,...,...,...,...,...
10264804094.jpg,0.986979,7,70.930692,0.096978,0.921875,0.499980,76.649554
6163705024.jpg,0.983724,17,77.730217,0.089902,0.578125,0.496635,76.649554
7914531096.jpg,0.987630,7,58.468774,0.018675,0.925781,0.502837,76.649554
4824776407.jpg,0.988932,15,66.287623,0.002342,0.882812,0.498488,76.650943


## Formatação dos Dados

Divisão entre os conjuntos de treinamento e de validação. O dataset está ordenado pela qualidade das imagens (alvo da regressão), logo é necessário embaralhar os dados para evitar viés.

In [11]:
r_state = 15

X = shuffle(df.iloc[:,:-1], random_state=r_state)

y = shuffle(df.iloc[:,-1], random_state=r_state)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=r_state
)


## Treinamento

Aqui é utilizado um modelo de regressão linear como teste.

In [49]:
model = RandomForestRegressor(max_depth=8,max_features=5,min_samples_split=10,min_samples_leaf=4)

In [50]:
model.fit(X_train, y_train)

## Validação

As métricas utilizadas são a Raíz do Erro Quadrático Médio (rmse), o erro absoluto médio (mae) e coeficiente de determinação (r2), que nos diz o quão bem os dados se encaixam no modelo. 

In [51]:

def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2



In [52]:
y_pred = model.predict(X_test)
print (eval_metrics(y_test,y_pred))

(12.235089365163445, 9.573091072047319, 0.29804186774091124)


In [53]:
y_pred = model.predict(X_train)
print (eval_metrics(y_train,y_pred))

(11.167752078570073, 8.740376602121405, 0.44485997334744665)


In [15]:
client = MlflowClient(tracking_uri="http://localhost:5000")