## Introdução à Ciência de Dados

### Projeto de Análise Exploratória dos Dados do ENEM-2022 - Predição


Essa análise está disponível no [Github](https://github.com/jonasrlg/Intro_a_Ciencia_de_Dados_Embraer), contendo tanto *notebook* quanto
*dataset* utilizado. 

In [1]:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import ElasticNet, SGDRegressor
from sklearn.svm import LinearSVR
from sklearn.neighbors import KNeighborsRegressor as KNN
from sklearn.model_selection import train_test_split, GridSearchCV, PredefinedSplit

import mlflow

import gc

In [2]:
df = pd.read_csv('enem.csv', usecols=['NU_NOTA_CN', 'NU_NOTA_CH', 'NU_NOTA_LC', 'NU_NOTA_MT', 'NU_NOTA_REDACAO'])
df

Unnamed: 0,NU_NOTA_CN,NU_NOTA_CH,NU_NOTA_LC,NU_NOTA_MT,NU_NOTA_REDACAO
0,,,,,
1,,,,,
2,421.1,546.0,498.8,565.3,760.0
3,490.7,388.6,357.8,416.0,320.0
4,,,,,
...,...,...,...,...,...
3476100,,,,,
3476101,,,,,
3476102,527.9,627.0,583.3,637.1,660.0
3476103,,,,,


In [3]:
df.describe()

Unnamed: 0,NU_NOTA_CN,NU_NOTA_CH,NU_NOTA_LC,NU_NOTA_MT,NU_NOTA_REDACAO
count,2355395.0,2493442.0,2493442.0,2355395.0,2493442.0
mean,495.9305,526.9531,517.4389,542.5032,618.4797
std,72.00975,81.48446,77.55491,116.0225,212.2125
min,0.0,0.0,0.0,0.0,0.0
25%,440.5,477.0,468.4,449.0,520.0
50%,485.6,529.9,525.5,530.8,620.0
75%,543.3,581.9,573.2,622.4,760.0
max,875.3,839.2,801.0,985.7,1000.0


A seguir, notamos que o número de linhas com algum elemento faltante é, apesar de estar na casa de 200 mil, é consideravelmente pequeno em relação ao tamanho do dataset quando comparado ao tamanho total. Então poderíamos retirar todas essas linhas com `nan` sem consequências muito prejudiciais ao resultado final.

In [4]:
df.dropna(how='all').shape

(2504014, 5)

In [5]:
df.dropna().shape

(2344823, 5)

Agora, descrevemos nossa ideia. Primeiro, tentamos desenvolver um cálculo para a TRI baseado nos acertos dos candidatos, mas essa não foi uma boa ideia. Ao seguir o manual do candidato do ENEM, é possível ver que o cálculo da TRI utiliza diversas etapas, unindo conhecimento específico de domínio dos especialistas que montaram a prova, em conjunto de métodos númericos de otimização, como o algoritmo EM para otimização estatística, que busca maximizar função de verossimilhança dada variáveis latentes do problema.

Dada essa dificuldade, que não traria nenhum tipo de conclusão interessante sobre nada, já que qualquer modelo de aprendizado de máquina seria pior que o cálculo da TRI em si para isso, tivemos uma nova ideia: estimar a nota de redação a partir da nota entre as outras modalidades. Essa abordagem foi inciada na análise anterior, quando foi feita a verificação da correlação entre as modalidades da prova.

Por isso, seguimos fazendo um `dropna` e separando a colunas das provas alternativas como `features` e a coluna de redação como  `target`.`

In [6]:
data = df.dropna()
data

Unnamed: 0,NU_NOTA_CN,NU_NOTA_CH,NU_NOTA_LC,NU_NOTA_MT,NU_NOTA_REDACAO
2,421.1,546.0,498.8,565.3,760.0
3,490.7,388.6,357.8,416.0,320.0
7,398.1,427.3,400.2,404.9,440.0
9,467.5,461.0,466.7,435.3,360.0
11,458.7,539.8,488.2,456.8,940.0
...,...,...,...,...,...
3476095,444.5,504.4,489.5,423.6,580.0
3476097,536.1,633.2,584.0,596.3,740.0
3476098,487.6,495.6,545.5,597.4,580.0
3476099,512.5,524.8,546.8,432.0,520.0


In [7]:
features = data.iloc[:,:-1].to_numpy()
target = data.iloc[:,-1].to_numpy()
print(f'Shape das features = {features.shape}')
print(f'Shape do target = {target.shape}')

features, target

Shape das features = (2344823, 4)
Shape do target = (2344823,)


(array([[421.1, 546. , 498.8, 565.3],
        [490.7, 388.6, 357.8, 416. ],
        [398.1, 427.3, 400.2, 404.9],
        ...,
        [487.6, 495.6, 545.5, 597.4],
        [512.5, 524.8, 546.8, 432. ],
        [527.9, 627. , 583.3, 637.1]]),
 array([760., 320., 440., ..., 580., 520., 660.]))

Agora, seguimos com uma normalização dos dados em ambos os conjuntos. Como as variáveis do conjunto de `features` e `target` possuem o mesmo range entre 0 e 1.000 (conhecimento específico de domínio e foi o que observamos pelo método `describe`), vamos utilizar a mesma transformação de dividir todas as colunas por 100. Isso se aproxima de uma abordagem usando um MinMaxScaler, mas essa segunda abordagem faz contas locais no dataset, enquanto estamos utilizando o conhecimento de como a nota do ENEM funciona.

In [8]:
features /= 1_000
target /= 1_000

features, target

(array([[0.4211, 0.546 , 0.4988, 0.5653],
        [0.4907, 0.3886, 0.3578, 0.416 ],
        [0.3981, 0.4273, 0.4002, 0.4049],
        ...,
        [0.4876, 0.4956, 0.5455, 0.5974],
        [0.5125, 0.5248, 0.5468, 0.432 ],
        [0.5279, 0.627 , 0.5833, 0.6371]]),
 array([0.76, 0.32, 0.44, ..., 0.58, 0.52, 0.66]))

### Separação do Dataset em Treino, Validação e Teste

Utiliza a função nativa do `scikit-learn` para fazer a divisão do conjunto de treino+validação e teste, para que depois seja dividido treino e validação.

Fazemos isso para facilitar a busca por hiperparâmetros quando formos utilizar o método `GridSearchCV` que utilizaremos futuramente.

In [9]:
# Split Data to Train+Validation and Test
X, X_test, y, y_test = train_test_split(features, target, train_size = 0.85,random_state = 42)

size = X.shape[0]
index = np.arange(size)
np.random.shuffle(index)

train_size = int(size*0.7/0.85)
X_train, X_val = X[index[:train_size], :], X[index[train_size:], :]
y_train, y_val = y[index[:train_size]], y[index[train_size:]]

print(f'Shape do train: X_train - {X_train.shape} / y_train {y_train.shape}')
print(f'Shape do val: X_val - {X_val.shape} / y_val {y_val.shape}')
print(f'Shape do test: X_test - {X_test.shape} / y_test {y_test.shape}')

Shape do train: X_train - (1641375, 4) / y_train (1641375,)
Shape do val: X_val - (351724, 4) / y_val (351724,)
Shape do test: X_test - (351724, 4) / y_test (351724,)


A variável ps abaixo indica qual índices dos conjuntos `X` e `y` pertencem a treino (recebe -1) e validação (recebe 0).

In [10]:
val_fold = np.zeros(size)
for i in range(train_size):
    val_fold[index[i]] = -1
val_fold

array([-1., -1., -1., ..., -1., -1., -1.])

Checamos que o número de ocorrências bate com o esperado.

In [11]:
unique, counts = np.unique(val_fold, return_counts=True)
dict(zip(unique, counts))

{-1.0: 1641375, 0.0: 351724}

### Algoritmos de Aprendizado

No decorrer do projeto, utilizaremos os seguintes algoritmos:

* ElasticNet 

* SGD (Stocastic Gradient Descent)

* Linear SVM (Support Vector Machine)

* KNN (K-Nearest Neighbors)

### Ativação do MLFlow

Como os algoritmos que selecionamos serão todos rodados usando a biblioteca `scikit-learn`, vamos ativar a função `autolog` do MLFlow.

In [12]:
mlflow.sklearn.autolog()

### Busca por Hiperparâmetros para cada Modelo

Função que será utilizada para determinar qual parte do conjunto `X` e do `y` vai para treinamento e qual parte vai para validação. Índices de `val_fold` iguais a -1 vão para treino e os iguais a `0` entram para validação.

In [13]:
ps = PredefinedSplit(val_fold)
ps.get_n_splits()

1

Checamos que só temos um fold resultante, como esperávamos.

In [14]:
for i, (train_index, test_index) in enumerate(ps.split()):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")

Fold 0:
  Train: index=[      0       1       2 ... 1993096 1993097 1993098]
  Test:  index=[     21      23      27 ... 1993084 1993087 1993093]


Verificamos que, realmente, os índices de treinamento e validação batem com aqueles que havíamos criado antes, na etapa 2 do projeto.

In [15]:
np.array_equal(train_index, np.sort(index[:train_size])), np.array_equal(test_index, np.sort(index[train_size:]))

(True, True)

In [16]:
del df
gc.collect()

0

In [17]:
gc.collect()

0

Durante a seleção de modelos, tanto por hiperparâmetros (intra-modelo) quanto por melhor algoritmo de aprendizado (inter-modelo), usamos a métrica de erro quadrático médio.

#### Elastic Net

In [18]:
# define model
en = ElasticNet()
# define grid
grid = dict()
grid['alpha'] = [1e-3, 1e-2, 1e-1, 0.0, 1.0, 1e2, 1e3]
grid['l1_ratio'] = np.arange(0, 1, 0.1)
# define search
search = GridSearchCV(en, grid, scoring='neg_mean_squared_error', cv=ps, n_jobs=-1)
# perform the search
results = search.fit(X, y)
# summarize
print('MSE: %.3f' % -results.best_score_)
print('Config: %s' % results.best_params_)

2023/07/14 13:52:00 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'f7d60a1d112345699252797e94304259', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  mode

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
2023/07/14 14:01:12 INFO mlflow.sklearn.utils: Logging the 5 best runs, 65 runs will be omitted.


MSE: 0.028
Config: {'alpha': 0.0, 'l1_ratio': 0.0}


In [19]:
en = ElasticNet(alpha=results.best_params_['alpha'], l1_ratio=results.best_params_['l1_ratio'])
en.fit(X_train,y_train)
y_pred = en.predict(X_test)
mean_squared_error(y_pred, y_test)

2023/07/14 14:01:12 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '856db816a19a430494c5ccdab0964f11', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


0.027966338554892677

#### Stocastic Gradient Descent (SGD)

In [20]:
# define model
sgd = SGDRegressor()
# define grid
grid = dict()
grid['penalty'] = ['l1', 'l2']
grid['alpha'] = [1e3, 1e2, 1e1, 1.0, 1e-1, 1e-2, 1e-3, 1e-5, 1e-6]
# define search
search = GridSearchCV(sgd, grid, scoring='neg_mean_squared_error', cv=ps, n_jobs=-1)
# perform the search
results = search.fit(X, y)
# summarize
print('MSE: %.3f' % -results.best_score_)
print('Config: %s' % results.best_params_)

2023/07/14 14:01:35 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'e335eb1b5599436196bdbdb70a491ecb', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2023/07/14 14:01:48 INFO mlflow.sklearn.utils: Logging the 5 best runs, 13 runs will be omitted.


MSE: 0.028
Config: {'alpha': 1e-06, 'penalty': 'l2'}


In [21]:
sgd = SGDRegressor(alpha=results.best_params_['alpha'], penalty=results.best_params_['penalty'])
sgd.fit(X_train,y_train)
y_pred = sgd.predict(X_test)
mean_squared_error(y_pred, y_test)

2023/07/14 14:01:48 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '58680528e5cd49929e91ecc7e3e9daec', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


0.02796676692448382

#### Linear Support Vector Machine (Linear SVM)

Aqui, utilizamos uma Linear Support Vector Vector Machine, porque uma SVM tradicional tomaria muito espaço e tempo para calcular a matriz e fazer sua otimização

In [22]:
# define model
svr = LinearSVR()
# define grid
grid = dict()
grid['C'] = [1e-2, 1e-1, 1, 1e1, 1e2]
grid['max_iter'] = [10_000]
# define search
search = GridSearchCV(svr, grid, scoring='neg_mean_squared_error', cv=ps, n_jobs=-1)
# perform the search
results = search.fit(X, y)
# summarize
print('MSE: %.3f' % -results.best_score_)
print('Config: %s' % results.best_params_)

2023/07/14 14:07:04 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '95874bdab45e464eb190d598add19ea9', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2023/07/14 15:03:18 INFO mlflow.sklearn.utils: Logging the 5 best runs, no runs will be omitted.


MSE: 0.028
Config: {'C': 100.0, 'max_iter': 10000}


In [23]:
svr = LinearSVR(C=results.best_params_['C'], max_iter=results.best_params_['max_iter'])
svr.fit(X_train,y_train)
y_pred = svr.predict(X_test)
mean_squared_error(y_pred, y_test)

2023/07/14 15:03:18 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '930f793d5c544995ba226a6e6d07f77d', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


0.028545304716330436

#### K-Nearest Neighboors (KNN)

In [18]:
# define model
knn = KNN()
# define grid
grid = dict()
grid['n_neighbors'] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
grid['weights'] = ['uniform', 'distance']
grid['p'] = [1, 2]
# define search
search = GridSearchCV(knn, grid, scoring='neg_mean_squared_error', cv=ps, n_jobs=-1)
# perform the search
results = search.fit(X, y)
# summarize
print('MSE: %.3f' % -results.best_score_)
print('Config: %s' % results.best_params_)

2023/07/14 16:50:37 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'b18ce426afbb4054b41dccc2233094a8', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2023/07/14 16:55:31 INFO mlflow.sklearn.utils: Logging the 5 best runs, 35 runs will be omitted.


MSE: 0.031
Config: {'n_neighbors': 10, 'p': 1, 'weights': 'uniform'}


In [19]:
knn = KNN(n_neighbors=results.best_params_['n_neighbors'], weights=results.best_params_['weights'], p=results.best_params_['p'])
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
mean_squared_error(y_pred, y_test)

2023/07/14 16:55:31 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '05eae19f7a584e439e3c70587895779f', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


0.03038909802003844

### Melhorando o Melhor Modelo

Olhando os resultados anteriores, podemos ver que a ElasticNet ganha dos outros modelos. Ela chega a quase empatar com a abordagem baseada em gradiente, mas conseguiu se sair um pouco melhor.

Dessa forma, agora buscamos melhorar ainda mais seu desempenho, o que será feito por uma busca de parâmetros ainda mais fina, em combinação com a utilização de PCA na nossa análise.

Primeiro, utilizamos outro método para a transformação dos dados. Dessa vez usando sua média e variância (sem dividir direto por 1_000, como havíamos feito).

In [20]:
scaler = StandardScaler()
dataset = data.to_numpy()
dataset_scaled = scaler.fit_transform(dataset)

features_scaled = dataset[:, :-1]
target_scaled = dataset[:, -1]

features_scaled.shape, target_scaled.shape

((2344823, 4), (2344823,))

In [21]:
# Split Data to Train+Validation and Test
X_scaled, X_test_scaled, y_scaled, y_test_scaled = train_test_split(features_scaled, target_scaled, train_size = 0.85,random_state = 42)

np.array_equal(X, X_scaled), np.array_equal(X_test, X_test_scaled), np.array_equal(y, y_scaled), np.array_equal(y_test, y_test_scaled)

(True, True, True, True)

Podemos ver o que havíamos indicado anteriormente, que a nossa normalização era boa. Então devemos prosseguir a utilização do dataset de treino/validação/teste original, e com uma outra abordagem, a utilização de PCA para tentar fazer uma melhor engenharia dos dados.

In [22]:
# define grid
grid = dict()
grid['alpha'] = [1e-3, 1e-2, 1e-1, 0.0, 1.0, 1e2, 1e3]
grid['l1_ratio'] = np.arange(0, 1, 1e-1)
grid['selection'] = ['cyclic', 'random']

best_mse = float('inf')
bset_n = 0

# iterate over number of PCA components
for n_components in range(1, 5):
    pca = PCA(n_components=n_components)
    features_pca = pca.fit_transform(features)
    
    # Split Data to Train+Validation and Test
    X_pca, X_test_pca, y_pca, y_test_pca = train_test_split(features_pca, target, train_size = 0.85,random_state = 42)

    X_train_pca, X_val_pca = X_pca[index[:train_size], :], X_pca[index[train_size:], :]
    y_train_pca, y_val_pca = y_pca[index[:train_size]], y_pca[index[train_size:]]

    print(f'Shape do train: X_train_pca - {X_train_pca.shape} / y_train_pca {y_train_pca.shape}')
    print(f'Shape do val: X_val_pca - {X_val_pca.shape} / y_val_pca {y_val_pca.shape}')
    print(f'Shape do test: X_test_pca - {X_test_pca.shape} / y_test_pca {y_test_pca.shape}')

    # define model
    en = ElasticNet()
    # define search
    search = GridSearchCV(en, grid, scoring='neg_mean_squared_error', cv=ps, n_jobs=-1)
    # perform the search
    results = search.fit(X_pca, y_pca)
    # summarize
    print(f'Number of components: {n_components}')
    print(f'                         MSE: {-results.best_score_}')
    print(f'                         Config: {results.best_params_}')

    if best_mse > -results.best_score_:
        best_n = n_components
        best_mse = -results.best_score_

2023/07/14 16:59:05 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'd17db74cee6747ba8a11e3eac9c4b66e', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2023/07/14 16:59:08 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '882049e1696d4033b17361c33d8e46f4', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


Shape do train: X_train_pca - (1641375, 1) / y_train_pca (1641375,)
Shape do val: X_val_pca - (351724, 1) / y_val_pca (351724,)
Shape do test: X_test_pca - (351724, 1) / y_test_pca (351724,)


  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model =

  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_t

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
2023/07/14 17:04:37 INFO mlflow.sklearn.utils: Logging the 5 best runs, 135 runs will be omitted.
2023/07/14 17:04:37 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '01912d1e0d0c451fbdfc16e942183511', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


Number of components: 1
                         MSE: 0.02862628619288026
                         Config: {'alpha': 0.0, 'l1_ratio': 0.0, 'selection': 'cyclic'}


2023/07/14 17:04:41 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'ecb915558988460ea5099c14db545cde', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


Shape do train: X_train_pca - (1641375, 2) / y_train_pca (1641375,)
Shape do val: X_val_pca - (351724, 2) / y_val_pca (351724,)
Shape do test: X_test_pca - (351724, 2) / y_test_pca (351724,)


  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model =

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_t

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
2023/07/14 17:14:00 INFO mlflow.sklearn.utils: Logging the 5 best runs, 135 runs will be omitted.
2023/07/14 17:14:00 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'c967f6e631954f60a514b8801dc24efd', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


Number of components: 2
                         MSE: 0.0282497759155457
                         Config: {'alpha': 0.0, 'l1_ratio': 0.0, 'selection': 'cyclic'}


2023/07/14 17:14:04 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '3d572307c01349a09a7e88871e4aefea', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


Shape do train: X_train_pca - (1641375, 3) / y_train_pca (1641375,)
Shape do val: X_val_pca - (351724, 3) / y_val_pca (351724,)
Shape do test: X_test_pca - (351724, 3) / y_test_pca (351724,)


  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  mo

  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.e

  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
2023/07/14 17:27:20 INFO mlflow.sklearn.utils: Logging the 5 best runs, 135 runs will be omitted.
2023/07/14 17:27:20 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '5e6c8921f14c4558b0fcb43738c20381', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


Number of components: 3
                         MSE: 0.02822816864880631
                         Config: {'alpha': 0.0, 'l1_ratio': 0.0, 'selection': 'cyclic'}


2023/07/14 17:27:22 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'd15848c5e56f45bbbd0de6378b2f1e86', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


Shape do train: X_train_pca - (1641375, 4) / y_train_pca (1641375,)
Shape do val: X_val_pca - (351724, 4) / y_val_pca (351724,)
Shape do test: X_test_pca - (351724, 4) / y_test_pca (351724,)


  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model =

  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.e

  model = cd_fast.enet_coordinate_descent(
  estimator.fit(X_train, y_train, **fit_params)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
2023/07/14 17:44:51 INFO mlflow.sklearn.utils: Logging the 5 best runs, 135 runs will be omitted.


Number of components: 4
                         MSE: 0.028201761201563695
                         Config: {'alpha': 0.0, 'l1_ratio': 0.0, 'selection': 'cyclic'}


In [23]:
best_n

4

In [26]:
-results.best_score_

0.028278019118756353

In [24]:
results.best_params_

{'alpha': 0.0, 'l1_ratio': 0.0, 'selection': 'cyclic'}

In [25]:
pca = PCA(n_components=best_n)
features_final = pca.fit_transform(features)

# Split Data to Train+Validation and Test
X_final, X_test_final, y_final, y_test_final = train_test_split(features_final, target, train_size = 0.85,random_state = 42)

X_train_final, X_val_final = X_final[index[:train_size], :], X_final[index[train_size:], :]
y_train_final, y_val_final = y_final[index[:train_size]], y_final[index[train_size:]]

print(f'Shape do train: X_train_final - {X_train_final.shape} / y_train_final {y_train_final.shape}')
print(f'Shape do val: X_val_final - {X_val_final.shape} / y_val_final {y_val_final.shape}')
print(f'Shape do test: X_test_final - {X_test_final.shape} / y_test_final {y_test_final.shape}')

en = ElasticNet(alpha=results.best_params_['alpha'], l1_ratio=results.best_params_['l1_ratio'], selection=results.best_params_['selection'])
en.fit(X_train_final,y_train_final)
y_pred_final = en.predict(X_test_final)
mean_squared_error(y_pred_final, y_test_final)

2023/07/13 12:44:13 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '7885fa31517f4cb6b328c96909549ef0', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2023/07/13 12:44:18 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'a9664ae5127a43d290502075bc8bec2a', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


Shape do train: X_train_final - (1641375, 4) / y_train_final (1641375,)
Shape do val: X_val_final - (351724, 4) / y_val_final (351724,)
Shape do test: X_test_final - (351724, 4) / y_test_final (351724,)


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


0.027966480903399833

In [27]:
y_pred_train = en.predict(X_train_final)
mean_squared_error(y_pred_train, y_train_final)

0.02814950963957138

Com essa mudança da utilização de PCA, observamos uma pequena melhora após o ajuste de hiper-parâmetros, mas já começamos a identificar que este se trata do limite de performance do modelo. Um indício disso está no MSE do modelo, com MSE de validação bastante próximo do MSE no conjunto de teste, o que nos indica que nosso modelo aprendeu bem, já que não observamos um fenômeno de under-fitting.