# Turma #867   -   Projeto Machine Learning II

## Grupo de trabalho:
- Adriana Roberta Miceli de Souza <br/>
- Debora Kassem Buturi <br/>
- Helen Cristina de Acypreste Rocha <br/> 
- Marcus Fontes <br/>
- Richard Raphael Banak <br/>


 

## Deadline

10/out/2022

## Entrega

Enviar um único email com nome dos participantes do projeto para:

- rychard.guedes@ada.tech

## Contextualização
A PyCoders Ltda., cada vez mais especializada no mundo da Inteligência Artificial e Ciência de Dados, foi procurada por uma fintech para desenvolver um projeto de concessão de crédito para imóveis. Nesse projeto, espera-se a criação de valor que discrimine ao máximo os bons pagadores dos maus pagadores. Para isso, foi disponibilizada uma base de dados com milhares de casos de empréstimos do passado com diversas características dos clientes. Devem ser entregues um modelo. Por questões contratuais, o pagamento será realizado baseado no desempenho (ROC AUC) do modelo ao longo do tempo.

## Highlights

- Vamos continuar utilizando as bases do módulo passado, todo o progresso até aqui será mantido.
- Por enquanto, continuaremos apenas com `application_train.csv` e `application_test_student.csv`.
- Vamos começar a olhar para performance! Lembrem-se de olhar sempre para **ROC AUC**.

## Base de Dados
Serão utilizadas bases de dados com informações cadastrais, histórico de crédito e balanços financeiros de diversos clientes. O conjunto de dados está dividido em treino e teste, todos no formato csv. Toda a modelagem, validação e avaliação deve ser feita em cima do conjunto de treino, subdividindo tal base como a squad achar melhor. Existe também os das variáveis explicativas, para ajudar no desenvolvimento do projeto. Serão necessários diversos cruzamentos e vocês estão livres para usar os dados da maneira que acharem mais conveniente.

[Baixar](https://drive.google.com/file/d/17fyteuN2MdGdbP5_Xq_sySN_yH91vTup/view?usp=sharing)

## Entregáveis

- Dois notebooks: (i) com a investigação e comparações feitas; (ii) com o fluxo limpo do modelo escolhido.
- Crie pipeline para o modelo.
- Realize processo de otimização dos hiperparâmetros.
- Utilize pelo menos uma variável categórica e pelo menos uma variável numérica.
- Para garantir robustez, adicione uma camada de imputação de missing para todas as features. Reflita sobre qual é a melhor estratégia para cada uma das variáveis explicativas.
- Não foque mais em utilizar apenas 5 variáveis, vamos abrir os horizontes e buscar outras features que podem ser úteis. Vale salientar que a depender do modelo, não adianta simplesmente colocar todas porque isso pode gerar problemas de overfitting, de underfitting ou de eficiência computacional, então será necessário fazer algum tipo de seleção.
- Apenas um modelo será entregue. Faça a escolha baseando-se em performance, mas levando em consideração custo computacional. Descreva o processo de decisão e argumente a favor do modelo - deve estar dentro do fluxo limpo, no notebook modelo escolhido, no início.
- Para marcar a previsão, use a probabilidade do evento (sem binarizar).
- **Meta em bater pelo menos 0.70 de AUC. Se não conseguir, não há problema, mas falar antecipadamente com Rychard/Bruno.**

## Divisão inicial do time:

Adri   - KNN <br/>
Banak  - Light gbm <br/>
Debora - Decision tree <br/>
Helen  - Xgboost <br/>
Marcus - Adaboost <br/>

# Imports

In [75]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from sklearn.metrics import roc_auc_score
from mlxtend.plotting import plot_confusion_matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn import svm
from sklearn.compose import make_column_transformer

In [76]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 500)

### 0. Leitura dos dados

In [77]:
caminho = './projeto_ml2/'
arquivo_principal = 'application_train.csv'
arquivo_oculto = 'application_test_student.csv'
arquivo_metadados = 'HomeCredit_columns_description.csv'

0.1 Inputs

In [78]:
df = pd.read_csv(f'{caminho}/{arquivo_principal}')

df_oculto = pd.read_csv(f'{caminho}/{arquivo_oculto}')

df_metadados = pd.read_csv(f'{caminho}/{arquivo_metadados}', encoding = 'Windows-1252')

In [79]:
df.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,456162,0,Cash loans,F,N,N,0,112500.0,700830.0,22738.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,134978,0,Cash loans,F,N,N,0,90000.0,375322.5,14422.5,...,0,0,0,0,0.0,0.0,0.0,1.0,0.0,3.0
2,318952,0,Cash loans,M,Y,N,0,180000.0,544491.0,16047.0,...,0,0,0,0,0.0,0.0,0.0,1.0,1.0,3.0
3,361264,0,Cash loans,F,N,Y,0,270000.0,814041.0,28971.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
4,260639,0,Cash loans,F,N,Y,0,144000.0,675000.0,21906.0,...,0,0,0,0,0.0,0.0,0.0,10.0,0.0,0.0


- Avaliando a variável resposta:



In [81]:
df['TARGET'].value_counts() , df['TARGET'].value_counts(normalize=True) # Dados desbalanceados (Default próximo de 8%)

(0    226038
 1     19970
 Name: TARGET, dtype: int64,
 0    0.918824
 1    0.081176
 Name: TARGET, dtype: float64)

- Avaliando as primeiras linhas

In [82]:
df_metadados.head()

Unnamed: 0.1,Unnamed: 0,Table,Row,Description,Special
0,1,application_{train|test}.csv,SK_ID_CURR,ID of loan in our sample,
1,2,application_{train|test}.csv,TARGET,"Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases)",
2,5,application_{train|test}.csv,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving,
3,6,application_{train|test}.csv,CODE_GENDER,Gender of the client,
4,7,application_{train|test}.csv,FLAG_OWN_CAR,Flag if the client owns a car,


In [83]:
df_metadados_ = df_metadados[df_metadados['Table'] =='application_{train|test}.csv']
df_metadados_[["Row", "Description"]]

Unnamed: 0,Row,Description
0,SK_ID_CURR,ID of loan in our sample
1,TARGET,"Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases)"
2,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving
3,CODE_GENDER,Gender of the client
4,FLAG_OWN_CAR,Flag if the client owns a car
5,FLAG_OWN_REALTY,Flag if client owns a house or flat
6,CNT_CHILDREN,Number of children the client has
7,AMT_INCOME_TOTAL,Income of the client
8,AMT_CREDIT,Credit amount of the loan
9,AMT_ANNUITY,Loan annuity


### 1. Entendimento inicial do dataset application_train

In [84]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246008 entries, 0 to 246007
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 229.0+ MB


In [85]:
df.describe()

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
count,246008.0,246008.0,246008.0,246008.0,246008.0,245998.0,245782.0,246008.0,246008.0,246008.0,...,246008.0,246008.0,246008.0,246008.0,212836.0,212836.0,212836.0,212836.0,212836.0,212836.0
mean,278280.072908,0.081176,0.415527,168912.2,599628.3,27129.162648,538928.9,0.020882,-16042.794393,63963.755699,...,0.007975,0.000589,0.000508,0.000289,0.006291,0.006944,0.034487,0.267403,0.264109,1.90004
std,102790.909988,0.273106,0.719922,260381.8,403067.2,14504.965232,369973.8,0.013852,4365.973763,141400.318322,...,0.088948,0.024271,0.022536,0.016986,0.083236,0.109538,0.204179,0.91664,0.611269,1.868217
min,100002.0,0.0,0.0,25650.0,45000.0,1615.5,40500.0,0.00029,-25229.0,-17912.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,189165.5,0.0,0.0,112500.0,270000.0,16561.125,238500.0,0.010006,-19691.0,-2758.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,278392.5,0.0,0.0,148500.0,514777.5,24930.0,450000.0,0.01885,-15763.0,-1215.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,367272.25,0.0,1.0,202500.0,808650.0,34599.375,679500.0,0.028663,-12418.0,-289.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
max,456255.0,1.0,19.0,117000000.0,4050000.0,258025.5,4050000.0,0.072508,-7489.0,365243.0,...,1.0,1.0,1.0,1.0,4.0,9.0,8.0,27.0,19.0,23.0


### 2. EDA - Avaliação inicial

2.0- Modificações Gerais no dataset original

In [104]:
print(df['CODE_GENDER'].value_counts())
df = df[df['CODE_GENDER'] != 'XNA']

F    161867
M     84138
Name: CODE_GENDER, dtype: int64


2.1- Avaliação do % de missing por campo

In [132]:
df_metadados_[["Row", "Description"]]

Unnamed: 0,Row,Description
0,SK_ID_CURR,ID of loan in our sample
1,TARGET,"Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases)"
2,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving
3,CODE_GENDER,Gender of the client
4,FLAG_OWN_CAR,Flag if the client owns a car
5,FLAG_OWN_REALTY,Flag if client owns a house or flat
6,CNT_CHILDREN,Number of children the client has
7,AMT_INCOME_TOTAL,Income of the client
8,AMT_CREDIT,Credit amount of the loan
9,AMT_ANNUITY,Loan annuity


In [135]:
# percentual de missing em uma lista de campos para excluí-los:

#df.iloc[:,0].name
df.iloc[:,0].dtype


dtype('int64')

### 3.Train x Test Split

In [87]:
df_treino, df_teste = train_test_split(df, test_size = 0.3, random_state = 10)

* 3.1-Definindo Variáveis

In [96]:
var_num = []
var_string = []
var_missing = [] # Despresadas por terem mais de 20% de missing
var_exp = var_num + var_string

var_expl = ['AMT_INCOME_TOTAL', 'CNT_CHILDREN', 'DAYS_EMPLOYED', 'REGION_RATING_CLIENT','DAYS_EMPLOYED']
var_resp = 'TARGET'

In [97]:
x_treino = df_treino[var_expl].copy()
x_teste = df_teste[var_expl].copy()
y_treino = df_treino[var_resp].copy()
y_teste = df_teste[var_resp].copy()

### 4. Pré-processamento

* 4.0- Avaliando missing

In [98]:
x_treino.isnull().sum()

AMT_INCOME_TOTAL        0
CNT_CHILDREN            0
DAYS_EMPLOYED           0
REGION_RATING_CLIENT    0
DAYS_EMPLOYED           0
dtype: int64

- 4.1- Ordinal Encoder

In [99]:
lista_ordenada = [
    'Lower secondary',
    'Secondary / secondary special', 
    'Incomplete higher',
    'Higher education', 
    'Academic degree', 
]

oe = OrdinalEncoder(categories = [lista_ordenada])
#oe.fit(x_treino[['NAME_EDUCATION_TYPE']])
#x_treino[['NAME_EDUCATION_TYPE']] = oe.transform(x_treino[['NAME_EDUCATION_TYPE']])

- 4.2- One Hot Encoder

In [93]:
#ohe = OneHotEncoder(drop='first').fit(x_treino[[]])
#ohe.transform(x_treino[[]])

#ohe_binarias = OneHotEncoder(drop='if_binary').fit(x_treino[[]])
#ohe_binarias.transform([[]]).toarray()

- 4.3- Imput Missing

In [100]:
imputer_num = SimpleImputer(missing_values = np.nan , strategy = 'mean')
imputer_str = SimpleImputer(missing_values = np.nan , strategy = 'most_frequent')
#x_treino[['','']] = imputer_num.fit_transform(x_treino[['','']])
#x_treino[['','']] = imputer_str.fit_transform(x_treino[['','']])

### 5.Modelos

In [101]:
modelo = AdaBoostClassifier(random_state = 1)
modelo.fit(x_treino, y_treino)

In [102]:
y_pred_treino = modelo.predict_proba(x_treino)[:, 1]
y_pred_teste = modelo.predict_proba(x_teste)[:, 1]

In [103]:
roc_auc_score(y_treino, y_pred_treino) , roc_auc_score(y_teste, y_pred_teste)

(0.6181645684601258, 0.6077913573318245)

### 6.Pipelines

### 7.Otimização de hiperparâmetros

7.1 - Adaboost

In [None]:
%%time

parametros = {
    'n_estimators': [5, 25, 100, 250,500],
    'learning_rate': [0.001,0.01,0.1],
}

modelo = AdaBoostClassifier(
    random_state = 1    
)

gscv = GridSearchCV(
    estimator = modelo,
    param_grid = parametros,
    scoring = 'roc_auc',
    refit = True,
    cv = 3
)

gscv.fit(x_treino, y_treino)

y_pred_treino = gscv.predict_proba(x_treino)[:, 1]
y_pred_teste = gscv.predict_proba(x_teste)[:, 1]

roc_auc_score(y_treino, y_pred_treino) , roc_auc_score(y_teste, y_pred_teste)

7.1 - Árvore de decisão (simples)

In [None]:
%%time

parametros = {
    'n_estimators': [5, 25, 100, 250,500],
    'learning_rate': [0.001,0.01,0.1],
}

modelo = AdaBoostClassifier(
    random_state = 1    
)

gscv = GridSearchCV(
    estimator = modelo,
    param_grid = parametros,
    scoring = 'roc_auc',
    refit = True,
    cv = 3
)

gscv.fit(x_treino, y_treino)

y_pred_treino = gscv.predict_proba(x_treino)[:, 1]
y_pred_teste = gscv.predict_proba(x_teste)[:, 1]

roc_auc_score(y_treino, y_pred_treino) , roc_auc_score(y_teste, y_pred_teste)

7.2 - Random Forrest

In [None]:
%%time

parametros = {
    'n_estimators': [5, 25, 100, 250,500],
    'learning_rate': [0.001,0.01,0.1],
}

modelo = AdaBoostClassifier(
    random_state = 1    
)

gscv = GridSearchCV(
    estimator = modelo,
    param_grid = parametros,
    scoring = 'roc_auc',
    refit = True,
    cv = 3
)

gscv.fit(x_treino, y_treino)

y_pred_treino = gscv.predict_proba(x_treino)[:, 1]
y_pred_teste = gscv.predict_proba(x_teste)[:, 1]

roc_auc_score(y_treino, y_pred_treino) , roc_auc_score(y_teste, y_pred_teste)

7.3 - XGBosst

In [None]:
%%time

parametros = {
    'n_estimators': [5, 25, 100, 250,500],
    'learning_rate': [0.001,0.01,0.1],
}

modelo = AdaBoostClassifier(
    random_state = 1    
)

gscv = GridSearchCV(
    estimator = modelo,
    param_grid = parametros,
    scoring = 'roc_auc',
    refit = True,
    cv = 3
)

gscv.fit(x_treino, y_treino)

y_pred_treino = gscv.predict_proba(x_treino)[:, 1]
y_pred_teste = gscv.predict_proba(x_teste)[:, 1]

roc_auc_score(y_treino, y_pred_treino) , roc_auc_score(y_teste, y_pred_teste)

7.4 - LightGBM

In [None]:
%%time

parametros = {
    'n_estimators': [5, 25, 100, 250,500],
    'learning_rate': [0.001,0.01,0.1],
}

modelo = AdaBoostClassifier(
    random_state = 1    
)

gscv = GridSearchCV(
    estimator = modelo,
    param_grid = parametros,
    scoring = 'roc_auc',
    refit = True,
    cv = 3
)

gscv.fit(x_treino, y_treino)

y_pred_treino = gscv.predict_proba(x_treino)[:, 1]
y_pred_teste = gscv.predict_proba(x_teste)[:, 1]

roc_auc_score(y_treino, y_pred_treino) , roc_auc_score(y_teste, y_pred_teste)

7.5 - SVM

In [None]:
%%time

parametros = {
    'n_estimators': [5, 25, 100, 250,500],
    'learning_rate': [0.001,0.01,0.1],
}

modelo = AdaBoostClassifier(
    random_state = 1    
)

gscv = GridSearchCV(
    estimator = modelo,
    param_grid = parametros,
    scoring = 'roc_auc',
    refit = True,
    cv = 3
)

gscv.fit(x_treino, y_treino)

y_pred_treino = gscv.predict_proba(x_treino)[:, 1]
y_pred_teste = gscv.predict_proba(x_teste)[:, 1]

roc_auc_score(y_treino, y_pred_treino) , roc_auc_score(y_teste, y_pred_teste)

7.6 - KNN

In [None]:
%%time

parametros = {
    'n_estimators': [5, 25, 100, 250,500],
    'learning_rate': [0.001,0.01,0.1],
}

modelo = AdaBoostClassifier(
    random_state = 1    
)

gscv = GridSearchCV(
    estimator = modelo,
    param_grid = parametros,
    scoring = 'roc_auc',
    refit = True,
    cv = 3
)

gscv.fit(x_treino, y_treino)

y_pred_treino = gscv.predict_proba(x_treino)[:, 1]
y_pred_teste = gscv.predict_proba(x_teste)[:, 1]

roc_auc_score(y_treino, y_pred_treino) , roc_auc_score(y_teste, y_pred_teste)

7.7 - Naive Bayes

In [None]:
%%time

parametros = {
    'n_estimators': [5, 25, 100, 250,500],
    'learning_rate': [0.001,0.01,0.1],
}

modelo = AdaBoostClassifier(
    random_state = 1    
)

gscv = GridSearchCV(
    estimator = modelo,
    param_grid = parametros,
    scoring = 'roc_auc',
    refit = True,
    cv = 3
)

gscv.fit(x_treino, y_treino)

y_pred_treino = gscv.predict_proba(x_treino)[:, 1]
y_pred_teste = gscv.predict_proba(x_teste)[:, 1]

roc_auc_score(y_treino, y_pred_treino) , roc_auc_score(y_teste, y_pred_teste)

7.7 - Regressão Logística

In [None]:
%%time

parametros = {
    'n_estimators': [5, 25, 100, 250,500],
    'learning_rate': [0.001,0.01,0.1],
}

modelo = AdaBoostClassifier(
    random_state = 1    
)

gscv = GridSearchCV(
    estimator = modelo,
    param_grid = parametros,
    scoring = 'roc_auc',
    refit = True,
    cv = 3
)

gscv.fit(x_treino, y_treino)

y_pred_treino = gscv.predict_proba(x_treino)[:, 1]
y_pred_teste = gscv.predict_proba(x_teste)[:, 1]

roc_auc_score(y_treino, y_pred_treino) , roc_auc_score(y_teste, y_pred_teste)

7.8 - Bagging Tree

In [None]:
%%time

parametros = {
    'n_estimators': [5, 25, 100, 250,500],
    'learning_rate': [0.001,0.01,0.1],
}

modelo = AdaBoostClassifier(
    random_state = 1    
)

gscv = GridSearchCV(
    estimator = modelo,
    param_grid = parametros,
    scoring = 'roc_auc',
    refit = True,
    cv = 3
)

gscv.fit(x_treino, y_treino)

y_pred_treino = gscv.predict_proba(x_treino)[:, 1]
y_pred_teste = gscv.predict_proba(x_teste)[:, 1]

roc_auc_score(y_treino, y_pred_treino) , roc_auc_score(y_teste, y_pred_teste)

7.9 - MLP

In [None]:
%%time

parametros = {
    'n_estimators': [5, 25, 100, 250,500],
    'learning_rate': [0.001,0.01,0.1],
}

modelo = AdaBoostClassifier(
    random_state = 1    
)

gscv = GridSearchCV(
    estimator = modelo,
    param_grid = parametros,
    scoring = 'roc_auc',
    refit = True,
    cv = 3
)

gscv.fit(x_treino, y_treino)

y_pred_treino = gscv.predict_proba(x_treino)[:, 1]
y_pred_teste = gscv.predict_proba(x_teste)[:, 1]

roc_auc_score(y_treino, y_pred_treino) , roc_auc_score(y_teste, y_pred_teste)

7.10 - LQV

In [None]:
%%time

parametros = {
    'n_estimators': [5, 25, 100, 250,500],
    'learning_rate': [0.001,0.01,0.1],
}

modelo = AdaBoostClassifier(
    random_state = 1    
)

gscv = GridSearchCV(
    estimator = modelo,
    param_grid = parametros,
    scoring = 'roc_auc',
    refit = True,
    cv = 3
)

gscv.fit(x_treino, y_treino)

y_pred_treino = gscv.predict_proba(x_treino)[:, 1]
y_pred_teste = gscv.predict_proba(x_teste)[:, 1]

roc_auc_score(y_treino, y_pred_treino) , roc_auc_score(y_teste, y_pred_teste)

### 8. Escolha do melhor modelo

8.1- Análise dos resultados

In [None]:
# Gráfico ou tabela com uma visão comparativa dos melhores modelos em ROC AUC

In [None]:
# Matriz de confusão (relatório) [TREINO]

In [None]:
# Curva Roc e Roc AUC [TREINO]

In [None]:
# Matriz de confusão (relatório) [TREINO]

In [None]:
# Curva Roc e Roc AUC [TESTE]

### 9. Previsão e Export

In [None]:
y_pred_oculto = modelo.predict_proba(x_oculto)[:, 1]
df_oculto['Y_PRED'] = y_pred_oculto
#df_oculto[['SK_ID_CURR', 'Y_PRED']].head()

### Conclusões

* conclusão