## Contextualização

A PyCoders Ltda., cada vez mais especializada no mundo da Inteligência Artificial e Ciência de Dados, foi procurada por uma fintech para desenvolver um projeto de concessão de crédito para veículos. Nesse projeto, espera-se a criação de valor que discrimine ao máximo os bons pagadores dos maus pagadores. Para isso, foi disponibilizada uma base de dados com mais de 185 mil casos de empréstimos do passado com diversas características dos clientes. Devem ser entregues um modelo. Por questões contratuais, o pagamento será realizado baseado no desempenho (gini) do modelo ao longo do tempo.

## Base de Dados

Será utilizada uma base de dados com informações cadastrais e histórico de crédito de clientes indianos. O conjunto de dados está dividido em treino e teste sem variável resposta, todos no formato pickle comprimido com gzip. Para leitura, basta executar df = pd.read_pickle('nome_do_arquivo.pkl.gz'). Toda a modelagem e validação deve ser feita em cima do conjunto de treino, subdividindo tal base como a squad achar melhor. Existe também os metadados das variáveis explicativas, para ajudar no desenvolvimento do projeto.

## Definições

### Nota

A nota final da squad será composta por:

Desempenho do modelo em uma base de label escondida (50 pontos), sendo avaliado o Gini.

A squad com melhor desempenho na base escondida receberá 50 pontos;

A squad com segundo melhor desempenho na base escondida receberá 45 pontos;

A squad com terceiro melhor desempenho na base escondida receberá 40 pontos;

A squad com quarto melhor desempenho na base escondida receberá 35 pontos;

Fluxo para decisão de qual modelo será efetivamente usado (50 pontos)

Serão avaliados todo o fluxo de modelagem, incluindo (mas não exclusivamente) pré-processamento, métricas, seleção de modelo (25 pontos);

Os fatos que levam a squad a decisão de escolha de um modelo final (15 pontos);

Motivos que levaram a squad a usar ou não usar determinadas variáveis (estamos simulando uma financeira, então pensem em questões de ética e imagem da empresa, por exemplo) (10 pontos).

### Regras de Entrega

Deve ser entregue uma base com as predições para a base de teste.

Essa base deverá ser um Data Frame com duas colunas: a primeira sendo o ID da pessoa (variável id_pessoa) e a segunda a probabilidade de inadimplência.

:warning: Entregar as predições com a probabilidade da inadimplência ocorrer.

Deve ser entregue um notebook com a análise exploratória e análise de modelagem, mostrando como as variáveis foram investigadas, as hipóteses levantas, o por quê das decisões.

Um vídeo de até 10 min fazendo o walk-through desse notebook (não se preocupem em criar a apresentação ou coisas do tipo, apenas gravem a tela do notebook explicando cada step).

Uma tabela-resumo com todos os modelos testados, as variáveis usadas nesse modelo e a métrica encontrada no treino e teste. Isso também pautará as decisões sobre qual modelo foi escolhido.

### Dicas

Explorar o conceito das variáveis: possui risco de imagem uma empresa utilizar variável de idade para determinar risco de crédito? Vale a pena trazer a variável para o modelo?

Criar novas variáveis usando as variáveis que já estão na base: criatividade tem que ser mato.

Conversar com Rychard para tirar dúvidas sobre o projeto.


In [1]:
from pandas_profiling import ProfileReport
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression





pd.set_option('display.max_columns', None)

In [2]:
treino = pd.read_pickle('treino.pkl')
teste = pd.read_pickle('teste_aluno.pkl')

# Pré-processamento

In [3]:
treino.columns

Index(['id_pessoa', 'valor_emprestimo', 'custo_ativo', 'emprestimo_custo',
       'agencia', 'revendedora', 'montadora', 'Current_pincode_ID',
       'nascimento', 'emprego', 'data_contrato', 'estado', 'funcionario',
       'flag_telefone', 'flag_aadhar', 'flag_pan', 'flag_eleitor',
       'flag_cmotorista', 'flag_passaporte', 'score', 'score_desc',
       'pri_qtd_tot_emp', 'pri_qtd_tot_emp_atv', 'pri_qtd_tot_def',
       'pri_emp_abt', 'pri_emp_san', 'pri_emp_tom', 'sec_qtd_tot_emp',
       'sec_qtd_tot_emp_atv', 'sec_qtd_tot_def', 'sec_emp_abt', 'sec_emp_san',
       'sec_emp_tom', 'par_pri_emp', 'par_seg_emp', 'nov_emp_6m', 'def_emp_6m',
       'tem_med_emp', 'tem_pri_emp', 'qtd_sol_emp', 'default'],
      dtype='object')

In [4]:
treino.head()

Unnamed: 0,id_pessoa,valor_emprestimo,custo_ativo,emprestimo_custo,agencia,revendedora,montadora,Current_pincode_ID,nascimento,emprego,data_contrato,estado,funcionario,flag_telefone,flag_aadhar,flag_pan,flag_eleitor,flag_cmotorista,flag_passaporte,score,score_desc,pri_qtd_tot_emp,pri_qtd_tot_emp_atv,pri_qtd_tot_def,pri_emp_abt,pri_emp_san,pri_emp_tom,sec_qtd_tot_emp,sec_qtd_tot_emp_atv,sec_qtd_tot_def,sec_emp_abt,sec_emp_san,sec_emp_tom,par_pri_emp,par_seg_emp,nov_emp_6m,def_emp_6m,tem_med_emp,tem_pri_emp,qtd_sol_emp,default
155653,487469,63418,75571,85.0,136,14189,86,3783,10-05-76,Salaried,03-09-18,8,2064,1,1,0,0,0,0,676,F-Low Risk,41,16,2,1365190,1454300,1454300,0,0,0,0,0,0,3746800,0,3,0,0yrs 8mon,5yrs 5mon,0,0
98628,627194,42494,69042,65.18,1,22056,45,4923,05-02-97,Self employed,26-10-18,3,1298,1,0,0,1,0,0,824,A-Very Low Risk,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1yrs 1mon,1yrs 1mon,0,0
132937,636647,56909,69407,84.86,5,17980,86,3396,06-05-83,Salaried,29-10-18,9,679,1,0,0,1,0,0,755,C-Very Low Risk,14,4,0,12162,97826,26664,0,0,0,0,0,0,0,0,4,0,0yrs 5mon,1yrs 4mon,0,0
29031,518430,69488,82782,85.0,19,14375,86,1838,07-01-83,Self employed,19-09-18,4,603,1,1,0,0,0,0,604,H-Medium Risk,5,2,0,68664,80000,80000,0,0,0,0,0,0,11840,0,2,0,1yrs 1mon,3yrs 1mon,7,0
67486,577759,54963,66783,84.9,74,22928,86,2578,01-01-94,Salaried,14-10-18,4,286,1,1,0,0,0,0,737,C-Very Low Risk,3,1,0,36654,57131,57131,0,0,0,0,0,0,2658,0,0,0,0yrs 10mon,1yrs 4mon,0,0


In [5]:
#prof = ProfileReport(treino)
#prof.to_file(output_file="output.html")

In [6]:
#Dropando as colunas que identicam os funcionarios. Bem como a origem do veículo
colunas_drop = ['agencia','funcionario','revendedora','montadora', "flag_telefone",'Current_pincode_ID',
                'flag_aadhar', 'flag_passaporte']
treino.drop(colunas_drop, axis=1, inplace=True)
teste.drop(colunas_drop, axis=1, inplace=True)

In [7]:
# função para corrigir os anos (61 é considerado como 2061, por exemplo)
def corrige_anos(arr):
    
    '''
    Essa função recebe um array de datas no formato string, seleciona se o final dela (ano)
    for menor que 20 (referência a 2020) e então formata como ano com inicio 19 ou 20 
    
    '''
    
    lista = []
    for i in arr:
        if int(i[6:]) < 20:
            lista.append(i[:6]+'20'+i[6:])
        else:
            lista.append(i[:6]+'19'+i[6:])
    return lista

In [8]:
# Convertendo o ano com a função criada
treino.nascimento = corrige_anos(treino.nascimento)
teste.nascimento = corrige_anos(teste.nascimento)

In [9]:
#Convertendo o nascimento para anos 
now = pd.Timestamp('now')
treino['idade'] = (now - pd.to_datetime(treino.nascimento,format='%d-%m-%Y')).astype('<m8[Y]')
teste['idade'] = (now - pd.to_datetime(teste.nascimento,format='%d-%m-%Y')).astype('<m8[Y]')
treino.drop('nascimento', axis =1, inplace=True)
teste.drop('nascimento', axis =1, inplace=True)

In [10]:
#Convertendo a data do contrato para dias

now = pd.Timestamp('now')
treino['dias_contrato'] = (now - pd.to_datetime(treino.data_contrato,format='%d-%m-%y')).dt.days
teste['dias_contrato'] = (now - pd.to_datetime(teste.data_contrato,format='%d-%m-%y')).dt.days
treino.drop('data_contrato', axis =1, inplace=True)
teste.drop('data_contrato', axis =1, inplace=True)

In [11]:
treino.dias_contrato

155653    760
98628     707
132937    704
29031     744
67486     719
         ... 
158885    718
144610    729
204677    702
68304     771
57003     767
Name: dias_contrato, Length: 186523, dtype: int64

In [12]:
def str_para_mes(arr):
    
        
    '''
    Essa função recebe um array de datas no formato string: (5yrs 5mon). Retira a referência ao ano e ao mês e 
    retorna a soma entre os anos e meses como um array.    
    '''
    
    
    ano = []
    mes=[]
    arr = arr.str.replace('yrs ', '-').str.replace('mon','')
    for i in arr.index:
        ano.append(int(arr.loc[i].split('-')[0]))
        mes.append(int(arr.loc[i].split('-')[1]))


    return np.array(ano)*12+np.array(mes)
        


In [13]:
#Convertendo para mês
treino['tem_med_emp'] = str_para_mes(treino.tem_med_emp)
teste['tem_med_emp'] = str_para_mes(teste.tem_med_emp)

In [14]:
#Convertendo para mês
treino['tem_pri_emp'] = str_para_mes(treino.tem_pri_emp)
teste['tem_pri_emp'] = str_para_mes(teste.tem_pri_emp)

In [15]:
treino.columns

Index(['id_pessoa', 'valor_emprestimo', 'custo_ativo', 'emprestimo_custo',
       'emprego', 'estado', 'flag_pan', 'flag_eleitor', 'flag_cmotorista',
       'score', 'score_desc', 'pri_qtd_tot_emp', 'pri_qtd_tot_emp_atv',
       'pri_qtd_tot_def', 'pri_emp_abt', 'pri_emp_san', 'pri_emp_tom',
       'sec_qtd_tot_emp', 'sec_qtd_tot_emp_atv', 'sec_qtd_tot_def',
       'sec_emp_abt', 'sec_emp_san', 'sec_emp_tom', 'par_pri_emp',
       'par_seg_emp', 'nov_emp_6m', 'def_emp_6m', 'tem_med_emp', 'tem_pri_emp',
       'qtd_sol_emp', 'default', 'idade', 'dias_contrato'],
      dtype='object')

In [16]:
#Retirando a variável resposta e a id para o preprocessamento
treino_exp= treino.drop(['default', 'id_pessoa'], axis=1)

In [17]:
numericas = ['valor_emprestimo', 'custo_ativo', 'emprestimo_custo',
        'estado','score',  'pri_qtd_tot_emp', 'pri_qtd_tot_emp_atv',
       'pri_qtd_tot_def', 'pri_emp_abt', 'pri_emp_san', 'pri_emp_tom',
       'sec_qtd_tot_emp', 'sec_qtd_tot_emp_atv', 'sec_qtd_tot_def',
       'sec_emp_abt', 'sec_emp_san', 'sec_emp_tom', 'par_pri_emp',
       'par_seg_emp', 'nov_emp_6m', 'def_emp_6m', 'tem_med_emp', 'tem_pri_emp',
       'qtd_sol_emp', 'idade', 'dias_contrato']

In [18]:
treino_exp[numericas] = treino_exp[numericas].astype(float)

In [19]:
categoricas = ['emprego', 'flag_pan', 'flag_eleitor', 'flag_cmotorista',
       'score_desc']

In [20]:
#Pipeline para o pré-processmento

pipe_num = Pipeline(steps = [
    ('impute', SimpleImputer(strategy = 'median' )),
    ('minmax', MinMaxScaler())
])

pipe_cat = Pipeline(steps = [
    ('impute', SimpleImputer(strategy = 'most_frequent')),
    ('encoder', OrdinalEncoder())
])

preproc = ColumnTransformer(transformers = [
    ('proc_cat', pipe_cat, categoricas),
    ('proc_num', pipe_num, numericas)
    
], n_jobs=-1)

pipe_final = Pipeline([
    ('proc', preproc),
])


In [21]:
treino_trans= pipe_final.fit_transform(treino_exp)

In [22]:
#Lista de colunas para renomear o DF
col = treino.columns.drop(['id_pessoa', 'default'])

In [23]:
# Transformando o array em DF
treino_trans = pd.DataFrame(treino_trans, columns = col)

In [24]:
treino_trans.head()

Unnamed: 0,valor_emprestimo,custo_ativo,emprestimo_custo,emprego,estado,flag_pan,flag_eleitor,flag_cmotorista,score,score_desc,pri_qtd_tot_emp,pri_qtd_tot_emp_atv,pri_qtd_tot_def,pri_emp_abt,pri_emp_san,pri_emp_tom,sec_qtd_tot_emp,sec_qtd_tot_emp_atv,sec_qtd_tot_def,sec_emp_abt,sec_emp_san,sec_emp_tom,par_pri_emp,par_seg_emp,nov_emp_6m,def_emp_6m,tem_med_emp,tem_pri_emp,qtd_sol_emp,idade,dias_contrato
0,0.0,0.0,0.0,0.0,5.0,0.051264,0.024228,0.876284,0.333333,0.759551,0.090508,0.111111,0.086957,0.034335,0.001454,0.001454,0.0,0.0,0.0,0.015698,0.0,0.0,0.146115,0.0,0.085714,0.0,0.02168,0.138889,0.0,0.480769,0.637363
1,1.0,0.0,1.0,0.0,0.0,0.029853,0.020127,0.631078,0.095238,0.925843,0.002208,0.0,0.0,0.020481,0.0,0.0,0.0,0.0,0.0,0.015698,0.0,0.0,0.0,0.0,0.0,0.0,0.03523,0.027778,0.0,0.076923,0.054945
2,0.0,0.0,1.0,0.0,2.0,0.044604,0.020356,0.874552,0.380952,0.848315,0.030905,0.027778,0.0,0.020605,9.8e-05,2.7e-05,0.0,0.0,0.0,0.015698,0.0,0.0,0.0,0.0,0.114286,0.0,0.01355,0.034188,0.0,0.346154,0.021978
3,1.0,0.0,0.0,0.0,7.0,0.057475,0.028758,0.876284,0.142857,0.678652,0.011038,0.013889,0.0,0.021178,8e-05,8e-05,0.0,0.0,0.0,0.015698,0.0,0.0,0.000462,0.0,0.057143,0.0,0.03523,0.07906,0.194444,0.346154,0.461538
4,0.0,0.0,0.0,0.0,2.0,0.042612,0.018708,0.875046,0.142857,0.82809,0.006623,0.006944,0.0,0.020853,5.7e-05,5.7e-05,0.0,0.0,0.0,0.015698,0.0,0.0,0.000104,0.0,0.0,0.0,0.0271,0.034188,0.0,0.134615,0.186813


In [25]:
#Reinserindo as colunas de id e resposta
treino_trans.insert(0, column='id_pessoa' , value=treino.id_pessoa.values)


In [26]:
treino_trans.insert(1, column='default' , value=treino.default.values)

In [27]:
treino_trans.head()

Unnamed: 0,id_pessoa,default,valor_emprestimo,custo_ativo,emprestimo_custo,emprego,estado,flag_pan,flag_eleitor,flag_cmotorista,score,score_desc,pri_qtd_tot_emp,pri_qtd_tot_emp_atv,pri_qtd_tot_def,pri_emp_abt,pri_emp_san,pri_emp_tom,sec_qtd_tot_emp,sec_qtd_tot_emp_atv,sec_qtd_tot_def,sec_emp_abt,sec_emp_san,sec_emp_tom,par_pri_emp,par_seg_emp,nov_emp_6m,def_emp_6m,tem_med_emp,tem_pri_emp,qtd_sol_emp,idade,dias_contrato
0,487469,0,0.0,0.0,0.0,0.0,5.0,0.051264,0.024228,0.876284,0.333333,0.759551,0.090508,0.111111,0.086957,0.034335,0.001454,0.001454,0.0,0.0,0.0,0.015698,0.0,0.0,0.146115,0.0,0.085714,0.0,0.02168,0.138889,0.0,0.480769,0.637363
1,627194,0,1.0,0.0,1.0,0.0,0.0,0.029853,0.020127,0.631078,0.095238,0.925843,0.002208,0.0,0.0,0.020481,0.0,0.0,0.0,0.0,0.0,0.015698,0.0,0.0,0.0,0.0,0.0,0.0,0.03523,0.027778,0.0,0.076923,0.054945
2,636647,0,0.0,0.0,1.0,0.0,2.0,0.044604,0.020356,0.874552,0.380952,0.848315,0.030905,0.027778,0.0,0.020605,9.8e-05,2.7e-05,0.0,0.0,0.0,0.015698,0.0,0.0,0.0,0.0,0.114286,0.0,0.01355,0.034188,0.0,0.346154,0.021978
3,518430,0,1.0,0.0,0.0,0.0,7.0,0.057475,0.028758,0.876284,0.142857,0.678652,0.011038,0.013889,0.0,0.021178,8e-05,8e-05,0.0,0.0,0.0,0.015698,0.0,0.0,0.000462,0.0,0.057143,0.0,0.03523,0.07906,0.194444,0.346154,0.461538
4,577759,0,0.0,0.0,0.0,0.0,2.0,0.042612,0.018708,0.875046,0.142857,0.82809,0.006623,0.006944,0.0,0.020853,5.7e-05,5.7e-05,0.0,0.0,0.0,0.015698,0.0,0.0,0.000104,0.0,0.0,0.0,0.0271,0.034188,0.0,0.134615,0.186813


### Os mesmos passos do treino no dataframe de teste

In [28]:
teste_exp = teste.drop(['id_pessoa'], axis=1)

In [29]:
teste_exp = pipe_final.transform(teste_exp)

In [30]:
col = teste.columns.drop(['id_pessoa'])

In [35]:
teste_trans = pd.DataFrame(teste_exp, columns = col)

In [38]:
teste_trans.insert(0, column='id_pessoa' , value=teste.id_pessoa.values)

In [40]:
teste_trans.head()

Unnamed: 0,id_pessoa,valor_emprestimo,custo_ativo,emprestimo_custo,emprego,estado,flag_pan,flag_eleitor,flag_cmotorista,score,score_desc,pri_qtd_tot_emp,pri_qtd_tot_emp_atv,pri_qtd_tot_def,pri_emp_abt,pri_emp_san,pri_emp_tom,sec_qtd_tot_emp,sec_qtd_tot_emp_atv,sec_qtd_tot_def,sec_emp_abt,sec_emp_san,sec_emp_tom,par_pri_emp,par_seg_emp,nov_emp_6m,def_emp_6m,tem_med_emp,tem_pri_emp,qtd_sol_emp,idade,dias_contrato
0,563819,1.0,0.0,1.0,0.0,13.0,0.035124,0.026447,0.606705,0.52381,0.0,0.0,0.0,0.0,0.020481,0.0,0.0,0.0,0.0,0.0,0.015698,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.252747
1,482578,0.0,0.0,0.0,0.0,17.0,0.029679,0.015876,0.738587,0.428571,0.019101,0.002208,0.006944,0.0,0.020948,4.6e-05,4.6e-05,0.0,0.0,0.0,0.015698,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.115385,0.67033
2,565578,1.0,0.0,0.0,0.0,1.0,0.076407,0.051797,0.746381,0.095238,0.857303,0.002208,0.006944,0.0,0.020481,2.6e-05,2.6e-05,0.0,0.0,0.0,0.015698,0.0,0.0,0.0,0.0,0.0,0.0,0.04336,0.034188,0.0,0.288462,0.241758
3,629237,1.0,0.0,0.0,0.0,13.0,0.045985,0.027261,0.74799,0.095238,0.0,0.0,0.0,0.0,0.020481,0.0,0.0,0.0,0.0,0.0,0.015698,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.043956
4,485542,1.0,0.0,0.0,0.0,13.0,0.054591,0.038926,0.674749,0.666667,0.0,0.0,0.0,0.0,0.020481,0.0,0.0,0.0,0.0,0.0,0.015698,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.153846,0.67033


In [1]:
teste_trans.shape

NameError: name 'teste_trans' is not defined

In [None]:
treino_explicativo.columns = ['id_pessoa', 'valor_emprestimo', 'custo_ativo', 'emprestimo_custo',
       'emprego', 'flag_pan', 'flag_eleitor', 'flag_cmotorista', 'score',
       'score_desc', 'pri_qtd_tot_emp', 'pri_qtd_tot_emp_atv',
       'pri_qtd_tot_def', 'pri_emp_abt', 'pri_emp_san', 'pri_emp_tom',
       'sec_qtd_tot_emp', 'sec_qtd_tot_emp_atv', 'sec_qtd_tot_def',
       'sec_emp_abt', 'sec_emp_san', 'sec_emp_tom', 'par_pri_emp',
       'par_seg_emp', 'nov_emp_6m', 'def_emp_6m', 'tem_med_emp', 'tem_pri_emp',
       'qtd_sol_emp', 'idade', 'dias_contrato']

In [None]:
reglog = 0.6290246390899086

In [None]:
RandomForestClassifier = 0.5533905198782701

In [2]:
DecisionTreeClassifier = 0.6315743808039911

In [None]:
DecisionTreeRegressor = 0.6003817688715436

In [None]:
SVC = 0.4655817728573386

In [None]:
pca = PCA(n_components=4)

In [None]:
SVC = 0.5194567039611498

In [None]:
RandomForestClassifier= 0.5257018177853641

In [None]:
DecisionTreeClassifier = 0.5919920477778472

In [4]:
Adaboost (tree) = 0.7731051228498657

SyntaxError: cannot assign to function call (<ipython-input-4-6062c8089c64>, line 1)

In [None]:
Adaboost (rf) = 0.5422109785139895

In [5]:
selecionados = treino_explicativo[["flag_eleitor", "flag_pan","score_desc","qtd_sol_emp",
                                  "tem_pri_emp","pri_qtd_tot_def","pri_qtd_tot_emp_atv",
                                  "emprestimo_custo","idade","def_emp_6m","pri_qtd_tot_emp",
                                  "id_pessoa","nov_emp_6m","valor_emprestimo","tem_med_emp",
                                  "pri_emp_abt","dias_contrato","flag_cmotorista","par_pri_emp"]]

NameError: name 'treino_explicativo' is not defined

In [None]:
RandomForestClassifier  = 0.5

In [None]:
DecisionTreeClassifier = 0.610088477027986

In [None]:
GradientBoostingClassifier = 0.7073408909506756