<h1 style='color: blue; font-size: 34px; font-weight: bold;'> Projeto Proposto 
</h1>
<p style='font-size: 18px; line-height: 2; margin: 0px 0px; text-align: justify; text-indent: 0px;'>    
<i> Este projeto tem o intuito de estudar Técnicas de Modelagem de Risco de Crédito. </i>       
</p>  

# <font color='red' style='font-size: 40px;'> Bibliotecas Utilizadas </font>
<hr style='border: 2px solid red;'>

In [19]:
import pandas as pd 
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.weightstats import ztest
from statsmodels.stats.diagnostic import lilliefors
from statsmodels.tsa import stattools
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf 
from statsmodels.tsa.seasonal import seasonal_decompose
from scipy.stats import bernoulli, binom, poisson, geom, norm, chi2, f, chi2_contingency, normaltest, ttest_ind, ttest_rel, wilcoxon, mannwhitneyu, kruskal
import matplotlib.pyplot as plt
import seaborn as sns 
import plotly.express as px 
import plotly.graph_objects as go
import importlib
import transition_matrix_estimator
from transition_matrix_estimator import TransitionMatrixLearner
import time
import datetime
import warnings
import category_encoders as ce 
from sklearn.preprocessing import LabelEncoder
from category_encoders import BinaryEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score, f1_score, confusion_matrix

%matplotlib inline
sns.set(style="whitegrid", font_scale=1.2)
plt.rcParams['font.family'] = 'Arial'
plt.rcParams['font.size'] = '14'
plt.rcParams['figure.figsize'] = [10, 5]
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.float_format', lambda x: '%.4f' % x) # Tira os números do formato de Notação Científica
np.set_printoptions(suppress=True) # Tira os números do formato de Notação Científica em Numpy Arrays
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=FutureWarning) # Retira Future Warnings

# <font color='orange' style='font-size: 40px;'> Exemplo Montagem de Target </font>
<hr style='border: 2px solid orange;'>

# <font color='green' style='font-size: 30px;'> 1.1) Teste 1 </font>
<hr style='border: 2px solid green;'>

> 1. Testes de Continuidade: Verificação da consistência temporal dos dados por contrato (sem quebras ou lacunas indevidas).

> 2. Criação da Coluna flag_acordo: Identificação de registros com acordos ou renegociações.

> 3. Criação da Coluna over90: Flag mensal de inadimplência grave (atraso superior a 90 dias).

> 4. Criação da Coluna mau: Identificação de contratos irrecuperáveis com base em regras de negócio.

> 5. Construção do Grupo Performing: Definição da amostra elegível para modelagem de PD, com base no status performing.

> 6. Criação da Coluna ever90m12: Indica se o contrato atingiu atraso >90 dias em qualquer mês dos próximos 12 meses (sem considerar acordos).

> 7. Criação da Coluna over90m12: Indica se o contrato está em atraso >90 dias ao final da janela futura de 12 meses (sem considerar acordos).

> 8. Criação da Coluna target: Flag final de inadimplência futura, considerando tanto atraso quanto acordos.

> 9. Congelamento do Snapshot de Contratos: Seleção de uma linha por contrato para representar o momento de avaliação (janela de decisão).

> 10. Curva de Inadimplência (over90) por Mês: Análise temporal do volume de contratos inadimplentes mês a mês (painel e snapshot).

> 11. Curvas de ever90m12 e over90m12 por Mês: Evolução mensal dos indicadores prospectivos de risco (apenas painel).

> 12. Construção da Matriz de Transição: Avaliação da migração de status de crédito entre períodos (ex: performing → inadimplente).

> 13. Construção da Curva de Cura: Medição da recuperação de contratos inadimplentes ao longo do tempo.

### 1.1.1) Entendimento de Conceitos

> 1. data_inicio_contrato

- É a data que o cliente e o banco firmaram acordo, como por exemplo 2014-09-30
- Em muitos casos, existe "carência" até o 1 vencimento, portanto, o contrato existe juridicamente mas não dá para medir performance ainda

> 2. safra:

- Mês de nascimento da performance (M0)
- Se a primeira parcela vence em jan/2025, o contrato é marcado como safra de 201501

> 3. dat_ref:

- A cada dat_ref (foto mensal), o contrato é monitorado
- A safra nunca muda, porque é o "carimbo" da geração

In [57]:
historico  = pd.read_parquet('./data/full_history.parquet')
historico['safra'] = historico.groupby('id_contrato')['data_ref'].transform('min')
historico = historico[['id_contrato','data_inicio_contrato', 'safra', 'data_ref','dias_atraso']]
historico.sort_values(by=['id_contrato','data_ref'], inplace=True)
historico['data_ref'] = pd.to_datetime(historico['data_ref'])


rastro_contratos = pd.read_parquet('./data/rastro_contratos.parquet')
rastro_contratos.sort_values(by=['id_antigo'], inplace=True)
rastro_contratos['data_evento'] = pd.to_datetime(rastro_contratos['data_evento'])
rastro_contratos.rename(columns={'id_antigo': 'id_contrato', 'data_evento': 'data_inicio_contrato'}, inplace=True)

display(historico.head(15))
display(rastro_contratos.head(15))



Unnamed: 0,id_contrato,data_inicio_contrato,safra,data_ref,dias_atraso
0,10000000,2014-09-30,2015-01-01,2015-01-01,30
1,10000000,2014-09-30,2015-01-01,2015-02-01,15
2,10000000,2014-09-30,2015-01-01,2015-03-01,15
3,10000000,2014-09-30,2015-01-01,2015-04-01,30
4,10000000,2014-09-30,2015-01-01,2015-05-01,60
5,10000000,2014-09-30,2015-01-01,2015-06-01,90
6,10000000,2014-09-30,2015-01-01,2015-07-01,60
7,10000000,2014-09-30,2015-01-01,2015-08-01,90
8,10000000,2014-09-30,2015-01-01,2015-09-01,90
9,10000000,2014-09-30,2015-01-01,2015-10-01,60


Unnamed: 0,id_contrato,id_novo,data_inicio_contrato
1023,10000000,10004583,2015-12-01
368,10000001,10003338,2015-06-01
706,10000002,10003996,2015-09-01
0,10000003,10002500,2015-02-01
255,10000007,10003118,2015-05-01
1366,10000009,10005220,2016-03-01
815,10000010,10004198,2015-10-01
55,10000011,10002680,2015-03-01
816,10000013,10004199,2015-10-01
256,10000015,10003119,2015-05-01


### 1.1.2) Flag de Renegociação   

> 1. Renegociação é quando o cliente não consegue mais horar o contrato original e o banco faz um novo acordo para tentar recuperar o crédito. Exemplo:

- O cliente estava atrasado várias parcelas
- Ao invés do banco contabilizar uma perda efetivamente, o banco oferece uma renegociação:
    - Pode reparcelar o saldo devedor em novas condições de prazos
    - Pode descontar parte da dívida
    - Pode dar carência antes de retomar os pagamentos
- Nesses casos, o ID do contrato geralmente muda e o novo ID carrega a dívida renegociada

In [58]:
historico['flag_acordo'] = 0

rastro_contratos["safra"] = np.nan
rastro_contratos["data_ref"] = np.nan
rastro_contratos["dias_atraso"] = np.nan
rastro_contratos['flag_acordo'] = 1
rastro_contratos = rastro_contratos[['id_contrato', 'data_inicio_contrato', 'safra', 'data_ref', 'dias_atraso', 'flag_acordo']]

df = pd.concat([historico, rastro_contratos])
df.sort_values(by=['id_contrato','data_ref'], inplace=True)
df = df.reset_index(drop=True)
df.head(15)

Unnamed: 0,id_contrato,data_inicio_contrato,safra,data_ref,dias_atraso,flag_acordo
0,10000000,2014-09-30,2015-01-01,2015-01-01,30.0,0
1,10000000,2014-09-30,2015-01-01,2015-02-01,15.0,0
2,10000000,2014-09-30,2015-01-01,2015-03-01,15.0,0
3,10000000,2014-09-30,2015-01-01,2015-04-01,30.0,0
4,10000000,2014-09-30,2015-01-01,2015-05-01,60.0,0
5,10000000,2014-09-30,2015-01-01,2015-06-01,90.0,0
6,10000000,2014-09-30,2015-01-01,2015-07-01,60.0,0
7,10000000,2014-09-30,2015-01-01,2015-08-01,90.0,0
8,10000000,2014-09-30,2015-01-01,2015-09-01,90.0,0
9,10000000,2014-09-30,2015-01-01,2015-10-01,60.0,0


### 1.1.3) Criação de Mau Origem, Over e Ever

> 1. Desconsideraremos a Flag de Acordo pois eu só conseguiria marcar Mau Origem, Over e Ever mediante a saber quantos dias de atraso o novo contrato possui

> 2. Em termos práticos:

| Tipo     | Uso         | Exemplo                                                         |
| -------- | ----------- | --------------------------------------------------------------- |
| Snapshot | Application | “Este cliente, até o mês corrente, já atrasou >30 dias?”        |
| Painel   | Behaviour   | “Este cliente, em algum momento do contrato, atrasou >30 dias?” |


In [59]:
df = pd.concat([historico, rastro_contratos])
df = df.loc[df['flag_acordo'] == 0]
df.sort_values(by=['id_contrato','data_ref'], inplace=True)
df.head(15)

# 1) Definir safra e idade em meses
df['idade_meses_contrato'] =df['data_ref'].dt.to_period('M').astype(int) - df['safra'].dt.to_period('M').astype(int)

# ==========================
# MAU_ORIGEM (exemplo: >30 dias em M0)
# ==========================

df['flag_mau_origem'] = np.where((df['dias_atraso'] >= 30) & (df['idade_meses_contrato'] == 0), 1, 0)
df['mau_origem'] = df.groupby('id_contrato')['flag_mau_origem'].transform('max')
df.drop(columns=['flag_mau_origem'], inplace=True)


# --------------------------
# EVER30M6
# --------------------------

# 1) Linha a linha → 1 apenas nas referências que bateram o limiar
df['ever30m6_snapshot'] = np.where((df['dias_atraso'] >= 30) & (df['idade_meses_contrato'] <= 6),1, 0)

# 2) Contrato inteiro → 1 se em algum mês bateu o limiar
df['ever30m6_painel'] = df.groupby('id_contrato')['ever30m6_snapshot'].transform('max')

# --------------------------
# EVER30M12
# --------------------------

# 1) Linha a linha → 1 apenas nas referências que bateram o limiar
df['ever30m12_snapshot'] = np.where((df['dias_atraso'] >= 30) & (df['idade_meses_contrato'] <= 12),1, 0)

# 2) Contrato inteiro → 1 se em algum mês bateu o limiar
df['ever30m12_painel'] = df.groupby('id_contrato')['ever30m12_snapshot'].transform('max')

df.head(15)


Unnamed: 0,id_contrato,data_inicio_contrato,safra,data_ref,dias_atraso,flag_acordo,idade_meses_contrato,mau_origem,ever30m6_snapshot,ever30m6_painel,ever30m12_snapshot,ever30m12_painel
0,10000000,2014-09-30,2015-01-01,2015-01-01,30.0,0,0,1,1,1,1,1
1,10000000,2014-09-30,2015-01-01,2015-02-01,15.0,0,1,1,0,1,0,1
2,10000000,2014-09-30,2015-01-01,2015-03-01,15.0,0,2,1,0,1,0,1
3,10000000,2014-09-30,2015-01-01,2015-04-01,30.0,0,3,1,1,1,1,1
4,10000000,2014-09-30,2015-01-01,2015-05-01,60.0,0,4,1,1,1,1,1
5,10000000,2014-09-30,2015-01-01,2015-06-01,90.0,0,5,1,1,1,1,1
6,10000000,2014-09-30,2015-01-01,2015-07-01,60.0,0,6,1,1,1,1,1
7,10000000,2014-09-30,2015-01-01,2015-08-01,90.0,0,7,1,0,1,1,1
8,10000000,2014-09-30,2015-01-01,2015-09-01,90.0,0,8,1,0,1,1,1
9,10000000,2014-09-30,2015-01-01,2015-10-01,60.0,0,9,1,0,1,1,1


### 1.1.4) Rolagem

In [66]:
# Curva de Performance por Safra - Cohort
bins = [0, 30, 60, 90, 120, float('inf')]
labels = ['1–30 dias','31–60 dias','61–90 dias','91–120 dias','>120 dias']
df['faixa_atraso'] = pd.cut(df['dias_atraso'], bins=bins, labels=labels, right=True)

matriz_cohort = df.groupby(['safra', 'faixa_atraso'])['id_contrato'].count().unstack(fill_value=0)
matriz_cohort_pct = matriz_cohort.div(matriz_cohort.sum(axis=1), axis=0) * 100

matriz_cohort_pct = matriz_cohort_pct.round(2)
display(matriz_cohort_pct)


# Matriz de Rolagem
bins = [0, 30, 60, 90, 120, float('inf')]
labels = ['1-30 dias','31-60 dias','61-90 dias','91-120 dias','>120 dias']
df['faixa_atraso'] = pd.cut(df['dias_atraso'], bins=bins, labels=labels, right=True)

# Ordene por contrato e data
df = df.sort_values(['id_contrato', 'data_ref'])

# Pegue a faixa de atraso do mês anterior
df['faixa_atraso_prev'] = df.groupby('id_contrato')['faixa_atraso'].shift(1)

# Considere apenas pares válidos (com mês anterior)
transicoes = df.dropna(subset=['faixa_atraso_prev', 'faixa_atraso'])

# Monte a matriz de transição
matriz_transicao = pd.crosstab(
    transicoes['faixa_atraso_prev'],
    transicoes['faixa_atraso'],
    normalize='index'
) * 100

display(matriz_transicao)

faixa_atraso,1–30 dias,31–60 dias,61–90 dias,91–120 dias,>120 dias
safra,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-01-01,41.73,32.08,16.13,3.97,6.1
2015-02-01,50.68,29.33,14.38,1.78,3.83
2015-03-01,49.2,30.1,15.23,1.96,3.5
2015-04-01,48.64,32.24,14.62,2.08,2.42
2015-05-01,50.87,30.54,14.15,2.06,2.38
2015-06-01,48.28,32.05,14.95,2.15,2.56
2015-07-01,47.46,34.18,13.84,1.37,3.15
2015-08-01,48.44,32.23,13.82,2.0,3.52
2015-09-01,51.75,31.21,13.38,1.9,1.75
2015-10-01,53.88,29.7,12.7,1.22,2.5


faixa_atraso,1-30 dias,31-60 dias,61-90 dias,91-120 dias,>120 dias
faixa_atraso_prev,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1-30 dias,84.3684,15.6316,0.0,0.0,0.0
31-60 dias,9.0544,63.5413,27.4043,0.0,0.0
61-90 dias,0.0,51.6292,48.3708,0.0,0.0
91-120 dias,0.0,0.0,40.2734,59.7266,0.0
>120 dias,0.0,0.0,0.0,19.6751,80.3249


# <font color='orange' style='font-size: 40px;'> Exemplo Inferência de Negados </font>
<hr style='border: 2px solid orange;'>

https://python.plainenglish.io/a-project-on-reject-inference-94a6858bc821

https://www.kaggle.com/code/shraddhacodes/credit-card-default-reject-inference-project

# <font color='green' style='font-size: 30px;'> 1.1) Teste 1 </font>
<hr style='border: 2px solid green;'>

> 1. O Cramér's V é uma medida de associação entre duas variáveis categóricas, baseada no teste do qui-quadrado (chi-2).

- Ele indica a força da associação, variando de 0 (nenhuma associação) a 1 (associação perfeita).
- O Cramér's V é calculado a partir do valor do chi-2 da tabela de contingência:
    - chi2: estatística do teste do qui-quadrado
    - n: número total de observações
    - k, r: número de categorias nas variáveis (linhas e colunas)
- Interpretação
    - Cramér's V ≈ 0: variáveis independentes
    - Cramér's V ≈ 1: associação forte

In [None]:
df_approved = pd.read_excel('./data/appbeh_approved.xlsx')
# Análise Exploratória
display(df_approved['TGT_VAR'].value_counts(normalize=True))
display(df_approved.describe())

# Preenchimento de Nulos
for col in df_approved:
  if df_approved[col].dtype=='float' or df_approved[col].dtype=='int':
    df_approved[col].fillna(df_approved[col].mean(), inplace=True)

for col in df_approved:
  if df_approved[col].dtype=='object':
    df_approved[col].fillna(df_approved[col].value_counts().index[0], inplace=True)

# Label Encoder
le=LabelEncoder()
for col in df_approved:
  if df_approved[col].dtype=='object':
    df_approved[col]=le.fit_transform(df_approved[col])

# Drop de Variáveis de Alta Correlação
numeric_columns = df_approved.select_dtypes(include=['float', 'int'])
correlation_matrix = numeric_columns.corr()
correlation_threshold = 0.7
high_correlation_values = correlation_matrix[abs(correlation_matrix) > correlation_threshold]
columns_to_drop = ['ACC_AMT', 'AMT', 'EMPLOYMENT_STATUS_CD','APPL_OUTCM_CD']
df_approved = df_approved.drop(columns_to_drop, axis=1)

# Seleção de variáveis categóricas
categorical_columns = ['EDUCATION', 'GENDER', 'MARITAL_STATUS','RESIDENCE','APPL_PA_LEG_JUDG_FLG','APPL_PA_BNKR_STS_CD','APPL_PA_MNTS_FLG','TGT_VAR']
association_table = pd.DataFrame(index=categorical_columns, columns=categorical_columns)

# Calculate association measures for each pair of categorical variables
for i in range(len(categorical_columns)):
    for j in range(len(categorical_columns)):
        if i == j:
            association_table.iloc[i, j] = 1.0  # Diagonal elements are always 1
        else:
            contingency_table = pd.crosstab(df_approved[categorical_columns[i]], df_approved[categorical_columns[j]]) # Cria uma tabela de contingência entre cada par de variáveis categóricas:
            chi2, _, _, _ = chi2_contingency(contingency_table) # Aplica o teste do qui-quadrado para medir se existe associação estatística entre as duas variáveis:
            min_dim = min(contingency_table.shape[0], contingency_table.shape[1]) # Calcula o Cramér's V, que é uma medida de força da associação (varia de 0 a 1):
            cramers_v = np.sqrt(chi2 / (df_approved.shape[0] * (min_dim - 1))) # Calcula o Cramér's V, que é uma medida de força da associação (varia de 0 a 1):
            association_table.iloc[i, j] = cramers_v # Preenche a matriz de associação com o valor de Cramér's V para cada par de variáveis.



# Display the association table
display(association_table)


TGT_VAR
0   0.8521
1   0.1479
Name: proportion, dtype: float64

Unnamed: 0,RK,TAX_CODE,AMT,ACC_AMT,ANNUAL_INCOME_AMT,EMP_YR_CNT,ACC_1_30DLQ_LST_3M_CNT,ACC_31_60DLQ_LST_3M_CNT,ACC_61_90DLQ_LST_3M_CNT,ACC_91_120DLQ_LST_3M_CNT,ACC_DLD_PAY_LST_3M_CNT,ACC_1_30DLQ_LST_3M_AMT,TOT_OUTSTANDING_31_60_DAY_AMT,ACC_61_90DLQ_LST_3M_AMT,OUTCOME_CD,SLN_DR_TRNS_LST_3M_CNT,ACC_91_120DLQ_LST_3M_AMT,APPL_SCR_NO,ACC_APPL_PCL_VAL_AMT,APPL_PCL_TYP_CD,APPL_PA_HHD_INC_AMT,APPL_PA_LQD_AST_AMT,APPL_PA_REST_AMT,APPL_PA_AST_OTH_AMT,APPL_PA_LBL_REST_AMT,APPL_APPT_MAX_AGE_NO,APPL_APPT_MAX_LBL_AMT,APPL_OUTCM_CD,APPL_PA_BUR1_BNKP_CNT,APPL_PA_BUR2_BNKP_CNT,APPL_PA_BUR1_CURR_LMT_AMT,TGT_VAR
count,5252.0,5252.0,5177.0,5177.0,5177.0,5252.0,5252.0,5252.0,5252.0,5252.0,5252.0,5252.0,5252.0,5252.0,5252.0,5252.0,5252.0,5252.0,5252.0,5252.0,5252.0,5252.0,5252.0,5252.0,5252.0,5252.0,5177.0,5252.0,5252.0,5252.0,5252.0,5252.0
mean,7043.5914,4.0021,68732.6803,51549.5102,58232.645,8.0337,2.9979,1.9975,1.4962,3.9939,2.1746,1932.6076,50419.9657,49139.2688,0.2502,1.3562,99559.2346,501.3686,5057.508,0.1455,49856.3854,49198.853,49822.7388,50069.7091,50549.5468,26.6683,71283.7864,0.2502,1.5017,1.4998,4984.738,0.1479
std,4072.219,1.7711,37123.555,27842.6662,31923.1162,4.0714,1.2243,0.7185,0.5,1.7924,1.5711,4118.3898,28698.1591,28488.2168,0.4332,0.9551,40460.6275,232.3133,2893.1053,0.3526,28799.6816,28619.811,28988.4129,28889.2549,28649.1576,8.3476,39845.0376,0.4332,0.5,0.5,2875.8865,0.3551
min,2.0,1.0,6341.0,4755.75,5540.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,16.0,68.0,0.0,0.0,1028.0,100.0,0.0,0.0,0.0,0.0,0.0,8.0,4.0,18.0,6808.0,0.0,1.0,1.0,4.0,0.0
25%,3546.5,3.0,44896.0,33672.0,37716.0,4.0,2.0,1.0,1.0,2.75,1.0,299.0,25341.0,24244.0,0.0,1.0,70739.0,301.0,2588.0,0.0,24518.0,24602.0,24615.0,24991.0,25859.0,20.0,45603.0,0.0,1.0,1.0,2472.0,0.0
50%,6894.5,4.0,60699.0,45524.25,51300.0,8.0,3.0,2.0,1.0,4.0,2.0,582.0,50276.0,49124.0,0.0,1.0,99092.0,500.0,5068.0,0.0,49482.0,48466.0,50140.0,50794.0,50424.0,24.0,62636.0,0.0,2.0,1.0,5004.0,0.0
75%,10719.0,5.0,81315.0,60986.25,69100.0,12.0,4.0,3.0,2.0,6.0,3.0,874.0,74700.0,73198.0,1.0,2.0,128963.0,707.0,7581.0,0.0,74261.0,74486.0,75014.0,75081.0,75816.0,31.0,85321.0,1.0,2.0,2.0,7469.0,0.0
max,13997.0,7.0,586691.0,440018.25,465907.0,15.0,5.0,3.0,2.0,7.0,5.0,19975.0,99996.0,99992.0,1.0,3.0,199044.0,900.0,9996.0,1.0,99984.0,99960.0,99968.0,99984.0,100000.0,66.0,634141.0,1.0,2.0,2.0,9996.0,1.0


Unnamed: 0,EDUCATION,GENDER,MARITAL_STATUS,RESIDENCE,APPL_PA_LEG_JUDG_FLG,APPL_PA_BNKR_STS_CD,APPL_PA_MNTS_FLG,TGT_VAR
EDUCATION,1.0,0.0222,0.0081,0.1405,0.3391,0.154,0.0656,0.0261
GENDER,0.0222,1.0,0.0139,0.0143,0.0004,0.0052,0.2634,0.0131
MARITAL_STATUS,0.0081,0.0139,1.0,0.0145,0.0151,0.0006,0.0101,0.0021
RESIDENCE,0.1405,0.0143,0.0145,1.0,0.1978,0.0988,0.051,0.0297
APPL_PA_LEG_JUDG_FLG,0.3391,0.0004,0.0151,0.1978,1.0,0.0767,0.0486,0.0213
APPL_PA_BNKR_STS_CD,0.154,0.0052,0.0006,0.0988,0.0767,1.0,0.0266,0.0
APPL_PA_MNTS_FLG,0.0656,0.2634,0.0101,0.051,0.0486,0.0266,1.0,0.0199
TGT_VAR,0.0261,0.0131,0.0021,0.0297,0.0213,0.0,0.0199,1.0


In [None]:
df_rej = pd.read_excel('./data/appbeh_rej.xlsx')
df_rej.drop('TGT_VAR', axis = 1, inplace=True)

# Preenchimento de Nulos
for col in df_rej:
  if df_rej[col].dtype=='float' or df_rej[col].dtype=='int':
    df_rej[col].fillna(df_rej[col].mean(), inplace=True)

for col in df_rej:
  if df_rej[col].dtype=='object':
    df_rej[col].fillna(df_rej[col].value_counts().index[0], inplace=True)

# Label Encoder
le=LabelEncoder()
for col in df_rej:
  if df_rej[col].dtype=='object':
    df_rej[col]=le.fit_transform(df_rej[col])

# Drop de Variáveis de Alta Correlação
numeric_columns = df_rej.select_dtypes(include=['float', 'int'])
correlation_matrix = numeric_columns.corr()
correlation_threshold = 0.7
high_correlation_values = correlation_matrix[abs(correlation_matrix) > correlation_threshold]
columns_to_drop = ['ACC_AMT', 'AMT', 'EMPLOYMENT_STATUS_CD','APPL_OUTCM_CD']
df_rej = df_rej.drop(columns_to_drop, axis=1)

# Seleção de variáveis categóricas
categorical_columns = ['EDUCATION', 'GENDER', 'MARITAL_STATUS','RESIDENCE','APPL_PA_LEG_JUDG_FLG','APPL_PA_BNKR_STS_CD','APPL_PA_MNTS_FLG']
association_table = pd.DataFrame(index=categorical_columns, columns=categorical_columns)

# Calculate association measures for each pair of categorical variables
for i in range(len(categorical_columns)):
    for j in range(len(categorical_columns)):
        if i == j:
            association_table.iloc[i, j] = 1.0  # Diagonal elements are always 1
        else:
            contingency_table = pd.crosstab(df_rej[categorical_columns[i]], df_rej[categorical_columns[j]]) # Cria uma tabela de contingência entre cada par de variáveis categóricas:
            chi2, _, _, _ = chi2_contingency(contingency_table) # Aplica o teste do qui-quadrado para medir se existe associação estatística entre as duas variáveis:
            min_dim = min(contingency_table.shape[0], contingency_table.shape[1]) # Calcula o Cramér's V, que é uma medida de força da associação (varia de 0 a 1):
            cramers_v = np.sqrt(chi2 / (df_rej.shape[0] * (min_dim - 1))) # Calcula o Cramér's V, que é uma medida de força da associação (varia de 0 a 1):
            association_table.iloc[i, j] = cramers_v # Preenche a matriz de associação com o valor de Cramér's V para cada par de variáveis.



# Display the association table
display(association_table)


Unnamed: 0,EDUCATION,GENDER,MARITAL_STATUS,RESIDENCE,APPL_PA_LEG_JUDG_FLG,APPL_PA_BNKR_STS_CD,APPL_PA_MNTS_FLG
EDUCATION,1.0,0.0292,0.0176,0.1426,0.3108,0.1627,0.0712
GENDER,0.0292,1.0,0.0081,0.0305,0.0044,0.0108,0.2653
MARITAL_STATUS,0.0176,0.0081,1.0,0.0081,0.0063,0.0008,0.0
RESIDENCE,0.1426,0.0305,0.0081,1.0,0.2066,0.0921,0.0448
APPL_PA_LEG_JUDG_FLG,0.3108,0.0044,0.0063,0.2066,1.0,0.1333,0.046
APPL_PA_BNKR_STS_CD,0.1627,0.0108,0.0008,0.0921,0.1333,1.0,0.016
APPL_PA_MNTS_FLG,0.0712,0.2653,0.0,0.0448,0.046,0.016,1.0


# <font color='orange' style='font-size: 40px;'> Exemplo Análise de Sobrevivência em Risco de Crédito </font>
<hr style='border: 2px solid orange;'>

https://www.kaggle.com/code/jurk06/survival-analysis/notebook

# <font color='green' style='font-size: 30px;'> 1.1) Teste 1 </font>
<hr style='border: 2px solid green;'>

In [None]:
df_train = pd.read_csv('./data/cs-training.csv')
df_test = pd.read_csv('./data/cs-test.csv')

# <font color='orange' style='font-size: 40px;'> Exemplo Otimizador de Aprovação </font>
<hr style='border: 2px solid orange;'>

https://building.nubank.com/pt-br/ds-ml-meetup-n-o-82-do-nubank-imersao-pratica-nos-modelos-de-otimizacao/

# <font color='green' style='font-size: 30px;'> 1.1) Teste 1 </font>
<hr style='border: 2px solid green;'>