# Objetivo

Dada uma planilha em Excel, inserir os dados da planilha num banco de dados PostgreSQL.

Dica: você pode usar Python junto com psycopg2 para se conectar no PostgreSQL.
    
Passos:

* Criar uma virtualenv

* Instalar psycopg2

* Ler os dados da planilha Excel

* Tratar os dados, se necessário
* Inserir os dados no banco PostgreSQL
* Coloque seu projeto no Gitlab, ou Github pessoal e me manda o link do repositório

# Dependências 

## Biblioteca

In [19]:
import psycopg2
import pandas as pd
import numpy as np
import seaborn as sns
#import missingno as msno
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
%matplotlib inline

## Dados e Constantes

In [17]:
df_raw = pd.read_excel('default_credit_card_clients.xls',header=[1])
df_raw.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [6]:
df_raw.shape

(30000, 25)

# Limpeza

In [9]:
df_train.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
23833,23834,80000,2,3,1,50,2,3,2,2,2,0,2684,2502,2321,4247,3914,3242,0,0,2001,4,1073,23076,0
27012,27013,500000,1,1,1,43,0,0,0,0,0,0,124362,126860,129576,129863,120781,98163,4567,4787,4745,4256,3321,3326,0
12496,12497,60000,2,1,2,31,0,-1,2,2,2,2,3610,855,111,411,261,6695,900,0,300,0,6500,0,0
25519,25520,280000,2,2,2,34,-1,-1,2,-1,-1,-1,38,34163,32800,131200,3950,0,34163,0,131200,3950,0,716,0
18880,18881,400000,2,1,2,30,-2,-2,-1,-1,0,0,2912,2668,3222,21755,15258,16002,2668,3222,21755,500,1000,475,0


In [9]:
#Observando e removendo duplicatas nas linhas e colunas. 
def remover_duplicatas(df):
    #Verificando duplicatas nas linhas
    print('Removendo...')
    df = df.drop_duplicates()
    #Verificando duplicatas colunas
    df_T = df.T
    print(f'Existem {df_T.duplicated().sum()} colunas duplicadas e {df.duplicated().sum()} linhas duplicadas')
    list_duplicated_columns = df_T[df_T.duplicated(keep=False)].index.tolist()
    df_T.drop_duplicates(inplace = True)
    print('Colunas duplicadas:')
    print(list_duplicated_columns)
    return  df_T.T, list_duplicated_columns
df_T, lista_removidas = remover_duplicatas(df_raw)

Removendo...
Existem 0 colunas duplicadas e 0 linhas duplicadas
Colunas duplicadas:
[]


In [11]:
#Observando todas as columas
df_raw.columns

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'default payment next month'], dtype='object')

In [12]:
#Convertendo valores faltosos para np.nan, para que haja um padrão de null no data_set. 
df_raw.replace([None, 'Null', 'null','NULL', -np.inf, np.inf],np.nan,inplace=True)

In [15]:
#Tratando dados inconsistentes para 'Education'.
lista_education = [0,5,6]
lista_subst1 = [4,4,4]
df_train['EDUCATION'] = df_train['EDUCATION'].replace(lista_education,lista_subst1)

In [13]:
#Tratando dados inconsistentes para 'MARRIAGE'
lista_marriage = [0]
lista_subst2 = [3]
df_raw['MARRIAGE'] = df_raw['MARRIAGE'].replace(lista_marriage,lista_subst2)

In [14]:
#Tratando dados inconsistentes para lista de PAY.
lista_pay = df_raw.filter(regex='PAY').columns.tolist()[:6]
lista_pay1 = [0,-2]
lista_subst3 =[9,9] 
for i in lista_pay:
    df_raw[i]=df_raw[i].replace(lista_pay1,lista_subst3)

Estou considerando que nas features PAY_0,...,PAY_6, os valores inconsistestes 0 e -2 sejam pessoas que estão com pagamentos atrasados a mais de 9 meses. 


## Missing Values

In [16]:
#Observando a porcentagem os valores faltosos para as features.
porcentagem_de_missings = round(df_raw.isna().sum()/df_raw.shape[0]*100,1)
porcentagem_de_missings.sort_values(ascending=False)

default payment next month    0.0
PAY_6                         0.0
LIMIT_BAL                     0.0
SEX                           0.0
EDUCATION                     0.0
MARRIAGE                      0.0
AGE                           0.0
PAY_0                         0.0
PAY_2                         0.0
PAY_3                         0.0
PAY_4                         0.0
PAY_5                         0.0
BILL_AMT1                     0.0
PAY_AMT6                      0.0
BILL_AMT2                     0.0
BILL_AMT3                     0.0
BILL_AMT4                     0.0
BILL_AMT5                     0.0
BILL_AMT6                     0.0
PAY_AMT1                      0.0
PAY_AMT2                      0.0
PAY_AMT3                      0.0
PAY_AMT4                      0.0
PAY_AMT5                      0.0
ID                            0.0
dtype: float64