# Exercício: Seleção de Dados 
<img src='https://pandas.pydata.org/docs/_images/03_subset_rows.svg'>

- O que vamos fazer?
    - Leitura de dados
    - Formatação de dados 
    - Selecionar dados por tipos
    - Reduzir uso de memória no dataframe

In [1]:
##### bibliotecas usadas
import pandas as pd

# 1. Leitura de dados

- [Dataset original application_record.csv](https://www.kaggle.com/datasets/rikdifos/credit-card-approval-prediction?resource=download)
- Dados para aprovação de crédito em cartão de credito
    - ID - (category)
    - CODE_GENDER - (category)
    - FLAG_OWN_CAR - (boolean)
    - FLAG_OWN_REALTY - (boolean)
    - CNT_CHILDREN - (inteiro)
    - AMT_INCOME_TOTAL (float)
    - NAME_INCOME_TYPE (category)
    - NAME_EDUCATION_TYPE - (category)
    - NAME_FAMILY_STATUS - (category)
    - NAME_HOUSING_TYPE - (category)
    - DAYS_BIRTH - (Int)
    - DAYS_EMPLOYED - (int)
    - FLAG_MOBIL - (boolean)
    - FLAG_WORK_PHONE - (boolean)
    - FLAG_PHONE - (boolean)
    - OCCUPATION_TYPE - (category)
    - CNT_FAM_MEMBERS - int

In [2]:
filename = 'data/application_record.csv'
df = pd.read_csv(filename, sep=',')

In [3]:
df.head()

Unnamed: 0,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS
0,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
1,5008805,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
2,5008806,M,Y,Y,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,-21474,-1134,1,0,0,0,Security staff,2.0
3,5008808,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0
4,5008809,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0


In [4]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5001 entries, 0 to 5000
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   5001 non-null   int64  
 1   CODE_GENDER          5001 non-null   object 
 2   FLAG_OWN_CAR         5001 non-null   object 
 3   FLAG_OWN_REALTY      5001 non-null   object 
 4   CNT_CHILDREN         5001 non-null   int64  
 5   AMT_INCOME_TOTAL     5001 non-null   float64
 6   NAME_INCOME_TYPE     5001 non-null   object 
 7   NAME_EDUCATION_TYPE  5001 non-null   object 
 8   NAME_FAMILY_STATUS   5001 non-null   object 
 9   NAME_HOUSING_TYPE    5001 non-null   object 
 10  DAYS_BIRTH           5001 non-null   int64  
 11  DAYS_EMPLOYED        5001 non-null   int64  
 12  FLAG_MOBIL           5001 non-null   int64  
 13  FLAG_WORK_PHONE      5001 non-null   int64  
 14  FLAG_PHONE           5001 non-null   int64  
 15  FLAG_EMAIL           5001 non-null   i

Also, passing ``deep=True`` will enable a more accurate memory usage report, that accounts for the full usage of the contained objects. This is because memory usage does not include memory consumed by elements that are not components of the array if ``deep=False`` (default case).

In [5]:
## Verificar o uso de memória
mem_before = df.memory_usage(deep=True).sum()
print('{} in KBs'.format(mem_before/1024))
print('{} in MBs'.format(mem_before/1024/1024))

2933.77734375 in KBs
2.8650169372558594 in MBs


# 2. Seleção Condicional

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5001 entries, 0 to 5000
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   5001 non-null   int64  
 1   CODE_GENDER          5001 non-null   object 
 2   FLAG_OWN_CAR         5001 non-null   object 
 3   FLAG_OWN_REALTY      5001 non-null   object 
 4   CNT_CHILDREN         5001 non-null   int64  
 5   AMT_INCOME_TOTAL     5001 non-null   float64
 6   NAME_INCOME_TYPE     5001 non-null   object 
 7   NAME_EDUCATION_TYPE  5001 non-null   object 
 8   NAME_FAMILY_STATUS   5001 non-null   object 
 9   NAME_HOUSING_TYPE    5001 non-null   object 
 10  DAYS_BIRTH           5001 non-null   int64  
 11  DAYS_EMPLOYED        5001 non-null   int64  
 12  FLAG_MOBIL           5001 non-null   int64  
 13  FLAG_WORK_PHONE      5001 non-null   int64  
 14  FLAG_PHONE           5001 non-null   int64  
 15  FLAG_EMAIL           5001 non-null   i

In [7]:
# Pegar todas as colunas objeto para category
cols_cat = list(df.select_dtypes(include='object').columns)

In [8]:
# transformar todas as colunas object em category
df[cols_cat] = df[cols_cat].astype('category')

In [9]:
# ID para categorico
df['ID'] = df['ID'].astype('category')

In [10]:
## CNT_FAM_MEMBERS para inteiro
df['CNT_FAM_MEMBERS'] = df['CNT_FAM_MEMBERS'].astype('int')

In [11]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5001 entries, 0 to 5000
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   ID                   5001 non-null   category
 1   CODE_GENDER          5001 non-null   category
 2   FLAG_OWN_CAR         5001 non-null   category
 3   FLAG_OWN_REALTY      5001 non-null   category
 4   CNT_CHILDREN         5001 non-null   int64   
 5   AMT_INCOME_TOTAL     5001 non-null   float64 
 6   NAME_INCOME_TYPE     5001 non-null   category
 7   NAME_EDUCATION_TYPE  5001 non-null   category
 8   NAME_FAMILY_STATUS   5001 non-null   category
 9   NAME_HOUSING_TYPE    5001 non-null   category
 10  DAYS_BIRTH           5001 non-null   int64   
 11  DAYS_EMPLOYED        5001 non-null   int64   
 12  FLAG_MOBIL           5001 non-null   int64   
 13  FLAG_WORK_PHONE      5001 non-null   int64   
 14  FLAG_PHONE           5001 non-null   int64   
 15  FLAG_EMAIL           

In [12]:
## Verificar o uso de memória
mem_after = df.memory_usage().sum()
mem_after

564278

In [13]:
mem_before = df.memory_usage(deep=True).sum()
mem_reduce = mem_before - mem_after
print('{} in KBs'.format(mem_reduce/1024))
print('{} in MBs'.format(mem_reduce/1024/1024))

2.6748046875 in KBs
0.0026121139526367188 in MBs


In [14]:
## Verifique os tipos dos dados
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5001 entries, 0 to 5000
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   ID                   5001 non-null   category
 1   CODE_GENDER          5001 non-null   category
 2   FLAG_OWN_CAR         5001 non-null   category
 3   FLAG_OWN_REALTY      5001 non-null   category
 4   CNT_CHILDREN         5001 non-null   int64   
 5   AMT_INCOME_TOTAL     5001 non-null   float64 
 6   NAME_INCOME_TYPE     5001 non-null   category
 7   NAME_EDUCATION_TYPE  5001 non-null   category
 8   NAME_FAMILY_STATUS   5001 non-null   category
 9   NAME_HOUSING_TYPE    5001 non-null   category
 10  DAYS_BIRTH           5001 non-null   int64   
 11  DAYS_EMPLOYED        5001 non-null   int64   
 12  FLAG_MOBIL           5001 non-null   int64   
 13  FLAG_WORK_PHONE      5001 non-null   int64   
 14  FLAG_PHONE           5001 non-null   int64   
 15  FLAG_EMAIL           

In [15]:
#verifique os dados nulos
df.isna().sum()

ID                        0
CODE_GENDER               0
FLAG_OWN_CAR              0
FLAG_OWN_REALTY           0
CNT_CHILDREN              0
AMT_INCOME_TOTAL          0
NAME_INCOME_TYPE          0
NAME_EDUCATION_TYPE       0
NAME_FAMILY_STATUS        0
NAME_HOUSING_TYPE         0
DAYS_BIRTH                0
DAYS_EMPLOYED             0
FLAG_MOBIL                0
FLAG_WORK_PHONE           0
FLAG_PHONE                0
FLAG_EMAIL                0
OCCUPATION_TYPE        1516
CNT_FAM_MEMBERS           0
dtype: int64