<a href="https://colab.research.google.com/github/heber-augusto/sus-kpis-analysis/blob/main/sia/indicadores_monitor_rosa_sia_pa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Apresentação
Este notebook foi desenolvido como parte do projeto Monitor Rosa, um conjunto de ferramentas, softwares e outros artefatos cujo principal objetivo é melhorar o cenário de diagnósticos de câncer de mama no Brasil, idealmente melhorando diminuindo a proporção de diagnósticos tardios/diagnósticos precoces.

Os arquivos utilizados neste notebook foram coletados através dos scripts presentes [neste repositório](https://github.com/heber-augusto/devops-pysus-get-files). Os arquios foram armazenados em um bucket do Google Storage no formato parquet compactados (gzip).

# Instalação de bibliotecas e pacotes para leitura de arquivos

## Configurações iniciais para conectar com bucket no Google Storage
 - Autenticação do Google Colab
 - Definição do nome do projeto

 Para a execução dos comandos desta seção, o arquio gcp-leitura.json deve ser inserido na raiz do colab.

In [2]:
from google.colab import auth
auth.authenticate_user()

# id do projeto
project_id = 'teak-ellipse-317015'
# id do bucket dentro do projeto
bucket_id = 'observatorio-oncologia'

# nome da pasta local para mapear
local_folder_name = 'colab'

# nome da pasta do projeto
project_folder_name = 'monitor'

## Instalação para garantir montagem da pasta no bucket
Instalação do gcsfuse para mapear pasta do bucket no google colab

In [3]:
!echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
!apt -qq update
!apt -qq install gcsfuse

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  2537  100  2537    0     0  59000      0 --:--:-- --:--:-- --:--:-- 59000
OK
25 packages can be upgraded. Run 'apt list --upgradable' to see them.
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  gcsfuse
0 upgraded, 1 newly installed, 0 to remove and 25 not upgraded.
Need to get 12.1 MB of archives.
After this operation, 27.5 MB of additional disk space will be used.
Selecting previously unselected package gcsfuse.
(Reading database ... 155680 files and directories currently installed.)
Preparing to unpack .../gcsfuse_0.41.5_amd64.deb ...
Unpacking gcsfuse (0.41.5) ...
Setting up gcsfuse (0.41.5) ...


## Nome/caminho do arquivo json contendo credenciais de acesso ao storage do Google

In [4]:
serice_account_json = '/content/gcp-leitura.json'

## Montagem do bucket em uma pasta local do google colab

In [5]:
!mkdir {local_folder_name}
!gcsfuse --key-file {serice_account_json} --implicit-dirs {bucket_id} {local_folder_name}

2022/08/06 14:38:31.351524 Start gcsfuse/0.41.5 (Go version go1.18.4) for app "" using mount point: /content/colab
2022/08/06 14:38:31.369845 Opening GCS connection...
2022/08/06 14:38:32.425876 Mounting file system "observatorio-oncologia"...
2022/08/06 14:38:32.462019 File system has been successfully mounted.


## Função para facilitar coleta de arquivo

In [6]:
import glob
import os

def get_files(state, year, month, file_type, file_group):
    initial_path = os.path.join(r'/content/',local_folder_name,project_folder_name)
    internal_folder = f"""{state}/{year}/{month}/{file_type}/{file_group}"""
    # print(f"{initial_path}/{internal_folder}/*.parquet.gzip")
    return glob.glob(f"{initial_path}/{internal_folder}/*.parquet.gzip")    

# SIA PA, AQ e AR: Leitura, filtro e transformação inicial dos arquivos

## Informações sobre filtros pertinentes ao contexto de câncer de mama:

### SIH: realizar filtro através da variável DIAG_PRINC (4 caracteres)
 * Filtro: C500, C501, C502, C503, C504, C505, C506, C508 e C509

### RHC: realizar filtro através da variável LOCTUDET (3 caracteres)
 * Filtro: C50

### SIA – APAC de Quimioterapia e Radioterapia (AQ e AR)
Realizar filtro através da variável AP_CIDPRI (4 caracteres)
 * Filtro: C500, C501, C502, C503, C504, C505, C506, C508 e C509

### SIA – Procedimentos ambulatoriais (PA)

Os arquivos de procedimentos ambulatoriais são um pouco diferentes por um motivo: a pessoa já pode ter o diagnóstico e está realizando um procedimento OU a pessoa está realizando um exame com finalidade diagnóstica (mamografia, ultrassonografia, etc). Então, neste caso, podemos pensar em dois filtros:

Realizar filtro através da variável PA_CIDPRI (4 caracteres)
 * Filtro: C500, C501, C502, C503, C504, C505, C506, C508 e C509

Realizar filtro através da variável do código de procedimento ambulatorial “PA_PROC_ID” (10 caracteres)
 * Filtros:
  * 201010569	BIOPSIA/EXERESE DE NÓDULO DE MAMA
  * 201010585	PUNÇÃO ASPIRATIVA DE MAMA POR AGULHA FINA
  * 201010607	PUNÇÃO DE MAMA POR AGULHA GROSSA
  * 203010035	EXAME DE CITOLOGIA (EXCETO CERVICO-VAGINAL E DE MAMA)
  * 203010043	EXAME CITOPATOLOGICO DE MAMA
  * 203020065	EXAME ANATOMOPATOLOGICO DE MAMA - BIOPSIA
  * 203020073	EXAME ANATOMOPATOLOGICO DE MAMA - PECA CIRURGICA
  * 205020097	ULTRASSONOGRAFIA MAMARIA BILATERAL
  * 208090037	CINTILOGRAFIA DE MAMA (BILATERAL)
  * 204030030	MAMOGRAFIA
  * 204030188	MAMOGRAFIA BILATERAL PARA RASTREAMENTO

## Bibliotecas utilizadas na análise exploratória

In [7]:
import pandas as pd
import numpy as np 

## Variáveis de filtro

In [8]:
# filtro pelo cid
cid_filter = ['C500', 'C501', 'C502', 'C503', 'C504', 'C505', 'C506', 'C508', 'C509']

# dicionario de procedimentos
proc_id_dict = {
    '0201010569': 'BIOPSIA/EXERESE DE NÓDULO DE MAMA',
    '0201010585': 'PUNÇÃO ASPIRATIVA DE MAMA POR AGULHA FINA',
    '0201010607': 'PUNÇÃO DE MAMA POR AGULHA GROSSA',
    '0203010035': 'EXAME DE CITOLOGIA (EXCETO CERVICO-VAGINAL E DE MAMA)',
    '0203010043': 'EXAME CITOPATOLOGICO DE MAMA',
    '0203020065': 'EXAME ANATOMOPATOLOGICO DE MAMA - BIOPSIA',
    '0203020073': 'EXAME ANATOMOPATOLOGICO DE MAMA - PECA CIRURGICA',
    '0205020097': 'ULTRASSONOGRAFIA MAMARIA BILATERAL',
    '0208090037': 'CINTILOGRAFIA DE MAMA (BILATERAL)',
    '0204030030': 'MAMOGRAFIA',
    '0204030188': 'MAMOGRAFIA BILATERAL PARA RASTREAMENTO'
    }
proc_id_filter = list(proc_id_dict.keys())


## Funções de filtro para arquivo SIA PA, AQ e AR

In [9]:
def filter_pa_content(df):
    """

    """
    return df[df.PA_CIDPRI.isin(cid_filter) & \
              df.PA_PROC_ID.isin(proc_id_filter)]

def filter_ar_content(df):
    """

    """
    return df[df.AP_CIDPRI.isin(cid_filter)]

filter_aq_content = filter_ar_content


## Função para unir diversos arquivos em um único datraframe

In [10]:
def create_cancer_dataframe(file_paths, filter_function=filter_pa_content):
    """

    """
    filtered_contents = [
      filter_function(pd.read_parquet(file_path))
      for file_path in file_paths
      ]

    return pd.concat(
        filtered_contents, 
        ignore_index=True)



## Função para retornar lista de arquivos (caminho completo)

In [11]:
def get_file_paths(states, years, months, file_type, file_group):
    """

    """
    file_paths = []
    for state in states:
        for year in years:
            for month in months:
                file_paths.extend(
                    get_files(
                        state,
                        year,
                        month,
                        file_type,
                        file_group)
                )
    return file_paths

## SIA PA: Leitura e união de dados para o período desejado




### Estado, anos e meses a serem lidos e processados

In [34]:
states = ['SP',]
start_year = 2008
end_year = 2022
years  = [f'{year + 2008:02d}' for year in range(end_year - start_year + 1)]
months = [f'{month + 1:02d}' for month in range(12)]
file_type = 'SIA'

### Monta lista de arquivos a serem lidos

In [35]:
file_paths_by_type = {}

# Arquivos de produção ambulatorial
file_paths_by_type['PA'] = get_file_paths(
    states,
    years,
    months,
    file_type,
    'PA'
)

# Arquivos de radioterapia
file_paths_by_type['AR'] = get_file_paths(
    states,
    years,
    months,
    file_type,
    'AR'
)

# Arquivos de quimioteraia
file_paths_by_type['AQ'] = get_file_paths(
    states,
    years,
    months,
    file_type,
    'AQ'
)

In [36]:
print(f"""Identificados {len(file_paths_by_type['PA'])} arquivos de produção ambulatorial""")
print(f"""Identificados {len(file_paths_by_type['AR'])} arquivos de radioterapia""")
print(f"""Identificados {len(file_paths_by_type['AQ'])} arquivos de quimioterapia""")

Identificados 317 arquivos de produção ambulatorial
Identificados 170 arquivos de radioterapia
Identificados 170 arquivos de quimioterapia


## Cria um único dataframe a partir dos conteúdos filtrados

In [37]:
cancer_dataframe_pa = create_cancer_dataframe(file_paths_by_type['PA'], filter_function=filter_pa_content)
cancer_dataframe_aq = create_cancer_dataframe(file_paths_by_type['AQ'], filter_function=filter_ar_content)
cancer_dataframe_ar = create_cancer_dataframe(file_paths_by_type['AR'], filter_function=filter_aq_content)

## Dataframe de PA

In [38]:
cancer_dataframe_pa

Unnamed: 0,PA_CODUNI,PA_GESTAO,PA_CONDIC,PA_UFMUN,PA_REGCT,PA_INCOUT,PA_INCURG,PA_TPUPS,PA_TIPPRE,PA_MN_IND,...,PA_CODOCO,PA_FLQT,PA_FLER,PA_ETNIA,PA_VL_CF,PA_VL_CL,PA_VL_INC,PA_SRV_C,PA_INE,PA_NAT_JUR
0,2081377,350000,EP,355710,0000,0000,0000,05,61,I,...,1,K,0,,,,,,,
1,2081377,350000,EP,355710,0000,0000,0000,05,61,I,...,1,K,0,,,,,,,
2,2081377,350000,EP,355710,0000,0000,0000,05,61,I,...,1,K,0,,,,,,,
3,2081377,350000,EP,355710,0000,0000,0000,05,61,I,...,1,K,0,,,,,,,
4,2081377,350000,EP,355710,0000,0000,0000,05,61,I,...,1,K,0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
368989,2027240,355030,PG,355030,0000,0000,0000,05,00,M,...,1,K,0,,0.0,0.0,0.0,121002,,1031
368990,2027240,355030,PG,355030,0000,0000,0000,05,00,M,...,1,K,0,,0.0,0.0,0.0,121002,,1031
368991,2027240,355030,PG,355030,0000,0000,0000,05,00,M,...,1,K,0,,0.0,0.0,0.0,121002,,1031
368992,2027240,355030,PG,355030,0000,0000,0000,05,00,M,...,1,K,0,,0.0,0.0,0.0,121002,,1031


In [39]:
cancer_dataframe_pa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 368994 entries, 0 to 368993
Data columns (total 60 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   PA_CODUNI   368994 non-null  object
 1   PA_GESTAO   368994 non-null  object
 2   PA_CONDIC   368994 non-null  object
 3   PA_UFMUN    368994 non-null  object
 4   PA_REGCT    368994 non-null  object
 5   PA_INCOUT   368994 non-null  object
 6   PA_INCURG   368994 non-null  object
 7   PA_TPUPS    368994 non-null  object
 8   PA_TIPPRE   368994 non-null  object
 9   PA_MN_IND   368994 non-null  object
 10  PA_CNPJCPF  368994 non-null  object
 11  PA_CNPJMNT  368994 non-null  object
 12  PA_CNPJ_CC  368994 non-null  object
 13  PA_MVM      368994 non-null  object
 14  PA_CMP      368994 non-null  object
 15  PA_PROC_ID  368994 non-null  object
 16  PA_TPFIN    368994 non-null  object
 17  PA_SUBFIN   368994 non-null  object
 18  PA_NIVCPL   368994 non-null  object
 19  PA_DOCORIG  368994 non-

## Dataframe de AQ

In [40]:
cancer_dataframe_aq

Unnamed: 0,AP_MVM,AP_CONDIC,AP_GESTAO,AP_CODUNI,AP_AUTORIZ,AP_CMP,AP_PRIPAL,AP_VL_AP,AP_UFMUN,AP_TPUPS,...,AQ_DTINI2,AQ_CIDINI3,AQ_DTINI3,AQ_CONTTR,AQ_DTINTR,AQ_ESQU_P1,AQ_TOTMPL,AQ_TOTMAU,AQ_ESQU_P2,AP_NATJUR
0,200801,EP,350000,2025507,3508205574489,200801,0304050130,213.4,352900,05,...,,,,S,20070801,TAMOX,060,060,,
1,200801,EP,350000,2025507,3508205574500,200801,0304050040,79.75,352900,05,...,20061101,C500,20070101,S,20070606,TAMOX,060,060,,
2,200801,EP,350000,2025507,3508205574621,200801,0304020354,147.1,352900,05,...,19960301,C508,20040901,S,20070101,EXAME,060,060,,
3,200801,EP,350000,2773171,3508203923345,200801,0304020141,2828.4,353060,39,...,,,,N,20080111,ESQUE,008,000,,
4,200801,EP,350000,2773171,3508204190227,200801,0304020338,301.5,353060,39,...,,,,N,20070213,ESQUE,012,010,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5415650,202202,EP,350000,2077396,3522222582505,202201,0304020435,2149.5,354980,05,...,,,,S,20220128,CARBO,005,000,"PLATINA, D",3069
5415651,202202,EP,350000,2688689,3521272063815,202111,0304020435,0.0,355030,05,...,20190903,C504,20200205,S,20211123,PERTU,005,000,/TRASTU,3999
5415652,202202,EP,350000,2705982,3522218907922,202202,0304020435,1700.0,351620,05,...,,,,N,20220101,Docet,010,000,axelHercPe,3069
5415653,202202,EP,350000,2079798,3522217105429,202202,0304020443,34.1,350950,05,...,,,,S,20210318,HERCE,017,003,PTIN,1112


In [41]:
cancer_dataframe_aq.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5415655 entries, 0 to 5415654
Data columns (total 64 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   AP_MVM      object
 1   AP_CONDIC   object
 2   AP_GESTAO   object
 3   AP_CODUNI   object
 4   AP_AUTORIZ  object
 5   AP_CMP      object
 6   AP_PRIPAL   object
 7   AP_VL_AP    object
 8   AP_UFMUN    object
 9   AP_TPUPS    object
 10  AP_TIPPRE   object
 11  AP_MN_IND   object
 12  AP_CNPJCPF  object
 13  AP_CNPJMNT  object
 14  AP_CNSPCN   object
 15  AP_COIDADE  object
 16  AP_NUIDADE  object
 17  AP_SEXO     object
 18  AP_RACACOR  object
 19  AP_MUNPCN   object
 20  AP_UFNACIO  object
 21  AP_CEPPCN   object
 22  AP_UFDIF    object
 23  AP_MNDIF    object
 24  AP_DTINIC   object
 25  AP_DTFIM    object
 26  AP_TPATEN   object
 27  AP_TPAPAC   object
 28  AP_MOTSAI   object
 29  AP_OBITO    object
 30  AP_ENCERR   object
 31  AP_PERMAN   object
 32  AP_ALTA     object
 33  AP_TRANSF   object
 34  AP_DTOCOR   object

## Dataframe de AR

In [42]:
cancer_dataframe_ar

Unnamed: 0,AP_MVM,AP_CONDIC,AP_GESTAO,AP_CODUNI,AP_AUTORIZ,AP_CMP,AP_PRIPAL,AP_VL_AP,AP_UFMUN,AP_TPUPS,...,AR_NUMC1,AR_INIAR1,AR_INIAR2,AR_INIAR3,AR_FIMAR1,AR_FIMAR2,AR_FIMAR3,AR_NUMC2,AR_NUMC3,AP_NATJUR
0,200801,EP,350000,2790556,3508205036259,200801,0304010090,940.68,350600,05,...,84,20080102,,,20080331,,,,,
1,200801,EP,350000,2748223,3508205171460,200801,0304010294,247.92,350750,05,...,60,20080128,,,20080331,,,,,
2,200801,EP,350000,2083086,3508205087112,200801,0304010090,335.24,352530,07,...,002,20080122,,,20080331,,,000,000,
3,200801,EP,350000,2090236,3508211775740,200801,0304010286,1057.72,350550,07,...,003,20071220,,,20080529,,,,,
4,200801,PG,350320,2082527,3507229450164,200801,0304010286,2179.92,350320,05,...,04,20080101,,,20080331,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190814,202202,PG,354340,2080400,3522226642946,202202,0304010413,5904.0,354340,05,...,,20220112,,,20220331,,,,,3999
190815,202202,PG,354870,2025361,3522230090764,202202,0304010413,5904.0,354870,05,...,,20220210,,,20220430,,,,,1244
190816,202202,PG,355030,2077590,3522221932493,202202,0304010413,5904.0,355030,07,...,,20220201,,,20220430,,,,,3999
190817,202202,PG,355030,2077590,3522221932559,202202,0304010413,5904.0,355030,07,...,,20220201,,,20220430,,,,,3999


In [43]:
cancer_dataframe_ar.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 190819 entries, 0 to 190818
Data columns (total 74 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   AP_MVM      190819 non-null  object
 1   AP_CONDIC   190819 non-null  object
 2   AP_GESTAO   190819 non-null  object
 3   AP_CODUNI   190819 non-null  object
 4   AP_AUTORIZ  190819 non-null  object
 5   AP_CMP      190819 non-null  object
 6   AP_PRIPAL   190819 non-null  object
 7   AP_VL_AP    190819 non-null  object
 8   AP_UFMUN    190819 non-null  object
 9   AP_TPUPS    190819 non-null  object
 10  AP_TIPPRE   190819 non-null  object
 11  AP_MN_IND   190819 non-null  object
 12  AP_CNPJCPF  190819 non-null  object
 13  AP_CNPJMNT  190819 non-null  object
 14  AP_CNSPCN   190819 non-null  object
 15  AP_COIDADE  190819 non-null  object
 16  AP_NUIDADE  190819 non-null  object
 17  AP_SEXO     190819 non-null  object
 18  AP_RACACOR  190819 non-null  object
 19  AP_MUNPCN   190819 non-

# Montagem do dataset de Exames de Paciente (1 linha por paciente)

Colunas:

 - Chave da paciente (cns_encrypted)
 - Custo total do tratamento
 - Estadiamento inicial
 - Estadiamento final
 - Indicação de óbito
 - Localização de moradia

## Proposta de solução:

 - Tipos de arquivos a serem utilizados: AQ e AR
 - chave do paciente:
  - AQ: coluna AP_CNSPCN
  - AR: coluna AP_CNSPCN  
 - custo total do tratamento: será uma estimativa considerando apenas radioterapia e quimioterapia, calculado somando os valores dos procedimentos (presentes em AR e AQ), para cada chave de paciente. Sendo:
   - custos em AQ: soma de AP_VL_AP;
   - custos em AR: soma de AP_VL_AP.
 - Estadiamento inicial: calculado utilizando o valor do estadiamento (presentes em AQ:AQ_ESTADI e AR:AR_ESTADI) do registro mais antigo de radioterapia ou quimioterapia, de um determinado paciente;
 - Estadiamento final: calculado utilizando o valor do estadiamento (presentes em AQ:AQ_ESTADI e AR:AR_ESTADI) do registro mais recente de radioterapia ou quimioterapia, de um determinado paciente;
 - Localização de moradia: utilizar coluna AP_MUNPCN (presentes em AR e AQ). Talvez criar dois campos, AP_MUNPCN presente no registro mais antigo e AP_MUNPCN presente no registro mais novo;
 - Indicação de óbito: valor máximo do campo AP_OBITO, presente em AQ e AR (0: sem indicação e 1:com indicação de óbito) O valor 1 indicará óbito.

In [44]:
cancer_dataframe_ar.head()

Unnamed: 0,AP_MVM,AP_CONDIC,AP_GESTAO,AP_CODUNI,AP_AUTORIZ,AP_CMP,AP_PRIPAL,AP_VL_AP,AP_UFMUN,AP_TPUPS,...,AR_NUMC1,AR_INIAR1,AR_INIAR2,AR_INIAR3,AR_FIMAR1,AR_FIMAR2,AR_FIMAR3,AR_NUMC2,AR_NUMC3,AP_NATJUR
0,200801,EP,350000,2790556,3508205036259,200801,304010090,940.68,350600,5,...,84,20080102,,,20080331,,,,,
1,200801,EP,350000,2748223,3508205171460,200801,304010294,247.92,350750,5,...,60,20080128,,,20080331,,,,,
2,200801,EP,350000,2083086,3508205087112,200801,304010090,335.24,352530,7,...,2,20080122,,,20080331,,,0.0,0.0,
3,200801,EP,350000,2090236,3508211775740,200801,304010286,1057.72,350550,7,...,3,20071220,,,20080529,,,,,
4,200801,PG,350320,2082527,3507229450164,200801,304010286,2179.92,350320,5,...,4,20080101,,,20080331,,,,,


## Conteúdo das colunas com datas

In [45]:
cancer_dataframe_ar[['AP_CMP','AP_DTINIC', 'AP_DTFIM', 'AP_DTOCOR']]

Unnamed: 0,AP_CMP,AP_DTINIC,AP_DTFIM,AP_DTOCOR
0,200801,20080102,20080331,20080128
1,200801,20080128,20080331,
2,200801,20080122,20080331,
3,200801,20080102,20080331,
4,200801,20080101,20080331,
...,...,...,...,...
190814,202202,20220101,20220331,20220221
190815,202202,20220210,20220430,20220225
190816,202202,20220201,20220430,20220211
190817,202202,20220201,20220430,20220211


## Transformação dos tipos das colunas 
 - custo (AP_VL_AP) em double
 - Indicação de óbito (AP_OBITO) em inteiro.

In [46]:
cancer_dataframe_aq['custo'] = cancer_dataframe_aq['AP_VL_AP'].astype(np.double)
cancer_dataframe_ar['custo'] = cancer_dataframe_ar['AP_VL_AP'].astype(np.double)
cancer_dataframe_aq['obito'] = cancer_dataframe_aq['AP_OBITO'].astype(np.integer)
cancer_dataframe_ar['obito'] = cancer_dataframe_ar['AP_OBITO'].astype(np.integer)

## União entre AQ e AR

In [47]:
columns_aq = ['AP_CMP', 'AP_CNSPCN', 'AQ_ESTADI', 'custo', 'AP_MUNPCN', 'obito']
columns_ar = ['AP_CMP', 'AP_CNSPCN', 'AR_ESTADI', 'custo', 'AP_MUNPCN', 'obito']

normalized_columns = ['data','paciente','estadiamento', 'custo', 'municipio', 'obito']

renamed_aq = cancer_dataframe_aq[columns_aq]
renamed_aq.columns = normalized_columns

renamed_ar = cancer_dataframe_ar[columns_ar]
renamed_ar.columns = normalized_columns

cancer_dataframe = pd.concat(
    [
      renamed_aq, 
      renamed_ar
    ], 
    ignore_index=True)
cancer_dataframe

Unnamed: 0,data,paciente,estadiamento,custo,municipio,obito
0,200801,{{||}|~,1,213.40,352900,0
1,200801,{{{{{{,1,79.75,352900,0
2,200801,{{{{~~~,4,147.10,351730,0
3,200801,{{|{~||~,4,2828.40,353060,0
4,200801,{{{~{|},1,301.50,353060,0
...,...,...,...,...,...,...
5606469,202202,{{{{~~}{,2,5904.00,354340,0
5606470,202202,{{~|,2,5904.00,354870,0
5606471,202202,{{{||||,1,5904.00,355030,0
5606472,202202,{{|{|~},0,5904.00,355030,0


## Cálculo dos valores considerando apenas AQ e AR

In [48]:
df_paciente = cancer_dataframe\
    .sort_values(by=['data'], ascending=True)\
    .groupby(['paciente'])\
    .agg(
        data_primeiro_estadiamento=('data', 'first'),         
        data_ultimo_estadiamento=('data', 'last'),                  
        primeiro_estadiamento=('estadiamento', 'first'), 
        maior_estadiamento=('estadiamento', 'max'),
        ultimo_estadiamento=('estadiamento', 'last'),
        custo_total=('custo', 'sum'),
        primeiro_municipio=('municipio', 'first'), 
        ultimo_municipio=('municipio', 'last'),
        indicacao_obito=('obito', 'max'),
        )\
    .reset_index()
   
df_paciente

Unnamed: 0,paciente,data_primeiro_estadiamento,data_ultimo_estadiamento,primeiro_estadiamento,maior_estadiamento,ultimo_estadiamento,custo_total,primeiro_municipio,ultimo_municipio,indicacao_obito
0,898 0000 8725 6,201107,201108,2,2,2,3928.00,355220,355220,0
1,8980500671835 8,200903,200905,2,2,2,239.25,354890,354890,0
2,{{}{{{{{|,201004,201006,3,3,3,239.25,352900,352900,0
3,{}{|}{{|,200910,201105,1,1,1,4283.69,353660,353660,0
4,{}{~~~||{{{~,201102,201104,3,3,3,2400.00,510030,510030,0
...,...,...,...,...,...,...,...,...,...,...
180789,{{|}~},201709,202202,2,2,2,5430.50,355030,355030,0
180790,{{}{{~|,201305,201805,1,1,1,4785.00,350950,350950,0
180791,||{{{{|~,201004,201006,2,2,2,2171.96,280300,280300,0
180792,{|{{{,201206,201207,0,0,0,3420.00,355030,355030,0


## Validação dados cálculos em um dos pacientes

In [49]:
cancer_dataframe[cancer_dataframe.paciente == df_paciente['paciente'][65794]][['data','paciente','estadiamento','municipio', 'obito']]

Unnamed: 0,data,paciente,estadiamento,municipio,obito
4506550,202004,{{}}}{|},4,352500,0
4564082,202005,{{}}}{|},4,352500,0
4590120,202006,{{}}}{|},4,352500,0
4643350,202007,{{}}}{|},4,352500,0
4674151,202008,{{}}}{|},4,352500,0
4733963,202009,{{}}}{|},4,352500,0
4787851,202011,{{}}}{|},2,352500,0
4798464,202010,{{}}}{|},2,352500,0
4841334,202012,{{}}}{|},2,352500,0
4886107,202101,{{}}}{|},2,352500,0


## Pacientes com marcação de óbito

In [50]:
df_paciente[df_paciente.indicacao_obito == 1]

Unnamed: 0,paciente,data_primeiro_estadiamento,data_ultimo_estadiamento,primeiro_estadiamento,maior_estadiamento,ultimo_estadiamento,custo_total,primeiro_municipio,ultimo_municipio,indicacao_obito
63,|{{~~{{{,201808,202010,3,3,3,2153.25,355560,355560,1
76,|{{~{{{,201112,201603,2,3,2,60593.65,354990,354990,1
80,|{{|~|{{{,200908,202003,3,3,2,34069.50,352900,352900,1
88,|{{{{~{{{~,200808,201005,3,3,3,6145.70,351670,351670,1
155,|{||}{~{{{,201001,201405,3,4,4,47220.24,314810,314810,1
...,...,...,...,...,...,...,...,...,...,...
180608,{{|{{}{},201108,201202,0,3,3,5600.00,354780,355030,1
180611,{{|{{}}{~,201408,202006,2,4,4,18351.35,355030,355030,1
180615,{{|{{~|{,201303,201412,4,4,4,23182.60,355030,355030,1
180746,{{|}{|}|,201101,201509,3,4,4,32370.55,352500,352500,1


## Pacientes com ultimo estadiamento menor que o maior estadiamento identificado

In [51]:
df_paciente[df_paciente.maior_estadiamento > df_paciente.ultimo_estadiamento]

Unnamed: 0,paciente,data_primeiro_estadiamento,data_ultimo_estadiamento,primeiro_estadiamento,maior_estadiamento,ultimo_estadiamento,custo_total,primeiro_municipio,ultimo_municipio,indicacao_obito
15,|{{|{{{,201207,201712,0,3,0,6769.50,355030,355030,0
45,|{{~~|{{{,200912,201406,1,1,0,4147.00,354990,354990,0
47,|{{~~{{{{,201006,201501,2,2,1,16281.00,354750,353360,0
49,|{{~~}~{{{,201007,201601,2,3,2,7691.55,354150,354150,0
53,|{{~~|}|~{{{,200905,201401,3,3,0,14394.80,350330,350330,0
...,...,...,...,...,...,...,...,...,...,...
180756,{{|}}|,201302,201606,2,4,3,1116.50,355030,355030,0
180757,{{|}~~{|,201610,202002,1,3,2,7035.50,355030,355030,0
180762,{{|},201601,201611,4,4,0,13053.00,354530,354530,0
180775,{{|}~{|{,201507,202108,3,4,3,14648.00,355030,355030,0


## Cria arquivo de pacientes

In [52]:
df_paciente.to_parquet(
    'pacientes.parquet.gzip', 
    compression='gzip')