<a href="https://colab.research.google.com/github/heber-augusto/sus-kpis-analysis/blob/main/sia/indicadores_monitor_rosa_sia_pa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Apresentação
Este notebook foi desenolvido como parte do projeto Monitor Rosa, um conjunto de ferramentas, softwares e outros artefatos cujo principal objetivo é melhorar o cenário de diagnósticos de câncer de mama no Brasil, idealmente melhorando diminuindo a proporção de diagnósticos tardios/diagnósticos precoces.

Os arquivos utilizados neste notebook foram coletados através dos scripts presentes [neste repositório](https://github.com/heber-augusto/devops-pysus-get-files). Os arquios foram armazenados em um bucket do Google Storage no formato parquet compactados (gzip).

# Instalação de bibliotecas e pacotes para leitura de arquivos

## Configurações iniciais para conectar com bucket no Google Storage
 - Autenticação do Google Colab
 - Definição do nome do projeto

 Para a execução dos comandos desta seção, o arquio gcp-leitura.json deve ser inserido na raiz do colab.

In [2]:
from google.colab import auth
auth.authenticate_user()

# id do projeto
project_id = 'teak-ellipse-317015'
# id do bucket dentro do projeto
bucket_id = 'observatorio-oncologia'

# nome da pasta local para mapear
local_folder_name = 'colab'

# nome da pasta do projeto
project_folder_name = 'monitor'

!gcloud config set project {project_id}

Updated property [core/project].


To take a quick anonymous survey, run:
  $ gcloud survey



## Instalação para garantir montagem da pasta no bucket
Instalação do gcsfuse para mapear pasta do bucket no google colab

In [3]:
!echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
!apt -qq update
!apt -qq install gcsfuse

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  2537  100  2537    0     0  68567      0 --:--:-- --:--:-- --:--:-- 68567
OK
73 packages can be upgraded. Run 'apt list --upgradable' to see them.
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  gcsfuse
0 upgraded, 1 newly installed, 0 to remove and 73 not upgraded.
Need to get 11.6 MB of archives.
After this operation, 27.4 MB of additional disk space will be used.
Selecting previously unselected package gcsfuse.
(Reading database ... 155653 files and directories currently installed.)
Preparing to unpack .../gcsfuse_0.41.4_amd64.deb ...
Unpacking gcsfuse (0.41.4) ...
Setting up gcsfuse (0.41.4) ...


## Nome/caminho do arquivo json contendo credenciais de acesso ao storage do Google

In [4]:
serice_account_json = '/content/gcp-leitura.json'

## Montagem do bucket em uma pasta local do google colab

In [12]:
!mkdir {local_folder_name}
!gcsfuse --key-file {serice_account_json} --implicit-dirs {bucket_id} {local_folder_name}

mkdir: cannot create directory ‘colab’: File exists
2022/07/28 16:45:43.768994 Start gcsfuse/0.41.4 (Go version go1.17.6) for app "" using mount point: /content/colab
2022/07/28 16:45:43.784244 Opening GCS connection...
2022/07/28 16:45:44.228398 Mounting file system "observatorio-oncologia"...
2022/07/28 16:45:44.262358 File system has been successfully mounted.


## Função para facilitar coleta de arquivo

In [6]:
import glob
import os

def get_files(state, year, month, file_type, file_group):
    initial_path = os.path.join(r'/content/',local_folder_name,project_folder_name)
    internal_folder = f"""{state}/{year}/{month}/{file_type}/{file_group}"""
    # print(f"{initial_path}/{internal_folder}/*.parquet.gzip")
    return glob.glob(f"{initial_path}/{internal_folder}/*.parquet.gzip")    

# SIA PA: Leitura, filtro e transformação inicial dos arquivos

## Informações sobre filtros pertinentes ao contexto de câncer de mama:

### SIH: realizar filtro através da variável DIAG_PRINC (4 caracteres)
 * Filtro: C500, C501, C502, C503, C504, C505, C506, C508 e C509

### RHC: realizar filtro através da variável LOCTUDET (3 caracteres)
 * Filtro: C50

### SIA – APAC de Quimioterapia e Radioterapia (AQ e AR)
Realizar filtro através da variável AP_CIDPRI (4 caracteres)
 * Filtro: C500, C501, C502, C503, C504, C505, C506, C508 e C509

### SIA – Procedimentos ambulatoriais (PA)

Os arquivos de procedimentos ambulatoriais são um pouco diferentes por um motivo: a pessoa já pode ter o diagnóstico e está realizando um procedimento OU a pessoa está realizando um exame com finalidade diagnóstica (mamografia, ultrassonografia, etc). Então, neste caso, podemos pensar em dois filtros:

Realizar filtro através da variável PA_CIDPRI (4 caracteres)
 * Filtro: C500, C501, C502, C503, C504, C505, C506, C508 e C509

Realizar filtro através da variável do código de procedimento ambulatorial “PA_PROC_ID” (10 caracteres)
 * Filtros:
  * 201010569	BIOPSIA/EXERESE DE NÓDULO DE MAMA
  * 201010585	PUNÇÃO ASPIRATIVA DE MAMA POR AGULHA FINA
  * 201010607	PUNÇÃO DE MAMA POR AGULHA GROSSA
  * 203010035	EXAME DE CITOLOGIA (EXCETO CERVICO-VAGINAL E DE MAMA)
  * 203010043	EXAME CITOPATOLOGICO DE MAMA
  * 203020065	EXAME ANATOMOPATOLOGICO DE MAMA - BIOPSIA
  * 203020073	EXAME ANATOMOPATOLOGICO DE MAMA - PECA CIRURGICA
  * 205020097	ULTRASSONOGRAFIA MAMARIA BILATERAL
  * 208090037	CINTILOGRAFIA DE MAMA (BILATERAL)
  * 204030030	MAMOGRAFIA
  * 204030188	MAMOGRAFIA BILATERAL PARA RASTREAMENTO

## Bibliotecas utilizadas na análise exploratória

In [7]:
import pandas as pd
import numpy as np 

## Variáveis de filtro

In [13]:
# filtro pelo cid
cid_filter = ['C500', 'C501', 'C502', 'C503', 'C504', 'C505', 'C506', 'C508', 'C509']

# dicionario de procedimentos
proc_id_dict = {
    '0201010569': 'BIOPSIA/EXERESE DE NÓDULO DE MAMA',
    '0201010585': 'PUNÇÃO ASPIRATIVA DE MAMA POR AGULHA FINA',
    '0201010607': 'PUNÇÃO DE MAMA POR AGULHA GROSSA',
    '0203010035': 'EXAME DE CITOLOGIA (EXCETO CERVICO-VAGINAL E DE MAMA)',
    '0203010043': 'EXAME CITOPATOLOGICO DE MAMA',
    '0203020065': 'EXAME ANATOMOPATOLOGICO DE MAMA - BIOPSIA',
    '0203020073': 'EXAME ANATOMOPATOLOGICO DE MAMA - PECA CIRURGICA',
    '0205020097': 'ULTRASSONOGRAFIA MAMARIA BILATERAL',
    '0208090037': 'CINTILOGRAFIA DE MAMA (BILATERAL)',
    '0204030030': 'MAMOGRAFIA',
    '0204030188': 'MAMOGRAFIA BILATERAL PARA RASTREAMENTO'
    }
proc_id_filter = list(proc_id_dict.keys())


## Funções de filtro para arquivo SIA PA, AQ e AR

In [36]:
def filter_pa_content(df):
    """

    """
    return df[df.PA_CIDPRI.isin(cid_filter) & \
              df.PA_PROC_ID.isin(proc_id_filter)]

def filter_ar_content(df):
    """

    """
    return df[df.AP_CIDPRI.isin(cid_filter)]

filter_aq_content = filter_ar_content


## Função para unir diversos arquivos em um único datraframe

In [27]:
def create_cancer_dataframe(file_paths, filter_function=filter_pa_content):
    """

    """
    filtered_contents = [
      filter_function(pd.read_parquet(file_path))
      for file_path in file_paths
      ]

    return pd.concat(
        filtered_contents, 
        ignore_index=True)



## Função para retornar lista de arquivos (caminho completo)

In [22]:
def get_file_paths(states, years, months, file_type, file_group):
    """

    """
    file_paths = []
    for state in states:
        for year in years:
            for month in months:
                file_paths.extend(
                    get_files(
                        state,
                        year,
                        month,
                        file_type,
                        file_group)
                )
    return file_paths

## SIA PA: Leitura e união de dados para o período desejado




### Estado, anos e meses a serem lidos e processados

In [29]:
states = ['SP',]
years  = ['2020','2021']
months = [f'{month + 1:02d}' for month in range(12)]
file_type = 'SIA'

### Monta lista de arquivos a serem lidos

In [30]:
file_paths_by_type = {}

# Arquivos de produção ambulatorial
file_paths_by_type['PA'] = get_file_paths(
    states,
    years,
    months,
    file_type,
    'PA'
)

# Arquivos de radioterapia
file_paths_by_type['AR'] = get_file_paths(
    states,
    years,
    months,
    file_type,
    'AR'
)

# Arquivos de quimioteraia
file_paths_by_type['AQ'] = get_file_paths(
    states,
    years,
    months,
    file_type,
    'AQ'
)

In [31]:
print(f"""Identificados {len(file_paths_by_type['PA'])} arquivos de produção ambulatorial""")
print(f"""Identificados {len(file_paths_by_type['AR'])} arquivos de radioterapia""")
print(f"""Identificados {len(file_paths_by_type['AQ'])} arquivos de radioterapia""")

Identificados 12 arquivos de produção ambulatorial
Identificados 4 arquivos de radioterapia
Identificados 4 arquivos de radioterapia


## Cria um único dataframe a partir dos conteúdos filtrados

In [38]:
cancer_dataframe_pa = create_cancer_dataframe(file_paths_by_type['PA'], filter_function=filter_pa_content)
cancer_dataframe_aq = create_cancer_dataframe(file_paths_by_type['AQ'], filter_function=filter_ar_content)
cancer_dataframe_ar = create_cancer_dataframe(file_paths_by_type['AR'], filter_function=filter_aq_content)

## Dataframe de PA

In [41]:
cancer_dataframe_pa

Unnamed: 0,PA_CODUNI,PA_GESTAO,PA_CONDIC,PA_UFMUN,PA_REGCT,PA_INCOUT,PA_INCURG,PA_TPUPS,PA_TIPPRE,PA_MN_IND,...,PA_CODOCO,PA_FLQT,PA_FLER,PA_ETNIA,PA_VL_CF,PA_VL_CL,PA_VL_INC,PA_SRV_C,PA_INE,PA_NAT_JUR
0,6479200,350000,EP,355030,7101,0000,0000,62,00,M,...,1,K,0,,0.0,0.0,0.0,121002,,1023
1,2792176,350000,EP,352220,7101,0000,0000,05,00,M,...,1,K,0,,0.0,0.0,0.0,121002,,1023
2,6123740,350000,EP,355030,7101,0000,0000,07,00,M,...,1,K,0,,0.0,0.0,0.0,121002,,1023
3,6123740,350000,EP,355030,7101,0000,0000,07,00,M,...,1,K,0,,0.0,0.0,0.0,120001,,1023
4,2090236,350000,EP,350550,7101,0000,0000,07,00,I,...,1,S,0,,0.0,0.0,0.0,120001,,3069
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8932,6998194,355030,PG,355030,0000,0000,0000,62,00,M,...,1,K,0,,0.0,0.0,0.0,121002,,1031
8933,2752026,355030,PG,355030,0000,0000,0000,36,00,M,...,1,K,0,,0.0,0.0,0.0,121002,,1031
8934,2752026,355030,PG,355030,0000,0000,0000,36,00,M,...,1,K,0,,0.0,0.0,0.0,121002,,1031
8935,6998194,355030,PG,355030,0000,0000,0000,62,00,M,...,1,K,0,,0.0,0.0,0.0,121002,,1031


In [45]:
cancer_dataframe_pa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8937 entries, 0 to 8936
Data columns (total 60 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   PA_CODUNI   8937 non-null   object
 1   PA_GESTAO   8937 non-null   object
 2   PA_CONDIC   8937 non-null   object
 3   PA_UFMUN    8937 non-null   object
 4   PA_REGCT    8937 non-null   object
 5   PA_INCOUT   8937 non-null   object
 6   PA_INCURG   8937 non-null   object
 7   PA_TPUPS    8937 non-null   object
 8   PA_TIPPRE   8937 non-null   object
 9   PA_MN_IND   8937 non-null   object
 10  PA_CNPJCPF  8937 non-null   object
 11  PA_CNPJMNT  8937 non-null   object
 12  PA_CNPJ_CC  8937 non-null   object
 13  PA_MVM      8937 non-null   object
 14  PA_CMP      8937 non-null   object
 15  PA_PROC_ID  8937 non-null   object
 16  PA_TPFIN    8937 non-null   object
 17  PA_SUBFIN   8937 non-null   object
 18  PA_NIVCPL   8937 non-null   object
 19  PA_DOCORIG  8937 non-null   object
 20  PA_AUTOR

## Dataframe de AQ

In [42]:
cancer_dataframe_aq

Unnamed: 0,AP_MVM,AP_CONDIC,AP_GESTAO,AP_CODUNI,AP_AUTORIZ,AP_CMP,AP_PRIPAL,AP_VL_AP,AP_UFMUN,AP_TPUPS,...,AQ_DTINI2,AQ_CIDINI3,AQ_DTINI3,AQ_CONTTR,AQ_DTINTR,AQ_ESQU_P1,AQ_TOTMPL,AQ_TOTMAU,AQ_ESQU_P2,AP_NATJUR
0,202001,EP,350000,2079798,3519258857675,202001,0304040185,1400.0,350950,05,...,,,,S,20190213,H4PA,017,003,,1112
1,202001,EP,350000,6123740,3519265808740,202001,0304040185,1400.0,355030,07,...,,,,S,20191125,PACLI,006,000,TAXEL+TRAS,1023
2,202001,EP,350000,2078287,3519266065194,201911,0304040185,1400.0,355030,07,...,,,,N,20191002,TRAS/,006,000,CARBO/DOCE,1023
3,202001,EP,350000,2079798,3519263839432,201912,0304040185,1400.0,350950,05,...,,,,S,20190522,H4AC,017,003,,1112
4,202001,EP,350000,2078287,3519265739352,202001,0304040185,1400.0,355030,07,...,,,,S,20190509,CARBO,006,003,/TAXOL/TRA,1023
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
159027,202102,PG,350000,2078287,3521224651813,202101,0304020427,34.0,355030,07,...,20130326,C509,20130401,N,20210107,TRAST,012,000,UZUMABE,1023
159028,202102,PG,350000,2082187,3520269556969,202011,0304020427,34.0,354340,05,...,,,,N,20201105,TZB,003,000,,3069
159029,202102,PG,350000,2079798,3521216585513,202102,0304020427,34.0,350950,05,...,,,,N,20210114,HERCE,017,003,PTIN,1112
159030,202102,PG,350000,2705982,3521217440455,202102,0304020427,34.0,351620,05,...,,,,N,20210115,Trast,012,000,uzumabe568,3069


In [44]:
cancer_dataframe_aq.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159032 entries, 0 to 159031
Data columns (total 64 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   AP_MVM      159032 non-null  object
 1   AP_CONDIC   159032 non-null  object
 2   AP_GESTAO   159032 non-null  object
 3   AP_CODUNI   159032 non-null  object
 4   AP_AUTORIZ  159032 non-null  object
 5   AP_CMP      159032 non-null  object
 6   AP_PRIPAL   159032 non-null  object
 7   AP_VL_AP    159032 non-null  object
 8   AP_UFMUN    159032 non-null  object
 9   AP_TPUPS    159032 non-null  object
 10  AP_TIPPRE   159032 non-null  object
 11  AP_MN_IND   159032 non-null  object
 12  AP_CNPJCPF  159032 non-null  object
 13  AP_CNPJMNT  159032 non-null  object
 14  AP_CNSPCN   159032 non-null  object
 15  AP_COIDADE  159032 non-null  object
 16  AP_NUIDADE  159032 non-null  object
 17  AP_SEXO     159032 non-null  object
 18  AP_RACACOR  159032 non-null  object
 19  AP_MUNPCN   159032 non-

## Dataframe de AR

In [40]:
cancer_dataframe_ar

Unnamed: 0,AP_MVM,AP_CONDIC,AP_GESTAO,AP_CODUNI,AP_AUTORIZ,AP_CMP,AP_PRIPAL,AP_VL_AP,AP_UFMUN,AP_TPUPS,...,AR_NUMC1,AR_INIAR1,AR_INIAR2,AR_INIAR3,AR_FIMAR1,AR_FIMAR2,AR_FIMAR3,AR_NUMC2,AR_NUMC3,AP_NATJUR
0,202001,EP,350000,6123740,3519260657803,201912,0304010413,5904.0,355030,07,...,,20191216,,,20200107,,,,,1023
1,202001,EP,350000,2079798,3519259257998,202001,0304010413,5904.0,350950,05,...,,20191121,,,20200107,,,,,1112
2,202001,EP,350000,2030705,3519213407842,202001,0304010413,5904.0,354140,39,...,,20191118,,,20200101,,,,,2062
3,202001,EP,350000,3126838,3520209556589,202001,0304010413,5904.0,355410,05,...,,20200130,,,20200331,,,,,3999
4,202001,PG,350950,2081490,3519238281900,201911,0304010413,5904.0,350950,05,...,,20191104,,,20191218,,,,,1120
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2307,202102,PG,350000,6123740,3521224639427,202101,0304010413,5904.0,355030,07,...,,20210111,,,20210204,,,,,1023
2308,202102,PG,350000,6123740,3521224750406,202101,0304010413,5904.0,355030,07,...,,20210122,,,20210212,,,,,1023
2309,202102,PG,353470,4049020,3521225378781,202102,0304010413,5904.0,353470,05,...,,20210201,,,20210228,,,,,3999
2310,202102,PG,355030,2080125,3520265094984,202102,0304010413,5904.0,355030,07,...,,20201214,,,20210228,,,,,3999


In [43]:
cancer_dataframe_ar.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2312 entries, 0 to 2311
Data columns (total 74 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   AP_MVM      2312 non-null   object
 1   AP_CONDIC   2312 non-null   object
 2   AP_GESTAO   2312 non-null   object
 3   AP_CODUNI   2312 non-null   object
 4   AP_AUTORIZ  2312 non-null   object
 5   AP_CMP      2312 non-null   object
 6   AP_PRIPAL   2312 non-null   object
 7   AP_VL_AP    2312 non-null   object
 8   AP_UFMUN    2312 non-null   object
 9   AP_TPUPS    2312 non-null   object
 10  AP_TIPPRE   2312 non-null   object
 11  AP_MN_IND   2312 non-null   object
 12  AP_CNPJCPF  2312 non-null   object
 13  AP_CNPJMNT  2312 non-null   object
 14  AP_CNSPCN   2312 non-null   object
 15  AP_COIDADE  2312 non-null   object
 16  AP_NUIDADE  2312 non-null   object
 17  AP_SEXO     2312 non-null   object
 18  AP_RACACOR  2312 non-null   object
 19  AP_MUNPCN   2312 non-null   object
 20  AP_UFNAC

# Montagem do dataset de Exames de Paciente (1 linha por paciente)

Colunas:

 - Chave da paciente (cns_encrypted)
 - Custo total do tratamento
 - Estadiamento inicial
 - Estadiamento final
 - Indicação de óbito
 - Localização de moradia

## Proposta de solução:

 - Tipos de arquivos a serem utilizados: AQ e AR
 - chave do paciente:
  - AQ: coluna AP_CNSPCN
  - AR: coluna AP_CNSPCN  
 - custo total do tratamento: será uma estimativa considerando apenas radioterapia e quimioterapia, calculado somando os valores dos procedimentos (presentes em AR e AQ), para cada chave de paciente. Sendo:
   - custos em AQ: soma de AP_VL_AP;
   - custos em AR: soma de AP_VL_AP.
 - Estadiamento inicial: calculado utilizando o valor do estadiamento (presentes em AQ:AQ_ESTADI e AR:AR_ESTADI) do registro mais antigo de radioterapia ou quimioterapia, de um determinado paciente;
 - Estadiamento final: calculado utilizando o valor do estadiamento (presentes em AQ:AQ_ESTADI e AR:AR_ESTADI) do registro mais recente de radioterapia ou quimioterapia, de um determinado paciente;
 - Localização de moradia: utilizar coluna AP_MUNPCN (presentes em AR e AQ). Talvez criar dois campos, AP_MUNPCN presente no registro mais antigo e AP_MUNPCN presente no registro mais novo