<img src="https://github.com/bigdata-icict/ETL-Dataiku-DSS/raw/master/tutoriais/pcdas_1.5.png">

# Neonatal mortality rates

Notebook for calculation of neonatal mortality rates. In this case, it is categorized the rates related to the premature, late and total neonatal deaths. The rates are calculated with yearly and monthly periodicity and refer to different regions, states and cities of Brazil.

In order to enable the calculation of neonatal mortality rates, we will demonstrate how to access the SIM and SINASC datasets indexed by Data Science Platform apllied to Health (PCDaS) through R.

The SIM and SINASC datasets are available in ElasticSearch indexes (ES), which contain all the individual records of deaths and births, respectively, updated yearly.

## Required packages

First we define an auxiliary function in order to load the required packages to the execution of this notebook and install any package if it is not available.

In [2]:
loadlibrary <- function(x){
  if (!require(x,character.only = TRUE)) {
    install.packages(x, repos='http://cran.fiocruz.br/', dep=TRUE)
    if(!require(x,character.only = TRUE)) stop("Package not found")
  }
}

The access to the index in ES is performed through the package [`elasticsearchr`](https://cran.r-project.org/web/packages/elasticsearchr/elasticsearchr.pdf).

In [3]:
loadlibrary("elasticsearchr")

Loading required package: elasticsearchr


We will also use other libraries from R in order to facilitate the manipulation of the data obtained.

In [4]:
packages <- c("dplyr","curl","jsonlite","ggplot2")
lapply(packages, loadlibrary)

Loading required package: dplyr

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: curl
Loading required package: jsonlite
Loading required package: ggplot2


## Access to ElasticSearch

The first step is informing to R the connection parameters to the index in ES.

In parameters `es_user` and `es_pwd`, inform the same user and password that you use to access the PCDaS platform.

In [5]:
es_host <- "dados-pcdas.icict.fiocruz.br"
es_port <- 443
es_transport_schema <- "https"
es_user <- ""
es_pwd <- ""

#URL de conexão com o ES
es_url <- paste(es_transport_schema,"://",es_user,":",es_pwd,"@",es_host,":",es_port,sep="")

Next, we create an object to access the index of ES containning the SINASC and SIM datasets.

In [6]:
es_sinasc <- elasticsearchr::elastic(es_url, "datasus-sinasc")
es_sim <- elasticsearchr::elastic(es_url, "datasus-sim")

## Querying data

We can execute queries in the data and see the documents (records) of the indexes (SIM and SINASC databases) with the command `query` and the operator `%search%`.

With the command `query` we can define any kind of query that ES allows using its native JSON syntax.
The operator `%search%` executes the defined query, passing its definition to ES and returning the results in a table format, or, `data.frame`.

For example, a search for all documents and all fields that exist in an index can be defined in the following way.

In [197]:
all <- query('{
                 "match_all": {}
               }')

This search could then be executed with the operator `%search%` applied to our connection objects with the indexes SIM `es_sim` and SINASC `es_sinasc`:

`es_sim %search% all`

`es_sinasc %search% all`

Nonetheless, since the number of documents in these indexes is significantly big (up to 62 millions), a search like this is not recommended (for being very costly computationally) and many times it can be unnecessary.

In this case, it is generaly more interesting the especification of only some fields, filters and mostly agregations that are relevant in order to obtain the desired response.

We will explore this possibility in the next steps.

## Building denominators

### Births per city of residence

If we want to generate more complex tables of counts, we can use an especific form to get aggregations. 

For instance, we can build the denominators of the mortality rates through the birth data aggregation (count) over time (months) and for each city of __residence__ of Brazil.

An aggregation for the ES needs to be written following a standard. See below:

In [9]:
agg_sinasc_mun <- aggs('{
    "mes": {
      "date_histogram": {
        "field": "data_nasc",
        "interval": "1M",
        "time_zone": "UTC",
        "min_doc_count": 1
      },
      "aggs": {
        "mun": {
          "terms": {
            "field": "res_codigo_adotado",
            "size": 6000
          }
        }
      }
    }
  }')

We are creating an object called `agg_sinasc_mun` in R, which will be used in a query to ES. What does each line of this object mean?
* `aggs`: this command declares to ES that you are requiring an aggregation;
* `mes` and `mun`: aggregation names. You can modify these names;
* `date_histogram`: it declares to ES that you want to do an aggregation based on a date variable, resulting in document counting. Do not modify this line;
* `terms`: it declares to ES that ou want to do an aggregation based on a categoric variable, resulting in document counting. Do not modify this line;
* `field`: identifies the field by which you desire to do an aggregation, in our case, by date (`data_nasc`) with monthly periodicity (`"interval": "1M"`), and by city of residence (`res_codigo_adotado`). You can modify those lines with another variable;
* `size`: This is the limit of aggregation results.

This aggregation is executed with the following code:

In [23]:
data_sinasc_mun <- es_sinasc %search% agg_sinasc_mun

In this case, for being a nested aggregation with 2 levels, it is necessary the use of the function `create_df_agg2` to produce a dataframe (table) based on the data contained in `data_sinasc_mun`. The final result is accumulated in the dataframe `df_sinasc_mun` and saved in the file "df_sinasc_mun.csv".

In [11]:
create_df_agg2 <- function(data,bucket_name,names){
    
    join_df_agg2 <- function(key,agg2){
       list(cbind(x1=key, x2=agg2$key, x3=agg2$doc_count))
    }
    
    bucket_var <- paste(bucket_name,".buckets",sep="")

    df <- do.call("rbind", mapply(join_df_agg2, data$key, data[[bucket_var]], SIMPLIFY = TRUE))
    colnames(df) <- names
    
    return(df)
}

In [44]:
df_sinasc_mun <- create_df_agg2( data_sinasc_mun, "mun", c("Mes","Municipio","Nascimentos") )

In [45]:
length(unique(df_sinasc_mun$Municipio))

The next step is "cleaning" inconsistent or ignored city codes according to IBGE.

In [46]:
#List of citys IBGE (source: https://www.ibge.gov.br/explica/codigos-dos-municipios.php)
cod_mun <- read.csv("CODIGOS_MUNICIPIO_IBGE.csv")
cod_mun <- str_sub(cod_mun$Cod, end = 6)

#citys that are listed by IBGE
df_sinasc_mun <- df_sinasc_mun[df_sinasc_mun$Municipio %in% cod_mun,]

In [48]:
length(unique(df_sinasc_mun$Municipio))

Saving the final result in the file "df_sinasc_res_mun.csv".

In [49]:
write.csv(df_sinasc_mun, file = "df_sinasc_res_mun.csv", row.names = FALSE)

### Births per city of birth

We can also build the denominators of the mortality rates through the birth data aggregation (count) over time (months) and for each city of __birth__ of Brazil.

For that, the aggregation defined in the object `agg_sinasc_mun` can be adapted by altering the line `field` with the variable related to the city of occurence of birth (`nasc_codigo_adotado`).

In [58]:
agg_sinasc_mun <- aggs('{
    "mes": {
      "date_histogram": {
        "field": "data_nasc",
        "interval": "1M",
        "time_zone": "UTC",
        "min_doc_count": 1
      },
      "aggs": {
        "mun": {
          "terms": {
            "field": "nasc_codigo_adotado",
            "size": 6000
          }
        }
      }
    }
  }')

In addition, we are only interested in births that occur in hospitals or other healthcare facilities. In order to restrict the searches according to our interest, we can define a filter as shown below:

In [59]:
filter_hospital <- query('{
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "def_loc_nasc:Hospital OR def_loc_nasc:(Outro Estab. Saúde)"
          }
        }
      ]
    }
}')

With the code above, we are creating an object called `filter_hospital` in R, which will be used when consulting ES. What does each line of this object mean?
* `query`: as seen earlier, this command tells ES that you are requesting a search;
* `bool`: this clause allows the construction of filters that have multiple fields. Do not modify this line;
* `query_string`: this declares that a line of code will be defined that represents a search, or in this case, a filter. Do not modify this line;
* `query`: the value of this clause represents the filter itself, which defines the fields you want to filter based on their respective values. You can modify this filter as needed.

This filter and aggregation are executed with the following code:

In [60]:
data_sinasc_mun <- es_sinasc %search% (filter_hospital + agg_sinasc_mun)

In [61]:
df_sinasc_mun <- create_df_agg2( data_sinasc_mun, "mun", c("Mes","Municipio","Nascimentos") )

In [62]:
length(unique(df_sinasc_mun$Municipio))

The next step is "cleaning" inconsistent or ignored city codes according to IBGE.

In [46]:
#List of citys IBGE (source: https://www.ibge.gov.br/explica/codigos-dos-municipios.php)
cod_mun <- read.csv("CODIGOS_MUNICIPIO_IBGE.csv")
cod_mun <- str_sub(cod_mun$Cod, end = 6)

#citys that are listed by IBGE
df_sinasc_mun <- df_sinasc_mun[df_sinasc_mun$Municipio %in% cod_mun,]

In [69]:
length(unique(df_sinasc_mun$Municipio))

Saving the final result in the file "df_sinasc_nasc_mun.csv".

In [65]:
write.csv(df_sinasc_mun, file = "df_sinasc_nasc_mun.csv", row.names = FALSE)

## Building numerators

### Deaths per city of residence

The process of building numerators for the calculation of the neonatal mortality rates is analogous to that used for building the denominators. The difference is that now we are interested in the caĺculation of the number of deaths.

More particulary, we restrict ourselves to the number of neonatal deaths, that means, deaths occured in the age groups between 0 and 6 days (premature neonatal), 7 and 27 days (late neonatal), and between 0 and 27 days (total neonatal). We should, therefore, filter the data accordingly:

In [12]:
filter_neonatal <- query('{
            "bool": {
              "must": [
                {
                  "query_string": {
                    "query": "idade_obito_dias: [0 TO 27]"
                  }
                }
              ]
            }
        }')

filter_neonatal_precoce <- query('{
            "bool": {
              "must": [
                {
                  "query_string": {
                    "query": "idade_obito_dias: [0 TO 6]"
                  }
                }
              ]
            }
        }')

filter_neonatal_tardio <- query('{
            "bool": {
              "must": [
                {
                  "query_string": {
                    "query": "idade_obito_dias: [7 TO 27]"
                  }
                }
              ]
            }
        }')

We define a death data aggregation by month and by city of __residence__ analogous to that defined for denominators building:

In [13]:
agg_sim_mun <- aggs('{
    "mes": {
      "date_histogram": {
        "field": "data_nasc",
        "interval": "1M",
        "time_zone": "UTC",
        "min_doc_count": 1
      },
      "aggs": {
        "mun": {
          "terms": {
            "field": "res_codigo_adotado",
            "size": 6000
          }
        }
      }
    }
  }')

We execute now the aggregation of neonatal deaths (__totals__), saving the results in the file "df_sim_res_mun_neo.csv":

In [12]:
data_sim_mun_neo <- es_sim %search% (filter_neonatal + agg_sim_mun)

In [10]:
df_sim_mun_neo <- create_df_agg2( data_sim_mun_neo, "mun", c("Mes","Municipio","Obitos") )

In [11]:
length(unique(df_sim_mun_neo$Municipio))

The next step is "cleaning" inconsistent or ignored city codes according to IBGE.

In [46]:
#List of citys IBGE (source: https://www.ibge.gov.br/explica/codigos-dos-municipios.php)
cod_mun <- read.csv("CODIGOS_MUNICIPIO_IBGE.csv")
cod_mun <- str_sub(cod_mun$Cod, end = 6)

#citys that are listed by IBGE
df_sim_mun_neo <- df_sim_mun_neo[df_sim_mun_neo$Municipio %in% cod_mun,]

In [13]:
length(unique(df_sim_mun_neo$Municipio))

In [14]:
write.csv(df_sim_mun_neo, file = "df_sim_res_mun_neo.csv", row.names = FALSE)

The aggregation of __premature__ neonatal deaths are saved in the file "df_sim_res_mun_neo_pre.csv":

In [15]:
data_sim_mun_neo_pre <- es_sim %search% (filter_neonatal_precoce + agg_sim_mun)

In [16]:
df_sim_mun_neo_pre <- create_df_agg2( data_sim_mun_neo_pre, "mun", c("Mes","Municipio","Obitos") )

In [17]:
length(unique(df_sim_mun_neo_pre$Municipio))

The next step is "cleaning" inconsistent or ignored city codes according to IBGE.

In [18]:
#List of citys IBGE (source: https://www.ibge.gov.br/explica/codigos-dos-municipios.php)
cod_mun <- read.csv("CODIGOS_MUNICIPIO_IBGE.csv")
cod_mun <- str_sub(cod_mun$Cod, end = 6)

#citys that are listed by IBGE
df_sim_mun_neo_pre <- df_sim_mun_neo_pre[df_sim_mun_neo_pre$Municipio %in% cod_mun,]

In [19]:
length(unique(df_sim_mun_neo_pre$Municipio))

In [20]:
write.csv(df_sim_mun_neo_pre, file = "df_sim_res_mun_neo_pre.csv", row.names = FALSE)

The aggregation of __late__ neonatal deaths are saved in the file "df_sim_res_mun_neo_tar.csv":

In [21]:
data_sim_mun_neo_tar <- es_sim %search% (filter_neonatal_tardio + agg_sim_mun)

In [22]:
df_sim_mun_neo_tar <- create_df_agg2( data_sim_mun_neo_tar, "mun", c("Mes","Municipio","Obitos") )

In [23]:
length(unique(df_sim_mun_neo_tar$Municipio))

The next step is "cleaning" inconsistent or ignored city codes according to IBGE.

In [24]:
#List of citys IBGE (source: https://www.ibge.gov.br/explica/codigos-dos-municipios.php)
cod_mun <- read.csv("CODIGOS_MUNICIPIO_IBGE.csv")
cod_mun <- str_sub(cod_mun$Cod, end = 6)

#citys that are listed by IBGE
df_sim_mun_neo_tar <- df_sim_mun_neo_tar[df_sim_mun_neo_tar$Municipio %in% cod_mun,]

In [25]:
length(unique(df_sim_mun_neo_tar$Municipio))

In [26]:
write.csv(df_sim_mun_neo_tar, file = "df_sim_res_mun_neo_tar.csv", row.names = FALSE)

### Deaths per city of occurence

As for the production of denominators by city of occurrence of births, we are only interested in deaths occurring in hospitals or other health facilities. For this, we adapted the filters with the addition of the filter related to the __place of occurrence__ of death:

In [27]:
filter_neonatal <- query('{
            "bool": {
              "must": [
                {
                  "query_string": {
                    "query": "idade_obito_dias: [0 TO 27] AND (def_loc_ocor:Hospital OR def_loc_ocor:(Outro Estab. Saúde))"
                  }
                }
              ]
            }
        }')

filter_neonatal_precoce <- query('{
            "bool": {
              "must": [
                {
                  "query_string": {
                    "query": "idade_obito_dias: [0 TO 6] AND (def_loc_ocor:Hospital OR def_loc_ocor:(Outro Estab. Saúde))"
                  }
                }
              ]
            }
        }')

filter_neonatal_tardio <- query('{
            "bool": {
              "must": [
                {
                  "query_string": {
                    "query": "idade_obito_dias: [7 TO 27] AND (def_loc_ocor:Hospital OR def_loc_ocor:(Outro Estab. Saúde))"
                  }
                }
              ]
            }
        }')

We define an aggregation of death data by month and by city of __occurrence__ similar to the one defined for the construction of the denominators:

In [28]:
agg_sim_mun <- aggs('{
    "mes": {
      "date_histogram": {
        "field": "data_nasc",
        "interval": "1M",
        "time_zone": "UTC",
        "min_doc_count": 1
      },
      "aggs": {
        "mun": {
          "terms": {
            "field": "ocor_codigo_adotado",
            "size": 6000
          }
        }
      }
    }
  }')

We execute now the aggregation of neonatal deaths (__totals__), saving the results in the file "df_sim_ocor_mun_neo.csv":

In [29]:
data_sim_mun_neo <- es_sim %search% (filter_neonatal + agg_sim_mun)

In [30]:
df_sim_mun_neo <- create_df_agg2( data_sim_mun_neo, "mun", c("Mes","Municipio","Obitos") )

In [31]:
length(unique(df_sim_mun_neo$Municipio))

The next step is "cleaning" inconsistent or ignored city codes according to IBGE.

In [46]:
#List of citys IBGE (source: https://www.ibge.gov.br/explica/codigos-dos-municipios.php)
cod_mun <- read.csv("CODIGOS_MUNICIPIO_IBGE.csv")
cod_mun <- str_sub(cod_mun$Cod, end = 6)

#citys that are listed by IBGE
df_sim_mun_neo <- df_sim_mun_neo[df_sim_mun_neo$Municipio %in% cod_mun,]

In [33]:
length(unique(df_sim_mun_neo$Municipio))

In [34]:
write.csv(df_sim_mun_neo, file = "df_sim_ocor_mun_neo.csv", row.names = FALSE)

The aggregation of __premature__ neonatal deaths are saved in the file "df_sim_ocor_mun_neo_pre.csv":

In [35]:
data_sim_mun_neo_pre <- es_sim %search% (filter_neonatal_precoce + agg_sim_mun)

In [36]:
df_sim_mun_neo_pre <- create_df_agg2( data_sim_mun_neo_pre, "mun", c("Mes","Municipio","Obitos") )

In [37]:
length(unique(df_sim_mun_neo_pre$Municipio))

The next step is "cleaning" inconsistent or ignored city codes according to IBGE.

In [18]:
#List of citys IBGE (source: https://www.ibge.gov.br/explica/codigos-dos-municipios.php)
cod_mun <- read.csv("CODIGOS_MUNICIPIO_IBGE.csv")
cod_mun <- str_sub(cod_mun$Cod, end = 6)

#citys that are listed by IBGE
df_sim_mun_neo_pre <- df_sim_mun_neo_pre[df_sim_mun_neo_pre$Municipio %in% cod_mun,]

In [39]:
length(unique(df_sim_mun_neo_pre$Municipio))

In [40]:
write.csv(df_sim_mun_neo_pre, file = "df_sim_ocor_mun_neo_pre.csv", row.names = FALSE)

The aggregation of __late__ neonatal deaths are saved in the file "df_sim_ocor_mun_neo_tar.csv":

In [41]:
data_sim_mun_neo_tar <- es_sim %search% (filter_neonatal_tardio + agg_sim_mun)

In [42]:
df_sim_mun_neo_tar <- create_df_agg2( data_sim_mun_neo_tar, "mun", c("Mes","Municipio","Obitos") )

In [43]:
length(unique(df_sim_mun_neo_tar$Municipio))

The next step is "cleaning" inconsistent or ignored city codes according to IBGE.

In [24]:
#List of citys IBGE (source: https://www.ibge.gov.br/explica/codigos-dos-municipios.php)
cod_mun <- read.csv("CODIGOS_MUNICIPIO_IBGE.csv")
cod_mun <- str_sub(cod_mun$Cod, end = 6)

#citys that are listed by IBGE
df_sim_mun_neo_tar <- df_sim_mun_neo_tar[df_sim_mun_neo_tar$Municipio %in% cod_mun,]

In [45]:
length(unique(df_sim_mun_neo_tar$Municipio))

In [46]:
write.csv(df_sim_mun_neo_tar, file = "df_sim_ocor_mun_neo_tar.csv", row.names = FALSE)