# Demo Application with DataSUS death records and Streamlit

![datasus_app](../assets/datasus_app_pt.png)

In Brazil, [more than 70% of the population depends exclusively on the medical assistance provided by the government](http://bvsms.saude.gov.br/bvs/pacsaude/diretrizes.php). The Brazilian public healthcare system is called SUS (Sistema Único de Saúde). 

Fortunately for us, there is a public SUS data repository available online (DataSUS). Although the data is not always clean and complete, we can derive many insights from DataSUS. 

In this post we are going to build and deploy a Streamlit application inspired on the [Uber pickups example](https://github.com/streamlit/demo-uber-nyc-pickups), but using DataSUS death records (2006-2017) and geographic coordinates from health facilities.

## Downloading the data from DataSUS

### SIM

From the [DataSUS website](http://www2.datasus.gov.br/DATASUS/index.php?area=060701) we have the definition of SIM:
> The Mortality Information System (SIM) was created by DATASUS for the regular collection of mortality data in the country. From the creation of the SIM it was possible to comprehensively capture mortality data to subsidize the various management spheres in public health. Based on this information it is possible to perform situation analysis, planning and evaluation of actions and programs in the area.

Let's download the SIM data for the São Paulo state. The prefixes of the files are "DOSP". 

#### Downloading the data from the ftp

In [2]:
from ftplib import FTP

ftp = FTP("ftp.datasus.gov.br")
ftp.login()
ftp.cwd("dissemin/publicos/SIM/CID10/DORES/")
all_files = ftp.nlst(".")
state_prefix = "DOSP"
# We sort the list and keep only the last 12 records
# This is because they share the same layout of the current data (2006-2017)
files = sorted([file for file in all_files if state_prefix in file])[-12:]

for file in files:
    print("Downloading {}...".format(file))
    with open("../data/SIM/" + file, "wb") as fp:
        ftp.retrbinary("RETR {}".format(file), fp.write)

Downloading DOSP2006.dbc...
Downloading DOSP2007.DBC...
Downloading DOSP2008.dbc...
Downloading DOSP2009.dbc...
Downloading DOSP2010.DBC...
Downloading DOSP2011.DBC...
Downloading DOSP2012.DBC...
Downloading DOSP2013.dbc...
Downloading DOSP2014.dbc...
Downloading DOSP2015.dbc...
Downloading DOSP2016.dbc...
Downloading DOSP2017.dbc...


#### Renaming the files which the extension is capitalized

In [4]:
import os

files = [file for file in os.listdir("../data/SIM/") if "DOSP" in file and ".DBC" in file]

for file in files:
    os.rename("../data/SIM/" + file, "../data/SIM/" + file[:-4] + ".dbc")

#### Converting from .dbc to .csv
As you may have noticed, the files are in a `.dbc` format. This is a proprietary format of the SUS Department of Informatics (DATASUS).

A kind developer provided a [tool](https://github.com/greatjapa/dbc2csv) to convert files from `.dbc` to `.csv`. To use this tool we will need to have `git` and `docker` installed.

##### Build the docker image
```bash
git clone https://github.com/greatjapa/dbc2csv.git
cd dbc2csv
docker build -t dbc2csv .
```

##### Convert the files
1. Navigate to the folder where you download the `.dbc` files. Copy the full path to the directory, you can get this path by running:
```bash
pwd
```
2. Run:
```bash
docker run -it -v <full_path_to_the_directory>:/usr/src/app/data dbc2csv make
```
3. A `/csv` folder will be populated with the converted files.

![dbc2csv](../assets/dbc2csv.png)

### CNES
From their [website](http://cnes.datasus.gov.br/):
> The National Register of Health Facilities (CNES) is a public document and official information system for registering information about all health facilities in the country, regardless of their legal nature or integration with the Unified Health System (SUS).

The process to download the data is simpler this time, they are already in a `.zip` file you can download from this link:

ftp://ftp.datasus.gov.br/cnes/BASE_DE_DADOS_CNES_201910.ZIP

We are going to use only one `.csv` file from this data: `tbEstabelecimento201910.csv`

## Processing the data

### Reading the facilities table with pandas

To be efficient, we will pass only the columns that matter to our application.

In [5]:
import pandas as pd

cnes = pd.read_csv(
    "../data/CNES/tbEstabelecimento201910.csv",
    sep=";",
    usecols=[
        "CO_CNES",
        "CO_CEP",
        "NO_FANTASIA",
        "NO_LOGRADOURO",
        "CO_ESTADO_GESTOR",
        "NU_LATITUDE",
        "NU_LONGITUDE",
    ],
)

### Filtering the data for the São Paulo state. 
From the dictionary available on the DataSUS website we know that '35' is the code for São Paulo. For this application we are only going to keep this data

In [6]:
cnes = cnes[cnes["CO_ESTADO_GESTOR"]==35]

### Merging with the death records
My converted `.csv` SIM files are in a folder called sim_files, make sure you modify the path accordingly

In [8]:
files = sorted(os.listdir("../data/SIM/csv/"))

dfs = [
    pd.read_csv(
        "../data/SIM/csv/" + file,
        usecols=["NUMERODO", "DTOBITO", "HORAOBITO", "CODESTAB"],
        low_memory=False
    )
    for file in files
]
df = pd.concat(dfs)

# We will drop the null CODESTABs (data without CNES code)
df = df.dropna()

Before proceeding to fill the missing coordinates, join the CODESTAB with the CO_CNES, so we have fewer facilities to fill.

In [12]:
cnes = cnes.rename(columns={"CO_CNES": "CODESTAB"})
merged = df.merge(cnes, on="CODESTAB")

# Since we merged with the death records file, we have many duplicates, 
# let's drop it to see which facilities have coordinates missing
unique_merged = merged[
    ["CODESTAB", "CO_CEP", "NU_LATITUDE", "NU_LONGITUDE"]
].drop_duplicates()
# Filtering the data for only the records where the coordinates are missing
missing_coords = unique_merged[unique_merged["NU_LATITUDE"].isnull()]
# The CEP was automatically converted to int and we have lost the first zero digit.
# This line converts to string and pad with zero so we have a valid CEP
missing_coords["CO_CEP"] = (
    missing_coords["CO_CEP"].astype(str).apply(lambda x: x.zfill(8)))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


We have 697 CEPs without coordinates, let's try to fill them up.

## Enriching the data from DataSUS with latitude and longitude (cep_to_coords)

The data we downloaded from DataSUS is not complete. Geographic coordinates of various health facilities are missing. While latitude and longitude are not present in all cases, we do have the Brazilian zip code (CEP) for some. 

A quick search on Google for converting from CEP to latitude and longitude has shown that we had some scripts that mixed R and Python to achieve this task. 

Investigating the scripts further, it became clear that it was simple and valuable to implement this in Python. So, I removed the dependency of R to achieve the same result with just Python (https://github.com/millengustavo/cep_to_coords). 

### Install geocode from source
```bash
git clone https://github.com/millengustavo/cep_to_coords.git
cd cep_to_coords
git checkout master
pip install -e .
```

The package usage is simple. Call the `cep_to_coords` function with a valid CEP string, it will search the correios API for an address, concatenate it with the city and country and hit an [API](http://photon.komoot.de/) to get the coordinates.

If you find it useful, please leave a star on [Github](https://github.com/millengustavo/cep_to_coords). The project is still in its infancy, so it is a great opportunity to [contribute to your first open source project](https://medium.com/@austintackaberry/why-you-should-contribute-to-open-source-software-right-now-bec8bd83cfc0) adding features or refactoring the code!

### Fill the coordinates

In [13]:
from cep_to_coords.geocode import cep_to_coords

cep_column = "CO_CEP"

unique_ceps = missing_coords[cep_column].unique()
# cep_to_coords returns a [lat, lon] list if it finds the coordinates
# else it returns [NaN, NaN]
missing_coords["lat"] = float("nan")
missing_coords["lon"] = float("nan")
for ind, elem in enumerate(unique_ceps):
    try:
        coords = cep_to_coords(elem)
        missing_coords.loc[ind, "lat"] = coords[0]
        missing_coords.loc[ind, "lon"] = coords[1]
    except Exception as e:
        print(elem, coords, e)
    print("{}%...".format(ind * 100 / len(unique_ceps)))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


Requesting coordinates from Estrada da Riviera São Paulo Brasil
0.0%...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Requesting coordinates from Avenida Alcântara Machado São Paulo Brasil
0.1594896331738437%...


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


Requesting coordinates from Rua dos Economiários Barretos Brasil
0.3189792663476874%...
Requesting coordinates from Rua Melvin Jones Arujá Brasil
0.4784688995215311%...
Requesting coordinates from Rua Braz Sanches Arriaga Birigüi Brasil
0.6379585326953748%...
Requesting coordinates from Rua Estevão Fernandes São Paulo Brasil
0.7974481658692185%...
Requesting coordinates from Avenida Santo Amaro São Paulo Brasil
0.9569377990430622%...
1.1164274322169059%...
Requesting coordinates from Avenida Nossa Senhora do Sabará São Paulo Brasil
1.2759170653907497%...
Requesting coordinates from Avenida Washington Luiz Presidente Prudente Brasil
1.4354066985645932%...
Requesting coordinates from Avenida Santo Amaro São Paulo Brasil
1.594896331738437%...
Requesting coordinates from Praça Alexandre Fleming Botucatu Brasil
1.7543859649122806%...
Requesting coordinates from Rodovia Regis Bittencourt Taboão da Serra Brasil
1.9138755980861244%...
Requesting coordinates from Avenida Brigadeiro Luís Antônio

Requesting coordinates from Rua Almirante Silvado São Paulo Brasil
18.819776714513555%...
Requesting coordinates from Rua Doutor Alceu de Campos Rodrigues São Paulo Brasil
18.9792663476874%...
Requesting coordinates from Rua Engenheiro Oscar Americano São Paulo Brasil
19.138755980861244%...
Requesting coordinates from Avenida Winston Churchill São Bernardo do Campo Brasil
19.29824561403509%...
Requesting coordinates from Rua Regente Feijó São José do Rio Preto Brasil
19.45773524720893%...
Requesting coordinates from Avenida Padre Vicente Melillo Osasco Brasil
19.617224880382775%...
Requesting coordinates from Rua Padre Damaso Osasco Brasil
19.77671451355662%...
Requesting coordinates from Rua Espírito Santo 277 São Caetano do Sul Brasil
19.936204146730464%...
Requesting coordinates from Rua Vicente Sabella Bragança Paulista Brasil
20.095693779904305%...
Requesting coordinates from Rua Quintino Bocaiúva Amparo Brasil
20.25518341307815%...
Requesting coordinates from Rua Domingos Leme Gu

Requesting coordinates from Avenida Cangaíba São Paulo Brasil
34.44976076555024%...
34.609250398724086%...
Requesting coordinates from Rua José Jannarelli São Paulo Brasil
34.76874003189793%...
Requesting coordinates from Rua Dona Adma Jafet São Paulo Brasil
34.92822966507177%...
Requesting coordinates from Avenida Brigadeiro Faria Lima São Paulo Brasil
35.08771929824562%...
Requesting coordinates from Avenida Raimundo Pereira de Magalhães São Paulo Brasil
35.24720893141946%...
Requesting coordinates from Rua Quinze de Novembro São Vicente Brasil
35.4066985645933%...
Requesting coordinates from Rua Mercedes Lopes São Paulo Brasil
35.56618819776715%...
Requesting coordinates from Estrada dos Alvarengas São Bernardo do Campo Brasil
35.72567783094099%...
35.88516746411483%...
Requesting coordinates from Avenida Professor Manoel José Pedroso Cotia Brasil
36.04465709728868%...
Requesting coordinates from Rua Vicente de Carvalho Americana Brasil
36.20414673046252%...
Requesting coordinates f

Requesting coordinates from Rua Rainha das Missões São Paulo Brasil
50.717703349282296%...
Requesting coordinates from Avenida Engenheiro Armando de Arruda Pereira São Paulo Brasil
50.87719298245614%...
Requesting coordinates from Rua Alvinópolis São Paulo Brasil
51.036682615629985%...
Requesting coordinates from Rua Antônio Cajano São Paulo Brasil
51.196172248803826%...
Requesting coordinates from Rua Tamandaré São Paulo Brasil
51.355661881977674%...
Requesting coordinates from Rua Pedro de Toledo São Paulo Brasil
51.515151515151516%...
Requesting coordinates from Praça Coronel Sandoval de Figueiredo São Paulo Brasil
51.67464114832536%...
Requesting coordinates from Rua Bom Pastor São Paulo Brasil
51.834130781499205%...
Requesting coordinates from Rua João Ventura Batista São Paulo Brasil
51.993620414673046%...
Requesting coordinates from Travessa Somos Todos Iguais São Paulo Brasil
52.15311004784689%...
Requesting coordinates from Avenida Menotti Laudisio São Paulo Brasil
52.31259968

Requesting coordinates from Rua Antônio Gil São Paulo Brasil
66.18819776714514%...
Requesting coordinates from Avenida Professor Osvaldo de Oliveira São Paulo Brasil
66.34768740031897%...
Requesting coordinates from Rua Francisco Cardoso Júnior São Paulo Brasil
66.50717703349282%...
Requesting coordinates from Rua Serra de Jairé São Paulo Brasil
66.66666666666667%...
Requesting coordinates from Rua Giovanni Di Balduccio São Paulo Brasil
66.8261562998405%...
Requesting coordinates from Rua Doutor Osmar Marinho Couto Mogi das Cruzes Brasil
66.98564593301435%...
Requesting coordinates from Rua Antônio Soares Lara São Paulo Brasil
67.1451355661882%...
Requesting coordinates from Rua dos Buritis São Paulo Brasil
67.30462519936204%...
Requesting coordinates from Rua Ytaipu São Paulo Brasil
67.46411483253588%...
Requesting coordinates from Rua Costeira São Paulo Brasil
67.62360446570973%...
Requesting coordinates from Rua Engenheiro Teixeira Soares São Paulo Brasil
67.78309409888357%...
Reque

Requesting coordinates from Avenida General Valdomiro de Lima São Paulo Brasil
81.49920255183413%...
Requesting coordinates from Rua Cesário Ramalho São Paulo Brasil
81.65869218500798%...
Requesting coordinates from Avenida General Francisco Glicério Santos Brasil
81.81818181818181%...
Requesting coordinates from Rua Dolzani Ricardo São José dos Campos Brasil
81.97767145135566%...
Requesting coordinates from Rua Jerônima Dias São Paulo Brasil
82.13716108452951%...
Requesting coordinates from Rua Porta do Prado São Paulo Brasil
82.29665071770334%...
Requesting coordinates from Rua Pedro Carleto Netto Sertãozinho Brasil
82.45614035087719%...
82.61562998405104%...
Requesting coordinates from Avenida Brasil Americana Brasil
82.77511961722487%...
Requesting coordinates from Rua Cônego Demétrio Tatuí Brasil
82.93460925039872%...
Requesting coordinates from Avenida Imirim São Paulo Brasil
83.09409888357257%...
Requesting coordinates from Travessa Apeninos Santo André Brasil
83.2535885167464%.

Requesting coordinates from Rua Almirante Pereira Guimarães São Paulo Brasil
96.96969696969697%...
Requesting coordinates from Rua Quinze de Novembro Tatuí Brasil
97.12918660287082%...
Requesting coordinates from Rua Filipe Cardoso São Paulo Brasil
97.28867623604465%...
Requesting coordinates from Rua Texas São Paulo Brasil
97.4481658692185%...
Requesting coordinates from Rua General Sócrates São Paulo Brasil
97.60765550239235%...
Requesting coordinates from Rua José Cianciarulo Osasco Brasil
97.76714513556618%...
Requesting coordinates from Avenida Lavandisca São Paulo Brasil
97.92663476874003%...
Requesting coordinates from Rua Manoel Antônio Pinto São Paulo Brasil
98.08612440191388%...
Requesting coordinates from Rua Zacarias de Gois São Paulo Brasil
98.24561403508773%...
Requesting coordinates from Rua Padre Estevão Pernet São Paulo Brasil
98.40510366826156%...
Requesting coordinates from Rua Maciel Monteiro São Paulo Brasil
98.56459330143541%...
Requesting coordinates from Rua Dou

> Using the cep_to_coords function we were able to fill **78%** of the missing coordinates!

### Compiling the final CEP table

To complete the data preparation, we need to take our filled coordinates and replace the NaNs on the death records table.

In [14]:
unique_merged["CO_CEP"] = (
    unique_merged["CO_CEP"].astype(str).apply(lambda x: x.zfill(8))
)
# unfortunately we didn't fill all coordinates, let's drop them
missing_coords = missing_coords.drop(columns=["NU_LATITUDE", "NU_LONGITUDE"]).dropna()
# joining the datasets
full_table = unique_merged.merge(missing_coords, on="CO_CEP", how="left")
# filling the missing data
full_table["lat"] = full_table.apply(
    lambda x: x["lat"] if pd.isnull(x["NU_LATITUDE"]) else x["NU_LATITUDE"], axis=1
)
full_table["lon"] = full_table.apply(
    lambda x: x["lon"] if pd.isnull(x["NU_LONGITUDE"]) else x["NU_LONGITUDE"], axis=1
)
# compiling the CEP final table
full_table = (
    full_table.drop(columns=["NU_LATITUDE", "NU_LONGITUDE", "CODESTAB_y"])
    .dropna()
    .rename(columns={"CODESTAB_x": "CODESTAB"})
    .reset_index(drop=True)
)

### Merging the facilities back to the death records dataframe and cleaning the data

In [15]:
df_enriched = df.merge(full_table, on="CODESTAB")
df_enriched["HORAOBITO"] = pd.to_numeric(
    df_enriched["HORAOBITO"], downcast="integer", errors="coerce"
)
df_enriched = df_enriched.dropna()
df_enriched["DTOBITO"] = df_enriched["DTOBITO"].astype(str).apply(lambda x: x.zfill(8))
df_enriched["HORAOBITO"] = (
    df_enriched["HORAOBITO"].astype(int).astype(str).apply(lambda x: x.zfill(4))
)
# Creating a timestamp column with both date and hour of death
df_enriched["DATA"] = df_enriched["DTOBITO"] + " " + df_enriched["HORAOBITO"]
df_enriched["DATA"] = pd.to_datetime(
    df_enriched["DATA"], format="%d%m%Y %H%M", errors="coerce"
)
df_enriched = df_enriched.dropna()
df_enriched["NUMERODO"] = df_enriched["NUMERODO"].astype(str)

df_enriched["lat"] = (
    df_enriched["lat"].astype(str).str.replace(",", ".", regex=False).astype(float)
)

df_enriched["lon"] = (
    df_enriched["lon"].astype(str).str.replace(",", ".", regex=False).astype(float)
)

INFO:numexpr.utils:NumExpr defaulting to 4 threads.


### Saving to a .parquet file

In [30]:
df_enriched.to_parquet("../data/clean/dataset.parquet.gzip", compression="gzip", index=False)

## Creating the app using the Uber pickups example

Streamlit according to the website is 
> “The fastest way to build custom ML tools”. 

It is indeed a bold statement, but what sold me on it was the sentence on the subtitle:
> “So you can stop spending time on frontend development and get back to what you do best.”.

For this experiment, we are going to spend even less time on frontend development by using an example gently posted by the Streamlit team (https://github.com/streamlit/demo-uber-nyc-pickups). The demo presents the Uber pickups on New York City by hour. Our goal here is to replace pickups with deaths registered on SIM and New York City with the state of São Paulo. 

There are only a few things we need to change in the code to adapt the application to our use. 

### Clone the repository
```bash
git clone https://github.com/streamlit/demo-uber-nyc-pickups.git
cd demo-uber-nyc-pickups
```

### Open app.py on your favorite text editor and change the following lines (commented here)


In [31]:
# OLD -> DATE_TIME = "date/time"
# NEW -> DATE_TIME = "data"

# OLD -> data = pd.read_csv(DATA_URL, nrows=nrows)
# NEW -> data = pd.read_parquet("../data/clean/dataset.parquet.gzip")

For cosmetic purposes you may also change the title and other specific references

### Install streamlit
```bash
pip install streamlit
```

### Run the app
```bash
streamlit run app.py
```

![terminal_streamlit](../assets/terminal_streamlit.png)

Voilà! This command will automatically open the app on your browser (port 8051 by default).

# Conclusion
This is a very simple project that shows some amazing libraries that are being developed lately for Machine Learning applications. Although we didn't used any complex techniques here, we covered an interesting part of what a data scientist do. Data ingestion, cleaning, enriching and finally visualization.

I hope you learned something from this and I encourage you to play around with Streamlit, it's definitely amazing!