<img src="https://raw.githubusercontent.com/harmonize-tools/socio4health/main/docs/source/_static/Harmonize_LogoH.png" alt="image info" height="100" width="200"/> <img src="https://raw.githubusercontent.com/harmonize-tools/socio4health/main/docs/source/_static/image.png" alt="image info" height="100" width="100"/>

# Hands-on with socio4health: effects of hydrometeorologigcal hazards and urbanization on dengue risk in Brazil 



**Run the tutorial via free cloud platforms:** [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/harmonize-tools/socio4health/HEAD?urlpath=%2Fdoc%2Ftree%2Fdocs%2Fsource%2Fnotebooks%2Fexample_colombia.ipynb) <a target="_blank" href="https://colab.research.google.com/github/harmonize-tools/socio4health/blob/main/docs/source/notebooks/example_brazil.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook provides a real-world example of how to use **socio4health** to **retrieve**, **harmonize** and **analyze** **socioeconomic and demographic**  variables, such as the level of urbanization and access to water supply in Brazil, to recreate the dataset used in the publication *Combined effects of hydrometeorological hazards and urbanisation on dengue risk in Brazil: a spatiotemporal modelling study* by Lowe et al., published in *The Lancet Planetary Health* in 2021 ([DOI](https://doi.org/10.1016/S2542-5196(20)30292-8)). The study evaluated how the association between hydrometeorological events and **dengue** risk varies with these variables. This tutorial assumes an **intermediate** or **advanced** understanding of **Python** and data manipulation.

## Setting up the environment

To run this notebook, you need to have the following prerequisites:

- **Python 3.10+**

Additionally, you need to install the `socio4health` and `pandas` package, which can be done using ``pip``:



In [None]:
!pip install socio4health pandas -q

In [1]:
import sys
import os

custom_path = "../../../src"
if custom_path not in sys.path:
    sys.path.insert(0, custom_path)

## Import Libraries

To perform the data extraction, the `socio4health` library provides the `Extractor` class for data extraction, and the `Harmonizer` class for data harmonization of the retrieved date. `pandas` will be used for data manipulation. Additionally, we will use some utility functions from the `socio4health.utils.harmonizer_utils` module to standardize and translate the dictionary.


In [None]:
import datetime
import tqdm as notebook_tqdm
import pandas as pd
from src.socio4health import Extractor
from src.socio4health.harmonizer import Harmonizer
from src.socio4health.utils import harmonizer_utils, extractor_utils

  from .autonotebook import tqdm as notebook_tqdm


# 1. Load and standardize the dictionary
To harmonize the data, provide a dictionary that describes the variables in the dataset. The study retrieved data from the 2010 census, from DATASUS


[here](https://microdatos.dane.gov.co/index.php/catalog/643/download/14045). Create a raw dictionary, which we will then standardize and translate to English. Follow the steps in the tutorial ["How to Create a Raw Dictionary for Data Harmonization"](https://harmonize-tools.github.io/socio4health/dictionary.html) to create a raw dictionary in Excel format.



In [3]:
raw_dic = pd.read_excel("../../../../input/Diccionario Crudo Censo2.xlsx")
dic=harmonizer_utils.standardize_dict(raw_dic)
colnames, colspecs =extractor_utils.parse_fwf_dict(dic)
dic

  .apply(_process_group, include_groups=True)\


Unnamed: 0,variable_name,question,description,value,initial_position,final_position,size,dec,type,possible_answers
0,V0402,a responsabilidade pelo domicílio é de:,,1.0; 2.0; 9.0,107.0,107.0,1.0,,C,apenas um morador; mais de um morador; ignorado
1,V0209,"abastecimento de água, canalização:",,1.0; 2.0; 3.0,90.0,90.0,1.0,,C,"sim, em pelo menos um cômodo; sim, só na propr..."
2,V0208,"abastecimento de água, forma:",,1.0; 2.0; 3.0; 4.0; 5.0; 6.0; 7.0; 8.0; 9.0; 10.0,88.0,89.0,2.0,,C,rede geral de distribuição; poço ou nascente n...
3,V6210,adequação da moradia,,1.0; 2.0; 3.0,144.0,144.0,1.0,,C,adequada; semi-adequada; inadequada
4,V0301,alguma pessoa que morava com você(s) estava mo...,,1.0; 2.0,104.0,104.0,1.0,,C,sim; não
...,...,...,...,...,...,...,...,...,...,...
71,V0214,"televisão, existência:",,1.0; 2.0,95.0,95.0,1.0,,C,sim; não
72,V4002,tipo de espécie:,,11.0; 12.0; 13.0; 14.0; 15.0; 51.0; 52.0; 53.0...,56.0,57.0,2.0,,C\n,casa; casa de vila ou em condomínio; apartamen...
73,V0001,unidade da federação:,,11.0; 12.0; 13.0; 14.0; 15.0; 16.0; 17.0; 21.0...,1.0,2.0,2.0,,A,rondônia; acre; amazonas; roraima; pará; amapá...
74,V2011,valor do aluguel (em reais),,,59.0,64.0,6.0,,N,


In [10]:
bra_online_extractor = Extractor(input_path="https://www.ibge.gov.br/estatisticas/sociais/saude/9662-censo-demografico-2010.html?=&t=microdados",
                                 down_ext=['.txt','.zip'],
                                 output_path="../../../../input/IBGE_2010",
                                 key_words=["^[A-Z]+\.zip$"],
                                 depth=0, is_fwf=True, colnames=colnames, colspecs=colspecs)
bra_Censo_2010 = bra_online_extractor.extract()

  key_words=["^[A-Z]+\.zip$"],
2025-09-01 11:09:03,919 - INFO - ----------------------
2025-09-01 11:09:03,919 - INFO - Starting data extraction...
2025-09-01 11:09:03,920 - INFO - Extracting data in online mode...
2025-09-01 11:09:03,920 - INFO - Scraping URL: https://www.ibge.gov.br/estatisticas/sociais/saude/9662-censo-demografico-2010.html?=&t=microdados with depth 0
2025-09-01 11:09:08,167 - INFO - Spider completed successfully for URL: https://www.ibge.gov.br/estatisticas/sociais/saude/9662-censo-demografico-2010.html?=&t=microdados
2025-09-01 11:09:08,169 - INFO - Downloading files to: ../../../../input/IBGE_2010
Downloading files:   0%|          | 0/27 [00:00<?, ?it/s]2025-09-01 11:09:11,202 - INFO - Successfully downloaded: RO.zip
Downloading files:   4%|▎         | 1/27 [00:03<01:18,  3.03s/it]2025-09-01 11:09:13,246 - INFO - Successfully downloaded: AC.zip
Downloading files:   7%|▋         | 2/27 [00:05<01:01,  2.45s/it]2025-09-01 11:09:17,043 - INFO - Successfully downloade

In [9]:
bra_extractor = Extractor(input_path="../../../../input/IBGE_2010",down_ext=['.txt'],is_fwf=True,output_path="../../../../output/IBGE_2010", colnames=colnames, colspecs=colspecs)
bra_Censo_2010 = bra_extractor.extract()

2025-09-01 12:12:31,416 - INFO - ----------------------
2025-09-01 12:12:31,416 - INFO - Starting data extraction...
2025-09-01 12:12:31,416 - INFO - Extracting data in local mode...
Processing files: 100%|██████████| 84/84 [05:56<00:00,  4.24s/it]
2025-09-01 12:18:27,565 - INFO - Successfully processed 84/84 files
2025-09-01 12:18:27,565 - INFO - Extraction completed successfully.


In [10]:
har = Harmonizer()
har.similarity_threshold = 0.9
dfs = har.vertical_merge(bra_Censo_2010)

Grouping DataFrames: 100%|██████████| 84/84 [00:00<00:00, 396.94it/s]
Merging groups: 100%|██████████| 1/1 [00:00<00:00,  1.65it/s]


In [11]:
dic = harmonizer_utils.translate_column(dic, "question", language="en")
dic = harmonizer_utils.translate_column(dic, "description", language="en")
dic = harmonizer_utils.translate_column(dic, "possible_answers", language="en")
dic = harmonizer_utils.classify_rows(dic, "question_en", "description_en", "possible_answers_en",
                                        new_column_name="category",
                                        MODEL_PATH="../../../../input/bert_finetuned_classifier")
dic

question translated
description translated
possible_answers translated


Device set to use cpu


Unnamed: 0,variable_name,question,description,value,initial_position,final_position,size,dec,type,possible_answers,question_en,description_en,possible_answers_en,category
0,V0402,a responsabilidade pelo domicílio é de:,,1.0; 2.0; 9.0,107.0,107.0,1.0,,C,apenas um morador; mais de um morador; ignorado,The responsibility for the home is:,,just a resident; more than one resident; ignored,Housing
1,V0209,"abastecimento de água, canalização:",,1.0; 2.0; 3.0,90.0,90.0,1.0,,C,"sim, em pelo menos um cômodo; sim, só na propr...","water supply, channeling:",,"Yes, in at least one room; Yes, only on the pr...",Housing
2,V0208,"abastecimento de água, forma:",,1.0; 2.0; 3.0; 4.0; 5.0; 6.0; 7.0; 8.0; 9.0; 10.0,88.0,89.0,2.0,,C,rede geral de distribuição; poço ou nascente n...,"water supply, form:",,General Distribution Network; well or source o...,Business
3,V6210,adequação da moradia,,1.0; 2.0; 3.0,144.0,144.0,1.0,,C,adequada; semi-adequada; inadequada,Housing Adequacy,,adequate; semi-adherence; inadequate,Housing
4,V0301,alguma pessoa que morava com você(s) estava mo...,,1.0; 2.0,104.0,104.0,1.0,,C,sim; não,Someone who lived with you (s) was living in a...,,Yes; no,Business
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,V0214,"televisão, existência:",,1.0; 2.0,95.0,95.0,1.0,,C,sim; não,"television, existence:",,Yes; no,Identification
72,V4002,tipo de espécie:,,11.0; 12.0; 13.0; 14.0; 15.0; 51.0; 52.0; 53.0...,56.0,57.0,2.0,,C\n,casa; casa de vila ou em condomínio; apartamen...,Type of species:,,home; village house or condominium; apartment;...,Housing
73,V0001,unidade da federação:,,11.0; 12.0; 13.0; 14.0; 15.0; 16.0; 17.0; 21.0...,1.0,2.0,2.0,,A,rondônia; acre; amazonas; roraima; pará; amapá...,Federation unit:,,Rondônia; acre; Amazonas; Roraima; to; Amapá; ...,Business
74,V2011,valor do aluguel (em reais),,,59.0,64.0,6.0,,N,,Rental value (in reais),,,Business


In [12]:
har.dict_df = dic
har.categories = ["Housing"]
filtered_ddfs = har.data_selector(dfs)



In [21]:
filtered_ddfs

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [20]:
filtered_ddfs[0].compute()

KeyboardInterrupt: 