<img src="https://raw.githubusercontent.com/harmonize-tools/socio4health/main/docs/source/_static/image.png" alt="image info" height="100" width="100"/>




# Extraction of Colombia, Brazil and Peru online data

**Run the tutorial via free cloud platforms:** [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/harmonize-tools/socio4health/HEAD?urlpath=%2Fdoc%2Ftree%2Fdocs%2Fsource%2Fnotebooks%2Fextractor.ipynb) <a target="_blank" href="https://colab.research.google.com/github/harmonize-tools/socio4health/blob/main/docs/source/notebooks/extractor.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>



This notebook provides you with an introduction on how to retrieve data from online data sources through **web scraping**, as well as from **local files** from **Colombia**, **Brazil**, **Peru**, and the **Dominican Republic**. This tutorial assumes you have an **intermediate** or **advanced** understanding of **Python** and data manipulation.

## Setting up the environment

To run this notebook, you need to have the following prerequisites:

- **Python 3.10+**

Additionally, you need to install the `socio4health` and `pandas` package, which can be done using ``pip``:



In [1]:
!pip install socio4health pandas -q


[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In case you want to run this notebook in **Google Colab**, you also need to run the following command to use your files stored in **Google Drive**:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Import Libraries

To perform the data extraction, the `socio4health` library provides the `Extractor` class for data extraction, and the `Harmonizer` class for data harmonization of the retrieved date. We will also use `pandas` for data manipulation.


In [2]:

from socio4health import Extractor
from socio4health.enums.data_info_enum import BraColnamesEnum, BraColspecsEnum


## Use case 1: Extracting data from Colombia

To extract data from Colombia, we will use the `Extractor` class from the `socio4health` library. The `Extractor` class provides methods to retrieve data from various sources, including online databases and local files. In this example, we will extract the Large Integrated Household Survey - **GEIH** - 2022 (Gran Encuesta Integrada de Hogares - **GEIH** - 2022) dataset  from the Colombian Nacional Administration of Statistics (**DANE**) website

The `Extractor` class requires the following parameters:
- `input_path`: The `URL` or local path to the data source.
- `down_ext`: A list of file extensions to download. This can include `.CSV`, `.csv`, `.zip`, etc.
- `sep`: The separator used in the data files (e.g., `;` for semicolon-separated values).
- `output_path`: The local path where the extracted data will be saved.
- `depth`: The depth of the directory structure to traverse when downloading files. A depth of `0` means only the files in the specified directory will be downloaded.


In [3]:
col_online_extractor = Extractor(input_path="https://microdatos.dane.gov.co/index.php/catalog/771/get-microdata", down_ext=['.CSV','.csv','.zip'], sep=';', output_path="../data", depth=0)

After the instance is set up, we can call the `s4h_extract` method to download and extract the data. The method returns a list of `pandas` DataFrames containing the extracted data.

In [4]:
col_dfs = col_online_extractor.s4h_extract()

2025-09-24 12:24:17,077 - INFO - ----------------------
2025-09-24 12:24:17,078 - INFO - Starting data extraction...
2025-09-24 12:24:17,079 - INFO - Extracting data in online mode...
2025-09-24 12:24:17,080 - INFO - Scraping URL: https://microdatos.dane.gov.co/index.php/catalog/771/get-microdata with depth 0
2025-09-24 12:24:21,753 - INFO - Spider completed successfully for URL: https://microdatos.dane.gov.co/index.php/catalog/771/get-microdata
2025-09-24 12:24:21,755 - INFO - Downloading files to: ../data
Downloading files:   0%|          | 0/12 [00:00<?, ?it/s]2025-09-24 12:24:25,589 - INFO - Successfully downloaded: GEIH_Enero_2022_Marco_2018.zip
Downloading files:   8%|▊         | 1/12 [00:03<00:41,  3.81s/it]2025-09-24 12:24:29,405 - INFO - Successfully downloaded: GEIH_Febrero_2022_Marco_2018.zip
Downloading files:  17%|█▋        | 2/12 [00:07<00:38,  3.81s/it]2025-09-24 12:24:32,439 - INFO - Successfully downloaded: GEIH_Marzo_2022_Marco_2018.zip
Downloading files:  25%|██▌    

In [5]:
col_dfs[0].head()

Unnamed: 0,PERIODO,DIRECTORIO,SECUENCIA_P,ORDEN,HOGAR,P7495,P7500S1,P7500S1A1,P7500S2,P7500S2A1,...,P3371S1,P3371S2,P3371S3,P3371S4,P3372,P3372S1,FEX_C18,PER,REGIS,filename
0,20220104,5000000,1,1,1,2,,,,,...,,,,,2,,1432.4633227,2022,90,c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ...
1,20220104,5000000,1,2,1,2,,,,,...,,,,,2,,1432.4633227,2022,90,c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ...
2,20220104,5000000,1,6,1,2,,,,,...,,,,,2,,1432.4633227,2022,90,c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ...
3,20220104,5000001,1,1,1,2,,,,,...,,,,,2,,1088.7962663,2022,90,c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ...
4,20220104,5000001,1,2,1,2,,,,,...,,,,,2,,1088.7962663,2022,90,c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ...


## Use case 2: Extracting data from Brazil

We are downloading the Brazilian data from the Brazilian Institute of Geography and Statistics (**IBGE**) website. The `Extractor` class is used to download the data. In this case, we are extracting the Brazilian National Continuous Household Sample Survey (**PNADC**) for the year 2024



<div style="border-left: 4px solid #e74c3c; background: #fdecea; color: #222; padding: 0.5em 1em; margin: 1em 0; display: flex; align-items: center;">
  <span style="font-size: 20px; margin-right: 10px;">⚠️</span>
  <div>
    <strong>Important:</strong> <code>is_fwf</code> parameter is set to <code>True</code>, which indicates that the data files are in fixed-width format. The <code>colnames</code> and <code>colspecs</code> parameters must be provided. In this example, they are set to the corresponding available enums for <strong> PNADC </strong> data, which define the column names and specifications for the dataset.
    See more details in
    <a href="https://harmonize-tools.github.io/socio4health/socio4health.enums.html#module-socio4health.enums.data_info_enum" target="_blank">
      socio4health.enums.data_info_enum documentation
    </a>.
  </div>
</div>

In [6]:
bra_online_extractor = Extractor(input_path="https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/", down_ext=['.txt','.zip'], is_fwf=True, colnames=BraColnamesEnum.PNADC.value, colspecs=BraColspecsEnum.PNADC.value, output_path="../data", depth=0)

bra_dfs = bra_online_extractor.s4h_extract()


2025-09-24 12:30:08,711 - INFO - ----------------------
2025-09-24 12:30:08,713 - INFO - Starting data extraction...
2025-09-24 12:30:08,713 - INFO - Extracting data in online mode...
2025-09-24 12:30:08,715 - INFO - Scraping URL: https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/ with depth 0
2025-09-24 12:30:13,261 - INFO - Spider completed successfully for URL: https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/
2025-09-24 12:30:13,264 - INFO - Downloading files to: ../data
Downloading files:   0%|          | 0/4 [00:00<?, ?it/s]2025-09-24 12:30:28,678 - INFO - Successfully downloaded: PNADC_012024_20250815.zip
Downloading files:  25%|██▌       | 1/4 [00:15<00:46, 15.41s/it]2025-09-24 12:41:31,509 - INFO - Successfully downloaded: PNADC_022024_20250815.zip
Downloading files:  50%|█████     | 2/4 [11:18<13:12, 396.25s/it]2025-09-24 12:

In [7]:
bra_dfs[0].head()

Unnamed: 0,Ano,Trimestre,UF,Capital,RM_RIDE,UPA,Estrato,V1008,V1014,V1016,...,V1028192,V1028193,V1028194,V1028195,V1028196,V1028197,V1028198,V1028199,V1028200,filename
0,2024,1,11,11,,110000016,1110011,3,11,1,...,242.37393247,0.0,0.0,132.86482247,252.85458864,271.03799675,122.61081652,125.78602243,113.09511303,a7db871d_PNADC_012024.txt
1,2024,1,11,11,,110000016,1110011,6,11,1,...,405.66107457,0.0,0.0,205.06572241,410.23613176,437.83686366,190.08927267,200.15696949,182.15329508,a7db871d_PNADC_012024.txt
2,2024,1,11,11,,110000016,1110011,6,11,1,...,405.66107457,0.0,0.0,205.06572241,410.23613176,437.83686366,190.08927267,200.15696949,182.15329508,a7db871d_PNADC_012024.txt
3,2024,1,11,11,,110000016,1110011,8,11,1,...,485.38386591,0.0,0.0,242.53160028,474.75504741,520.88948037,223.17316781,229.81795045,213.22589782,a7db871d_PNADC_012024.txt
4,2024,1,11,11,,110000016,1110011,8,11,1,...,485.38386591,0.0,0.0,242.53160028,474.75504741,520.88948037,223.17316781,229.81795045,213.22589782,a7db871d_PNADC_012024.txt



## Use case 3: Extracting data from Peru

Peruvian data is extracted from the National Institute of Statistics and Informatics (**INEI**) website. In this case, we are extracting the National Household Survey (**ENAHO**) for the year 2022. The `down_ext` parameter is set to download `.csv` and `.zip` files, and the `sep` parameter is set to `;`, indicating that the data files are semicolon-separated values.

In [8]:
per_online_extractor = Extractor(input_path="https://www.inei.gob.pe/media/DATOS_ABIERTOS/ENAHO/DATA/2022.zip", down_ext=['.csv','.zip'], output_path="../data", depth=0)

per_dfs = per_online_extractor.s4h_extract()

2025-09-24 12:50:18,463 - INFO - ----------------------
2025-09-24 12:50:18,464 - INFO - Starting data extraction...
2025-09-24 12:50:18,465 - INFO - Extracting data in online mode...
2025-09-24 12:50:18,466 - INFO - Detected direct file download URL - skipping scraping
2025-09-24 12:50:18,467 - INFO - Downloading large file (2022.zip)...
0.00B [00:00, ?B/s]2025-09-24 12:57:51,195 - INFO - Successfully downloaded: 2022.zip
0.00B [07:32, ?B/s]
2025-09-24 12:57:51,201 - INFO - Processing (depth 0): 2022.zip
2025-09-24 12:58:01,073 - INFO - Extracted: a0643d91_784-Modulo01_Enaho01-2022-100.csv
2025-09-24 12:58:01,083 - INFO - Extracted: a0643d91_784-Modulo02_ENAHO-TABLA-CIUO-88.csv
2025-09-24 12:58:01,090 - INFO - Extracted: a0643d91_784-Modulo02_ENAHO-TABLA-CNO-2015.csv
2025-09-24 12:58:01,112 - INFO - Extracted: a0643d91_784-Modulo02_Enaho01-2022-200.csv
2025-09-24 12:58:01,261 - INFO - Extracted: a0643d91_784-Modulo03_Enaho01a-2022-300.csv
2025-09-24 12:58:01,265 - INFO - Extracted: a0

In [9]:
per_dfs[0].head()

Unnamed: 0,AÑO,MES,CONGLOME,VIVIENDA,HOGAR,UBIGEO,DOMINIO,ESTRATO,P612N,P612,...,P612H,TICUEST01,D612G,D612H,I612G,I612H,FACTOR07,NCONGLOME,SUB_CONGLOME,filename
0,2022,1,5030,8,11,10201,7,4,1,2,...,,2,,,,,106.890243530273,6618,0,a0643d91_784-Modulo18_Enaho01-2022-612.csv
1,2022,1,5030,8,11,10201,7,4,2,1,...,1502.0,2,,1526.17028808594,,152.617034912109,106.890243530273,6618,0,a0643d91_784-Modulo18_Enaho01-2022-612.csv
2,2022,1,5030,8,11,10201,7,4,3,2,...,,2,,,,,106.890243530273,6618,0,a0643d91_784-Modulo18_Enaho01-2022-612.csv
3,2022,1,5030,8,11,10201,7,4,4,1,...,120.0,2,,121.931053161621,,10.1609210968018,106.890243530273,6618,0,a0643d91_784-Modulo18_Enaho01-2022-612.csv
4,2022,1,5030,8,11,10201,7,4,5,2,...,,2,,,,,106.890243530273,6618,0,a0643d91_784-Modulo18_Enaho01-2022-612.csv



## Further steps
* Harmonize the extracted data using the `Harmonizer` class from the `socio4health` library. You can follow the [Harmonization tutorial](https://harmonize-tools.github.io/socio4health/notebooks/harmonization.html) for more details.