# Getting Data About Dengue Across Brazil

To begin the data collection, the **SINAN Portal** was chosen as the most reliable source of information for **Epidemiological Surveillance in general**.

---

### Quick description about SINAN

The **Notifiable Diseases Information System (SINAN)** aims to collect, transmit, and disseminate data routinely generated by epidemiological surveillance at all three levels of government through a computerized network. This system supports investigation processes and provides input for the analysis of information on diseases and health problems subject to mandatory reporting.  

Currently, the system operates in two active versions: **SINAN Online** and **SINAN Net**.

**Access to the 'Dengue' dataset on the openDataSUS Portal:**  
ðŸ‘‰ [https://opendatasus.saude.gov.br/dataset/arboviroses-dengue](https://opendatasus.saude.gov.br/dataset/arboviroses-dengue)

---

### Importing Required Libraries


In [1]:
import os              # For file and directory operations
import requests        # For downloading files from the internet
import zipfile         # For handling ZIP files
import shutil          # For high-level file operations

### Defining Headers and Paths

To avoid being blocked by the server, we define a HEADERS dictionary that mimics a standard browser request.

Then, we set the base URL for the SINAN dengue dataset and specify where the raw data will be stored locally.

In [2]:
# Https headers to mimic a browser visit - avoid potential blocking by server
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}

BASE_URL = 'https://s3.sa-east-1.amazonaws.com/ckan.saude.gov.br/SINAN/Dengue/csv/'
RAW_DIR = '../data/raw/dengue'

RAW_FILES = [
    'DENGBR24.csv',
    'DENGBR23.csv',
    'DENGBR22.csv',
    'DENGBR21.csv',
    'DENGBR20.csv',
    'DENGBR19.csv',
    'DENGBR18.csv',
    'DENGBR17.csv',
    'DENGBR16.csv',
    'DENGBR15.csv'
]

---
### Automating the Data Collection

The approach used here automates the process of downloading, extracting, and organizing dengue case datasets from the official repository.

Each `.zip` file is downloaded, uncompressed, and its corresponding `.csv` file is moved to the raw data directory for further processing.
Temporary files and directories are removed automatically to keep the workspace clean.

In [None]:
def downloadFile(url: str, local_path: str):
    r = requests.get(url, headers=HEADERS, stream=True)  # stream=True to download large files
    try:
        r.raise_for_status()
        with open(local_path, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):
                if chunk:
                    f.write(chunk)
    finally:
        r.close()

def extractFile(zip_path: str, extract_dir: str):
    try: 
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(extract_dir) 
    except zipfile.BadZipFile:
        print("  -> ERRO: O arquivo baixado estÃ¡ corrompido ou Ã© um ZIP invÃ¡lido.")
    except Exception as e:
        print(f"  -> ERRO durante a extraÃ§Ã£o: {e}")

def moveFileCSV(base_path: str, final_filename: str):
    extracted_csv_name = None
    for item in os.listdir(base_path):
        if item.endswith('.csv'):
            extracted_csv_name = item
            break
            
    if extracted_csv_name:
        source = os.path.join(base_path, extracted_csv_name)
        destination = os.path.join(RAW_DIR, final_filename) 
        
        shutil.move(source, destination)
    else:
        pass

Now we loop through all the files we need â€” from 2015 to 2024 â€” and apply the functions defined above:

In [None]:

for file in RAW_FILES:
    DOWNLOAD_FILE_NAME = BASE_URL + file + '.zip'
    ZIP_FILE_PATH = os.path.join(RAW_DIR, file + '.zip')
    TEMP_EXTRACT_PATH = os.path.join(RAW_DIR, file.replace('.csv', ''))

    try:
        downloadFile(DOWNLOAD_FILE_NAME, local_path=ZIP_FILE_PATH)
        extractFile(ZIP_FILE_PATH, extract_dir=TEMP_EXTRACT_PATH)
        moveFileCSV(base_path=TEMP_EXTRACT_PATH, final_filename=file)

        shutil.rmtree(TEMP_EXTRACT_PATH) 
        os.remove(ZIP_FILE_PATH)

    except requests.exceptions.HTTPError as e: 
        print(f"ERRO HTTP ao baixar {file}: Verifique a URL ou a disponibilidade. Erro: {e}")
    except Exception as e:
        print(f"ERRO inesperado no processamento de {file}: {e}")

---
### Directory Structure

After successful execution, the directory structure will be organized as expected.
All .csv files from the last 10 years will be available under ../data/raw/dengue:

In the end, the dir structure will be exactly as expected. All the files .csv over the last 10 years are now downloaded in '../data/raw/dengue' this way...