# Selecting Only CABO FRIO (City) Data

The raw public dengue data from the **SINAN Portal** includes information for **all states and municipalities** of Brazil, along with their specific statistics.  
However, to focus solely on **CABO FRIO**, a **data filtering step** is required.

This notebook presents a step-by-step process showing how to extract only the CABO FRIO records and save them into a separate file.  
This makes the data easier to understand and significantly lighter for further processing.

---
### Understanding the Geographic Identifiers

The STATE_ID and CITY_ID used in the raw dataset are based on the IBGE (Brazilian Institute of Geography and Statistics) codes.

- IBGE uses 7-digit identifiers for cities.
- SINAN, however, uses only the first 6 digits of those codes.


| Region                 | Field Name   | Code       |
| :--------------------- | :----------- | :--------- |
| State (Rio de Janeiro) | `SG_UF`      | **33**     |
| City (Cabo Frio)       | `ID_MN_RESI` | **330070** |

In [10]:
CABO_FRIO_ID = '330070'

---
### Listing the Available Files

In the previous notebook, raw dengue data for the entire country (2015–2025) was collected and saved under data/raw/dengue/.
Each file follows the naming convention DENGBRxx.csv, where xx indicates the year.

In [11]:
RAW_FILES = [
    'DENGBR24.csv',
    'DENGBR23.csv',
    'DENGBR22.csv',
    'DENGBR21.csv',
    'DENGBR20.csv',
    'DENGBR19.csv',
    'DENGBR18.csv',
    'DENGBR17.csv',
    'DENGBR16.csv',
    'DENGBR15.csv'
]

---

### Selecting Only the Relevant Columns

For this analysis, only a few columns are relevant.
These are specified in the COLUMNS list below.

In [12]:
COLUMNS = [
    'ID_MN_RESI', 
    'DT_SIN_PRI', 
    'NU_IDADE_N', 
    'CS_SEXO',     
    'CS_RACA',     
    'CLASSI_FIN',  
    'EVOLUCAO',    
    'DT_OBITO'     
]

# To avoid issues with data types while reading the CSV files with pandas
DATA_TYPES = {
    'ID_MN_RESI': str,
    'CLASSI_FIN': str,
    'EVOLUCAO': str
}

---
### Filtering and Saving CABO FRIO Data

Now, for each raw dengue dataset, we will:

1. Load the data using pandas.

2. Select only the relevant columns.

3. Filter the rows where the city corresponds to CABO FRIO (ID_MN_RESI == 330070).

4. Save the filtered data to a new .csv file for future analysis.

In [13]:
for file in RAW_FILES:
    
    INPUT_FILE_NAME = '../data/raw/dengue/' + file
    
    # Build the OUTPUT file (ex: DENGBR.csv -> DENGBR25_processed.csv)
    OUTPUT_FILE_NAME = file.replace('.csv', '_processed.csv')
    OUTPUT_FILE_NAME = '../data/processed/dengue/' + OUTPUT_FILE_NAME

        
    # Loading pandas DataFrame
    df = pd.read_csv(
        INPUT_FILE_NAME, 
        sep=',',                     # Original separator
        usecols=COLUMNS, 
        dtype=DATA_TYPES,
        low_memory=False             # To avoid dtype warning
    )

    # Filtering for Cabo Frio only
    df_filtered = df[df['ID_MN_RESI'] == CABO_FRIO_ID].copy()

    # saving the filtered DataFrame to a new CSV file
    df_filtered.to_csv(OUTPUT_FILE_NAME, index=False, sep=';')

---
### Result

After running this notebook, each dengue dataset will have a filtered version containing only CABO FRIO records, saved as: