# NaN Values Treatment Strategy

We chose to fill NaN values with 0 due to the logical structure of the dietary survey:

## Reasoning
- Some questions were skipped based on previous "No" answers
- For example:
    - If a child didn't drink water (e02_agua = 0)
    - Then questions about sugar in water (e04_agua_com_acucar) were not asked
    - These cases resulted in NaN values
    - It's logical to assume that if they didn't drink water, they also didn't drink water with sugar

## Implementation
In cell below, we:
1. First replaced 'Sim' with 1 (binary positive)
2. Then replaced 'Não' with 0 (binary negative)
3. Finally filled remaining NaN values with 0

This comprehensive approach:
- Maintains data consistency
- Standardizes responses to binary format
- Handles both explicit 'Não' responses and implicit NaN values uniformly
- Makes the dataset ready for binary classification analysis


In [1]:
import pandas as pd

# Read the CSV file
df = pd.read_csv('/Users/marcelosilva/Desktop/clustering(0-4)/3-E-Aval/DSWOUTNS.csv')

# Replace 'Sim' with 1 first
df = df.replace({'Sim': 1})

# Replace 'Não' with 0
df = df.replace({'Não': 0})

# Fill all remaining NaN values with 0
df = df.fillna(0)

# Save the modified dataset
df.to_csv('/Users/marcelosilva/Desktop/clustering(0-4)/3-E-Aval/DSBIV.CSV', index=False)

# Print the first few rows to verify the changes
print(df.head())

       id_anon  e01_leite_peito  e02_agua  e04_agua_com_acucar  e05_cha  \
0  10951000402                0         1                  0.0        0   
1  10951000403                0         1                  0.0        0   
2  10951003402                0         1                  0.0        0   
3  10951003403                0         1                  0.0        0   
4  10951009202                1         1                  0.0        0   

   e06_leite_vaca_po  e07_leite_vaca_liquido  e08_leite_soja_po  \
0                  0                       1                  0   
1                  0                       0                  0   
2                  0                       1                  0   
3                  0                       0                  0   
4                  0                       0                  0   

   e09_leite_soja_liquido  e10_formula_infantil  ...  e31_salgadinhos  \
0                       0                     0  ...                0   


In [2]:
df_bi = pd.read_csv('/Users/marcelosilva/Desktop/clustering(0-4)/3-E-Aval/DSBIV.CSV')

import pandas_utils as pdu

pdu.custom_info(df_bi)

DataFrame Info with Completeness Analysis:
---------------------------------------------------------------------------
Total Rows: 14558
Total Columns: 49

Column Details:
---------------------------------------------------------------------------
id_anon                 14558 non-null int64      (100.0% complete)
e01_leite_peito         14558 non-null int64      (100.0% complete)
e02_agua                14558 non-null int64      (100.0% complete)
e04_agua_com_acucar     14558 non-null float64    (100.0% complete)
e05_cha                 14558 non-null int64      (100.0% complete)
e06_leite_vaca_po       14558 non-null int64      (100.0% complete)
e07_leite_vaca_liquido    14558 non-null int64      (100.0% complete)
e08_leite_soja_po       14558 non-null int64      (100.0% complete)
e09_leite_soja_liquido    14558 non-null int64      (100.0% complete)
e10_formula_infantil    14558 non-null int64      (100.0% complete)
e11_suco                14558 non-null int64      (100.0% complete)
