# Rationale for Variable Exclusion from Dietary Pattern Analysis

## Overview
In our analysis of dietary patterns among Brazilian children aged 0-4 years (n=14,558), two variables were systematically excluded from the clustering algorithm based on methodological rigor and conceptual relevance to nutritional development.

---

## Excluded Variables

### `e03_filtrada_fervida` (Water filtration/boiling status)

**Primary rationale:** Infrastructure confounding rather than dietary behavior

This variable captures whether consumed water was filtered or boiled, which fundamentally reflects:
- **Household infrastructure** and access to water treatment systems
- **Socioeconomic status** rather than child-specific dietary patterns
- **Regional water quality** and sanitation infrastructure
- **Parental education** about water safety practices

**Impact on clustering:** Including this variable would generate clusters based on living conditions rather than alimentary development stages, confounding our primary research objective of mapping nutritional maturity patterns.

---

### `e14_manga` (Specific tropical fruit consumption)

**Primary rationale:** Seasonal/regional specificity versus developmental relevance

This variable represents consumption of specific fruits (mango, papaya, guava), which introduces:
- **Seasonal bias** due to harvest periods varying across data collection timeframes
- **Geographic clustering** based on regional fruit availability rather than developmental patterns
- **Redundancy** with the broader `e12_fruta_inteira` variable, which captures the developmentally relevant pattern of whole fruit consumption and motor skill progression
- **Market availability** dependencies unrelated to child nutritional development

**Methodological consideration:** The variable `e13_fruta_vezes` (frequency of fruit consumption) was retained as it provides developmentally meaningful information about feeding frequency patterns, which correlates with motor development and acceptance of solid foods.

---

## Analytical Benefits of Exclusion

| **Aspect** | **Benefit** |
|------------|-------------|
| **Conceptual clarity** | Maintains focus on alimentary development rather than environmental factors |
| **Temporal validity** | Reduces seasonal confounding across different data collection periods |
| **Geographic generalizability** | Minimizes regional clustering unrelated to nutritional maturity |
| **Developmental relevance** | Preserves variables that directly reflect feeding skills and dietary transitions |

---

## Dataset Integrity Considerations

### Missing Data Pattern Analysis
Both excluded variables showed specific missing data patterns:
- `e03_filtrada_fervida`: 11.96% missing (conditional on water consumption)
- `e14_manga`: 52.57% missing (conditional on fruit consumption)

The exclusion of these variables eliminates the need for imputation strategies that could introduce additional bias, while retaining the core developmental information captured by related variables (`e02_agua` and `e12_fruta_inteira` respectively).

### Preserved Developmental Indicators
The final dataset maintains comprehensive coverage of key developmental transitions:
- **0-6 months:** Exclusive breastfeeding patterns (`e01_leite_peito`, `e10_formula_infantil`)
- **6-12 months:** Introduction of complementary foods (`e19_mingau`, `e38_farinhas`)
- **12-24 months:** Texture progression and family food integration (`e181-e185` consistency variables)
- **24-48 months:** Dietary diversity and processed food introduction (`e30-e37` ultra-processed categories)

---

## Final Dataset Composition

After exclusion, the analysis proceeded with **49 dietary variables** that comprehensively capture:
- **Feeding methods** (breastfeeding, bottle-feeding, cup drinking)
- **Food textures** (liquid, pureed, solid) indicating motor development
- **Nutritional complexity** (single ingredients to mixed family foods)
- **Processing levels** (fresh foods to ultra-processed products)

This curated variable set enables identification of genuine nutritional development clusters while minimizing confounding from non-dietary factors.

---

## Methodological Transparency

This exclusion strategy aligns with best practices in developmental nutrition research, where:
1. **Biological relevance** takes precedence over statistical completeness
2. **Developmental appropriateness** guides variable selection
3. **Confounding minimization** improves cluster interpretability
4. **Temporal stability** enhances findings generalizability

The resulting dataset provides a robust foundation for identifying alimentary maturity patterns across the critical 0-4 year developmental window.

In [7]:
import pandas as pd

# Read the original dataset
df = pd.read_csv('/Users/marcelosilva/Desktop/clustering(0-4)/2-E-Choice/DatasetE.csv')

# Remove the specified columns
columns_to_remove = ['e14_manga', 'e03_filtrada_fervida']
DSWOUTMW = df.drop(columns=columns_to_remove)

# Save the new dataset
DSWOUTMW.to_csv('/Users/marcelosilva/Desktop/clustering(0-4)/3-E-Aval/DSWOUTMW.csv', index=False)

# Display the first few rows and shape of the new dataset
print("Shape of the new dataset:", DSWOUTMW.shape)
DSWOUTMW.head()

Shape of the new dataset: (14558, 50)


Unnamed: 0,id_anon,e01_leite_peito,e02_agua,e04_agua_com_acucar,e05_cha,e06_leite_vaca_po,e07_leite_vaca_liquido,e08_leite_soja_po,e09_leite_soja_liquido,e10_formula_infantil,...,e31_salgadinhos,e32_suco_industrializado,e33_refrigerante,e34_macarrao,e35_biscoito,e36_bala,e37_tempero,e38_farinhas,e39_mamadeira,e40_adocado
0,10951000402,Não,Sim,Não,Não,Não,Sim,Não,Não,Não,...,Não,Não,Sim,Não,Sim,Sim,Não,Não,Sim,Sim
1,10951000403,Não,Sim,Não,Não,Não,Não,Não,Não,Não,...,Não,Não,Sim,Não,Não,Sim,Não,Não,Não,Não
2,10951003402,Não,Sim,Não,Não,Não,Sim,Não,Não,Não,...,Sim,Sim,Sim,Não,Sim,Não,Não,Sim,Não,Não
3,10951003403,Não,Sim,Não,Não,Não,Não,Não,Não,Não,...,Não,Sim,Não,Não,Sim,Não,Não,Não,Não,Sim
4,10951009202,Sim,Sim,Não,Não,Não,Não,Não,Não,Não,...,Sim,Sim,Não,Não,Sim,Não,Sim,Não,Não,Não


In [8]:
import pandas_utils as pdu

pdu.custom_info(df)

DataFrame Info with Completeness Analysis:
---------------------------------------------------------------------------
Total Rows: 14558
Total Columns: 52

Column Details:
---------------------------------------------------------------------------
id_anon                 14558 non-null int64      (100.0% complete)
e01_leite_peito         14558 non-null object     (100.0% complete)
e02_agua                14558 non-null object     (100.0% complete)
e03_filtrada_fervida    12817 non-null object     (88.04% complete) •
e04_agua_com_acucar     12817 non-null object     (88.04% complete) •
e05_cha                 14558 non-null object     (100.0% complete)
e06_leite_vaca_po       14558 non-null object     (100.0% complete)
e07_leite_vaca_liquido    14558 non-null object     (100.0% complete)
e08_leite_soja_po       14558 non-null object     (100.0% complete)
e09_leite_soja_liquido    14558 non-null object     (100.0% complete)
e10_formula_infantil    14558 non-null object     (100.0% comple

In [9]:
# Análise das respostas únicas em cada variável
print("🔍 ANÁLISE DAS RESPOSTAS ÚNICAS POR VARIÁVEL")
print("=" * 60)

# Lista das variáveis para analisar (excluindo id_anon)
variables_to_analyze = [col for col in df.columns if col != 'id_anon']

for var in variables_to_analyze:
    print(f"\n📊 {var}:")
    
    # Valores únicos e suas contagens
    value_counts = df[var].value_counts(dropna=False)
    
    # Mostra todos os valores únicos
    for value, count in value_counts.items():
        percentage = (count / len(df)) * 100
        print(f"   '{value}': {count:,} ({percentage:.1f}%)")
    
    # Se tiver muitos valores únicos, mostra apenas os mais frequentes
    if len(value_counts) > 10:
        print(f"   ... (mostrando apenas os 10 mais frequentes de {len(value_counts)} valores únicos)")
    
    print(f"   Total de valores únicos: {len(value_counts)}")
    print("-" * 40)

print(f"\n📋 RESUMO GERAL:")
print(f"Total de linhas no dataset: {len(df):,}")
print(f"Total de variáveis analisadas: {len(variables_to_analyze)}")

🔍 ANÁLISE DAS RESPOSTAS ÚNICAS POR VARIÁVEL

📊 e01_leite_peito:
   'Não': 9,695 (66.6%)
   'Sim': 4,863 (33.4%)
   Total de valores únicos: 2
----------------------------------------

📊 e02_agua:
   'Sim': 12,817 (88.0%)
   'Não': 1,741 (12.0%)
   Total de valores únicos: 2
----------------------------------------

📊 e03_filtrada_fervida:
   'Sim': 8,733 (60.0%)
   'Não': 4,084 (28.1%)
   'nan': 1,741 (12.0%)
   Total de valores únicos: 3
----------------------------------------

📊 e04_agua_com_acucar:
   'Não': 12,697 (87.2%)
   'nan': 1,741 (12.0%)
   'Sim': 120 (0.8%)
   Total de valores únicos: 3
----------------------------------------

📊 e05_cha:
   'Não': 13,851 (95.1%)
   'Sim': 707 (4.9%)
   Total de valores únicos: 2
----------------------------------------

📊 e06_leite_vaca_po:
   'Não': 10,677 (73.3%)
   'Sim': 3,881 (26.7%)
   Total de valores únicos: 2
----------------------------------------

📊 e07_leite_vaca_liquido:
   'Não': 9,073 (62.3%)
   'Sim': 5,485 (37.7%)
   To

```markdown
# Variable Exclusion: 'e219a_nao_sabe'

## Rationale

The variable `e219a_nao_sabe` ("Does not know" response option for bread consumption) will be eliminated from subsequent analyses for the following reasons:

### Methodological Considerations

- Represents missing/uncertain information rather than an actual dietary pattern
- Redundant with proper "No" responses in the bread consumption variable
- Low frequency of occurrence (<1% of responses)
- Does not contribute meaningful information to developmental feeding patterns

### Statistical Impact

- Removal will improve cluster interpretability
- Reduces noise in the clustering algorithm
- Maintains all valid bread consumption information through other related variables

This exclusion aligns with best practices in nutritional epidemiology where uncertain responses are handled separately from valid dietary information.
```

In [10]:
df1 = pd.read_csv('/Users/marcelosilva/Desktop/clustering(0-4)/3-E-Aval/DSWOUTMW.csv')
# Remove the e219a_nao_sabe column
DSWOUTNS = df1.drop(columns=['e219a_nao_sabe'])

# Save the new dataset
DSWOUTNS.to_csv('/Users/marcelosilva/Desktop/clustering(0-4)/3-E-Aval/DSWOUTNS.csv', index=False)

# Display the first few rows and shape of the new dataset 
print("Shape of the new dataset:", DSWOUTNS.shape)
DSWOUTNS.head()

Shape of the new dataset: (14558, 49)


Unnamed: 0,id_anon,e01_leite_peito,e02_agua,e04_agua_com_acucar,e05_cha,e06_leite_vaca_po,e07_leite_vaca_liquido,e08_leite_soja_po,e09_leite_soja_liquido,e10_formula_infantil,...,e31_salgadinhos,e32_suco_industrializado,e33_refrigerante,e34_macarrao,e35_biscoito,e36_bala,e37_tempero,e38_farinhas,e39_mamadeira,e40_adocado
0,10951000402,Não,Sim,Não,Não,Não,Sim,Não,Não,Não,...,Não,Não,Sim,Não,Sim,Sim,Não,Não,Sim,Sim
1,10951000403,Não,Sim,Não,Não,Não,Não,Não,Não,Não,...,Não,Não,Sim,Não,Não,Sim,Não,Não,Não,Não
2,10951003402,Não,Sim,Não,Não,Não,Sim,Não,Não,Não,...,Sim,Sim,Sim,Não,Sim,Não,Não,Sim,Não,Não
3,10951003403,Não,Sim,Não,Não,Não,Não,Não,Não,Não,...,Não,Sim,Não,Não,Sim,Não,Não,Não,Não,Sim
4,10951009202,Sim,Sim,Não,Não,Não,Não,Não,Não,Não,...,Sim,Sim,Não,Não,Sim,Não,Sim,Não,Não,Não
