In [1]:
import pandas as pd

df = pd.read_csv('/Users/marcelosilva/Desktop/projectOne/4/D-Target Placing/DatasetTarget.csv')

In [2]:
categorical_columns = df.select_dtypes(include=['object']).columns

for col in categorical_columns:
    print(f"\nUnique values in {col}:")
    print(df[col].unique())
    print(f"Count:\n{df[col].value_counts()}")
    print("-" * 50)


Unique values in b02_sexo:
['Feminino' 'Masculino']
Count:
b02_sexo
Masculino    2180
Feminino     2107
Name: count, dtype: int64
--------------------------------------------------

Unique values in d01_cor:
['Branca' 'Parda (mulata, cabocla, cafuza, mameluca ou mestiça)'
 'Amarela (origem japonesa, chinesa, coreana etc.)' 'Preta' 'Indígena']
Count:
d01_cor
Parda (mulata, cabocla, cafuza, mameluca ou mestiça)    2237
Branca                                                  1734
Preta                                                    289
Amarela (origem japonesa, chinesa, coreana etc.)          19
Indígena                                                   8
Name: count, dtype: int64
--------------------------------------------------

Unique values in h04_parto:
['Normal' 'Cesariana de urgência (Não agendada)'
 'Cesariana agendada (eletiva)']
Count:
h04_parto
Normal                                  2158
Cesariana de urgência (Não agendada)    1124
Cesariana agendada (eletiva)           

# Label Encoding Transformation Documentation

## Overview
This document describes the label encoding process applied to categorical variables with natural ordering in the maternal nutrition dataset. The transformation converts ordinal categorical variables into numerical representations while preserving their inherent order.

## Dataset Information
- **Original Dataset**: `DatasetTarget.csv` (4,287 rows × 39 columns)
- **Transformed Dataset**: `DataSetLabel.csv` (4,287 rows × 39 columns)
- **Target Variable**: `status_nutricional_who` (WHO nutritional status)

## Label Encoding Applied

### 1. Gestational Age Category (`def_idade_gest`)
**Original Categories** → **Encoded Values**
- `prematuro` (premature) → `0`
- `adequado` (adequate) → `1` 
- `pos_termo` (post-term) → `2`

**Rationale**: Natural gestational progression from premature to adequate to post-term delivery.

**Distribution**:
- Adequate: 3,275 cases (76.4%)
- Premature: 725 cases (16.9%)
- Post-term: 287 cases (6.7%)

### 2. Prenatal Care Adequacy (`adequacao_prenatal`)
**Original Categories** → **Encoded Values**
- `ausente` (absent) → `0`
- `insuficiente` (insufficient) → `1`
- `adequado` (adequate) → `2`

**Rationale**: Quality gradient from no prenatal care to adequate prenatal care.

**Distribution**:
- Adequate: 3,838 cases (89.5%)
- Insufficient: 447 cases (10.4%)
- Absent: 2 cases (0.0%)

### 3. Maternal Age Category (`idade_mae_cat`)
**Original Categories** → **Encoded Values**
- `jovem` (young) → `0`
- `adulta` (adult) → `1`
- `madura` (mature) → `2`

**Rationale**: Natural age progression from young to adult to mature maternal age.

**Distribution**:
- Adult: 3,115 cases (72.7%)
- Young: 625 cases (14.6%)
- Mature: 547 cases (12.8%)

### 4. Birth Weight Category (`peso_cat`)
**Original Categories** → **Encoded Values**
- `baixo` (low) → `0`
- `normal` (normal) → `1`
- `alto` (high) → `2`

**Rationale**: Weight progression from low to normal to high birth weight.

**Distribution**:
- Normal: 3,654 cases (85.2%)
- Low: 397 cases (9.3%)
- High: 236 cases (5.5%)

### 5. Birth Weight Classification (`classificacao_peso`)
**Original Categories** → **Encoded Values**
- `PIG` (Small for Gestational Age) → `0`
- `AIG` (Appropriate for Gestational Age) → `1`
- `GIG` (Large for Gestational Age) → `2`

**Rationale**: Clinical classification from small to appropriate to large for gestational age.

**Distribution**:
- AIG: 3,282 cases (76.6%)
- PIG: 620 cases (14.5%)
- GIG: 385 cases (9.0%)

## Variables Left for One-Hot Encoding

The following categorical variables were **NOT** label encoded as they lack natural ordering and will be processed by PyCaret using one-hot encoding:

- `b02_sexo` (Sex): Masculine, Feminine
- `d01_cor` (Maternal Race/Color): White, Mixed, Black, Asian, Indigenous
- `j03_cor` (Newborn Race/Color): White, Mixed, Black, Asian, Indigenous
- `h04_parto` (Delivery Type): Normal, Emergency C-section, Elective C-section
- `k15_recebeu` (Received Support): Yes, No
- `k16_liquido` (Liquid Support): Yes, No

## Technical Implementation

### Encoding Strategy
- **Ordinal Variables**: Manual label encoding to preserve order
- **Nominal Variables**: Automatic one-hot encoding by PyCaret
- **Target Variable**: Unchanged (4 classes: Adequate Weight, Overweight, Obesity, Malnourished)

### Data Integrity
- ✅ No missing values introduced during encoding
- ✅ All original observations preserved (4,287 rows)
- ✅ All original features maintained (39 columns)
- ✅ Ordinal relationships preserved in encoded variables

## Machine Learning Implications

### Benefits of Label Encoding for Ordinal Variables
1. **Preserves Order**: Maintains meaningful relationships between categories
2. **Reduces Dimensionality**: Avoids sparse one-hot encoded matrices
3. **Improves Model Performance**: Tree-based models can better utilize ordinal information
4. **Memory Efficiency**: Single column vs. multiple binary columns

### PyCaret Configuration
The transformed dataset is ready for PyCaret with the following configuration:
- **Ignore Features**: `id_anon`, `vd_zimc`
- **Ordinal Features**: All 5 label-encoded variables
- **Target**: `status_nutricional_who`

## Data Quality Assessment

### Class Distribution (Target Variable)
- **Peso adequado** (Adequate Weight): 3,123 cases (72.8%)
- **Sobrepeso** (Overweight): 716 cases (16.7%)
- **Obesidade** (Obesity): 345 cases (8.0%)
- **Desnutrido** (Malnourished): 103 cases (2.4%) ⚠️ **Imbalanced**

### Recommendations
1. **Class Imbalance**: Use SMOTE or similar techniques for the minority class (Malnourished: 2.4%)
2. **Feature Selection**: Monitor multicollinearity among weight-related variables
3. **Validation Strategy**: Use stratified sampling to maintain class proportions

## File Structure
```
projectOne/4/
├── D-Target Placing/
│   └── DatasetTarget.csv          # Original dataset
└── E-LabelEncoding/
    └── DataSetLabel.csv           # Transformed dataset (ready for ML)
```

## Next Steps
1. Load `DataSetLabel.csv` into PyCaret
2. Configure ordinal features in setup()
3. Apply class balancing techniques
4. Proceed with model comparison and optimization

---

**Dataset Status**: ✅ Ready for Machine Learning Pipeline  
**Encoding Quality**: ✅ Verified and Complete  
**Documentation**: ✅ Complete

In [3]:

import os

# ===== CONFIGURAÇÕES =====
# Caminho do dataset original
input_path = '/Users/marcelosilva/Desktop/projectOne/4/D-Target Placing/DatasetTarget.csv'

# Caminho de saída
output_dir = '/Users/marcelosilva/Desktop/projectOne/4/E-LabelEnconding'
output_file = 'DataSetLabel.csv'
output_path = os.path.join(output_dir, output_file)

print("🚀 INICIANDO LABEL ENCODING")
print(f"📁 Carregando dataset de: {input_path}")

# ===== CARREGAR DATASET ORIGINAL =====
try:
    df_original = pd.read_csv(input_path)
    print(f"✅ Dataset carregado com sucesso!")
    print(f"📊 Shape: {df_original.shape}")
except FileNotFoundError:
    print(f"❌ ERRO: Arquivo não encontrado em {input_path}")
    print("🔍 Verifique se o caminho está correto")
    exit()

# ===== CRIAR CÓPIA PARA LABEL ENCODING =====
df_label = df_original.copy()
print(f"📋 Cópia criada para Label Encoding")

# ===== VERIFICAR VALORES ÚNICOS ANTES DO ENCODING =====
print("\n🔍 VERIFICANDO VALORES ÚNICOS DAS VARIÁVEIS ORDINAIS:")
print("=" * 60)

ordinal_vars = ['def_idade_gest', 'adequacao_prenatal', 'idade_mae_cat', 'peso_cat', 'classificacao_peso']

for var in ordinal_vars:
    print(f"\n{var}:")
    counts = df_label[var].value_counts()
    for value, count in counts.items():
        print(f"  '{value}': {count}")

# ===== LABEL ENCODING =====
print("\n🔄 APLICANDO LABEL ENCODING:")
print("=" * 60)

# 1. def_idade_gest (ordem: prematuro < adequado < pos_termo)
print("\n1️⃣ def_idade_gest:")
idade_gest_map = {'prematuro': 0, 'adequado': 1, 'pos_termo': 2}
df_label['def_idade_gest'] = df_label['def_idade_gest'].map(idade_gest_map)
print("   prematuro -> 0, adequado -> 1, pos_termo -> 2")

# 2. adequacao_prenatal (ordem: ausente < insuficiente < adequado)
print("\n2️⃣ adequacao_prenatal:")
prenatal_map = {'ausente': 0, 'insuficiente': 1, 'adequado': 2}
df_label['adequacao_prenatal'] = df_label['adequacao_prenatal'].map(prenatal_map)
print("   ausente -> 0, insuficiente -> 1, adequado -> 2")

# 3. idade_mae_cat (ordem: jovem < adulta < madura)
print("\n3️⃣ idade_mae_cat:")
idade_cat_map = {'jovem': 0, 'adulta': 1, 'madura': 2}
df_label['idade_mae_cat'] = df_label['idade_mae_cat'].map(idade_cat_map)
print("   jovem -> 0, adulta -> 1, madura -> 2")

# 4. peso_cat (ordem: baixo < normal < alto)
print("\n4️⃣ peso_cat:")
peso_cat_map = {'baixo': 0, 'normal': 1, 'alto': 2}
df_label['peso_cat'] = df_label['peso_cat'].map(peso_cat_map)
print("   baixo -> 0, normal -> 1, alto -> 2")

# 5. classificacao_peso (ordem: PIG < AIG < GIG)
print("\n5️⃣ classificacao_peso:")
classif_peso_map = {'PIG': 0, 'AIG': 1, 'GIG': 2}
df_label['classificacao_peso'] = df_label['classificacao_peso'].map(classif_peso_map)
print("   PIG -> 0, AIG -> 1, GIG -> 2")

print("\n✅ Label Encoding concluído!")

# ===== VERIFICAR RESULTADO DO ENCODING =====
print("\n🔍 VERIFICANDO RESULTADO DO ENCODING:")
print("=" * 60)

for var in ordinal_vars:
    print(f"\n{var} (após encoding):")
    counts = df_label[var].value_counts().sort_index()
    for value, count in counts.items():
        print(f"  {value}: {count}")

# ===== VERIFICAR SE HÁ VALORES NaN (encoding mal feito) =====
print("\n⚠️  VERIFICANDO VALORES NaN (possíveis erros de encoding):")
for var in ordinal_vars:
    nan_count = df_label[var].isnull().sum()
    if nan_count > 0:
        print(f"❌ {var}: {nan_count} valores NaN encontrados!")
    else:
        print(f"✅ {var}: OK")

# ===== CRIAR DIRETÓRIO DE SAÍDA SE NÃO EXISTIR =====
print(f"\n📁 Criando diretório de saída: {output_dir}")
os.makedirs(output_dir, exist_ok=True)

# ===== SALVAR DATASET COM LABEL ENCODING =====
try:
    df_label.to_csv(output_path, index=False)
    print(f"✅ Dataset salvo com sucesso em: {output_path}")
    print(f"📊 Shape final: {df_label.shape}")
except Exception as e:
    print(f"❌ ERRO ao salvar: {e}")

# ===== RESUMO FINAL =====
print("\n" + "=" * 60)
print("📋 RESUMO DO LABEL ENCODING")
print("=" * 60)
print(f"📁 Dataset original: {input_path}")
print(f"📁 Dataset com Label Encoding: {output_path}")
print(f"📊 Shape: {df_label.shape}")
print("\n🔢 Variáveis que receberam Label Encoding:")
for i, var in enumerate(ordinal_vars, 1):
    print(f"   {i}. {var}")

print("\n🎯 Variáveis que permaneceram inalteradas (One-Hot no PyCaret):")
categorical_vars = ['d01_cor', 'j03_cor', 'b02_sexo', 'h04_parto', 'k15_recebeu', 'k16_liquido']
for i, var in enumerate(categorical_vars, 1):
    print(f"   {i}. {var}")

print(f"\n🎉 PROCESSO CONCLUÍDO COM SUCESSO!")
print(f"📝 Próximo passo: Usar 'DataSetLabel.csv' no PyCaret")

🚀 INICIANDO LABEL ENCODING
📁 Carregando dataset de: /Users/marcelosilva/Desktop/projectOne/4/D-Target Placing/DatasetTarget.csv
✅ Dataset carregado com sucesso!
📊 Shape: (4287, 39)
📋 Cópia criada para Label Encoding

🔍 VERIFICANDO VALORES ÚNICOS DAS VARIÁVEIS ORDINAIS:

def_idade_gest:
  'adequado': 3275
  'prematuro': 725
  'pos_termo': 287

adequacao_prenatal:
  'adequado': 3838
  'insuficiente': 447
  'ausente': 2

idade_mae_cat:
  'adulta': 3115
  'jovem': 625
  'madura': 547

peso_cat:
  'normal': 3654
  'baixo': 397
  'alto': 236

classificacao_peso:
  'AIG': 3282
  'PIG': 620
  'GIG': 385

🔄 APLICANDO LABEL ENCODING:

1️⃣ def_idade_gest:
   prematuro -> 0, adequado -> 1, pos_termo -> 2

2️⃣ adequacao_prenatal:
   ausente -> 0, insuficiente -> 1, adequado -> 2

3️⃣ idade_mae_cat:
   jovem -> 0, adulta -> 1, madura -> 2

4️⃣ peso_cat:
   baixo -> 0, normal -> 1, alto -> 2

5️⃣ classificacao_peso:
   PIG -> 0, AIG -> 1, GIG -> 2

✅ Label Encoding concluído!

🔍 VERIFICANDO RESULTADO