After analyzing the data, we opted for this strategy:

**Proposed Strategy:**
The test set will have exactly 2 records from the "ISLAND" category. The remaining test data (to complete ≈20% of the dataset) will be obtained through stratified sampling based on income_cat (median_income bins). This ensures 1) income representativeness – the most explanatory variable – and 2) at least some ISLAND examples to verify how the model extrapolates.

**How to implement (conceptually):**
1. Separate the 5 ISLAND rows
   - Define `is_island = df.ocean_proximity == "ISLAND"`
   - Keep 3 rows in training, 2 in the test "pool" (choose via fixed seed for reproducibility)

2. Create income_cat in the NON-ISLAND subset
   - Suggested bins from the book: [0–1.5, 1.5–3, 3–4.5, 4.5–6, > 6]
   - This generates 5 relatively balanced strata

3. Apply StratifiedShuffleSplit (test_size = 0.20) only to this non-ISLAND block
   - Use fixed random_state
   - Obtain train_idx, test_idx

4. Combine
   - Final test = test_idx rows + the 2 ISLAND rows
   - Final training = train_idx rows + the remaining 3 ISLAND rows

5. Verify proportions
   - Calculate income_cat percentages in full dataset vs. new test (Δ ≤ 3 p.p.)
   - Confirm that all 5 ocean_proximity categories now appear: <1H OCEAN, INLAND, NEAR OCEAN, NEAR BAY, ISLAND (2 rows)

**Documentation:**
Write in README: "Hold-out contains 2/5 ISLAND rows (40%) due to sample scarcity; remaining 98% of hold-out generated via stratification by income_cat."

**Benefits:**
- Test set reflects income distribution, which explains ~70% of price variation
- At least two "ISLAND" examples allow checking error in this class without unbalancing global metrics
- Training still has 3 ISLAND rows, reducing extrapolation risk without reference

**Limitation:**
- ISLAND-specific metrics remain volatile (N=2), but already provide an indication
- If ISLAND records increase in the future, simply reapply the same rule (40% in test; stratification in the rest)

If this meets your requirements, simply follow this procedure when building the hold-out.


In [4]:
import pandas as pd
import hashlib
import numpy as np

def create_stratified_test_set_by_uid(df, test_ratio=0.2, stratify_col='income_cat'):
    """
    Split determinístico por UID + estratificado por income_cat
    """
    test_indices = []
    
    for stratum in df[stratify_col].unique():
        stratum_df = df[df[stratify_col] == stratum]
        
        # Hash determinístico dentro do estrato
        def is_in_test_set(uid):
            hash_val = int(hashlib.md5(str(uid).encode()).hexdigest()[-2:], 16)
            return hash_val < (test_ratio * 256)  # ~20% dos hash values
        
        stratum_test = stratum_df[stratum_df['uid'].apply(is_in_test_set)]
        test_indices.extend(stratum_test.index.tolist())
    
    test_df = df.loc[test_indices]
    train_df = df.drop(test_indices)
    
    return test_df, train_df

def island_train_test_split(island_df):
    """
    Força exatamente 2 registros ISLAND para teste (determinístico por UID)
    """
    sorted_uids = sorted(island_df['uid'].values)
    test_uids = sorted_uids[:2]  # Sempre os mesmos 2 UIDs menores
    
    island_test = island_df[island_df['uid'].isin(test_uids)]
    island_train = island_df[~island_df['uid'].isin(test_uids)]
    
    return island_test, island_train

def create_income_categories(df):
    """
    Cria categorias de renda como no livro do Géron
    Bins: [0, 1.5, 3.0, 4.5, 6.0, inf]
    """
    df = df.copy()
    
    # Verificar se a coluna de renda existe
    income_col = None
    if 'median_income' in df.columns:
        income_col = 'median_income'
    elif 'medianIncome' in df.columns:
        income_col = 'medianIncome'
    else:
        raise ValueError("Coluna de renda não encontrada. Esperado: 'median_income' ou 'medianIncome'")
    
    print(f"🔧 Usando coluna: {income_col}")
    
    # Criar bins de renda (como no livro)
    df['income_cat'] = pd.cut(df[income_col],
                              bins=[0., 1.5, 3.0, 4.5, 6.0, np.inf],
                              labels=[1, 2, 3, 4, 5])
    
    print("💰 Categorias de renda criadas:")
    print(df['income_cat'].value_counts().sort_index())
    
    return df

def create_final_datasets(df, test_ratio=0.2):
    """
    Cria datasets train/test com:
    - ISLAND: força 2 para teste (40%)
    - Resto: estratificação por income_cat + determinístico por UID
    """
    print(f"📊 Dataset original: {len(df)} registros")
    print(f"🏝️  Registros ISLAND: {len(df[df['ocean_proximity'] == 'ISLAND'])}")
    
    # 0. Criar categorias de renda se não existir
    if 'income_cat' not in df.columns:
        print("\n🔧 Criando categorias de renda...")
        df = create_income_categories(df)
    
    # 1. Separar ISLAND do resto
    island_df = df[df['ocean_proximity'] == 'ISLAND']
    non_island_df = df[df['ocean_proximity'] != 'ISLAND']
    
    print(f"\n🔄 Processando splits...")
    
    # 2. ISLAND: sempre 2 para teste
    island_test, island_train = island_train_test_split(island_df)
    
    # 3. NÃO-ISLAND: estratificação por income_cat + UID
    non_island_test, non_island_train = create_stratified_test_set_by_uid(
        non_island_df, 
        test_ratio=test_ratio, 
        stratify_col='income_cat'
    )
    
    # 4. Combinar datasets finais
    final_test = pd.concat([non_island_test, island_test], ignore_index=True)
    final_train = pd.concat([non_island_train, island_train], ignore_index=True)
    
    return final_train, final_test

def validate_split(original_df, train_df, test_df):
    """
    Valida se o split manteve as proporções corretas
    """
    print("\n🔍 VALIDAÇÃO DO SPLIT:")
    print("=" * 50)
    
    # Tamanhos
    print(f"Dataset original: {len(original_df):,}")
    print(f"Train set: {len(train_df):,} ({len(train_df)/len(original_df)*100:.1f}%)")
    print(f"Test set: {len(test_df):,} ({len(test_df)/len(original_df)*100:.1f}%)")
    
    # ISLAND específico
    island_original = len(original_df[original_df['ocean_proximity'] == 'ISLAND'])
    island_train = len(train_df[train_df['ocean_proximity'] == 'ISLAND'])
    island_test = len(test_df[test_df['ocean_proximity'] == 'ISLAND'])
    
    print(f"\n🏝️  ISLAND:")
    print(f"Original: {island_original}")
    print(f"Train: {island_train}")
    print(f"Test: {island_test} ✓" if island_test == 2 else f"Test: {island_test} ❌")
    
    # Estratificação income_cat (apenas no subconjunto não-ISLAND)
    print(f"\n💰 ESTRATIFICAÇÃO INCOME_CAT (não-ISLAND):")
    
    try:
        original_non_island = original_df[original_df['ocean_proximity'] != 'ISLAND']
        test_non_island = test_df[test_df['ocean_proximity'] != 'ISLAND']
        
        if 'income_cat' in original_non_island.columns and 'income_cat' in test_non_island.columns:
            orig_props = original_non_island['income_cat'].value_counts(normalize=True).sort_index()
            test_props = test_non_island['income_cat'].value_counts(normalize=True).sort_index()
            
            for cat in orig_props.index:
                orig_pct = orig_props[cat] * 100
                test_pct = test_props.get(cat, 0) * 100 if cat in test_props.index else 0
                diff = abs(orig_pct - test_pct)
                status = "✓" if diff <= 3.0 else "⚠️"
                print(f"Categoria {cat}: Original {orig_pct:.1f}% → Test {test_pct:.1f}% (Δ{diff:.1f}p.p.) {status}")
        else:
            print("❌ Coluna income_cat não encontrada nos datasets")
            
    except Exception as e:
        print(f"❌ Erro na validação income_cat: {e}")
        print("🔧 Verificando colunas disponíveis...")
        print(f"Original: {original_df.columns.tolist()}")
        print(f"Test: {test_df.columns.tolist()}")
    
    # Ocean proximity geral
    print(f"\n🌊 OCEAN_PROXIMITY (todas as categorias presentes?):")
    test_categories = set(test_df['ocean_proximity'].unique())
    original_categories = set(original_df['ocean_proximity'].unique())
    missing = original_categories - test_categories
    
    if len(missing) == 0:
        print("✅ Todas as categorias presentes no test set")
    else:
        print(f"❌ Categorias ausentes no test: {missing}")

# ==================================================
# EXECUÇÃO PRINCIPAL
# ==================================================

if __name__ == "__main__":
    # Carregar dataset
    file_path = "/Users/marcelosilva/Desktop/Hands-on Machine Learning/data/processed/housing/housing_with_uid.csv"
    
    print("🚀 Iniciando divisão do dataset Housing...")
    print(f"📁 Carregando: {file_path}")
    
    try:
        df = pd.read_csv(file_path)
        print(f"✅ Dataset carregado: {len(df)} registros")
        
        # Verificar se colunas básicas existem
        income_col = 'median_income' if 'median_income' in df.columns else 'medianIncome'
        required_cols = ['uid', income_col, 'ocean_proximity']
        missing_cols = [col for col in required_cols if col not in df.columns]
        
        if missing_cols:
            print(f"❌ Colunas ausentes: {missing_cols}")
            print(f"📋 Colunas disponíveis: {list(df.columns)}")
        else:
            print(f"✅ Colunas básicas presentes")
            print(f"🔧 income_cat será criada a partir de {income_col}")
            
            # Criar splits
            train_dataset, test_dataset = create_final_datasets(df, test_ratio=0.2)
            
            # Validar splits
            validate_split(df, train_dataset, test_dataset)
            
            # Salvar datasets
            output_dir = "/Users/marcelosilva/Desktop/Hands-on Machine Learning/data/processed/housing/"
            train_path = output_dir + "housing_train.csv"
            test_path = output_dir + "housing_test.csv"
            
            train_dataset.to_csv(train_path, index=False)
            test_dataset.to_csv(test_path, index=False)
            
            print(f"\n💾 DATASETS SALVOS:")
            print(f"📁 Train: {train_path}")
            print(f"📁 Test: {test_path}")
            print(f"\n🎯 PRÓXIMOS PASSOS:")
            print(f"1. Use APENAS housing_train.csv para desenvolvimento")
            print(f"2. Mantenha housing_test.csv INTOCÁVEL até avaliação final")
            print(f"3. Para validação durante desenvolvimento, faça split do train set")
            
    except FileNotFoundError:
        print(f"❌ Arquivo não encontrado: {file_path}")
        print("Verifique se o caminho está correto.")
    except Exception as e:
        print(f"❌ Erro: {e}")

🚀 Iniciando divisão do dataset Housing...
📁 Carregando: /Users/marcelosilva/Desktop/Hands-on Machine Learning/data/processed/housing/housing_with_uid.csv
✅ Dataset carregado: 20640 registros
✅ Colunas básicas presentes
🔧 income_cat será criada a partir de median_income
📊 Dataset original: 20640 registros
🏝️  Registros ISLAND: 5

🔧 Criando categorias de renda...
🔧 Usando coluna: median_income
💰 Categorias de renda criadas:
income_cat
1     822
2    6581
3    7236
4    3639
5    2362
Name: count, dtype: int64

🔄 Processando splits...

🔍 VALIDAÇÃO DO SPLIT:
Dataset original: 20,640
Train set: 16,418 (79.5%)
Test set: 4,222 (20.5%)

🏝️  ISLAND:
Original: 5
Train: 3
Test: 2 ✓

💰 ESTRATIFICAÇÃO INCOME_CAT (não-ISLAND):
❌ Coluna income_cat não encontrada nos datasets

🌊 OCEAN_PROXIMITY (todas as categorias presentes?):
✅ Todas as categorias presentes no test set

💾 DATASETS SALVOS:
📁 Train: /Users/marcelosilva/Desktop/Hands-on Machine Learning/data/processed/housing/housing_train.csv
📁 Test: 