# BitSize ML · Ep04 — Random vs Stable Splits

### Goal
Compare two approaches to splitting data into training and test sets:
- **Random Split** — quick but not stable if file order changes.
- **Fixed (Stable) Split by ID** — deterministic using a CRC32 hash of an ID.

In [1]:
!tree ~/Projects/BitSize_ML/

[01;34m/Users/nejat/Projects/BitSize_ML/[0m
└── [01;34mend_to_end_ml[0m
    ├── [01;34mdata[0m
    │   └── [00mhousing.csv[0m
    ├── [01;34mimages[0m
    │   └── [00mENV checking.png[0m
    ├── [01;34mmodels[0m
    └── [01;34mnotebooks[0m
        ├── [00m1_setup.ipynb[0m
        ├── [00m2_fetch_data.ipynb[0m
        ├── [00m3_overview_EDA.ipynb[0m
        └── [00m4_data_split.ipynb[0m

6 directories, 6 files


In [2]:
import numpy as np 
import pandas as pd
from pathlib import Path

data_path = Path.cwd().parent /"data" /"housing.csv"
housing = pd.read_csv(data_path)

## 1️⃣ Random Split

In [3]:
def split_train_test(df, test_ratio, seed):
    rng = np.random.default_rng(seed)
    shuffled_idx = rng.permutation(len(df))
    test_size = int(test_ratio*len(df))
    test_idx = shuffled_idx[:test_size]
    train_idx = shuffled_idx[test_size:]
    return df.iloc[train_idx], df.iloc[test_idx]

In [4]:
train_man, test_man = split_train_test(housing, 0.2,27)
print(f"[Manual Random], train_size={len(train_man)}, test_size={len(test_man)}")

[Manual Random], train_size=16512, test_size=4128


In [5]:
from sklearn.model_selection import train_test_split
train_sk, test_sk = train_test_split(housing, test_size=0.2, random_state=27)
print(f"[Sklearn Random], train_size={len(train_sk)}, test_size={len(test_sk)}")


[Sklearn Random], train_size=16512, test_size=4128


## 2️⃣ Fixed (Stable) Split by ID

* identifier → bytes → CRC32 hash (0–2³²)
* if hash < test_ratio * 2³² → test set ✅
* else → train set


In [6]:
from zlib import crc32
def is_test(identifier, test_ratio):
    return (crc32(np.int64(identifier).tobytes()) & 0xFFFFFFFF) < test_ratio*2**32

In [7]:
def split_train_test_by_id(df, test_ratio, column):
    id_column = df[column]
    in_test = id_column.apply(lambda id_: is_test(id_, test_ratio))
    return df.loc[~in_test], df.loc[in_test]

In [8]:
housing_id = housing.reset_index()
train_fixded, test_fixed = split_train_test_by_id(housing_id, 0.2, "index")
print(f"[Fixed], train_size={len(train_fixded)}, test_size={len(test_fixed)}")

[Fixed], train_size=16512, test_size=4128
