# Data Split Plan

This notebook splits the dataset into three parts before performing any detailed exploratory data analysis (EDA). The goal is to set up a clear and reliable starting point for building and testing the model.

---

## Initial Check

Before splitting, a quick check of the dataset was done:

* The dataset contains around 15,000 rows, which is small enough to work with directly.
* A light inspection of the date and time column showed that timestamps are available and appear to be sequential.

This justifies a time-aware split and helps avoid bias or data leakage before deeper analysis.

---

## Split Overview

The data is divided as follows:

* **Train (75%)**
  Used for model training and feature building. This set reflects typical user behavior.

* **Validation (15%)**
  Used to tune model settings (like contamination rate or dimensionality reduction) and check model stability.

* **Holdout Test (10%)**
  Used for final evaluation. This data is kept separate to assess model performance on truly unseen data.

---

## Time-Aware Consideration

Since the dataset includes a timestamp (_event date and time_) and the entries appear sequential, a time-based split will be used to preserve the natural order of events:

* Train on the earliest 75%
* Validate on the next 15%
* Test on the most recent 10%

---

## Why Time-Aware Splitting Matters

Bot behavior can change over time, so using a random split might mix old and new patterns. That could make the model appear more accurate than it truly is.

A time-based split simulates how the model would perform in a real-world scenario:

* It learns from past behavior
* Then it is tested on future, unseen data

This helps evaluate how well the model adapts to new patterns or changes in traffic.

---

## Summary

A chronological 75-15-10 split is used to:

* Avoid mixing past and future data
* Maintain time order
* Prepare the model for real-world use cases


In [3]:
import pandas as pd

df = pd.read_csv(
    '../data/bot-hunter-dataset.tsv',
    sep='\t',
    header=None,
    names=['datetime', 'region', 'browser', 'device', 'url_params'],
    index_col=False,
)

# convert datetime column to proper datetime format
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.sort_values('datetime').reset_index(drop=True)

n_total = len(df)
n_train = int(0.75 * n_total)    # 75% for training
n_test = int(0.15 * n_total)     # 15% for testing
n_holdout = n_total - n_train - n_test  # remaining 10% for holdout

print(f"Dataset size: {n_total:,} records")
print(f"Train size: {n_train:,} records ({n_train/n_total:.1%})")
print(f"Test size: {n_test:,} records ({n_test/n_total:.1%})")
print(f"Holdout size: {n_holdout:,} records ({n_holdout/n_total:.1%})")

# split data
train_df = df.iloc[:n_train].copy()
test_df = df.iloc[n_train:n_train + n_test].copy()
holdout_df = df.iloc[n_train + n_test:].copy()

assert len(train_df) + len(test_df) + len(holdout_df) == len(df), "Split sizes don't match original dataset"

Dataset size: 15,627 records
Train size: 11,720 records (75.0%)
Test size: 2,344 records (15.0%)
Holdout size: 1,563 records (10.0%)


In [5]:
train_df.to_csv('../data/train.csv', sep='\t', index=False)
test_df.to_csv('../data/test.csv', sep='\t', index=False)
holdout_df.to_csv('../data/holdout.csv', sep='\t', index=False)