# FI-2010 Dataset Inspection and Validation

This notebook performs a systematic inspection and validation of the FI-2010 dataset
used in the thesis.

The goal is to verify the structural integrity, feature composition, label definitions,
normalization properties, and train/test split consistency of the dataset, without making
any assumptions beyond what can be empirically confirmed from the data itself.

All findings reported here directly inform Chapter 4 (Problem Formulation and Data) of
the thesis. The notebook is not used for model training.

## 1. Imports, Setup, and Loading

In [26]:
import pandas as pd
import numpy as np
import sys, platform
from pathlib import Path

PROJECT_ROOT = Path().resolve().parents[2]
sys.path.insert(0, str(PROJECT_ROOT))

print("Project root:", PROJECT_ROOT)

# Versioning
print("Python:", sys.version.split()[0])
print("Platform:", platform.platform())
print("pandas:", pd.__version__)
print("numpy:", np.__version__)

print("-" * 30)

train_path = "../data/FI-2010/train.csv"
test_path  = "../data/FI-2010/test.csv"

train_df = pd.read_csv(train_path)
test_df  = pd.read_csv(test_path)

# Shapes
print("Train dataset shape:", train_df.shape)
print("Test dataset shape:", test_df.shape)

Project root: /Users/luisreindlmeier
Python: 3.11.14
Platform: macOS-26.1-arm64-arm-64bit
pandas: 2.3.3
numpy: 2.4.0
------------------------------
Train dataset shape: (362400, 150)
Test dataset shape: (31937, 150)


## 2. Structural Integrity Checks

### 2.1 Column Count & Index Column

In [27]:
# Show first/last columns
print("First 10 columns (raw):", list(train_df.columns[:10]))
print("Last 10 columns  (raw):", list(train_df.columns[-10:]))

# Drop index column if present
if "Unnamed: 0" in train_df.columns:
    train_df = train_df.drop(columns=["Unnamed: 0"])
    test_df  = test_df.drop(columns=["Unnamed: 0"])

print("Train shape (clean):", train_df.shape)
print("Test shape  (clean):", test_df.shape)

First 10 columns (raw): ['Unnamed: 0', '0', '1', '2', '3', '4', '5', '6', '7', '8']
Last 10 columns  (raw): ['139', '140', '141', '142', '143', '144', '145', '146', '147', '148']
Train shape (clean): (362400, 149)
Test shape  (clean): (31937, 149)


### 2.2 Feature / Label Separation

In [28]:
# Create feature / label separation
X_train = train_df.iloc[:, :-5]
y_train = train_df.iloc[:, -5:]

X_test  = test_df.iloc[:, :-5]
y_test  = test_df.iloc[:, -5:]

feature_col = X_train.columns[0]

# Manual raw values inspection (validate the split)
print("\nFirst 5 rows × first 5 feature columns:")
print(X_train.iloc[:5, :5])

print("\nFirst 5 rows × last 5 feature columns:")
print(X_train.iloc[:5, -5:])

print("\nFirst 5 rows × all label columns:")
print(y_train.head(5))


First 5 rows × first 5 feature columns:
          0         1         2         3         4
0  0.318116 -0.564619  0.313539 -0.551889  0.319726
1  0.318116 -0.662079  0.313539 -0.551889  0.320706
2  0.317136 -0.723163  0.313539 -0.551889  0.316787
3  0.317136 -0.585895  0.313539 -0.551889  0.318747
4  0.317136 -0.585895  0.313539 -0.551889  0.318747

First 5 rows × last 5 feature columns:
        139       140  141  142  143
0 -0.816832 -0.825238  0.0  0.0  0.0
1  0.464300  0.452887  0.0  0.0  0.0
2 -0.798788 -0.807237  0.0  0.0  0.0
3  0.465974  0.454558  0.0  0.0  0.0
4 -0.410306 -0.419666  0.0  0.0  0.0

First 5 rows × all label columns:
   144  145  146  147  148
0  2.0  2.0  2.0  2.0  2.0
1  2.0  2.0  2.0  2.0  2.0
2  3.0  3.0  2.0  2.0  2.0
3  2.0  2.0  3.0  2.0  2.0
4  1.0  1.0  1.0  2.0  2.0


## 3. Feature Space Validation

### 3.1 Feature Statistics

In [29]:
print("Feature columns:", X_train.shape[1])

print("-" * 30)

# Expecting features are normalized (mean≈0, std≈1)
print("Feature column statistics:")
print(X_train[feature_col].describe())

Feature columns: 144
------------------------------
Feature column statistics:
count    3.624000e+05
mean    -1.591887e-10
std      1.000000e+00
min     -1.065622e+00
25%     -9.833034e-01
50%     -5.266306e-01
75%      1.156003e+00
max      1.355919e+00
Name: 0, dtype: float64


### 3.2 Normalization Sanity Check (z-score)

In [30]:
# Normalization sanity check (z-score)
stats = X_train.describe().loc[["mean", "std"]]
stats.iloc[:, :10]

mean_max = stats.loc["mean"].abs().max()
std_max_deviation = stats.loc["std"].sub(1).abs().max()

print(f"Max |feature mean|: {mean_max:.2e}")
print(f"Max |feature std - 1|: {std_max_deviation:.2e}")

Max |feature mean|: 7.92e-09
Max |feature std - 1|: 1.00e+00


### 3.3 Missing / Invalid Values

In [31]:
n_nan = np.isnan(X_train.values).sum()
n_inf = np.isinf(X_train.values).sum()

print(f"NaN values in features: {n_nan}")
print(f"Infinite values: {n_inf}")

NaN values in features: 0
Infinite values: 0


## 4. Label Definition & Semantics

### 4.1 Label Horizons

In [32]:
# According to FI-2010 specification, columns 144–148 correspond to prediction horizons h ∈ {10,20,30,50,100}
print("Label colums:", list(y_train.columns))

print("-" * 30)

# Expecting labels mean≈2 (only values 1,2,3, wiith 0 and 2 showing a similar frequency), min=1, and max=3
print("Label column statistics:")
print(y_train.iloc[:, 0].describe())

Label colums: ['144', '145', '146', '147', '148']
------------------------------
Label column statistics:
count    362400.000000
mean          1.997599
std           0.600984
min           1.000000
25%           2.000000
50%           2.000000
75%           2.000000
max           3.000000
Name: 144, dtype: float64


### 4.2 Raw Label Encoding

In [33]:
print(np.unique(y_train.iloc[:, 0]))

[1. 2. 3.]


### 4.3 Label Distribution

In [34]:
train_dist = y_train.iloc[:, 0].value_counts(normalize=True).sort_index()
test_dist  = y_test.iloc[:, 0].value_counts(normalize=True).sort_index()

print("Train Label Distribution for column", train_dist)

print("-" * 30)

print("Test Label Distribution for column", test_dist)

Train Label Distribution for column 144
1.0    0.181794
2.0    0.638813
3.0    0.179393
Name: proportion, dtype: float64
------------------------------
Test Label Distribution for column 144
1.0    0.173686
2.0    0.668159
3.0    0.158155
Name: proportion, dtype: float64


### 4.4 Mapping to {-1, 0, +1}

In [35]:
# Select horizon h = 10 (first label column)
h10 = y_train.iloc[:, 0]

# FI-2010 labels use encoding {1: down, 2: stationary, 3: up}
label_mapping = {
    1: -1,  # downward
    2:  0,  # stationary
    3:  1   # upward
}

y_train_h10_mapped = h10.map(label_mapping)

print("Mapping for column", y_train_h10_mapped.value_counts().sort_index())

Mapping for column 144
-1     65882
 0    231506
 1     65012
Name: count, dtype: int64


## 5. Dataset Summary (for Thesis Reference)

In [36]:
# Calculate dataset summary metrics
n_train = X_train.shape[0]
n_test = X_test.shape[0]
train_ratio = n_train / (n_train + n_test)

print("=" * 50)
print("DATASET SUMMARY")
print("=" * 50)

print("\nSample sizes")
print(f"{'Train samples':<30}: {n_train}")
print(f"{'Test samples':<30}: {n_test}")
print(f"{'Train/Test split ratio':<30}: {round(train_ratio, 4)}")

print("-" * 50)

print("\nFeature space")
print(f"{'LOB feature dimension':<30}: {X_train.shape[1]}")

print("-" * 50)

print("\nLabels")
print(f"{'Number of label horizons':<30}: {y_train.shape[1]}")
print(f"{'Label horizons (columns)':<30}: {list(y_train.columns)}")
print(f"{'Raw label values':<30}: {sorted(y_train.iloc[:, 0].unique().astype(int))}")
print(f"{'Label encoding (raw)':<30}: {{1: 'downward', 2: 'stationary', 3: 'upward'}}")

print("-" * 50)

print("\nNormalization checks")
print(f"{'Feature normalization':<30}: z-score (mean≈0, std≈1)")
print(f"{'Max |feature mean|':<30}: {stats.loc['mean'].abs().max():.2e}")
print(f"{'Max |feature std − 1|':<30}: {stats.loc['std'].sub(1).abs().max():.2e}")

print("-" * 50)

print("\nData integrity")
print(f"{'Missing values (features)':<30}: {int(np.isnan(X_train.values).sum())}")
print(f"{'Infinite values (features)':<30}: {int(np.isinf(X_train.values).sum())}")
print("=" * 50)

DATASET SUMMARY

Sample sizes
Train samples                 : 362400
Test samples                  : 31937
Train/Test split ratio        : 0.919
--------------------------------------------------

Feature space
LOB feature dimension         : 144
--------------------------------------------------

Labels
Number of label horizons      : 5
Label horizons (columns)      : ['144', '145', '146', '147', '148']
Raw label values              : [np.int64(1), np.int64(2), np.int64(3)]
Label encoding (raw)          : {1: 'downward', 2: 'stationary', 3: 'upward'}
--------------------------------------------------

Normalization checks
Feature normalization         : z-score (mean≈0, std≈1)
Max |feature mean|            : 7.92e-09
Max |feature std − 1|         : 1.00e+00
--------------------------------------------------

Data integrity
Missing values (features)     : 0
Infinite values (features)    : 0


## 6. Important Checks

## Check 1 – Class Distribution (Train vs Test)

In [37]:
print(y_train.iloc[:, 0].value_counts().sort_index())
print(y_test.iloc[:, 0].value_counts().sort_index())

144
1.0     65882
2.0    231506
3.0     65012
Name: count, dtype: int64
144
1.0     5547
2.0    21339
3.0     5051
Name: count, dtype: int64


## Check 2 – Confusion Matrix (After Short Training)

In [38]:
import torch
import numpy as np
import pandas as pd

from datasets.fi2010 import FI2010Dataset
from models.model import LOBTransformer

DEVICE = "mps" if torch.backends.mps.is_available() else "cpu"

# 1) Dataset + Loader (h=10 => first label column 144)
test_ds = FI2010Dataset(csv_path="../data/FI-2010/test.csv", horizon_idx=0)  
test_loader = torch.utils.data.DataLoader(test_ds, batch_size=1024, shuffle=False)

# 2) Load model
model = LOBTransformer().to(DEVICE)
model.eval()

# 3) Create preds
all_preds = []
all_true = []

with torch.no_grad():
    for x, y in test_loader:
        x = x.to(DEVICE)
        y = y.to(DEVICE)

        logits = model(x)
        pred = torch.argmax(logits, dim=1)

        all_preds.append(pred.cpu().numpy())
        all_true.append(y.cpu().numpy())

y_pred = np.concatenate(all_preds)
y_true = np.concatenate(all_true)

# 4) Confusion matrix (Labels 0/1/2)
cm = pd.crosstab(
    pd.Series(y_true, name="true"),
    pd.Series(y_pred, name="pred"),
    rownames=["true"], colnames=["pred"],
    dropna=False
)

cm

pred,0,1,2
true,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,2847,1440,1260
1,12168,4378,4793
2,2581,1266,1204


## Check 3 – Shuffle Test

In [39]:
# ============================================================
# FINAL EVALUATION CHECKS (h = 10)
# ============================================================

from sklearn.metrics import (
    accuracy_score,
    f1_score,
    classification_report
)

print("=" * 60)
print("FINAL MODEL EVALUATION (h = 10)")
print("=" * 60)

# ------------------------------------------------------------
# 1) Sanity check: label ranges
# ------------------------------------------------------------
print("\nLabel value sanity check")
print("-" * 30)
print("Unique true labels :", np.unique(y_true))
print("Unique pred labels :", np.unique(y_pred))

# ------------------------------------------------------------
# 2) Accuracy (for reference only)
# ------------------------------------------------------------
acc = accuracy_score(y_true, y_pred)

print("\nAccuracy")
print("-" * 30)
print(f"Accuracy: {acc:.4f}")

# ------------------------------------------------------------
# 3) Macro F1 (MAIN METRIC)
# ------------------------------------------------------------
macro_f1 = f1_score(y_true, y_pred, average="macro")

print("\nMacro F1-score")
print("-" * 30)
print(f"Macro F1: {macro_f1:.4f}")

# ------------------------------------------------------------
# 4) Detailed per-class metrics
# ------------------------------------------------------------
print("\nPer-class precision / recall / F1")
print("-" * 30)

print(
    classification_report(
        y_true,
        y_pred,
        target_names=["down", "stationary", "up"],
        digits=4
    )
)

# ------------------------------------------------------------
# 5) Normalized confusion matrix (optional but useful)
# ------------------------------------------------------------
print("\nNormalized confusion matrix (row-wise)")
print("-" * 30)

cm_norm = cm.div(cm.sum(axis=1), axis=0)
cm_norm

FINAL MODEL EVALUATION (h = 10)

Label value sanity check
------------------------------
Unique true labels : [0 1 2]
Unique pred labels : [0 1 2]

Accuracy
------------------------------
Accuracy: 0.2639

Macro F1-score
------------------------------
Macro F1: 0.2499

Per-class precision / recall / F1
------------------------------
              precision    recall  f1-score   support

        down     0.1618    0.5133    0.2460      5547
  stationary     0.6180    0.2052    0.3081     21339
          up     0.1659    0.2384    0.1956      5051

    accuracy                         0.2639     31937
   macro avg     0.3152    0.3189    0.2499     31937
weighted avg     0.4673    0.2639    0.2795     31937


Normalized confusion matrix (row-wise)
------------------------------


pred,0,1,2
true,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.51325,0.2596,0.22715
1,0.570224,0.205164,0.224612
2,0.510988,0.250643,0.238369
