# CMS2016G — Data Cleaning & Preprocessing (DCP) Audit Notebook

# Install Required Libraries

We install:
- `uproot` → to read CMS ROOT files without ROOT framework
- `awkward` → to handle jagged arrays from NanoAOD/MiniAOD

These are lightweight and optimized for HEP-scale datasets.


In [1]:
!pip install uproot awkward --quiet


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m393.8/393.8 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m919.6/919.6 kB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m656.7/656.7 kB[0m [31m45.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import uproot
import awkward as ak
import numpy as np
import pandas as pd
import os
from glob import glob
import math


# Load CMS ROOT Files

We scan the dataset directory and load all ROOT files.

Expected:
- 85 ROOT files
- Each containing a CMS event tree

This ensures:
- No file is missing
- No naming inconsistency
- Dataset integrity across splits


In [3]:
BASE_PATH = "/kaggle/input/datasets/hiteshrs/cms2016g29-5785"

root_files = sorted(glob(os.path.join(BASE_PATH, "*.root")))

print("Total ROOT files found:", len(root_files))


Total ROOT files found: 85


# Inspect ROOT Tree Structure

CMS NanoAOD typically stores event-level observables in:

- "Events"

However, MiniAOD or derived samples may differ.

We confirm the correct tree name before loading features.


In [4]:
with uproot.open(root_files[0]) as file:
    print(file.keys())


['Features;9', 'Features;8']


# Define Extended Feature Set (50 Observables)

These features are:

- Event-level
- Object-level (leading/subleading)
- Derived kinematics
- Angular correlations
- Ratio features

All are inference-safe (no generator-level information).


In [5]:
FEATURES = [
    'nMuon','nElectron','nJet','MET_pt','MET_phi','MET_sumEt',
    'Muon_pt_0','Muon_eta_0','Muon_phi_0',
    'Muon_pt_1','Muon_eta_1','Muon_phi_1',
    'Electron_pt_0','Electron_eta_0','Electron_phi_0',
    'Electron_pt_1','Electron_eta_1','Electron_phi_1',
    'Jet_pt_0','Jet_eta_0','Jet_phi_0',
    'Jet_pt_1','Jet_eta_1','Jet_phi_1',
    'Jet_pt_2','Jet_eta_2','Jet_phi_2',
    'Jet_pt_3','Jet_eta_3','Jet_phi_3',
    'HT','ST',
    'M_ll','M_jj_01','M_jj_12',
    'delta_phi_MET_j0','delta_phi_MET_j1','min_delta_phi_MET_jets',
    'delta_R_j0_j1','delta_phi_ll','delta_R_ll',
    'Jet_btagDeepB_0','Jet_btagDeepB_1',
    'MT_lep_MET','HT_ratio','MET_pt_HT_ratio',
    'nJet_pt30','Jet_mass_0','LeadLepton_pt','sum_pt_leptons'
]


# Streaming CMS Data Audit (Memory Safe)

We avoid concatenating all 85 ROOT files.

Instead:
- Process one file at a time
- Iterate in chunks (50k events)
- Accumulate summary statistics
- Free memory immediately

This prevents Kaggle RAM overflow.

This is the correct large-scale HEP data validation strategy.


In [6]:
from tqdm import tqdm

summary_stats = {
    "total_events": 0,
    "missing": 0,
    "infinite": 0,
    "negative_physics": 0,
    "angular_violation": 0
}

physical_positive = [
    'MET_pt','HT','ST','M_ll','M_jj_01','M_jj_12',
    'Jet_pt_0','Jet_pt_1','Jet_pt_2','Jet_pt_3',
    'Muon_pt_0','Muon_pt_1',
    'Electron_pt_0','Electron_pt_1',
    'Jet_mass_0'
]

for file in tqdm(root_files, desc="Processing ROOT files"):
    
    with uproot.open(file)["Features"] as tree:
        
        for chunk in tree.iterate(FEATURES, step_size=50000, library="pd"):
            
            summary_stats["total_events"] += len(chunk)
            summary_stats["missing"] += chunk.isnull().sum().sum()
            summary_stats["infinite"] += np.isinf(chunk).sum().sum()
            
            for col in physical_positive:
                if col in chunk.columns:
                    summary_stats["negative_physics"] += (chunk[col] < 0).sum()
            
            phi_cols = [col for col in chunk.columns if "phi" in col]
            for col in phi_cols:
                summary_stats["angular_violation"] += (
                    ((chunk[col] < -math.pi) | (chunk[col] > math.pi)).sum()
                )
            
            del chunk

summary_stats


Processing ROOT files: 100%|██████████| 85/85 [04:48<00:00,  3.39s/it]


{'total_events': 120786365,
 'missing': np.int64(0),
 'infinite': np.int64(0),
 'negative_physics': np.int64(3258),
 'angular_violation': np.int64(55314)}

# Missing Value Audit (Streaming Mode)

Since the dataset is too large to fit in memory,
we compute missing values per feature while streaming chunks.

This ensures:
- Constant memory usage
- Accurate global statistics
- Scalability to >100M events


In [None]:
missing_per_feature = {f: 0 for f in FEATURES}

for file in root_files:
    with uproot.open(file)["Features"] as tree:
        for chunk in tree.iterate(FEATURES, step_size=50000, library="pd"):
            
            null_counts = chunk.isnull().sum()
            
            for col in FEATURES:
                missing_per_feature[col] += int(null_counts[col])
            
            del chunk

# Show only features with missing values
missing_filtered = {k: v for k, v in missing_per_feature.items() if v > 0}

print("Features with missing values:")
missing_filtered


# Infinite Value Audit

Infinite values often arise from:

- Division by zero (HT_ratio, MET_pt_HT_ratio)
- Log transforms
- Improper normalization

These break:

- Neural Spline Flow likelihood estimation
- Standard scaling
- Model stability

Any presence → data corruption.


In [None]:
inf_summary = np.isinf(df).sum()
inf_summary = inf_summary[inf_summary > 0]
print("Features with infinite values:")
print(inf_summary)


# Physics Validity Check — Non-Negativity

The following must never be negative:

- Transverse momenta
- Invariant masses
- Scalar sums
- MET

Negative values imply:

- Reconstruction bug
- Feature engineering error
- File corruption

Such events must be removed.


In [None]:
physical_positive = [
    'MET_pt','HT','ST','M_ll','M_jj_01','M_jj_12',
    'Jet_pt_0','Jet_pt_1','Jet_pt_2','Jet_pt_3',
    'Muon_pt_0','Muon_pt_1',
    'Electron_pt_0','Electron_pt_1',
    'Jet_mass_0'
]

for col in physical_positive:
    if col in df.columns:
        negatives = (df[col] < 0).sum()
        if negatives > 0:
            print(f"{col} has {negatives} negative values")


# Angular Boundary Validation

Physics constraints:

- φ ∈ [-π, π]
- Δφ ∈ [0, π]
- ΔR ≥ 0

Violations indicate:

- Improper wrapping
- Incorrect Δφ calculation
- Mixing degrees & radians

Angular misdefinition severely harms ML modeling.


In [None]:
import math

phi_cols = [col for col in df.columns if "phi" in col]

for col in phi_cols:
    outside = ((df[col] < -math.pi) | (df[col] > math.pi)).sum()
    if outside > 0:
        print(f"{col} has {outside} values outside [-π, π]")


# Object Multiplicity Consistency

Logical constraints:

- If nMuon = 0 → Muon_pt_0 must be 0
- If nJet < 4 → Jet_pt_3 must be 0

Violations indicate:

- Improper zero-padding
- Misaligned indexing
- Incorrect object extraction

This introduces fake correlations into anomaly detection.


In [None]:
inconsistency_muon = df[(df['nMuon'] == 0) & (df['Muon_pt_0'] > 0)]
print("Muon inconsistency count:", len(inconsistency_muon))

inconsistency_jet = df[(df['nJet'] < 4) & (df['Jet_pt_3'] > 0)]
print("Jet inconsistency count:", len(inconsistency_jet))


# Derived Feature Recalculation Audit

We recompute HT_ratio and compare with stored value.

Large mismatches indicate:

- Inconsistent computation across files
- Mixing NanoAOD and MiniAOD definitions
- Precision drift

Derived variables should always be recomputed during DCP.


In [None]:
recalc_ratio = df['HT'] / (df['HT'] + df['MET_pt'] + 1e-9)
diff = np.abs(recalc_ratio - df['HT_ratio'])
print("HT_ratio mismatch count:", (diff > 1e-3).sum())


In [None]:
import os
import glob
import numpy as np
import uproot
import awkward as ak
from tqdm import tqdm

# ======================================================
# CONFIGURATION
# ======================================================

input_dir = "/kaggle/input/cms2016g29-5785"
output_dir = "/kaggle/working/processed_events"
tree_name = "Features"
batch_size = "100 MB"

os.makedirs(output_dir, exist_ok=True)

# ======================================================
# CLEANING CONTROL VARIABLES
# (Used only to decide which events to remove)
# ======================================================

protected_fields = [
    "Muon_pt_0",
    "LeadLepton_pt",
    "sum_pt_leptons",
    "ST",
    "HT",
    "MET_pt",
    "MT_lep_MET",
    "M_ll",
    "Jet_pt_0",
]

# ======================================================
# CLEANING FUNCTION
# ======================================================

def build_clean_mask(events):

    # Start with all True mask
    first_field = events.fields[0]
    mask = ak.ones_like(events[first_field], dtype=bool)

    for field in protected_fields:

        if field not in events.fields:
            continue

        arr = events[field]

        # Finite check
        mask = mask & np.isfinite(arr)

        # Upper bound protection
        upper_limits = {
            "Muon_pt_0": 5000,
            "LeadLepton_pt": 5000,
            "sum_pt_leptons": 10000,
            "ST": 20000,
            "HT": 20000,
            "MET_pt": 5000,
            "MT_lep_MET": 10000,
            "M_ll": 20000,
            "Jet_pt_0": 10000,
        }

        for field in protected_fields:

            if field not in events.fields:
                continue

            arr = events[field]

            mask = mask & np.isfinite(arr)
            mask = mask & (arr < upper_limits[field])


    return mask


# ======================================================
# MAIN PROCESSING LOOP
# ======================================================

root_files = sorted(glob.glob(os.path.join(input_dir, "*.root")))

for filepath in tqdm(root_files, desc="Processing ROOT files"):

    filename = os.path.basename(filepath)
    output_path = os.path.join(output_dir, filename)

    total_before = 0
    total_after = 0

    with uproot.recreate(
        output_path,
        compression=uproot.LZ4(4)
    ) as outfile:

        first_batch = True

        for events in uproot.iterate(
            f"{filepath}:{tree_name}",
            library="ak",          # ✅ Read ALL branches
            step_size=batch_size
        ):

            total_before += len(events)

            # Build mask using only protected physics variables
            mask = build_clean_mask(events)

            cleaned = events[mask]
            total_after += len(cleaned)

            if len(cleaned) == 0:
                continue

            # Write ALL branches (full schema preserved)
            cleaned_dict = {field: cleaned[field] for field in cleaned.fields}

            if first_batch:
                outfile[tree_name] = cleaned_dict
                first_batch = False
            else:
                outfile[tree_name].extend(cleaned_dict)

    print(f"\nFile: {filename}")
    print(f"Original events : {total_before}")
    print(f"Cleaned events  : {total_after}")
    if total_before > 0:
        removed_pct = 100 * (1 - total_after / total_before)
        print(f"Removed         : {removed_pct:.2f}%")

print("\nAll ROOT files cleaned successfully.")


# Data Cleaning Summary (DCP Audit)

This notebook checks:

✔ Missing values  
✔ Infinite values  
✔ Physical validity  
✔ Angular constraints  
✔ Object multiplicity consistency  
✔ Derived feature correctness   

A dataset is considered clean only if:

- All checks return zero violations
- Derived features are consistent
- No schema drift across files

Next steps:
- Remove invalid events
- Recompute derived features
- Apply scaling only after cleaning
