# 🧹 EMBER 2024 Data Preprocessing Pipeline

**Run this ONCE to clean and prepare EMBER 2024 data**

### What this does:
1. ✅ Auto-discovers all EMBER 2024 JSONL files (Win32, Win64, Dot_Net, APK, ELF, PDF)
2. ✅ Loads and processes ~5 million samples **in 5 batches** (RAM-friendly!)
3. ✅ Cleans data (removes NaN, outliers, unlabeled)
4. ✅ Balances classes (50/50 malware/benign)
5. ✅ Validates quality
6. ✅ Saves cleaned data back to Drive

### Output files:
- `ember2024_train_cleaned.npz` - Clean training data (~4M samples)
- `ember2024_test_cleaned.npz` - Clean test data (~1M samples)
- `ember2024_quality_report.json` - Quality metrics

### 💡 RAM Management:
- Processes data in **5 batches** to avoid crashes
- Clears memory after each batch
- Safe for Colab free tier (12GB RAM)

**Time to run: ~30-60 minutes**  
**Run frequency: Once (or when you get new data)**

## 📦 Setup

In [1]:
# Install required packages
!pip install numpy pandas scikit-learn tqdm -q
print("✅ Packages installed!")

✅ Packages installed!


In [2]:
import json
import numpy as np
import pandas as pd
from pathlib import Path
from collections import Counter, defaultdict
from sklearn.model_selection import train_test_split
from tqdm.auto import tqdm
import warnings
import random
warnings.filterwarnings('ignore')

print("✅ Imports successful!")

✅ Imports successful!


## 📁 Mount Google Drive

In [3]:
from google.colab import drive
drive.mount('/content/drive')

print("✅ Google Drive mounted!")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
✅ Google Drive mounted!


## ⚙️ Configuration

In [4]:
# 🔧 MODIFY THESE PATHS for your setup

# Where your EMBER 2024 JSONL files are
RAW_DATA_PATH = '/content/drive/MyDrive/DIC'

# Where to save cleaned data
CLEANED_DATA_PATH = '/content/drive/MyDrive/DIC_cleaned'

# Processing configuration
CONFIG = {
    # Target sample counts
    'target_train_samples': 4_000_000,  # 4M training samples
    'target_test_samples': 1_000_000,   # 1M test samples

    # File selection (SPEED OPTIMIZATION)
    'max_train_files': 15,  # Use only 15 files instead of all 62 (MUCH FASTER!)
    'max_test_files': 10,   # Use only 10 test files
    # Set to None to use ALL files (slower but more diverse)

    # Batch processing (to avoid RAM crashes)
    'n_batches': 5,  # Process in 5 batches

    # Feature extraction
    'feature_size': 2381,  # Standard EMBER feature size

    # Cleaning thresholds
    'remove_outliers': True,
    'z_threshold': 3.0,

    # Class balancing
    'balance_classes': True,
    'balance_method': 'undersample',
    'balance_ratio': 1.0,  # 1.0 = 50/50 split

    # File types to process
    'file_types': ['Win32', 'Win64', 'Dot_Net', 'APK', 'ELF', 'PDF'],

    # Output
    'save_format': 'npz',
    'random_seed': 42,
}

# Create output directory
Path(CLEANED_DATA_PATH).mkdir(exist_ok=True, parents=True)

print(f"📂 Raw data: {RAW_DATA_PATH}")
print(f"📂 Clean data: {CLEANED_DATA_PATH}")
print(f"\n⚙️ Config:")
for k, v in CONFIG.items():
    if k != 'file_types':
        print(f"   {k}: {v}")
print(f"   file_types: {', '.join(CONFIG['file_types'])}")
print(f"\n⚡ Speed mode: Using {CONFIG['max_train_files']} train files & {CONFIG['max_test_files']} test files")
print(f"   (Set to None to use ALL files - slower but more diverse)")

📂 Raw data: /content/drive/MyDrive/DIC
📂 Clean data: /content/drive/MyDrive/DIC_cleaned

⚙️ Config:
   target_train_samples: 4000000
   target_test_samples: 1000000
   max_train_files: 15
   max_test_files: 10
   n_batches: 5
   feature_size: 2381
   remove_outliers: True
   z_threshold: 3.0
   balance_classes: True
   balance_method: undersample
   balance_ratio: 1.0
   save_format: npz
   random_seed: 42
   file_types: Win32, Win64, Dot_Net, APK, ELF, PDF

⚡ Speed mode: Using 15 train files & 10 test files
   (Set to None to use ALL files - slower but more diverse)


## 🔍 Step 1: Discover Files

In [5]:
def discover_ember_files(data_dir, file_types):
    """Auto-discover all EMBER 2024 JSONL files"""
    data_path = Path(data_dir)

    files_by_type = defaultdict(lambda: {'train': [], 'test': []})

    # Pattern: YYYY-MM-DD_YYYY-MM-DD_FileType_subset.jsonl
    for jsonl_file in data_path.glob('*.jsonl'):
        filename = jsonl_file.name

        # Determine file type
        file_type = None
        for ft in file_types:
            if ft in filename:
                file_type = ft
                break

        if not file_type:
            continue

        # Determine subset (train/test)
        if '_train.jsonl' in filename:
            files_by_type[file_type]['train'].append(jsonl_file)
        elif '_test.jsonl' in filename:
            files_by_type[file_type]['test'].append(jsonl_file)

    # Sort files for consistent processing
    for file_type in files_by_type:
        files_by_type[file_type]['train'].sort()
        files_by_type[file_type]['test'].sort()

    return dict(files_by_type)

# Discover all files
print("🔍 Discovering EMBER 2024 files...\n")
discovered_files = discover_ember_files(RAW_DATA_PATH, CONFIG['file_types'])

# Print summary
total_train = 0
total_test = 0
for file_type, files in discovered_files.items():
    n_train = len(files['train'])
    n_test = len(files['test'])
    total_train += n_train
    total_test += n_test
    print(f"📊 {file_type}:")
    print(f"   Train files: {n_train}")
    print(f"   Test files:  {n_test}")

print(f"\n✅ Total: {total_train} train files, {total_test} test files")

🔍 Discovering EMBER 2024 files...

📊 Win32:
   Train files: 52
   Test files:  12
📊 Win64:
   Train files: 52
   Test files:  9
📊 Dot_Net:
   Train files: 52
   Test files:  4
📊 APK:
   Train files: 52
   Test files:  0
📊 ELF:
   Train files: 52
   Test files:  0
📊 PDF:
   Train files: 52
   Test files:  0

✅ Total: 312 train files, 25 test files


## 📥 Step 2: Load and Extract Features

In [6]:
def extract_ember2024_features(data):
    """Extract 2381 features from EMBER 2024 data

    EMBER standard features:
    - ByteHistogram: 256
    - ByteEntropyHistogram: 256
    - StringExtractor: ~104
    - GeneralFileInfo: ~10
    - HeaderFileInfo: ~62
    - SectionInfo: ~255
    - ImportsInfo: ~1280
    - ExportsInfo: ~128
    - DataDirectories: ~15
    Total: 2381 features
    """
    features = []

    # 1. Byte histogram (256)
    histogram = data.get('histogram', [0]*256)
    features.extend(histogram[:256])

    # 2. Byte entropy histogram (256)
    byteentropy = data.get('byteentropy', [0]*256)
    features.extend(byteentropy[:256])

    # 3. String features (~104)
    strings_dict = data.get('strings', {})
    if isinstance(strings_dict, dict):
        features.extend([
            strings_dict.get('numstrings', 0),
            strings_dict.get('avlength', 0),
            strings_dict.get('printables', 0),
            strings_dict.get('entropy', 0),
        ])
        # Add histogram of string lengths (100 bins)
        hist = strings_dict.get('hist', [0]*100)
        features.extend(hist[:100] if isinstance(hist, list) else [0]*100)
    else:
        features.extend([0]*104)

    # 4. General file info (~10)
    general = data.get('general', {})
    if isinstance(general, dict):
        features.extend([
            general.get('size', 0),
            general.get('vsize', 0),
            general.get('has_debug', 0),
            general.get('exports', 0),
            general.get('imports', 0),
            general.get('has_relocations', 0),
            general.get('has_resources', 0),
            general.get('has_signature', 0),
            general.get('has_tls', 0),
            general.get('symbols', 0),
        ])
    else:
        features.extend([0]*10)

    # 5. Header info (~62)
    header = data.get('header', {})
    if isinstance(header, dict):
        # Simplified - add key header fields
        features.extend([
            header.get('timestamp', 0),
            header.get('machine', 0),
            header.get('characteristics', 0),
        ])
        # Pad remaining
        features.extend([0]*59)
    else:
        features.extend([0]*62)

    # 6-10. Sections, Imports, Exports, DataDirectories
    # For now, extract what's available and pad to reach 2381
    section = data.get('section', {})
    imports = data.get('imports', {})
    exports = data.get('exports', {})

    # Add any list/dict features as flattened values
    for key in ['section', 'imports', 'exports']:
        obj = data.get(key, {})
        if isinstance(obj, dict):
            for v in list(obj.values())[:50]:  # Limit per section
                if isinstance(v, (int, float)):
                    features.append(v)
                elif isinstance(v, list):
                    features.extend([x for x in v if isinstance(x, (int, float))][:20])

    # Pad to exactly 2381 features
    target_size = 2381
    if len(features) < target_size:
        features.extend([0] * (target_size - len(features)))

    return features[:target_size]


def load_ember2024_jsonl(files, target_samples=None, subset='train', seed=42):
    """Load EMBER 2024 JSONL files with reservoir sampling

    Args:
        files: List of file paths
        target_samples: Target number of samples (None = all)
        subset: 'train' or 'test'
        seed: Random seed
    """
    rng = random.Random(seed)
    np.random.seed(seed)

    X_chunks = []
    y_chunks = []

    # Calculate samples per file if target is set
    if target_samples and len(files) > 0:
        samples_per_file = max(1, target_samples // len(files))
    else:
        samples_per_file = None

    total_loaded = 0

    for file_path in tqdm(files, desc=f"Loading {subset} files"):
        file_X = []
        file_y = []
        kept = 0
        seen = 0

        with open(file_path, 'r') as f:
            for line in f:
                try:
                    data = json.loads(line)

                    # Get label
                    label = data.get('label', -1)

                    # For training, skip unlabeled
                    if subset == 'train' and label == -1:
                        continue

                    seen += 1

                    # Extract features
                    features = extract_ember2024_features(data)

                    # Reservoir sampling if needed
                    if samples_per_file is None:
                        file_X.append(features)
                        file_y.append(label)
                        kept += 1
                    else:
                        if kept < samples_per_file:
                            file_X.append(features)
                            file_y.append(label)
                            kept += 1
                        else:
                            j = rng.randrange(seen)
                            if j < samples_per_file:
                                file_X[j] = features
                                file_y[j] = label

                except (json.JSONDecodeError, KeyError, ValueError) as e:
                    continue

        if file_X:
            X_chunks.append(np.array(file_X, dtype=np.float32))
            y_chunks.append(np.array(file_y, dtype=np.int32))
            total_loaded += len(file_X)

    # Combine chunks
    if X_chunks:
        X = np.concatenate(X_chunks)
        y = np.concatenate(y_chunks)

        # Shuffle
        idx = np.arange(len(X))
        np.random.shuffle(idx)
        X = X[idx]
        y = y[idx]

        # Trim to target if over
        if target_samples and len(X) > target_samples:
            X = X[:target_samples]
            y = y[:target_samples]
    else:
        X = np.array([], dtype=np.float32).reshape(0, 2381)
        y = np.array([], dtype=np.int32)

    del X_chunks, y_chunks

    return X, y


def select_subset_of_files(all_files, max_files, seed=42):
    """Select evenly distributed subset of files

    Args:
        all_files: List of all file paths
        max_files: Maximum number of files to select (None = use all)
        seed: Random seed

    Returns:
        Selected subset of files, evenly distributed
    """
    if max_files is None or max_files >= len(all_files):
        return all_files

    # Select evenly spaced indices
    total_files = len(all_files)
    indices = np.linspace(0, total_files - 1, max_files, dtype=int)

    # Add some randomness while keeping distribution
    np.random.seed(seed)
    jitter = np.random.randint(-1, 2, size=len(indices))  # -1, 0, or 1
    indices = np.clip(indices + jitter, 0, total_files - 1)
    indices = np.unique(indices)  # Remove duplicates

    # If we lost files due to deduplication, add random ones
    if len(indices) < max_files:
        remaining = set(range(total_files)) - set(indices)
        additional = np.random.choice(list(remaining), max_files - len(indices), replace=False)
        indices = np.concatenate([indices, additional])

    selected_files = [all_files[i] for i in sorted(indices)]

    return selected_files


print("✅ Feature extraction functions ready!")

✅ Feature extraction functions ready!


In [7]:
# Load training data in 5 batches to avoid RAM crashes
print("\n📥 Loading training data in 5 batches...\n")

# Collect all train files
all_train_files = []
for file_type, files in discovered_files.items():
    all_train_files.extend(files['train'])

print(f"Total train files available: {len(all_train_files)}")

# Select subset of files (SPEED BOOST!)
selected_train_files = select_subset_of_files(
    all_train_files,
    CONFIG['max_train_files'],
    seed=CONFIG['random_seed']
)

print(f"Using {len(selected_train_files)} files for training (evenly distributed)")
print(f"⚡ Speed boost: {len(all_train_files) - len(selected_train_files)} fewer files to process!\n")

# Split into 5 batches
n_batches = 5
batch_size = len(selected_train_files) // n_batches
samples_per_batch = CONFIG['target_train_samples'] // n_batches

train_batches = []
for i in range(n_batches):
    start_idx = i * batch_size
    end_idx = start_idx + batch_size if i < n_batches - 1 else len(selected_train_files)
    train_batches.append(selected_train_files[start_idx:end_idx])

# Process each batch
X_train_chunks = []
y_train_chunks = []

for batch_num, batch_files in enumerate(train_batches, 1):
    print(f"\n🔄 Processing batch {batch_num}/{n_batches} ({len(batch_files)} files)...")

    X_batch, y_batch = load_ember2024_jsonl(
        batch_files,
        target_samples=samples_per_batch,
        subset='train',
        seed=CONFIG['random_seed'] + batch_num
    )

    X_train_chunks.append(X_batch)
    y_train_chunks.append(y_batch)

    print(f"   ✅ Batch {batch_num}: {X_batch.shape[0]:,} samples")
    print(f"   💾 RAM: Batch stored, clearing variables...")

    # Clear batch variables to free RAM
    del X_batch, y_batch
    import gc
    gc.collect()

# Combine all batches
print(f"\n🔗 Combining {len(X_train_chunks)} batches...")
X_train_raw = np.vstack(X_train_chunks)
y_train_raw = np.concatenate(y_train_chunks)

# Clear chunks
del X_train_chunks, y_train_chunks
import gc
gc.collect()

# Shuffle combined data
print("🔀 Shuffling combined data...")
idx = np.arange(len(X_train_raw))
np.random.shuffle(idx)
X_train_raw = X_train_raw[idx]
y_train_raw = y_train_raw[idx]

print(f"\n✅ Total training data: {X_train_raw.shape[0]:,} samples, {X_train_raw.shape[1]} features")
print(f"   Label distribution: {Counter(y_train_raw)}")


📥 Loading training data in 5 batches...

Total train files available: 312
Using 15 files for training (evenly distributed)
⚡ Speed boost: 297 fewer files to process!


🔄 Processing batch 1/5 (3 files)...


Loading train files:   0%|          | 0/3 [00:00<?, ?it/s]

   ✅ Batch 1: 180,000 samples
   💾 RAM: Batch stored, clearing variables...

🔄 Processing batch 2/5 (3 files)...


Loading train files:   0%|          | 0/3 [00:00<?, ?it/s]

   ✅ Batch 2: 50,000 samples
   💾 RAM: Batch stored, clearing variables...

🔄 Processing batch 3/5 (3 files)...


Loading train files:   0%|          | 0/3 [00:00<?, ?it/s]

   ✅ Batch 3: 28,000 samples
   💾 RAM: Batch stored, clearing variables...

🔄 Processing batch 4/5 (3 files)...


Loading train files:   0%|          | 0/3 [00:00<?, ?it/s]

   ✅ Batch 4: 10,000 samples
   💾 RAM: Batch stored, clearing variables...

🔄 Processing batch 5/5 (3 files)...


Loading train files:   0%|          | 0/3 [00:00<?, ?it/s]

   ✅ Batch 5: 6,000 samples
   💾 RAM: Batch stored, clearing variables...

🔗 Combining 5 batches...
🔀 Shuffling combined data...

✅ Total training data: 274,000 samples, 2381 features
   Label distribution: Counter({np.int32(0): 137000, np.int32(1): 137000})


In [8]:
# Load test data in 5 batches to avoid RAM crashes
print("\n📥 Loading test data in 5 batches...\n")

# Collect all test files
all_test_files = []
for file_type, files in discovered_files.items():
    all_test_files.extend(files['test'])

print(f"Total test files available: {len(all_test_files)}")

# Select subset of files (SPEED BOOST!)
selected_test_files = select_subset_of_files(
    all_test_files,
    CONFIG['max_test_files'],
    seed=CONFIG['random_seed']
)

print(f"Using {len(selected_test_files)} files for testing (evenly distributed)")
print(f"⚡ Speed boost: {len(all_test_files) - len(selected_test_files)} fewer files to process!\n")

# Split into 5 batches
n_batches = 5
batch_size = len(selected_test_files) // n_batches
samples_per_batch = CONFIG['target_test_samples'] // n_batches

test_batches = []
for i in range(n_batches):
    start_idx = i * batch_size
    end_idx = start_idx + batch_size if i < n_batches - 1 else len(selected_test_files)
    test_batches.append(selected_test_files[start_idx:end_idx])

# Process each batch
X_test_chunks = []
y_test_chunks = []

for batch_num, batch_files in enumerate(test_batches, 1):
    print(f"\n🔄 Processing batch {batch_num}/{n_batches} ({len(batch_files)} files)...")

    X_batch, y_batch = load_ember2024_jsonl(
        batch_files,
        target_samples=samples_per_batch,
        subset='test',
        seed=CONFIG['random_seed'] + batch_num
    )

    X_test_chunks.append(X_batch)
    y_test_chunks.append(y_batch)

    print(f"   ✅ Batch {batch_num}: {X_batch.shape[0]:,} samples")
    print(f"   💾 RAM: Batch stored, clearing variables...")

    # Clear batch variables to free RAM
    del X_batch, y_batch
    import gc
    gc.collect()

# Combine all batches
print(f"\n🔗 Combining {len(X_test_chunks)} batches...")
X_test_raw = np.vstack(X_test_chunks)
y_test_raw = np.concatenate(y_test_chunks)

# Clear chunks
del X_test_chunks, y_test_chunks
import gc
gc.collect()

# Shuffle combined data
print("🔀 Shuffling combined data...")
idx = np.arange(len(X_test_raw))
np.random.shuffle(idx)
X_test_raw = X_test_raw[idx]
y_test_raw = y_test_raw[idx]

print(f"\n✅ Total test data: {X_test_raw.shape[0]:,} samples, {X_test_raw.shape[1]} features")
print(f"   Label distribution: {Counter(y_test_raw)}")


📥 Loading test data in 5 batches...

Total test files available: 25
Using 10 files for testing (evenly distributed)
⚡ Speed boost: 15 fewer files to process!


🔄 Processing batch 1/5 (2 files)...


Loading test files:   0%|          | 0/2 [00:00<?, ?it/s]

   ✅ Batch 1: 120,000 samples
   💾 RAM: Batch stored, clearing variables...

🔄 Processing batch 2/5 (2 files)...


Loading test files:   0%|          | 0/2 [00:00<?, ?it/s]

   ✅ Batch 2: 80,000 samples
   💾 RAM: Batch stored, clearing variables...

🔄 Processing batch 3/5 (2 files)...


Loading test files:   0%|          | 0/2 [00:00<?, ?it/s]

   ✅ Batch 3: 40,000 samples
   💾 RAM: Batch stored, clearing variables...

🔄 Processing batch 4/5 (2 files)...


Loading test files:   0%|          | 0/2 [00:00<?, ?it/s]

   ✅ Batch 4: 30,000 samples
   💾 RAM: Batch stored, clearing variables...

🔄 Processing batch 5/5 (2 files)...


Loading test files:   0%|          | 0/2 [00:00<?, ?it/s]

   ✅ Batch 5: 20,000 samples
   💾 RAM: Batch stored, clearing variables...

🔗 Combining 5 batches...
🔀 Shuffling combined data...

✅ Total test data: 290,000 samples, 2381 features
   Label distribution: Counter({np.int32(1): 145000, np.int32(0): 145000})


## 🧹 Step 3: Clean Data

In [9]:
class DataCleaner:
    """Clean and validate malware dataset"""

    def __init__(self, verbose=True):
        self.verbose = verbose
        self.stats = {
            'initial_samples': 0,
            'removed_nan': 0,
            'removed_outliers': 0,
            'removed_unlabeled': 0,
            'final_samples': 0
        }

    def remove_nan(self, X, y):
        """Remove rows with NaN values"""
        mask = ~np.isnan(X).any(axis=1)
        self.stats['removed_nan'] = (~mask).sum()
        return X[mask], y[mask]

    def remove_outliers(self, X, y, z_threshold=3.0, batch_size=10000):
        """Remove outliers using z-score - memory efficient"""
        # Calculate stats once
        mean = X.mean(axis=0)
        std = X.std(axis=0) + 1e-10

        # Process in batches to avoid RAM spike
        n_samples = X.shape[0]
        mask = np.ones(n_samples, dtype=bool)

        for start_idx in range(0, n_samples, batch_size):
            end_idx = min(start_idx + batch_size, n_samples)
            batch = X[start_idx:end_idx]

            z_scores_batch = np.abs((batch - mean) / std)
            mask[start_idx:end_idx] = (z_scores_batch < z_threshold).all(axis=1)

        self.stats['removed_outliers'] = (~mask).sum()
        return X[mask], y[mask]

    def remove_unlabeled(self, X, y):
        """Remove samples with label = -1"""
        mask = y != -1
        self.stats['removed_unlabeled'] = (~mask).sum()
        return X[mask], y[mask]

    def balance_classes(self, X, y, method='undersample', ratio=1.0):
        """Balance class distribution

        Args:
            method: 'undersample' or 'oversample'
            ratio: target ratio of minority/majority (1.0 = equal)
        """
        unique, counts = np.unique(y, return_counts=True)

        if len(unique) != 2:
            if self.verbose:
                print(f"⚠️  Expected 2 classes, found {len(unique)}. Skipping balancing.")
            return X, y

        minority_class = unique[np.argmin(counts)]
        majority_class = unique[np.argmax(counts)]
        n_minority = counts.min()
        n_majority = counts.max()

        target_majority = int(n_minority / ratio)

        if method == 'undersample':
            # Undersample majority class
            idx_minority = np.where(y == minority_class)[0]
            idx_majority = np.where(y == majority_class)[0]

            idx_majority_sampled = np.random.choice(
                idx_majority, size=target_majority, replace=False
            )

            idx_balanced = np.concatenate([idx_minority, idx_majority_sampled])
            np.random.shuffle(idx_balanced)

            return X[idx_balanced], y[idx_balanced]

        elif method == 'oversample':
            # Oversample minority class
            idx_minority = np.where(y == minority_class)[0]
            idx_majority = np.where(y == majority_class)[0]

            n_oversample = target_majority - n_minority
            idx_minority_oversampled = np.random.choice(
                idx_minority, size=n_oversample, replace=True
            )

            X_balanced = np.vstack([X, X[idx_minority_oversampled]])
            y_balanced = np.concatenate([y, y[idx_minority_oversampled]])

            # Shuffle
            idx_shuffle = np.arange(len(y_balanced))
            np.random.shuffle(idx_shuffle)

            return X_balanced[idx_shuffle], y_balanced[idx_shuffle]

        return X, y

    def clean(self, X, y, config):
        """Run full cleaning pipeline"""
        self.stats['initial_samples'] = len(X)

        if self.verbose:
            print(f"\n🧹 Cleaning {len(X):,} samples...")

        # Remove NaN
        X, y = self.remove_nan(X, y)
        if self.verbose and self.stats['removed_nan'] > 0:
            print(f"   Removed {self.stats['removed_nan']:,} NaN samples")

        # Remove unlabeled
        X, y = self.remove_unlabeled(X, y)
        if self.verbose and self.stats['removed_unlabeled'] > 0:
            print(f"   Removed {self.stats['removed_unlabeled']:,} unlabeled samples")

        # Remove outliers
        if config.get('remove_outliers', True):
            X, y = self.remove_outliers(X, y, config.get('z_threshold', 3.0))
            if self.verbose and self.stats['removed_outliers'] > 0:
                print(f"   Removed {self.stats['removed_outliers']:,} outlier samples")

        # Balance classes
        if config.get('balance_classes', True):
            before_balance = len(X)
            X, y = self.balance_classes(
                X, y,
                method=config.get('balance_method', 'undersample'),
                ratio=config.get('balance_ratio', 1.0)
            )
            if self.verbose:
                print(f"   Balanced classes: {before_balance:,} → {len(X):,} samples")

        self.stats['final_samples'] = len(X)

        if self.verbose:
            print(f"\n✅ Cleaning complete!")
            print(f"   Final: {len(X):,} samples")
            print(f"   Class distribution: {Counter(y)}")

        return X, y

print("✅ DataCleaner ready!")

✅ DataCleaner ready!


In [10]:
# Clean training data
cleaner_train = DataCleaner(verbose=True)
X_train_clean, y_train_clean = cleaner_train.clean(X_train_raw, y_train_raw, CONFIG)

# Free up memory from raw training data
print("\n💾 Clearing raw training data from RAM...")
del X_train_raw, y_train_raw
import gc
gc.collect()

# Clean test data
cleaner_test = DataCleaner(verbose=True)
X_test_clean, y_test_clean = cleaner_test.clean(X_test_raw, y_test_raw, CONFIG)

# Free up memory from raw test data
print("\n💾 Clearing raw test data from RAM...")
del X_test_raw, y_test_raw
gc.collect()

print("\n✅ All cleaning complete, raw data cleared from memory!")


🧹 Cleaning 274,000 samples...
   Removed 34,124 outlier samples
   Balanced classes: 239,876 → 239,092 samples

✅ Cleaning complete!
   Final: 239,092 samples
   Class distribution: Counter({np.int32(1): 119546, np.int32(0): 119546})

💾 Clearing raw training data from RAM...

🧹 Cleaning 290,000 samples...
   Removed 42,848 outlier samples
   Balanced classes: 247,152 → 227,648 samples

✅ Cleaning complete!
   Final: 227,648 samples
   Class distribution: Counter({np.int32(1): 113824, np.int32(0): 113824})

💾 Clearing raw test data from RAM...

✅ All cleaning complete, raw data cleared from memory!


## 💾 Step 4: Save Cleaned Data

In [11]:
# Save as NPZ
train_path = Path(CLEANED_DATA_PATH) / 'ember2024_train_cleaned.npz'
test_path = Path(CLEANED_DATA_PATH) / 'ember2024_test_cleaned.npz'

print("\n💾 Saving cleaned data...")

np.savez_compressed(
    train_path,
    X=X_train_clean,
    y=y_train_clean,
    feature_size=CONFIG['feature_size']
)

np.savez_compressed(
    test_path,
    X=X_test_clean,
    y=y_test_clean,
    feature_size=CONFIG['feature_size']
)

print(f"✅ Saved train data: {train_path}")
print(f"✅ Saved test data: {test_path}")


💾 Saving cleaned data...
✅ Saved train data: /content/drive/MyDrive/DIC_cleaned/ember2024_train_cleaned.npz
✅ Saved test data: /content/drive/MyDrive/DIC_cleaned/ember2024_test_cleaned.npz


## 📊 Step 5: Quality Report

In [13]:
# Generate quality report
def convert_to_native(obj):
    """Convert numpy types to native Python types"""
    if isinstance(obj, dict):
        return {convert_to_native(k): convert_to_native(v) for k, v in obj.items()}
    elif isinstance(obj, (list, tuple)):
        return [convert_to_native(item) for item in obj]
    elif isinstance(obj, np.integer):
        return int(obj)
    elif isinstance(obj, np.floating):
        return float(obj)
    elif isinstance(obj, np.ndarray):
        return obj.tolist()
    return obj

quality_report = {
    'dataset': 'EMBER 2024',
    'config': convert_to_native(CONFIG),
    'train': {
        'shape': [int(x) for x in X_train_clean.shape],
        'samples': int(X_train_clean.shape[0]),
        'features': int(X_train_clean.shape[1]),
        'class_distribution': {int(k): int(v) for k, v in Counter(y_train_clean).items()},
        'cleaning_stats': convert_to_native(cleaner_train.stats)
    },
    'test': {
        'shape': [int(x) for x in X_test_clean.shape],
        'samples': int(X_test_clean.shape[0]),
        'features': int(X_test_clean.shape[1]),
        'class_distribution': {int(k): int(v) for k, v in Counter(y_test_clean).items()},
        'cleaning_stats': convert_to_native(cleaner_test.stats)
    }
}

# Save report
report_path = Path(CLEANED_DATA_PATH) / 'ember2024_quality_report.json'
with open(report_path, 'w') as f:
    json.dump(quality_report, f, indent=2)

print("\n📊 Quality Report:")
print(json.dumps(quality_report, indent=2))
print(f"\n✅ Report saved: {report_path}")


📊 Quality Report:
{
  "dataset": "EMBER 2024",
  "config": {
    "target_train_samples": 4000000,
    "target_test_samples": 1000000,
    "max_train_files": 15,
    "max_test_files": 10,
    "n_batches": 5,
    "feature_size": 2381,
    "remove_outliers": true,
    "z_threshold": 3.0,
    "balance_classes": true,
    "balance_method": "undersample",
    "balance_ratio": 1.0,
    "file_types": [
      "Win32",
      "Win64",
      "Dot_Net",
      "APK",
      "ELF",
      "PDF"
    ],
    "save_format": "npz",
    "random_seed": 42
  },
  "train": {
    "shape": [
      239092,
      2381
    ],
    "samples": 239092,
    "features": 2381,
    "class_distribution": {
      "1": 119546,
      "0": 119546
    },
    "cleaning_stats": {
      "initial_samples": 274000,
      "removed_nan": 0,
      "removed_outliers": 34124,
      "removed_unlabeled": 0,
      "final_samples": 239092
    }
  },
  "test": {
    "shape": [
      227648,
      2381
    ],
    "samples": 227648,
    "feature

## ✅ Complete!

Your EMBER 2024 data is now cleaned and ready to use!

### Next steps:
1. Use the cleaned NPZ files in your ML models
2. Check the quality report for data statistics
3. If you need to reprocess, just rerun this notebook