# LSTM Architecture: RGB Hilbert 32×32 – 6-Class Training (Few-Shot Holdout)

## Sequential Learning Setup
**Objective**: Train an LSTM on the same 6-class RGB Hilbert dataset used for ViT/CNN, holding out 3 classes for few-shot experiments.

**Dataset**: RGB Hilbert 32×32 (3 channels)
- Training on 6 classes; held out: DDoS-HTTP_Flood, DoS-UDP_Flood, Recon-PortScan

**Input**: Treat each 32×32 RGB image as a sequence of 32 timesteps with 96 features per step (32 columns × 3 channels)  
**Architecture**: Long Short-Term Memory (LSTM) with attention  

**Questions**:
1. Does sequential modeling improve over CNN/ViT on Hilbert-encoded data?
2. How do temporal dynamics extracted from RGB Hilbert images compare?
3. Does the LSTM generalize similarly to held-out classes?


In [1]:
# Environment Setup and Configuration
import os
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import json
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Configuration  
CONFIG = {
    'data_path': '/home/ubuntu/Cyber_AI/ai-cyber/notebooks/ViT-experiment/pcap-dataset-samples/parquet/rgb_hilbert_32x32/',
    'test_size': 0.2,
    'val_size': 0.2,
    'random_state': 42,
    'batch_size': 64,
    'learning_rate': 0.001,  # Higher LR for LSTM
    'epochs': 50,
    'patience': 7,
    'device': 'cuda' if torch.cuda.is_available() else 'cpu',
    'num_workers': 4,
    # LSTM-specific config for RGB Hilbert
    'sequence_length': 32,     # 32 time steps (rows)
    'input_features': 96,      # 32 columns × 3 channels
    'hidden_size': 128,        # LSTM hidden dimension
    'num_layers': 2,           # Stacked LSTM layers
    'dropout': 0.3,            # Dropout for regularization
    'num_classes': 6
}

# 6-class training with 3 held-out classes (few-shot)
HELD_OUT_CLASSES = ['DDoS-HTTP_Flood', 'DoS-UDP_Flood', 'Recon-PortScan']

print("🔄 LSTM SEQUENTIAL ARCHITECTURE INITIALIZED (RGB Hilbert 6-class)")
print("📋 Notebook: LSTM_Prototype_rgb_hilbert_32x32_6_class.ipynb")
print(f"📊 Device: {CONFIG['device']}")
print(f"📊 Dataset: RGB Hilbert 32×32 (6 training classes; held out: {HELD_OUT_CLASSES})")
print(f"📊 Sequence format: {CONFIG['sequence_length']} timesteps × {CONFIG['input_features']} features")


🔄 LSTM SEQUENTIAL ARCHITECTURE INITIALIZED (RGB Hilbert 6-class)
📋 Notebook: LSTM_Prototype_rgb_hilbert_32x32_6_class.ipynb
📊 Device: cpu
📊 Dataset: RGB Hilbert 32×32 (6 training classes; held out: ['DDoS-HTTP_Flood', 'DoS-UDP_Flood', 'Recon-PortScan'])
📊 Sequence format: 32 timesteps × 96 features


In [2]:
# Multi-Layer LSTM Architecture for RGB Hilbert Sequential Analysis
class MultiLayerLSTM(nn.Module):
    def __init__(self, input_size=96, hidden_size=128, num_layers=2, num_classes=6, dropout=0.3):
        super(MultiLayerLSTM, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.num_classes = num_classes
        
        # LSTM layers
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,  # (batch, seq, feature)
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=False
        )
        
        # Attention mechanism for focusing on important timesteps
        self.attention = nn.MultiheadAttention(
            embed_dim=hidden_size,
            num_heads=8,
            dropout=dropout,
            batch_first=True
        )
        
        # Classification head
        self.classifier = nn.Sequential(
            nn.LayerNorm(hidden_size),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, hidden_size // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size // 2, num_classes)
        )
        
        self._initialize_weights()
    
    def forward(self, x):
        # x shape: (batch_size, sequence_length, input_features)
        # Expected: (batch_size, 32, 96)
        
        batch_size = x.size(0)
        
        # Initialize hidden states
        h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)
        
        # LSTM forward pass
        lstm_out, (hidden, cell) = self.lstm(x, (h0, c0))
        # lstm_out shape: (batch_size, sequence_length, hidden_size)
        
        # Apply attention to focus on important timesteps
        attended_out, attention_weights = self.attention(lstm_out, lstm_out, lstm_out)
        # attended_out shape: (batch_size, sequence_length, hidden_size)
        
        # Global average pooling over sequence dimension
        pooled = torch.mean(attended_out, dim=1)  # (batch_size, hidden_size)
        
        # Classification
        output = self.classifier(pooled)  # (batch_size, num_classes)
        
        return output
    
    def _initialize_weights(self):
        for name, param in self.named_parameters():
            if 'weight_ih' in name:
                # Input-to-hidden weights
                nn.init.xavier_uniform_(param.data)
            elif 'weight_hh' in name:
                # Hidden-to-hidden weights
                nn.init.orthogonal_(param.data)
            elif 'bias' in name:
                param.data.fill_(0.)
                # Set forget gate bias to 1 (LSTM best practice)
                if 'bias_ih' in name:
                    n = param.size(0)
                    param.data[n//4:n//2].fill_(1.)

# Initialize LSTM model
model = MultiLayerLSTM(
    input_size=CONFIG['input_features'],
    hidden_size=CONFIG['hidden_size'],
    num_layers=CONFIG['num_layers'],
    num_classes=CONFIG['num_classes'],
    dropout=CONFIG['dropout']
)
model = model.to(CONFIG['device'])

# Count parameters for comparison
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print("\n🔄 LSTM ARCHITECTURE SUMMARY:")
print(f"📊 Total parameters: {total_params:,}")
print(f"📊 Trainable parameters: {trainable_params:,}")
print(f"📊 Model size: ~{total_params * 4 / 1024 / 1024:.1f} MB")
print("\n🔍 Architecture Details:")
print(f"   • {CONFIG['num_layers']}-layer LSTM with attention")
print(f"   • Hidden size: {CONFIG['hidden_size']} per layer")
print(f"   • Multi-head attention mechanism (8 heads)")
print(f"   • Layer normalization and dropout ({CONFIG['dropout']})")
print(f"   • Sequential input: {CONFIG['sequence_length']} × {CONFIG['input_features']}")



🔄 LSTM ARCHITECTURE SUMMARY:
📊 Total parameters: 322,758
📊 Trainable parameters: 322,758
📊 Model size: ~1.2 MB

🔍 Architecture Details:
   • 2-layer LSTM with attention
   • Hidden size: 128 per layer
   • Multi-head attention mechanism (8 heads)
   • Layer normalization and dropout (0.3)
   • Sequential input: 32 × 96


In [3]:
# Data Loading for RGB Hilbert 6-Class (Few-Shot Holdout)
import glob

def load_rgb_hilbert_6class(base_path, held_out_classes):
    print(f"📂 Loading RGB Hilbert 6-class dataset from: {base_path}")
    print(f"🔒 Excluding held-out classes: {held_out_classes}")
    all_image_data = []
    all_labels = []
    splits = ['train', 'val', 'test']

    # Discover all classes present
    class_dirs = sorted([d for d in glob.glob(f"{base_path}*/") if not any(s in d for s in splits)])
    class_names = [d.split('/')[-2] for d in class_dirs]

    training_classes = [c for c in class_names if c not in held_out_classes]
    print(f"✓ Training classes ({len(training_classes)}): {training_classes}")

    for class_dir in class_dirs:
        class_name = class_dir.split('/')[-2]
        if class_name in held_out_classes:
            continue
        print(f"  📂 Loading {class_name}...")
        for split in splits:
            parquet_files = sorted(glob.glob(f"{class_dir}{split}/*.parquet"))
            for file_path in parquet_files:
                try:
                    df = pd.read_parquet(file_path)
                    if 'image_data' in df.columns:
                        for _, row in df.iterrows():
                            image_data = np.array(row['image_data'], dtype=np.float32)
                            all_image_data.append(image_data)
                            all_labels.append(class_name)
                except Exception as e:
                    print(f"    ⚠️ Error loading {file_path}: {e}")

    X = np.array(all_image_data, dtype=np.float32)
    y = np.array(all_labels)

    print(f"\n✓ Loaded training data: {len(X):,} samples")
    print(f"✓ Shape: {X.shape}")
    print(f"✓ Unique classes: {np.unique(y)}")
    return X, y

# Load dataset (excluding held-out classes)
X, y = load_rgb_hilbert_6class(CONFIG['data_path'], HELD_OUT_CLASSES)

# Reshape for LSTM: each image → sequence of 32 timesteps × 96 features
print(f"\n🔄 Reshaping data for LSTM sequential input...")
print(f"   Original shape: {X.shape} (flattened: samples × 3072 features)")
expected_features = CONFIG['sequence_length'] * CONFIG['input_features']  # 32 × 96 = 3072
if X.shape[1] == expected_features:
    X = X.reshape(-1, CONFIG['sequence_length'], CONFIG['input_features'])
else:
    if X.shape[1] > expected_features:
        X = X[:, :expected_features].reshape(-1, CONFIG['sequence_length'], CONFIG['input_features'])
    else:
        pad = np.zeros((X.shape[0], expected_features - X.shape[1]), dtype=np.float32)
        X = np.concatenate([X, pad], axis=1).reshape(-1, CONFIG['sequence_length'], CONFIG['input_features'])
print(f"   LSTM shape: {X.shape} (samples × timesteps × features)")

# Normalize if needed
if X.max() > 1.0:
    X = X / 255.0

# Encode labels
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

print(f"\n🏷️ 6-class label distribution (training classes):")
for i, label in enumerate(label_encoder.classes_):
    count = np.sum(y == label)
    print(f"   {i}: {label} ({count:,} samples)")

print(f"\n📈 LSTM data ready: range=[{X.min():.3f}, {X.max():.3f}], shape={X.shape}")


📂 Loading RGB Hilbert 6-class dataset from: /home/ubuntu/Cyber_AI/ai-cyber/notebooks/ViT-experiment/pcap-dataset-samples/parquet/rgb_hilbert_32x32/
🔒 Excluding held-out classes: ['DDoS-HTTP_Flood', 'DoS-UDP_Flood', 'Recon-PortScan']
✓ Training classes (6): ['Benign_Final', 'DDoS-SYN_Flood', 'DictionaryBruteForce', 'DoS-TCP_Flood', 'Mirai-udpplain', 'SqlInjection']
  📂 Loading Benign_Final...
  📂 Loading DDoS-SYN_Flood...
  📂 Loading DictionaryBruteForce...
  📂 Loading DoS-TCP_Flood...
  📂 Loading Mirai-udpplain...
  📂 Loading SqlInjection...

✓ Loaded training data: 72,000 samples
✓ Shape: (72000, 3072)
✓ Unique classes: ['Benign_Final' 'DDoS-SYN_Flood' 'DictionaryBruteForce' 'DoS-TCP_Flood'
 'Mirai-udpplain' 'SqlInjection']

🔄 Reshaping data for LSTM sequential input...
   Original shape: (72000, 3072) (flattened: samples × 3072 features)
   LSTM shape: (72000, 32, 96) (samples × timesteps × features)

🏷️ 6-class label distribution (training classes):
   0: Benign_Final (12,000 sample

In [4]:
# Data Splitting and Training Setup
from torch.utils.data import DataLoader, TensorDataset

# Split data into train/val/test (same random_state as CNN/ViT for fairness)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y_encoded, test_size=CONFIG['test_size'], 
    random_state=CONFIG['random_state'], 
    stratify=y_encoded
)

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=CONFIG['val_size']/(1-CONFIG['test_size']), 
    random_state=CONFIG['random_state'], 
    stratify=y_temp
)

print("📊 Data Split Summary:")
print(f"   Training: {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"   Validation: {X_val.shape[0]:,} samples ({X_val.shape[0]/len(X)*100:.1f}%)")
print(f"   Test: {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"   Sequential format: timesteps={X_train.shape[1]}, features={X_train.shape[2]}")

# Convert to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train)
y_train_tensor = torch.LongTensor(y_train)
X_val_tensor = torch.FloatTensor(X_val)
y_val_tensor = torch.LongTensor(y_val)
X_test_tensor = torch.FloatTensor(X_test)
y_test_tensor = torch.LongTensor(y_test)

# Create data loaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=CONFIG['batch_size'], shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=CONFIG['batch_size'], shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=CONFIG['batch_size'], shuffle=False)

# Compute class weights for balanced training
class_weights = compute_class_weight(
    'balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weights_tensor = torch.FloatTensor(class_weights).to(CONFIG['device'])

print(f"\n⚖️  Class weights: {dict(zip(label_encoder.classes_, class_weights))}")

# Initialize optimizer and loss function
# Higher learning rate for LSTM compared to CNN/ViT
optimizer = optim.Adam(model.parameters(), lr=CONFIG['learning_rate'], weight_decay=1e-4)
criterion = nn.CrossEntropyLoss(weight=class_weights_tensor)

# Learning rate scheduler
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='max', patience=3, factor=0.5
)

print(f"\n🎯 LSTM Training Configuration:")
print(f"   📊 Optimizer: Adam (lr={CONFIG['learning_rate']}, weight_decay=1e-4)")
print(f"   📊 Loss: Weighted CrossEntropyLoss")
print(f"   📊 Scheduler: ReduceLROnPlateau (patience=3)")
print(f"   📊 Batch size: {CONFIG['batch_size']}")
print(f"   📊 Max epochs: {CONFIG['epochs']}")
print(f"   📊 Early stopping patience: {CONFIG['patience']}")
print(f"   🔄 Sequential processing: {CONFIG['sequence_length']} timesteps per sample")

print(f"\n🎯 Architecture Comparison Setup:")
print(f"   🤖 ViT: Global attention, 96.94% target")
print(f"   🏗️  CNN: Local convolution, ~93%+ (training)")
print(f"   🔄 LSTM: Sequential memory, starting training next...")


📊 Data Split Summary:
   Training: 43,200 samples (60.0%)
   Validation: 14,400 samples (20.0%)
   Test: 14,400 samples (20.0%)
   Sequential format: timesteps=32, features=96

⚖️  Class weights: {np.str_('Benign_Final'): np.float64(1.0), np.str_('DDoS-SYN_Flood'): np.float64(1.0), np.str_('DictionaryBruteForce'): np.float64(1.0), np.str_('DoS-TCP_Flood'): np.float64(1.0), np.str_('Mirai-udpplain'): np.float64(1.0), np.str_('SqlInjection'): np.float64(1.0)}

🎯 LSTM Training Configuration:
   📊 Optimizer: Adam (lr=0.001, weight_decay=1e-4)
   📊 Loss: Weighted CrossEntropyLoss
   📊 Scheduler: ReduceLROnPlateau (patience=3)
   📊 Batch size: 64
   📊 Max epochs: 50
   📊 Early stopping patience: 7
   🔄 Sequential processing: 32 timesteps per sample

🎯 Architecture Comparison Setup:
   🤖 ViT: Global attention, 96.94% target
   🏗️  CNN: Local convolution, ~93%+ (training)
   🔄 LSTM: Sequential memory, starting training next...


In [5]:
# LSTM Training Pipeline with Sequential Processing
def train_epoch(model, train_loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    
    for batch_idx, (data, targets) in enumerate(train_loader):
        data, targets = data.to(device), targets.to(device)
        # data shape: (batch_size, sequence_length, input_features)
        
        optimizer.zero_grad()
        outputs = model(data)
        loss = criterion(outputs, targets)
        loss.backward()
        
        # Gradient clipping for LSTM stability
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        
        optimizer.step()
        
        total_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        total += targets.size(0)
        correct += (predicted == targets).sum().item()
    
    return total_loss / len(train_loader), correct / total

def validate_epoch(model, val_loader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for data, targets in val_loader:
            data, targets = data.to(device), targets.to(device)
            outputs = model(data)
            loss = criterion(outputs, targets)
            
            total_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            total += targets.size(0)
            correct += (predicted == targets).sum().item()
    
    return total_loss / len(val_loader), correct / total

# Training loop with early stopping
print("🔄 Starting LSTM sequential learning...")
print("🎯 Target: Match ViT 6-class performance on RGB Hilbert")
print("💡 Hypothesis: Sequential patterns > Spatial patterns for IoT security\n")

best_val_acc = 0
patience_counter = 0
train_losses = []
val_losses = []
train_accs = []
val_accs = []

start_time = datetime.now()

for epoch in range(CONFIG['epochs']):
    # Train for one epoch
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, CONFIG['device'])
    
    # Validate
    val_loss, val_acc = validate_epoch(model, val_loader, criterion, CONFIG['device'])
    
    # Update learning rate
    scheduler.step(val_acc)
    
    # Store metrics
    train_losses.append(train_loss)
    val_losses.append(val_loss)
    train_accs.append(train_acc)
    val_accs.append(val_acc)
    
    # Print progress with architecture comparison
    print(f"Epoch {epoch+1:2d}/{CONFIG['epochs']} | "
          f"Train: {train_acc:.4f} ({train_loss:.4f}) | "
          f"Val: {val_acc:.4f} ({val_loss:.4f}) | "
          f"LR: {optimizer.param_groups[0]['lr']:.6f}")
    
    # Architecture comparison updates
    if val_acc > 0.93:
        print(f"🔄 LSTM approaching CNN performance! Current: {val_acc:.4f} vs CNN: ~0.93+")
    if val_acc > 0.96:
        print(f"🎯 LSTM approaching ViT performance! Current: {val_acc:.4f} vs ViT: 0.9694")
    
    # Save best model
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        patience_counter = 0
        torch.save(model.state_dict(), 'best_lstm_rgb_hilbert_6class_model.pth')
        print(f"✅ New best validation accuracy: {val_acc:.4f}")
        
        # Check performance milestones
        if val_acc > 0.9694:
            print(f"🏆 LSTM BEATS ViT! {val_acc:.4f} > 0.9694")
            print(f"🔄 Sequential modeling proves superior for IoT cybersecurity!")
    else:
        patience_counter += 1
        
        if patience_counter >= CONFIG['patience']:
            print(f"\n⏰ Early stopping triggered after {epoch+1} epochs")
            print(f"   Best validation accuracy: {best_val_acc:.4f}")
            break

training_time = datetime.now() - start_time
print(f"\n🎯 LSTM Training Complete!")
print(f"   ⏱️  Total time: {training_time}")
print(f"   🏆 Best validation accuracy: {best_val_acc:.4f}")
print(f"   📊 Total epochs: {epoch+1}")

# Compare with baselines
vit_accuracy = 0.9694
cnn_accuracy = 0.93  # Conservative estimate

print(f"\n📊 ARCHITECTURE COMPARISON:")
print(f"   🤖 ViT (Global Attention): {vit_accuracy:.4f}")
print(f"   🏗️  CNN (Local Features): ~{cnn_accuracy:.2f}+ (training)")
print(f"   🔄 LSTM (Sequential): {best_val_acc:.4f}")

if best_val_acc > vit_accuracy:
    improvement = (best_val_acc - vit_accuracy) * 100
    print(f"\n🎉 LSTM OUTPERFORMS ALL BASELINES!")
    print(f"   LSTM beats ViT by +{improvement:.2f} percentage points")
    print(f"   🔄 Sequential modeling is superior for IoT cybersecurity")
elif best_val_acc > cnn_accuracy:
    print(f"\n🥈 LSTM BEATS CNN but trails ViT")
    print(f"   Sequential > Local features, but Global attention still leads")
else:
    print(f"\n📊 LSTM provides alternative perspective")
    print(f"   All architectures show competitive performance")


🔄 Starting LSTM sequential learning...
🎯 Target: Match ViT 6-class performance on RGB Hilbert
💡 Hypothesis: Sequential patterns > Spatial patterns for IoT security

Epoch  1/50 | Train: 0.5981 (1.0164) | Val: 0.6453 (0.8947) | LR: 0.001000
✅ New best validation accuracy: 0.6453
Epoch  2/50 | Train: 0.6795 (0.8289) | Val: 0.7156 (0.7516) | LR: 0.001000
✅ New best validation accuracy: 0.7156
Epoch  3/50 | Train: 0.7147 (0.7699) | Val: 0.7389 (0.7036) | LR: 0.001000
✅ New best validation accuracy: 0.7389
Epoch  4/50 | Train: 0.7385 (0.7251) | Val: 0.7635 (0.6588) | LR: 0.001000
✅ New best validation accuracy: 0.7635
Epoch  5/50 | Train: 0.7546 (0.6769) | Val: 0.7678 (0.6313) | LR: 0.001000
✅ New best validation accuracy: 0.7678
Epoch  6/50 | Train: 0.7680 (0.6243) | Val: 0.7771 (0.5707) | LR: 0.001000
✅ New best validation accuracy: 0.7771
Epoch  7/50 | Train: 0.7795 (0.5861) | Val: 0.7914 (0.5607) | LR: 0.001000
✅ New best validation accuracy: 0.7914
Epoch  8/50 | Train: 0.7886 (0.5602) 

In [6]:
# LSTM Evaluation
# Load best model
model.load_state_dict(torch.load('best_lstm_rgb_hilbert_6class_model.pth'))
model.eval()

# Test set evaluation
def evaluate_model(model, test_loader, device, label_encoder):
    model.eval()
    all_preds = []
    all_targets = []
    all_probs = []
    
    with torch.no_grad():
        for data, targets in test_loader:
            data, targets = data.to(device), targets.to(device)
            outputs = model(data)
            probs = torch.softmax(outputs, dim=1)
            _, predicted = torch.max(outputs, 1)
            
            all_preds.extend(predicted.cpu().numpy())
            all_targets.extend(targets.cpu().numpy())
            all_probs.extend(probs.cpu().numpy())
    
    return np.array(all_preds), np.array(all_targets), np.array(all_probs)

print("🧪 Final LSTM Test Set Evaluation...")
test_preds, test_targets, test_probs = evaluate_model(model, test_loader, CONFIG['device'], label_encoder)
test_accuracy = accuracy_score(test_targets, test_preds)

print(f"\n🎯 FINAL LSTM RESULTS:")
print(f"   📊 Test Accuracy: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")

# Baseline comparison (optional)
vit_test_accuracy = 0.9694
print(f"\n🏆 Baseline Comparison:")
print(f"   ViT (Global Attention): {vit_test_accuracy:.4f}")
print(f"   LSTM (Sequential): {test_accuracy:.4f}")

# Performance tier classification
if test_accuracy > vit_test_accuracy:
    improvement = (test_accuracy - vit_test_accuracy) * 100
    print(f"\n🎉 LSTM ACHIEVES NEW STATE-OF-THE-ART!")
    print(f"   LSTM: {test_accuracy:.4f} vs ViT: {vit_test_accuracy:.4f}")
    print(f"   Improvement: +{improvement:.2f} percentage points")
    print(f"   🔄 Sequential modeling proves superior for IoT cybersecurity!")
    tier = "🥇 STATE-OF-THE-ART"
elif test_accuracy > 0.95:
    deficit = (vit_test_accuracy - test_accuracy) * 100
    print(f"\n🥈 LSTM EXCELLENT Performance!")
    print(f"   LSTM: {test_accuracy:.4f} vs ViT: {vit_test_accuracy:.4f}")
    print(f"   Deficit: -{deficit:.2f} percentage points")
    tier = "🥈 EXCELLENT"
elif test_accuracy > 0.90:
    deficit = (vit_test_accuracy - test_accuracy) * 100
    print(f"\n🥉 LSTM VERY GOOD Performance!")
    print(f"   LSTM: {test_accuracy:.4f} vs ViT: {vit_test_accuracy:.4f}")
    print(f"   Deficit: -{deficit:.2f} percentage points")
    tier = "🥉 VERY GOOD"
else:
    deficit = (vit_test_accuracy - test_accuracy) * 100
    print(f"\n📊 LSTM Performance Analysis:")
    print(f"   LSTM: {test_accuracy:.4f} vs ViT: {vit_test_accuracy:.4f}")
    print(f"   Deficit: -{deficit:.2f} percentage points")
    tier = "📊 BASELINE"

# Detailed classification report
print(f"\n📋 Detailed LSTM Classification Report:")
class_names = label_encoder.classes_
report = classification_report(test_targets, test_preds, target_names=class_names, digits=4)
print(report)

# Confidence analysis
confidence_scores = np.max(test_probs, axis=1)
mean_confidence = np.mean(confidence_scores)
print(f"\n🎯 LSTM Prediction Confidence Analysis:")
print(f"   Mean confidence: {mean_confidence:.4f}")
print(f"   High confidence (>0.9): {np.sum(confidence_scores > 0.9):,} samples")
print(f"   Low confidence (<0.7): {np.sum(confidence_scores < 0.7):,} samples")

# Save comprehensive results
results = {
    'experiment': 'LSTM_RGB_Hilbert_6class',
    'timestamp': datetime.now().isoformat(),
    'dataset': 'RGB_Hilbert_32x32',
    'approach': '6-class training with few-shot holdout',
    'architecture': 'LSTM_Sequential',
    'total_samples': len(X),
    'test_accuracy': float(test_accuracy),
    'validation_accuracy': float(best_val_acc),
    'training_epochs': epoch + 1,
    'training_time': str(training_time),
    'parameters': trainable_params,
    'sequence_config': {
        'sequence_length': CONFIG['sequence_length'],
        'input_features': CONFIG['input_features'],
        'hidden_size': CONFIG['hidden_size'],
        'num_layers': CONFIG['num_layers']
    },
    'baselines': {
        'vit_baseline': vit_test_accuracy
    },
    'confidence_analysis': {
        'mean_confidence': float(mean_confidence),
        'high_confidence_samples': int(np.sum(confidence_scores > 0.9)),
        'low_confidence_samples': int(np.sum(confidence_scores < 0.7))
    },
    'performance_tier': tier,
    'classification_report': report,
    'class_names': list(class_names)
}

with open('results_lstm_rgb_hilbert_6class.json', 'w') as f:
    json.dump(results, f, indent=2)

print(f"\n💾 Results saved to: results_lstm_rgb_hilbert_6class.json")
print(f"\n🎯 SUMMARY:")
print(f"   🤖 ViT (Attention) baseline: {vit_test_accuracy:.4f}")
print(f"   🔄 LSTM (Sequential): {test_accuracy:.4f} ({trainable_params:,} params)")
print(f"   📊 Performance Tier: {tier}")


🧪 Final LSTM Test Set Evaluation...

🎯 FINAL LSTM RESULTS:
   📊 Test Accuracy: 0.9059 (90.59%)

🏆 Baseline Comparison:
   ViT (Global Attention): 0.9694
   LSTM (Sequential): 0.9059

🥉 LSTM VERY GOOD Performance!
   LSTM: 0.9059 vs ViT: 0.9694
   Deficit: -6.35 percentage points

📋 Detailed LSTM Classification Report:
                      precision    recall  f1-score   support

        Benign_Final     0.9142    0.8658    0.8894      2400
      DDoS-SYN_Flood     0.9535    0.9225    0.9377      2400
DictionaryBruteForce     0.8658    0.8279    0.8464      2400
       DoS-TCP_Flood     0.9957    0.9542    0.9745      2400
      Mirai-udpplain     0.9948    0.9529    0.9734      2400
        SqlInjection     0.7520    0.9121    0.8243      2400

            accuracy                         0.9059     14400
           macro avg     0.9127    0.9059    0.9076     14400
        weighted avg     0.9127    0.9059    0.9076     14400


🎯 LSTM Prediction Confidence Analysis:
   Mean confidenc