# LSTM Architecture: CIC-IoT23 3-Class Full Capacity Experiment

## Sequential Learning Architecture Comparison Research
**Objective**: Compare LSTM vs CNN vs ViT performance on IoT cybersecurity classification

**Dataset**: CIC-IoT23 with semantic 3-class grouping
- **Normal**: Normal IoT traffic
- **Reconnaissance**: Recon-PortScan, DictionaryBruteForce
- **Active_Attack**: DDoS-HTTP_Flood, DDoS-SYN_Flood, DoS-TCP_Flood, DoS-UDP_Flood, Mirai-udpplain, SqlInjection

**Capacity**: 12,000 samples per class (36,000 total)  
**Input**: 5-channel 32x32 → Sequential (32 timesteps × 160 features)  
**Architecture**: Long Short-Term Memory (LSTM)  
**Baselines**: CNN (training), ViT (96.94% accuracy)

**Research Questions**:
1. Can LSTM capture temporal patterns better than spatial methods (CNN/ViT)?
2. How does sequential modeling compare to local (CNN) vs global (ViT) approaches?
3. Will LSTM show different cross-domain transfer characteristics to UNSW-NB15?
4. Which paradigm (spatial, sequential, attention) is optimal for IoT cybersecurity?

**LSTM Advantages for Cybersecurity**:
- Network traffic is inherently sequential
- Attack patterns unfold over time
- Memory of previous states (stateful detection)
- Natural fit for protocol analysis


In [1]:
# Environment Setup and Configuration
import os
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import json
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Configuration  
CONFIG = {
    'data_path': '/home/ubuntu/Cyber_AI/ai-cyber/notebooks/ViT-experiment/pcap-dataset-samples/parquet/5channel_32x32/',
    'max_samples_per_class': 12000,  # Full capacity
    'test_size': 0.2,
    'val_size': 0.2,
    'random_state': 42,
    'batch_size': 64,
    'learning_rate': 0.001,  # Higher LR for LSTM
    'epochs': 50,
    'patience': 7,
    'device': 'cuda' if torch.cuda.is_available() else 'cpu',
    'num_workers': 4,
    # LSTM-specific config
    'sequence_length': 32,     # Treat as 32 time steps
    'input_features': 160,     # 5 channels * 32 features per timestep
    'hidden_size': 128,        # LSTM hidden dimension
    'num_layers': 2,           # Stacked LSTM layers
    'dropout': 0.3             # Dropout for regularization
}

# CIC 3-class mapping (same as CNN/ViT for fair comparison)
CLASS_MAPPING = {
    'Normal': ['Benign_Final'],
    'Reconnaissance': ['Recon-PortScan', 'DictionaryBruteForce'],
    'Active_Attack': ['DDoS-HTTP_Flood', 'DDoS-SYN_Flood', 'DoS-TCP_Flood', 
                     'DoS-UDP_Flood', 'Mirai-udpplain', 'SqlInjection']
}

print("🔄 LSTM SEQUENTIAL ARCHITECTURE EXPERIMENT INITIALIZED")
print("📋 Notebook: LSTM_Prototype_CIC_3class_full_capacity.ipynb")
print("🔧 Version: Sequential modeling for IoT cybersecurity")
print(f"📊 Device: {CONFIG['device']}")
print(f"📊 Dataset: CIC-IoT23 3-class semantic grouping")
print(f"📊 Capacity: {CONFIG['max_samples_per_class']:,} samples per class")
print(f"📊 Total samples: {CONFIG['max_samples_per_class'] * 3:,}")
print(f"📊 Architecture: LSTM (Sequential Modeling)")
print(f"📊 Sequence format: {CONFIG['sequence_length']} timesteps × {CONFIG['input_features']} features")
print(f"🎯 Baselines: ViT (96.94%), CNN (training ~93%+)")


🔄 LSTM SEQUENTIAL ARCHITECTURE EXPERIMENT INITIALIZED
📋 Notebook: LSTM_Prototype_CIC_3class_full_capacity.ipynb
🔧 Version: Sequential modeling for IoT cybersecurity
📊 Device: cpu
📊 Dataset: CIC-IoT23 3-class semantic grouping
📊 Capacity: 12,000 samples per class
📊 Total samples: 36,000
📊 Architecture: LSTM (Sequential Modeling)
📊 Sequence format: 32 timesteps × 160 features
🎯 Baselines: ViT (96.94%), CNN (training ~93%+)


In [2]:
# Multi-Layer LSTM Architecture for IoT Cybersecurity Sequential Analysis
class MultiLayerLSTM(nn.Module):
    def __init__(self, input_size=160, hidden_size=128, num_layers=2, num_classes=3, dropout=0.3):
        super(MultiLayerLSTM, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.num_classes = num_classes
        
        # LSTM layers
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,  # (batch, seq, feature)
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=False
        )
        
        # Attention mechanism for focusing on important timesteps
        self.attention = nn.MultiheadAttention(
            embed_dim=hidden_size,
            num_heads=8,
            dropout=dropout,
            batch_first=True
        )
        
        # Classification head
        self.classifier = nn.Sequential(
            nn.LayerNorm(hidden_size),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, hidden_size // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size // 2, num_classes)
        )
        
        self._initialize_weights()
    
    def forward(self, x):
        # x shape: (batch_size, sequence_length, input_features)
        # Expected: (batch_size, 32, 160)
        
        batch_size = x.size(0)
        
        # Initialize hidden states
        h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)
        
        # LSTM forward pass
        lstm_out, (hidden, cell) = self.lstm(x, (h0, c0))
        # lstm_out shape: (batch_size, sequence_length, hidden_size)
        
        # Apply attention to focus on important timesteps
        attended_out, attention_weights = self.attention(lstm_out, lstm_out, lstm_out)
        # attended_out shape: (batch_size, sequence_length, hidden_size)
        
        # Global average pooling over sequence dimension
        pooled = torch.mean(attended_out, dim=1)  # (batch_size, hidden_size)
        
        # Classification
        output = self.classifier(pooled)  # (batch_size, num_classes)
        
        return output
    
    def _initialize_weights(self):
        for name, param in self.named_parameters():
            if 'weight_ih' in name:
                # Input-to-hidden weights
                nn.init.xavier_uniform_(param.data)
            elif 'weight_hh' in name:
                # Hidden-to-hidden weights
                nn.init.orthogonal_(param.data)
            elif 'bias' in name:
                param.data.fill_(0.)
                # Set forget gate bias to 1 (LSTM best practice)
                if 'bias_ih' in name:
                    n = param.size(0)
                    param.data[n//4:n//2].fill_(1.)

# Initialize LSTM model
model = MultiLayerLSTM(
    input_size=CONFIG['input_features'],
    hidden_size=CONFIG['hidden_size'],
    num_layers=CONFIG['num_layers'],
    num_classes=3,
    dropout=CONFIG['dropout']
)
model = model.to(CONFIG['device'])

# Count parameters for comparison
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print("\n🔄 LSTM ARCHITECTURE SUMMARY:")
print(f"📊 Total parameters: {total_params:,}")
print(f"📊 Trainable parameters: {trainable_params:,}")
print(f"📊 Model size: ~{total_params * 4 / 1024 / 1024:.1f} MB")
print("\n🔍 Architecture Details:")
print(f"   • {CONFIG['num_layers']}-layer LSTM with attention")
print(f"   • Hidden size: {CONFIG['hidden_size']} per layer")
print(f"   • Multi-head attention mechanism (8 heads)")
print(f"   • Layer normalization and dropout ({CONFIG['dropout']})")
print(f"   • Sequential input: {CONFIG['sequence_length']} × {CONFIG['input_features']}")

print(f"\n🆚 COMPARISON TARGETS:")
print(f"   📊 ViT: 96.94% accuracy (~150K parameters)")
print(f"   📊 CNN: ~93%+ accuracy (4.8M parameters, training)")
print(f"   🔄 LSTM: TBD accuracy ({trainable_params:,} parameters)")
print(f"\n💡 LSTM Advantages:")
print(f"   • Captures temporal attack progression")
print(f"   • Memory of previous network states")
print(f"   • Natural for sequential protocol analysis")
print(f"   • Attention focuses on critical time points")



🔄 LSTM ARCHITECTURE SUMMARY:
📊 Total parameters: 355,331
📊 Trainable parameters: 355,331
📊 Model size: ~1.4 MB

🔍 Architecture Details:
   • 2-layer LSTM with attention
   • Hidden size: 128 per layer
   • Multi-head attention mechanism (8 heads)
   • Layer normalization and dropout (0.3)
   • Sequential input: 32 × 160

🆚 COMPARISON TARGETS:
   📊 ViT: 96.94% accuracy (~150K parameters)
   📊 CNN: ~93%+ accuracy (4.8M parameters, training)
   🔄 LSTM: TBD accuracy (355,331 parameters)

💡 LSTM Advantages:
   • Captures temporal attack progression
   • Memory of previous network states
   • Natural for sequential protocol analysis
   • Attention focuses on critical time points


In [3]:
# Data Loading using Working CIC Approach (same as CNN)
import glob

def load_cic_3class_full_capacity(base_path, class_mapping, max_samples_per_class):
    """Load CIC-IoT23 data using exact approach from working ViT notebook"""
    print(f"📂 Loading CIC-IoT23 3-class FULL CAPACITY dataset from: {base_path}")
    print(f"🎯 Target: {max_samples_per_class:,} samples per class = {max_samples_per_class * len(class_mapping):,} total")
    
    all_image_data = []
    all_labels = []
    splits = ['train', 'val', 'test']
    
    print(f"3-Class mapping (FULL CAPACITY): {class_mapping}")
    
    # Track samples collected per combined class
    class_samples = {combined_class: 0 for combined_class in class_mapping.keys()}
    
    # Process each combined class
    for combined_class, original_classes in class_mapping.items():
        print(f"\n🔄 Loading {combined_class} from: {original_classes}")
        print(f"   Target: {max_samples_per_class:,} samples")
        
        for original_class in original_classes:
            if class_samples[combined_class] >= max_samples_per_class:
                break
                
            class_dir = f"{base_path}{original_class}/"
            print(f"  📂 Processing {original_class}...")
            
            for split in splits:
                if class_samples[combined_class] >= max_samples_per_class:
                    break
                    
                split_path = f"{class_dir}{split}/"
                parquet_files = sorted(glob.glob(f"{split_path}*.parquet"))
                
                for file_path in parquet_files:
                    if class_samples[combined_class] >= max_samples_per_class:
                        break
                        
                    try:
                        df = pd.read_parquet(file_path)
                        
                        if 'image_data' in df.columns:
                            remaining_samples = max_samples_per_class - class_samples[combined_class]
                            samples_to_take = min(len(df), remaining_samples)
                            
                            for idx in range(samples_to_take):
                                row = df.iloc[idx]
                                image_data = np.array(row['image_data'], dtype=np.float32)
                                all_image_data.append(image_data)
                                all_labels.append(combined_class)
                                class_samples[combined_class] += 1
                            
                            if samples_to_take > 0:
                                print(f"    ✓ Loaded {samples_to_take:,} from {file_path.split('/')[-1]} (total {combined_class}: {class_samples[combined_class]:,})")
                    except Exception as e:
                        print(f"    ⚠️ Error loading {file_path}: {e}")
    
    X = np.array(all_image_data, dtype=np.float32)
    y = np.array(all_labels)
    
    print(f"\n🎉 CIC-IoT23 3-class FULL CAPACITY dataset loaded: {len(X):,} samples")
    print(f"📊 Final class distribution:")
    for combined_class, count in class_samples.items():
        percentage = (count / len(X)) * 100
        print(f"   {combined_class:15s}: {count:,} samples ({percentage:.1f}%)")
    
    total_target = max_samples_per_class * len(class_mapping)
    achievement = (len(X) / total_target) * 100
    print(f"\n✓ Capacity achievement: {achievement:.1f}% of target ({len(X):,} / {total_target:,})")
    
    return X, y

# Load FULL CAPACITY CIC data
X, y = load_cic_3class_full_capacity(CONFIG['data_path'], CLASS_MAPPING, CONFIG['max_samples_per_class'])

# Reshape data for LSTM: (samples, features) -> (samples, sequence_length, input_features)
print(f"\n🔄 Reshaping data for LSTM sequential input...")
print(f"   Original shape: {X.shape} (flattened: samples × 5120 features)")

# Reshape from (36000, 5120) to (36000, 32, 160) for LSTM
# Treat as 32 timesteps with 160 features each (5 channels × 32 width)
X = X.reshape(-1, CONFIG['sequence_length'], CONFIG['input_features'])
print(f"   LSTM shape: {X.shape} (samples × timesteps × features)")
print(f"   Sequential interpretation: {CONFIG['sequence_length']} timesteps of {CONFIG['input_features']} features")
print(f"   ➤ Each timestep represents one 'row' of the 5-channel data")
print(f"   ➤ Each feature vector combines all 5 channels for that row")

# Encode labels
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

print(f"\n🏷️ CIC-IoT23 3-class label distribution:")
for i, label in enumerate(label_encoder.classes_):
    count = np.sum(y == label)
    print(f"   {i}: {label} ({count:,} samples)")

print(f"\n📈 LSTM data ready: range=[{X.min():.3f}, {X.max():.3f}], shape={X.shape}")
print(f"🚀 Total samples: {len(X):,}")
print(f"🔄 Ready for LSTM vs CNN vs ViT comparison!")
print(f"\n💡 Sequential Learning Hypothesis:")
print(f"   • Network attacks often have temporal patterns")
print(f"   • LSTM can model state transitions in protocols")
print(f"   • Memory allows detection of multi-step attacks")
print(f"   • May outperform spatial methods (CNN/ViT) on sequential data")


📂 Loading CIC-IoT23 3-class FULL CAPACITY dataset from: /home/ubuntu/Cyber_AI/ai-cyber/notebooks/ViT-experiment/pcap-dataset-samples/parquet/5channel_32x32/
🎯 Target: 12,000 samples per class = 36,000 total
3-Class mapping (FULL CAPACITY): {'Normal': ['Benign_Final'], 'Reconnaissance': ['Recon-PortScan', 'DictionaryBruteForce'], 'Active_Attack': ['DDoS-HTTP_Flood', 'DDoS-SYN_Flood', 'DoS-TCP_Flood', 'DoS-UDP_Flood', 'Mirai-udpplain', 'SqlInjection']}

🔄 Loading Normal from: ['Benign_Final']
   Target: 12,000 samples
  📂 Processing Benign_Final...
    ✓ Loaded 1,000 from shard_00000.parquet (total Normal: 1,000)
    ✓ Loaded 1,000 from shard_00001.parquet (total Normal: 2,000)
    ✓ Loaded 1,000 from shard_00002.parquet (total Normal: 3,000)
    ✓ Loaded 1,000 from shard_00003.parquet (total Normal: 4,000)
    ✓ Loaded 1,000 from shard_00004.parquet (total Normal: 5,000)
    ✓ Loaded 1,000 from shard_00005.parquet (total Normal: 6,000)
    ✓ Loaded 1,000 from shard_00006.parquet (total 

In [4]:
# Data Splitting and Training Setup
from torch.utils.data import DataLoader, TensorDataset

# Split data into train/val/test (same random_state as CNN/ViT for fairness)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y_encoded, test_size=CONFIG['test_size'], 
    random_state=CONFIG['random_state'], 
    stratify=y_encoded
)

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=CONFIG['val_size']/(1-CONFIG['test_size']), 
    random_state=CONFIG['random_state'], 
    stratify=y_temp
)

print("📊 Data Split Summary:")
print(f"   Training: {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"   Validation: {X_val.shape[0]:,} samples ({X_val.shape[0]/len(X)*100:.1f}%)")
print(f"   Test: {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"   Sequential format: timesteps={X_train.shape[1]}, features={X_train.shape[2]}")

# Convert to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train)
y_train_tensor = torch.LongTensor(y_train)
X_val_tensor = torch.FloatTensor(X_val)
y_val_tensor = torch.LongTensor(y_val)
X_test_tensor = torch.FloatTensor(X_test)
y_test_tensor = torch.LongTensor(y_test)

# Create data loaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=CONFIG['batch_size'], shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=CONFIG['batch_size'], shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=CONFIG['batch_size'], shuffle=False)

# Compute class weights for balanced training
class_weights = compute_class_weight(
    'balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weights_tensor = torch.FloatTensor(class_weights).to(CONFIG['device'])

print(f"\n⚖️  Class weights: {dict(zip(label_encoder.classes_, class_weights))}")

# Initialize optimizer and loss function
# Higher learning rate for LSTM compared to CNN/ViT
optimizer = optim.Adam(model.parameters(), lr=CONFIG['learning_rate'], weight_decay=1e-4)
criterion = nn.CrossEntropyLoss(weight=class_weights_tensor)

# Learning rate scheduler
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='max', patience=3, factor=0.5
)

print(f"\n🎯 LSTM Training Configuration:")
print(f"   📊 Optimizer: Adam (lr={CONFIG['learning_rate']}, weight_decay=1e-4)")
print(f"   📊 Loss: Weighted CrossEntropyLoss")
print(f"   📊 Scheduler: ReduceLROnPlateau (patience=3)")
print(f"   📊 Batch size: {CONFIG['batch_size']}")
print(f"   📊 Max epochs: {CONFIG['epochs']}")
print(f"   📊 Early stopping patience: {CONFIG['patience']}")
print(f"   🔄 Sequential processing: {CONFIG['sequence_length']} timesteps per sample")

print(f"\n🎯 Architecture Comparison Setup:")
print(f"   🤖 ViT: Global attention, 96.94% target")
print(f"   🏗️  CNN: Local convolution, ~93%+ (training)")
print(f"   🔄 LSTM: Sequential memory, starting training next...")


📊 Data Split Summary:
   Training: 21,600 samples (60.0%)
   Validation: 7,200 samples (20.0%)
   Test: 7,200 samples (20.0%)
   Sequential format: timesteps=32, features=160

⚖️  Class weights: {np.str_('Active_Attack'): np.float64(1.0), np.str_('Normal'): np.float64(1.0), np.str_('Reconnaissance'): np.float64(1.0)}

🎯 LSTM Training Configuration:
   📊 Optimizer: Adam (lr=0.001, weight_decay=1e-4)
   📊 Loss: Weighted CrossEntropyLoss
   📊 Scheduler: ReduceLROnPlateau (patience=3)
   📊 Batch size: 64
   📊 Max epochs: 50
   📊 Early stopping patience: 7
   🔄 Sequential processing: 32 timesteps per sample

🎯 Architecture Comparison Setup:
   🤖 ViT: Global attention, 96.94% target
   🏗️  CNN: Local convolution, ~93%+ (training)
   🔄 LSTM: Sequential memory, starting training next...


In [5]:
# LSTM Training Pipeline with Sequential Processing
def train_epoch(model, train_loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    
    for batch_idx, (data, targets) in enumerate(train_loader):
        data, targets = data.to(device), targets.to(device)
        # data shape: (batch_size, sequence_length, input_features)
        
        optimizer.zero_grad()
        outputs = model(data)
        loss = criterion(outputs, targets)
        loss.backward()
        
        # Gradient clipping for LSTM stability
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        
        optimizer.step()
        
        total_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        total += targets.size(0)
        correct += (predicted == targets).sum().item()
    
    return total_loss / len(train_loader), correct / total

def validate_epoch(model, val_loader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for data, targets in val_loader:
            data, targets = data.to(device), targets.to(device)
            outputs = model(data)
            loss = criterion(outputs, targets)
            
            total_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            total += targets.size(0)
            correct += (predicted == targets).sum().item()
    
    return total_loss / len(val_loader), correct / total

# Training loop with early stopping
print("🔄 Starting LSTM sequential learning...")
print("🎯 Targets: Beat ViT (96.94%) and CNN (~93%+)")
print("💡 Hypothesis: Sequential patterns > Spatial patterns for IoT security\n")

best_val_acc = 0
patience_counter = 0
train_losses = []
val_losses = []
train_accs = []
val_accs = []

start_time = datetime.now()

for epoch in range(CONFIG['epochs']):
    # Train for one epoch
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, CONFIG['device'])
    
    # Validate
    val_loss, val_acc = validate_epoch(model, val_loader, criterion, CONFIG['device'])
    
    # Update learning rate
    scheduler.step(val_acc)
    
    # Store metrics
    train_losses.append(train_loss)
    val_losses.append(val_loss)
    train_accs.append(train_acc)
    val_accs.append(val_acc)
    
    # Print progress with architecture comparison
    print(f"Epoch {epoch+1:2d}/{CONFIG['epochs']} | "
          f"Train: {train_acc:.4f} ({train_loss:.4f}) | "
          f"Val: {val_acc:.4f} ({val_loss:.4f}) | "
          f"LR: {optimizer.param_groups[0]['lr']:.6f}")
    
    # Architecture comparison updates
    if val_acc > 0.93:
        print(f"🔄 LSTM approaching CNN performance! Current: {val_acc:.4f} vs CNN: ~0.93+")
    if val_acc > 0.96:
        print(f"🎯 LSTM approaching ViT performance! Current: {val_acc:.4f} vs ViT: 0.9694")
    
    # Save best model
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        patience_counter = 0
        torch.save(model.state_dict(), 'best_lstm_3class_full_capacity_model.pth')
        print(f"✅ New best validation accuracy: {val_acc:.4f}")
        
        # Check performance milestones
        if val_acc > 0.9694:
            print(f"🏆 LSTM BEATS ViT! {val_acc:.4f} > 0.9694")
            print(f"🔄 Sequential modeling proves superior for IoT cybersecurity!")
    else:
        patience_counter += 1
        
        if patience_counter >= CONFIG['patience']:
            print(f"\n⏰ Early stopping triggered after {epoch+1} epochs")
            print(f"   Best validation accuracy: {best_val_acc:.4f}")
            break

training_time = datetime.now() - start_time
print(f"\n🎯 LSTM Training Complete!")
print(f"   ⏱️  Total time: {training_time}")
print(f"   🏆 Best validation accuracy: {best_val_acc:.4f}")
print(f"   📊 Total epochs: {epoch+1}")

# Compare with baselines
vit_accuracy = 0.9694
cnn_accuracy = 0.93  # Conservative estimate

print(f"\n📊 ARCHITECTURE COMPARISON:")
print(f"   🤖 ViT (Global Attention): {vit_accuracy:.4f}")
print(f"   🏗️  CNN (Local Features): ~{cnn_accuracy:.2f}+ (training)")
print(f"   🔄 LSTM (Sequential): {best_val_acc:.4f}")

if best_val_acc > vit_accuracy:
    improvement = (best_val_acc - vit_accuracy) * 100
    print(f"\n🎉 LSTM OUTPERFORMS ALL BASELINES!")
    print(f"   LSTM beats ViT by +{improvement:.2f} percentage points")
    print(f"   🔄 Sequential modeling is superior for IoT cybersecurity")
elif best_val_acc > cnn_accuracy:
    print(f"\n🥈 LSTM BEATS CNN but trails ViT")
    print(f"   Sequential > Local features, but Global attention still leads")
else:
    print(f"\n📊 LSTM provides alternative perspective")
    print(f"   All architectures show competitive performance")


🔄 Starting LSTM sequential learning...
🎯 Targets: Beat ViT (96.94%) and CNN (~93%+)
💡 Hypothesis: Sequential patterns > Spatial patterns for IoT security

Epoch  1/50 | Train: 0.6969 (0.6745) | Val: 0.7683 (0.5153) | LR: 0.001000
✅ New best validation accuracy: 0.7683
Epoch  2/50 | Train: 0.8172 (0.4666) | Val: 0.8331 (0.3780) | LR: 0.001000
✅ New best validation accuracy: 0.8331
Epoch  3/50 | Train: 0.8553 (0.3930) | Val: 0.8769 (0.3111) | LR: 0.001000
✅ New best validation accuracy: 0.8769
Epoch  4/50 | Train: 0.8780 (0.3260) | Val: 0.8883 (0.2761) | LR: 0.001000
✅ New best validation accuracy: 0.8883
Epoch  5/50 | Train: 0.8837 (0.3131) | Val: 0.8757 (0.3086) | LR: 0.001000
Epoch  6/50 | Train: 0.8889 (0.2908) | Val: 0.9047 (0.2355) | LR: 0.001000
✅ New best validation accuracy: 0.9047
Epoch  7/50 | Train: 0.8941 (0.2711) | Val: 0.9006 (0.2482) | LR: 0.001000
Epoch  8/50 | Train: 0.8941 (0.2732) | Val: 0.8994 (0.2651) | LR: 0.001000
Epoch  9/50 | Train: 0.8950 (0.2651) | Val: 0.9038

In [None]:
# LSTM Evaluation & Multi-Architecture Results Analysis
# Load best model
model.load_state_dict(torch.load('best_lstm_3class_full_capacity_model.pth'))
model.eval()

# Test set evaluation
def evaluate_model(model, test_loader, device, label_encoder):
    model.eval()
    all_preds = []
    all_targets = []
    all_probs = []
    
    with torch.no_grad():
        for data, targets in test_loader:
            data, targets = data.to(device), targets.to(device)
            outputs = model(data)
            probs = torch.softmax(outputs, dim=1)
            _, predicted = torch.max(outputs, 1)
            
            all_preds.extend(predicted.cpu().numpy())
            all_targets.extend(targets.cpu().numpy())
            all_probs.extend(probs.cpu().numpy())
    
    return np.array(all_preds), np.array(all_targets), np.array(all_probs)

print("🧪 Final LSTM Test Set Evaluation...")
test_preds, test_targets, test_probs = evaluate_model(model, test_loader, CONFIG['device'], label_encoder)
test_accuracy = accuracy_score(test_targets, test_preds)

print(f"\n🎯 FINAL LSTM RESULTS:")
print(f"   📊 Test Accuracy: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")

# Multi-architecture performance comparison
vit_test_accuracy = 0.9694
cnn_test_accuracy = 0.93  # Conservative estimate until CNN completes

print(f"\n🏆 MULTI-ARCHITECTURE COMPARISON:")
architectures = {
    'ViT (Global Attention)': vit_test_accuracy,
    'CNN (Local Features)': cnn_test_accuracy,
    'LSTM (Sequential)': test_accuracy
}

# Sort by performance
sorted_archs = sorted(architectures.items(), key=lambda x: x[1], reverse=True)
for i, (arch, acc) in enumerate(sorted_archs):
    medal = ['🥇', '🥈', '🥉'][i] if i < 3 else '📊'
    print(f"   {medal} {arch}: {acc:.4f} ({acc*100:.2f}%)")

# Performance tier classification
if test_accuracy > vit_test_accuracy:
    improvement = (test_accuracy - vit_test_accuracy) * 100
    print(f"\n🎉 LSTM ACHIEVES NEW STATE-OF-THE-ART!")
    print(f"   LSTM: {test_accuracy:.4f} vs ViT: {vit_test_accuracy:.4f}")
    print(f"   Improvement: +{improvement:.2f} percentage points")
    print(f"   🔄 Sequential modeling proves superior for IoT cybersecurity!")
    tier = "🥇 STATE-OF-THE-ART"
elif test_accuracy > 0.95:
    deficit = (vit_test_accuracy - test_accuracy) * 100
    print(f"\n🥈 LSTM EXCELLENT Performance!")
    print(f"   LSTM: {test_accuracy:.4f} vs ViT: {vit_test_accuracy:.4f}")
    print(f"   Deficit: -{deficit:.2f} percentage points")
    tier = "🥈 EXCELLENT"
elif test_accuracy > 0.90:
    deficit = (vit_test_accuracy - test_accuracy) * 100
    print(f"\n🥉 LSTM VERY GOOD Performance!")
    print(f"   LSTM: {test_accuracy:.4f} vs ViT: {vit_test_accuracy:.4f}")
    print(f"   Deficit: -{deficit:.2f} percentage points")
    tier = "🥉 VERY GOOD"
else:
    deficit = (vit_test_accuracy - test_accuracy) * 100
    print(f"\n📊 LSTM Performance Analysis:")
    print(f"   LSTM: {test_accuracy:.4f} vs ViT: {vit_test_accuracy:.4f}")
    print(f"   Deficit: -{deficit:.2f} percentage points")
    tier = "📊 BASELINE"

# Detailed classification report
print(f"\n📋 Detailed LSTM Classification Report:")
class_names = label_encoder.classes_
report = classification_report(test_targets, test_preds, target_names=class_names, digits=4)
print(report)

# Confidence analysis
confidence_scores = np.max(test_probs, axis=1)
mean_confidence = np.mean(confidence_scores)
print(f"\n🎯 LSTM Prediction Confidence Analysis:")
print(f"   Mean confidence: {mean_confidence:.4f}")
print(f"   High confidence (>0.9): {np.sum(confidence_scores > 0.9):,} samples")
print(f"   Low confidence (<0.7): {np.sum(confidence_scores < 0.7):,} samples")

# Save comprehensive results
results = {
    'experiment': 'LSTM_CIC_3class_full_capacity',
    'timestamp': datetime.now().isoformat(),
    'dataset': 'CIC-IoT23',
    'approach': '3-class semantic grouping',
    'architecture': 'LSTM_Sequential',
    'total_samples': len(X),
    'samples_per_class': CONFIG['max_samples_per_class'],
    'test_accuracy': float(test_accuracy),
    'validation_accuracy': float(best_val_acc),
    'training_epochs': epoch + 1,
    'training_time': str(training_time),
    'parameters': trainable_params,
    'sequence_config': {
        'sequence_length': CONFIG['sequence_length'],
        'input_features': CONFIG['input_features'],
        'hidden_size': CONFIG['hidden_size'],
        'num_layers': CONFIG['num_layers']
    },
    'multi_architecture_comparison': {
        'vit_baseline': vit_test_accuracy,
        'cnn_baseline': cnn_test_accuracy,
        'lstm_performance': float(test_accuracy),
        'lstm_vs_vit_diff': float(test_accuracy - vit_test_accuracy),
        'lstm_vs_cnn_diff': float(test_accuracy - cnn_test_accuracy)
    },
    'confidence_analysis': {
        'mean_confidence': float(mean_confidence),
        'high_confidence_samples': int(np.sum(confidence_scores > 0.9)),
        'low_confidence_samples': int(np.sum(confidence_scores < 0.7))
    },
    'performance_tier': tier,
    'classification_report': report,
    'class_names': list(class_names)
}

with open('results_lstm_3class_32x32_full_capacity.json', 'w') as f:
    json.dump(results, f, indent=2)

print(f"\n💾 Results saved to: results_lstm_3class_32x32_full_capacity.json")
print(f"\n🎯 MULTI-ARCHITECTURE RESEARCH SUMMARY:")
print(f"   🤖 ViT (Attention): {vit_test_accuracy:.4f} (~150K params)")
print(f"   🏗️  CNN (Convolution): ~{cnn_test_accuracy:.2f}+ (4.8M params)")
print(f"   🔄 LSTM (Sequential): {test_accuracy:.4f} ({trainable_params:,} params)")
print(f"   📊 Performance Tier: {tier}")

# Research implications
print(f"\n🔬 RESEARCH IMPLICATIONS:")
if test_accuracy > vit_test_accuracy:
    print(f"   ✅ Sequential modeling superior for IoT cybersecurity")
    print(f"   ✅ Temporal patterns more important than spatial/attention")
    print(f"   ✅ LSTM architecture recommended for deployment")
else:
    print(f"   📊 Multiple architectures achieve competitive performance")
    print(f"   📊 Choice depends on deployment constraints")
    print(f"   📊 All approaches valid for IoT cybersecurity")

print(f"\n🔄 NEXT STEPS:")
print(f"   1. ✅ LSTM baseline established: {test_accuracy:.4f}")
print(f"   2. 🔄 Run cross-dataset validation on UNSW-NB15")
print(f"   3. 🔄 Compare LSTM vs CNN vs ViT domain transfer")
print(f"   4. 🔄 Implement ensemble methods (LSTM+CNN+ViT)")
print(f"   5. 📊 Complete your team's multi-architecture paper")
print(f"\n🎓 Ready for publication: Comprehensive IoT Cybersecurity Architecture Study!")


🧪 Final LSTM Test Set Evaluation...

🎯 FINAL LSTM RESULTS:
   📊 Test Accuracy: 0.9615 (96.15%)

🏆 MULTI-ARCHITECTURE COMPARISON:
   🥇 ViT (Global Attention): 0.9694 (96.94%)
   🥈 LSTM (Sequential): 0.9615 (96.15%)
   🥉 CNN (Local Features): 0.9300 (93.00%)

🥈 LSTM EXCELLENT Performance!
   LSTM: 0.9615 vs ViT: 0.9694
   Deficit: -0.79 percentage points

📋 Detailed LSTM Classification Report:
                precision    recall  f1-score   support

 Active_Attack     0.9962    0.9733    0.9846      2400
        Normal     0.9627    0.9363    0.9493      2400
Reconnaissance     0.9282    0.9750    0.9510      2400

      accuracy                         0.9615      7200
     macro avg     0.9624    0.9615    0.9616      7200
  weighted avg     0.9624    0.9615    0.9616      7200


🎯 LSTM Prediction Confidence Analysis:
   Mean confidence: 0.9774
   High confidence (>0.9): 6,738 samples
   Low confidence (<0.7): 241 samples

💾 Results saved to: results_lstm_3class_32x32_full_capacity.jso