# Cybersecurity Threat Detection with XGBoost

This notebook demonstrates a complete machine learning pipeline for detecting cybersecurity threats using XGBoost.

## Workflow:
1. **Load and Explore Data**
2. **Preprocess Data** (Clean, Encode, Scale)
3. **Train XGBoost Model**
4. **Evaluate Model Performance**
5. **Analyze False Positives/Negatives**
6. **Feature Importance Analysis**
7. **Export Metrics for Grafana Dashboard**

## 1. Setup and Imports

In [None]:
# Add src to path
import sys
sys.path.append('../src')

# Import custom modules
from preprocess import ThreatDataPreprocessor
from model import ThreatDetectionModel
from visualize import ThreatVisualization

# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("✓ All imports successful!")

## 2. Load and Explore Raw Data

In [None]:
# Load raw data
data_path = '../data/raw_data.csv'
df_raw = pd.read_csv(data_path)

print(f"Dataset shape: {df_raw.shape}")
print(f"\nColumns: {list(df_raw.columns)}")
print(f"\nFirst few rows:")
df_raw.head()

In [None]:
# Data info
df_raw.info()

In [None]:
# Check for missing values
missing_data = df_raw.isnull().sum()
missing_pct = (missing_data / len(df_raw)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Percentage': missing_pct
})
print(missing_df[missing_df['Missing Count'] > 0])

In [None]:
# Check target distribution - Using 'Label' column (BENIGN vs DDoS)
target_col = ' Label'  # Note: Column has leading space

if target_col in df_raw.columns:
    print("Target Distribution:")
    print(df_raw[target_col].value_counts())
    print(f"\nClass Balance:")
    print(df_raw[target_col].value_counts(normalize=True))
    
    # Visualize
    plt.figure(figsize=(8, 5))
    df_raw[target_col].value_counts().plot(kind='bar', color=['green', 'red'])
    plt.title('Target Class Distribution (BENIGN vs DDoS)', fontsize=14, fontweight='bold')
    plt.xlabel('Class')
    plt.ylabel('Count')
    plt.xticks(rotation=0)
    plt.show()
else:
    print(f"Target column '{target_col}' not found. Available columns: {list(df_raw.columns)}")

## 3. Data Preprocessing

This step includes:
- Handling missing values
- Removing duplicates
- Identifying categorical vs numerical features
- Label encoding categorical features
- Scaling numerical features using RobustScaler
- Train-test split with stratification

In [None]:
# Initialize preprocessor
preprocessor = ThreatDataPreprocessor()

# Run complete preprocessing pipeline
X_train, X_test, y_train, y_test, feature_names = preprocessor.preprocess_pipeline(
    filepath=data_path,
    target_col=target_col,  # Adjust based on your dataset
    test_size=0.2,
    random_state=RANDOM_STATE
)

print(f"\n{'='*60}")
print("PREPROCESSING COMPLETE")
print(f"{'='*60}")
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"Total features: {len(feature_names)}")

In [None]:
# Save preprocessor for later use
preprocessor.save_preprocessor('../src/preprocessor.pkl')
print("✓ Preprocessor saved!")

In [None]:
# Visualize class distribution in train/test sets
viz = ThreatVisualization()
viz.plot_class_distribution(y_train, y_test, figsize=(12, 5))

## 4. Feature Selection (Optional)

After exploring the data, you can select specific features that are most relevant for threat detection.
This is where you decide which features to keep based on domain knowledge and initial analysis.

In [None]:
# Option 1: Use all features (default)
print(f"Using all {len(feature_names)} features")
print(f"\nFeature list:")
for i, feat in enumerate(feature_names, 1):
    print(f"{i}. {feat}")

# Option 2: Select specific features (uncomment and modify as needed)
# selected_features = ['feature1', 'feature2', 'feature3']  # Replace with your choices
# X_train = X_train[selected_features]
# X_test = X_test[selected_features]
# feature_names = selected_features
# print(f"\nSelected {len(selected_features)} features for modeling")

## 5. Train XGBoost Model

In [None]:
# Initialize model
model = ThreatDetectionModel(random_state=RANDOM_STATE)

# Train model with default parameters
# You can customize hyperparameters here
custom_params = {
    'max_depth': 6,
    'learning_rate': 0.1,
    'n_estimators': 200,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'gamma': 0.1,
    'min_child_weight': 1
}

model.train_model(X_train, y_train, params=custom_params)
print("\n✓ Model training complete!")

## 6. Model Evaluation

In [None]:
# Evaluate on test set
metrics = model.evaluate_model(X_test, y_test, dataset_name="Test")

In [None]:
# Visualize confusion matrix
model.plot_confusion_matrix(metrics['confusion_matrix'], figsize=(8, 6))

In [None]:
# Plot ROC curve
model.plot_roc_curve(y_test, metrics['y_pred_proba'], figsize=(8, 6))

In [None]:
# Visualize all metrics
viz.plot_metrics_comparison(metrics, figsize=(10, 6))

## 7. Analyze False Positives and False Negatives

Understanding where the model makes mistakes is crucial for improving threat detection.

In [None]:
# Analyze errors
false_positives, false_negatives = model.analyze_false_positives_negatives(
    X_test, y_test, feature_names
)

In [None]:
# Visualize prediction distributions
viz.plot_prediction_distribution(
    y_test, 
    metrics['y_pred_proba'], 
    figsize=(12, 5)
)

In [None]:
# Analyze characteristics of false predictions
viz.plot_false_positive_negative_analysis(
    false_positives, 
    false_negatives, 
    top_features=5,
    figsize=(14, 6)
)

## 8. Feature Importance Analysis

Identify which features are most important for threat detection.

In [None]:
# Get feature importance
importance_df = model.get_feature_importance(top_n=20, importance_type='weight')

In [None]:
# Plot feature importance
model.plot_feature_importance(top_n=20, figsize=(10, 8))

In [None]:
# Different importance types
print("\nFeature Importance by Gain (information gain):")
importance_gain = model.get_feature_importance(top_n=10, importance_type='gain')

## 9. Comprehensive Dashboard Visualization

In [None]:
# Create comprehensive dashboard
viz.plot_comprehensive_dashboard(
    metrics=metrics,
    feature_importance=importance_df,
    y_true=y_test,
    y_pred=metrics['y_pred'],
    y_pred_proba=metrics['y_pred_proba'],
    figsize=(16, 12),
    save_path='../dashboard/comprehensive_dashboard.png'
)

## 10. Export Metrics for Grafana Dashboard

In [None]:
# Create JSON summary for Grafana
dashboard_summary = viz.create_dashboard_summary(
    metrics=metrics,
    feature_importance=importance_df,
    fp_count=len(false_positives),
    fn_count=len(false_negatives),
    output_path='../dashboard/metrics_summary.json'
)

print("\n✓ Dashboard metrics exported!")
print("\nSummary:")
import json
print(json.dumps(dashboard_summary, indent=2))

## 11. Save Trained Model

In [None]:
# Save model for deployment
model.save_model('../src/threat_detection_model.pkl')
print("✓ Model saved successfully!")

## 12. Key Insights and Conclusions

In [None]:
print("="*70)
print("THREAT DETECTION MODEL - KEY INSIGHTS")
print("="*70)

print(f"\n1. MODEL PERFORMANCE:")
print(f"   - Accuracy:  {metrics['accuracy']:.4f}")
print(f"   - Precision: {metrics['precision']:.4f} (How many predicted threats are actual threats)")
print(f"   - Recall:    {metrics['recall']:.4f} (How many actual threats were detected)")
print(f"   - F1-Score:  {metrics['f1_score']:.4f}")
if metrics['roc_auc']:
    print(f"   - ROC AUC:   {metrics['roc_auc']:.4f}")

print(f"\n2. ERROR ANALYSIS:")
tn, fp, fn, tp = metrics['confusion_matrix'].ravel()
total = tn + fp + fn + tp
print(f"   - False Positives: {fp} ({fp/total*100:.2f}%) - Normal traffic flagged as threats")
print(f"   - False Negatives: {fn} ({fn/total*100:.2f}%) - Threats missed by the model")
print(f"   - True Positives:  {tp} ({tp/total*100:.2f}%) - Correctly identified threats")
print(f"   - True Negatives:  {tn} ({tn/total*100:.2f}%) - Correctly identified normal traffic")

print(f"\n3. TOP 5 MOST IMPORTANT FEATURES:")
for i, row in importance_df.head(5).iterrows():
    print(f"   {i+1}. {row['feature']}: {row['importance']:.2f}")

print(f"\n4. DEPLOYMENT CONSIDERATIONS:")
if metrics['precision'] > 0.9:
    print("   ✓ High precision - Few false alarms")
else:
    print("   ⚠ Consider tuning threshold to reduce false positives")

if metrics['recall'] > 0.9:
    print("   ✓ High recall - Most threats are detected")
else:
    print("   ⚠ Some threats may slip through - consider ensemble methods")

print(f"\n5. NEXT STEPS:")
print("   - Deploy model to production environment")
print("   - Set up Grafana dashboard for real-time monitoring")
print("   - Implement automated retraining pipeline")
print("   - Configure alerting based on prediction confidence")
print("   - Monitor for concept drift and model degradation")

print("\n" + "="*70)

## 13. Load and Use Saved Model (Example)

In [None]:
# Example: Loading and using the saved model
# loaded_model = ThreatDetectionModel()
# loaded_model.load_model('../src/threat_detection_model.pkl')

# Make predictions on new data
# predictions = loaded_model.predict(X_new)
# probabilities = loaded_model.predict_proba(X_new)