# ü§ñ BugPredict AI - Complete Tutorial

This notebook demonstrates the complete BugPredict AI workflow:

1. **Data Collection** - Gather vulnerability data from HackerOne, Bugcrowd, and NVD
2. **Data Preprocessing** - Clean and normalize data
3. **Feature Engineering** - Extract 100+ features
4. **Model Training** - Train ensemble models
5. **Inference** - Predict vulnerabilities for new targets
6. **Analysis** - Generate actionable reports

---

## üîß Setup

First, let's import all necessary libraries and set up our environment.

In [None]:
# Standard imports
import sys
import warnings
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

# Add parent directory to path
sys.path.insert(0, str(Path.cwd().parent))

# BugPredict AI imports
from src.collectors.hackerone_scraper import HackerOneCollector
from src.collectors.bugcrowd_scraper import BugcrowdCollector
from src.collectors.cve_collector import CVECollector
from src.preprocessing.normalizer import DataNormalizer
from src.preprocessing.deduplicator import Deduplicator
from src.preprocessing.enricher import DataEnricher
from src.features.feature_engineer import FeatureEngineer
from src.models.vulnerability_classifier import VulnerabilityPredictor
from src.models.severity_predictor import SeverityPredictor
from src.models.chain_detector import ChainDetector
from src.inference.predictor import ThreatPredictor
from src.training.pipeline import TrainingPipeline

# Visualization settings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úì All imports successful!")

---

## üìä Part 1: Data Collection

Let's collect vulnerability data from multiple sources.

### 1.1 HackerOne Data Collection

In [None]:
print("Collecting data from HackerOne...\n")

# Initialize collector
h1_collector = HackerOneCollector()

# Collect reports (uses cache if available)
h1_reports = h1_collector.collect(limit=1000, use_cache=True)

print(f"\n‚úì Collected {len(h1_reports)} reports from HackerOne")

# Show sample report
if h1_reports:
    sample = h1_reports[0]
    print(f"\nSample Report:")
    print(f"  ID: {sample.report_id}")
    print(f"  Type: {sample.vulnerability_type}")
    print(f"  Severity: {sample.severity}")
    print(f"  Target: {sample.target_domain}")
    print(f"  Bounty: ${sample.bounty_amount}")

### 1.2 CVE/NVD Data Collection

In [None]:
print("Collecting CVEs from NVD...\n")

# Initialize collector
cve_collector = CVECollector()

# Date range: last 180 days
end_date = datetime.now()
start_date = end_date - timedelta(days=180)

# Collect CVEs
cve_reports = cve_collector.collect(
    start_date=start_date,
    end_date=end_date,
    keywords=['web', 'application'],
    limit=500,
    use_cache=True
)

print(f"\n‚úì Collected {len(cve_reports)} CVEs")

# Show sample
if cve_reports:
    sample = cve_reports[0]
    print(f"\nSample CVE:")
    print(f"  ID: {sample.report_id}")
    print(f"  Type: {sample.vulnerability_type}")
    print(f"  CVSS: {sample.cvss_score}")
    print(f"  Description: {sample.description[:100]}...")

### 1.3 Combine All Data

In [None]:
# Combine all reports
all_reports = h1_reports + cve_reports

print(f"Total reports collected: {len(all_reports)}")

# Data distribution
platforms = pd.Series([r.platform for r in all_reports])
print(f"\nData sources:")
print(platforms.value_counts())

### 1.4 Visualize Data Distribution

In [None]:
# Vulnerability type distribution
vuln_types = pd.Series([r.vulnerability_type for r in all_reports])
top_vulns = vuln_types.value_counts().head(10)

plt.figure(figsize=(12, 6))
top_vulns.plot(kind='barh', color='steelblue')
plt.title('Top 10 Vulnerability Types', fontsize=14, fontweight='bold')
plt.xlabel('Count')
plt.ylabel('Vulnerability Type')
plt.tight_layout()
plt.show()

# Severity distribution
severities = pd.Series([r.severity for r in all_reports])

plt.figure(figsize=(10, 6))
severities.value_counts().plot(kind='pie', autopct='%1.1f%%', colors=sns.color_palette('RdYlGn_r'))
plt.title('Severity Distribution', fontsize=14, fontweight='bold')
plt.ylabel('')
plt.tight_layout()
plt.show()

---

## üßπ Part 2: Data Preprocessing

Clean and normalize the collected data.

### 2.1 Normalize Data

In [None]:
print("Normalizing data...")

normalizer = DataNormalizer()
normalized_reports = normalizer.normalize(all_reports)

print(f"‚úì Normalized {len(normalized_reports)} reports")

### 2.2 Remove Duplicates

In [None]:
print("Removing duplicates...")

deduplicator = Deduplicator()
deduplicated_reports = deduplicator.deduplicate(normalized_reports)

removed = len(normalized_reports) - len(deduplicated_reports)
print(f"‚úì Removed {removed} duplicates")
print(f"‚úì {len(deduplicated_reports)} unique reports remaining")

### 2.3 Enrich Data

In [None]:
print("Enriching data...")

enricher = DataEnricher()
enriched_reports = enricher.enrich(deduplicated_reports)

print(f"‚úì Enriched {len(enriched_reports)} reports")

---

## ‚öôÔ∏è Part 3: Feature Engineering

Extract 100+ features from vulnerability reports.

In [None]:
print("Engineering features...\n")

feature_engineer = FeatureEngineer()
features_df = feature_engineer.fit_transform(enriched_reports)

print(f"\n‚úì Generated {features_df.shape[1]} features")
print(f"‚úì Dataset shape: {features_df.shape}")

# Show sample features
print("\nSample features:")
print(features_df.head())

### 3.1 Feature Analysis

In [None]:
# Numeric features summary
numeric_features = features_df.select_dtypes(include=[np.number])

print(f"Numeric features: {len(numeric_features.columns)}")
print("\nFeature statistics:")
print(numeric_features.describe())

### 3.2 Feature Correlations

In [None]:
# Correlation heatmap (top features)
top_features = numeric_features.columns[:20]
corr_matrix = numeric_features[top_features].corr()

plt.figure(figsize=(14, 10))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix (Top 20)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

---

## ü§ñ Part 4: Model Training

Train ensemble ML models.

### 4.1 Prepare Training Data

In [None]:
from sklearn.model_selection import train_test_split

# Extract targets
y_vuln = features_df['vuln_type'].values
y_severity = features_df['severity'].values
y_cvss = features_df['cvss_score'].values

# Extract features (only numeric)
X = features_df.drop(['vuln_type', 'severity', 'cvss_score'], axis=1, errors='ignore')
numeric_cols = X.select_dtypes(include=[np.number]).columns
X = X[numeric_cols]

print(f"Features shape: {X.shape}")
print(f"Target classes: {len(np.unique(y_vuln))}")

# Split data
X_train, X_test, y_train, y_test, sev_train, sev_test, cvss_train, cvss_test = train_test_split(
    X, y_vuln, y_severity, y_cvss,
    test_size=0.2,
    random_state=42,
    stratify=y_vuln
)

print(f"\nTrain set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")

### 4.2 Train Vulnerability Classifier

In [None]:
print("Training Vulnerability Classifier...\n")

# Initialize and build
vuln_predictor = VulnerabilityPredictor(random_state=42)
vuln_predictor.build_models()

# Train
vuln_results = vuln_predictor.train(
    pd.DataFrame(X_train, columns=X.columns),
    pd.Series(y_train),
    test_size=0.0,  # Already split
    validation_size=0.1,
    perform_cv=True
)

print("\n‚úì Vulnerability Classifier trained!")

### 4.3 Train Severity Predictor

In [None]:
print("Training Severity Predictor...\n")

# Initialize
severity_predictor = SeverityPredictor(random_state=42)
severity_predictor.build_model()

# Train
severity_results = severity_predictor.train(
    pd.DataFrame(X_train, columns=X.columns),
    pd.Series(sev_train),
    y_cvss=pd.Series(cvss_train),
    test_size=0.0,
    perform_cv=True
)

print("\n‚úì Severity Predictor trained!")

### 4.4 Model Evaluation

In [None]:
# Evaluate on test set
eval_results = vuln_predictor.evaluate(
    pd.DataFrame(X_test, columns=X.columns),
    pd.Series(y_test),
    method='averaging'
)

print(f"Ensemble Test Accuracy: {eval_results['accuracy']:.4f}")
print(f"Ensemble Test F1 Score: {eval_results['f1_score']:.4f}")

# Confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay

y_pred, _ = vuln_predictor.ensemble_predict(
    pd.DataFrame(X_test, columns=X.columns),
    method='averaging'
)

fig, ax = plt.subplots(figsize=(12, 10))
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred,
    ax=ax,
    cmap='Blues',
    colorbar=True
)
plt.title('Confusion Matrix - Vulnerability Classifier', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

### 4.5 Feature Importance

In [None]:
# Get feature importance
importance_df = vuln_predictor.get_feature_importance(top_n=20, model_name='random_forest')

# Plot
plt.figure(figsize=(12, 8))
plt.barh(importance_df['Feature'], importance_df['Importance'], color='steelblue')
plt.xlabel('Importance')
plt.title('Top 20 Most Important Features', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\nTop 10 Features:")
print(importance_df.head(10))

### 4.6 Save Models

In [None]:
# Create models directory
models_dir = Path('../data/models')
models_dir.mkdir(parents=True, exist_ok=True)

# Save models
vuln_predictor.save(str(models_dir / 'vulnerability_predictor.pkl'))
severity_predictor.save(str(models_dir / 'severity_predictor.pkl'))
feature_engineer.save(str(models_dir / 'feature_engineer.pkl'))

# Save chain detector
chain_detector = ChainDetector()
import pickle
with open(models_dir / 'chain_detector.pkl', 'wb') as f:
    pickle.dump(chain_detector, f)

print("‚úì All models saved!")

---

## üéØ Part 5: Inference & Prediction

Use trained models to predict vulnerabilities for new targets.

### 5.1 Load Trained Models

In [None]:
print("Loading trained models...\n")

predictor = ThreatPredictor(models_dir='../data/models')

print("\n‚úì Models loaded and ready!")

### 5.2 Analyze a Target

In [None]:
# Define target
target_info = {
    'domain': 'example.com',
    'company_name': 'Example Corp',
    'technology_stack': ['React', 'Node.js', 'PostgreSQL', 'Redis', 'AWS'],
    'endpoints': ['/api/users', '/api/posts', '/api/auth/login', '/api/payments'],
    'auth_required': True,
    'has_api': True,
    'description': 'Social media platform with payment integration'
}

# Analyze
results = predictor.analyze_target(target_info)

print(f"\n{'='*70}")
print(f"ANALYSIS COMPLETE")
print(f"{'='*70}")

### 5.3 View Results

In [None]:
# Risk score
print(f"\nRisk Score: {results['risk_score']}/10 ({results['risk_level'].upper()})")

# Top vulnerabilities
print(f"\nTop 10 Vulnerability Predictions:")
print(f"{'='*70}")

vuln_df = pd.DataFrame(results['vulnerability_predictions'][:10])
print(vuln_df.to_string(index=False))

# Chains
if results['chain_predictions']:
    print(f"\nDetected Attack Chains:")
    print(f"{'='*70}")
    
    for chain in results['chain_predictions'][:3]:
        print(f"\n{chain['name']} (Score: {chain['exploitability_score']}/10)")
        print(f"  Vulnerabilities: {', '.join(chain['vulns'])}")
        print(f"  Description: {chain['description']}")

### 5.4 Visualize Predictions

In [None]:
# Vulnerability probability chart
top_vulns = results['vulnerability_predictions'][:10]
vuln_names = [v['vulnerability_type'] for v in top_vulns]
vuln_probs = [v['probability'] for v in top_vulns]

plt.figure(figsize=(12, 6))
bars = plt.barh(vuln_names, vuln_probs, color='steelblue')

# Color code by probability
for i, (bar, prob) in enumerate(zip(bars, vuln_probs)):
    if prob > 0.7:
        bar.set_color('red')
    elif prob > 0.5:
        bar.set_color('orange')
    else:
        bar.set_color('steelblue')

plt.xlabel('Probability', fontsize=12)
plt.title('Vulnerability Predictions for example.com', fontsize=14, fontweight='bold')
plt.xlim(0, 1)
plt.gca().invert_yaxis()

# Add probability labels
for i, prob in enumerate(vuln_probs):
    plt.text(prob + 0.02, i, f'{prob:.1%}', va='center')

plt.tight_layout()
plt.show()

### 5.5 Test Strategy

In [None]:
# Display test strategy
strategy = results['test_strategy']

print(f"\nRecommended Test Strategy")
print(f"{'='*70}\n")

for target in strategy['priority_targets'][:5]:
    print(f"\n{target['vulnerability']} ({target['time_allocation']})")
    print(f"  Priority: {'‚≠ê' * target['priority']}")
    print(f"  Tools: {', '.join(target['tools'][:3])}")
    print(f"  Test Cases:")
    for i, test_case in enumerate(target['test_cases'][:3], 1):
        print(f"    {i}. {test_case}")

# Time allocation pie chart
if strategy['time_allocation']:
    plt.figure(figsize=(10, 8))
    
    labels = list(strategy['time_allocation'].keys())[:5]
    sizes = list(strategy['time_allocation'].values())[:5]
    
    plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)
    plt.title('Recommended Time Allocation', fontsize=14, fontweight='bold')
    plt.axis('equal')
    plt.tight_layout()
    plt.show()

### 5.6 Recommendations

In [None]:
print(f"\nActionable Recommendations")
print(f"{'='*70}\n")

for i, rec in enumerate(results['recommendations'], 1):
    print(f"{i}. {rec}")

---

## üì¶ Part 6: Batch Analysis

Analyze multiple targets at once.

In [None]:
# Define multiple targets
targets = [
    {
        'domain': 'site1.com',
        'company_name': 'Site 1',
        'technology_stack': ['React', 'Node.js'],
        'auth_required': True,
        'has_api': True
    },
    {
        'domain': 'site2.com',
        'company_name': 'Site 2',
        'technology_stack': ['Angular', 'Java', 'MySQL'],
        'auth_required': True,
        'has_api': True
    },
    {
        'domain': 'site3.com',
        'company_name': 'Site 3',
        'technology_stack': ['Vue.js', 'Python', 'PostgreSQL'],
        'auth_required': False,
        'has_api': True
    }
]

# Batch analyze
print("Analyzing multiple targets...\n")
batch_results = predictor.batch_analyze(targets)

print(f"\n‚úì Analyzed {len(batch_results)} targets")

### 6.1 Compare Results

In [None]:
# Compare risk scores
comparison_df = pd.DataFrame([
    {
        'Target': r['target'],
        'Risk Score': r['risk_score'],
        'Risk Level': r['risk_level'],
        'Top Vulnerability': r['vulnerability_predictions'][0]['vulnerability_type'],
        'Top Probability': f"{r['vulnerability_predictions'][0]['probability']:.1%}",
        'Chains': len(r['chain_predictions'])
    }
    for r in batch_results if 'error' not in r
])

print("\nComparison:")
print(comparison_df.to_string(index=False))

# Risk score comparison chart
plt.figure(figsize=(10, 6))
colors = ['red' if level == 'critical' else 'orange' if level == 'high' else 'yellow' if level == 'medium' else 'green' 
          for level in comparison_df['Risk Level']]

plt.bar(comparison_df['Target'], comparison_df['Risk Score'], color=colors)
plt.xlabel('Target')
plt.ylabel('Risk Score')
plt.title('Risk Score Comparison', fontsize=14, fontweight='bold')
plt.ylim(0, 10)
plt.axhline(y=6, color='orange', linestyle='--', alpha=0.5, label='High Risk Threshold')
plt.axhline(y=8, color='red', linestyle='--', alpha=0.5, label='Critical Risk Threshold')
plt.legend()
plt.tight_layout()
plt.show()

---

## üìù Part 7: Export Results

Save results for reporting.

In [None]:
import json

# Save individual analysis
output_dir = Path('../data/results')
output_dir.mkdir(parents=True, exist_ok=True)

with open(output_dir / 'example_analysis.json', 'w') as f:
    json.dump(results, f, indent=2)

print("‚úì Saved to data/results/example_analysis.json")

# Save batch results
with open(output_dir / 'batch_analysis.json', 'w') as f:
    json.dump(batch_results, f, indent=2)

print("‚úì Saved to data/results/batch_analysis.json")

# Save comparison CSV
comparison_df.to_csv(output_dir / 'comparison.csv', index=False)

print("‚úì Saved to data/results/comparison.csv")

---

## üéì Summary

In this notebook, we:

1. ‚úÖ **Collected** vulnerability data from HackerOne and NVD
2. ‚úÖ **Preprocessed** data (normalization, deduplication, enrichment)
3. ‚úÖ **Engineered** 100+ features from reports
4. ‚úÖ **Trained** ensemble ML models (Random Forest, XGBoost, LightGBM, etc.)
5. ‚úÖ **Evaluated** model performance (accuracy, F1 score, confusion matrix)
6. ‚úÖ **Predicted** vulnerabilities for new targets
7. ‚úÖ **Generated** actionable test strategies
8. ‚úÖ **Detected** attack chains
9. ‚úÖ **Analyzed** multiple targets in batch
10. ‚úÖ **Exported** results for reporting

### Next Steps:

- Use `scripts/analyze_target.py` for CLI analysis
- Use `scripts/batch_analyze.py` for batch processing
- Use `scripts/generate_nuclei_templates.py` to create Nuclei templates
- Integrate with your bug bounty workflow

### Resources:

- [Documentation](../docs/)
- [GitHub Repository](https://github.com/yourusername/bugpredict-ai)
- [API Documentation](../docs/api.md)

Happy Hunting! üéØ