# URL Threat Detection & Security Analysis


## Project Overview

This security research project analyzes honeypot data to identify and classify malicious URL-based attacks. By detecting patterns in URLs associated with suspicious activities, this analysis helps identify potential cyber threats and vulnerabilities within systems.

Understanding attack types and patterns enables organizations to better secure their networks, protect sensitive data, and ensure the integrity and availability of their services.


### Start Program Code Here

#### Libraries used

This section loads all required libraries.

In [1]:
#### Libraries used

# Standard library imports
import os
import re
import time
import json
from datetime import datetime
# Using ThreadPool (threading) instead of multiprocessing for better Jupyter compatibility
# Threads share memory space, avoiding serialization overhead in notebook environments
from multiprocessing.dummy import Pool as ThreadPool
from typing import Dict, List, Tuple, Set, Optional, Any

# Data manipulation and analysis
import pandas as pd
import numpy as np
from functools import lru_cache
import warnings
warnings.filterwarnings('ignore')

# Visualization libraries
import matplotlib.pyplot as plt

# Display utilities
from IPython.display import display

# Configure environment
plt.style.use('seaborn-v0_8-darkgrid')  
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)

# Version information for reproducibility
print("="*60)
print("URL Threat Detection & Security Analysis")
print("="*60)
print(f"Python version: {os.sys.version.split()[0]}")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Execution time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("="*60)

URL Threat Detection & Security Analysis
Python version: 3.13.5
Pandas version: 2.2.3
NumPy version: 2.1.3
Execution time: 2026-02-14 13:33:53


#### Set Data Path:

In [2]:
#### Set Data Path:
# Configure working directories and create output folder for analysis results

# Set paths 
main_dir = os.getcwd()
data_dir = main_dir
output_dir = os.path.join(main_dir, 'Reports')  # Save outputs to a subfolder

# Create the output directory if it doesn't exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Remove annoying error and improve display
pd.options.mode.chained_assignment = None 
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)


#### Regular Expressions (regex):

This section contains the regex statements that will be used to locate suspicious activity

In [3]:
#### Regular Expressions (regex):
# This section contains the regex statements that will be used to locate suspicious activity

class OptimizedAttackPatterns:
    """
    High-performance modular attack pattern management system.
    
    PERFORMANCE OPTIMIZATIONS IMPLEMENTED:
    - Pre-filtering with fast string operations
    - Simplified regex patterns to reduce backtracking
    - Pattern caching with LRU cache
    - Tiered detection 
    - Non-capturing groups and possessive quantifiers
    - Word boundaries for better anchoring
    
    """
    
    # Pre-filter keywords for fast initial screening
    _SQL_KEYWORDS = {'union', 'select', 'insert', 'update', 'delete', 'drop', 'create', 'alter', 'exec', 'waitfor', 'sleep'}
    _XSS_KEYWORDS = {'script', 'javascript', 'onload', 'onerror', 'onclick', 'alert', 'eval', 'document', 'window'}
    _PHP_KEYWORDS = {'eval', 'exec', 'system', 'shell_exec', 'base64_decode', 'gzinflate', 'file_get_contents', 'include', 'require'}
    _CMD_KEYWORDS = {'cmd', 'powershell', 'bash', 'sh', 'wget', 'curl', 'nc', 'netcat', 'ping', 'tracert'}
    _TRAVERSAL_KEYWORDS = {'../', '..\\', '%2e%2e', 'etc/passwd', 'windows/system32', 'proc/self'}
    
    @staticmethod
    def quick_prefilter(text: str, keywords: Set[str]) -> bool:
        """Fast string-based pre-filtering before regex matching."""
        text_lower = text.lower()
        return any(keyword in text_lower for keyword in keywords)
    
    @staticmethod
    @lru_cache(maxsize=32)
    def get_sql_injection_pattern() -> re.Pattern:
        """
        Optimized SQL Injection detection pattern.
        
        OPTIMIZATIONS:
        - Removed nested quantifiers that cause backtracking
        - Used word boundaries for better performance
        - Simplified encoding patterns
        - Combined similar patterns
        - Used non-capturing groups
        
        Returns:
            re.Pattern: Compiled regex pattern for SQL injection detection
        """
        pattern = re.compile(
            r"(?i)(?:"
            # Core SQL injection - most common first
            r"\bunion\s+(?:all\s+)?select\b|"             # UNION SELECT (most common)
            r"\bselect\s+[\w*,\s]+\bfrom\b|"              # SELECT FROM statements
            r"\b(?:insert|update|delete)\s+(?:into\s+)?\w+|"  # DML operations
            r"\bdrop\s+(?:table|database|schema)\b|"      # DROP operations
            
            # SQL functions and techniques
            r"\b(?:exec|execute)(?:\s*\(|\s+)|"           # EXEC/EXECUTE
            r"\bwaitfor\s+delay\b|"                       # Time delays
            r"\bsleep\s*\([^)]*\)|"                       # SLEEP function
            r"\bbenchmark\s*\([^)]*\)|"                   # BENCHMARK
            
            # Boolean-based blind injection 
            r"\b(?:and|or)\s+\d+\s*[=<>!]+\s*\d+|"      # Numeric comparisons
            r"\b(?:and|or)\s+[\w'\"]+\s*=\s*[\w'\"]+|"   # String comparisons
            
            # SQL comments and terminators
            r"--[^\r\n]*|"                               # Line comments
            r"/\*.*?\*/|"                                 # Block comments
            r"[;\x00]|"                                   # Statement terminators
            
            # Database metadata 
            r"\binformation_schema\b|"                    # Info schema
            r"\b(?:sys|mysql)\.(?:tables|users?|columns?)\b|"  # System tables
            
            # Common encoding (most frequent only)
            r"%27|%22|%20union%20|%20select%20|"         # URL encoded
            r"0x[0-9a-f]+|"                              # Hex values
            
            # Database functions (critical ones only)
            r"\b(?:concat|substring|ascii|char)\s*\(|"    # String functions
            r"\b(?:user|database|version)\s*\(\s*\)|"     # Info functions
            
            # Extra techniques 
            r"\binto\s+(?:outfile|dumpfile)\b|"          # File operations
            r"@@(?:version|user|hostname)|"               # MySQL variables
            r"\bxp_cmdshell\b"                           # MSSQL command shell
            r")",
            re.IGNORECASE | re.DOTALL
        )
        return pattern
    
    @staticmethod
    @lru_cache(maxsize=32)
    def get_xss_pattern() -> re.Pattern:
        """
        Optimized Cross-Site Scripting (XSS) detection pattern.
        
        OPTIMIZATIONS:
        - Focused on most dangerous XSS vectors
        - Simplified HTML tag matching
        - Reduced complex nested groups
        - Combined encoding patterns
        
        Returns:
            re.Pattern: Compiled regex pattern for XSS detection
        """
        pattern = re.compile(
            r"(?i)(?:"
            # Script tags (most critical)
            r"<script[^>]*>|</script>|"                   # Script tags
            r"%3Cscript|%3C/script|"                      # URL encoded
            r"&lt;script|&lt;/script|"                    # HTML encoded
            
            # JavaScript protocols and events
            r"javascript\s*:|"                           # JavaScript protocol
            r"\bon\w+\s*=|"                              # Event handlers
            r"\b(?:alert|prompt|confirm)\s*\(|"          # Dialog functions
            
            # Dangerous HTML tags 
            r"<(?:iframe|embed|object|applet|meta|link)\b[^>]*>|"  # Dangerous tags
            r"<(?:img|body)\b[^>]*\bon\w+[^>]*>|"        # Tags with events
            
            # DOM manipulation
            r"\b(?:document|window|self|top|parent)\.|\binnerHTML\b|"  # DOM access
            r"\.(?:cookie|location|href)\b|"              # Sensitive properties
            
            # Common encoding patterns 
            r"&#(?:\d+|x[0-9a-f]+);|"                   # HTML entities
            r"%u[0-9a-f]{4}|"                            # Unicode encoding
            r"String\.fromCharCode|"                      # Char code conversion
            
            # Data URIs and special content
            r"data:(?:text/html|application/x-javascript)|"  # Data URIs
            r"<!\[CDATA\[|"                              # CDATA sections
            r"\beval\s*\(|setTimeout\s*\("               # Code execution
            r")",
            re.IGNORECASE | re.DOTALL
        )
        return pattern
    
    @staticmethod
    @lru_cache(maxsize=32)
    def get_php_attack_pattern() -> re.Pattern:
        """
        Optimized PHP attack detection pattern.
        
        OPTIMIZATIONS:
        - Focused on most dangerous PHP functions
        - Simplified file extension matching
        - Reduced redundant shell name patterns
        - Combined encoding functions
        
        Returns:
            re.Pattern: Compiled regex pattern for PHP attack detection
        """
        pattern = re.compile(
            r"(?i)(?:"
            # PHP file indicators
            r"\.php[345]?\b|\.phtml\b|"                   # PHP extensions
            
            # Critical PHP shells (most common)
            r"\b(?:c99|r57|wso|b374k|shell|cmd|evil|bypass)\.php\b|"  # Common shells
            
            # Dangerous PHP functions (highest priority)
            r"\b(?:eval|assert|exec|system|shell_exec|passthru)\s*\(|"  # Code execution
            r"\b(?:file_get_contents|file_put_contents|fopen|fwrite)\s*\(|"  # File operations
            r"\bcreate_function\s*\(|"                    # Dynamic functions
            r"\bpreg_replace\s*\([^)]*[\"']/e[\"']|"     # Regex with eval
            
            # Critical encoding/decoding 
            r"\b(?:base64_decode|gzinflate|str_rot13)\s*\(|"  # Decoding
            r"\b(?:gzuncompress|convert_uudecode)\s*\(|"  # Decompression
            
            # System commands
            r"\bproc_open\s*\(|"                          # Process control
            r"\b(?:popen|fsockopen|socket_create)\s*\(|"  # Network/process
            
            # Include/require
            r"\b(?:include|require)(?:_once)?\s*\(?|"     # File inclusion
            
            # PHP superglobals 
            r"\$_(?:GET|POST|REQUEST|FILES|COOKIE)\s*\[|"  # Input arrays
            
            # PHP information
            r"\b(?:phpinfo|php_uname|system)\s*\(|"      # Info disclosure
            
            # Common database functions 
            r"\b(?:mysql_|mysqli_)(?:connect|query)\s*\(|"  # MySQL functions
            
            # Stream wrappers
            r"\bphp://(?:input|filter|fd)|"               # PHP wrappers
            r"\b(?:file|data|zip)://|"                    # File wrappers
            
            # Serialization
            r"\b(?:unserialize|__wakeup|__destruct)\s*\(|"  # Serialization
            
            # Known vulnerable components
            r"\b(?:timthumb|uploadify|adminer)\.php\b"    # Vulnerable files
            r")",
            re.IGNORECASE
        )
        return pattern
    
    @staticmethod
    @lru_cache(maxsize=32)
    def get_windows_attack_pattern() -> re.Pattern:
        """
        Optimized Windows attack detection pattern.
        
        OPTIMIZATIONS:
        - Focused on most critical Windows attack vectors
        - Simplified path matching
        - Combined registry patterns
        - Reduced PowerShell pattern complexity
        
        Returns:
            re.Pattern: Compiled regex pattern for Windows attack detection
        """
        pattern = re.compile(
            r"(?i)(?:"
            # Directory traversal (Windows)
            r"\.\.\\|%5c\.\.%5c|%c1%9c|"                 # Path traversal
            
            # Critical Windows executables 
            r"\b(?:cmd|powershell|wscript|cscript|mshta)\.exe\b|"  # Executables
            r"\b(?:regsvr32|rundll32|certutil|bitsadmin)\.exe\b|"  # System tools
            
            # Windows paths 
            r"[c-z]:\\\\|\\\\windows\\\\|\\\\system32\\\\|"  # System paths
            r"%(?:windir|systemroot|temp)%|"              # Environment vars
            
            # Dangerous Windows commands 
            r"\bnet\s+(?:user|localgroup|share)\b|"       # Net commands
            r"\b(?:sc|reg)\s+(?:create|add|delete)\b|"    # Service/registry
            r"\bwmic\s+(?:process|service)\b|"            # WMI commands
            
            # PowerShell 
            r"-EncodedCommand\b|"                         # Encoded PS
            r"\b(?:Invoke-Expression|Invoke-Command)\b|"  # PS execution
            r"\bDownloadString\b|Net\.WebClient\b|"       # PS download
            r"-ExecutionPolicy\s+Bypass|"                 # PS policy bypass
            
            # Critical registry paths 
            r"\bHKEY_(?:LOCAL_MACHINE|CURRENT_USER)\b|"   # Registry hives
            r"\\CurrentVersion\\Run\b|"                   # Autorun keys
            
            # High-risk Windows files 
            r"\b(?:boot|win|system)\.ini\b|"              # System files
            r"\b(?:sam|security|software)\b|"             # Registry files
            
            # ASP/ASPX
            r"\.asp[x]?\b|"                               # ASP files
            r"\b(?:eval|execute)\s+request\b|"            # ASP execution
            
            # Dangerous file extensions 
            r"\.(?:bat|cmd|vbs|ps1|scr|hta)\b|"          # Executable extensions
            
            # Common Windows malware/tools 
            r"\b(?:mimikatz|psexec|procdump|wmiexec)\b|"  # Attack tools
            
            # Privilege escalation
            r"\b(?:sudo|runas|gsudo)\s+|"                 # Elevation
            r"\bschtasks\s*/create\b"                     # Scheduled tasks
            r")",
            re.IGNORECASE
        )
        return pattern
    
    @staticmethod
    @lru_cache(maxsize=32)
    def get_directory_traversal_pattern() -> re.Pattern:
        """
        Optimized Directory Traversal detection pattern.
        
        OPTIMIZATIONS:
        - Focused on most common traversal patterns
        - Simplified encoding variations
        - Combined similar file access patterns
        
        Returns:
            re.Pattern: Compiled regex pattern for directory traversal detection
        """
        pattern = re.compile(
            r"(?i)(?:"
            # Basic traversal patterns
            r"\.\.[\\/]|"                                 # Standard traversal
            r"%2e%2e%2f|%2e%2e%5c|"                      # URL encoded
            r"%252e%252e%252f|%252e%252e%255c|"          # Double encoded
            r"\x00|\%00|"                                # Null bytes
            
            # Critical Unix sensitive files 
            r"/etc/(?:passwd|shadow|hosts|sudoers)\b|"    # System files
            r"/proc/(?:self|version|cmdline)/|"           # Process info
            r"/var/log/\w+|"                             # Log files
            
            # Web server files
            r"\.ht(?:access|passwd)\b|"                   # Apache files
            r"(?:httpd|nginx)\.conf\b|"                   # Config files
            r"web\.config\b|"                             # IIS config
            
            # Common application configs 
            r"\b(?:wp-config|configuration|settings)\.php\b|"  # CMS configs
            r"\.(?:env|git|ssh)/|"                        # Hidden dirs
            r"\b(?:id_rsa|authorized_keys)\b|"            # SSH keys
            
            # Database files
            r"\.(?:sqlite|db|sql|dump)\b|"               # Database files
            
            # Backup and temp files
            r"\.(?:bak|backup|old|tmp|swp)\b|"           # Backup files
            
            # Archive files
            r"\.(?:tar|zip|rar|7z)(?:\.gz)?\b|"         # Archives
            
            # Special protocols
            r"\b(?:file|php|data|zip)://|"               # URI schemes
            
            # Critical Windows specific paths
            r"\\windows\\system32\\|"                     # System32
            r"\\boot\.ini\b|\\sam\b"                     # Windows files
            r")",
            re.IGNORECASE
        )
        return pattern
    
    @staticmethod
    @lru_cache(maxsize=32)
    def get_command_injection_pattern() -> re.Pattern:
        """
        Optimized Command Injection detection pattern.
        
        OPTIMIZATIONS:
        - Focused on most dangerous command injection vectors
        - Simplified shell metacharacter detection
        - Combined similar command patterns
        
        Returns:
            re.Pattern: Compiled regex pattern for command injection detection
        """
        pattern = re.compile(
            r"(?i)(?:"
            # Critical command separators 
            r"[;&|`]|&&|\|\||"                           # Shell operators
            r"\$\(|\${|"                                 # Command substitution
            r">>|2>&1|"                                  # Redirections
            
            # Critical system commands
            r"\b(?:cat|ls|pwd|id|whoami|uname|ps|kill)\s|"  # Info commands
            r"\b(?:wget|curl|nc|netcat|telnet|ssh)\s|"   # Network commands
            r"\b(?:rm|rmdir|mkdir|cp|mv|chmod|chown)\s|"  # File commands
            r"\b(?:bash|sh|zsh|cmd|powershell)\s|"       # Shells
            
            # Dangerous patterns
            r"\b(?:eval|exec|system)\s*\(|"              # Execution functions
            r"/bin/(?:bash|sh)\s+-[ci]|"                 # Interactive shells
            r"mkfifo\s+|nc\s+-e|"                       # Backdoor patterns
            
            # Environment variables
            r"\$(?:PATH|HOME|USER|SHELL)\b|"             # Environment vars
            
            # Common encoded patterns 
            r"%3B|%7C|%26|%60|"                          # URL encoded
            r"\\x[2-7][0-9a-f]|"                         # Hex encoded
            
            # Package managers and interpreters
            r"\b(?:apt|yum|pip|npm|wget|curl)\s+\w+|"    # Package operations
            r"\b(?:python|perl|ruby|php|node)\s+|"       # Interpreters
            
            # Time-based and sleep
            r"\bsleep\s+\d+|timeout\s+\d+|"             # Delays
            
            # File operations
            r">/dev/null|"                               # Output redirect
            r"/dev/(?:tcp|udp)/|"                        # Network devices
            
            # Reverse shells 
            r"bash\s+-i|/bin/sh\s+-i|"                   # Interactive
            r"openssl\s+s_client"                        # SSL shells
            r")",
            re.IGNORECASE
        )
        return pattern
    
    @staticmethod
    @lru_cache(maxsize=16)
    def get_ldap_injection_pattern() -> re.Pattern:
        """Optimized LDAP Injection detection pattern."""
        pattern = re.compile(
            r"(?i)(?:"
            r"\(\w*=\*\)|"                               # LDAP wildcards
            r"[|&!]\(|"                                  # LDAP operators
            r"\b(?:cn|uid|mail|objectClass)=|"           # Common attributes
            r"%28|%29|%2A|%7C|%26"                       # Encoded chars
            r")",
            re.IGNORECASE
        )
        return pattern
    
    @staticmethod
    @lru_cache(maxsize=16)
    def get_xml_injection_pattern() -> re.Pattern:
        """Optimized XML Injection detection pattern."""
        pattern = re.compile(
            r"(?i)(?:"
            r"<!(?:ENTITY|DOCTYPE|ELEMENT|ATTLIST)\b|"   # XML declarations
            r"<!\[CDATA\[|"                              # CDATA
            r"\b(?:SYSTEM|PUBLIC)\s+[\"']|"              # External entities
            r"xmlns:|&\w+;|"                             # Namespaces/entities
            r"%3C!(?:ENTITY|DOCTYPE)"                    # Encoded
            r")",
            re.IGNORECASE
        )
        return pattern
    
    @staticmethod
    @lru_cache(maxsize=16)
    def get_nosql_injection_pattern() -> re.Pattern:
        """Optimized NoSQL Injection detection pattern."""
        pattern = re.compile(
            r"(?i)(?:"
            r"\$(?:ne|eq|gt|gte|lt|lte|in|nin|and|or|not|nor)\b|"  # Operators
            r"\$(?:exists|type|all|size|regex|where|elemMatch)\b|"  # Query ops
            r"[\{\[]?\$\w+|"                             # Object notation
            r"\bfunction\s*\(|this\.|"                   # JavaScript
            r"return\s+\w+"                              # Return statements
            r")",
            re.IGNORECASE
        )
        return pattern
    
    @staticmethod
    @lru_cache(maxsize=16)
    def get_file_inclusion_pattern() -> re.Pattern:
        """Optimized File Inclusion detection pattern."""
        pattern = re.compile(
            r"(?i)(?:"
            # Remote inclusion
            r"(?:https?|ftp|php|data|glob|phar)://|"     # Remote protocols
            # Local inclusion
            r"\.\.[\\/]\.\.[\\/]|"                       # Multiple traversal
            r"\?(?:file|path|page|include|require)=|"    # Inclusion params
            r"%00|\\x00"                                 # Null bytes
            r")",
            re.IGNORECASE
        )
        return pattern


class PerformanceAnalyzer:
    """
    High-performance pattern matching with tiered detection system.
    
    PERFORMANCE FEATURES:
    - Pre-filtering eliminates some non-malicious requests
    - Pattern caching reduces compilation overhead
    - Tiered matching 
    - Batch processing capabilities
    """
    
    def __init__(self):
        self.patterns = OptimizedAttackPatterns()
        self.stats = {
            'total_processed': 0,
            'prefilter_passed': 0,
            'regex_matches': 0,
            'pattern_cache_hits': 0
        }
    
    def analyze_single(self, text: str) -> Dict[str, bool]:
        """
        Analyze a single text input for multiple attack types.
        
        Returns:
            Dict mapping attack types to boolean detection results
        """
        self.stats['total_processed'] += 1
        results = {}
        
        # SQL Injection
        if OptimizedAttackPatterns.quick_prefilter(text, OptimizedAttackPatterns._SQL_KEYWORDS):
            self.stats['prefilter_passed'] += 1
            results['sql_injection'] = bool(self.patterns.get_sql_injection_pattern().search(text))
        else:
            results['sql_injection'] = False
        
        # XSS
        if OptimizedAttackPatterns.quick_prefilter(text, OptimizedAttackPatterns._XSS_KEYWORDS):
            results['xss'] = bool(self.patterns.get_xss_pattern().search(text))
        else:
            results['xss'] = False
        
        # PHP Attacks
        if OptimizedAttackPatterns.quick_prefilter(text, OptimizedAttackPatterns._PHP_KEYWORDS):
            results['php_attack'] = bool(self.patterns.get_php_attack_pattern().search(text))
        else:
            results['php_attack'] = False
        
        # Windows Attacks
        if OptimizedAttackPatterns.quick_prefilter(text, OptimizedAttackPatterns._CMD_KEYWORDS):
            results['windows_attack'] = bool(self.patterns.get_windows_attack_pattern().search(text))
        else:
            results['windows_attack'] = False
        
        # Directory Traversal
        if OptimizedAttackPatterns.quick_prefilter(text, OptimizedAttackPatterns._TRAVERSAL_KEYWORDS):
            results['directory_traversal'] = bool(self.patterns.get_directory_traversal_pattern().search(text))
        else:
            results['directory_traversal'] = False
        
        # Command Injection
        if OptimizedAttackPatterns.quick_prefilter(text, OptimizedAttackPatterns._CMD_KEYWORDS):
            results['command_injection'] = bool(self.patterns.get_command_injection_pattern().search(text))
        else:
            results['command_injection'] = False
        
        # Other patterns 
        results['ldap_injection'] = bool(self.patterns.get_ldap_injection_pattern().search(text))
        results['xml_injection'] = bool(self.patterns.get_xml_injection_pattern().search(text))
        results['nosql_injection'] = bool(self.patterns.get_nosql_injection_pattern().search(text))
        results['file_inclusion'] = bool(self.patterns.get_file_inclusion_pattern().search(text))
        
        # Count matches
        if any(results.values()):
            self.stats['regex_matches'] += 1
        
        return results
    
    def analyze_batch(self, texts: List[str]) -> List[Dict[str, bool]]:
        """
        Analyze multiple texts in batch for better performance.
        
        Args:
            texts: List of text strings to analyze
            
        Returns:
            List of detection results for each input
        """
        return [self.analyze_single(text) for text in texts]
    
    def get_performance_stats(self) -> Dict:
        """Get performance statistics."""
        if self.stats['total_processed'] > 0:
            prefilter_efficiency = (1 - self.stats['prefilter_passed'] / self.stats['total_processed']) * 100
            match_rate = (self.stats['regex_matches'] / self.stats['total_processed']) * 100
        else:
            prefilter_efficiency = 0
            match_rate = 0
        
        return {
            **self.stats,
            'prefilter_efficiency_percent': round(prefilter_efficiency, 2),
            'match_rate_percent': round(match_rate, 2)
        }
    
    def clear_cache(self):
        """Clear pattern cache for memory management."""
        # Clear the LRU caches
        OptimizedAttackPatterns.get_sql_injection_pattern.cache_clear()
        OptimizedAttackPatterns.get_xss_pattern.cache_clear()
        OptimizedAttackPatterns.get_php_attack_pattern.cache_clear()
        OptimizedAttackPatterns.get_windows_attack_pattern.cache_clear()
        OptimizedAttackPatterns.get_directory_traversal_pattern.cache_clear()
        OptimizedAttackPatterns.get_command_injection_pattern.cache_clear()
        OptimizedAttackPatterns.get_ldap_injection_pattern.cache_clear()
        OptimizedAttackPatterns.get_xml_injection_pattern.cache_clear()
        OptimizedAttackPatterns.get_nosql_injection_pattern.cache_clear()
        OptimizedAttackPatterns.get_file_inclusion_pattern.cache_clear()


# Initialize the performance analyzer
print("Initializing optimized attack detection system...")
analyzer = PerformanceAnalyzer()

print("Optimized attack patterns loaded successfully!")


Initializing optimized attack detection system...
Optimized attack patterns loaded successfully!


#### Functions:

In [4]:
#### Functions:
# Core analysis classes and utility functions for URL security detection

class VectorizedURLAnalyzer:
    """
    URL Security Analyzer using vectorized operations and parallel processing.
    """
    
    def __init__(self, patterns: Dict[str, re.Pattern]):
        """
        Initialize with pre-compiled patterns.
        
        Parameters:
        -----------
        patterns : Dict[str, re.Pattern]
            Dictionary of attack names to compiled regex patterns
        """
        self.patterns = patterns
        self.severity_weights = {
            'SQL Injection': 10,
            'Command Injection': 10,
            'File Inclusion': 9,
            'Directory Traversal': 8,
            'Cross-Site Scripting (XSS)': 7,
            'PHP Attack': 7,
            'Windows Attack': 6,
            'XML Injection': 5,
            'LDAP Injection': 5,
            'NoSQL Injection': 5
        }

    def analyze_single_url(self, url: str) -> Dict[str, Any]:
        """Analyze a single URL for attack patterns."""
        if pd.isna(url) or url == '':
            url = '/EMPTY_URL'
        
        url = str(url)
        matched_attacks = []
        
        # Check each pattern
        for attack_type, pattern in self.patterns.items():
            try:
                if pattern.search(url):
                    matched_attacks.append(attack_type)
            except Exception:
                # Skip problematic URLs
                continue
        
        # Calculate results
        is_suspicious = bool(matched_attacks)
        attack_count = len(matched_attacks)
        severity_score = sum([self.severity_weights.get(attack, 1) for attack in matched_attacks])
        
        # Determine severity level
        if severity_score >= 20:
            severity = 'Critical'
        elif severity_score >= 15:
            severity = 'High'
        elif severity_score >= 8:
            severity = 'Medium'
        elif severity_score > 0:
            severity = 'Low'
        else:
            severity = 'Clean'
        
        # Calculate confidence 
        confidence = min(1.0, attack_count * 0.3) if is_suspicious else 0.0
        
        return {
            'is_suspicious': is_suspicious,
            'attack_types': matched_attacks,
            'attack_count': attack_count,
            'severity': severity,
            'severity_score': severity_score,
            'confidence': confidence
        }

    def batch_analyze_chunk(self, urls_chunk: List[str]) -> List[Dict[str, Any]]:
        """Analyze a chunk of URLs."""
        results = []
        for url in urls_chunk:
            result = self.analyze_single_url(url)
            results.append(result)
        return results

    def analyze_batch_parallel(self, urls: List[str], n_jobs: int = 4) -> List[Dict[str, Any]]:
        """
        Analyze URLs in parallel using thread-safe approach.
        """
        print(f"Analyzing {len(urls):,} URLs using {n_jobs} threads...")

        # Using ThreadPool (threading) instead of multiprocessing for better Jupyter compatibility
        # Threads share memory space, avoiding serialization overhead in notebook environments
        
        # Create chunks
        chunk_size = max(100, len(urls) // n_jobs)  # Minimum 100 URLs per chunk
        chunks = [urls[i:i + chunk_size] for i in range(0, len(urls), chunk_size)]
        
        print(f"Created {len(chunks)} chunks of ~{chunk_size} URLs each")
        
        # Process chunks with thread pool
        results = []
        with ThreadPool(n_jobs) as pool:
            chunk_results = pool.map(self.batch_analyze_chunk, chunks)
        
        # Flatten results
        for chunk_result in chunk_results:
            results.extend(chunk_result)
        
        return results

    def get_summary_stats(self, results: List[Dict[str, Any]]) -> Dict[str, Any]:
        """Generate summary statistics from results."""
        if not results:
            return {'total_urls': 0, 'suspicious_urls': 0, 'attack_rate': 0.0}
        
        total_urls = len(results)
        suspicious_count = sum(1 for r in results if r['is_suspicious'])
        
        # Count attack types
        attack_type_counts = {}
        severity_counts = {}
        
        for result in results:
            # Count severity
            severity = result['severity']
            severity_counts[severity] = severity_counts.get(severity, 0) + 1
            
            # Count attack types
            for attack_type in result['attack_types']:
                attack_type_counts[attack_type] = attack_type_counts.get(attack_type, 0) + 1
        
        # Multi-attack URLs
        multi_attack_count = sum(1 for r in results if r['attack_count'] > 1)
        
        # Average confidence for suspicious URLs
        suspicious_results = [r for r in results if r['is_suspicious']]
        avg_confidence = (sum(r['confidence'] for r in suspicious_results) / len(suspicious_results)) if suspicious_results else 0.0
        
        # Attack combinations
        attack_combinations = {}
        for result in results:
            if result['attack_types']:
                combo = ', '.join(sorted(result['attack_types']))
                attack_combinations[combo] = attack_combinations.get(combo, 0) + 1
        
        # Sort and get top combinations
        top_combinations = dict(sorted(attack_combinations.items(), key=lambda x: x[1], reverse=True)[:10])
        
        return {
            'total_urls': total_urls,
            'suspicious_urls': suspicious_count,
            'clean_urls': total_urls - suspicious_count,
            'attack_rate': (suspicious_count / total_urls * 100) if total_urls > 0 else 0.0,
            'multi_attack_urls': multi_attack_count,
            'severity_distribution': severity_counts,
            'attack_distribution': attack_type_counts,
            'average_confidence': round(avg_confidence, 3),
            'high_confidence_attacks': sum(1 for r in suspicious_results if r['confidence'] >= 0.8),
            'top_attack_combinations': top_combinations
        }


def process_urls_optimized(df: pd.DataFrame, patterns_dict: Dict[str, re.Pattern], n_jobs: int = 4) -> Tuple[pd.DataFrame, Dict[str, Any]]:
    """
    Optimized function to process URLs with proper error handling and progress tracking.
    """
    print("\nStarting optimized URL security analysis...")
    print("Processing a large dataset — analysis time will vary based on system resources.\n")

    
    # Initialize analyzer
    analyzer = VectorizedURLAnalyzer(patterns_dict)
    
    # Extract URLs
    urls = df['request_url'].fillna('/EMPTY_URL').astype(str).tolist()
    
    # Analyze URLs
    try:
        analysis_results = analyzer.analyze_batch_parallel(urls, n_jobs=n_jobs)
        print("Analysis completed successfully!")
        
    except Exception as e:
        print(f"Error during analysis: {e}")
        # Fallback to single-threaded analysis
        print("Falling back to single-threaded analysis...")
        analysis_results = analyzer.batch_analyze_chunk(urls)
    
    # Convert results to DataFrame columns
    result_df = df.copy()
    
    for i, result in enumerate(analysis_results):
        if i < len(result_df):  # Safety check
            result_df.at[i, 'is_suspicious'] = result['is_suspicious']
            result_df.at[i, 'attack_types'] = ', '.join(result['attack_types'])
            result_df.at[i, 'attack_count'] = result['attack_count']
            result_df.at[i, 'severity'] = result['severity']
            result_df.at[i, 'severity_score'] = result['severity_score']
            result_df.at[i, 'confidence'] = result['confidence']
    
    # Generate summary
    summary = analyzer.get_summary_stats(analysis_results)
    
    # Display results
    print(f"\nQuick Analysis Summary:")
    print(f"   Total URLs: {summary['total_urls']:,}")
    print(f"   Suspicious URLs: {summary['suspicious_urls']:,} ({summary['attack_rate']:.2f}%)")
    print(f"   Multi-attack URLs: {summary['multi_attack_urls']:,}")
    print(f"   Average confidence: {summary['average_confidence']:.3f}")
    
    if summary['attack_distribution']:
        print(f"\nPreliminary Top Attack Types:")
        for attack_type, count in list(summary['attack_distribution'].items())[:5]:
            print(f"   • {attack_type}: {count:,}")
    
    return result_df, summary

# Simplified processing function for immediate use
def quick_security_analysis(df: pd.DataFrame, patterns_dict: Dict[str, re.Pattern]) -> pd.DataFrame:
    """
    Quick security analysis function.
    """
    print("Running quick security analysis...")
    
    analyzer = VectorizedURLAnalyzer(patterns_dict)
    
    # Process URLs one by one with progress indication
    total_urls = len(df)
    results = []
    
    for i, url in enumerate(df['request_url'].fillna('/EMPTY_URL').astype(str)):
        if i % 1000 == 0:  # Progress update every 1000 URLs
            print(f"Progress: {i:,}/{total_urls:,} ({i/total_urls*100:.1f}%)")
        
        result = analyzer.analyze_single_url(url)
        results.append(result)
    
    # Add results to dataframe
    result_df = df.copy()
    for i, result in enumerate(results):
        result_df.at[i, 'is_suspicious'] = result['is_suspicious']
        result_df.at[i, 'attack_types'] = ', '.join(result['attack_types'])
        result_df.at[i, 'attack_count'] = result['attack_count']
        result_df.at[i, 'severity'] = result['severity']
        result_df.at[i, 'severity_score'] = result['severity_score']
        result_df.at[i, 'confidence'] = result['confidence']
    
    print("Quick analysis completed!")
    return result_df


def optimize_patterns_for_vectorization(patterns: Dict[str, re.Pattern]) -> Dict[str, re.Pattern]:
    """
    Optimize regex patterns for better vectorized performance.
    
    Parameters:
    -----------
    patterns : Dict[str, re.Pattern]
        
    Returns:
    --------
    Dict[str, re.Pattern] : Optimized patterns
    """
    optimized = {}
    
    for name, pattern in patterns.items():
        optimized[name] = pattern
    
    return optimized

    
class DataProcessor:
   """
   Handles data loading, cleaning, and preprocessing operations.
   Designed for reusability across different data formats and sources.
   """
   
   @staticmethod
   def load_honeypot_data(file_path: str, 
                         encoding: str = 'utf8') -> pd.DataFrame:
       """
       Load honeypot log data with error handling.
       
       Parameters:
       -----------
       file_path : str
           Path to the CSV file
       encoding : str
           File encoding (default: utf8)
           
       Returns:
       --------
       pd.DataFrame : Loaded and initially cleaned dataframe
       """
       try:
           # Try different separators if tab doesn't work
           separators = ['\t', ',', '|', ';']
           
           for sep in separators:
               try:
                   df = pd.read_csv(
                       file_path, 
                       sep=sep, 
                       encoding=encoding,
                       dtype='string',
                       low_memory=False,
                       on_bad_lines='skip'
                   )
                   
                   # Check if reasonable data was retrieved
                   if len(df.columns) > 5:  # valid data has multiple columns
                       print(f"Successfully loaded data using separator: '{sep}'")
                       break
               except:
                   continue
           else:
               # If no separator worked, this tries to detect it
               with open(file_path, 'r', encoding=encoding) as f:
                   first_line = f.readline()
                   # Try to detect separator from first line
                   for sep in separators:
                       if sep in first_line:
                           df = pd.read_csv(
                               file_path, 
                               sep=sep, 
                               encoding=encoding,
                               dtype='string',
                               low_memory=False,
                               on_bad_lines='skip'
                           )
                           break
           
           # Remove any empty rows
           df = df.dropna(how='all')
           
           # Remove trailing empty row if it exists
           if df.iloc[-1].isna().all():
               df = df[:-1]
           
           return df
           
       except Exception as e:
           print(f"Error loading data: {str(e)}")
           raise
   
   @staticmethod
   def clean_data(df: pd.DataFrame, 
                 required_columns: List[str]) -> pd.DataFrame:
       """
       Clean and prepare dataframe for analysis.
       
       Parameters:
       -----------
       df : pd.DataFrame
           Raw dataframe
       required_columns : List[str]
           List of required column names
           
       Returns:
       --------
       pd.DataFrame : Cleaned dataframe
       """
       # Check for required columns
       missing_columns = set(required_columns) - set(df.columns)
       if missing_columns:
           print(f"Warning: Missing columns: {missing_columns}")
       
       # Convert timestamp if present
       if 'timestamp' in df.columns:
           df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
       
       # Clean string columns
       for col in df.columns:
            if df[col].dtype == 'object' or df[col].dtype.name == 'string':
                df[col] = df[col].astype(str).str.strip()


       # Handle missing URLs
       if 'request_url' in df.columns:
           df['request_url'] = df['request_url'].fillna('/EMPTY_URL')
           df['request_url'] = df['request_url'].replace('', '/EMPTY_URL')
       
       # Convert numeric columns
       numeric_mappings = {
           'source_port': 'int32',
           'destination_port': 'int32',
           'unixTimestamp': 'float64'
       }
       
       for col, dtype in numeric_mappings.items():
           if col in df.columns:
               df[col] = pd.to_numeric(df[col], errors='coerce')
       
       return df
   
   @staticmethod
   def enrich_data(df: pd.DataFrame) -> pd.DataFrame:
       """
       Add derived features to enhance analysis.
       
       Parameters:
       -----------
       df : pd.DataFrame
           Cleaned dataframe
           
       Returns:
       --------
       pd.DataFrame : Enriched dataframe
       """
       # Extract hour from timestamp for analysis
       if 'timestamp' in df.columns:
           df['hour'] = df['timestamp'].dt.hour
           df['day_of_week'] = df['timestamp'].dt.day_name()
           df['date'] = df['timestamp'].dt.date
       
       # Extract URL components
       if 'request_url' in df.columns:
           # Extract file extension
           df['file_extension'] = df['request_url'].str.extract(r'\.([a-zA-Z0-9]+)(?:\?|$)')
           
           # Extract query parameters presence
           df['has_query_params'] = df['request_url'].str.contains('\?', na=False)
           
           # URL length
           df['url_length'] = df['request_url'].str.len()
           
           # Count special characters
           df['special_char_count'] = df['request_url'].str.count('[^a-zA-Z0-9/.\-_]')
       
       # IP address analysis
       if 'source_ip' in df.columns:
           # Extract first octet for class identification
           df['ip_first_octet'] = df['source_ip'].str.extract(r'^(\d+)\.').astype('Int32')
           
           # Identify private IPs
           df['is_private_ip'] = df['source_ip'].str.match(
               r'^(10\.|172\.(1[6-9]|2[0-9]|3[01])\.|192\.168\.)'
           )
       
       return df


class ReportGenerator:
   """
   Generates professional reports and visualizations from analysis results.
   """
   
   def __init__(self, output_dir: str):
       """
       Initialize report generator with output directory.
       
       Parameters:
       -----------
       output_dir : str
           Directory to save reports and visualizations
       """
       self.output_dir = output_dir
       self.timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
       
       # Create output directory if it doesn't exist
       os.makedirs(output_dir, exist_ok=True)
   
   def generate_html_report(self, df: pd.DataFrame, 
                          summary: Dict[str, Any]) -> str:
       """
       Generate a professional HTML report.
       
       Parameters:
       -----------
       df : pd.DataFrame
           Analyzed dataframe
       summary : dict
           Summary statistics
           
       Returns:
       --------
       str : Path to generated HTML report
       """
       html_content = f"""
       <!DOCTYPE html>
       <html lang="en">
       <head>
           <meta charset="UTF-8">
           <meta name="viewport" content="width=device-width, initial-scale=1.0">
           <title>Honeypot Security Analysis Report</title>
           <style>
               body {{
                   font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
                   line-height: 1.6;
                   color: #333;
                   max-width: 1200px;
                   margin: 0 auto;
                   padding: 20px;
                   background-color: #f5f5f5;
               }}
               .header {{
                   background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
                   color: white;
                   padding: 30px;
                   border-radius: 10px;
                   margin-bottom: 30px;
                   box-shadow: 0 4px 6px rgba(0,0,0,0.1);
               }}
               .header h1 {{
                   margin: 0;
                   font-size: 2.5em;
               }}
               .header p {{
                   margin: 10px 0 0 0;
                   opacity: 0.9;
               }}
               .metric-grid {{
                   display: grid;
                   grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
                   gap: 20px;
                   margin-bottom: 30px;
               }}
               .metric-card {{
                   background: white;
                   padding: 20px;
                   border-radius: 8px;
                   box-shadow: 0 2px 4px rgba(0,0,0,0.1);
                   text-align: center;
                   transition: transform 0.2s;
               }}
               .metric-card:hover {{
                   transform: translateY(-2px);
                   box-shadow: 0 4px 8px rgba(0,0,0,0.15);
               }}
               .metric-value {{
                   font-size: 2.5em;
                   font-weight: bold;
                   color: #667eea;
                   margin: 10px 0;
               }}
               .metric-label {{
                   color: #666;
                   font-size: 0.9em;
                   text-transform: uppercase;
                   letter-spacing: 1px;
               }}
               .section {{
                   background: white;
                   padding: 30px;
                   border-radius: 8px;
                   margin-bottom: 20px;
                   box-shadow: 0 2px 4px rgba(0,0,0,0.1);
               }}
               .section h2 {{
                   color: #667eea;
                   border-bottom: 2px solid #e2e8f0;
                   padding-bottom: 10px;
                   margin-bottom: 20px;
               }}
               .severity-badge {{
                   display: inline-block;
                   padding: 4px 12px;
                   border-radius: 20px;
                   font-size: 0.85em;
                   font-weight: bold;
                   text-transform: uppercase;
                   margin: 2px;
               }}
               .severity-critical {{
                   background-color: #dc2626;
                   color: white;
               }}
               .severity-high {{
                   background-color: #f97316;
                   color: white;
               }}
               .severity-medium {{
                   background-color: #eab308;
                   color: white;
               }}
               .severity-low {{
                   background-color: #3b82f6;
                   color: white;
               }}
               .severity-clean {{
                   background-color: #10b981;
                   color: white;
               }}
               table {{
                   width: 100%;
                   border-collapse: collapse;
                   margin-top: 20px;
               }}
               th {{
                   background-color: #f3f4f6;
                   color: #374151;
                   font-weight: 600;
                   text-align: left;
                   padding: 12px;
                   border-bottom: 2px solid #e5e7eb;
               }}
               td {{
                   padding: 12px;
                   border-bottom: 1px solid #e5e7eb;
               }}
               tr:hover {{
                   background-color: #f9fafb;
               }}
               .chart-container {{
                   margin: 20px 0;
                   text-align: center;
               }}
               .footer {{
                   text-align: center;
                   color: #666;
                   margin-top: 50px;
                   padding-top: 20px;
                   border-top: 1px solid #e2e8f0;
               }}
               .alert {{
                   padding: 15px;
                   border-radius: 5px;
                   margin: 20px 0;
               }}
               .alert-warning {{
                   background-color: #fef3c7;
                   border: 1px solid #fbbf24;
                   color: #92400e;
               }}
               .alert-danger {{
                   background-color: #fee2e2;
                   border: 1px solid #f87171;
                   color: #991b1b;
               }}
               .progress-bar {{
                   width: 100%;
                   height: 20px;
                   background-color: #e5e7eb;
                   border-radius: 10px;
                   overflow: hidden;
                   margin: 10px 0;
               }}
               .progress-fill {{
                   height: 100%;
                   background: linear-gradient(90deg, #10b981 0%, #3b82f6 100%);
                   transition: width 0.3s ease;
               }}
           </style>
       </head>
       <body>
           <div class="header">
               <h1>Honeypot Security Analysis Report</h1>
               <p>Generated on {datetime.now().strftime('%B %d, %Y at %H:%M:%S')}</p>
               <p>Analysis by: Kyle Purves</p>
           </div>
           
           <div class="metric-grid">
               <div class="metric-card">
                   <div class="metric-label">Total Requests</div>
                   <div class="metric-value">{summary['total_urls']:,}</div>
               </div>
               <div class="metric-card">
                   <div class="metric-label">Suspicious Requests</div>
                   <div class="metric-value">{summary['suspicious_urls']:,}</div>
               </div>
               <div class="metric-card">
                   <div class="metric-label">Attack Rate</div>
                   <div class="metric-value">
                       {(summary['suspicious_urls'] / summary['total_urls'] * 100):.1f}%
                   </div>
               </div>
               <div class="metric-card">
                   <div class="metric-label">Multi-Attack URLs</div>
                   <div class="metric-value">{summary['multi_attack_urls']:,}</div>
               </div>
           </div>
       """
       
       # Add severity distribution
       if summary['suspicious_urls'] > 0:
           html_content += """
           <div class="section">
               <h2>Severity Distribution</h2>
               <div class="progress-bar">
           """
           
           severity_colors = {
               'Critical': '#dc2626',
               'High': '#f97316',
               'Medium': '#eab308',
               'Low': '#3b82f6'
           }
           

           html_content += '<div style="display: flex; width: 100%; height: 20px; border-radius: 10px; overflow: hidden;">'
           for severity, count in summary['severity_distribution'].items():
               if severity != 'Clean' and count > 0:
                   width = (count / summary['total_urls']) * 100
                   html_content += f"""
                   <div style="
                       width: {width}%; 
                       height: 100%;
                       background: {severity_colors.get(severity, '#666')};
                    "></div>
                    """
           
           html_content += """
               </div>
               <table>
                   <thead>
                       <tr>
                           <th>Severity Level</th>
                           <th>Count</th>
                           <th>Percentage</th>
                       </tr>
                   </thead>
                   <tbody>
           """
           
           for severity, count in summary['severity_distribution'].items():
               if count > 0:
                   percentage = (count / summary['total_urls']) * 100
                   badge_class = f"severity-{severity.lower()}"
                   html_content += f"""
                   <tr>
                       <td><span class="severity-badge {badge_class}">{severity}</span></td>
                       <td>{count:,}</td>
                       <td>{percentage:.2f}%</td>
                   </tr>
                   """
           
           html_content += """
                   </tbody>
               </table>
           </div>
           """
       
       # Add attack type distribution
       if summary['attack_distribution']:
           html_content += """
           <div class="section">
               <h2>Attack Type Distribution</h2>
               <table>
                   <thead>
                       <tr>
                           <th>Attack Type</th>
                           <th>Occurrences</th>
                           <th>Percentage of Attacks</th>
                       </tr>
                   </thead>
                   <tbody>
           """
           
           total_attacks = sum(summary['attack_distribution'].values())
           for attack_type, count in sorted(
               summary['attack_distribution'].items(), 
               key=lambda x: x[1], 
               reverse=True
           ):
               percentage = (count / total_attacks) * 100
               html_content += f"""
               <tr>
                   <td>{attack_type}</td>
                   <td>{count:,}</td>
                   <td>{percentage:.2f}%</td>
               </tr>
               """
           
           html_content += """
                   </tbody>
               </table>
           </div>
           """
       
       # Add top attack combinations
       if summary['top_attack_combinations']:
           html_content += """
           <div class="section">
               <h2>Top Attack Combinations</h2>
               <p>URLs often contain multiple attack vectors. Here are the most common combinations:</p>
               <table>
                   <thead>
                       <tr>
                           <th>Attack Combination</th>
                           <th>Frequency</th>
                       </tr>
                   </thead>
                   <tbody>
           """
           
           for combo, count in list(summary['top_attack_combinations'].items())[:10]:
               html_content += f"""
               <tr>
                   <td>{combo}</td>
                   <td>{count:,}</td>
               </tr>
               """
           
           html_content += """
                   </tbody>
               </table>
           </div>
           """
       
       # Add critical findings
       critical_count = summary['severity_distribution'].get('Critical', 0)
       if critical_count > 0:
           html_content += f"""
           <div class="alert alert-danger">
               <h3>Critical Security Alert</h3>
               <p><strong>{critical_count}</strong> critical severity attacks detected requiring immediate attention!</p>
               <p>These attacks pose the highest risk to system security and should be investigated immediately.</p>
           </div>
           """
       
       # Add insights and recommendations
       html_content += """
       <div class="section">
           <h2>Key Insights</h2>
           <ul>
       """
       
       # Generate insights based on data
       insights = []
       
       if summary['average_confidence'] > 0.8:
           insights.append("High confidence in attack detection (>80%) indicates clear attack patterns")
       
       if summary['multi_attack_urls'] > summary['suspicious_urls'] * 0.3:
           insights.append("Significant number of multi-vector attacks suggest sophisticated attackers")
       
       attack_variety = len(summary['attack_distribution'])
       if attack_variety >= 7:
           insights.append(f"Wide variety of attack types detected ({attack_variety} different types)")
       
       if 'SQL Injection' in summary['attack_distribution'] and \
          summary['attack_distribution']['SQL Injection'] > total_attacks * 0.4:
           insights.append("SQL Injection is the predominant attack vector. Database security should be prioritized.")
       
       for insight in insights:
           html_content += f"<li>{insight}</li>"
       
       html_content += """
           </ul>
       </div>
       
       <div class="section">
           <h2>Security Recommendations</h2>
           <ol>
       """
       
       # Generate recommendations based on findings
       recommendations = [
           "Implement a Web Application Firewall (WAF) to filter malicious requests",
           "Enable comprehensive logging and real-time monitoring",
           "Regularly update all software components with security patches",
           "Conduct security awareness training for development teams",
           "Implement rate limiting to prevent automated attacks",
           "Use parameterized queries to prevent SQL injection",
           "Deploy Content Security Policy (CSP) headers for XSS protection",
           "Perform regular security audits and penetration testing"
       ]
       
       for rec in recommendations[:5]:
           html_content += f"<li>{rec}</li>"
       
       html_content += """
           </ol>
       </div>
       
       <div class="footer">
           <p>Report generated by URL Security Analyzer</p>
           <p>&copy; 2025 Kyle Purves.</p>
       </div>
       
       </body>
       </html>
       """
       
       # Save HTML report
       report_path = os.path.join(self.output_dir, f'security_report_{self.timestamp}.html')
       with open(report_path, 'w', encoding='utf-8') as f:
           f.write(html_content)
       
       return report_path
   
   def create_visualizations(self, df: pd.DataFrame, summary: Dict[str, Any]) -> List[str]:
       """
       Create visualizations for the report.
       
       Parameters:
       -----------
       df : pd.DataFrame
           Analyzed dataframe
       summary : dict
           Summary statistics
           
       Returns:
       --------
       List[str] : Paths to generated visualization files
       """
       viz_paths = []
       
       # Set style
       plt.style.use('seaborn-v0_8-darkgrid')
       colors = ['#667eea', '#764ba2', '#f093fb', '#4facfe', '#00f2fe']
       
       # 1. Attack Type Distribution - Horizontal Bar Chart
       if summary['attack_distribution']:
           plt.figure(figsize=(10, 6))
           attack_types = list(summary['attack_distribution'].keys())
           counts = list(summary['attack_distribution'].values())
           
           # Sort by count
           sorted_data = sorted(zip(attack_types, counts), key=lambda x: x[1])
           attack_types, counts = zip(*sorted_data)
           
           bars = plt.barh(attack_types, counts, color=colors[0])
           
           # Add value labels on bars
           for i, (bar, count) in enumerate(zip(bars, counts)):
               plt.text(bar.get_width() + max(counts)*0.01, bar.get_y() + bar.get_height()/2,
                       f'{count:,}', va='center', fontweight='bold')
           
           plt.xlabel('Number of Occurrences', fontsize=12, fontweight='bold')
           plt.title('Distribution of Attack Types Detected', fontsize=16, fontweight='bold', pad=20)
           plt.tight_layout()
           
           path = os.path.join(self.output_dir, f'attack_distribution_{self.timestamp}.png')
           plt.savefig(path, dpi=300, bbox_inches='tight', facecolor='white')
           plt.close()
           viz_paths.append(path)
       
       # 2. Severity Distribution - Donut Chart
       plt.figure(figsize=(8, 8))
       severity_data = [(k, v) for k, v in summary['severity_distribution'].items() if v > 0]
       if severity_data:
           labels, sizes = zip(*severity_data)
           
           # Colors for severity levels
           severity_colors = {
               'Critical': '#dc2626',
               'High': '#f97316',
               'Medium': '#eab308',
               'Low': '#3b82f6',
               'Clean': '#10b981'
           }
           colors_list = [severity_colors.get(label, '#666') for label in labels]
           
           # Create donut chart
           wedges, texts, autotexts = plt.pie(sizes, labels=labels, colors=colors_list, 
                                              autopct='%1.1f%%', startangle=90,
                                              wedgeprops=dict(width=0.5, edgecolor='white'))
           
           # Enhance text
           for text in texts:
               text.set_fontsize(12)
               text.set_fontweight('bold')
           for autotext in autotexts:
               autotext.set_color('white')
               autotext.set_fontsize(10)
               autotext.set_fontweight('bold')
           
           plt.title('Request Severity Distribution', fontsize=16, fontweight='bold', pad=20)
           
           # Add total in center
           plt.text(0, 0, f'Total\n{sum(sizes):,}', ha='center', va='center', 
                   fontsize=20, fontweight='bold')
           
           path = os.path.join(self.output_dir, f'severity_distribution_{self.timestamp}.png')
           plt.savefig(path, dpi=300, bbox_inches='tight', facecolor='white')
           plt.close()
           viz_paths.append(path)
       
       # 3. Temporal Analysis - If timestamp data available
       if 'hour' in df.columns and df['is_suspicious'].sum() > 0:
           plt.figure(figsize=(12, 6))
           
           # Group by hour
           hourly_attacks = df[df['is_suspicious']].groupby('hour').size()
           hours = list(range(24))
           counts = [hourly_attacks.get(h, 0) for h in hours]
           
           # Create bar chart with gradient colors
           bars = plt.bar(hours, counts)
           
           # Apply gradient colors
           for i, bar in enumerate(bars):
               bar.set_color(plt.cm.plasma(i / 23))
           
           plt.xlabel('Hour of Day', fontsize=12, fontweight='bold')
           plt.ylabel('Number of Attacks', fontsize=12, fontweight='bold')
           plt.title('Attack Distribution by Hour of Day', fontsize=16, fontweight='bold', pad=20)
           plt.xticks(hours)
           plt.grid(axis='y', alpha=0.3)
           
           # Add average line
           avg_attacks = sum(counts) / 24
           plt.axhline(y=avg_attacks, color='red', linestyle='--', 
                      label=f'Average: {avg_attacks:.1f}')
           plt.legend()
           
           path = os.path.join(self.output_dir, f'temporal_analysis_{self.timestamp}.png')
           plt.savefig(path, dpi=300, bbox_inches='tight', facecolor='white')
           plt.close()
           viz_paths.append(path)
       
       # 4. Top Attacking Countries - If country data available
       if 'country' in df.columns and df['is_suspicious'].sum() > 0:
           plt.figure(figsize=(10, 6))
           
           # Get top 10 attacking countries
           country_counts = df[df['is_suspicious']]['country'].value_counts().head(10)
           
           # Create horizontal bar chart
           plt.barh(country_counts.index[::-1], country_counts.values[::-1], 
                   color=plt.cm.viridis(np.linspace(0.2, 0.8, 10)))
           
           plt.xlabel('Number of Attacks', fontsize=12, fontweight='bold')
           plt.title('Top 10 Attacking Countries', fontsize=16, fontweight='bold', pad=20)
           
           # Add value labels
           for i, v in enumerate(country_counts.values[::-1]):
               plt.text(v + max(country_counts)*0.01, i, f'{v:,}', 
                       va='center', fontweight='bold')
           
           plt.tight_layout()
           
           path = os.path.join(self.output_dir, f'country_distribution_{self.timestamp}.png')
           plt.savefig(path, dpi=300, bbox_inches='tight', facecolor='white')
           plt.close()
           viz_paths.append(path)
       
       return viz_paths
   
   def export_results(self, df: pd.DataFrame, summary: Dict[str, Any]) -> Dict[str, str]:
       """
       Export analysis results in multiple formats.
       
       Parameters:
       -----------
       df : pd.DataFrame
           Analyzed dataframe
       summary : dict
           Summary statistics
           
       Returns:
       --------
       dict : Paths to exported files
       """
       export_paths = {}
       
       # 1. Export full results CSV
       full_path = os.path.join(self.output_dir, f'full_analysis_{self.timestamp}.csv')
       df.to_csv(full_path, index=False)
       export_paths['full_csv'] = full_path
       
       # 2. Export suspicious-only CSV
       if 'is_suspicious' in df.columns:
           suspicious_df = df[df['is_suspicious']].copy()
           if len(suspicious_df) > 0:
               suspicious_path = os.path.join(
                   self.output_dir, 
                   f'suspicious_only_{self.timestamp}.csv'
               )
               suspicious_df.to_csv(suspicious_path, index=False)
               export_paths['suspicious_csv'] = suspicious_path
       
       # 3. Export critical attacks
       if 'severity' in df.columns:
           critical_df = df[df['severity'] == 'Critical'].copy()
           if len(critical_df) > 0:
               critical_path = os.path.join(
                   self.output_dir, 
                   f'critical_attacks_{self.timestamp}.csv'
               )
               critical_df.to_csv(critical_path, index=False)
               export_paths['critical_csv'] = critical_path
       
       # 4. Export summary JSON
       summary_path = os.path.join(self.output_dir, f'summary_{self.timestamp}.json')
       with open(summary_path, 'w') as f:
           json.dump(summary, f, indent=4, default=str)
       export_paths['summary_json'] = summary_path
       
       # 5. Generate and export executive summary text
       exec_summary_path = os.path.join(
           self.output_dir, 
           f'executive_summary_{self.timestamp}.txt'
       )
       
       with open(exec_summary_path, 'w') as f:
           f.write("EXECUTIVE SUMMARY - HONEYPOT SECURITY ANALYSIS\n")
           f.write("=" * 60 + "\n\n")
           f.write(f"Report Date: {datetime.now().strftime('%B %d, %Y')}\n")
           f.write(f"Analysis Period: {df['timestamp'].min()} to {df['timestamp'].max()}\n")
           f.write(f"Analyst: Kyle Purves\n\n")
           
           f.write("KEY FINDINGS:\n")
           f.write("-" * 40 + "\n")
           f.write(f"• Total Requests Analyzed: {summary['total_urls']:,}\n")
           f.write(f"• Malicious Requests Identified: {summary['suspicious_urls']:,}\n")
           f.write(f"• Attack Rate: {(summary['suspicious_urls'] / summary['total_urls'] * 100):.2f}%\n")
           f.write(f"• Critical Severity Attacks: {summary['severity_distribution'].get('Critical', 0):,}\n")
           f.write(f"• Multi-Vector Attacks: {summary['multi_attack_urls']:,}\n\n")
           
           f.write("TOP THREATS:\n")
           f.write("-" * 40 + "\n")
           for attack_type, count in list(summary['attack_distribution'].items())[:5]:
               f.write(f"• {attack_type}: {count:,} instances\n")
           
           f.write("\nRECOMMENDATIONS:\n")
           f.write("-" * 40 + "\n")
           f.write("1. Immediate deployment of Web Application Firewall (WAF)\n")
           f.write("2. Implementation of rate limiting and IP blocking\n")
           f.write("3. Security audit of all web applications\n")
           f.write("4. Enhanced monitoring and alerting systems\n")
           f.write("5. Regular security training for development teams\n")
       
       export_paths['executive_summary'] = exec_summary_path
       
       return export_paths


# Utility functions for extra analysis
def calculate_attack_metrics(df: pd.DataFrame) -> Dict[str, Any]:
   """
   Calculate advanced metrics for attack analysis.
   
   Parameters:
   -----------
   df : pd.DataFrame
       Dataframe with attack analysis results
       
   Returns:
   --------
   dict : Advanced metrics
   """
   metrics = {
       'total_records': len(df),
       'time_range': None,
       'attack_velocity': 0,
       'peak_attack_time': None,
       'most_targeted_paths': [],
       'attack_sophistication_index': 0,
       'geographic_diversity': 0,
       'unique_attackers': 0,
       'repeat_offenders': [],
       'attack_patterns': {}
   }
   
   # Time-based analysis
   if 'timestamp' in df.columns:
       df['timestamp'] = pd.to_datetime(df['timestamp'])
       time_range = (df['timestamp'].max() - df['timestamp'].min()).total_seconds() / 3600
       metrics['time_range'] = f"{time_range:.1f} hours"
       
       if 'is_suspicious' in df.columns:
           attacks_per_hour = df[df['is_suspicious']].groupby(
               df['timestamp'].dt.floor('H')
           ).size()
           
           if len(attacks_per_hour) > 0:
               metrics['attack_velocity'] = attacks_per_hour.mean()
               metrics['peak_attack_time'] = attacks_per_hour.idxmax()
   
   # Path analysis
   if 'request_url' in df.columns and 'is_suspicious' in df.columns:
       suspicious_df = df[df['is_suspicious']]
       if len(suspicious_df) > 0:
           # Extract base paths
           suspicious_df['base_path'] = suspicious_df['request_url'].str.split('?').str[0]
           path_counts = suspicious_df['base_path'].value_counts().head(10)
           metrics['most_targeted_paths'] = [
               {'path': path, 'count': count} 
               for path, count in path_counts.items()
           ]
   
   # Sophistication analysis
   if 'attack_count' in df.columns:
       multi_attack_ratio = len(df[df['attack_count'] > 1]) / len(df[df['is_suspicious']])
       avg_attacks_per_url = df[df['is_suspicious']]['attack_count'].mean()
       metrics['attack_sophistication_index'] = round(
           (multi_attack_ratio * 0.5 + (avg_attacks_per_url - 1) * 0.5) * 10, 2
       )
   
   # Geographic analysis
   if 'country' in df.columns and 'is_suspicious' in df.columns:
       unique_countries = df[df['is_suspicious']]['country'].nunique()
       metrics['geographic_diversity'] = unique_countries
   
   # Attacker analysis
   if 'source_ip' in df.columns and 'is_suspicious' in df.columns:
       suspicious_ips = df[df['is_suspicious']]['source_ip']
       metrics['unique_attackers'] = suspicious_ips.nunique()
       
       # Find repeat offenders
       ip_counts = suspicious_ips.value_counts()
       repeat_offenders = ip_counts[ip_counts > 10].head(10)
       metrics['repeat_offenders'] = [
           {'ip': ip, 'attacks': count} 
           for ip, count in repeat_offenders.items()
       ]
   
   # Attack pattern analysis
   if 'attack_types' in df.columns:
       # Find common attack sequences
       attack_sequences = df[df['is_suspicious']]['attack_types'].value_counts().head(10)
       metrics['attack_patterns'] = {
           seq: count for seq, count in attack_sequences.items() if seq
       }
   
   return metrics


def progress_reporter(current: int, total: int) -> None:
   """
   Report progress during batch processing.
   
   Parameters:
   -----------
   current : int
       Current item number
   total : int
       Total items to process
   """
   percentage = (current / total) * 100
   bar_length = 50
   filled_length = int(bar_length * current / total)
   bar = '[]' * filled_length + '-' * (bar_length - filled_length)
   
   print(f'\rProgress: |{bar}| {percentage:.1f}% ({current:,}/{total:,})', end='')
   
   if current == total:
       print()  


# Main execution function
def main(data_dir: str, output_dir: str):
   """
   Main execution function that organizes the entire analysis process.
   
   Parameters:
   -----------
   data_dir : str
       Directory containing the input data
   output_dir : str
       Directory for output files
   """
   print("\n" + "="*80)
   print("HONEYPOT URL SECURITY ANALYSIS SYSTEM")
   print("="*80)
   print(f"Version: 1.0")
   print(f"Author: Kyle Purves")
   print(f"Start Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
   print("="*80 + "\n")
   
   try:
       # Step 1: Load data
       print("STEP 1: Loading honeypot data...")
       file_path = os.path.join(data_dir, 'log_file_2.csv')
       
       processor = DataProcessor()
       df = processor.load_honeypot_data(file_path)
       print(f"✓ Loaded {len(df):,} records from {df.shape[1]} columns")
       
       # Step 2: Clean data
       print("\nSTEP 2: Cleaning and preprocessing data...")
       required_columns = ['source_ip', 'request_url', 'timestamp']
       df = processor.clean_data(df, required_columns)
       df = processor.enrich_data(df)
       print("Data cleaning complete")
       
       # Step 3: Initialize analyzer
       print("\nSTEP 3: Initializing security analyzer...")
       analyzer = URLSecurityAnalyzer()
       print("Analyzer ready with 10 attack detection patterns")
       
       # Step 4: Analyze URLs
       print("\nSTEP 4: Analyzing URLs for security threats...")
       print("This may take a few minutes for large datasets...")
       
       # Batch analyze with progress reporting
       analysis_results = analyzer.batch_analyze(
           df['request_url'].tolist(),
           progress_callback=progress_reporter
       )
       
       # Add results to dataframe
       for i, result in enumerate(analysis_results):
           df.at[i, 'is_suspicious'] = result['is_suspicious']
           df.at[i, 'attack_types'] = ', '.join(result['attack_types'])
           df.at[i, 'attack_count'] = result['attack_count']
           df.at[i, 'severity'] = result['severity']
           df.at[i, 'severity_score'] = result['severity_score']
           df.at[i, 'confidence'] = result['confidence']
       
       print("\n✓ URL analysis complete")
       
       # Step 5: Generate summary
       print("\nSTEP 5: Generating analysis summary...")
       summary = analyzer.get_attack_summary(analysis_results)
       
       # Calculate additional metrics
       advanced_metrics = calculate_attack_metrics(df)
       summary.update(advanced_metrics)
       
       # Print summary
       print("\n" + "="*60)
       print("ANALYSIS SUMMARY")
       print("="*60)
       print(f"Total URLs Analyzed: {summary['total_urls']:,}")
       print(f"Suspicious URLs: {summary['suspicious_urls']:,} ({summary['suspicious_urls']/summary['total_urls']*100:.2f}%)")
       print(f"Clean URLs: {summary['clean_urls']:,}")
       print(f"Invalid URLs: {summary['invalid_urls']:,}")
       print(f"\nAttack Sophistication Index: {summary.get('attack_sophistication_index', 0)}/10")
       print(f"Geographic Diversity: {summary.get('geographic_diversity', 0)} countries")
       print(f"Unique Attackers: {summary.get('unique_attackers', 0):,}")
       
       # Step 6: Generate reports
       print("\nSTEP 6: Generating reports and visualizations...")
       generator = ReportGenerator(output_dir)
       
       # Generate HTML report
       html_path = generator.generate_html_report(df, summary)
       print(f"HTML report generated: {html_path}")
       
       # Create visualizations
       viz_paths = generator.create_visualizations(df, summary)
       print(f"Created {len(viz_paths)} visualizations")
       
       # Export results
       export_paths = generator.export_results(df, summary)
       print(f"Exported {len(export_paths)} data files")
       
       # Step 7: Display critical findings
       print("\nCRITICAL FINDINGS:")
       print("-" * 40)
       
       critical_count = summary['severity_distribution'].get('Critical', 0)
       if critical_count > 0:
           print(f"{critical_count} CRITICAL severity attacks detected!")
           
           # Show sample critical attacks
           critical_samples = df[df['severity'] == 'Critical'].head(5)
           if len(critical_samples) > 0:
               print("\nSample Critical Attacks:")
               for _, row in critical_samples.iterrows():
                   print(f"  • IP: {row['source_ip']} | URL: {row['request_url'][:50]}...")
                   print(f"    Attacks: {row['attack_types']}")
       else:
           print("✓ No critical severity attacks detected")
       
       # Top threats
       print("\nTOP THREATS:")
       print("-" * 40)
       for attack_type, count in list(summary['attack_distribution'].items())[:5]:
           print(f"• {attack_type}: {count:,} instances")
       
       # Recommendations
       print("\nRECOMMENDATIONS:")
       print("-" * 40)
       recommendations = [
           "Deploy Web Application Firewall (WAF)",
           "Implement rate limiting and geo-blocking",
           "Enable comprehensive logging and monitoring",
           "Conduct security code reviews",
           "Regular penetration testing"
       ]
       
       for i, rec in enumerate(recommendations, 1):
           print(f"{i}. {rec}")
       
       print("\n" + "="*80)
       print("ANALYSIS COMPLETE!")
       print(f"End Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
       print("="*80)
       
       return df, summary
       
   except Exception as e:
       print(f"\nERROR: {str(e)}")
       import traceback
       traceback.print_exc()
       raise




# Initialize all patterns and functions
print("All functions loaded successfully!")
print("System ready for security analysis")

All functions loaded successfully!
System ready for security analysis


#### Load Data:
The dataset is from a honeypot: log_file.csv

In [5]:
#### Load Data:
# The dataset is from a honeypot: log_file_2.csv

# Load the honeypot log file
print("Loading honeypot log file...")

try:
    data = pd.read_csv(os.path.join(data_dir, 'log_file_2.csv'), sep='\t', lineterminator='\r', dtype='string', encoding='utf8')
    
    # Remove blank rows
    data = data[:-1]
    
    print(f"Data loaded successfully!")
    print(f"Shape: {data.shape}")
    print(f"Columns: {', '.join(data.columns.tolist())}")
    
except FileNotFoundError:
    print("ERROR: log_file_2.csv not found in current directory")
    print("Please ensure the data file is in the same folder as this notebook")
    raise
except Exception as e:
    print(f"ERROR loading data: {str(e)}")
    raise

Loading honeypot log file...
Data loaded successfully!
Shape: (720281, 22)
Columns: ident, timestamp, unixTimestamp, source_ip, actual_ip, actual_ip_num, source_port, destination_ip, destination_ip_num, destination_port, protocol, country, country_code, Latitude, Longitude, asn_code, asn, filename, User_Agent, request_raw, request_url, X_Forwarded_For


#### Data Wrangling:

This secion contains the required functions used to clean and prepare the data

In [6]:
#### Data Wrangling:
# This section contains the required functions used to clean and prepare the data

# Apply data cleaning
print("\n--- DATA PREPARATION ---")

# Initialize data processor
processor = DataProcessor()

# Clean the data
required_columns = ['source_ip', 'request_url', 'timestamp', 'country', 'User_Agent']
data = processor.clean_data(data, required_columns)

# Enrich data with derived features
data = processor.enrich_data(data)

# Display data quality metrics
print("\nData Quality Report:")
print(f"  • Total records: {len(data):,}")
print(f"  • Columns: {data.shape[1]}")
print(f"  • Missing values in request_url: {data['request_url'].isna().sum()}")
print(f"  • Date range: {data['timestamp'].min()} to {data['timestamp'].max()}")

# Display sample of cleaned data
print("\nSample of cleaned data:")
display(data[['timestamp', 'source_ip', 'country', 'request_url']].head())

print("\nData preparation complete!")


--- DATA PREPARATION ---

Data Quality Report:
  • Total records: 720,281
  • Columns: 31
  • Missing values in request_url: 0
  • Date range: 2016-09-15 07:12:34.475000 to 2019-10-31 13:14:50.017000

Sample of cleaned data:


Unnamed: 0,timestamp,source_ip,country,request_url
0,2016-09-15 07:12:34.475,36.239.154.136,Taiwan,/ipc$
1,2016-09-15 07:47:46.342,110.87.183.250,China,/ipc$
2,2016-09-15 07:51:14.172,191.96.249.80,United Kingdom,/xmlrpc.php
3,2016-09-15 12:01:09.751,203.67.142.246,Taiwan,/ipc$
4,2016-09-15 12:57:04.966,36.236.95.116,Taiwan,/ipc$



Data preparation complete!


#### Search for suspicious activity:

In [7]:
#### Search for suspicious activity:
# Analyze all URLs using compiled regex patterns to detect various attack types

print("\n--- SECURITY ANALYSIS ---")
print("Using vectorized operations with proper threading...")

start_time = time.time()  # Start timer
\

# 1. Collect regex patterns into dictionary 
patterns_dict = {
    'SQL Injection': OptimizedAttackPatterns.get_sql_injection_pattern(),
    'Cross-Site Scripting (XSS)': OptimizedAttackPatterns.get_xss_pattern(),
    'PHP Attack': OptimizedAttackPatterns.get_php_attack_pattern(),
    'Windows Attack': OptimizedAttackPatterns.get_windows_attack_pattern(),
    'Directory Traversal': OptimizedAttackPatterns.get_directory_traversal_pattern(),
    'Command Injection': OptimizedAttackPatterns.get_command_injection_pattern(),
    'LDAP Injection': OptimizedAttackPatterns.get_ldap_injection_pattern(),
    'XML Injection': OptimizedAttackPatterns.get_xml_injection_pattern(),
    'NoSQL Injection': OptimizedAttackPatterns.get_nosql_injection_pattern(),
    'File Inclusion': OptimizedAttackPatterns.get_file_inclusion_pattern()
}

# 2. Choose analysis method based on dataset size
if len(data) > 10000:
    print(f"Large dataset detected ({len(data):,} URLs). Using parallel processing...")
    try:
        # Try parallel processing first
        analyzed_data, summary = process_urls_optimized(data, patterns_dict, n_jobs=4)
    except Exception as e:
        print(f"Parallel processing failed: {e}")
        print("Falling back to single-threaded analysis...")
        analyzed_data = quick_security_analysis(data, patterns_dict)
        
        # Generate summary manually for fallback
        analyzer = VectorizedURLAnalyzer(patterns_dict)
        results_list = []
        for _, row in analyzed_data.iterrows():
            results_list.append({
                'is_suspicious': row['is_suspicious'],
                'attack_types': row['attack_types'].split(', ') if row['attack_types'] else [],
                'attack_count': row['attack_count'],
                'severity': row['severity'],
                'severity_score': row['severity_score'],
                'confidence': row['confidence']
            })
        summary = analyzer.get_summary_stats(results_list)
else:
    print(f"Small dataset ({len(data):,} URLs). Using single-threaded analysis...")
    analyzed_data = quick_security_analysis(data, patterns_dict)
    
    # Generate summary
    analyzer = VectorizedURLAnalyzer(patterns_dict)
    results_list = []
    for _, row in analyzed_data.iterrows():
        results_list.append({
            'is_suspicious': row['is_suspicious'],
            'attack_types': row['attack_types'].split(', ') if row['attack_types'] else [],
            'attack_count': row['attack_count'],
            'severity': row['severity'],
            'severity_score': row['severity_score'],
            'confidence': row['confidence']
        })
    summary = analyzer.get_summary_stats(results_list)

# 3. Display results
print("\n--- ANALYSIS RESULTS ---")
print(f"Analysis completed successfully!")
print(f"Total URLs analyzed: {len(analyzed_data):,}")
print(f"Suspicious URLs found: {summary['suspicious_urls']:,} ({summary['attack_rate']:.2f}%)")
print(f"Attack types detected: {len(summary['attack_distribution'])}")

# Display attack distribution
if summary['attack_distribution']:
    print(f"\nTop Attack Types:")
    for attack_type, count in list(summary['attack_distribution'].items())[:5]:
        print(f"   • {attack_type}: {count:,} occurrences")

# Display severity distribution
if summary['severity_distribution']:
    print(f"\nSeverity Distribution:")
    for severity, count in summary['severity_distribution'].items():
        if count > 0:
            print(f"   • {severity}: {count:,} URLs")

# 4. Show sample suspicious URLs if any
if summary['suspicious_urls'] > 0:
    print(f"\nSample Suspicious URLs:")
    suspicious_sample = analyzed_data[analyzed_data['is_suspicious'] == True].head(5)
    
    for idx, row in suspicious_sample.iterrows():
        print(f"\n   URL: {row['request_url'][:80]}...")
        print(f"   Source IP: {row['source_ip']}")
        print(f"   Attack Types: {row['attack_types']}")
        print(f"   Severity: {row['severity']} (Score: {row['severity_score']})")
        print(f"   Confidence: {row['confidence']:.3f}")

end_time = time.time()  # End timer
elapsed = end_time - start_time

print(f"\nSecurity analysis completed successfully!")
print(f"\nSecurity analysis completed in {elapsed:.2f} seconds.")
print(f"Results saved in 'analyzed_data' DataFrame with {len(analyzed_data.columns)} columns")



--- SECURITY ANALYSIS ---
Using vectorized operations with proper threading...
Large dataset detected (720,281 URLs). Using parallel processing...

Starting optimized URL security analysis...
Processing a large dataset — analysis time will vary based on system resources.

Analyzing 720,281 URLs using 4 threads...
Created 5 chunks of ~180070 URLs each
Analysis completed successfully!

Quick Analysis Summary:
   Total URLs: 720,281
   Suspicious URLs: 664,257 (92.22%)
   Multi-attack URLs: 473,728
   Average confidence: 0.554

Preliminary Top Attack Types:
   • PHP Attack: 487,771
   • File Inclusion: 9,811
   • Directory Traversal: 5,716
   • Command Injection: 35,694
   • Windows Attack: 220,484

--- ANALYSIS RESULTS ---
Analysis completed successfully!
Total URLs analyzed: 720,281
Suspicious URLs found: 664,257 (92.22%)
Attack types detected: 10

Top Attack Types:
   • PHP Attack: 487,771 occurrences
   • File Inclusion: 9,811 occurrences
   • Directory Traversal: 5,716 occurrences
 

#### Save the results for further analysis:

In [8]:
#### Save the results for further analysis:
# Generate comprehensive reports, visualizations, and export data in multiple formats

# Note: Analysis of 851,321 URLs typically takes 10-15 minutes depending on system resources.
# This is comparable to commercial security tools performing similar comprehensive pattern matching.
# The file export operations add an additional 2-3 minutes for writing large CSV files.

print("\n--- SAVING RESULTS ---")

# Initialize report generator
generator = ReportGenerator(output_dir)

# Generate comprehensive HTML report
html_report = generator.generate_html_report(analyzed_data, summary)
print(f"HTML report saved: {html_report}")

# Create visualizations
visualizations = generator.create_visualizations(analyzed_data, summary)
print(f"Created {len(visualizations)} visualizations")

print("Writing output files to disk... this may take a minute.\n")

# Export data files
export_paths = generator.export_results(analyzed_data, summary)
print("\nExported files:")
for file_type, path in export_paths.items():
    print(f"  • {file_type}: {path}")

# Display final summary
print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)
print(f"Total suspicious activity detected: {summary['suspicious_urls']:,} URLs")
print(f"Critical threats: {summary['severity_distribution'].get('Critical', 0)}")

# Fix for the unique_attack_patterns reference
if 'top_attack_combinations' in summary:
    print(f"Unique attack patterns: {len(summary['top_attack_combinations'])}")
else:
    print(f"Unique attack patterns: {len(summary['attack_distribution'])}")

print("\nAll results have been saved to:", output_dir)


--- SAVING RESULTS ---
HTML report saved: /Users/versa_kyle/Downloads/GitHub Portfolio Projects/Cybersecurity Projects/URL_Threat_Detection_Analysis/Reports/security_report_20260214_134339.html
Created 4 visualizations
Writing output files to disk... this may take a minute.


Exported files:
  • full_csv: /Users/versa_kyle/Downloads/GitHub Portfolio Projects/Cybersecurity Projects/URL_Threat_Detection_Analysis/Reports/full_analysis_20260214_134339.csv
  • suspicious_csv: /Users/versa_kyle/Downloads/GitHub Portfolio Projects/Cybersecurity Projects/URL_Threat_Detection_Analysis/Reports/suspicious_only_20260214_134339.csv
  • critical_csv: /Users/versa_kyle/Downloads/GitHub Portfolio Projects/Cybersecurity Projects/URL_Threat_Detection_Analysis/Reports/critical_attacks_20260214_134339.csv
  • summary_json: /Users/versa_kyle/Downloads/GitHub Portfolio Projects/Cybersecurity Projects/URL_Threat_Detection_Analysis/Reports/summary_20260214_134339.json
  • executive_summary: /Users/versa_kyle

## Observations, Analysis, and Conclusions

In [9]:
## Observations, Analysis, and Conclusions

# Detailed analysis of findings
print("\n" + "="*80)
print("DETAILED OBSERVATIONS AND ANALYSIS")
print("="*80)

# Executive Summary
print("\n EXECUTIVE SUMMARY:")
print("-" * 40)
print(f"Dataset: Honeypot log containing {len(analyzed_data):,} HTTP requests")
print(f"Analysis Period: {analyzed_data['timestamp'].min().strftime('%Y-%m-%d')} to {analyzed_data['timestamp'].max().strftime('%Y-%m-%d')}")
print(f"Overall Threat Level: CRITICAL - {(summary['suspicious_urls']/summary['total_urls']*100):.1f}% malicious traffic detected")
print(f"Unique Attack Patterns: {len(summary['top_attack_combinations'])} different combinations identified")

# 1. Attack Pattern Analysis
print("\n1. ATTACK PATTERN ANALYSIS:")
print("-" * 40)

# Most common attack combinations
if analyzed_data['attack_count'].max() > 1:
    multi_attacks = analyzed_data[analyzed_data['attack_count'] > 1]['attack_types'].value_counts().head(10)
    print("\nMost Common Attack Combinations:")
    for combo, count in multi_attacks.items():
        percentage = (count / summary['suspicious_urls']) * 100
        print(f"  - {combo}: {count:,} occurrences ({percentage:.2f}% of all attacks)")
    
    # Statistical analysis
    print(f"\nAttack Complexity Statistics:")
    print(f"  - Mean attacks per malicious URL: {analyzed_data[analyzed_data['is_suspicious']]['attack_count'].mean():.2f}")
    print(f"  - Median attacks per malicious URL: {analyzed_data[analyzed_data['is_suspicious']]['attack_count'].median():.0f}")
    print(f"  - Maximum attack types in single URL: {analyzed_data['attack_count'].max()}")

# 2. Temporal Analysis
print("\n2. TEMPORAL ANALYSIS:")
print("-" * 40)

if 'timestamp' in analyzed_data.columns and analyzed_data['timestamp'].notna().any():
    # Optimized temporal analysis
    hourly_attacks = analyzed_data[analyzed_data['is_suspicious']].groupby('hour').size()
    
    if len(hourly_attacks) > 0:
        peak_hour = hourly_attacks.idxmax()
        low_hour = hourly_attacks.idxmin()
        
        print(f"\nAttack Timing Patterns:")
        print(f"  - Peak attack hour: {peak_hour}:00 with {hourly_attacks[peak_hour]:,} attacks")
        print(f"  - Lowest attack hour: {low_hour}:00 with {hourly_attacks[low_hour]:,} attacks")
        print(f"  - Attack variance by hour: {hourly_attacks.std():.2f}")
        
        # Day of week analysis
        analyzed_data['day_name'] = analyzed_data['timestamp'].dt.day_name()
        daily_pattern = analyzed_data[analyzed_data['is_suspicious']].groupby('day_name').size()
        print(f"\n  - Most active day: {daily_pattern.idxmax()} ({daily_pattern.max():,} attacks)")
        print(f"  - Least active day: {daily_pattern.idxmin()} ({daily_pattern.min():,} attacks)")
        
        # Attack speed
        time_span = (analyzed_data['timestamp'].max() - analyzed_data['timestamp'].min()).total_seconds() / 3600
        attacks_per_hour = summary['suspicious_urls'] / time_span
        print(f"\n  - Average attack velocity: {attacks_per_hour:.1f} attacks/hour")

# 3. Geographic Analysis
print("\n3. GEOGRAPHIC ANALYSIS:")
print("-" * 40)

if 'country' in analyzed_data.columns:
    # Geographic diversity
    unique_countries = analyzed_data[analyzed_data['is_suspicious']]['country'].nunique()
    print(f"\nGeographic Distribution:")
    print(f"  - Attacks originated from {unique_countries} different countries")
    
    # Top attacking countries with percentages
    country_attacks = analyzed_data[analyzed_data['is_suspicious']]['country'].value_counts().head(10)
    print(f"\n  Top 10 Attacking Countries:")
    for country, count in country_attacks.items():
        percentage = (count / summary['suspicious_urls']) * 100
        print(f"    {country}: {count:,} attacks ({percentage:.2f}%)")
    
    # Countries with highest attack sophistication
    sophisticated_attacks = analyzed_data[analyzed_data['attack_count'] > 2]
    if len(sophisticated_attacks) > 0:
        sophisticated_countries = sophisticated_attacks['country'].value_counts().head(5)
        print("\n  Countries with Most Sophisticated Attacks (3+ attack types):")
        for country, count in sophisticated_countries.items():
            sophistication_rate = (count / country_attacks.get(country, count)) * 100
            print(f"    {country}: {count:,} multi-vector attacks ({sophistication_rate:.1f}% sophistication rate)")

# 4. Target Analysis
print("\n4. TARGET ANALYSIS:")
print("-" * 40)

# Most targeted paths with categorization
targeted_paths = analyzed_data[analyzed_data['is_suspicious']]['request_url'].str.split('?').str[0]
path_counts = targeted_paths.value_counts().head(15)

print("\nMost Targeted Resources:")
# Categorize paths
admin_paths = [p for p in path_counts.index if 'admin' in p.lower() or 'login' in p.lower()]
php_paths = [p for p in path_counts.index if '.php' in p]
config_paths = [p for p in path_counts.index if 'config' in p.lower() or '.xml' in p or '.ini' in p]

print(f"\n  Target Categories:")
print(f"    - Administrative interfaces: {len(admin_paths)} paths")
print(f"    - PHP applications: {len(php_paths)} paths")
print(f"    - Configuration files: {len(config_paths)} paths")

print("\n  Top 10 Specific Targets:")
for i, (path, count) in enumerate(path_counts.head(10).items(), 1):
    print(f"    {i}. {path}: {count:,} attacks")

# 5. Attack Sophistication Metrics
print("\n5. ATTACK SOPHISTICATION METRICS:")
print("-" * 40)

total_suspicious = analyzed_data['is_suspicious'].sum()
if total_suspicious > 0:
    # Detailed breakdown
    attack_distribution = analyzed_data[analyzed_data['is_suspicious']]['attack_count'].value_counts().sort_index()
    
    print("\nAttack Complexity Distribution:")
    for num_attacks, count in attack_distribution.items():
        percentage = (count / total_suspicious) * 100
        print(f"  - {num_attacks} attack type(s): {count:,} URLs ({percentage:.2f}%)")
    
    # Severity connection
    high_severity = analyzed_data[analyzed_data['severity'].isin(['Critical', 'High'])].shape[0]
    multi_vector_high = analyzed_data[(analyzed_data['attack_count'] > 1) & 
                                      (analyzed_data['severity'].isin(['Critical', 'High']))].shape[0]
    
    print(f"\n  Severity Correlation:")
    print(f"    - High/Critical severity attacks: {high_severity:,}")
    print(f"    - Multi-vector high severity: {multi_vector_high:,} ({multi_vector_high/high_severity*100:.1f}%)")

# 6. Attack Effectiveness Analysis
print("\n6. ATTACK EFFECTIVENESS INDICATORS:")
print("-" * 40)

# Confidence analysis
high_confidence = analyzed_data[analyzed_data['confidence'] >= 0.8].shape[0]
print(f"\nDetection Confidence:")
print(f"  - High confidence detections (≥80%): {high_confidence:,} ({high_confidence/total_suspicious*100:.1f}%)")
print(f"  - Average detection confidence: {analyzed_data[analyzed_data['is_suspicious']]['confidence'].mean()*100:.1f}%")

# Attack persistence 
if 'source_ip' in analyzed_data.columns:
    ip_counts = analyzed_data[analyzed_data['is_suspicious']]['source_ip'].value_counts()
    persistent_ips = ip_counts[ip_counts > 100].shape[0]
    print(f"\n  Attack Persistence:")
    print(f"    - IPs with >100 attacks: {persistent_ips}")
    print(f"    - Most persistent attacker: {ip_counts.index[0]} ({ip_counts.iloc[0]:,} attacks)")

# 7. Conclusions
print("\n7. CONCLUSIONS:")
print("-" * 40)

conclusions = []

# Overall threat assessment
threat_percentage = (summary['suspicious_urls'] / summary['total_urls']) * 100
conclusions.append(f"CRITICAL THREAT LEVEL: {threat_percentage:.1f}% of all traffic is malicious, indicating active targeting")

# Attack sophistication
sophistication_rate = (summary['multi_attack_urls'] / summary['suspicious_urls']) * 100
conclusions.append(f"HIGH SOPHISTICATION: {sophistication_rate:.1f}% of attacks use multiple vectors, suggesting advanced threat actors")

# Primary threats
top_3_attacks = list(summary['attack_distribution'].items())[:3]
threat_summary = ", ".join([f"{attack[0]} ({attack[1]:,})" for attack in top_3_attacks])
conclusions.append(f"PRIMARY THREATS: {threat_summary}")

# Geographic insights
if unique_countries > 50:
    conclusions.append(f"GLOBAL THREAT: Attacks from {unique_countries} countries indicate widespread interest")

# Temporal insights
if attacks_per_hour > 1000:
    conclusions.append(f"SUSTAINED CAMPAIGN: {attacks_per_hour:.0f} attacks/hour suggests automated, persistent threats")

print("\nKEY FINDINGS:")
for i, conclusion in enumerate(conclusions, 1):
    print(f"{i}. {conclusion}")

# 8. Recommendations
print("\n8. SECURITY RECOMMENDATIONS:")
print("-" * 40)

print("\nIMMEDIATE ACTIONS (0-7 days):")
immediate = [
    "Deploy Web Application Firewall (WAF) with rules targeting identified attack patterns",
    "Implement IP-based rate limiting (suggested: 100 requests/minute per IP)",
    "Block or monitor top 100 attacking IPs identified in this analysis"
]
for i, rec in enumerate(immediate, 1):
    print(f"  {i}. {rec}")

print("\nSHORT-TERM ACTIONS (1-4 weeks):")
short_term = []
if 'SQL Injection' in summary['attack_distribution']:
    short_term.append("Audit all database queries and implement prepared statements")
if 'PHP Attack' in summary['attack_distribution']:
    short_term.append("Review PHP configuration and disable dangerous functions (eval, exec, system)")
if 'Directory Traversal' in summary['attack_distribution']:
    short_term.append("Implement path validation and chroot jails for file operations")

for i, rec in enumerate(short_term, 1):
    print(f"  {i}. {rec}")

print("\nLONG-TERM ACTIONS (1-3 months):")
long_term = [
    "Implement comprehensive Security Information and Event Management (SIEM)",
    "Establish 24/7 security monitoring and incident response team",
    "Develop and test incident response playbooks for each attack type",
    "Conduct quarterly penetration testing and security assessments"
]
for i, rec in enumerate(long_term, 1):
    print(f"  {i}. {rec}")

# 9. Limitations and Future Work
print("\n9. STUDY LIMITATIONS AND FUTURE WORK:")
print("-" * 40)
print("\nLimitations:")
print("  - Analysis based on honeypot data may not reflect all real-world attack patterns")
print("  - Geographic attribution based on IP addresses may be inaccurate due to VPNs/proxies")
print("  - Some sophisticated attacks may evade regex-based detection")

print("\nFuture Work:")
print("  - Implement machine learning models for anomaly detection")
print("  - Correlate attack patterns with threat intelligence feeds")
print("  - Analyze payload contents for malware signatures")
print("  - Develop predictive models for attack forecasting")
print("  - Optimize regex patterns for improved performance on large datasets")
print("  - Implement real-time streaming analysis for live traffic")

print("\n" + "="*80)
print("END OF SECURITY ANALYSIS REPORT")
print(f"Report Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Analyst: Kyle Purves")
print("="*80)


DETAILED OBSERVATIONS AND ANALYSIS

 EXECUTIVE SUMMARY:
----------------------------------------
Dataset: Honeypot log containing 720,281 HTTP requests
Analysis Period: 2016-09-15 to 2019-10-31
Overall Threat Level: CRITICAL - 92.2% malicious traffic detected
Unique Attack Patterns: 10 different combinations identified

1. ATTACK PATTERN ANALYSIS:
----------------------------------------

Most Common Attack Combinations:
  - SQL Injection, PHP Attack: 228,536 occurrences (34.40% of all attacks)
  - SQL Injection, Windows Attack: 141,494 occurrences (21.30% of all attacks)
  - SQL Injection, PHP Attack, Windows Attack: 45,334 occurrences (6.82% of all attacks)
  - SQL Injection, PHP Attack, Command Injection: 15,288 occurrences (2.30% of all attacks)
  - SQL Injection, Windows Attack, Command Injection: 7,235 occurrences (1.09% of all attacks)
  - PHP Attack, Command Injection: 6,845 occurrences (1.03% of all attacks)
  - SQL Injection, PHP Attack, Windows Attack, NoSQL Injection: 5,59