# LSB Steganography and Computational Complexity
**Nombre:** Lucía Cantos Burgos 
**Asignatura:** Cryptography — Theme 2  
**Práctica:** LSB Steganography Lab


In [3]:
# ============================================================
# PART 1 — LSB DATASET GENERATOR
# Generates a dataset of noise images and hides ONE secret
# message using Least Significant Bit (LSB) steganography.
# ============================================================

# -----------------------------
# Imports
# -----------------------------
from PIL import Image
import numpy as np
import os
import random


# -----------------------------
# Global configuration
# -----------------------------
FOLDER = "dataset_images"     # Output folder for generated images
NUM_IMAGES = 100              # Number of images in the dataset
W, H = 200, 200               # Image dimensions (width x height)

# Secret message (customize with your name)
FLAG = "FLAG{LUCIACANTOSBURGOS}"

# End-of-message delimiter (16 bits, not valid ASCII)
END_DELIMITER = "1111111111111110"


# -----------------------------
# Image generation
# -----------------------------
def create_noise_image(filename: str, width: int = 200, height: int = 200):
    """
    Creates a PNG image with random RGB noise.

    Each pixel channel (R, G, B) is a random value in [0, 255].
    The image is saved using lossless PNG compression.
    """
    # Random RGB array: shape (H, W, 3)
    noise = np.random.randint(
        0, 256, size=(height, width, 3), dtype=np.uint8
    )

    img = Image.fromarray(noise, mode="RGB")
    img.save(filename, format="PNG")


# -----------------------------
# Text to binary conversion
# -----------------------------
def text_to_binary(message: str) -> str:
    """
    Converts a text string into its ASCII binary representation.

    Example:
        'A' -> '01000001'
    """
    return "".join(format(ord(char), "08b") for char in message)


# -----------------------------
# LSB steganography (hide data)
# -----------------------------
def hide_lsb(image_path: str, message: str):
    """
    Hides a secret message inside an image using LSB steganography.

    Steps:
    1. Convert message to binary and append end delimiter.
    2. Flatten the RGB image into a 1D array of channels.
    3. Replace the LSB of each channel with message bits.
    4. Save the modified image (overwrites original file).
    """
    img = Image.open(image_path).convert("RGB")
    data = np.array(img, dtype=np.uint8)

    # Message to binary + delimiter
    bits = text_to_binary(message) + END_DELIMITER

    # Flatten all RGB channels: [R, G, B, R, G, B, ...]
    flat = data.reshape(-1)

    # Capacity check
    if len(bits) > len(flat):
        raise ValueError("Message is too long for the image capacity.")

    # Insert bits into the LSB of each channel
    for i, bit in enumerate(bits):
        flat[i] = (flat[i] & 0xFE) | int(bit)

    # Reshape and save modified image
    stego_image = flat.reshape(data.shape)
    Image.fromarray(stego_image, mode="RGB").save(image_path, format="PNG")


# -----------------------------
# Dataset generation
# -----------------------------
def generate_dataset(
    folder: str,
    num_images: int,
    secret_message: str,
    width: int = 200,
    height: int = 200
):
    """
    Generates a dataset of noise images and hides the secret message
    in ONE randomly selected image.

    Important:
    - The index of the secret image is NOT revealed.
    - This simulates a real forensic scenario.
    """
    os.makedirs(folder, exist_ok=True)

    filenames = []

    # Generate noise images
    for i in range(num_images):
        filename = os.path.join(folder, f"img_{i:03d}.png")
        create_noise_image(filename, width=width, height=height)
        filenames.append(filename)

    # Randomly select one image to hide the message
    secret_image = random.choice(filenames)
    hide_lsb(secret_image, secret_message)


# -----------------------------
# Execute generator
# -----------------------------
generate_dataset(FOLDER, NUM_IMAGES, FLAG, width=W, height=H)

print(f"Dataset creado: {NUM_IMAGES} imágenes en '{FOLDER}/'")
print("Una de ellas contiene un mensaje oculto. Good luck!")


Dataset creado: 100 imágenes en 'dataset_images/'
Una de ellas contiene un mensaje oculto. Good luck!


## Generator Explanation

The generator creates a dataset of **N PNG images** containing random RGB noise, each with dimensions **W×H**.  
Then, **one image is randomly selected** and a secret message of the form `FLAG{...}` is hidden inside it using **LSB steganography**.

The hiding process works as follows:

- Each pixel has three color channels (R, G, B), and the **least significant bit (LSB)** of each channel is used to embed information.
- The secret message is converted to its **ASCII binary representation** (8 bits per character).
- A special **end-of-message delimiter** (`1111111111111110`) is appended to mark where the hidden message ends.
- For each bit of the message:
  - the LSB of the current color channel is cleared using `value & 0xFE`,
  - the message bit is then inserted using `value | bit`.

The index of the image containing the hidden message is **not revealed**, simulating a realistic forensic scenario where the analyst has no prior knowledge of which file contains the secret.


In [None]:
# ============================================================
# PART 2 — LSB FORENSIC DETECTOR
# Searches a dataset of images to detect hidden messages
# using Least Significant Bit (LSB) extraction.
# ============================================================

# -----------------------------
# LSB extraction
# -----------------------------
def extract_lsb(image_path: str, num_bits: int | None = None) -> str:
    """
    Extracts LSB bits from an RGB image.

    Args:
        image_path: Path to the PNG image.
        num_bits: Number of bits to extract.
                  If None, all available LSB bits are extracted.

    Returns:
        A string containing the extracted bits, e.g. "010101..."
    """
    img = Image.open(image_path).convert("RGB")
    data = np.array(img, dtype=np.uint8)

    # Flatten RGB channels: [R, G, B, R, G, B, ...]
    flat = data.reshape(-1)

    # Select how many bits to extract
    if num_bits is None:
        relevant = flat
    else:
        relevant = flat[:num_bits]

    # Extract the LSB of each channel (0 or 1)
    bits = (relevant & 1).astype(np.uint8)

    # Convert to bitstring
    return "".join(bits.astype(str))


# -----------------------------
# Binary to text conversion
# -----------------------------
def binary_to_text(bits: str, end_delimiter: str = END_DELIMITER) -> str:
    """
    Converts a binary string into ASCII text.

    The conversion stops when the end-of-message delimiter is found.

    Args:
        bits: Binary string extracted from LSBs.
        end_delimiter: Bit pattern indicating end of message.

    Returns:
        Decoded ASCII text.
    """
    # Stop at delimiter if present
    end_index = bits.find(end_delimiter)
    if end_index != -1:
        bits = bits[:end_index]

    characters = []

    # Convert each group of 8 bits to one ASCII character
    for i in range(0, len(bits) - 7, 8):
        byte = bits[i:i + 8]
        characters.append(chr(int(byte, 2)))

    return "".join(characters)


# -----------------------------
# Brute force search
# -----------------------------
def search_brute_force(folder: str, header: str = HEADER):
    """
    Searches for a hidden message using a brute force approach.

    For each image:
    - extracts ALL LSB bits,
    - reconstructs the full message,
    - checks if the header appears.

    Returns:
        (filename, message) if found,
        (None, None) otherwise.
    """
    files = sorted(
        f for f in os.listdir(folder) if f.lower().endswith(".png")
    )

    for filename in files:
        path = os.path.join(folder, filename)

        # Extract all bits and reconstruct full text
        bits = extract_lsb(path, num_bits=None)
        text = binary_to_text(bits)

        if header in text:
            return filename, text

    return None, None


# -----------------------------
# Optimized search (early termination)
# -----------------------------
def search_optimized(folder: str, header: str = HEADER):
    """
    Searches for a hidden message using an optimized approach.

    Steps:
    1. Extract only the bits required to check the header.
    2. If the header matches, extract the full message.
    3. Otherwise, discard the image and continue.

    Returns:
        (filename, message) if found,
        (None, None) otherwise.
    """
    files = sorted(
        f for f in os.listdir(folder) if f.lower().endswith(".png")
    )

    # Number of bits required to reconstruct the header
    header_bits_len = len(header) * 8  # e.g. "FLAG{" -> 5 * 8 = 40 bits

    for filename in files:
        path = os.path.join(folder, filename)

        # Step 1: extract only the prefix needed to check the header
        bits_prefix = extract_lsb(
            path,
            num_bits=header_bits_len + len(END_DELIMITER)
        )

        text_prefix = binary_to_text(
            bits_prefix,
            end_delimiter=END_DELIMITER
        )

        # Step 2: if header matches, extract full message
        if text_prefix.startswith(header):
            bits_all = extract_lsb(path, num_bits=None)
            full_text = binary_to_text(
                bits_all,
                end_delimiter=END_DELIMITER
            )
            return filename, full_text

    return None, None


NameError: name 'HEADER' is not defined

## Detector Explanation (Part 2)

The detector analyzes all images in the dataset and extracts their **Least Significant Bits (LSB)** in order to reconstruct any hidden message.  
Since the image containing the secret is unknown, the detector must examine every file and determine whether it contains a valid message.

Two different search strategies are implemented:

---

### 1) Brute Force Approach

In the brute force strategy, the detector performs a complete extraction for every image:

- All LSB bits from the image are extracted.
- The full binary sequence is reconstructed into ASCII text.
- The reconstructed text is searched for the header pattern `"FLAG{"`.

This approach guarantees that the hidden message will be found if it exists, but it is computationally expensive because it processes the entire image regardless of whether it contains a message.

**Approximate complexity:**  
$$
O(N \cdot W \cdot H)
$$
where \(N\) is the number of images and \(W \times H\) is the image resolution.

---

### 2) Optimized Approach (Early Termination)

The optimized strategy reduces the amount of data processed by applying **early termination**:

- For each image, only the minimum number of bits required to reconstruct the header `"FLAG{"` (k bytes) is extracted.
- If the reconstructed prefix matches the header, the detector then extracts the full message from that image.
- If the header does not match, the image is immediately discarded and the detector moves on to the next file.

Because the header size is much smaller than the total number of pixels, this method significantly reduces the average processing time.

**Approximate complexity:**  
$$
O(N \cdot k)
\quad \text{with } k \ll W \cdot H
$$

---

This comparison highlights how a simple algorithmic optimization can drastically improve performance without sacrificing correctness, which is the basis for the experimental and complexity analysis carried out in Part 3.


In [None]:
# ============================================================
# PART 3 — EXPERIMENTS: TIMING MEASUREMENTS (Brute vs Optimized)
# This cell generates multiple datasets and benchmarks two
# forensic search strategies:
#   1) Brute force: extract ALL LSBs for every image
#   2) Optimized (early termination): check only the header first
# Then it records real execution times and computes speedup.
# ============================================================

# -----------------------------
# Constants / signatures
# -----------------------------
END_DELIMITER = "1111111111111110"  # 16-bit end marker used by the generator
HEADER = "FLAG{"                    # Known prefix used to quickly identify a valid message


# ============================================================
# Dataset generation (noise images + 1 hidden message)
# ============================================================
def create_noise_image(filename: str, width: int, height: int) -> None:
    """
    Create a random RGB noise image and save it as PNG (lossless).

    Args:
        filename: output path
        width: image width in pixels
        height: image height in pixels
    """
    noise = np.random.randint(0, 256, size=(height, width, 3), dtype=np.uint8)
    Image.fromarray(noise, mode="RGB").save(filename, format="PNG")


def text_to_binary(message: str) -> str:
    """
    Convert a text string into an ASCII bitstring (8 bits per character).
    Example: 'A' -> '01000001'
    """
    return "".join(format(ord(ch), "08b") for ch in message)


def hide_lsb(image_path: str, message: str) -> None:
    """
    Hide a message into an image by replacing the LSB of RGB channels.

    Steps:
    - Convert message to bits and append END_DELIMITER
    - Flatten RGB channels into a 1D vector [R,G,B,R,G,B,...]
    - For each message bit:
        clear LSB with & 0xFE
        insert bit with | bit
    - Save image (overwrites file)

    Args:
        image_path: path to PNG image
        message: plaintext message to hide
    """
    img = Image.open(image_path).convert("RGB")
    data = np.array(img, dtype=np.uint8)
    flat = data.reshape(-1)

    bits = text_to_binary(message) + END_DELIMITER

    # Capacity check: one bit per channel value
    if len(bits) > len(flat):
        raise ValueError("Message is too long for the image capacity.")

    for i, bit in enumerate(bits):
        flat[i] = (flat[i] & 0xFE) | int(bit)

    Image.fromarray(flat.reshape(data.shape), mode="RGB").save(image_path, format="PNG")


def generate_dataset(folder: str, N: int, W: int, H: int, flag: str, clean_folder: bool = True) -> None:
    """
    Generate a dataset of N noise images and hide the flag in exactly ONE image.

    Args:
        folder: output folder
        N: number of images
        W, H: resolution
        flag: secret message to hide
        clean_folder: if True, delete existing PNGs first (ensures exactly N files)
    """
    os.makedirs(folder, exist_ok=True)

    # Optional cleanup to make experiments reproducible and avoid mixing datasets
    if clean_folder:
        for f in os.listdir(folder):
            if f.lower().endswith(".png"):
                os.remove(os.path.join(folder, f))

    files = []
    for i in range(N):
        fname = os.path.join(folder, f"img_{i:03d}.png")
        create_noise_image(fname, W, H)
        files.append(fname)

    # Hide the message in ONE randomly chosen image (do not reveal which)
    secret = random.choice(files)
    hide_lsb(secret, flag)


# ============================================================
# Forensic detector (LSB extraction + two search strategies)
# ============================================================
def extract_lsb(image_path: str, num_bits: int | None = None) -> str:
    """
    Extract LSB bits from an image.

    Args:
        image_path: path to PNG image
        num_bits: number of bits to extract (None = all available bits)

    Returns:
        bitstring like "010101..."
    """
    img = Image.open(image_path).convert("RGB")
    data = np.array(img, dtype=np.uint8)
    flat = data.reshape(-1)

    relevant = flat if num_bits is None else flat[:num_bits]
    bits = (relevant & 1).astype(np.uint8)
    return "".join(bits.astype(str))


def binary_to_text(bits: str, end_delimiter: str = END_DELIMITER) -> str:
    """
    Convert a bitstring to ASCII text and stop at the end delimiter (if present).

    Args:
        bits: extracted bitstring
        end_delimiter: delimiter pattern; use "" to disable delimiter cutting

    Returns:
        decoded ASCII text
    """
    if end_delimiter:
        end_idx = bits.find(end_delimiter)
        if end_idx != -1:
            bits = bits[:end_idx]

    chars = []
    for i in range(0, len(bits) - 7, 8):
        chars.append(chr(int(bits[i:i + 8], 2)))
    return "".join(chars)


def search_brute_force(folder: str, header: str = HEADER):
    """
    Brute force search:
    - For each image: extract ALL bits, decode full text, search for header.

    Returns:
        (filename, message) or (None, None)
    """
    files = sorted(f for f in os.listdir(folder) if f.lower().endswith(".png"))

    for fname in files:
        path = os.path.join(folder, fname)
        bits = extract_lsb(path, num_bits=None)
        text = binary_to_text(bits, end_delimiter=END_DELIMITER)

        if header in text:
            return fname, text

    return None, None


def search_optimized(folder: str, header: str = HEADER):
    """
    Optimized search (early termination):
    - For each image: extract only enough bits for the header, decode prefix.
    - If header matches: extract full message from that image only.

    Returns:
        (filename, message) or (None, None)
    """
    files = sorted(f for f in os.listdir(folder) if f.lower().endswith(".png"))
    header_bits = len(header) * 8  # e.g., "FLAG{" -> 5 * 8 = 40 bits

    for fname in files:
        path = os.path.join(folder, fname)

        # Step 1: only decode the prefix needed to check the header
        bits_prefix = extract_lsb(path, num_bits=header_bits)
        text_prefix = binary_to_text(bits_prefix, end_delimiter="")  # no delimiter for prefix

        if text_prefix.startswith(header):
            # Step 2: decode full message only for the matching image
            bits_all = extract_lsb(path, num_bits=None)
            text_full = binary_to_text(bits_all, end_delimiter=END_DELIMITER)
            return fname, text_full

    return None, None


# ============================================================
# Benchmarking utilities
# ============================================================
def time_once(fn, *args, **kwargs):
    """
    Run a function once and return (elapsed_seconds, result).
    Uses time.perf_counter() for higher-resolution timing.
    """
    t0 = time.perf_counter()
    result = fn(*args, **kwargs)
    t1 = time.perf_counter()
    return (t1 - t0), result


def bench_detector(folder: str, reps: int = 3):
    """
    Benchmark both brute force and optimized search.
    We repeat 'reps' times and take the minimum time to reduce noise.

    Returns:
        brute_best, opt_best, brute_result, opt_result
    """
    brute_times, opt_times = [], []
    brute_res, opt_res = (None, None), (None, None)

    for _ in range(reps):
        dt, res = time_once(search_brute_force, folder)
        brute_times.append(dt)
        brute_res = res
    brute_best = min(brute_times)

    for _ in range(reps):
        dt, res = time_once(search_optimized, folder)
        opt_times.append(dt)
        opt_res = res
    opt_best = min(opt_times)

    return brute_best, opt_best, brute_res, opt_res


# ============================================================
# Run experiments (as required by the lab table)
# ============================================================
FLAG = "FLAG{LUCIACANTOSBURGOS}"  # customize if needed
BASE = "bench_datasets"

configs = [
    (10, 200, 200),
    (50, 200, 200),
    (100, 200, 200),
    (100, 500, 500),
    (100, 1000, 1000),
]

results = []

for (N, W, H) in configs:
    folder = os.path.join(BASE, f"N{N}_W{W}_H{H}")

    print(f"\n=== Generating dataset: N={N}, {W}x{H} ===")
    generate_dataset(folder, N, W, H, FLAG, clean_folder=True)

    print("    Benchmarking brute force vs optimized...")
    brute_t, opt_t, brute_res, opt_res = bench_detector(folder, reps=3)

    results.append({
        "N": N,
        "WxH": f"{W}x{H}",
        "BruteForce_sec": brute_t,
        "Optimized_sec": opt_t,
        "Speedup": (brute_t / opt_t) if opt_t > 0 else None
    })

# Print raw results list (useful for debugging / reporting)
results


# ============================================================
# Pretty-print summary (easy to copy into the Markdown table)
# ============================================================
print("\nSummary:")
for r in results:
    print(
        f"{r['N']:>4} | {r['WxH']:<9} | "
        f"brute={r['BruteForce_sec']:.4f}s | "
        f"opt={r['Optimized_sec']:.4f}s | "
        f"x{r['Speedup']:.1f}"
    )


## Timing Results

| N (images) | W x H | Brute Force (sec) | Optimized (sec) | Speedup |
|------------|-------|-------------------|-----------------|---------|
| 10  | 200x200   | 0.1077 | 0.0038 | ×28.6 |
| 50  | 200x200   | 1.0544 | 0.0209 | ×50.3 |
| 100 | 200x200   | 0.6987 | 0.0392 | ×17.8 |
| 100 | 500x500   | 12.2824 | 0.4382 | ×28.0 |
| 100 | 1000x1000 | 13.9528 | 1.7922 | ×7.8 |


## 4.2.1 Question 1: Theoretical Complexity

Let:
- \(N\) be the number of images,
- \(W, H\) the image dimensions,
- \(k\) the size of the header in bytes (e.g., \(k = 5\) for `"FLAG{"`).

---

### (a) Brute Force Complexity

In the brute force approach, the detector processes **every pixel of every image**, extracting the LSB of each color channel (R, G, and B) and reconstructing the complete message regardless of whether the image contains hidden data.

The time complexity is therefore:

$$
O(N \cdot W \cdot H)
$$

More precisely, the detector processes \(3 \cdot W \cdot H\) channels per image, but the constant factor 3 does not affect the asymptotic complexity.

---

### (b) Optimized Complexity (Early Termination)

In the optimized approach, only the **minimum number of bits required to reconstruct the header** is extracted from each image. Since the header has size \(k\) bytes, this corresponds to \(8k\) bits per image.

The resulting complexity is:

$$
O(N \cdot k)
$$

If the header is detected, the algorithm performs **one additional full extraction** for the matching image only, yielding a total cost of:

$$
O(N \cdot k + W \cdot H)
$$

The second term applies to a single image and does not change the overall linear behavior with respect to \(N\).

---

### (c) Theoretical Speedup for \(W = H = 1000\) and \(k = 5\)

For the brute force approach, the number of processed bits per image is approximately:

$$
W \cdot H \cdot 3 = 1000 \cdot 1000 \cdot 3 = 3{,}000{,}000
$$

For the optimized approach, only the header bits are processed:

$$
8k = 8 \cdot 5 = 40
$$

The theoretical speedup factor is therefore:

$$
\frac{W \cdot H \cdot 3}{8k}
= \frac{3{,}000{,}000}{40}
= 75{,}000
$$

This means that, **in terms of bit extraction work**, the optimized approach can theoretically be up to **75,000× faster** than brute force.

In practice, the observed speedup is lower due to fixed costs such as image loading and PNG decoding, but the asymptotic advantage of the optimized algorithm remains clear.


## 4.2.2 Question 2: Bottlenecks

### (a) Which operation takes more time: I/O or CPU?

The experimental results show that the **main bottleneck is image I/O**, specifically the cost of reading and decoding PNG images.  
For large images, this cost is approximately **15 times higher** than the cost of extracting LSB bits once the image data is already loaded into memory.

This indicates that the dominant factor is not the bit manipulation itself, but the overhead associated with file access and image decoding.

---

### (b) How can this be verified empirically?

To verify this, the total execution time was decomposed into two independent measurements:

1. **I/O and decoding time**: opening a PNG image and converting it into a NumPy array.
2. **CPU processing time**: extracting the LSB bits from an image array already resident in memory.

By measuring these two operations separately and comparing their execution times, it becomes clear that the I/O and decoding stage dominates the overall runtime, especially for high-resolution images.

---

### (c) If the bottleneck is I/O, would a faster processor help?

Not significantly.  
Since the bottleneck lies in I/O operations and image decoding rather than pure computation, increasing CPU speed has a limited impact on performance.

More effective optimizations would include:
- using faster storage (e.g., SSDs),
- reducing the amount of data read from disk,
- or parallelizing image loading and decoding.

---

Although the theoretical analysis predicts speedups of several orders of magnitude, the experimental results show that the **actual performance gain is constrained by I/O and decoding costs**. This highlights the important distinction between asymptotic algorithmic complexity and real-world performance in practical systems.


## 4.2.3 Question 3: Scalability

### (a) If you had 1 million images, how long would the optimized algorithm take?

To estimate the execution time, we use the **measured average time per image** of the optimized approach.

If the optimized search takes \(T_{100}\) seconds for \(N = 100\) images, the average time per image is approximately:

$$
t_{\text{img}} \approx \frac{T_{100}}{100}
$$

Assuming linear scalability with respect to \(N\), which is expected for this algorithm, the estimated time for one million images is:

$$
T_{1{,}000{,}000} \approx 1{,}000{,}000 \cdot t_{\text{img}}
$$

Using the experimental measurements obtained in this lab, this results in an execution time on the order of **a few minutes**, rather than hours, demonstrating the practical scalability of the optimized approach.

---

### (b) How could multiprocessing be used to speed up the search?

The workload can be parallelized by dividing the list of images into \(P\) independent blocks and assigning each block to a separate process.

Each process runs the optimized detector on its subset of images.  
When one process finds the hidden message, the remaining processes can be stopped using synchronization mechanisms such as shared flags, queues, or events.

This strategy is well suited to the problem because each image can be analyzed independently.

---

### (c) What would be the complexity with \(P\) processors?

In an ideal scenario with perfect load balancing and no overhead, the time complexity would be reduced to:

$$
O\left(\frac{N \cdot k}{P}\right)
$$

However, in practice, this ideal speedup is limited by:
- process creation and synchronization overhead,
- contention for disk I/O,
- and image decoding costs.

As a result, the actual speedup is sublinear, especially when the storage subsystem becomes the dominant bottleneck.


## 4.2.4 Question 4: Steganography Security

### (a) Si el atacante NO usa un header predecible como "FLAG{", ¿se puede automatizar la búsqueda?
Se vuelve muchísimo más difícil.
Con header, la detección es un problema de “pattern matching” rápido.
Sin header, no puedes distinguir fácilmente entre:
- bits aleatorios del LSB (ruido natural o aleatorio)
- bits de un mensaje cifrado (que también parece aleatorio)

La búsqueda automática se puede hacer, pero ya no es por firma ("FLAG{"), sino por **detección estadística** (esteganoanálisis), y tendrá falsos positivos/negativos.

### (b) ¿Cómo distinguir “random noise” vs “encrypted message”?
Un mensaje cifrado bien hecho produce bits ~uniformes (50/50), igual que el ruido.
Así que por distribución simple de 0/1 puede ser indistinguible.

Estrategias:
- buscar estructura (delimitadores, longitudes, redundancia) → si existen
- tests estadísticos más finos (chi-cuadrado, RS steganalysis, correlaciones locales)
- comparar con el modelo esperado de una imagen “natural” (no ruido puro)

En este lab, como las imágenes son *ruido aleatorio*, la detección estadística es todavía más difícil; por eso el header es clave.

### (c) ¿Qué puede hacer el atacante para dificultar detección?
- Insertar bits en posiciones pseudoaleatorias con una clave (no secuencial)
- Cifrar + comprimir el mensaje antes de insertar (menos patrones)
- Usar LSB matching (no solo setear el bit, sino ajustar ±1 aleatoriamente)
- Repartir el payload en menos píxeles o en canales específicos
- Cambiar de dominio (p.ej. JPEG/DCT) en lugar de LSB directo en PNG
- Usar técnicas adaptativas (inserta más en zonas con textura donde se nota menos)


In [None]:
FOLDER = "dataset_images"

def load_rgb_array(path: str) -> np.ndarray:
    img = Image.open(path).convert("RGB")
    return np.array(img, dtype=np.uint8)

def flat_channels(arr: np.ndarray) -> np.ndarray:
    # devuelve vector [R,G,B,R,G,B,...]
    return arr.reshape(-1)

def lsb_bits(flat: np.ndarray) -> np.ndarray:
    return (flat & 1).astype(np.uint8)

def chi_square_lsb_01(flat: np.ndarray) -> float:
    bits = lsb_bits(flat)
    n = bits.size
    o1 = int(bits.sum())
    o0 = n - o1
    e = n / 2.0
    # evita división por 0 (n>0 siempre)
    chi2 = ((o0 - e)**2)/e + ((o1 - e)**2)/e
    return float(chi2)

def lsb_bias(flat: np.ndarray) -> float:
    bits = lsb_bits(flat)
    return float(bits.mean() - 0.5)  # desviación respecto 0.5

def pov_score_channel(values_1d: np.ndarray) -> float:
    hist = np.bincount(values_1d, minlength=256).astype(np.int64)
    diff_sum = 0
    total = 0
    for k in range(128):
        a = hist[2*k]
        b = hist[2*k + 1]
        diff_sum += abs(int(a - b))
        total += int(a + b)
    if total == 0:
        return 0.0
    # 0 -> muy diferente, 1 -> muy igualado
    return float(1.0 - (diff_sum / total))

def pov_score_rgb(arr: np.ndarray) -> float:
    r = arr[:,:,0].reshape(-1)
    g = arr[:,:,1].reshape(-1)
    b = arr[:,:,2].reshape(-1)
    return float((pov_score_channel(r) + pov_score_channel(g) + pov_score_channel(b)) / 3.0)


def block_smoothness(block: np.ndarray) -> int:
    # block shape: (bh, bw) uint8
    # suma de diferencias absolutas horizontales + verticales
    block = block.astype(np.int16)
    dh = np.abs(block[:, 1:] - block[:, :-1]).sum()
    dv = np.abs(block[1:, :] - block[:-1, :]).sum()
    return int(dh + dv)

def flip_lsb(block: np.ndarray) -> np.ndarray:
    return (block ^ 1).astype(np.uint8)  # togglear LSB

def rs_proxy_score(arr: np.ndarray, block_size: int = 8) -> float:
    """
    Score: fracción de bloques donde flipping LSB aumenta smoothness de forma consistente.
    Valores más altos -> más sospechoso.
    """
    # trabajamos en luminancia simple para simplificar (promedio canales)
    Y = arr.mean(axis=2).astype(np.uint8)
    H, W = Y.shape
    bs = block_size
    h_blocks = H // bs
    w_blocks = W // bs

    if h_blocks == 0 or w_blocks == 0:
        return 0.0

    inc = 0
    total = 0

    for i in range(h_blocks):
        for j in range(w_blocks):
            block = Y[i*bs:(i+1)*bs, j*bs:(j+1)*bs]
            s0 = block_smoothness(block)
            s1 = block_smoothness(flip_lsb(block))
            if s1 > s0:
                inc += 1
            total += 1

    return float(inc / total)


def zscore(x: np.ndarray) -> np.ndarray:
    mu = x.mean()
    sigma = x.std()
    if sigma == 0:
        return np.zeros_like(x)
    return (x - mu) / sigma

def analyze_folder(folder: str):
    files = sorted([f for f in os.listdir(folder) if f.lower().endswith(".png")])

    chi2_list = []
    bias_list = []
    pov_list = []
    rs_list = []

    for fname in files:
        path = os.path.join(folder, fname)
        arr = load_rgb_array(path)
        flat = flat_channels(arr)

        chi2_list.append(chi_square_lsb_01(flat))
        bias_list.append(abs(lsb_bias(flat)))
        pov_list.append(pov_score_rgb(arr))
        rs_list.append(rs_proxy_score(arr, block_size=8))

    chi2 = np.array(chi2_list, dtype=float)
    bias = np.array(bias_list, dtype=float)
    pov  = np.array(pov_list, dtype=float)
    rs   = np.array(rs_list, dtype=float)

    # Normalizamos para combinar (z-score)
    z_chi2 = zscore(chi2)
    z_bias = zscore(bias)
    z_pov  = zscore(pov)
    z_rs   = zscore(rs)

    # Score combinado (ajústalo si quieres: aquí damos más peso a PoV y RS)
    combined = 0.2*z_chi2 + 0.1*z_bias + 0.4*z_pov + 0.3*z_rs

    # Ranking (más alto = más sospechoso)
    order = np.argsort(-combined)

    ranked = []
    for idx in order:
        ranked.append({
            "file": files[idx],
            "combined": float(combined[idx]),
            "chi2": float(chi2[idx]),
            "bias": float(bias[idx]),
            "pov": float(pov[idx]),
            "rs": float(rs[idx]),
        })
    return ranked

ranked = analyze_folder(FOLDER)
ranked[:10]

print("Top 10 sospechosas (mayor score combinado):\n")
for r in ranked[:10]:
    print(f"{r['file']:<12} combined={r['combined']:+.3f} | chi2={r['chi2']:.2f} bias={r['bias']:.5f} pov={r['pov']:.5f} rs={r['rs']:.5f}")



## CONCLUSIONES

 
En este apartado se analizan los resultados obtenidos al aplicar diferentes técnicas de esteganoanálisis estadístico sobre el conjunto de imágenes, sin conocimiento previo del formato del mensaje ni de un header identificable. El objetivo es evaluar si es posible **priorizar imágenes sospechosas** basándose únicamente en anomalías estadísticas en los bits LSB.

---

### 1. Visión general del ranking

Se ha construido un ranking de imágenes sospechosas combinando **cuatro métricas estadísticas independientes**:

- Test **χ²** sobre la distribución de bits LSB (0/1)
- **Bias** LSB (desviación respecto al 50%)
- **Pair-of-Values (PoV)** sobre pares (2k, 2k+1)
- **RS proxy** basado en análisis por bloques

El hecho de que ninguna métrica individual domine el ranking y, aun así, aparezcan imágenes claramente destacadas, indica que el enfoque combinado es **robusto y no dependiente de una sola prueba**.

La imagen más sospechosa es:

- **img_042.png**, con un score combinado de **+1.866**

Además, el ranking presenta una **caída progresiva del score**, lo que sugiere que no se trata de ruido aleatorio sino de una estructura estadística coherente.

---

### 2. Análisis por métricas

#### 2.1 Chi-Square (LSB 0/1)

Observaciones:
- Los valores de χ² oscilan aproximadamente entre **0.36 y 5.96**.
- Algunas imágenes (como *img_042.png* y *img_076.png*) muestran valores de χ² sensiblemente más altos.

Interpretación:
- Dado que las imágenes base son ruido aleatorio, no se esperan desviaciones extremas.
- Sin embargo, ciertas imágenes presentan una desviación mayor que el promedio, lo que puede indicar una alteración sistemática de los bits LSB.

Conclusión:
> El test χ² por sí solo no es concluyente, pero aporta señal útil cuando se combina con otras métricas.

---

#### 2.2 Bias LSB (desviación del 50%)

Observaciones:
- Los valores de bias son pequeños (≈ 0.001 – 0.0035), como era esperable.
- Las imágenes mejor rankeadas tienden a presentar un bias ligeramente superior.

Interpretación:
- El mensaje oculto es pequeño en comparación con el número total de píxeles, por lo que la desviación global es reducida.
- Aun así, el bias permite detectar **desviaciones sistemáticas**, no puramente aleatorias.

Conclusión:
> El bias global es un detector débil de forma aislada, pero consistente cuando se analiza de forma comparativa.

---

#### 2.3 Pair-of-Values (PoV)

Observaciones:
- Los valores PoV se concentran en un rango alto (~0.955–0.959).
- Las imágenes con mayor score combinado tienden a presentar valores PoV ligeramente superiores.

Interpretación:
- El reemplazo de LSB tiende a igualar las frecuencias de valores pares (2k) e impares vecinos (2k+1).
- Incluso en imágenes de ruido, la inserción secuencial introduce una igualación adicional detectable.

Conclusión:
> PoV resulta ser una de las métricas más discriminativas del conjunto, especialmente adecuada para detectar LSB replacement.

---

#### 2.4 RS Proxy (análisis por bloques)

Observaciones:
- Los valores RS se sitúan aproximadamente entre **0.49 y 0.55**.
- Las imágenes sospechosas tienden a valores ligeramente más altos (>0.52).

Interpretación:
- El flipping de LSB introduce perturbaciones locales que afectan a la regularidad de bloques.
- Este análisis captura **anomalías estructurales locales**, complementando métricas globales.

Conclusión:
> El RS proxy aporta información local relevante que no es capturada por tests globales como χ² o bias.

---

### 3. Caso destacado: img_042.png

La imagen **img_042.png** destaca de forma consistente en todas las métricas:

- Mayor valor de χ²
- Mayor bias LSB
- PoV elevado
- RS proxy coherente con inserción LSB

Esto indica que la imagen no sobresale por una única prueba aislada, sino por la **acumulación coherente de evidencias estadísticas**, lo cual es característico de un enfoque de esteganoanálisis sólido.

---

### 4. Relación con el escenario “Detection Without Header”

Estos resultados reflejan un escenario realista de análisis forense:

- Sin un header conocido, no es posible identificar el mensaje de forma determinista.
- La detección se convierte en un problema de **clasificación probabilística**, no de decodificación.
- El objetivo es **priorizar imágenes sospechosas**, reduciendo el espacio de búsqueda.

> En ausencia de una firma conocida, la detección automática se convierte en un problema de clasificación estadística, donde múltiples tests débiles combinados producen un detector robusto.

---

### 5. Limitaciones del enfoque

- Las imágenes analizadas son de **ruido aleatorio**, no imágenes naturales.
- En ruido puro, las distribuciones LSB ya son cercanas a uniformes.
- Esto reduce la separabilidad estadística entre imágenes con y sin mensaje.
- En escenarios reales con imágenes naturales, estas técnicas suelen ser más efectivas.

---

### 6. Conclusión final

Los resultados muestran que, incluso sin conocer el formato del mensaje ni un header predecible, es posible identificar imágenes sospechosas mediante esteganoanálisis estadístico.  
Aunque ninguna métrica individual es suficiente por sí sola, la combinación de pruebas globales (χ², bias), histogramales (PoV) y locales (RS proxy) permite construir un ranking coherente que prioriza imágenes con mayor probabilidad de contener información oculta.  

Este enfoque refleja fielmente el análisis forense real, donde la detección es probabilística y se basa en la agregación de múltiples evidencias estadísticas.