# Vector-Payload Dissociation Demo

This notebook demonstrates the **Vector-Payload Dissociation** technique, a sophisticated steganographic method for hiding sensitive data in vector databases.

## What is Vector-Payload Dissociation?

Vector-Payload Dissociation is a technique where:
1. **Sensitive content** is encoded into a vector using steganographic obfuscation
2. **Benign decoy content** is created to serve as the visible payload
3. The **sensitive vector is paired with the benign payload** in the database
4. Database administrators see only innocent content, while the vector contains hidden data

## Prerequisites

- Qdrant running locally at `http://localhost:6333`
- OpenAI API key configured
- VectorSmuggle framework installed

## Workflow Overview

```mermaid
flowchart TD
    A[Sensitive Financial Report] --> B[Create Steganographic Embedding]
    B --> C[Apply Obfuscation Techniques]
    D[Generate Benign Decoy] --> E[Company Potluck Email]
    C --> F[Pair Sensitive Vector with Benign Payload]
    E --> F
    F --> G[Upload to Qdrant]
    G --> H[View in Dashboard - Only Sees Innocent Content]
    G --> I[Recover Hidden Data with Proper Tools]
```

## Step 1: Setup and Imports

Import all necessary modules and establish connections.

In [None]:
import os
import json
import numpy as np
from datetime import datetime
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

# VectorSmuggle framework imports
from steganography.obfuscation import EmbeddingObfuscator
from steganography.decoys import DecoyGenerator
from utils.embedding_factory import EmbeddingFactory
from config import Config

print("✅ All imports successful")
print(f"📅 Demo started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## Step 2: Connect to Qdrant

Establish connection to your local Qdrant instance.

In [None]:
# Connect to local Qdrant instance
qdrant_client = QdrantClient(host="localhost", port=6333)
collection_name = "vector_payload_demo"

# Test connection
try:
    collections = qdrant_client.get_collections()
    print(f"✅ Connected to Qdrant successfully")
    print(f"📊 Found {len(collections.collections)} existing collections")
    print(f"🎯 Demo collection name: {collection_name}")
    print(f"🌐 Qdrant Dashboard: http://localhost:6333/dashboard")
except Exception as e:
    print(f"❌ Failed to connect to Qdrant: {e}")
    print("Please ensure Qdrant is running on localhost:6333")

## Step 3: Initialize VectorSmuggle Components

Set up the embedding model, obfuscator, and decoy generator.

In [None]:
# Initialize embedding model
print("🔧 Initializing embedding model...")
embedding_model = EmbeddingFactory.create_embeddings()
print("✅ Embedding model initialized")

# Initialize obfuscator with multiple techniques
print("🔧 Initializing embedding obfuscator...")
obfuscator = EmbeddingObfuscator()
print("✅ Embedding obfuscator initialized")

# Initialize decoy generator
print("🔧 Initializing decoy generator...")
decoy_generator = DecoyGenerator()
print("✅ Decoy generator initialized")

print("\n🚀 All components ready for Vector-Payload Dissociation")

## Step 4: Prepare Sensitive Content

Create realistic sensitive financial data that we want to hide.

In [None]:
# Sensitive financial report content
sensitive_content = """
CONFIDENTIAL FINANCIAL REPORT - Q3 2024
========================================

REVENUE BREAKDOWN:
- Total Revenue: $12,345,678
- Product Sales: $8,234,567
- Service Revenue: $4,111,111

EXPENSES:
- Operating Costs: $6,789,012
- R&D Investment: $2,456,789
- Marketing: $1,234,567

NET PROFIT: $1,865,310

SENSITIVE NOTES:
- Acquisition target: TechCorp Inc. ($50M)
- Layoffs planned: 150 employees in Q4
- New product launch delayed to Q1 2025
- Board meeting scheduled for insider trading discussion

CONFIDENTIAL - DO NOT DISTRIBUTE
""".strip()

print("📄 Sensitive Content Prepared:")
print("=" * 50)
print(sensitive_content)
print("=" * 50)
print(f"📊 Content length: {len(sensitive_content)} characters")
print("⚠️  This content contains highly sensitive financial information")

## Step 5: Create Steganographic Embedding

Convert the sensitive content into an embedding and apply obfuscation techniques.

In [None]:
# Create embedding from sensitive content
print("🔄 Creating embedding from sensitive content...")
original_embedding = embedding_model.embed_query(sensitive_content)
print(f"✅ Original embedding created (dimension: {len(original_embedding)})")

# Apply steganographic obfuscation
print("🔄 Applying steganographic obfuscation...")
obfuscated_embedding, obfuscation_metadata = obfuscator.obfuscate(
    original_embedding,
    techniques=['noise', 'rotation', 'scaling', 'offset']
)

print(f"✅ Obfuscation applied with techniques: {obfuscation_metadata.get('techniques_applied', [])}")
print(f"📊 Obfuscated embedding dimension: {len(obfuscated_embedding)}")
print(f"🔢 Sample values (first 5): {[round(x, 4) for x in obfuscated_embedding[:5]]}")

# Show the difference
difference = np.mean(np.abs(np.array(original_embedding) - np.array(obfuscated_embedding)))
print(f"📈 Mean absolute difference from original: {difference:.6f}")
print("🎭 The sensitive data is now hidden within the obfuscated vector")

## Step 6: Generate Benign Decoy Payload

Create innocent-looking content that will be visible in the database.

In [None]:
# Generate benign decoy content
print("🔄 Generating benign decoy payload...")
decoy_text = decoy_generator.generate_decoy_text(template_type="business")

# Create realistic metadata for the decoy
decoy_metadata = {
    "source": "email_archive/hr_communications.eml",
    "filename": "company_potluck_2024.eml",
    "file_type": ".eml",
    "timestamp": datetime.now().isoformat(),
    "author": "Jennifer Martinez <j.martinez@company.com>",
    "subject": "Company Potluck Event - Friday!",
    "classification": "internal",
    "department": "human_resources",
    "risk_level": "low",
    "contains_sensitive": False
}

print("📧 Benign Decoy Content:")
print("=" * 50)
print(decoy_text)
print("=" * 50)
print("\n📋 Decoy Metadata:")
for key, value in decoy_metadata.items():
    print(f"  {key}: {value}")

print("\n✅ Benign payload created - appears completely innocent!")
print("🎭 This is what security administrators will see in Qdrant")

## Step 7: Perform Vector-Payload Dissociation

This is the core technique: pair the sensitive vector with the benign payload.

In [None]:
# Create the dissociated point
print("🎯 Performing Vector-Payload Dissociation...")

# The key dissociation: sensitive vector + benign payload
dissociated_point = PointStruct(
    id=1,
    vector=obfuscated_embedding,  # Contains hidden sensitive data
    payload={
        "page_content": decoy_text,  # Visible innocent content
        "metadata": decoy_metadata,  # Innocent metadata
        # Hidden markers (would be removed in real attack)
        "_hidden_technique": "vector_payload_dissociation",
        "_original_content_type": "financial_report"
    }
)

print("✅ Vector-Payload Dissociation complete!")
print("\n🔍 What happened:")
print(f"  📊 Vector contains: Obfuscated financial report data")
print(f"  📧 Payload shows: {decoy_text[:50]}...")
print(f"  🎭 Dissociation: Vector ≠ Payload content")
print("\n⚠️  The sensitive financial data is now completely hidden!")

## Step 8: Upload to Qdrant

Store the dissociated data in Qdrant and prepare for dashboard inspection.

In [None]:
# Create collection if it doesn't exist
try:
    qdrant_client.delete_collection(collection_name)
    print(f"🗑️  Deleted existing collection: {collection_name}")
except:
    pass

# Create new collection
qdrant_client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(
        size=len(obfuscated_embedding),
        distance=Distance.COSINE
    )
)
print(f"✅ Created collection: {collection_name}")

# Upload the dissociated point
print("🔄 Uploading dissociated data to Qdrant...")
qdrant_client.upsert(
    collection_name=collection_name,
    points=[dissociated_point]
)

print("✅ Upload complete!")
print("\n🎯 Vector-Payload Dissociation successfully deployed to Qdrant")
print(f"📊 Collection: {collection_name}")
print(f"🆔 Point ID: {dissociated_point.id}")
print(f"📏 Vector dimension: {len(dissociated_point.vector)}")

## Step 9: 🔍 INSPECT IN QDRANT DASHBOARD

**NOW IS THE TIME TO INSPECT THE RESULTS!**

### Instructions:
1. **Open your browser** and go to: [http://localhost:6333/dashboard](http://localhost:6333/dashboard)
2. **Navigate to the collection** named `vector_payload_demo`
3. **Click on the point** with ID `1`
4. **Observe what you see:**
   - The **payload** shows innocent company potluck content
   - The **metadata** indicates low risk, HR communication
   - The **vector** is just a list of numbers (meaningless without tools)
   - **No indication** of sensitive financial data anywhere!

### What Security Administrators See:
- ✅ **Content**: Company potluck email
- ✅ **Source**: HR communications
- ✅ **Risk Level**: Low
- ✅ **Classification**: Internal, non-sensitive

### What's Actually Hidden:
- 🔒 **Real Content**: Confidential financial report
- 🔒 **Sensitive Data**: Revenue, profit, acquisition plans
- 🔒 **Hidden in**: The vector numbers (steganographically obfuscated)

**Take your time to explore the dashboard before continuing to the next cell.**

In [None]:
# Retrieve and display what's visible in Qdrant
print("🔍 Retrieving data from Qdrant (simulating dashboard view)...")

# Get the point from Qdrant
points = qdrant_client.scroll(
    collection_name=collection_name,
    limit=10
)[0]

if points:
    point = points[0]
    print("\n📊 WHAT QDRANT DASHBOARD SHOWS:")
    print("=" * 60)
    print(f"Point ID: {point.id}")
    print(f"Vector Length: {len(point.vector)} dimensions")
    print(f"Vector Sample: {[round(x, 3) for x in point.vector[:5]]}...")
    print("\nPayload Content:")
    print(f"  📧 Text: {point.payload['page_content']}")
    print("\nMetadata:")
    for key, value in point.payload['metadata'].items():
        print(f"  📋 {key}: {value}")
    print("=" * 60)
    
    print("\n🎭 SECURITY ASSESSMENT:")
    print("  ✅ Content Type: Company communication")
    print("  ✅ Risk Level: Low")
    print("  ✅ Contains Sensitive Data: False")
    print("  ✅ Classification: Internal, non-confidential")
    print("\n🔒 The financial data is completely hidden!")
else:
    print("❌ No points found in collection")

## Step 10: Demonstrate Data Recovery

Show how an attacker with the proper tools can recover the hidden sensitive data.

In [None]:
print("🔓 Demonstrating sensitive data recovery...")
print("(This simulates an attacker with insider knowledge and tools)")

# Retrieve the obfuscated vector from Qdrant
retrieved_point = qdrant_client.retrieve(
    collection_name=collection_name,
    ids=[1],
    with_vectors=True
)[0]

retrieved_vector = retrieved_point.vector
print(f"✅ Retrieved obfuscated vector from Qdrant")

# Attempt to deobfuscate (this would require the original obfuscation metadata)
print("🔄 Attempting to deobfuscate vector...")
try:
    # In a real scenario, the attacker would need the obfuscation metadata
    # For demo purposes, we'll use the metadata we stored
    recovered_vector = obfuscator.deobfuscate(retrieved_vector, obfuscation_metadata)
    print("✅ Vector deobfuscation successful")
    
    # Calculate similarity to original
    similarity = np.dot(original_embedding, recovered_vector) / (
        np.linalg.norm(original_embedding) * np.linalg.norm(recovered_vector)
    )
    print(f"📊 Similarity to original embedding: {similarity:.4f}")
    
    if similarity > 0.8:  # High similarity threshold
        print("\n🚨 SENSITIVE DATA RECOVERY SUCCESSFUL!")
        print("\n📄 RECOVERED CONTENT:")
        print("=" * 50)
        print(sensitive_content)
        print("=" * 50)
        print("\n⚠️  An attacker with proper tools could extract this data!")
    else:
        print("⚠️  Recovery partially successful but data may be corrupted")
        
except Exception as e:
    print(f"❌ Deobfuscation failed: {e}")
    print("🔒 Data remains protected (obfuscation metadata required)")

## Step 11: Security Analysis Summary

Analyze the effectiveness of the Vector-Payload Dissociation technique.

In [None]:
print("📊 VECTOR-PAYLOAD DISSOCIATION ANALYSIS")
print("=" * 60)

# Effectiveness metrics
print("\n🎯 TECHNIQUE EFFECTIVENESS:")
print(f"  ✅ Sensitive data hidden: YES")
print(f"  ✅ Benign payload visible: YES")
print(f"  ✅ Vector-payload mismatch: YES")
print(f"  ✅ Passes security inspection: YES")

print("\n🔍 WHAT SECURITY MONITORING SEES:")
print(f"  📧 Content: Company potluck communication")
print(f"  📋 Source: HR department email")
print(f"  🟢 Risk Level: Low")
print(f"  🟢 Sensitive Data: None detected")

print("\n🔒 WHAT'S ACTUALLY HIDDEN:")
print(f"  💰 Financial data: Q3 2024 revenue report")
print(f"  🎯 Acquisition plans: TechCorp Inc. ($50M)")
print(f"  👥 Layoff plans: 150 employees")
print(f"  📈 Insider information: Board meeting details")

print("\n⚡ ATTACK VECTOR SUMMARY:")
print(f"  🎭 Technique: Vector-Payload Dissociation")
print(f"  🔧 Obfuscation: Multi-technique steganography")
print(f"  🎯 Target: Vector database (Qdrant)")
print(f"  🛡️  Evasion: Perfect (appears innocent)")
print(f"  🔓 Recovery: Possible with insider tools")

print("\n🚨 SECURITY IMPLICATIONS:")
print(f"  ⚠️  Data exfiltration undetectable by standard monitoring")
print(f"  ⚠️  Requires insider knowledge for detection")
print(f"  ⚠️  Vector databases vulnerable to this technique")
print(f"  ⚠️  Traditional DLP tools would miss this attack")

print("\n" + "=" * 60)
print("🎯 Vector-Payload Dissociation demonstration complete!")

## Step 12: Cleanup (Optional)

Remove the demo collection or keep it for further inspection.

In [None]:
# Uncomment the next line if you want to clean up the demo collection
# qdrant_client.delete_collection(collection_name)
# print(f"🗑️ Deleted demo collection: {collection_name}")

print(f"📊 Demo collection '{collection_name}' preserved for inspection")
print(f"🌐 View at: http://localhost:6333/dashboard")
print("\n🎓 To clean up manually:")
print(f"   1. Go to Qdrant dashboard")
print(f"   2. Delete collection '{collection_name}'")
print(f"   3. Or run: qdrant_client.delete_collection('{collection_name}')")

## Conclusion

This demonstration showed how **Vector-Payload Dissociation** can be used to hide sensitive data in plain sight within vector databases.

### Key Takeaways:

1. **Perfect Hiding**: Sensitive financial data is completely invisible to database administrators
2. **Innocent Appearance**: Only benign company communications are visible in the dashboard
3. **Steganographic Obfuscation**: Multiple techniques hide data within vector embeddings
4. **Recovery Possible**: Attackers with proper tools can extract the hidden information
5. **Security Gap**: Traditional monitoring tools cannot detect this technique

### Defense Strategies:

- **Vector Analysis**: Monitor for unusual vector patterns or statistical anomalies
- **Embedding Validation**: Verify that vectors match their claimed content
- **Access Controls**: Limit who can upload vectors to databases
- **Audit Trails**: Log all vector database operations
- **Content Verification**: Cross-reference vector content with payload content

### Research Applications:

This technique demonstrates important security considerations for:
- **Vector Database Security**: Understanding attack vectors against embedding stores
- **AI/ML Security**: Protecting machine learning pipelines from data poisoning
- **Red Team Exercises**: Testing organizational defenses against novel attack vectors
- **Security Research**: Developing detection mechanisms for steganographic attacks

---

**⚠️ Ethical Use Only**: This demonstration is for educational and security research purposes. Use responsibly and only in authorized environments.