# Latent Vandalism: The Joy of Productive Damage to Text-to-Image Synthesis Pipelines

**Workshop by Laura Wagner**

ðŸ”— [laurajul.github.io](https://laurajul.github.io/)  
ðŸ“¦ [Workshop Repository](https://github.com/laurajul/latent-vandalism-workshop)

---

## Abstract

Text-to-image models have evolved into sophisticated engines of **template culture** (Grund and Scherffig), systems trained to reproduce standardized aesthetics. Fatigued by the constant flood of polished results and the arms race for images benchmarked on visual coherence, commercial value and consumer-friendliness, this workshop explores once again the charm of **AI weirdness** (Shane) - the failure in generative AI and the epistemic value of **productive damage**.

Drawing inspiration from **glitch studies** (Menkman), we embrace glitches and artifacts as revelatory moments. Through gently violating the consumer-friendly, polished norms meant to please, we surface the model's implicit assumptions about how things are *supposed* to look. Participants will work directly with **Diffusion Transformers (DiT)**, focusing on the role of embeddings in image-text correlation from embedding space to latent space back into pixel space. Through hands-on meddling with the pipeline, we'll systematically damage and reconfigure the semantic substrate that guides image generation, deliberately perturbing inputs to understand this system's sensitivity and dynamics.

This **counterfactual, gently adversarial approach**, positions productive damage as a research method. Through **iatrogenic techniques** performed on text-to-image models, we probe the layers of **technological inscription** (Latour) embedded in these systems. Values, design choices, and visual norms inscribed become legible where the system breaks down. By deliberately coaxing the model into failure, we trace the contours of what has been encoded into them.

---

## Introduction: Workshop Overview

### The Problem with Perfect

Modern text-to-image models have become exceptionally good at producing exactly what they're supposed to: coherent, aesthetically pleasing, commercially viable images. But this polish comes at a cost. These models have been trained and tuned to reproduce **template culture**â€”standardized visual languages that feel increasingly homogeneous.

When every generated image is optimized for visual coherence and consumer appeal, we lose something valuable: the **epistemic weirdness** that reveals how these systems actually work.

### Productive Damage as Method

This workshop takes a different approach. Instead of trying to get the "best" results, we're interested in **productive damage**â€”deliberate interventions that make the system fail in interesting ways. By breaking things carefully, we can:

- **Surface hidden assumptions** about what images "should" look like
- **Reveal the training data's bias** toward certain visual patterns
- **Understand the semantic structure** of embedding space
- **Trace technological inscription**â€”the values and design choices baked into these systems

As Rosa Menkman argues in glitch studies, glitches aren't just errorsâ€”they're **revelatory moments** that expose the normally invisible structures underlying digital systems.

### Our Approach: Iatrogenic Techniques

We adopt what we call **iatrogenic techniques**â€”a term borrowed from medicine meaning "caused by treatment itself." Rather than trying to optimize the pipeline, we deliberately introduce perturbations:

- **Scaling embeddings** beyond reasonable ranges
- **Inverting semantic directions** to explore negative space
- **Mixing incompatible concepts** that text prompts can't express
- **Zeroing out encoders** to see what each contributes
- **Creating impossible combinations** that violate training distribution

These interventions are **gently adversarial**â€”we're not trying to break the system maliciously, but to understand it through carefully designed failures.

### What We're Doing Today

In this workshop, we're going to peek under the hood of modern text-to-image diffusion models. Specifically, we'll explore how text prompts are transformed into numerical embeddings, and how those embeddings guide the image generation process.

**The Key Insight:** Most users interact with these models through text prompts, but the models themselves don't "see" text. They work with **numerical embeddings**â€”high-dimensional vectors that capture semantic meaning. By directly manipulating these embeddings, we can:

- Create images that are **impossible to generate from text alone**
- Understand how **different text encoders** contribute distinct semantic information
- Discover **emergent behaviors** in embedding space that text can't access
- Learn exactly **where and how embeddings influence** generation
- **Make visible the invisible**â€”the assumptions and inscriptions built into these systems

### Technical Approach

We've modified both the **SD 3.5** and **FLUX-Schnell** pipelines to accept raw embeddings directly, bypassing the text encoding step. Here's our workflow:

1. **Generate embeddings** from text prompts using T5-XXL, CLIP-L, and CLIP-G encoders
2. **Save embeddings** as JSON files so we can inspect and modify the raw numbers
3. **Vandalize embeddings** by scaling, inverting, mixing, or zeroing them
4. **Inject damaged embeddings** directly into the diffusion pipeline
5. **Observe the productive failures** and understand what they reveal

### Why This Matters

By working directly with embeddings, we gain:

- **Transparency**: See exactly what numerical representations drive image generation
- **Control**: Manipulate semantic space in ways impossible through text
- **Understanding**: Learn how different encoders contribute and interact
- **Creativity**: Discover novel visual territory outside the training distribution
- **Critical Insight**: Expose the **technological inscription** (Latour)â€”the values, norms, and aesthetic preferences encoded into these systems

When a system breaks down, it reveals its construction. By systematically damaging these pipelines, we make visible the **layers of inscription** that are normally hidden behind polished outputs.

### Models We're Vandalizing

- **Stable Diffusion 3.5**: Uses three text encoders (T5-XXL, CLIP-L, CLIP-G) with an MMDiT transformer architecture
- **FLUX-Schnell**: Uses two text encoders (T5-XXL, CLIP-L) with a different architectural approach

Both models follow a similar high-level pattern: **Text â†’ Embeddings â†’ Latent Diffusion â†’ Image**, but they differ in how embeddings are processed and used. These differences become visible when we start breaking things.

## The Big Picture: From Text to Pixels

Before diving into the details, let's understand the fundamental flow in modern diffusion models:

```mermaid
graph LR
    A[Text Prompt] --> B[Embedding Space]
    B --> C[Latent Space]
    C --> D[Pixel Space]
    
    style A fill:#e1f5ff
    style B fill:#fff4e1
    style C fill:#ffe1f5
    style D fill:#d4edda
```

### Three Key Spaces:

1. **Embedding Space** (High-dimensional semantic vectors)
   - Where text meaning is encoded numerically
   - Typically thousands of dimensions (768-4096 per token)
   - **This is our focus today**

2. **Latent Space** (Compressed image representation)
   - Where diffusion actually happens
   - Much smaller than pixel space (e.g., 64Ã—64Ã—16 instead of 1024Ã—1024Ã—3)
   - Embeddings guide the denoising process here

3. **Pixel Space** (Final RGB image)
   - The actual image you see
   - Decoded from latent space by VAE

### Our Workshop Focus:

```mermaid
graph TD
    A[Text Prompt] --> B[Text Encoders]
    B --> C[Embedding Space]
    
    C -.->|We manipulate here| D[Modified Embeddings]
    
    D --> E[Diffusion in Latent Space]
    E --> F[VAE Decoder]
    F --> G[Pixel Space / Image]
    
    H[Random Noise] --> E
    
    style C fill:#fff4e1
    style D fill:#ffcccc
    style E fill:#ffe1f5
    style G fill:#d4edda
```

**Today's Goal**: Understand how changes in embedding space propagate through latent space to affect the final image.

## Comparing SD 3.5 and FLUX-Schnell Architectures

### SD 3.5: Complete Pipeline Flow

```mermaid
graph TD
    subgraph "EMBEDDING SPACE"
        A[Text Prompt] --> B1[T5-XXL Encoder]
        A --> B2[CLIP-L Encoder]
        A --> B3[CLIP-G Encoder]
        
        B1 --> C1[T5 Embeddings<br/>77 Ã— 4096]
        B2 --> C2[CLIP-L Embeddings<br/>77 Ã— 768]
        B2 --> C3[CLIP-L Pooled<br/>768]
        B3 --> C4[CLIP-G Embeddings<br/>77 Ã— 1280]
        B3 --> C5[CLIP-G Pooled<br/>1280]
        
        C1 --> D1[Concatenate]
        C2 --> D1
        C4 --> D1
        
        C3 --> D2[Concatenate]
        C5 --> D2
        
        D1 --> E1[Context Embeddings<br/>77 Ã— 6144]
        D2 --> E2[Pooled Embeddings<br/>2048]
    end
    
    subgraph "LATENT SPACE"
        F[Random Noise<br/>Latent Tensor] --> G[MMDiT Transformer]
        E1 --> |Cross-Attention| G
        E2 --> |AdaLN Modulation| G
        
        G --> H[Denoising Steps<br/>Iterative Refinement]
        H --> I[Clean Latent<br/>Representation]
    end
    
    subgraph "PIXEL SPACE"
        I --> J[VAE Decoder]
        J --> K[RGB Image<br/>1024Ã—1024Ã—3]
    end
    
    style E1 fill:#fff4e1
    style E2 fill:#fff4e1
    style G fill:#ffe1f5
    style H fill:#ffe1f5
    style K fill:#d4edda
```

### FLUX-Schnell: Complete Pipeline Flow

```mermaid
graph TD
    subgraph "EMBEDDING SPACE"
        A[Text Prompt] --> B1[T5-XXL Encoder]
        A --> B2[CLIP-L Encoder]
        
        B1 --> C1[T5 Embeddings<br/>512 Ã— 4096]
        B2 --> C2[CLIP-L Embeddings<br/>77 Ã— 768]
        B2 --> C3[CLIP-L Pooled<br/>768]
        
        C1 --> D1[Concatenate]
        C2 --> D1
        
        D1 --> E1[Context Embeddings<br/>Variable length Ã— 4864]
        C3 --> E2[Guidance Vector<br/>768]
    end
    
    subgraph "LATENT SPACE"
        F[Random Noise<br/>Latent Tensor] --> G[FLUX Transformer]
        E1 --> |Attention| G
        E2 --> |Guidance Embedding| G
        
        G --> H[Flow Matching<br/>4 Steps Only]
        H --> I[Clean Latent<br/>Representation]
    end
    
    subgraph "PIXEL SPACE"
        I --> J[VAE Decoder]
        J --> K[RGB Image<br/>1024Ã—1024Ã—3]
    end
    
    style E1 fill:#fff4e1
    style E2 fill:#fff4e1
    style G fill:#ffe1f5
    style H fill:#ffe1f5
    style K fill:#d4edda
```

## Key Differences: SD 3.5 vs FLUX-Schnell

| Aspect | SD 3.5 | FLUX-Schnell |
|--------|--------|-------------|
| **Text Encoders** | T5-XXL, CLIP-L, CLIP-G | T5-XXL, CLIP-L |
| **T5 Sequence Length** | 77 tokens | 512 tokens (longer context!) |
| **Pooled Embeddings** | CLIP-L + CLIP-G (2048 dims) | CLIP-L only (768 dims) |
| **Architecture** | MMDiT (Multimodal Diffusion Transformer) | FLUX Transformer |
| **Denoising Process** | Standard diffusion (20-50 steps) | Flow matching (4 steps) |
| **Embedding Usage** | Cross-attention + AdaLN modulation | Attention + guidance embedding |
|**Speed** | Slower (more steps) | Faster (fewer steps) |

### Why These Differences Matter:

- **FLUX's longer T5 context** (512 vs 77 tokens) allows more detailed prompts
- **SD 3.5's third encoder** (CLIP-G) provides additional style/aesthetic control
- **Different architectures** mean embeddings influence generation in different ways
- **Flow matching vs diffusion** affects how quickly embeddings guide the process

## How Embeddings Control Latent Diffusion (And How We'll Break It)

```mermaid
graph TD
    subgraph "Latent Vandalism: Embedding Manipulation"
        A[Original Prompt] --> B[Generate Embeddings]
        B --> C[Save to JSON]
        C --> D{Vandalism Techniques}
        
        D --> E[Scaling Beyond Reason<br/>Ã— 5.0 or Ã— 0.1]
        D --> F[Semantic Inversion<br/>Ã— -1 negative space]
        D --> G[Impossible Mixing<br/>Blend incompatible prompts]
        D --> H[Encoder Ablation<br/>Zero out components]
        D --> I[Nonsense Interpolation<br/>Invalid semantic paths]
        
        E --> J[Damaged Embeddings]
        F --> J
        G --> J
        H --> J
        I --> J
    end
    
    subgraph "Effect on Latent Diffusion: Productive Failures"
        J --> K[Inject into Pipeline]
        
        K --> L[Transformer Attention<br/>with corrupted guidance]
        L --> M{How Damage Manifests}
        
        M --> N[Extreme influence<br/>Oversaturated semantics]
        M --> O[Weak/absent influence<br/>Lost coherence]
        M --> P[Inverted semantics<br/>Opposite concepts]
        M --> Q[Chimeric concepts<br/>Impossible combinations]
        
        N --> R[Glitches & Artifacts]
        O --> R[Template breakdown]
        P --> R[Semantic inversions]
        Q --> R[AI weirdness emerges]
    end
    
    R --> S[Corrupted Latent]
    S --> T[VAE Decode]
    T --> U[Revelatory Failure Image]
    
    U --> V{What Does This Reveal?}
    V --> W[Training data biases]
    V --> X[Encoder dependencies]
    V --> Y[Inscribed assumptions]
    V --> Z[Aesthetic boundaries]
    
    style J fill:#ffcccc
    style L fill:#ffe1f5
    style R fill:#fff4e1
    style U fill:#d4edda
    style V fill:#e1f5ff
```

### The Epistemology of Productive Damage

When we damage embeddings systematically, we're not just making "bad" imagesâ€”we're conducting **counterfactual experiments**:

- **"What if this encoder didn't exist?"** â†’ Zero it out
- **"What if the semantic direction reversed?"** â†’ Invert it
- **"What if two incompatible concepts merged?"** â†’ Mix embeddings
- **"What if the signal was too strong/weak?"** â†’ Scale it

Each failure mode reveals **technological inscription**: the assumptions, preferences, and constraints built into the system that are normally invisible in successful generations.

As Latour argues, technology embeds social and design choices into material form. When we make these systems fail, we **make the inscription visible**.

## Workshop Roadmap: From Understanding to Vandalism

### Part 1: Understanding the Architecture
**Before we break it, we need to know what we're breaking**
- Detailed SD 3.5 pipeline breakdown
- Detailed FLUX-Schnell pipeline breakdown
- Where each embedding type is used and how
- The "normal" path from text to pixels

### Part 2: Generating and Saving Embeddings
**Extracting the semantic substrate**
- Extract embeddings from prompts
- Save as JSON for inspection and manipulation
- Understand the numerical structure of semantic space
- Observe what "normal" embeddings look like

### Part 3: Techniques of Productive Damage
**Systematic vandalism experiments**

- **Experiment 1: Scaling Beyond Reason**  
  Push encoders to extremes (10Ã— T5, 0.1Ã— CLIP). What breaks first?
  
- **Experiment 2: Negative Space Exploration**  
  Invert embeddings to explore semantic opposites. What does negative "cat" look like?
  
- **Experiment 3: Impossible Mixtures**  
  Combine embeddings from incompatible prompts. Create chimeras text can't express.
  
- **Experiment 4: Encoder Ablation**  
  Zero out specific encoders. What does each one actually contribute?
  
- **Experiment 5: Interpolation Through Nonsense**  
  Morph between concepts via paths that violate semantic coherence.

### Part 4: Reading the Glitches
**What do failures reveal?**
- Same damage, different modelsâ€”comparing SD 3.5 vs FLUX responses
- Identifying inscribed assumptions (what the model "wants" to generate)
- Finding aesthetic boundaries (where template culture stops)
- Understanding model-specific sensitivities
- Documenting **AI weirdness** as epistemic resource

### Part 5: Discussion
**Productive damage as critical method**
- What did breaking things teach us?
- How do glitches expose technological inscription?
- Can we create a taxonomy of revealing failures?
- What are the ethics of adversarial exploration?

---

**Let's begin by understanding what we're about to vandalize!**

---

# Part 1: SD 3.5 Pipeline Deep Dive

## 1. Standard SD 3.5 Pipeline (Normal Text-to-Image)

```mermaid
graph TD
    A[Text Prompt] --> B[T5-XXL Encoder]
    A --> C[CLIP-L Encoder]
    A --> D[CLIP-G Encoder]
    
    B --> E[T5 Text Embeddings<br/>77 tokens Ã— 4096 dims]
    C --> F[CLIP-L Text Embeddings<br/>77 tokens Ã— 768 dims]
    C --> G[CLIP-L Pooled<br/>Single vector: 768 dims]
    D --> H[CLIP-G Text Embeddings<br/>77 tokens Ã— 1280 dims]
    D --> I[CLIP-G Pooled<br/>Single vector: 1280 dims]
    
    E --> J[Concatenate Text Embeddings]
    F --> J
    H --> J
    
    J --> K[Combined Context<br/>77 tokens Ã— 6144 dims<br/>4096+768+1280]
    
    G --> L[Concatenate Pooled]
    I --> L
    
    L --> M[Pooled Embeddings<br/>2048 dims<br/>768+1280]
    
    K --> N[MMDiT Transformer]
    M --> N
    
    O[Random Latent Noise] --> N
    
    N --> P[Denoising Process<br/>Multiple Steps]
    P --> Q[Final Latent]
    Q --> R[VAE Decoder]
    R --> S[Output Image]
    
    style K fill:#e1f5ff
    style M fill:#ffe1f5
    style N fill:#fff4e1
```

## 2. Our Modified Pipeline (Direct Embedding Injection)

```mermaid
graph TD
    A[T5 Embeddings JSON<br/>77 Ã— 4096] --> B[Load & Parse]
    C[CLIP-L Embeddings JSON<br/>77 Ã— 768] --> D[Load & Parse]
    E[CLIP-L Pooled JSON<br/>768 dims] --> F[Load & Parse]
    G[CLIP-G Embeddings JSON<br/>77 Ã— 1280] --> H[Load & Parse]
    I[CLIP-G Pooled JSON<br/>1280 dims] --> J[Load & Parse]
    
    B --> K[Optional: Scale/Modify<br/>e.g., Ã— 1.5, invert, etc.]
    D --> L[Optional: Scale/Modify]
    F --> M[Optional: Scale/Modify]
    H --> N[Optional: Scale/Modify]
    J --> O[Optional: Scale/Modify]
    
    K --> P[Concatenate Text Embeddings]
    L --> P
    N --> P
    
    P --> Q[Combined Context<br/>77 Ã— 6144 dims]
    
    M --> R[Concatenate Pooled]
    O --> R
    
    R --> S[Pooled Embeddings<br/>2048 dims]
    
    Q --> T[MMDiT Transformer]
    S --> T
    
    U[Random Latent Noise] --> T
    
    T --> V[Denoising Process]
    V --> W[Final Latent]
    W --> X[VAE Decoder]
    X --> Y[Output Image]
    
    style K fill:#ffcccc
    style L fill:#ffcccc
    style M fill:#ffcccc
    style N fill:#ffcccc
    style O fill:#ffcccc
    style Q fill:#e1f5ff
    style S fill:#ffe1f5
    style T fill:#fff4e1
```

## 3. Detailed View: How Embeddings Flow Into MMDiT

```mermaid
graph LR
    subgraph "Text Embeddings (Context)"
        A[T5-XXL<br/>77 Ã— 4096]
        B[CLIP-L<br/>77 Ã— 768]
        C[CLIP-G<br/>77 Ã— 1280]
    end
    
    subgraph "Pooled Embeddings (Global Conditioning)"
        D[CLIP-L Pooled<br/>768]
        E[CLIP-G Pooled<br/>1280]
    end
    
    A --> F[Concat]
    B --> F
    C --> F
    
    F --> G[Context Vector<br/>77 Ã— 6144]
    
    D --> H[Concat]
    E --> H
    
    H --> I[Pooled Vector<br/>2048]
    
    subgraph "MMDiT Transformer Block"
        G --> J[Cross-Attention<br/>Keys & Values]
        K[Noisy Latent] --> L[Self-Attention<br/>Queries]
        L --> M[Attention]
        J --> M
        M --> N[Feed Forward]
        
        I --> O[AdaLN Modulation<br/>Scale & Shift]
        O --> L
        O --> N
    end
    
    N --> P[Denoised Output]
    
    style G fill:#e1f5ff
    style I fill:#ffe1f5
    style M fill:#d4edda
    style O fill:#fff3cd
```

## 4. Key Roles of Each Embedding Type

```mermaid
graph TD
    subgraph "T5-XXL Embeddings"
        A[Rich Semantic Understanding<br/>4096 dimensions per token]
        A --> A1[Detailed descriptions]
        A --> A2[Complex relationships]
        A --> A3[Nuanced concepts]
    end
    
    subgraph "CLIP-L Embeddings"
        B[Visual-Text Alignment<br/>768 dimensions per token]
        B --> B1[Object recognition]
        B --> B2[Style understanding]
        B --> B3[Composition hints]
    end
    
    subgraph "CLIP-G Embeddings"
        C[High-Level Visual Concepts<br/>1280 dimensions per token]
        C --> C1[Artistic style]
        C --> C2[Overall aesthetics]
        C --> C3[Global structure]
    end
    
    subgraph "Pooled Embeddings"
        D[Global Image Characteristics]
        D --> D1[Overall style modulation]
        D --> D2[Quality/aesthetic level]
        D --> D3[Conditioning strength]
    end
    
    A1 --> E[Cross-Attention]
    A2 --> E
    A3 --> E
    B1 --> E
    B2 --> E
    B3 --> E
    C1 --> E
    C2 --> E
    C3 --> E
    
    D1 --> F[AdaLN Modulation]
    D2 --> F
    D3 --> F
    
    E --> G[Image Generation]
    F --> G
    
    style E fill:#e1f5ff
    style F fill:#ffe1f5
    style G fill:#d4edda
```

## 5. Vandalism Techniques: From Manipulation to Revelation

```mermaid
graph TD
    A[Saved Embeddings JSON] --> B{Vandalism Technique}
    
    B --> C[Extreme Scaling]
    B --> D[Semantic Inversion]
    B --> E[Impossible Mixing]
    B --> F[Encoder Ablation]
    B --> G[Nonsense Interpolation]
    
    C --> C1["T5 Ã— 10, CLIP Ã— 0.01<br/>Push encoders to extremes"]
    D --> D1["Ã— -1 all embeddings<br/>Explore negative semantic space"]
    E --> E1["Blend 'sunset' + 'theorem'<br/>Force incompatible concepts"]
    F --> F1["Zero T5, keep CLIP<br/>Isolate encoder roles"]
    G --> G1["Interpolate via invalid paths<br/>Break semantic continuity"]
    
    C1 --> H[Damaged Embeddings]
    D1 --> H
    E1 --> H
    F1 --> H
    G1 --> H
    
    H --> I[Inject into Pipeline]
    I --> J[Generate Image]
    J --> K{Productive Failures}
    
    K --> L[Glitch Aesthetics<br/>Visual artifacts reveal structure]
    K --> M[Template Breakdown<br/>Where polished norms collapse]
    K --> N[Inscribed Assumptions<br/>What the model 'wants' to do]
    K --> O[AI Weirdness<br/>Epistemic value of failure]
    
    L --> P[Research Insights]
    M --> P
    N --> P
    O --> P
    
    P --> Q["Understanding through damage:<br/>What becomes visible when systems break?"]
    
    style H fill:#ffcccc
    style J fill:#ffe1f5
    style K fill:#fff4e1
    style L fill:#e1f5ff
    style M fill:#e1f5ff
    style N fill:#e1f5ff
    style O fill:#e1f5ff
    style Q fill:#d4edda
```

### Reading Glitches as Data

Each type of damage produces characteristic failures:

- **Extreme scaling** â†’ Oversaturation or complete loss of semantic guidance  
  *Reveals: Sensitivity thresholds, encoder balance requirements*

- **Semantic inversion** â†’ Opposite concepts, anti-patterns  
  *Reveals: Whether semantic directions are truly bidirectional*

- **Impossible mixing** â†’ Chimeric forms, visual contradictions  
  *Reveals: How models handle semantic conflicts, training data gaps*

- **Encoder ablation** â†’ Missing styles, lost coherence, altered aesthetics  
  *Reveals: Individual encoder contributions, dependencies*

- **Nonsense interpolation** â†’ Unexpected transitional states  
  *Reveals: Non-linear structure of embedding space*

### The Charm of AI Weirdness

As Janelle Shane observes, when AI systems fail, they often fail in **informative ways**. The strange, the broken, the glitchedâ€”these aren't just mistakes. They're **windows into the system's implicit knowledge**.

A perfectly coherent image tells you the system works. A **revelatory failure** tells you *how* it works.

## Summary

### Embedding Dimensions:
- **T5-XXL**: 77 tokens Ã— 4096 dimensions (SD 3.5) / 512 tokens Ã— 4096 (FLUX)
- **CLIP-L**: 77 tokens Ã— 768 dimensions + 768 pooled
- **CLIP-G**: 77 tokens Ã— 1280 dimensions + 1280 pooled (SD 3.5 only)

### How They're Used:
1. **Text embeddings** (concatenated) â†’ Cross-attention in transformer
2. **Pooled embeddings** (concatenated) â†’ Global conditioning (AdaLN/guidance)

### Workshop Method:
- **Productive damage** as epistemological tool
- **Iatrogenic techniques** to probe system boundaries
- **Glitch aesthetics** as revelatory moments
- **Counterfactual experiments** to understand inscription

### What Vandalism Reveals:
- Direct manipulation bypasses text encoding limitations
- Systematic damage exposes training data biases
- Failures make visible the inscribed norms and assumptions
- AI weirdness provides epistemic value beyond polish
- Each encoder contributes distinct, separable semantic information
- Template culture's boundaries become legible where it breaks

---

**Remember:** The goal isn't to make "better" imagesâ€”it's to understand what "better" means to these systems, and to find creative freedom in the spaces where that definition breaks down.

## Theoretical Framework: References

This workshop draws on several theoretical traditions:

### Glitch Studies
- **Menkman, Rosa.** *The Glitch Moment(um)*. Network Notebooks, 2011.
  - Glitches as revelatory moments that expose normally invisible structures
  - Productive failures as aesthetic and epistemic resources

### Template Culture
- **Grund, Katja and Scherffig, Lasse.** Work on template culture and standardized aesthetics in generative AI
  - How models reproduce homogeneous visual languages
  - The political economy of aesthetic standardization

### AI Weirdness
- **Shane, Janelle.** Research on AI failures and unexpected behaviors
  - The epistemic value of AI mistakes
  - How failures reveal system structure

### Science and Technology Studies
- **Latour, Bruno.** "Technology is society made durable." *Sociological Review*, 1990.
  - Technological inscription: How values and choices become embedded in systems
  - Making visible the social and political dimensions of technical artifacts

### Iatrogenic Methods
- Medical concept of harm caused by treatment itself, repurposed as deliberate intervention
  - Systematic damage as research methodology
  - Counterfactual reasoning through controlled failures

---

### About This Workshop

**Workshop by Laura Wagner**

ðŸ”— Website: [laurajul.github.io](https://laurajul.github.io/)  
ðŸ“¦ Repository: [github.com/laurajul/latent-vandalism-workshop](https://github.com/laurajul/latent-vandalism-workshop)

For questions, feedback, or collaborations on productive damage to generative AI systems, please reach out via the website or repository.

---

*"The charm of AI weirdness is not just in the strange outputs, but in what those outputs reveal about the system that produced them."*