Disclaimer: This repository is an experimental proof-of-concept created exclusively for personal study and research. It is not intended for production use. The results and capabilities discussed here are hypothetical, and the system has not been rigorously evaluated against standard benchmarks. The code is shared "as is" to explore alternative architectural ideas in neural networks.
In modern AI, a Multi-Agent System (MAS) involves multiple LLM instances (agents) working together to solve complex problems. Typically, agents communicate by generating text and reading each other's outputs. For example, an "Analyzer Agent" writes a long chain-of-thought, and a "Speaker Agent" reads that text to produce a final concise answer. While effective, this text-based communication is slow, consumes large amounts of the context window, and forces models to externalize every single thought into tokens.
LatentBridge proposes a radical alternative: what if agents could communicate telepathically?
LatentBridge is a lightweight, standalone PyTorch implementation of Latent Space Communication for Multi-Agent Systems. It allows two instances of an LLM (e.g., Qwen 3.5 4B) to communicate their "thoughts" without generating visible text tokens. Instead, they share their intermediate neural activations directly.
Today, if you want an LLM to give a complex answer, you must force it to "think out loud" (Chain-of-Thought). This consumes thousands of visible tokens (the <think> blocks), drastically slows down the response (latency), and saturates the Context Window.
LatentBridge eliminates the need to print text to think: it moves reasoning entirely into the latent (vector) space of the neural network. The architecture relies on two instances of the same base model, assuming two different roles via System Prompts:
- Agent B (The Thinker): Analyzes the problem deeply in the background.
- Agent A (The Speaker): Provides the final concise answer.
Here is the step-by-step mechanism of the Parallel Injection:
Agent B reads the user prompt and performs a forward pass. Instead of making it generate text, we extract its Hidden States (internal neural activations). Specifically, we capture only the final token (H_B[:, -1:, :]). Because of Self-Attention, the last token computed by an LLM contains a dense, compressed, mathematical summary of the entire sentence and reasoning process. It is a "vector of pure intuition."
Instead of merging minds at the very beginning (word embeddings) or at the very end (probabilities), LatentBridge hooks into the deep intermediate layers of the model (e.g., Layers 11, 19, and 27). This is where the neural network processes logical abstraction, complex syntax, and problem-solving.
To prevent destroying the stable mathematical space of Agent A, LatentBridge uses a trainable neural projection (an MLP or linear matrix,
This is the masterpiece of the system. During generation, Agent A creates one word at a time. For every single word generated, the Dynamic Gate computes an equation. We use a Sigmoid function as a "valve" controlling the injection:
- Agent A dynamically decides, token by token, if it needs Agent B's help.
- If Agent A is writing obvious words ("The", "answer", "is"), the Gate closes near zero to avoid distortion.
- If Agent A reaches a crucial logical crossroad, the Gate opens to 1, and the mathematical solution thought by B flows into A's calculations to guide the correct word.
The more Agent A speaks, the more its sentence makes sense, and the less it needs B's initial intuition. A decay_rate gradually lowers the injection token after token, allowing Agent A to finish its sentence independently with perfect grammar.
To validate the LatentBridge approach, we evaluated its performance on the GSM8K mathematical dataset compared to a standard Textual Multi-Agent Baseline.
- Textual Baseline: Agent B receives the question and generates an explicit textual Chain-of-Thought (CoT). Agent A reads the question and Agent B's CoT text, then generates the final numeric answer.
- LatentBridge: Agent B reads the question and processes it internally, without generating any text. Its final hidden states at layers 11, 19, and 27 are extracted and injected into Agent A via the bridge. Agent A generates the final answer guided by this latent intuition.
The test was conducted on 44 reasoning problems, checking for strict numeric exact-match accuracy. The results demonstrate a net improvement across all metrics.
The accuracy jumped from 55.8% to 76.7% (+20.9% absolute increase). Latent communication dramatically reduces the chances of the model hallucinating or committing logic errors by condensing the reasoning.
A major flaw of the Textual Baseline is overthinking: generating excessively long CoT blocks (averaging nearly 1900 tokens). LatentBridge completely resolves this. By processing the intuition in the latent space, it forces synthesis and coherence.
- Latency: 5.1x faster (from ~184s down to ~36s).
- Token Usage: Reduced by 81% (down to ~353 tokens).
- VRAM: The impact is virtually negligible. Adding LatentBridge increases peak VRAM by only ~3% (+289 MB), a very low cost for the performance gains.
- Confidence: The model's internal confidence in its generations increased from 87.86% to 92.26%.
This repository provides everything you need to test LatentBridge out of the box with zero API servers or complex setups.
pip install torch transformers accelerateNote: The scripts are optimized to use Scaled Dot Product Attention (sdpa) and run completely on GPU (.to("cuda")). An 8GB+ GPU is recommended.
To verify that everything is working correctly, you can run the minimal hello_world.py script located in the src folder. This script initializes the base model, loads the latent bridge, and makes Agent A speak by using Agent B's latent intuition.
cd src
python hello_world.pyIf you want to train your own LatentBridge (or fine-tune it), the repository includes the complete training pipeline. The training script uses Teacher Forcing and freezes the base model, optimizing only the bridge's projections (
To run a training session using the default reasoning_dataset.json:
cd src
python train.pyYou can customize the training run via command line arguments without modifying config.py:
| Parameter | Description | Default |
|---|---|---|
--source |
Path to the custom JSON dataset. | reasoning_dataset.json |
--model |
Override the base model name/path. | Qwen/Qwen3.5-4B |
--layers |
Space-separated list of target layers for injection. | 11 19 27 |
--bridge-type |
Neural architecture of the bridge (mlp or linear). |
mlp |
--lr |
Learning rate for the optimizer. | 5e-4 |
--epochs |
Number of training epochs. | 3 |
--max-steps |
Force stop after N steps (0 = no limit). | 5000 |
--checkpoint-dir |
Directory to save .pt checkpoint weights. |
checkpoints |
--resume |
Path to a .pt file to resume training from. |
None |
For example, to train for 1000 steps with a specific learning rate:
python train.py --source my_dataset.json --lr 1e-4 --max-steps 1000- Base Model: Qwen/Qwen3.5-4B
- Training Methodology (Knowledge Distillation): The neural bridge (
bridge_weights.pt) was trained by distilling the explicit Chain-of-Thought (CoT) reasoning from Claude Opus 4.6. The MLP projections and dynamic gates learned to map the complex, multi-step logical deductions generated by Opus 4.6 directly into the latent space of the 4B model. - Trainable Parameters: The
bridge_weights.ptcontains the trained MLP projections and dynamic gates for layers 11, 19, 27.




