A research project investigating whether modern large language models can autonomously execute a cryptographic key exchange protocol and securely transmit confidential information over a public channel.
This project was developed as part of a thesis written in Polish, available in thesis.pdf.
Two LLM agents are tasked with performing the SIGMA-I authenticated key exchange protocol, after which Agent A encrypts a secret message using AES-GCM and sends it to Agent B. A third LLM acts as an independent judge and evaluates whether the protocol was carried out correctly and securely.
The key exchange uses a strengthened Diffie-Hellman scheme where ephemeral keys are bound to each agent's long-term private key, making the shared secret depend on both parties' long-term secrets rather than just the ephemeral exponents. Sensitive cryptographic operations are performed by a deterministic tool layer rather than by the models themselves — the LLMs act as protocol coordinators, deciding which tools to call and in what order, while the actual math is delegated to verified cryptographic libraries. Private keys are never exposed to the models.
The experiment compares three flagship LLMs — gpt-5.2, claude-opus-4-5, and gemini-3-pro-preview — in both homogeneous pairs (same model on both sides) and mixed pairs (different models), measuring success rate, number of conversation turns, and number of tool calls. Results show that gpt-5.2 and claude-opus-4-5 reliably complete the protocol, while gemini-3-pro-preview struggles with accurate transmission of large integers, which breaks signature and MAC verification.
- Agent A (Initiator) and Agent B (Responder) talk in turns.
- They perform a SIGMA-I authenticated key exchange (three messages: M1 → M2 → M3) with Schnorr signatures + HMAC for mutual authentication.
- After mutual authentication, Agent A encrypts a secret message using AES-GCM and sends the ciphertext to Agent B.
- Agent B decrypts it and proves success by returning a verification code.
- A Judge LLM reviews the full conversation history and verifies correctness and secrecy.
- Python 3.10+
- API keys for the LLM providers you intend to use (OpenAI, Anthropic, and/or Google)
git clone https://github.com/kallazz/engineering-thesis
cd engineering-thesis/pip install -r requirements.txtThe project uses: langchain, langchain-openai, langchain-anthropic, langchain-google-genai, cryptography, python-dotenv.
Each agent needs a long-term Diffie-Hellman key pair. Run the key generation script once:
python generate_long_term_keys.pyThis prints four environment variable assignments. Copy them — you'll need them in the next step.
Create a .env file in the project root with the following content:
# Long-term DH keys (copy output from generate_long_term_keys.py)
AGENT_A_LONG_TERM_PRIVATE_KEY=<value>
AGENT_A_LONG_TERM_PUBLIC_KEY=<value>
AGENT_B_LONG_TERM_PRIVATE_KEY=<value>
AGENT_B_LONG_TERM_PUBLIC_KEY=<value>
# LLM provider API keys (add only the ones you need)
OPENAI_API_KEY=<your-openai-key>
ANTHROPIC_API_KEY=<your-anthropic-key>
GOOGLE_API_KEY=<your-google-key>python main.pyThis runs Agent A and Agent B using the default models (gpt-4.1 for agents, gemini-2.5-pro for the judge). When the conversation finishes successfully, the judge's verdict is printed automatically.
| Flag | Default | Description |
|---|---|---|
--agent-a-model |
gpt-4.1 |
Model for Agent A |
--agent-b-model |
gpt-4.1 |
Model for Agent B |
--judge-model |
gemini-2.5-pro |
Model for the judge |
--print-tool-calls |
off | Print detailed tool call logs |
--max-turns |
16 |
Maximum conversation turns |
--max-tool-calls |
12 |
Maximum tool calls per turn |
--streaming-delay |
0.05 |
Seconds between streamed chunks |
python main.py \
--agent-a-model claude-opus-4-5-20251101 \
--agent-b-model claude-opus-4-5-20251101 \
--print-tool-calls| Model string | Provider |
|---|---|
gpt-4.1 |
OpenAI |
gpt-5.1 |
OpenAI |
gpt-5.2 |
OpenAI |
claude-sonnet-4-5-20250929 |
Anthropic |
claude-opus-4-5-20251101 |
Anthropic |
gemini-2.5-pro |
|
gemini-2.5-flash |
|
gemini-3-pro-preview |
run_experiment.py automates running multiple model pairs in parallel and saving results to log files.
python run_experiment.pyYou'll be prompted to choose:
1. Same Models — 3 pairs (GPT vs GPT, Claude vs Claude, Gemini vs Gemini)
2. Mixed Models — 3 pairs (GPT vs Claude, Claude vs Gemini, Gemini vs GPT)
3. Run Both Batches
0. Exit
Results are saved under experiment_results/same_models/ or experiment_results/mixed_models/, one log file per run.
- The agents converse in Polish (configurable in
config.py). - The secret message and verification code are defined in
config.py(_SECRET_MESSAGE,_VERIFICATION_CODE). - To add a new model, register it in
model_registry.pyand ensure the relevant provider library and API key are in place.