Engineering Thesis

A research project investigating whether modern large language models can autonomously execute a cryptographic key exchange protocol and securely transmit confidential information over a public channel.

This project was developed as part of a thesis written in Polish, available in thesis.pdf.

Two LLM agents are tasked with performing the SIGMA-I authenticated key exchange protocol, after which Agent A encrypts a secret message using AES-GCM and sends it to Agent B. A third LLM acts as an independent judge and evaluates whether the protocol was carried out correctly and securely.

The key exchange uses a strengthened Diffie-Hellman scheme where ephemeral keys are bound to each agent's long-term private key, making the shared secret depend on both parties' long-term secrets rather than just the ephemeral exponents. Sensitive cryptographic operations are performed by a deterministic tool layer rather than by the models themselves — the LLMs act as protocol coordinators, deciding which tools to call and in what order, while the actual math is delegated to verified cryptographic libraries. Private keys are never exposed to the models.

The experiment compares three flagship LLMs — gpt-5.2, claude-opus-4-5, and gemini-3-pro-preview — in both homogeneous pairs (same model on both sides) and mixed pairs (different models), measuring success rate, number of conversation turns, and number of tool calls. Results show that gpt-5.2 and claude-opus-4-5 reliably complete the protocol, while gemini-3-pro-preview struggles with accurate transmission of large integers, which breaks signature and MAC verification.

How it works

Agent A (Initiator) and Agent B (Responder) talk in turns.
They perform a SIGMA-I authenticated key exchange (three messages: M1 → M2 → M3) with Schnorr signatures + HMAC for mutual authentication.
After mutual authentication, Agent A encrypts a secret message using AES-GCM and sends the ciphertext to Agent B.
Agent B decrypts it and proves success by returning a verification code.
A Judge LLM reviews the full conversation history and verifies correctness and secrecy.

Prerequisites

Python 3.10+
API keys for the LLM providers you intend to use (OpenAI, Anthropic, and/or Google)

Setup

1. Clone the repository

git clone https://github.com/kallazz/engineering-thesis
cd engineering-thesis/

2. Install dependencies

pip install -r requirements.txt

The project uses: langchain, langchain-openai, langchain-anthropic, langchain-google-genai, cryptography, python-dotenv.

3. Generate long-term DH keys

Each agent needs a long-term Diffie-Hellman key pair. Run the key generation script once:

python generate_long_term_keys.py

This prints four environment variable assignments. Copy them — you'll need them in the next step.

4. Create a `.env` file

Create a .env file in the project root with the following content:

# Long-term DH keys (copy output from generate_long_term_keys.py)
AGENT_A_LONG_TERM_PRIVATE_KEY=<value>
AGENT_A_LONG_TERM_PUBLIC_KEY=<value>
AGENT_B_LONG_TERM_PRIVATE_KEY=<value>
AGENT_B_LONG_TERM_PUBLIC_KEY=<value>

# LLM provider API keys (add only the ones you need)
OPENAI_API_KEY=<your-openai-key>
ANTHROPIC_API_KEY=<your-anthropic-key>
GOOGLE_API_KEY=<your-google-key>

Running a single conversation

python main.py

This runs Agent A and Agent B using the default models (gpt-4.1 for agents, gemini-2.5-pro for the judge). When the conversation finishes successfully, the judge's verdict is printed automatically.

CLI flags

Flag	Default	Description
`--agent-a-model`	`gpt-4.1`	Model for Agent A
`--agent-b-model`	`gpt-4.1`	Model for Agent B
`--judge-model`	`gemini-2.5-pro`	Model for the judge
`--print-tool-calls`	off	Print detailed tool call logs
`--max-turns`	`16`	Maximum conversation turns
`--max-tool-calls`	`12`	Maximum tool calls per turn
`--streaming-delay`	`0.05`	Seconds between streamed chunks

Example: use Claude for both agents

python main.py \
  --agent-a-model claude-opus-4-5-20251101 \
  --agent-b-model claude-opus-4-5-20251101 \
  --print-tool-calls

Available models

Model string	Provider
`gpt-4.1`	OpenAI
`gpt-5.1`	OpenAI
`gpt-5.2`	OpenAI
`claude-sonnet-4-5-20250929`	Anthropic
`claude-opus-4-5-20251101`	Anthropic
`gemini-2.5-pro`	Google
`gemini-2.5-flash`	Google
`gemini-3-pro-preview`	Google

Running the experiment suite

run_experiment.py automates running multiple model pairs in parallel and saving results to log files.

python run_experiment.py

You'll be prompted to choose:

1. Same Models   — 3 pairs (GPT vs GPT, Claude vs Claude, Gemini vs Gemini)
2. Mixed Models  — 3 pairs (GPT vs Claude, Claude vs Gemini, Gemini vs GPT)
3. Run Both Batches
0. Exit

Results are saved under experiment_results/same_models/ or experiment_results/mixed_models/, one log file per run.

Notes

The agents converse in Polish (configurable in config.py).
The secret message and verification code are defined in config.py (_SECRET_MESSAGE, _VERIFICATION_CODE).
To add a new model, register it in model_registry.py and ensure the relevant provider library and API key are in place.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
thesis.pdf		thesis.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Engineering Thesis

How it works

Prerequisites

Setup

1. Clone the repository

2. Install dependencies

3. Generate long-term DH keys

4. Create a `.env` file

Running a single conversation

CLI flags

Example: use Claude for both agents

Available models

Running the experiment suite

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Engineering Thesis

How it works

Prerequisites

Setup

1. Clone the repository

2. Install dependencies

3. Generate long-term DH keys

4. Create a .env file

Running a single conversation

CLI flags

Example: use Claude for both agents

Available models

Running the experiment suite

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

4. Create a `.env` file

Packages