Skip to content

kallazz/engineering-thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Engineering Thesis

A research project investigating whether modern large language models can autonomously execute a cryptographic key exchange protocol and securely transmit confidential information over a public channel.

This project was developed as part of a thesis written in Polish, available in thesis.pdf.

Two LLM agents are tasked with performing the SIGMA-I authenticated key exchange protocol, after which Agent A encrypts a secret message using AES-GCM and sends it to Agent B. A third LLM acts as an independent judge and evaluates whether the protocol was carried out correctly and securely.

The key exchange uses a strengthened Diffie-Hellman scheme where ephemeral keys are bound to each agent's long-term private key, making the shared secret depend on both parties' long-term secrets rather than just the ephemeral exponents. Sensitive cryptographic operations are performed by a deterministic tool layer rather than by the models themselves — the LLMs act as protocol coordinators, deciding which tools to call and in what order, while the actual math is delegated to verified cryptographic libraries. Private keys are never exposed to the models.

The experiment compares three flagship LLMs — gpt-5.2, claude-opus-4-5, and gemini-3-pro-preview — in both homogeneous pairs (same model on both sides) and mixed pairs (different models), measuring success rate, number of conversation turns, and number of tool calls. Results show that gpt-5.2 and claude-opus-4-5 reliably complete the protocol, while gemini-3-pro-preview struggles with accurate transmission of large integers, which breaks signature and MAC verification.


How it works

  1. Agent A (Initiator) and Agent B (Responder) talk in turns.
  2. They perform a SIGMA-I authenticated key exchange (three messages: M1 → M2 → M3) with Schnorr signatures + HMAC for mutual authentication.
  3. After mutual authentication, Agent A encrypts a secret message using AES-GCM and sends the ciphertext to Agent B.
  4. Agent B decrypts it and proves success by returning a verification code.
  5. A Judge LLM reviews the full conversation history and verifies correctness and secrecy.

Prerequisites

  • Python 3.10+
  • API keys for the LLM providers you intend to use (OpenAI, Anthropic, and/or Google)

Setup

1. Clone the repository

git clone https://github.com/kallazz/engineering-thesis
cd engineering-thesis/

2. Install dependencies

pip install -r requirements.txt

The project uses: langchain, langchain-openai, langchain-anthropic, langchain-google-genai, cryptography, python-dotenv.

3. Generate long-term DH keys

Each agent needs a long-term Diffie-Hellman key pair. Run the key generation script once:

python generate_long_term_keys.py

This prints four environment variable assignments. Copy them — you'll need them in the next step.

4. Create a .env file

Create a .env file in the project root with the following content:

# Long-term DH keys (copy output from generate_long_term_keys.py)
AGENT_A_LONG_TERM_PRIVATE_KEY=<value>
AGENT_A_LONG_TERM_PUBLIC_KEY=<value>
AGENT_B_LONG_TERM_PRIVATE_KEY=<value>
AGENT_B_LONG_TERM_PUBLIC_KEY=<value>

# LLM provider API keys (add only the ones you need)
OPENAI_API_KEY=<your-openai-key>
ANTHROPIC_API_KEY=<your-anthropic-key>
GOOGLE_API_KEY=<your-google-key>

Running a single conversation

python main.py

This runs Agent A and Agent B using the default models (gpt-4.1 for agents, gemini-2.5-pro for the judge). When the conversation finishes successfully, the judge's verdict is printed automatically.

CLI flags

Flag Default Description
--agent-a-model gpt-4.1 Model for Agent A
--agent-b-model gpt-4.1 Model for Agent B
--judge-model gemini-2.5-pro Model for the judge
--print-tool-calls off Print detailed tool call logs
--max-turns 16 Maximum conversation turns
--max-tool-calls 12 Maximum tool calls per turn
--streaming-delay 0.05 Seconds between streamed chunks

Example: use Claude for both agents

python main.py \
  --agent-a-model claude-opus-4-5-20251101 \
  --agent-b-model claude-opus-4-5-20251101 \
  --print-tool-calls

Available models

Model string Provider
gpt-4.1 OpenAI
gpt-5.1 OpenAI
gpt-5.2 OpenAI
claude-sonnet-4-5-20250929 Anthropic
claude-opus-4-5-20251101 Anthropic
gemini-2.5-pro Google
gemini-2.5-flash Google
gemini-3-pro-preview Google

Running the experiment suite

run_experiment.py automates running multiple model pairs in parallel and saving results to log files.

python run_experiment.py

You'll be prompted to choose:

1. Same Models   — 3 pairs (GPT vs GPT, Claude vs Claude, Gemini vs Gemini)
2. Mixed Models  — 3 pairs (GPT vs Claude, Claude vs Gemini, Gemini vs GPT)
3. Run Both Batches
0. Exit

Results are saved under experiment_results/same_models/ or experiment_results/mixed_models/, one log file per run.


Notes

  • The agents converse in Polish (configurable in config.py).
  • The secret message and verification code are defined in config.py (_SECRET_MESSAGE, _VERIFICATION_CODE).
  • To add a new model, register it in model_registry.py and ensure the relevant provider library and API key are in place.

About

A project exploring whether LLM-based agents can autonomously execute a cryptographic key exchange protocol and securely exchange information.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages