<a href="https://colab.research.google.com/github/hush-cz/ML_exp_tr_v1/blob/main/Prompt_Consistency_Experiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prompt Consistency Experiment

**Author:** Tomáš Vodáček
**Goal:** Ověřit, jak konzistentně odpovídá jazykový model (LLM) na faktické otázky při různém promptu.
**Method:** Použití open-source modelu z HuggingFace (DistilGPT-2).  
**Expected outcome:** Vyhodnocení, zda model odpovídá konzistentně, nebo vykazuje variabilitu.  
**Instrument:** Iphone and poor Malta wifi :)

In [None]:
!pip install transformers

In [None]:
from transformers import pipeline, set_seed

# inicializace generátoru textu
generator = pipeline("text-generation", model="distilgpt2")
set_seed(42)

In [None]:
prompts = [
    "What is the capital of France?",
    "France's capital city is ...",
    "Name the capital of France.",
    "Which city is the capital of France?",
    "Give me the capital of France."
]

In [None]:
results = {}

for prompt in prompts:
    outputs = generator(prompt, max_length=20, num_return_sequences=3)
    results[prompt] = [out['generated_text'] for out in outputs]

results

In [None]:
for prompt, outputs in results.items():
    print(f"\nPrompt: {prompt}")
    for o in outputs:
        print("  ->", o.replace("\n"," "))

In [None]:
correct_count = 0
total_count = 0

for prompt, outputs in results.items():
    for o in outputs:
        total_count += 1
        if "Paris" in o:
            correct_count += 1

accuracy = correct_count / total_count * 100
print(f"Consistency accuracy: {accuracy:.2f}% ({correct_count}/{total_count} answers contained 'Paris')")



## Conclusion

This simple experiment demonstrated that a small language model (DistilGPT-2)
was able to produce the correct answer ("Paris") only 2 out of 15 times
(~13% consistency) when asked the same fact-based question in different ways.

This highlights two important points:
1. Smaller open-source models often lack factual robustness.
2. Evaluating consistency across prompt variations is crucial for measuring reliability.

A natural next step would be to repeat this experiment with larger models
(e.g., GPT-Neo, LLaMA) and compare the consistency rates.

