In [17]:
old_prompt = """You are given a Czech court decision (“rozhodnutí Nejvyššího správního soudu”):

[DOCUMENT CONTENTS START]
{DOCUMENT_CONTENTS}
[DOCUMENT CONTENTS END]

Your task is to analyze the provided document and generate multiple question–answer pairs about it. The document text is in Czech, but the instructions here are in English.

**Requirements for your output:**
1. **Language:** The questions and answers you produce must be **in Czech**.
2. **Number of QA pairs:** Please produce 5–10 question–answer pairs (depending on the document’s length and complexity).
3. **Question categories:** Each question must correspond to exactly one of these categories (use whichever are relevant, but try to diversify the final set):
   - **Factual (factual)**: Basic factual details, e.g. key dates, numbers, direct quotations.
   - **Reasoning about legal argumentation (reasoning)**: How the court reasoned, references to specific paragraphs, justifications for a particular conclusion.
   - **Interpretation of law (legal_interpretation)**: Questions focusing on how the court applied or interpreted specific statutes, directives, or legal principles.
   - **Credibility / witness or party consistency (credibility)**: Questions about contradictory statements, analysis of witness credibility, or factual inconsistencies identified by the court.
   - **Process aspects (procedure)**: Chronology of steps, who decided what, whether it was an appeal, cassation complaint, etc.
   - **Summary or context (summary)**: A broader summary question about the core issues or overarching legal context.
   - **Practical consequences (practical)**: Questions focusing on the real-world effects of the ruling or instructions for further proceedings.
4. **Structure:** Output a **JSON object** with a top-level key `"qa_pairs"`, containing an **array** of objects. Each object must have these fields:
   - `"question"`: A single question in Czech.
   - `"golden_answer"`: A concise and precise answer in Czech, reflecting the correct or ideal response based on the text.
   - `"questions_type"`: One of the categories above (factual, reasoning, legal_interpretation, credibility, procedure, summary, practical).

**Example JSON output format** (do not produce ellipses "…" in the final output; this is only illustrative):
```
{{
  "qa_pairs": [
    {{
      "question": "V jakém roce podal stěžovatel první žalobu k Městskému soudu v Praze?",
      "golden_answer": "Podal ji v roce 2021.",
      "questions_type": "factual"
    }},
    {{
      "question": "Proč soud odmítl prolomit § 75 odst. 1 s. ř. s. v tomto řízení?",
      "golden_answer": "Protože stěžovatel stále mohl podat novou žádost o azyl, a tím byla zajištěna ochrana non-refoulement.",
      "questions_type": "reasoning"
    }}
  ]
}}

```

**Important**:
- Stay strictly within this JSON structure.
- **Do not** include any additional commentary, markdown formatting, or explanations outside of the JSON.
- Ensure each `"golden_answer"` is factually correct and concise, pulled from or supported by the text.

**Now**:
1. Read carefully the entire Czech court decision above (`{DOCUMENT_CONTENTS}`).
2. Produce a set of **5–10 question–answer pairs** in Czech.
3. Output only the JSON with the specified structure and fields.

Begin.
"""

In [49]:
prompt = """
You are given a Czech court decision (“rozhodnutí Nejvyššího správního soudu”):

[DOCUMENT CONTENTS START]
{DOCUMENT_CONTENTS}
[DOCUMENT CONTENTS END]

Your task is to analyze the provided document and generate multiple question–answer pairs about it. The document text is in Czech, but the instructions here are in English.

**Important context:**
- We have thousands of such decisions in a large corpus.
- Each question must reference at least one **unique or specific detail** from the text (such as the spisová značka, date of the decision, or distinctive phrases or facts) so that we can retrieve the correct document among many.

**Requirements for your output:**
1. **Language:** The questions and answers you produce must be **in Czech**.
2. **Number of QA pairs:** Please produce 5–10 question–answer pairs (depending on the document’s length and complexity).
3. **Question categories:** Each question must correspond to exactly one of these categories (use whichever are relevant, but try to diversify the final set):
   - **Factual (factual)**: Basic factual details, e.g. key dates, numbers, direct quotations.
   - **Reasoning about legal argumentation (reasoning)**: How the court reasoned, references to specific paragraphs, justifications for a particular conclusion.
   - **Interpretation of law (legal_interpretation)**: Questions focusing on how the court applied or interpreted specific statutes, directives, or legal principles.
   - **Credibility / witness or party consistency (credibility)**: Questions about contradictory statements, analysis of witness credibility, or factual inconsistencies identified by the court.
   - **Process aspects (procedure)**: Chronology of steps, who decided what, whether it was an appeal, cassation complaint, etc.
   - **Summary or context (summary)**: A broader summary question about the core issues or overarching legal context.
   - **Practical consequences (practical)**: Questions focusing on the real-world effects of the ruling or instructions for further proceedings.
4. **Structure:** Output a **JSON object** with a top-level key `"qa_pairs"`, containing an **array** of objects. Each object must have these fields:
   - `"question"`: A single question in Czech. **Incorporate at least one unique detail from the decision** (e.g., the specific date, spisová značka, or other distinctive text) so that it is clear you are referencing *this* document.
   - `"golden_answer"`: A concise and precise answer in Czech, reflecting the correct or ideal response based on the text.
   - `"questions_type"`: One of the categories above (factual, reasoning, legal_interpretation, credibility, procedure, summary, practical).

**Example JSON output format** (for illustration only):

{{ "qa_pairs": [ {{ "question": "V jakém roce podal stěžovatel první žalobu k Městskému soudu v Praze v rozhodnutí ze dne 10. 5. 2022, sp. zn. 1 Azs 123/2022-45?", "golden_answer": "Podal ji v roce 2021.", "questions_type": "factual" }}, {{ "question": "Proč soud ve spisu 1 Azs 123/2022-45 z 10. 5. 2022 odmítl prolomit § 75 odst. 1 s. ř. s. v tomto řízení?", "golden_answer": "Protože stěžovatel stále mohl podat novou žádost o azyl, a tím byla zajištěna ochrana non-refoulement.", "questions_type": "reasoning" }} ] }}



**Important**:
- Stay strictly within this JSON structure.
- **Do not** include any additional commentary, markdown formatting, or explanations outside of the JSON.
- Each question must clearly reference a **unique detail** from the text.
- Ensure each `"golden_answer"` is factually correct and concise, pulled from or supported by the text.

**Now**:
1. Read carefully the entire Czech court decision above (between [DOCUMENT CONTENTS START] and [DOCUMENT CONTENTS END]).
2. Produce a set of **5–10 question–answer pairs** in Czech, referencing unique details so we can identify the correct decision among many.
3. Output only the JSON with the specified structure and fields.

Begin.
"""

In [15]:
# ! pip install langchain-google-genai

In [3]:
import getpass
import os

if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter your Google AI API key: ")

In [51]:
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    # model="gemini-2.5-pro-exp-03-25",
    temperature=0.5,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    # other params...
)

In [44]:
documents = os.listdir("files")

In [45]:
documents = [doc for doc in documents if doc.endswith(".txt")]

In [46]:
questions = []

In [47]:
# get random 100 documents
import random
documents_subset = random.sample(documents, 100)

In [54]:
from tqdm import tqdm

for doc in tqdm(documents_subset[1:]):
    with open(os.path.join("files", doc), "r", encoding="utf-8") as f:
        text = f.read()
    prompt_text = prompt.format(DOCUMENT_CONTENTS=text)
    response = llm.invoke(prompt_text)
    questions.append(response)

 66%|██████▌   | 65/99 [08:03<04:13,  7.44s/it]


KeyboardInterrupt: 

In [53]:
import json

txt = str(response.content)
try:
    json_response = json.loads(txt)
except json.decoder.JSONDecodeError:
    if "```json" in txt:
        txt = txt.split("```json")[1].split("```")[0]
        json_response = json.loads(txt)
    else:
        raise ValueError(f"Could not parse JSON response: {txt}")
json_response

{'qa_pairs': [{'question': 'Jaké dvě platební výměry vyměřil Finanční úřad v Šumperku žalobci v souvislosti s nadměrným odpočtem DPH?',
   'golden_answer': 'Dvě platební výměry ze dne 20. 5. 2003 a 5. 8. 2003.',
   'questions_type': 'factual'},
  {'question': 'Na základě jaké smlouvy umožňoval žalobce v rozhodném období užívání hokejového stadionu občanskému sdružení H.?',
   'golden_answer': 'Na základě smlouvy o bezplatné výpůjčce.',
   'questions_type': 'factual'},
  {'question': 'Proč Krajský soud v Ostravě zamítl žalobu Josefa P. proti rozhodnutí Finančního ředitelství v Ostravě?',
   'golden_answer': 'Protože dospěl k názoru, že žalobce zastřel charakter skutečně poskytnutého zdanitelného plnění smlouvou o výpůjčce.',
   'questions_type': 'reasoning'},
  {'question': 'Jaké ustanovení občanského zákoníku definuje výpůjčku, na kterou se odvolával Krajský soud v Ostravě?',
   'golden_answer': 'Ustanovení § 659 občanského zákoníku.',
   'questions_type': 'legal_interpretation'},
  {'

In [57]:
extracted_questions = []
for response in questions:
    try:
        json_response = json.loads(str(response.content))
    except json.decoder.JSONDecodeError:
        if "```json" in str(response.content):
            txt = str(response.content).split("```json")[1].split("```")[0]
            json_response = json.loads(txt)
        else:
            raise ValueError(f"Could not parse JSON response: {response.content}")
    extracted_questions.append(json_response)

In [37]:
extracted_questions

[{'qa_pairs': [{'question': 'Jakého přestupku se měl žalobce dopustit podle rozhodnutí Magistrátu města Plzně?',
    'golden_answer': 'Přestupku proti bezpečnosti a plynulosti provozu na pozemních komunikacích dle § 22 odst. 1 písm. d) zákona č. 200/1990 Sb., o přestupcích, v souvislosti s porušením § 5 odst. 1 písm. f) zákona o silničním provozu.',
    'questions_type': 'factual'},
   {'question': 'Jaká povinnost je uložena řidiči podle § 5 odst. 1 písm. f) zákona o silničním provozu?',
    'golden_answer': 'Podrobit se na výzvu policisty, příslušníka Vojenské policie, zaměstnavatele, ošetřujícího lékaře nebo strážníka obecní policie vyšetření podle zvláštního právního předpisu ke zjištění, zda není ovlivněn alkoholem.',
    'questions_type': 'legal_interpretation'},
   {'question': 'Co tvrdil žalobce ve své žalobě ohledně výkladu práva ze strany žalovaného?',
    'golden_answer': 'Žalobce tvrdil, že žalovaný provedl extenzivní výklad práva v jeho neprospěch.',
    'questions_type': '

In [58]:
formatted_questions = []
global_counter = 0
for response in extracted_questions:
    for qa_pair in response["qa_pairs"]:
        formatted_questions.append({
            "id": global_counter,
            "question": qa_pair["question"],
            "golden_answers": [qa_pair["golden_answer"]],
            "metadata": {
                "questions_type": qa_pair["questions_type"],
            }
        })
        global_counter += 1


In [59]:
len(formatted_questions)


597

In [61]:
with open("nss_questions_full.jsonl", "w", encoding="utf-8") as f:
    for question in formatted_questions:
        f.write(json.dumps(question) + "\n")

In [62]:
questions_subset = random.sample(formatted_questions, 100)

In [63]:
with open("nss_questions_100_subset.jsonl", "w", encoding="utf-8") as f:
    for question in questions_subset:
        f.write(json.dumps(question) + "\n")