# Legal Q&A Synthetic Data Generation

This notebook generates synthetic legal question-answer pairs using AWS Bedrock's Claude 3.5 Sonnet model. The generated data is designed for fine-tuning legal assistant AI models.

## Overview

The notebook takes a set of legal texts and uses Claude to generate high-quality Q&A pairs that are technically accurate and legally grounded. The generated pairs are saved in JSONL format and uploaded to S3.

## 1. Import Required Libraries

First, we import the necessary Python packages for AWS interaction, JSON handling, and progress tracking.

In [2]:
import os
import json
import boto3
import random
from tqdm import tqdm

## 2. AWS Configuration

Set up AWS credentials and initialize clients for Bedrock and S3 services. Make sure you have the appropriate AWS profile configured.

In [4]:
BUCKET_NAME = "khalil-adib-bucket"
PROFILE_NAME = "dev"
REGION_NAME = "us-east-1"

session = boto3.Session(profile_name=PROFILE_NAME, region_name=REGION_NAME)
bedrock_runtime = session.client('bedrock-runtime')
s3 = session.client('s3')

## 3. Model and Prompt Configuration

Define the Bedrock model ID and create a prompt template for generating Q&A pairs. The template instructs Claude to focus on technical legal concepts and statutory interpretation.


The `LEGAL_TEXTS` list contains 10 curated legal texts covering various legal concepts and doctrines.

In [3]:
MODEL_ID = "us.anthropic.claude-3-5-sonnet-20240620-v1:0"

NUMBER_OF_PAIRS = 10

def get_prompt(text):
    return f"""
You are an expert in legal drafting and education. Your task is to generate high-quality question-answer pairs for fine-tuning a legal assistant AI.

Instructions:
- Focus strictly on technical legal concepts, statutory interpretation, and precedent-based reasoning.
- Do NOT include speculative, emotional, or overly general questions.
- Base your questions and answers solely on the text provided.
- Answers must be precise, well-reasoned, and legally grounded.

Input text: {text}

Generate exactly {NUMBER_OF_PAIRS} question-answer pairs.

Return the result as a JSON array with the following format:

[
  {{
    "question": "What is the legal implication of ...?",
    "answer": "Under [doctrine/statute], the legal interpretation is ..."
  }},
  ...
]

please only return the array without any text
"""


LEGAL_TEXTS = [
    {
        "id": 1,
        "text": "The plain meaning rule directs that if the language of a statute is clear and unambiguous, the courts must apply it as written, without resorting to extrinsic aids of interpretation. However, where ambiguity exists, legislative history, purpose, and canons of construction may be invoked."
    },
    {
        "id": 2,
        "text": "Under the doctrine of stare decisis, courts are generally bound to follow precedents established by higher courts within the same jurisdiction. Exceptions are made when prior rulings are shown to be clearly erroneous or outdated due to legal or social developments."
    },
    {
        "id": 3,
        "text": "In Chevron U.S.A. Inc. v. Natural Resources Defense Council, the Supreme Court held that when a statute is ambiguous, courts must defer to an agency’s reasonable interpretation of that statute, provided Congress has not spoken directly to the issue."
    },
    {
        "id": 4,
        "text": "Mens rea, or the 'guilty mind', is a fundamental component of criminal liability. Most serious crimes require both a prohibited act (actus reus) and a culpable mental state, which can range from negligence to purposefulness."
    },
    {
        "id": 5,
        "text": "The doctrine of res judicata bars parties from re-litigating a claim that has already been finally adjudicated on the merits by a competent court. It promotes judicial efficiency and finality in litigation."
    },
    {
        "id": 6,
        "text": "For a contract to be enforceable, it must be supported by consideration, which is defined as a bargained-for exchange of something of legal value. Past consideration or moral obligations generally do not satisfy this requirement."
    },
    {
        "id": 7,
        "text": "The Commerce Clause grants Congress the authority to regulate commerce among the states. Judicial interpretations have evolved, permitting regulation of intrastate activities that substantially affect interstate commerce."
    },
    {
        "id": 8,
        "text": "When a law implicates a fundamental right or targets a suspect classification, courts apply strict scrutiny, requiring the law to be narrowly tailored to serve a compelling governmental interest."
    },
    {
        "id": 9,
        "text": "The Rule Against Perpetuities prevents future interests in property from vesting too remotely. At common law, a contingent interest must vest, if at all, no later than 21 years after the death of a relevant life in being at the time of the creation of the interest."
    },
    {
        "id": 10,
        "text": "In Miranda v. Arizona, the Supreme Court held that statements made during custodial interrogation are inadmissible unless the suspect was informed of their rights to remain silent and to legal counsel, and voluntarily waived those rights."
    }
]


## 4. Generate Q&A Pairs
For each legal text, we:
1. Create a prompt using the template
2. Call the Bedrock API with random temperature and top-p values for diversity
3. Store the generated Q&A pairs

The progress bar shows the generation status for all 10 legal texts.

In [4]:
dataset = []
for item in tqdm(LEGAL_TEXTS):
    try:
        prompt = get_prompt(text=item['text'])
        response = bedrock_runtime.converse(
            modelId=MODEL_ID,
            messages=[
                {
                    'role': 'user',
                    'content': [
                        {
                            'text': prompt,
                        }
                    ]
                }
            ],
            inferenceConfig={
                'maxTokens': 8192,
                'temperature': random.random(),
                'topP': random.random(),
            }
        )
        output = response['output']['message']['content'][0]['text']
        dataset.append(output)
    except:
        print(f'Error at {item}')

100%|██████████████████████████████| 10/10 [03:19<00:00, 19.98s/it]


## 5. Process Generated Data
Convert the raw JSON strings into Python objects and flatten the nested structure.

In [8]:
dataset = [ json.loads(d) for d in dataset ]

In [10]:
flattened = [item for sublist in dataset for item in sublist]

## 6. Format Dataset
Transform the Q&A pairs into the required format with 'prompt' and 'completion' fields.

In [17]:
processed_dataset = []
for i in flattened:
    processed_dataset.append({
        'prompt': i['question'],
        'completion': i['answer'],
    })

## 7. Save Dataset
Save the processed dataset to a local JSONL file.

In [18]:
with open("dataset.jsonl", "w", ) as f:
    for item in processed_dataset:
        f.write(json.dumps(item) + "\n")

In [5]:
s3.upload_file('dataset.jsonl', BUCKET_NAME, 'webinar/dataset/webinar_dataset.jsonl')