# 90 · Batch Translation Demo — Health-Claim Mini-Workflow
_Last updated 2025-05-04_

Goal: show **end-to-end** use of the OpenAI **Batch** API on a pocket-sized
dataset (3 claims × 3 target languages).

We will:

1. create a strict JSON schema (`response_format`) once;
2. write one JSONL line per `(claim, language)` call;
3. (optionally) split large files into ≤ 50 MB chunks;
4. submit a **pilot batch** with just **one** line so everyone
   sees the UI & cost;
5. provide commented code for the full upload, retrieval,
   and parsing steps.



## API key

* Reads `OPENAI_API_KEY` from the environment, or  
* Falls back to `key/openai_key.txt` (one-line file).  
* Raises an error if neither is present.


In [None]:
# %pip -q install --upgrade openai python-dotenv pandas tqdm
import os, pathlib, json, random, time, textwrap, pandas as pd
from openai import OpenAI

# Key fallback: key/openai_key.txt
key_file = pathlib.Path("key/openai_key.txt")
if os.getenv("OPENAI_API_KEY") is None and key_file.exists():
    os.environ["OPENAI_API_KEY"] = key_file.read_text().strip()

if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("Put your API key in .env or key/openai_key.txt")

client = OpenAI()


# 1 · Mini demo dataset (3 English claims)
In production df_claims would come from a CSV; here we hard-code.


In [2]:
df_claims = pd.DataFrame({
    "claim_id":   [1, 2, 3],
    "Claim": [
        "Calcium contributes to normal bone health.",
        "DHA intake contributes to the normal function of the brain.",
        "Plant sterols have been shown to lower blood cholesterol."
    ],
    "Nutrient": [
        "Calcium", "Docosahexaenoic acid (DHA)", "Plant sterols"
    ],
    "Relationship": [
        "Bone health", "Brain function", "Blood cholesterol"
    ]
})
df_claims


Unnamed: 0,claim_id,Claim,Nutrient,Relationship
0,1,Calcium contributes to normal bone health.,Calcium,Bone health
1,2,DHA intake contributes to the normal function ...,Docosahexaenoic acid (DHA),Brain function
2,3,Plant sterols have been shown to lower blood c...,Plant sterols,Blood cholesterol


# 2 · Schema & system prompt
 One strict JSON schema reused for every translation call.


In [3]:
response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "claim_translation_v1",
        "strict": True,
        "schema": {
            "type": "object",
            "properties": {
                "translation_claim": {"type": "string"},
                "translation_nutrient_substance": {"type": "string"},
                "translation_health_relationship": {"type": "string"}
            },
            "required": [
                "translation_claim",
                "translation_nutrient_substance",
                "translation_health_relationship"
            ],
            "additionalProperties": False
        }
    }
}

SYSTEM_PROMPT = textwrap.dedent("""
    You are a certified medical-translation specialist.
    Translate the ENGLISH text below into the target language.
    Preserve technical terms if no local equivalent exists.
    Return JSON strictly matching the schema keys provided.
""").strip()


# 3 · Helper -> one JSONL line


In [4]:
def make_jsonl_line(row, target_lang):
    """
    Convert one (claim row, language) into a batch-ready JSON dict.
    """
    custom_id = f"claim{row.claim_id}_{target_lang}"
    user_prompt = (
        f"Target language: {target_lang}\n"
        f"Claim: {row.Claim}\n"
        f"Nutrient substance: {row.Nutrient}\n"
        f"Health relationship: {row.Relationship}"
    )

    return {
        "custom_id": custom_id,
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",       # cheap + strong
            "messages": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user",   "content": user_prompt}
            ],
            "temperature": 0,
            "max_tokens": 512,
            "response_format": response_format
        }
    }


# 4 · Build JSONL (3 claims × 3 languages = 9 lines)


In [6]:
TARGET_LANGS = ["Spanish", "German", "Japanese"]
jsonl_path = pathlib.Path("int_data/batch_input_claims_demo.jsonl")

with jsonl_path.open("w", encoding="utf-8") as f:
    for _, row in df_claims.iterrows():
        for lang in TARGET_LANGS:
            line = make_jsonl_line(row, lang)
            f.write(json.dumps(line) + "\n")

print(f"Wrote {jsonl_path} with {sum(1 for _ in open(jsonl_path))} lines.")


Wrote int_data\batch_input_claims_demo.jsonl with 9 lines.


# 5 · (Optional) split into ≤50 MB parts
For the demo our file is < 2 KB so we keep NUM_PARTS=1.


In [7]:
NUM_PARTS = 1

part_dir = pathlib.Path("int_data/batch_parts_demo"); part_dir.mkdir(exist_ok=True)
if NUM_PARTS == 1:
    part_paths = [jsonl_path]
else:
    lines = jsonl_path.read_text(encoding="utf-8").splitlines()
    chunk = len(lines) // NUM_PARTS + 1
    part_paths = []
    for i in range(NUM_PARTS):
        p = part_dir / f"claims_part{i+1}.jsonl"
        p.write_text("\n".join(lines[i*chunk:(i+1)*chunk]), encoding="utf-8")
        part_paths.append(p)


# 6 · Create a **pilot** batch (1 random line) – optional


In [8]:
pilot_line = random.choice(jsonl_path.read_text().splitlines())
pilot_path  = pathlib.Path("int_Data/pilot_line.jsonl")
pilot_path.write_text(pilot_line, encoding="utf-8")

pilot_file = client.files.create(file=pilot_path.open("rb"), purpose="batch")
pilot_batch = client.batches.create(
    input_file_id   = pilot_file.id,
    endpoint        = "/v1/chat/completions",
    completion_window = "24h",
    metadata        = {"description": "Pilot translation batch (1 line)"}
)
print("Pilot batch ID:", pilot_batch.id)


Pilot batch ID: batch_681744dbc2d08190a7bbbbeb03e30f71


# 7 · Template – upload *all* parts  (commented to avoid cost)


In [9]:

batch_ids = []
for p in part_paths:
    file_obj = client.files.create(file=p.open("rb"), purpose="batch")
    batch   = client.batches.create(
        input_file_id=file_obj.id,
        endpoint="/v1/chat/completions",
        completion_window="24h",
        metadata={"description": f"Full translation – {p.name}"}
    )
    batch_ids.append(batch.id)
    time.sleep(1)   # avoid 429



# 8 · Download & normalise output
### Only runs if you have completed batches.
### Go to https://platform.openai.com/usage to check if your batch is complete.


In [None]:
def download_output(batch_id, out_dir="int_data/batch_out_demo"):
    out_dir = pathlib.Path(out_dir); out_dir.mkdir(exist_ok=True)
    status  = client.batches.retrieve(batch_id)
    if status.status != "completed":
        print(batch_id, "not ready:", status.status); return None
    out_path = out_dir / f"{batch_id}.jsonl"
    out_path.write_text(client.files.content(status.output_file_id).text,
                        encoding="utf-8")
    return out_path

# --- parse helper -------------------------------------------------
def parse_batch_file(path):
    rows = []
    for line in path.read_text(encoding="utf-8").splitlines():
        blob = json.loads(line)
        cid  = blob["custom_id"]
        payload = json.loads(blob["response"]["body"]["choices"][0]["message"]["content"])
        claim_id, lang = cid.replace("claim","").split("_")
        rows.append({"claim_id": int(claim_id),
                     "language": lang,
                     **payload})
    return pd.DataFrame(rows)

# parse pilot:
pilot_out = download_output(pilot_batch.id)
if pilot_out: display(parse_batch_file(pilot_out))


Unnamed: 0,claim_id,language,translation_claim,translation_nutrient_substance,translation_health_relationship
0,2,Spanish,La ingesta de DHA contribuye al funcionamiento...,Ácido docosahexaenoico (DHA),Función cerebral


# 🔄 10 · Retrieve *all* batch outputs
- Works whether you created batches in the current session or re-open the notebook later.
- If you saved the IDs to disk, load them; otherwise just pass the in-memory list `batch_ids`.


In [14]:
# -----------------------------------------------------------------
# (A)  Fetch the batch-ID list
# -----------------------------------------------------------------
BATCH_ID_FILE = pathlib.Path("batch_ids_full.json")   # or .pkl
if BATCH_ID_FILE.exists():
    batch_ids = json.loads(BATCH_ID_FILE.read_text())
    print("Loaded", len(batch_ids), "batch IDs from file.")
else:
    # fall back to whatever variable is in memory
    try:
        batch_ids
    except NameError:
        raise ValueError("No batch_ids variable and no batch-id file found.")

# -----------------------------------------------------------------
# (B)  Poll until every batch is done (or failed)
# -----------------------------------------------------------------
PENDING  = set(batch_ids)
COMPLETED, FAILED = set(), set()

print("Polling", len(PENDING), "batches …  (ctrl-C to stop)")
while PENDING:
    for bid in list(PENDING):
        st = client.batches.retrieve(bid).status
        if st == "completed":
            COMPLETED.add(bid); PENDING.remove(bid)
        elif st == "failed":
            FAILED.add(bid); PENDING.remove(bid)
    print(f" done: {len(COMPLETED)}  failed: {len(FAILED)}  pending: {len(PENDING)}",
          end="\r")
    if PENDING:
        time.sleep(60)       # check once per minute

print("\nAll batches resolved.")


Polling 1 batches …  (ctrl-C to stop)
 done: 1  failed: 0  pending: 0
All batches resolved.


In [15]:

# -----------------------------------------------------------------
# (C)  Download + parse every completed batch
# -----------------------------------------------------------------
all_tables = []
for bid in COMPLETED:
    out_path = download_output(bid, out_dir="int_data/batch_out_demo")
    if out_path:
        all_tables.append(parse_batch_file(out_path))

if not all_tables:
    raise ValueError("No completed batch files parsed!")

df_translated = pd.concat(all_tables, ignore_index=True)
print("Rows collected:", len(df_translated))
display(df_translated.head())


Rows collected: 9


Unnamed: 0,claim_id,language,translation_claim,translation_nutrient_substance,translation_health_relationship
0,1,Spanish,El calcio contribuye a la salud ósea normal.,Calcio,Salud ósea
1,1,German,Kalzium trägt zur normalen Knochengesundheit bei.,Kalzium,Knochengesundheit
2,1,Japanese,カルシウムは正常な骨の健康に寄与します。,カルシウム,骨の健康
3,2,Spanish,La ingesta de DHA contribuye al funcionamiento...,Ácido docosahexaenoico (DHA),Función cerebral
4,2,German,Die Aufnahme von DHA trägt zur normalen Funkti...,Docosahexaensäure (DHA),Gehirnfunktion


In [16]:

# -----------------------------------------------------------------
# (D)  Join back to English claims (optional)
# -----------------------------------------------------------------
df_final = (
    df_translated
    .merge(df_claims[["claim_id", "Claim"]], on="claim_id", how="left")
    .sort_values(["claim_id", "language"])
)
display(df_final.head())

Unnamed: 0,claim_id,language,translation_claim,translation_nutrient_substance,translation_health_relationship,Claim
1,1,German,Kalzium trägt zur normalen Knochengesundheit bei.,Kalzium,Knochengesundheit,Calcium contributes to normal bone health.
2,1,Japanese,カルシウムは正常な骨の健康に寄与します。,カルシウム,骨の健康,Calcium contributes to normal bone health.
0,1,Spanish,El calcio contribuye a la salud ósea normal.,Calcio,Salud ósea,Calcium contributes to normal bone health.
4,2,German,Die Aufnahme von DHA trägt zur normalen Funkti...,Docosahexaensäure (DHA),Gehirnfunktion,DHA intake contributes to the normal function ...
5,2,Japanese,DHAの摂取は脳の正常な機能に寄与します。,ドコサヘキサエン酸 (DHA),脳機能,DHA intake contributes to the normal function ...
