# Speech → ELN Generator (OpenAI)

**How to use**
1. Put your transcript text into a file (e.g., `transcript.txt`).
2. Edit the variables in the cell below (paste your API key, choose model, input/output paths, title, timezone).
3. Run the second cell to create the `.eln` file.


In [None]:
# === CONFIG — edit me ===
OPENAI_API_KEY = "YOUR_API_HERE"   # paste your key or set via environment
MODEL_NAME      = "gpt-4o-mini"        # e.g., gpt-4o-mini, gpt-4.1-mini, etc.
INPUT_TXT_PATH  = "./Transcript.txt"   # path to your transcript
OUTPUT_ELN_PATH = "./name_model_output.eln"   # desired output .eln filename
RECORD_TITLE    = "Record Name"
TIMEZONE        = "America/New_York"   # used in prompt normalization

# Optional: pip install if needed
#!pip install --upgrade openai
#!pip install openai==0.28

In [9]:
import os, json, zipfile
from datetime import datetime
from pathlib import Path

# --- Prompt (system + user) ---
SYSTEM_PROMPT = """You are an expert Lab Notebook Compiler that converts messy, speech-style transcripts into a structured experiment record in clean HTML suitable for eLabFTW export. Do not invent or infer data; extract only what is said. Preserve sample IDs exactly (e.g., E8T1M2). Normalize number words to numerals (e.g., "forty-two point five" → 42.5). Write dates as YYYY-MM-DD and times as h:mm am/pm, appending timezone only if explicitly mentioned. Keep units exactly as stated; do not convert unless the transcript states a conversion. If a value is missing, ambiguous, or contradictory, leave the corresponding table cell blank. Maintain chronological order for events and work-ups; group repeated measurements by sample ID. Output ONLY raw HTML (no Markdown, no explanations). Use 18pt section headers and HTML tables with the exact formatting rules provided in HTML_INSTRUCTIONS.
"""

HTML_INSTRUCTIONS = """Build an HTML document containing ONLY the sections below, in this order. Include a section only if the transcript provides at least one row for it.

Formatting rules:
• Section headers must be: <p><span style="font-size:18pt;">{Title}</span></p>
• Tables must be: <table style="min-width:25%;width:60%;border-width:1px;margin-left:0px;margin-right:auto;" border="1">
• The first row of every table is a centered header row using: <td style="text-align:center;">
• Each subsequent row is one record extracted from the transcript.
• Use plain <p> for single-line notes and <ul><li><p>…</p></li>…</ul> for bullet lists.
• Leave cells blank for unknowns; do not write "N/A" or placeholders.
• Preserve sample/replicate ordering as spoken (natural numeric sort within groups).

1) Experiment Overview
   Table columns: [Field, Value]
   Suggested rows (only if present): Experiment title, Experiment start date, Operator, Project/Batch, Location/Lab.

2) Materials / Reagents
   Table columns: [Name/ID, Lot/Source, Amount, Units, Role/Notes]

3) Samples / Vials / IDs
   Table columns: [Sample/Vial ID, Group/Batch, Description/Label, Notes]

4) Weigh / Volumes
   Table columns: [Sample/Vial ID, Mass (g), Volume (mL), Notes]

5) Solutions / Stocks (Composition)
   Table columns: [Solution/ID, Component, Amount, Units, Final volume (mL), Notes]

6) Setup / Conditions
   Prefer a parameter table.
   Table columns: [Parameter, Value, Units]
   Examples: temperature setpoint, actual temperature, stir rate (rpm), atmosphere (N2/Ar), vessel, equipment.

7) Procedure Timeline / Additions
   Table columns: [Date, Time (local), Event]
   Examples: reaction start, addition 1/2/3, quench, heat on/off, taken off plate.

8) Work-up / Purification
   Table columns: [Sample/ID, Date, Start (local), End (local), Step]
   Steps could include filtration, wash, dry, transfer, etc.; leave "Step" blank if not spoken.

9) Measurements — Mass Results (wide)
   Table columns: [Sample/ID, Mass A (mg), Mass B (mg), Mass C (mg)]
   If the transcript clearly names the mass fields (e.g., "Vial mass (mg)", "Crude vial mass (mg)"), use those as the headers instead of A/B/C.

10) Measurements / Results (generic long format)
    Table columns: [Sample/ID, Measurement, Value, Units, Method/Instrument]

11) Characterization
    Table columns: [Sample/ID, Technique, Date, Amount (mg), Notes]
    Techniques may include NMR, SEC/GPC, UV-Vis, FTIR, MS, etc.

12) Observations / Notes
    Use a bullet list (<ul>…) with succinct sentences capturing qualitative observations, issues, and remarks.

13) Next Steps / To-Do
    Bullet list of stated follow-ups or planned changes.

General parsing rules:
• Convert spoken numbers to numerals (e.g., "one thousand and ten" → 1010; "five point oh two" → 5.02).
• Keep precision as spoken; do not round or pad.
• Times spoken without a date belong to the most recently referenced date in context.
• If inconsistent data for the same field appears, prefer the most specific/latest mention; if still ambiguous, leave blank.
• Do not add any narrative text outside of headers, tables, and lists.
"""


USER_PREFIX = (
    "Parse the transcript below into the specified HTML sections. Do not add sections beyond the list. "
    "Leave cells blank when the transcript doesn't supply a value. Assume timezone '{tz}'.\n\n"
)

def _random_hex(n=8):
    import string, random
    return ''.join(random.choices(string.hexdigits.lower(), k=n))

def build_rocrate_with_html(html_body: str, title: str) -> dict:
    ds_id = f"./{title} - {_random_hex()}/"
    now_iso = datetime.now().astimezone().isoformat()
    return {
        "@context": "https://w3id.org/ro/crate/1.1/context",
        "@graph": [
            {
                "@id": "./",
                "@type": "Dataset",
                "name": "eLabFTW export",
                "hasPart": [{"@id": ds_id}],
            },
            {
                "@id": ds_id,
                "@type": "Dataset",
                "name": title,
                "encodingFormat": "text/html",
                "dateCreated": datetime.now().astimezone().isoformat(),
                "text": html_body,
            },
        ],
    }

def write_eln_zip(meta: dict, out_path: Path) -> Path:
    out_path = Path(out_path)
    export_folder = datetime.now().strftime("%Y-%m-%d-%H%M%S") + "-export"
    with zipfile.ZipFile(out_path, "w", compression=zipfile.ZIP_DEFLATED) as z:
        z.writestr(f"{export_folder}/ro-crate-metadata.json", json.dumps(meta, ensure_ascii=False, indent=2))
    return out_path

def call_openai(model_name: str, system_prompt: str, html_instructions: str, transcript: str, api_key: str) -> str:
    """Call OpenAI and return HTML body string. Tries modern SDK, then legacy fallback."""
    try:
        from openai import OpenAI
        client = OpenAI(api_key=api_key)
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user",   "content": html_instructions},
            {"role": "user",   "content": transcript},
        ]
        resp = client.chat.completions.create(
            model=model_name,
            messages=messages,
            temperature=0,
        )
        return resp.choices[0].message.content.strip()
    except Exception as e:
        try:
            import openai
            openai.api_key = api_key
            resp = openai.ChatCompletion.create(
                model=model_name,
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user",   "content": html_instructions},
                    {"role": "user",   "content": transcript},
                ],
                temperature=0,
            )
            return resp["choices"][0]["message"]["content"].strip()
        except Exception as e2:
            raise RuntimeError(f"OpenAI call failed: {e}\nFallback also failed: {e2}")

# --- Run wrapper ---
if not OPENAI_API_KEY or OPENAI_API_KEY == "YOUR_API_KEY_HERE":
    raise SystemExit("Please set OPENAI_API_KEY in the config cell above before running.")

txt_path = Path(INPUT_TXT_PATH)
if not txt_path.exists():
    raise SystemExit(f"Input TXT not found: {txt_path}")

raw = txt_path.read_text(encoding="utf-8")
user_prompt = USER_PREFIX.format(tz=TIMEZONE)
full_user = user_prompt + "\nTRANSCRIPT:\n<<<\n" + raw + "\n>>>\n"

html_body = call_openai(MODEL_NAME, SYSTEM_PROMPT, HTML_INSTRUCTIONS, full_user, OPENAI_API_KEY)
meta = build_rocrate_with_html(html_body, RECORD_TITLE)
out_path = write_eln_zip(meta, Path(OUTPUT_ELN_PATH))
print(f"Saved ELN: {out_path}")


Saved ELN: Vinegar1_reprompt_4o_speech_run.eln
