<h1 style="font-size:36px; line-height:1.1;">Appendix: Temporal Agents with Knowledge Graphs</h1>

This notebook contains an appendix to the **Temporal Agents with Knowledge Graphs** cookbook. 

Within this appendix, you'll find a more in-depth *Prototype to Production* section. 

# A. Prototype to Production
---

## A.1. Storing and Retrieving High-Volume Graph Data

### A.1.1. Data Volume & Schema Complexity

As your dataset scales to millions or even billions of nodes and edges, managing performance and maintainability becomes critical. This requires thoughtful approaches to both schema design and data partitioning:

<ol style="margin-left: 1em; line-height: 1.6; padding-left: 0.5em;">
  <li style="margin-bottom: 1.2em;">
    <strong>Schema design for growth and change</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Clearly define core entity types (e.g., <code>Person</code>, <code>Organization</code>, <code>Event</code>) and relationships. Design the schema with versioning and flexibility in mind, enabling future schema evolution with minimal downtime.
    </p>
  </li>

  <li style="margin-bottom: 1.2em;">
    <strong>Sharding &amp; partitioning</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Use high-cardinality fields (such as timestamps or unique entity IDs) for partitioning to preserve query performance as data volume grows. This is particularly important for temporally-aware data. For example:
    </p>
  
  ```sql  
  CREATE TABLE statements (
    statement_id UUID PRIMARY KEY,
    entity_id UUID NOT NULL,
    text TEXT NOT NULL,
    valid_from TIMESTAMP NOT NULL,
    valid_to TIMESTAMP,
    status VARCHAR(16) DEFAULT 'active',
    embedding VECTOR(1536),
    ...
  ) PARTITION BY RANGE (valid_from);
  ```
  </li>
</ol>

### A.1.2. Temporal Validity & Versioning

In our temporal knowledge graph, each statement includes temporal markers (e.g., `valid_from`, `valid_to`). 

<ol style="margin-left: 1em; line-height: 1.6; padding-left: 0.5em;">
  <li style="margin-bottom: 1.2em;">
    <strong>Preserve history non-destructively</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Avoid deleting or overwriting records. Instead mark outdated facts as inactive by setting a <code>status</code> (e.g., <code>inactive</code>).
    </p>
  </li>

  <li style="margin-bottom: 1.2em;">
    <strong>Optimize for temporal access</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Index temporal fields (<code>valid_from</code>, <code>valid_to</code>) to support efficient querying of both current and historical states.
    </p>
  </li>
</ol>


#### Example: Non-Destructive Updates

Rather than removing or overwriting a record, update its status and close its validity window:

```sql
UPDATE statements
SET status = 'inactive', valid_to = '2025-03-15T00:00:00Z'
WHERE statement_id = '...' AND entity_id = '...';
```

### A.1.3. Indexing & Semantic Search

##### Temporal Indexes
To support efficient temporal queries create B-tree indexes on `valid_from` and `valid_to`. A 'B-tree' index is a tree data structure that keeps data sorted to facilitate fast lookups, range queries, and ordered scans in logarithmic time. It's the default index type in many relational databases. 

```sql
CREATE INDEX ON statements (valid_from);
CREATE INDEX ON statements (valid_to);
```
##### Semantic search with pgvector
Storing vector embeddings in PostgreSQL (via the `pgvector` extension) enables similarity-based retrieval via semantic search. This follows a two-step process:
1. Store high-dimensional vectors that represent the semantic meaning of the text. These can be created with embedding models such as OpenAI's `text-embedding-3-small` and `text-embedding-3-large`
2. Use Approximate Nearest-Neighbour (ANN) for efficient similarity matching at scale

There are several different indexing options available in pgvector, each with different purposes. These indexing options are described in more detail, along with in-depth implementation steps in the [README on the Github repository for pgvector](https://github.com/pgvector/pgvector/blob/master/README.md).
| <div align="center">Index Type</div> | <div align="center">Build Time</div> | <div align="center">Query Speed</div> | <div align="center">Memory Usage</div> | <div align="center">Accuracy</div> | <div align="center">Recommended Scale</div> | Notes |
|-------------------------------------|--------------------------------------|----------------------------------------|-----------------------------------------|-----------------------------------|----------------------------------------------|-------|
| <div align="center">**flat**</div> | <div align="center">Minimal</div> | <div align="center">Slow<br>(linear scan)</div> | <div align="center">Low</div> | <div align="center">100%<br>(exact)</div> | <div align="center">Very small<br>(&lt; 100 K vectors)</div> | No approximate indexing—scans all vectors. Best for exact recall on small collections |
| <div align="center">**ivfflat**</div> | <div align="center">Moderate</div> | <div align="center">Fast when tuned</div> | <div align="center">Moderate</div> | <div align="center">High<br>(tunable)</div> | <div align="center">Small to Medium<br>(100 K–200 M)</div> | Uses inverted file indexing. Query-time parameters control trade-offs |
| <div align="center">**ivfpq**</div> | <div align="center">High</div> | <div align="center">Very fast</div> | <div align="center">Low<br>(quantized)</div> | <div align="center">Slightly lower<br>than ivfflat</div> | <div align="center">Medium to Large<br>(1 M–500 M)</div> | Combines inverted files with product quantization for lower memory use |
| <div align="center">**hnsw**</div> | <div align="center">Highest</div> | <div align="center">Fastest<br>(esp. at scale)</div> | <div align="center">High<br>(in-memory)</div> | <div align="center">Very high</div> | <div align="center">Large to Very Large<br>(100 M–Billions+)</div> | Builds a hierarchical navigable graph. Ideal for latency-sensitive, high-scale systems |


##### Tuning parameters for vector indexing

`ivfflat`
* `lists`: Number of partitions (e.g., 100)
* `probes`: Number of partitions to scan at query time (e.g., 10-20), controls recall vs. latency

`ivfpq`
* `subvectors`: Number of blocks to quantize (e.g., 16)
* `bits`: Number of bits per block (e.g., 8)
* `probes`: Same as in `ivfflat`

`hnsw`
* `M`: Max connections per node (e.g., 16)
* `ef_construction`: Build-time dynamic candidate list size (e.g., 200)
* `ef_search`: Queyr-time candidate pool (e.g., 64-128)

##### Best practices
- `flat` for debugging or small datasets
- `ivfflat` when you want tunable accuracy with good speed
- `ivfpq` when memory efficieny is critical
- `hnsw` when optimizing for lowest latency on massive collections

##### Other vector database options in the ecosystem

| Vector DB    | Key Features                                                 | Pros                                        | Cons                                                            |
| ------------ | ------------------------------------------------------------ | ------------------------------------------- | --------------------------------------------------------------- |
| **Pinecone** | Fully managed, serverless; supports HNSW and SPANN           | Auto-scaling, SLA-backed, easy to integrate | Vendor lock-in; cost escalates at scale                         |
| **Weaviate** | GraphQL API, built-in modules for encoding and vectorization | Hybrid queries (metadata + vector), modular | Production deployment requires Kubernetes                       |
| **Milvus**   | Supports GPU indexing; IVF, HNSW, ANNOY                      | High performance at scale, dynamic indexing | Operational complexity; separate system                         |
| **Qdrant**   | Lightweight, real-time updates, payload filtering            | Simple setup, good hybrid query support     | Lacks native relational joins; eventual consistency in clusters |
| **Vectara**  | Managed with semantic ranking and re-ranking                 | Strong relevance features; easy integration | Proprietary; limited index control                              |

##### Choosing the Right Vector Store

| <div align="center">Scale</div> | <div align="center">Recommendation</div> | Details |
|--------------------------------|------------------------------------------|---------|
| <div align="center">**Small to Medium Scale**<br>(less than 100M vectors)</div> | <div align="center">PostgreSQL + pgvector<br>with `ivfflat` index</div> | Often sufficient for moderate workloads. Recommended settings: `lists = 100–200`, `probes = 10–20`. |
| <div align="center">**Large Scale**<br>(100M – 1B+ vectors)</div> | <div align="center">Milvus or Qdrant</div> | Suitable for high-throughput workloads, especially when GPU-accelerated indexing or sub-millisecond latency is needed. |
| <div align="center">**Hybrid Scenarios**</div> | <div align="center">PostgreSQL for metadata<br>+ dedicated vector DB</div> | Use PostgreSQL for entity metadata storage and a vector DB (e.g., Milvus, Qdrant) for similarity search. Synchronize embeddings using CDC pipelines (e.g., Debezium). |

For more detailed information, check out the [OpenAI cookbook on vector databases](https://cookbook.openai.com/examples/vector_databases/readme).

##### Durable disk storage and backup
For some cases, especially those requiring high availability or state recovery across restarts, it may be worth persisting state to reliable disk storage and implementing a backup strategy. 

If durability is a concern, consider using persistent disks with regular backups or syncing state to external storage. While not necessary for all deployments, it can provide a valuable safeguard against data loss or operational disruption in environments where consistency and fault tolerance matter.

## A.2. Managing and Pruning Datasets

### A.2.1. TTL (Time-to-Live) and Archival Policies

Establish clear policies to determine which facts should be retained indefinitely (e.g., legally required records for regulators) and which can be archived after a defined period (e.g., statements sourced from social media more than one year old).

Key practices to include:
<ol style="margin-left: 1em; line-height: 1.6; padding-left: 0.5em;">
  <li style="margin-bottom: 1.2em;">
    <strong>Automated Archival Jobs</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Set up a background task that periodically queries for records with e.g., <code>valid_to &lt; NOW() - INTERVAL 'X days'</code> and moves them to an archival table for long-term storage.
    </p>
  </li>

  <li style="margin-bottom: 1.2em;">
    <strong>Source-Specific Retention Policies</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Tailor retention durations by data source or entity type. For example, high-authority sources like government publications may warrant longer retention than less reliable data such as scraped news headlines or user-generated content.
    </p>
  </li>
</ol>

### A.2.2. Relevance Scoring and Intelligent Pruning

As your knowledge graph grows, the utility of many facts will decline. To keep the graph focused and maximise performance: 
<ol style="margin-left: 1em; line-height: 1.6; padding-left: 0.5em;">
  <li style="margin-bottom: 1.2em;">
    <strong>Index a Relevance Score</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Introduce a numeric <code>relevance_score</code> column (or columns) that incorporate metrics such as recency, source trustworthiness, and production query frequency.
    </p>
  </li>

  <li style="margin-bottom: 1.2em;">
    <strong>Automated Pruning Logic</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Schedule a routine job to prune or archive facts falling below a predefined relevance threshold.
    </p>
  </li>
</ol>


#### Advanced Relevance-Based Graph Reduction

Efficiently reducing the size of a knowledge graph is important when scaling. [A 2024 survey](https://arxiv.org/pdf/2402.03358) categorizes techniques into **sparsification**, **coarsening**, and **condensation**—all aimed at shrinking the graph while preserving task-critical semantics. These methods offer substantial runtime and memory gains on large-scale KGs.

Example implementation pattern:
<ol style="margin-left: 1em; line-height: 1.6; padding-left: 0.5em;">
  <li style="margin-bottom: 1.2em;">
    <strong>Score Each Triple</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Compute a composite <code>relevance_score</code>, for example:
    </p>
    <pre style="margin-top: 0.5em; margin-bottom: 0.5em; background-color: #f5f5f5; padding: 0.75em; border-radius: 5px;"><code>relevance_score = β1 * recency_score + β2 * source_trust_score + β3 * retrieval_count</code></pre>
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Where:
    </p>
    <ul style="margin-top: 0.5em; margin-bottom: 0.5em; padding-left: 1.2em;">
      <li><code>recency_score</code>: exponential decay from <code>valid_from</code></li>
      <li><code>source_trust_score</code>: source-domain trust value</li>
      <li><code>retrieval_count</code>: production query frequency</li>
    </ul>
  </li>

  <li style="margin-bottom: 1.2em;">
    <strong>Apply a Reduction Strategy</strong><br />
    <ul style="margin-top: 0.5em; margin-bottom: 0.5em; padding-left: 1.2em;">
      <li><strong>Sparsify</strong>: Select and retain only the most relevant edges or nodes based on criteria like centrality, spectral similarity, or embedding preservation</li>
      <li><strong>Coarsen</strong>: Group low-importance or semantically similar nodes into super-nodes and aggregate their features and connections</li>
      <li><strong>Condense</strong>: Construct a task-optimized mini-graph from scratch</li>
    </ul>
  </li>

  <li style="margin-bottom: 1.2em;">
    <strong>Validate in Shadow Mode</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Log and compare outputs from the pruned vs. original graph before routing production traffic.
    </p>
  </li>

  <li style="margin-bottom: 1.2em;">
    <strong>Re-Score Regularly</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Recompute relevance (e.g., nightly) to ensure new or frequently accessed facts surface back to the top.
    </p>
  </li>
</ol>

## A.3. Implementing Concurrency in the Ingestion Pipeline

Moving from prototype to production often requires you to transform your linear processing pipeline into a concurrent, scalable pipeline. Instead of processing documents sequentially (document → chunking → statement extraction → entity extraction → statement invalidation → entity resolution), implement a staged pipeline where each phase can scale independently.

Design your pipeline with a series of specialized stages, each with its own queue and worker pool. This allows you to scale bottlenecks independently and maintain system reliability under varying loads. 

<ol style="margin-left: 1em; line-height: 1.6; padding-left: 0.5em;">
  <li style="margin-bottom: 1.2em;">
    <strong>Batch Chunking</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Begin by collecting documents in batches of e.g., 100–500 using a job queue like Redis or Amazon SQS. Process these documents in parallel, splitting each into their respective chunks. The chunking stage should often optimize for I/O parallelization as document reading is often the bottleneck. You can then store the chunks and their respective metadata in your <code>chunk_store</code> table, using bulk insert operations to minimize overhead.
    </p>
  </li>

  <li style="margin-bottom: 1.2em;">
    <strong>Statement and Entity Extraction</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Pull chunks in batches of e.g., 50–100 and send them to your chosen LLM (e.g., GPT-4.1-mini) using parallel API requests. Implement rate limiting with semaphores or other methods to stay safely within OpenAI's API limits whilst maximizing your throughputs. We've covered rate limiting in more detail in our cookbook on <a href="https://cookbook.openai.com/examples/how_to_handle_rate_limits">How to handle rate limits</a>. Once extracted, you can then write these to the relevant table in your database.
    </p>
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      You can then similarly group the statements we've just extracted into batches, and run the entity extraction processes in a similar vein before storing them.
    </p>
  </li>

  <li style="margin-bottom: 1.2em;">
    <strong>Statement Invalidation</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Group extracted statement IDs by their associated entity clusters (e.g., all statements related to a specific entity like “Acme Corp.”). Send each cluster to your LLM (e.g., GPT-4.1-mini) in parallel to assess which statements are outdated or superseded. Use the model’s output to update the <code>status</code> field in your <code>statements</code> table—e.g., setting <code>status = 'inactive'</code>. Parallelize invalidation jobs for performance and consider scheduling periodic sweeps for consistency.
    </p>
  </li>

  <li style="margin-bottom: 1.2em;">
    <strong>Entity Resolution</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Take batches of newly extracted entity mentions and compute embeddings using your model’s embedding endpoint. Insert these into your <code>entity_registry</code> table, assigning each a provisional or canonical <code>entity_id</code>. Perform approximate nearest-neighbor (ANN) searches using <code>pgvector</code> to identify near-duplicates or aliases. You can then update the <code>entities</code> table with resolved canonical IDs, ensuring downstream tasks reference unified representations.
    </p>
  </li>
</ol>


### Advantages of Batch Processing
* Throughput – Batching reduces the overhead of individual API calls and database transactions.

* Parallelism – Each stage can horizontally scale: you can run multiple worker processes for chunking, extraction, invalidation, etc., each reading from a queue.

* Backpressure & Reliability – If one stage becomes slow (e.g., statement invalidation during a sudden data surge), upstream stages can buffer more items in the queue until capacity frees up.


## A.4. Minimizing Token Cost

### A.4.1. Prompt Caching

Avoid redundant API calls by memoizing responses to brittle sub-prompts.

Implementation Strategy:
- **Cache Frequent Queries**: For example, repeated prompts like "Extract entities from this statement" on identicial statements
- **Use Hash Keys**: Generate a unique cache key using the MD5 hash of the statement text: `md5(statement_text)`
- **Storage Options**: Redis for scalable persistence or in-memory LRU cache for simplicity and speed
- **Bypass API Calls**: If a statement is found in cache, skip the API call

### A.4.2. Service Tier: Flex

Utilize the `service_tier=flex` parameter in the OpenAI Responses SDK to enable partial completions and reduce costs.

API Configuration:
```json
{
  "model": "o4-mini",
  "prompt": "<your prompt>",
  "service_tier": "flex"
}
```

Cost Benefits:
- Charges only for generated tokens, not prompt tokens
- Can reduce costs by up to 40% for short extractions (e.g., single-sentence entity lists)

You can learn more about the power of Flex processing and how to utilise it in the [API documentation for Flex processing](https://platform.openai.com/docs/guides/flex-processing?api-mode=responses).

### A.4.3. Minimize "Chattiness"

Replace expensive text-generation calls with more efficient alternatives where possible.

Alternative approach:
- Use embeddings endpoint (cheaper per token) combined with pgvector nearest-neighbor search
- Instead of asking the model "Which existing statement is most similar?", compute embeddings once and query directly in Postgres
- This approach is particularly effective for semantic similarity tasks

**Benefits:**
- Lower cost per operation
- Faster query response times
- Reduced API dependency for similarity searches

## A.5. Scaling and Productionizing our Retrieval Agent

Once your graph is populated, you need a mechanism to answer multi-hop queries at scale. This requires:

<ol style="margin-left: 1em; line-height: 1.6; padding-left: 0.5em;">
  <li style="margin-bottom: 1.2em;">
    <strong>Agent Architecture</strong><br />
    <ul style="margin-top: 0.5em; margin-bottom: 0.5em; padding-left: 1.2em;">
      <li><strong>Controller Agent (Frontend)</strong>: Receives a user question (e.g., “What events led to Acme Corp.’s IPO?”), then decomposes it into sub-questions or traversal steps.</li>
      <li><strong>Traversal Worker Agents</strong>: Each worker can perform a local graph traversal (e.g., “Find all facts where Acme Corp. has EventType = Acquisition between 2020–2025”), possibly in parallel on different partitions of the graph.</li>
    </ul>
  </li>

  <li style="margin-bottom: 1.2em;">
    <strong>Parallel Subgraph Extraction</strong><br />
    <ul style="margin-top: 0.5em; margin-bottom: 0.5em; padding-left: 1.2em;">
      <li>Partition the graph by entity ID hash (e.g., modulo 16). For a given query, identify which partitions are likely to contain relevant edges, then dispatch traversal tasks in parallel to each worker.</li>
      <li>Workers return partial subgraphs (nodes + edges), and the Controller Agent merges them.</li>
    </ul>
  </li>

  <li style="margin-bottom: 1.2em;">
    <strong>Chained LLM Reasoning</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      For multi-hop questions, the Controller can prompt a model (e.g., GPT-4.1) with the partial subgraph and ask “Which next edge should I traverse?” This allows dynamic, context-aware traversal rather than blind breadth-first search.
    </p>
  </li>

  <li style="margin-bottom: 1.2em;">
    <strong>Caching and Memoization</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      For frequently asked queries or subgraph patterns, cache the results (e.g., in Redis or a Postgres Materialized View) with a TTL equal to the fact’s <code>valid_to</code> date, so that subsequent requests hit the cache instead of re-traversing.
    </p>
  </li>

  <li style="margin-bottom: 1.2em;">
    <strong>Load Balancing &amp; Autoscaling</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Deploy the Traversal Worker Agents in a Kubernetes cluster with Horizontal Pod Autoscalers. Use CPU and memory metrics (and average queue length) to scale out during peak usage.
    </p>
  </li>
</ol>


## A.6. Safeguards

### A.6.1 Multi-Layered Output Verification

Run a lightweight validation pipeline to ensure outputs are as desired. Some examples of what can be included in this:
* Check that dates conform to `ISO-8601`
* Verify that entity types match your controlled vocabulary (e.g., if the model outputs an unexpected label, flag for manual review)
* Deploy a "sanity-check" function call to a smaller, cheaper model to verify the consistency of outputs (for example, “Does this statement parse correctly as a Fact? Yes/No.”)

### A.6.2. Audit Logging & Monitoring

- Implement structured logging with configurable verbosity levels (e.g., debug, info, warn, error)
- Store input pre-processing steps, intermediate outputs, and final results with full tracing, such as that offered via [OpenAI's tracing](https://platform.openai.com/traces)
- Track token throughput, latency, and error rates
- Monitor data quality metrics where possible, such as document or statement coverage, temporal resolution rates, and more
- Measure business-related metrics such as user numbers, average message volume, and user satisfaction

## A.7. Prompt Optimization

<ol style="margin-left: 1em; line-height: 1.6; padding-left: 0.5em;">
  <li style="margin-bottom: 1.2em;">
    <strong>Personas</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Introducing a persona to the model is an effective way to drive performance. Once you have narrowed down the specialism of the component you are developing the prompt for, you can create a persona in the system prompt that helps to shape the model's behaviour. We used this in our planner model to create a system prompt like this:
    </p>
    <pre style="margin-top: 0.5em; margin-bottom: 0.5em; background-color: #f5f5f5; padding: 0.75em; border-radius: 5px;"><code>initial_planner_system_prompt = (
    "You work for the leading financial firm, ABC Incorporated, one of the largest financial firms in the world. "
    "Due to your long and esteemed tenure at the firm, various equity research teams will often come to you "
    "for guidance on research tasks they are performing. Your expertise is particularly strong in the area of "
    "ABC Incorporated's proprietary knowledge base of earnings call transcripts. This contains details that have been "
    "extracted from the earnings call transcripts of various companies with labelling for when these statements are, or "
    "were, valid. You are an expert at providing instructions to teams on how to use this knowledge graph to answer "
    "their research queries. \n"
)</code></pre>
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Persona prompts can become much more developed and specific than this, but this should provide an insight into what this looks like in practice.
    </p>
  </li>

  <li style="margin-bottom: 1.2em;">
    <strong>Few-Shot Prompting and Chain-of-Thought</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      For extraction-related tasks, such as statement extraction, a concise few-shot prompt (2–5 examples) will typically deliver higher precision than a zero-shot prompt at a marginal increase in cost.
    </p>
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      For e.g., temporal reconciliation tasks, chain-of-thought methods where you guide the model through comparison logic are more appropriate. This can look like:
    </p>
    <pre style="margin-top: 0.5em; margin-bottom: 0.5em; background-color: #f5f5f5; padding: 0.75em; border-radius: 5px;"><code>Example 1: [Old fact], [New fact] → Invalidate
Example 2: [Old fact], [New fact] → Coexist
Now: [Old fact], [New fact] →</code></pre>
  </li>

  <li style="margin-bottom: 1.2em;">
    <strong>Dynamic Prompting &amp; Context Management</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      You can also lean on other LLMs or more structured methods to prune and prepare material that will be dynamically passed to prompts. We saw an example of this when building the tools for our retriever above, where the <code>timeline_generation</code> tool sorts the retrieved material before passing it back to the central orchestrator.
    </p>
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Steps to clean up the context or compress it mid-run can also be highly effective for longer-running queries.
    </p>
  </li>

  <li style="margin-bottom: 1.2em;">
    <strong>Template Library &amp; A/B Testing</strong><br />
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Maintain a set of prompt templates in a version-controlled directory (e.g., <code>prompts/statement_extraction.json</code>, <code>prompts/entity_extraction.json</code>) to enable you to audit past changes and revert if necessary. You can utilize OpenAI's reusuable prompts for this. In the OpenAI dashboard, you can develop <a href="https://platform.openai.com/docs/guides/text#reusable-prompts">reusable prompts</a> to use in API requests. This enables you to build and evaluate your prompts, deploying updated and improved versions without ever changing the code.
    </p>
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Automate A/B testing by periodically sampling extracted facts from the pipeline, re-running them through alternative prompts, and comparing performance scores (you can track this in a separate evaluation harness).
    </p>
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      Track key performance indicators (KPIs) such as extraction latency, error rates, and invalidation accuracy.
    </p>
    <p style="margin-top: 0.5em; margin-bottom: 0.5em;">
      If any metric drifts beyond a threshold (e.g., invalidation accuracy drops below 90%), trigger an alert and roll back to a previous prompt version.
    </p>
  </li>
</ol>
