Skip to content

[pull] main from microsoft:main#149

Merged
pull[bot] merged 1 commit into
graphrag:mainfrom
microsoft:main
May 14, 2026
Merged

[pull] main from microsoft:main#149
pull[bot] merged 1 commit into
graphrag:mainfrom
microsoft:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented May 14, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* feat: native CosmosTableProvider with namespace partitioning

Replace the parquet-decomposition approach in AzureCosmosStorage with a
native CosmosTableProvider that implements TableProvider directly:

- CosmosTableProvider: stores DataFrame rows as Cosmos documents with
  /namespace partition key. All queries are single-partition (no fan-out).
- CosmosTable: streaming Table impl with async SDK and server-side pagination.
- AzureCosmosStorage: simplified to key-value only (context.json, stats.json,
  cache). child() now works via ':'-separated namespace prefixes.
- TableProvider.child(): new non-abstract method for namespace isolation.
  ParquetTableProvider/CSVTableProvider delegate to Storage.child().
- Pipeline wiring: run_pipeline.py and utils.py use table_provider.child()
  for update-run delta/previous isolation.
- Legacy fallback: CosmosTableProvider reads from old containers when
  legacy_container is configured, enabling transparent migration.

Tested against Cosmos DB Linux emulator (vNext, ARM64).
302 unit tests + 15 verb tests pass (no regressions).

* fix: remove enable_cross_partition_query from async SDK calls

The async azure-cosmos SDK (v4.9) leaks this kwarg through to
aiohttp.ClientSession, causing TypeError. Omitting partition_key
achieves the same cross-partition behavior automatically.

Also documents the caveat in the design doc.

Verified: migration test passes all 5 phases against Cosmos emulator.

* feat: transactional batch writes with configurable batch_size

Add batch_size parameter (default 50, max 100) to CosmosTableProvider
and CosmosTable. Documents are written using Cosmos transactional
batch (execute_item_batch) for ~50× fewer network round-trips.

If a batch fails (e.g. payload too large), falls back to individual
upserts for that chunk so partial progress is never lost.

Config: table_provider.batch_size in settings.yaml
Propagates through child() and open() to streaming writes.

Tested: 120 rows at batch_size=50, 25 rows at batch_size=10,
75 streamed rows, clamping to max 100, child inheritance.

* chore: lint cleanup and dead code removal

- Remove unused _INTERNAL_FIELDS constant (duplicated _COSMOS_SYSTEM_KEYS)
- Fix TRY300: move returns to else blocks in AzureCosmosStorage
- Fix SIM105: use contextlib.suppress for CosmosResourceNotFoundError
- Fix SLF001: replace __new__ + private attr copy with __init__ in child()
- Fix RUF002: replace en-dash with hyphen in docstrings
- Fix D105: add __aiter__ docstring
- Add noqa: PERF401 for async iteration (false positive: no async listcomp)
- All ruff checks pass, pyright 0 errors, 317 tests pass

* fix: address code review findings

Critical fixes:
- Fix ID round-trip corruption: _strip_cosmos_metadata now restores
  original id from row_id field. Previously, read_dataframe returned
  '{table_name}:{key}' instead of the pipeline's original id value.
- Always store row_id on write (consistent between provider and table).
- has() now catches CosmosResourceNotFoundError specifically instead of
  bare Exception — auth/network errors propagate correctly.

Medium fixes:
- Add asyncio.Lock to _ensure_container() for concurrent-task safety.
- _batch_upsert catches only CosmosBatchOperationError for fallback;
  other exceptions (auth, network) now propagate instead of silently
  falling back to individual upserts.

Verified: ID round-trip, streaming write, no-id tables all pass
against Cosmos emulator. 317 unit/verb tests pass.

* chore: fix spellcheck and add semversioner change

- Add dictionary words: aiohttp, aiter, colls, serde, upserts, vnext
- Fix British spellings: serialisation→serialization, initialisation→initialization, behaviour→behavior
- Replace 'Unparameterized' with 'Non-parameterized'
- Add semversioner minor change file

* fix: update test_clear assertion for new clear() behavior

clear() now drops and recreates the container instead of deleting the
entire database. The container and database clients remain valid after
clear() — only the data is removed.

* refactor: extract Cosmos connection from Storage, not TableProviderConfig

Connection fields (connection_string, account_url, database_name) removed
from TableProviderConfig. The factory extracts them from the affiliated
AzureCosmosStorage instance when table_provider.type is cosmosdb.

This eliminates config duplication — credentials are defined once on
output_storage, and table_provider only carries table-specific fields
(container_name, batch_size, legacy_container).

Config example:
  output_storage:
    type: cosmosdb
    account_url: https://...
    database_name: graphrag
    container_name: graphrag-kv
  table_provider:
    type: cosmosdb
    container_name: graphrag-tables
    batch_size: 50

* perf: batch deletes in _delete_table to match write batching

Use transactional batches for delete operations instead of
one-at-a-time delete_item calls, mirroring the _batch_upsert pattern.
Falls back to individual deletes on CosmosBatchOperationError.
@pull pull Bot locked and limited conversation to collaborators May 14, 2026
@pull pull Bot added the ⤵️ pull label May 14, 2026
@pull pull Bot merged commit de531f0 into graphrag:main May 14, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant