[pull] main from microsoft:main by pull[bot] · Pull Request #149 · graphrag/ms-graphrag

pull · 2026-05-14T00:13:05Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* feat: native CosmosTableProvider with namespace partitioning Replace the parquet-decomposition approach in AzureCosmosStorage with a native CosmosTableProvider that implements TableProvider directly: - CosmosTableProvider: stores DataFrame rows as Cosmos documents with /namespace partition key. All queries are single-partition (no fan-out). - CosmosTable: streaming Table impl with async SDK and server-side pagination. - AzureCosmosStorage: simplified to key-value only (context.json, stats.json, cache). child() now works via ':'-separated namespace prefixes. - TableProvider.child(): new non-abstract method for namespace isolation. ParquetTableProvider/CSVTableProvider delegate to Storage.child(). - Pipeline wiring: run_pipeline.py and utils.py use table_provider.child() for update-run delta/previous isolation. - Legacy fallback: CosmosTableProvider reads from old containers when legacy_container is configured, enabling transparent migration. Tested against Cosmos DB Linux emulator (vNext, ARM64). 302 unit tests + 15 verb tests pass (no regressions). * fix: remove enable_cross_partition_query from async SDK calls The async azure-cosmos SDK (v4.9) leaks this kwarg through to aiohttp.ClientSession, causing TypeError. Omitting partition_key achieves the same cross-partition behavior automatically. Also documents the caveat in the design doc. Verified: migration test passes all 5 phases against Cosmos emulator. * feat: transactional batch writes with configurable batch_size Add batch_size parameter (default 50, max 100) to CosmosTableProvider and CosmosTable. Documents are written using Cosmos transactional batch (execute_item_batch) for ~50× fewer network round-trips. If a batch fails (e.g. payload too large), falls back to individual upserts for that chunk so partial progress is never lost. Config: table_provider.batch_size in settings.yaml Propagates through child() and open() to streaming writes. Tested: 120 rows at batch_size=50, 25 rows at batch_size=10, 75 streamed rows, clamping to max 100, child inheritance. * chore: lint cleanup and dead code removal - Remove unused _INTERNAL_FIELDS constant (duplicated _COSMOS_SYSTEM_KEYS) - Fix TRY300: move returns to else blocks in AzureCosmosStorage - Fix SIM105: use contextlib.suppress for CosmosResourceNotFoundError - Fix SLF001: replace __new__ + private attr copy with __init__ in child() - Fix RUF002: replace en-dash with hyphen in docstrings - Fix D105: add __aiter__ docstring - Add noqa: PERF401 for async iteration (false positive: no async listcomp) - All ruff checks pass, pyright 0 errors, 317 tests pass * fix: address code review findings Critical fixes: - Fix ID round-trip corruption: _strip_cosmos_metadata now restores original id from row_id field. Previously, read_dataframe returned '{table_name}:{key}' instead of the pipeline's original id value. - Always store row_id on write (consistent between provider and table). - has() now catches CosmosResourceNotFoundError specifically instead of bare Exception — auth/network errors propagate correctly. Medium fixes: - Add asyncio.Lock to _ensure_container() for concurrent-task safety. - _batch_upsert catches only CosmosBatchOperationError for fallback; other exceptions (auth, network) now propagate instead of silently falling back to individual upserts. Verified: ID round-trip, streaming write, no-id tables all pass against Cosmos emulator. 317 unit/verb tests pass. * chore: fix spellcheck and add semversioner change - Add dictionary words: aiohttp, aiter, colls, serde, upserts, vnext - Fix British spellings: serialisation→serialization, initialisation→initialization, behaviour→behavior - Replace 'Unparameterized' with 'Non-parameterized' - Add semversioner minor change file * fix: update test_clear assertion for new clear() behavior clear() now drops and recreates the container instead of deleting the entire database. The container and database clients remain valid after clear() — only the data is removed. * refactor: extract Cosmos connection from Storage, not TableProviderConfig Connection fields (connection_string, account_url, database_name) removed from TableProviderConfig. The factory extracts them from the affiliated AzureCosmosStorage instance when table_provider.type is cosmosdb. This eliminates config duplication — credentials are defined once on output_storage, and table_provider only carries table-specific fields (container_name, batch_size, legacy_container). Config example: output_storage: type: cosmosdb account_url: https://... database_name: graphrag container_name: graphrag-kv table_provider: type: cosmosdb container_name: graphrag-tables batch_size: 50 * perf: batch deletes in _delete_table to match write batching Use transactional batches for delete operations instead of one-at-a-time delete_item calls, mirroring the _batch_upsert pattern. Falls back to individual deletes on CosmosBatchOperationError.

pull Bot locked and limited conversation to collaborators May 14, 2026

pull Bot added the ⤵️ pull label May 14, 2026

pull Bot merged commit de531f0 into graphrag:main May 14, 2026

pull Bot had a problem deploying to pypi May 14, 2026 00:13 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from microsoft:main#149

[pull] main from microsoft:main#149
pull[bot] merged 1 commit into
graphrag:mainfrom
microsoft:main

pull Bot commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pull Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pull Bot commented May 14, 2026 •

edited

Loading