[pull] main from microsoft:main#149
Merged
Merged
Conversation
* feat: native CosmosTableProvider with namespace partitioning
Replace the parquet-decomposition approach in AzureCosmosStorage with a
native CosmosTableProvider that implements TableProvider directly:
- CosmosTableProvider: stores DataFrame rows as Cosmos documents with
/namespace partition key. All queries are single-partition (no fan-out).
- CosmosTable: streaming Table impl with async SDK and server-side pagination.
- AzureCosmosStorage: simplified to key-value only (context.json, stats.json,
cache). child() now works via ':'-separated namespace prefixes.
- TableProvider.child(): new non-abstract method for namespace isolation.
ParquetTableProvider/CSVTableProvider delegate to Storage.child().
- Pipeline wiring: run_pipeline.py and utils.py use table_provider.child()
for update-run delta/previous isolation.
- Legacy fallback: CosmosTableProvider reads from old containers when
legacy_container is configured, enabling transparent migration.
Tested against Cosmos DB Linux emulator (vNext, ARM64).
302 unit tests + 15 verb tests pass (no regressions).
* fix: remove enable_cross_partition_query from async SDK calls
The async azure-cosmos SDK (v4.9) leaks this kwarg through to
aiohttp.ClientSession, causing TypeError. Omitting partition_key
achieves the same cross-partition behavior automatically.
Also documents the caveat in the design doc.
Verified: migration test passes all 5 phases against Cosmos emulator.
* feat: transactional batch writes with configurable batch_size
Add batch_size parameter (default 50, max 100) to CosmosTableProvider
and CosmosTable. Documents are written using Cosmos transactional
batch (execute_item_batch) for ~50× fewer network round-trips.
If a batch fails (e.g. payload too large), falls back to individual
upserts for that chunk so partial progress is never lost.
Config: table_provider.batch_size in settings.yaml
Propagates through child() and open() to streaming writes.
Tested: 120 rows at batch_size=50, 25 rows at batch_size=10,
75 streamed rows, clamping to max 100, child inheritance.
* chore: lint cleanup and dead code removal
- Remove unused _INTERNAL_FIELDS constant (duplicated _COSMOS_SYSTEM_KEYS)
- Fix TRY300: move returns to else blocks in AzureCosmosStorage
- Fix SIM105: use contextlib.suppress for CosmosResourceNotFoundError
- Fix SLF001: replace __new__ + private attr copy with __init__ in child()
- Fix RUF002: replace en-dash with hyphen in docstrings
- Fix D105: add __aiter__ docstring
- Add noqa: PERF401 for async iteration (false positive: no async listcomp)
- All ruff checks pass, pyright 0 errors, 317 tests pass
* fix: address code review findings
Critical fixes:
- Fix ID round-trip corruption: _strip_cosmos_metadata now restores
original id from row_id field. Previously, read_dataframe returned
'{table_name}:{key}' instead of the pipeline's original id value.
- Always store row_id on write (consistent between provider and table).
- has() now catches CosmosResourceNotFoundError specifically instead of
bare Exception — auth/network errors propagate correctly.
Medium fixes:
- Add asyncio.Lock to _ensure_container() for concurrent-task safety.
- _batch_upsert catches only CosmosBatchOperationError for fallback;
other exceptions (auth, network) now propagate instead of silently
falling back to individual upserts.
Verified: ID round-trip, streaming write, no-id tables all pass
against Cosmos emulator. 317 unit/verb tests pass.
* chore: fix spellcheck and add semversioner change
- Add dictionary words: aiohttp, aiter, colls, serde, upserts, vnext
- Fix British spellings: serialisation→serialization, initialisation→initialization, behaviour→behavior
- Replace 'Unparameterized' with 'Non-parameterized'
- Add semversioner minor change file
* fix: update test_clear assertion for new clear() behavior
clear() now drops and recreates the container instead of deleting the
entire database. The container and database clients remain valid after
clear() — only the data is removed.
* refactor: extract Cosmos connection from Storage, not TableProviderConfig
Connection fields (connection_string, account_url, database_name) removed
from TableProviderConfig. The factory extracts them from the affiliated
AzureCosmosStorage instance when table_provider.type is cosmosdb.
This eliminates config duplication — credentials are defined once on
output_storage, and table_provider only carries table-specific fields
(container_name, batch_size, legacy_container).
Config example:
output_storage:
type: cosmosdb
account_url: https://...
database_name: graphrag
container_name: graphrag-kv
table_provider:
type: cosmosdb
container_name: graphrag-tables
batch_size: 50
* perf: batch deletes in _delete_table to match write batching
Use transactional batches for delete operations instead of
one-at-a-time delete_item calls, mirroring the _batch_upsert pattern.
Falls back to individual deletes on CosmosBatchOperationError.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )