_filter_under_community_level silently drops all entities without community assignments due to NaN comparison

Bug Description  (Claude Opus 4..6)
The function _filter_under_community_level in graphrag/query/indexer_adapters.py silently discards all entities that were not assigned to any community during the Leiden community detection phase. This is because the filter df[df.level <= community_level] evaluates to False for any row where level is NaN, and NaN is the expected value for unassigned entities after the left join with community data.

This causes the CLI query commands (graphrag query --method global/local/drift) to operate on a drastically reduced entity set — in our case, 10 out of 151 entities (6.6%) — with no warning or error message. Depending on the search method, this produces empty results or "no data tables" errors.

Root Cause
In indexer_adapters.py, entity DataFrames acquire a level column via a left join with community membership data:
community_join = community_df.explode("entity_ids").loc[:, ["community", "level", "entity_ids"]]
nodes_df = entity_df.merge(community_join, left_on="id", right_on="entity_ids", how="left")

Entities that were not placed into any community by Leiden get level = NaN after the left join. The subsequent filter:
def _filter_under_community_level(df, community_level):
    return df[df.level <= community_level]

drops them, because in NumPy/pandas:
>>> import numpy as np
>>> np.nan <= 2
False

Impact
Silent data loss: No warning or log message is emitted. The user has no indication that entities were dropped.
Very common trigger: Any graph where Leiden community detection does not achieve 100% entity coverage triggers this bug. Small, sparse, or domain-specific datasets are especially vulnerable since isolated nodes and weakly connected components are routinely excluded by Leiden.
CLI-only: Users who build search engines manually and pass community_level=None are unaffected, since the filter is skipped. This makes the bug difficult to diagnose — the same indexed data works perfectly via manual scripts but fails through the CLI.
Reproduction
Environment
graphrag version: 2.2.0 (also present in current main)
Python 3.12
OS: Ubuntu 24.04

Steps
Index a small/domain-specific corpus:
graphrag index --root ./my_project

Verify entities exist but most lack community assignments:
import pandas as pd

entities = pd.read_parquet("output/entities.parquet")
communities = pd.read_parquet("output/communities.parquet")

community_join = communities.explode("entity_ids")[["community", "level", "entity_ids"]]
merged = entities.merge(community_join, left_on="id", right_on="entity_ids", how="left")

total = len(merged)
assigned = merged["level"].notna().sum()
orphaned = merged["level"].isna().sum()

print(f"Total: {total}, Assigned: {assigned}, Orphaned: {orphaned}")
# Example output: Total: 151, Assigned: 10, Orphaned: 141


Run a CLI query:
graphrag query --root ./my_project --method local --query "your question"

Observe empty or degraded results. No warning about dropped entities.

Run the same query via manual script with community_level=None:
search_engine = get_local_search_engine(
    ...,
    community_level=None,  # bypasses the filter entirely
)
result = await search_engine.asearch("your question")
# Returns full, correct results


Observed Data (test dataset, 3 source documents)
Metric | Value
-- | --
Total entities | 151
Entities with community assignment | 10 (6.6%)
Entities with level = NaN | 141 (93.4%)
Communities | 3 (all at level 0)
Community reports | 3
CLI filter (level <= 2) passes | 10 entities
Manual script (level=None) passes | 151 entities


Proposed Fix
Minimal (filter only)
def _filter_under_community_level(df, community_level):
    return df[(df.level <= community_level) | df.level.isna()]

This preserves the original intent (hierarchical filtering) while retaining entities that have no community assignment — which is the semantically correct behavior, since "not in any community" is not the same as "in a community above the requested level."

Recommended (additional safeguards)
Emit a warning when a significant percentage of entities lack community assignments:
orphan_pct = df.level.isna().sum() / len(df) * 100
if orphan_pct > 10:
    logger.warning(
        f"{orphan_pct:.0f}% of entities have no community assignment "
        f"and would be excluded without NaN-safe filtering."
    )

Warn when community_level exceeds the max available level:
max_level = df.level.dropna().max()
if community_level > max_level:
    logger.warning(
        f"Requested community_level={community_level} but max available is {max_level}."
    )

Consider changing the CLI default for --community-level from 2 to None (or auto-detect from the dataset), since small datasets frequently have only level 0.

Related Code Locations
graphrag/query/indexer_adapters.py — _filter_under_community_level()
graphrag/cli/query.py — CLI defaults community_level=2
Affects: read_indexer_entities(), read_indexer_reports(), and all downstream search engine constructors when called via CLI





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_filter_under_community_level silently drops all entities without community assignments due to NaN comparison #2348

Example output: Total: 151, Assigned: 10, Orphaned: 141

Returns full, correct results

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Value
Total entities	151
Entities with community assignment	10 (6.6%)
Entities with level = NaN	141 (93.4%)
Communities	3 (all at level 0)
Community reports	3
CLI filter (level <= 2) passes	10 entities
Manual script (level=None) passes	151 entities

_filter_under_community_level silently drops all entities without community assignments due to NaN comparison #2348

Description

Example output: Total: 151, Assigned: 10, Orphaned: 141

Returns full, correct results

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions