Bug Description (Claude Opus 4..6)
The function _filter_under_community_level in graphrag/query/indexer_adapters.py silently discards all entities that were not assigned to any community during the Leiden community detection phase. This is because the filter df[df.level <= community_level] evaluates to False for any row where level is NaN, and NaN is the expected value for unassigned entities after the left join with community data.
This causes the CLI query commands (graphrag query --method global/local/drift) to operate on a drastically reduced entity set — in our case, 10 out of 151 entities (6.6%) — with no warning or error message. Depending on the search method, this produces empty results or "no data tables" errors.
Root Cause
In indexer_adapters.py, entity DataFrames acquire a level column via a left join with community membership data:
community_join = community_df.explode("entity_ids").loc[:, ["community", "level", "entity_ids"]]
nodes_df = entity_df.merge(community_join, left_on="id", right_on="entity_ids", how="left")
Entities that were not placed into any community by Leiden get level = NaN after the left join. The subsequent filter:
def _filter_under_community_level(df, community_level):
return df[df.level <= community_level]
drops them, because in NumPy/pandas:
import numpy as np
np.nan <= 2
False
Impact
Silent data loss: No warning or log message is emitted. The user has no indication that entities were dropped.
Very common trigger: Any graph where Leiden community detection does not achieve 100% entity coverage triggers this bug. Small, sparse, or domain-specific datasets are especially vulnerable since isolated nodes and weakly connected components are routinely excluded by Leiden.
CLI-only: Users who build search engines manually and pass community_level=None are unaffected, since the filter is skipped. This makes the bug difficult to diagnose — the same indexed data works perfectly via manual scripts but fails through the CLI.
Reproduction
Environment
graphrag version: 2.2.0 (also present in current main)
Python 3.12
OS: Ubuntu 24.04
Steps
Index a small/domain-specific corpus:
graphrag index --root ./my_project
Verify entities exist but most lack community assignments:
import pandas as pd
entities = pd.read_parquet("output/entities.parquet")
communities = pd.read_parquet("output/communities.parquet")
community_join = communities.explode("entity_ids")[["community", "level", "entity_ids"]]
merged = entities.merge(community_join, left_on="id", right_on="entity_ids", how="left")
total = len(merged)
assigned = merged["level"].notna().sum()
orphaned = merged["level"].isna().sum()
print(f"Total: {total}, Assigned: {assigned}, Orphaned: {orphaned}")
Example output: Total: 151, Assigned: 10, Orphaned: 141
Run a CLI query:
graphrag query --root ./my_project --method local --query "your question"
Observe empty or degraded results. No warning about dropped entities.
Run the same query via manual script with community_level=None:
search_engine = get_local_search_engine(
...,
community_level=None, # bypasses the filter entirely
)
result = await search_engine.asearch("your question")
Returns full, correct results
Observed Data (test dataset, 3 source documents)
| Metric |
Value |
| Total entities |
151 |
| Entities with community assignment |
10 (6.6%) |
| Entities with level = NaN |
141 (93.4%) |
| Communities |
3 (all at level 0) |
| Community reports |
3 |
| CLI filter (level <= 2) passes |
10 entities |
| Manual script (level=None) passes |
151 entities |
Proposed Fix
Minimal (filter only)
def _filter_under_community_level(df, community_level):
return df[(df.level <= community_level) | df.level.isna()]
This preserves the original intent (hierarchical filtering) while retaining entities that have no community assignment — which is the semantically correct behavior, since "not in any community" is not the same as "in a community above the requested level."
Recommended (additional safeguards)
Emit a warning when a significant percentage of entities lack community assignments:
orphan_pct = df.level.isna().sum() / len(df) * 100
if orphan_pct > 10:
logger.warning(
f"{orphan_pct:.0f}% of entities have no community assignment "
f"and would be excluded without NaN-safe filtering."
)
Warn when community_level exceeds the max available level:
max_level = df.level.dropna().max()
if community_level > max_level:
logger.warning(
f"Requested community_level={community_level} but max available is {max_level}."
)
Consider changing the CLI default for --community-level from 2 to None (or auto-detect from the dataset), since small datasets frequently have only level 0.
Related Code Locations
graphrag/query/indexer_adapters.py — _filter_under_community_level()
graphrag/cli/query.py — CLI defaults community_level=2
Affects: read_indexer_entities(), read_indexer_reports(), and all downstream search engine constructors when called via CLI
Bug Description (Claude Opus 4..6)
The function _filter_under_community_level in graphrag/query/indexer_adapters.py silently discards all entities that were not assigned to any community during the Leiden community detection phase. This is because the filter df[df.level <= community_level] evaluates to False for any row where level is NaN, and NaN is the expected value for unassigned entities after the left join with community data.
This causes the CLI query commands (graphrag query --method global/local/drift) to operate on a drastically reduced entity set — in our case, 10 out of 151 entities (6.6%) — with no warning or error message. Depending on the search method, this produces empty results or "no data tables" errors.
Root Cause
In indexer_adapters.py, entity DataFrames acquire a level column via a left join with community membership data:
community_join = community_df.explode("entity_ids").loc[:, ["community", "level", "entity_ids"]]
nodes_df = entity_df.merge(community_join, left_on="id", right_on="entity_ids", how="left")
Entities that were not placed into any community by Leiden get level = NaN after the left join. The subsequent filter:
def _filter_under_community_level(df, community_level):
return df[df.level <= community_level]
drops them, because in NumPy/pandas:
Impact
Silent data loss: No warning or log message is emitted. The user has no indication that entities were dropped.
Very common trigger: Any graph where Leiden community detection does not achieve 100% entity coverage triggers this bug. Small, sparse, or domain-specific datasets are especially vulnerable since isolated nodes and weakly connected components are routinely excluded by Leiden.
CLI-only: Users who build search engines manually and pass community_level=None are unaffected, since the filter is skipped. This makes the bug difficult to diagnose — the same indexed data works perfectly via manual scripts but fails through the CLI.
Reproduction
Environment
graphrag version: 2.2.0 (also present in current main)
Python 3.12
OS: Ubuntu 24.04
Steps
Index a small/domain-specific corpus:
graphrag index --root ./my_project
Verify entities exist but most lack community assignments:
import pandas as pd
entities = pd.read_parquet("output/entities.parquet")
communities = pd.read_parquet("output/communities.parquet")
community_join = communities.explode("entity_ids")[["community", "level", "entity_ids"]]
merged = entities.merge(community_join, left_on="id", right_on="entity_ids", how="left")
total = len(merged)
assigned = merged["level"].notna().sum()
orphaned = merged["level"].isna().sum()
print(f"Total: {total}, Assigned: {assigned}, Orphaned: {orphaned}")
Example output: Total: 151, Assigned: 10, Orphaned: 141
Run a CLI query:
graphrag query --root ./my_project --method local --query "your question"
Observe empty or degraded results. No warning about dropped entities.
Run the same query via manual script with community_level=None:
search_engine = get_local_search_engine(
...,
community_level=None, # bypasses the filter entirely
)
result = await search_engine.asearch("your question")
Returns full, correct results
Observed Data (test dataset, 3 source documents)
Proposed Fix
Minimal (filter only)
def _filter_under_community_level(df, community_level):
return df[(df.level <= community_level) | df.level.isna()]
This preserves the original intent (hierarchical filtering) while retaining entities that have no community assignment — which is the semantically correct behavior, since "not in any community" is not the same as "in a community above the requested level."
Recommended (additional safeguards)
Emit a warning when a significant percentage of entities lack community assignments:
orphan_pct = df.level.isna().sum() / len(df) * 100
if orphan_pct > 10:
logger.warning(
f"{orphan_pct:.0f}% of entities have no community assignment "
f"and would be excluded without NaN-safe filtering."
)
Warn when community_level exceeds the max available level:
max_level = df.level.dropna().max()
if community_level > max_level:
logger.warning(
f"Requested community_level={community_level} but max available is {max_level}."
)
Consider changing the CLI default for --community-level from 2 to None (or auto-detect from the dataset), since small datasets frequently have only level 0.
Related Code Locations
graphrag/query/indexer_adapters.py — _filter_under_community_level()
graphrag/cli/query.py — CLI defaults community_level=2
Affects: read_indexer_entities(), read_indexer_reports(), and all downstream search engine constructors when called via CLI