Error in Leiden Algorithm in create_base_entity_graph #515

as1078 · 2024-07-11T18:53:06Z

Describe the issue

I got an empty network when doing the Leiden clustering algorithm as follows:
{"type": "error", "data": "Error executing verb "cluster_graph" in create_base_entity_graph: EmptyNetworkError", "stack": ... leiden.EmptyNetworkError: EmptyNetworkError\n", "source": "EmptyNetworkError", "details": null}

When opening my parquet files for each step in pandas, there is only an entity_graph column with an incomplete graphml URL. I saw on other posts that there should also be a clustered_graph column, but there is none for me. When I look in the cache directory however, both entity_extraction and summarize_descriptions have valid JSON results, so I'm not sure how exactly the graph became empty. My data is a set of .txt files of US Congressional hearings, and I previously used the prompt autotune feature to customize prompts to my data.

Steps to reproduce

joint-20240710T193325Z-001.zip
To generate results, I simply ran the init command followed by !python -m graphrag.prompt_tune --root ./ragtest --domain "US congress hearings" and then !python -m graphrag.index --verbose --root ./ragtest

GraphRAG Config Used

encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
model: gpt-4-turbo-preview
model_supports_json: true # recommended if this is available for your model.

max_tokens: 4000

request_timeout: 180.0

api_base: https://.openai.azure.com

api_version: 2024-02-15-preview

organization: <organization_id>

deployment_name: <azure_model_deployment_name>

tokens_per_minute: 150_000 # set a leaky bucket throttle

requests_per_minute: 10_000 # set a leaky bucket throttle

max_retries: 10

max_retry_wait: 10.0

sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times

concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization:
stagger: 0.3

num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:

parallelization: override the global parallelization settings for embeddings

async_mode: threaded # or asyncio
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: text-embedding-3-small
# api_base: https://.openai.azure.com
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# batch_size: 16 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional

chunks:
size: 300
overlap: 100
group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\.txt$"

cache:
type: file # or blob
base_dir: "cache"

connection_string: <azure_blob_storage_connection_string>

container_name: <azure_blob_storage_container_name>

storage:
type: file # or blob
base_dir: "output/${timestamp}/artifacts"

connection_string: <azure_blob_storage_connection_string>

container_name: <azure_blob_storage_container_name>

reporting:
type: file # or console, blob
base_dir: "output/${timestamp}/reports"

connection_string: <azure_blob_storage_connection_string>

container_name: <azure_blob_storage_container_name>

entity_extraction:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

prompt: "prompts/entity_extraction.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 0

summarize_descriptions:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

prompt: "prompts/summarize_descriptions.txt"
max_length: 500

claim_extraction:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

enabled: true

prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 0

community_report:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000

cluster_graph:
max_cluster_size: 10

embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes

num_walks: 10

walk_length: 40

window_size: 2

iterations: 3

random_seed: 597832

umap:
enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
graphml: false
raw_entities: false
top_level_nodes: false

local_search:

text_unit_prop: 0.5

community_prop: 0.1

conversation_history_max_turns: 5

top_k_mapped_entities: 10

top_k_relationships: 10

max_tokens: 12000

global_search:

max_tokens: 12000

data_max_tokens: 12000

map_max_tokens: 1000

reduce_max_tokens: 2000

concurrency: 32

Logs and screenshots

Logs.json
{"type": "error", "data": "Error executing verb "cluster_graph" in create_base_entity_graph: EmptyNetworkError", "stack": "Traceback (most recent call last):\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/datashaper/workflow/workflow.py", line 410, in _execute_verb\n result = node.verb.func(**verb_args)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/graphrag/index/verbs/graph/clustering/cluster_graph.py", line 61, in cluster_graph\n results = output_df[column].apply(lambda graph: run_layout(strategy, graph))\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/pandas/core/series.py", line 4924, in apply\n ).apply()\n ^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/pandas/core/apply.py", line 1427, in apply\n return self.apply_standard()\n ^^^^^^^^^^^^^^^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/pandas/core/apply.py", line 1507, in apply_standard\n mapped = obj._map_values(\n ^^^^^^^^^^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/pandas/core/base.py", line 921, in _map_values\n return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/pandas/core/algorithms.py", line 1743, in map_array\n return lib.map_infer(values, mapper, convert=convert)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "lib.pyx", line 2972, in pandas._libs.lib.map_infer\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/graphrag/index/verbs/graph/clustering/cluster_graph.py", line 61, in \n results = output_df[column].apply(lambda graph: run_layout(strategy, graph))\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/graphrag/index/verbs/graph/clustering/cluster_graph.py", line 167, in run_layout\n clusters = run_leiden(graph, strategy)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/graphrag/index/verbs/graph/clustering/strategies/leiden.py", line 26, in run\n node_id_to_community_map = _compute_leiden_communities(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/graphrag/index/verbs/graph/clustering/strategies/leiden.py", line 61, in _compute_leiden_communities\n community_mapping = hierarchical_leiden(\n ^^^^^^^^^^^^^^^^^^^^\n File "<@beartype(graspologic.partition.leiden.hierarchical_leiden) at 0x32b439d00>", line 304, in hierarchical_leiden\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/graspologic/partition/leiden.py", line 588, in hierarchical_leiden\n hierarchical_clusters_native = gn.hierarchical_leiden(\n ^^^^^^^^^^^^^^^^^^^^^^^\nleiden.EmptyNetworkError: EmptyNetworkError\n", "source": "EmptyNetworkError", "details": null}
{"type": "error", "data": "Error running pipeline!", "stack": "Traceback (most recent call last):\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/graphrag/index/run.py", line 323, in run_pipeline\n result = await workflow.run(context, callbacks)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/datashaper/workflow/workflow.py", line 369, in run\n timing = await self._execute_verb(node, context, callbacks)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/datashaper/workflow/workflow.py", line 410, in _execute_verb\n result = node.verb.func(**verb_args)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/graphrag/index/verbs/graph/clustering/cluster_graph.py", line 61, in cluster_graph\n results = output_df[column].apply(lambda graph: run_layout(strategy, graph))\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/pandas/core/series.py", line 4924, in apply\n ).apply()\n ^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/pandas/core/apply.py", line 1427, in apply\n return self.apply_standard()\n ^^^^^^^^^^^^^^^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/pandas/core/apply.py", line 1507, in apply_standard\n mapped = obj._map_values(\n ^^^^^^^^^^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/pandas/core/base.py", line 921, in _map_values\n return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/pandas/core/algorithms.py", line 1743, in map_array\n return lib.map_infer(values, mapper, convert=convert)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "lib.pyx", line 2972, in pandas._libs.lib.map_infer\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/graphrag/index/verbs/graph/clustering/cluster_graph.py", line 61, in \n results = output_df[column].apply(lambda graph: run_layout(strategy, graph))\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/graphrag/index/verbs/graph/clustering/cluster_graph.py", line 167, in run_layout\n clusters = run_leiden(graph, strategy)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/graphrag/index/verbs/graph/clustering/strategies/leiden.py", line 26, in run\n node_id_to_community_map = _compute_leiden_communities(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/graphrag/index/verbs/graph/clustering/strategies/leiden.py", line 61, in _compute_leiden_communities\n community_mapping = hierarchical_leiden(\n ^^^^^^^^^^^^^^^^^^^^\n File "<@beartype(graspologic.partition.leiden.hierarchical_leiden) at 0x32b439d00>", line 304, in hierarchical_leiden\n File "/Users/amansingh/anaconda3/lib/python3.11/site-packages/graspologic/partition/leiden.py", line 588, in hierarchical_leiden\n hierarchical_clusters_native = gn.hierarchical_leiden(\n ^^^^^^^^^^^^^^^^^^^^^^^\nleiden.EmptyNetworkError: EmptyNetworkError\n", "source": "EmptyNetworkError", "details": null}

Additional Information

GraphRAG Version: 0.1.1
Operating System: mac OS Sonoma 14.4.1
Python Version: 3.11.9
Related Issues:

The text was updated successfully, but these errors were encountered:

AlonsoGuevara · 2024-07-11T22:52:32Z

HI @as1078

It seems entity extraction process failed and yield an empty graph.
Could you please share your log file?

as1078 · 2024-07-12T01:24:49Z

Sure, since my log file is too large to upload, I have uploaded a portion of it here It seems that all other errors besides the clustering one were rate limit errors, which I thought were dealt with by GraphRAG through waiting before submitting another API request. I excluded the clustering errors that were put in above.
logs.json

fire · 2024-07-12T02:30:31Z

I have this error too. I noticed that my generated prompts were missing a ) in the entity extraction.

as1078 · 2024-07-12T19:07:13Z

Just noticed I had the same issue. Thanks!

fire · 2024-07-12T20:21:54Z

On less performant models like the phi-3 #503 was able to repair the json. I did not test with prompt rewrite.

jiangjingzhi2003 · 2024-07-20T06:20:42Z

I have this error too. I noticed that my generated prompts were missing a ) in the entity extraction.

I still have the same error after I fix my generated prompts for entity extraction, does anyone know what might be the cause?

github-actions · 2024-07-28T01:54:20Z

This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days.

fantom845 · 2024-08-07T13:55:14Z

Faced the same issue. Found a potential fix. Putting it here just in case it is useful for somebody in the future, or if someone can identify why is this causing the error.
in the file graphrag/prompt_tune/prompt/entity_relationship.py
line no 25
3. Return output in {language} as a single list of all the entities and relationships identified in steps 1 and 2. Use **{{record_delimiter}}** as the list delimiter.

removing the asterisks(*) on either side of the {{record_delimiter}} fixes the prompt generation and the error during indexing for me:
3. Return output in {language} as a single list of all the entities and relationships identified in steps 1 and 2. Use {{record_delimiter}} as the list delimiter.

shengkui · 2024-08-08T09:09:55Z

I meet this error on graphrag v0.2.1

rincon-santi · 2024-08-13T19:03:50Z

Same in graphrag v0.2.0

fantom845 · 2024-08-13T19:08:53Z

Can anyone who is still facing the error try this change:
#515 (comment)
and confirm if the prompts generated now work fine or was it just a one time case for me

as1078 · 2024-08-13T21:37:24Z

Faced the same issue. Found a potential fix. Putting it here just in case it is useful for somebody in the future, or if someone can identify why is this causing the error. in the file graphrag/prompt_tune/prompt/entity_relationship.py line no 25 3. Return output in {language} as a single list of all the entities and relationships identified in steps 1 and 2. Use **{{record_delimiter}}** as the list delimiter.

removing the asterisks(*) on either side of the {{record_delimiter}} fixes the prompt generation and the error during indexing for me: 3. Return output in {language} as a single list of all the entities and relationships identified in steps 1 and 2. Use {{record_delimiter}} as the list delimiter.

I didn't modify entity_relationship.py itself, but this worked for me in the auto-generated prompts (the txt files).

github-actions · 2024-08-21T01:51:51Z

This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days.

github-actions · 2024-08-26T01:53:08Z

This issue has been closed after being marked as stale for five days. Please reopen if needed.

Sere1nz · 2024-10-26T19:15:16Z

Can anyone who is still facing the error try this change: #515 (comment) and confirm if the prompts generated now work fine or was it just a one time case for me

it doesn't work for me. In my graphrag version, ****has been removed in prompt and still having this issue

as1078 added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Jul 11, 2024

AlonsoGuevara added extraction_error and removed triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Jul 11, 2024

github-actions bot added the stale Used by auto-resolve bot to flag inactive issues label Jul 28, 2024

natoverse removed stale Used by auto-resolve bot to flag inactive issues extraction_error labels Jul 30, 2024

natoverse added the awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response label Aug 9, 2024

fantom845 mentioned this issue Aug 14, 2024

fix for issue 515 #925

Merged

4 tasks

github-actions bot added the stale Used by auto-resolve bot to flag inactive issues label Aug 21, 2024

github-actions bot added the autoresolved label Aug 26, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in Leiden Algorithm in create_base_entity_graph #515

Error in Leiden Algorithm in create_base_entity_graph #515

as1078 commented Jul 11, 2024

AlonsoGuevara commented Jul 11, 2024

as1078 commented Jul 12, 2024

fire commented Jul 12, 2024

as1078 commented Jul 12, 2024

fire commented Jul 12, 2024

jiangjingzhi2003 commented Jul 20, 2024 •

edited

Loading

github-actions bot commented Jul 28, 2024

fantom845 commented Aug 7, 2024

shengkui commented Aug 8, 2024

rincon-santi commented Aug 13, 2024

fantom845 commented Aug 13, 2024

as1078 commented Aug 13, 2024 •

edited

Loading

github-actions bot commented Aug 21, 2024

github-actions bot commented Aug 26, 2024

Sere1nz commented Oct 26, 2024

Error in Leiden Algorithm in create_base_entity_graph #515

Error in Leiden Algorithm in create_base_entity_graph #515

Comments

as1078 commented Jul 11, 2024

Describe the issue

Steps to reproduce

GraphRAG Config Used

max_tokens: 4000

request_timeout: 180.0

api_base: https://.openai.azure.com

api_version: 2024-02-15-preview

organization: <organization_id>

deployment_name: <azure_model_deployment_name>

tokens_per_minute: 150_000 # set a leaky bucket throttle

requests_per_minute: 10_000 # set a leaky bucket throttle

max_retries: 10

max_retry_wait: 10.0

sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times

concurrent_requests: 25 # the number of parallel inflight requests that may be made

num_threads: 50 # the number of threads to use for parallel processing

parallelization: override the global parallelization settings for embeddings

connection_string: <azure_blob_storage_connection_string>

container_name: <azure_blob_storage_container_name>

connection_string: <azure_blob_storage_connection_string>

container_name: <azure_blob_storage_container_name>

connection_string: <azure_blob_storage_connection_string>

container_name: <azure_blob_storage_container_name>

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

enabled: true

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

num_walks: 10

walk_length: 40

window_size: 2

iterations: 3

random_seed: 597832

text_unit_prop: 0.5

community_prop: 0.1

conversation_history_max_turns: 5

top_k_mapped_entities: 10

top_k_relationships: 10

max_tokens: 12000

max_tokens: 12000

data_max_tokens: 12000

map_max_tokens: 1000

reduce_max_tokens: 2000

concurrency: 32

Logs and screenshots

Additional Information

AlonsoGuevara commented Jul 11, 2024

as1078 commented Jul 12, 2024

fire commented Jul 12, 2024

as1078 commented Jul 12, 2024

fire commented Jul 12, 2024

jiangjingzhi2003 commented Jul 20, 2024 • edited Loading

github-actions bot commented Jul 28, 2024

fantom845 commented Aug 7, 2024

shengkui commented Aug 8, 2024

rincon-santi commented Aug 13, 2024

fantom845 commented Aug 13, 2024

as1078 commented Aug 13, 2024 • edited Loading

github-actions bot commented Aug 21, 2024

github-actions bot commented Aug 26, 2024

Sere1nz commented Oct 26, 2024

jiangjingzhi2003 commented Jul 20, 2024 •

edited

Loading

as1078 commented Aug 13, 2024 •

edited

Loading