Do you need to file an issue?
Describe the bug
Hello,
During auto prompt tuning, GraphRAG generates a knowledge graph output that has bugs:
Bug: knowledge graph is not valid, because the number of } is more than {.
Steps to reproduce
- Init graphrag
- provide some paragraphs from this PDF: https://kpmg.com/kpmg-us/content/dam/kpmg/frv/pdf/2024/handbook-revenue-recognition-1224.pdf
- run prompt tuning
You will see this error:
Traceback (most recent call last):
File ".../pypoetry/virtualenvs/service-vector-embedding-6NKDQ0ig-py3.11/lib/python3.11/site-packages/graphrag/index/operations/extract_graph/graph_extractor.py", line 127, in __call__
result = await self._process_document(text, prompt_variables)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../pypoetry/virtualenvs/service-vector-embedding-6NKDQ0ig-py3.11/lib/python3.11/site-packages/graphrag/index/operations/extract_graph/graph_extractor.py", line 156, in _process_document
self._extraction_prompt.format(**{
ValueError: Single '}' encountered in format string
and when I look at the extract_graph.txt I see the issue. For example, see here (there are 15 { but there are 19 } - look at the extra } in advance}) for example)
extract_graph.txt
("entity"{tuple_delimiter}HOSTING SERVICE FEES{tuple_delimiter}cost types{tuple_delimiter}Fees for hosting services, charged at $100 per month, paid in advance})
{record_delimiter}
("entity"{tuple_delimiter}REMAINING TERM OF THE HOSTING ARRANGEMENT{tuple_delimiter}lease arrangements{tuple_delimiter}The duration left on the hosting arrangement from the go-live date, which is 5 years})
{record_delimiter}
("entity"{tuple_delimiter}GO-LIVE DATE{tuple_delimiter}implementation details{tuple_delimiter}The date when the cloud-based solution became operational, which is January 1, Year 3})
{record_delimiter}
("entity"{tuple_delimiter}CAPITALIZED IMPLEMENTATION COSTS – PAYROLL MODULE{tuple_delimiter}cost types{tuple_delimiter}The costs incurred to implement the payroll processing module, amounting to $400, which are capitalized})
Expected Behavior
The extract_graph.txt should have equal number of { and } and free of errors
GraphRAG Config Used
models:
default_chat_model:
type: openai_chat
auth_type: api_key
api_key: ${GRAPHRAG_API_KEY}
model: gpt-4-turbo-preview
model_supports_json: true
concurrent_requests: 25
async_mode: threaded
retry_strategy: native
max_retries: -1
tokens_per_minute: 0
requests_per_minute: 0
default_embedding_model:
type: openai_embedding
auth_type: api_key
api_key: ${GRAPHRAG_API_KEY}
model: text-embedding-3-small
model_supports_json: true
concurrent_requests: 25
async_mode: threaded
retry_strategy: native
max_retries: -1
tokens_per_minute: 0
requests_per_minute: 0
vector_store:
default_vector_store:
type: lancedb
db_uri: output/lancedb
container_name: default
overwrite: true
embed_text:
model_id: default_embedding_model
vector_store_id: default_vector_store
input:
type: file
file_type: json
base_dir: input
text_column: page_content
title_column: title
metadata:
- page
- data_type
- figures
chunks:
size: 1200
overlap: 100
group_by_columns:
- id
cache:
type: file
base_dir: cache
reporting:
type: file
base_dir: logs
output:
type: file
base_dir: output
extract_graph:
model_id: default_chat_model
prompt: prompts/extract_graph.txt
entity_types:
- organization
- trademark
- publication
- standard
max_gleanings: 1
summarize_descriptions:
model_id: default_chat_model
prompt: prompts/summarize_descriptions.txt
max_length: 500
extract_graph_nlp:
text_analyzer:
extractor_type: regex_english
extract_claims:
enabled: false
model_id: default_chat_model
prompt: prompts/extract_claims.txt
description: Any claims or facts that could be relevant to information discovery.
max_gleanings: 1
community_reports:
model_id: default_chat_model
graph_prompt: prompts/community_report_graph.txt
text_prompt: prompts/community_report_text.txt
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false
umap:
enabled: false
snapshots:
graphml: false
embeddings: false
local_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: prompts/local_search_system_prompt.txt
global_search:
chat_model_id: default_chat_model
map_prompt: prompts/global_search_map_system_prompt.txt
reduce_prompt: prompts/global_search_reduce_system_prompt.txt
knowledge_prompt: prompts/global_search_knowledge_system_prompt.txt
drift_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: prompts/drift_search_system_prompt.txt
reduce_prompt: prompts/drift_search_reduce_prompt.txt
basic_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: prompts/basic_search_system_prompt.txt
Logs and screenshots
Additional Information
- GraphRAG Version: 2.1.0
- Operating System: Linux
- Python Version: 3.11.2
- Related Issues:
Do you need to file an issue?
Describe the bug
Hello,
During auto prompt tuning, GraphRAG generates a knowledge graph
outputthat has bugs:Bug: knowledge graph is not valid, because the number of
}is more than{.Steps to reproduce
You will see this error:
and when I look at the
extract_graph.txtI see the issue. For example, see here (there are 15{but there are 19}- look at the extra}inadvance})for example)extract_graph.txt
Expected Behavior
The
extract_graph.txtshould have equal number of{and}and free of errorsGraphRAG Config Used
Logs and screenshots
Additional Information