Do you need to file an issue?
Describe the bug
I have no problem uploading small files, but when I upload this book, which is quite large (1088k), it shows that the upload was successful and Graphrag can find the book, but it fails to load.
Steps to reproduce
uploading a larger book. For testing, I'm using a 1088k file with 585,696 characters, encoded in UTF-8.
Expected Behavior
support large book indexing
GraphRAG Config Used
This config file contains required core defaults that must be set, along with a handful of common optional settings.
LLM settings
There are a number of settings to tune the threading and token limits for LLM calls - check the docs.
encoding_model: cl100k_base # this needs to be matched to your model!
llm:
api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
type: openai_chat # or azure_openai_chat
model: gpt-4o-mini
model_supports_json: true # recommended if this is available for your model.
api_base: https://.openai.azure.com
api_version: 2024-02-15-preview
organization: <organization_id>
deployment_name: <azure_model_deployment_name>
parallelization:
stagger: 0.3
num_threads: 50
async_mode: threaded # or asyncio
embeddings:
async_mode: threaded # or asyncio
vector_store:
type: lancedb
db_uri: 'output/lancedb'
container_name: default
overwrite: true
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: text-embedding-3-small
# api_base: https://.openai.azure.com
# api_version: 2024-02-15-preview
# audience: "https://cognitiveservices.azure.com/.default"
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
Input settings
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\.txt$"
chunks:
size: 1200
overlap: 100
group_by_columns: [id]
Storage settings
If blob storage is specified in the following four sections,
connection_string and container_name must be provided
cache:
type: file # or blob
base_dir: "cache"
reporting:
type: file # or console, blob
base_dir: "logs"
storage:
type: file # or blob
base_dir: "output"
only turn this on if running graphrag index with custom settings
we normally use graphrag update with the defaults
update_index_storage:
#type: file # or blob
#base_dir: "vv"
Workflow settings
skip_workflows: []
entity_extraction:
prompt: "prompts/entity_extraction.txt"
entity_types: [organization, person, geo, event]
max_gleanings: 1
summarize_descriptions:
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
claim_extraction:
enabled: false
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1
community_reports:
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
umap:
enabled: false # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: false
embeddings: false
transient: false
Query settings
The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
local_search:
prompt: "prompts/local_search_system_prompt.txt"
global_search:
map_prompt: "prompts/global_search_map_system_prompt.txt"
reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"
drift_search:
prompt: "prompts/drift_search_system_prompt.txt"
Logs and screenshots
10:26:28,71 graphrag.index.create_pipeline_config INFO skipping workflows
10:26:28,71 graphrag.index.run.run INFO Running pipeline
10:26:28,72 graphrag.storage.file_pipeline_storage INFO Creating file storage at /home/emotionalrag/graphrag/incremental_ragtest/output
10:26:28,72 graphrag.index.input.factory INFO loading input from root_dir=input
10:26:28,72 graphrag.index.input.factory INFO using file storage for input
10:26:28,73 graphrag.storage.file_pipeline_storage INFO search /home/emotionalrag/graphrag/incremental_ragtest/input for files matching .*.txt$
10:26:28,74 graphrag.index.input.text INFO found text files from input, found [('xizang_history.txt', {})]
10:26:28,80 graphrag.index.input.text WARNING Warning! Error loading file xizang_history.txt. Skipping...
10:26:28,80 graphrag.index.input.text INFO Found 1 files, loading 0
10:26:28,82 graphrag.index.workflows.load INFO Workflow Run Order: ['create_base_text_units', 'create_final_documents', 'extract_graph', 'compute_communities', 'create_final_entities', 'create_final_relationships', 'create_final_communities', 'create_final_nodes', 'create_final_text_units', 'create_final_community_reports', 'generate_text_embeddings']
10:26:28,82 graphrag.index.run.run INFO Final # of rows loaded: 0
10:26:28,238 graphrag.index.run.workflow INFO dependencies for create_base_text_units: []
10:26:28,243 datashaper.workflow.workflow INFO executing verb create_base_text_units
10:26:28,243 datashaper.workflow.workflow ERROR Error executing verb "create_base_text_units" in create_base_text_units: 'id'
Traceback (most recent call last):
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 415, in _execute_verb
result = await result
^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/workflows/v1/create_base_text_units.py", line 68, in workflow
output = await create_base_text_units(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/flows/create_base_text_units.py", line 32, in create_base_text_units
sort = documents.sort_values(by=["id"], ascending=[True])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/frame.py", line 7189, in sort_values
k = self._get_label_or_level_values(by[0], axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/generic.py", line 1911, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'id'
10:26:28,248 graphrag.callbacks.file_workflow_callbacks INFO Error executing verb "create_base_text_units" in create_base_text_units: 'id' details=None
10:26:28,248 graphrag.index.run.run ERROR error running workflow create_base_text_units
Traceback (most recent call last):
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/run/run.py", line 262, in run_pipeline
result = await _process_workflow(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/run/workflow.py", line 103, in _process_workflow
result = await workflow.run(context, callbacks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 369, in run
timing = await self._execute_verb(node, context, callbacks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 415, in _execute_verb
result = await result
^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/workflows/v1/create_base_text_units.py", line 68, in workflow
output = await create_base_text_units(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/flows/create_base_text_units.py", line 32, in create_base_text_units
sort = documents.sort_values(by=["id"], ascending=[True])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/frame.py", line 7189, in sort_values
k = self._get_label_or_level_values(by[0], axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/generic.py", line 1911, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'id'
Additional Information
- GraphRAG Version:1.0.0
- Operating System:ubantu
- Python Version:3.12
- Related Issues:
Do you need to file an issue?
Describe the bug
I have no problem uploading small files, but when I upload this book, which is quite large (1088k), it shows that the upload was successful and Graphrag can find the book, but it fails to load.
Steps to reproduce
uploading a larger book. For testing, I'm using a 1088k file with 585,696 characters, encoded in UTF-8.
Expected Behavior
support large book indexing
GraphRAG Config Used
This config file contains required core defaults that must be set, along with a handful of common optional settings.
For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/
LLM settings
There are a number of settings to tune the threading and token limits for LLM calls - check the docs.
encoding_model: cl100k_base # this needs to be matched to your model!
llm:
api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
type: openai_chat # or azure_openai_chat
model: gpt-4o-mini
model_supports_json: true # recommended if this is available for your model.
audience: "https://cognitiveservices.azure.com/.default"
api_base: https://.openai.azure.com
api_version: 2024-02-15-preview
organization: <organization_id>
deployment_name: <azure_model_deployment_name>
parallelization:
stagger: 0.3
num_threads: 50
async_mode: threaded # or asyncio
embeddings:
async_mode: threaded # or asyncio
vector_store:
type: lancedb
db_uri: 'output/lancedb'
container_name: default
overwrite: true
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: text-embedding-3-small
# api_base: https://.openai.azure.com
# api_version: 2024-02-15-preview
# audience: "https://cognitiveservices.azure.com/.default"
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
Input settings
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\.txt$"
chunks:
size: 1200
overlap: 100
group_by_columns: [id]
Storage settings
If blob storage is specified in the following four sections,
connection_string and container_name must be provided
cache:
type: file # or blob
base_dir: "cache"
reporting:
type: file # or console, blob
base_dir: "logs"
storage:
type: file # or blob
base_dir: "output"
only turn this on if running
graphrag indexwith custom settingswe normally use
graphrag updatewith the defaultsupdate_index_storage:
#type: file # or blob
#base_dir: "vv"
Workflow settings
skip_workflows: []
entity_extraction:
prompt: "prompts/entity_extraction.txt"
entity_types: [organization, person, geo, event]
max_gleanings: 1
summarize_descriptions:
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
claim_extraction:
enabled: false
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1
community_reports:
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
umap:
enabled: false # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: false
embeddings: false
transient: false
Query settings
The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query
local_search:
prompt: "prompts/local_search_system_prompt.txt"
global_search:
map_prompt: "prompts/global_search_map_system_prompt.txt"
reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"
drift_search:
prompt: "prompts/drift_search_system_prompt.txt"
Logs and screenshots
10:26:28,71 graphrag.index.create_pipeline_config INFO skipping workflows
10:26:28,71 graphrag.index.run.run INFO Running pipeline
10:26:28,72 graphrag.storage.file_pipeline_storage INFO Creating file storage at /home/emotionalrag/graphrag/incremental_ragtest/output
10:26:28,72 graphrag.index.input.factory INFO loading input from root_dir=input
10:26:28,72 graphrag.index.input.factory INFO using file storage for input
10:26:28,73 graphrag.storage.file_pipeline_storage INFO search /home/emotionalrag/graphrag/incremental_ragtest/input for files matching .*.txt$
10:26:28,74 graphrag.index.input.text INFO found text files from input, found [('xizang_history.txt', {})]
10:26:28,80 graphrag.index.input.text WARNING Warning! Error loading file xizang_history.txt. Skipping...
10:26:28,80 graphrag.index.input.text INFO Found 1 files, loading 0
10:26:28,82 graphrag.index.workflows.load INFO Workflow Run Order: ['create_base_text_units', 'create_final_documents', 'extract_graph', 'compute_communities', 'create_final_entities', 'create_final_relationships', 'create_final_communities', 'create_final_nodes', 'create_final_text_units', 'create_final_community_reports', 'generate_text_embeddings']
10:26:28,82 graphrag.index.run.run INFO Final # of rows loaded: 0
10:26:28,238 graphrag.index.run.workflow INFO dependencies for create_base_text_units: []
10:26:28,243 datashaper.workflow.workflow INFO executing verb create_base_text_units
10:26:28,243 datashaper.workflow.workflow ERROR Error executing verb "create_base_text_units" in create_base_text_units: 'id'
Traceback (most recent call last):
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 415, in _execute_verb
result = await result
^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/workflows/v1/create_base_text_units.py", line 68, in workflow
output = await create_base_text_units(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/flows/create_base_text_units.py", line 32, in create_base_text_units
sort = documents.sort_values(by=["id"], ascending=[True])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/frame.py", line 7189, in sort_values
k = self._get_label_or_level_values(by[0], axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/generic.py", line 1911, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'id'
10:26:28,248 graphrag.callbacks.file_workflow_callbacks INFO Error executing verb "create_base_text_units" in create_base_text_units: 'id' details=None
10:26:28,248 graphrag.index.run.run ERROR error running workflow create_base_text_units
Traceback (most recent call last):
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/run/run.py", line 262, in run_pipeline
result = await _process_workflow(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/run/workflow.py", line 103, in _process_workflow
result = await workflow.run(context, callbacks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 369, in run
timing = await self._execute_verb(node, context, callbacks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 415, in _execute_verb
result = await result
^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/workflows/v1/create_base_text_units.py", line 68, in workflow
output = await create_base_text_units(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/graphrag/index/flows/create_base_text_units.py", line 32, in create_base_text_units
sort = documents.sort_values(by=["id"], ascending=[True])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/frame.py", line 7189, in sort_values
k = self._get_label_or_level_values(by[0], axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/emotionalrag/anaconda3/envs/graphrag/lib/python3.12/site-packages/pandas/core/generic.py", line 1911, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'id'
Additional Information