[Bug]: KeyError for dataframe while batch embedding in generate_text_embeddings workflow

### Do you need to file an issue?

- [x] I have searched the existing issues and this bug is not already filed.
- [ ] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
- [x] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

### Describe the bug

In  `_text_embed_with_vector_store`: `title` initially used as the column name
https://github.com/microsoft/graphrag/blob/634e3ed62a6c5de7084f20e034edbb7185ad5e84/graphrag/index/operations/embed_text/embed_text.py#L155-L160

but later is overwritten by dataframe row values in a for loop. 

https://github.com/microsoft/graphrag/blob/634e3ed62a6c5de7084f20e034edbb7185ad5e84/graphrag/index/operations/embed_text/embed_text.py#L196

This causes a KeyError when trying to access the DataFrame using `title` as a column name in subsequent iterations of batch embedding

### Steps to reproduce


```console
~$ poetry run graphrag index --root .\index_root --verbose --output .\output
```

### Expected Behavior

The function processa all batches using the correct DataFrame column name. 

### GraphRAG Config Used

```yaml
encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: gpt-4o-mini
  model_supports_json: true # recommended if this is available for your model.
  # audience: "https://cognitiveservices.azure.com/.default"
  # max_tokens: 4000
  # request_timeout: 180.0
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  max_retries: 20
  max_retry_wait: 30.0
  sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  concurrent_requests: 10 # the number of parallel inflight requests that may be made
  # temperature: 0 # temperature for sampling
  # top_p: 1 # top-p sampling
  # n: 1 # Number of completions to generate

parallelization:
  stagger: 20
  num_threads: 20 # the number of threads to use for parallel processing

async_mode: asyncio # or threaded

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: asyncio # or threaded
  target: required # or all
  # batch_size: 16 # the number of documents to send in a single request
  # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
  vector_store:
    type: lancedb
    db_uri: 'output\lancedb'
    collection_name: default
    overwrite: true
  # vector_store: # configuration for AI Search
    # type: azure_ai_search
    # url: <ai_search_endpoint>
    # api_key: <api_key> # if not set, will attempt to use managed identity. Expects the `Search Index Data Contributor` RBAC role in this case.
    # audience: <optional> # if using managed identity, the audience to use for the token
    # overwrite: true # or false. Only applicable at index creation time
    # collection_name: <collection_name> # the name of the collection to use. Default: 'default'
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-3-large
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    max_retries: 20
    max_retry_wait: 30.0
    sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    concurrent_requests: 10 # the number of parallel inflight requests that may be made

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.md$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

update_index_storage: # Storage to save an updated index (for incremental indexing). Enabling this performs an incremental index run
  # type: file # or blob
  # base_dir: "update_output"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: console # or console, blob
  base_dir: "logs"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## strategy: fully override the entity extraction strategy.
  ##   type: one of graph_intelligence, graph_intelligence_json and nltk
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [Library, Paper, Pattern, technique, Agent, Algorithm, Animal, Api, Application, Conference, architecture, Architecture, Benchmark, Blog, Book, BootLoader, BuildSystem, Camping, Clause, Code, Command, Component, Concept, Conference, Configuration, Country, Course, Database, DataFormat, Dataset, DataStructure, DataType, Event, Exercise, Extension, Feature, File, FileFormat, Format, Framework, Function, Game, Journal, Language, License, LinuxDistro, Matrix, Method, Metric, Model, Module, Newsletter, OperatingSystem, Optimizer, Organization, Orm, Package, Person, Platform, Podcast, Program, ProgrammingLanguage, Project, Prompt, Protocol, Repository, Resource, Server, Service, Software, Standard, Storage, Task, Technology, Tool, Version, Website]
  max_gleanings: 1

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 25

embed_graph:
  enabled: true # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: true # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: true
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000

global_search:
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32


```


### Logs and screenshots

Traceback (most recent call last):
  File "C:\projects\graphrag\graphrag\index\run\run.py", line 267, in run_pipeline
    result = await _process_workflow(
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\projects\graphrag\graphrag\index\run\workflow.py", line 105, in _process_workflow
    result = await workflow.run(context, callbacks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\AppData\Local\pypoetry\Cache\virtualenvs\graphrag-LFDedd5Q-py3.12\Lib\site-packages\datashaper\workflow\workflow.py", line 369, in run
    timing = await self._execute_verb(node, context, callbacks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\AppData\Local\pypoetry\Cache\virtualenvs\graphrag-LFDedd5Q-py3.12\Lib\site-packages\datashaper\workflow\workflow.py", line 415, in _execute_verb
    result = await result
             ^^^^^^^^^^^^
  File "C:\projects\graphrag\graphrag\index\workflows\v1\subflows\generate_text_embeddings.py", line 56, in generate_text_embeddings
    await generate_text_embeddings_flow(
  File "C:\projects\graphrag\graphrag\index\flows\generate_text_embeddings.py", line 106, in generate_text_embeddings
    await _run_and_snapshot_embeddings(
  File "C:\projects\graphrag\graphrag\index\flows\generate_text_embeddings.py", line 129, in _run_and_snapshot_embeddings
    data["embedding"] = await embed_text(
                        ^^^^^^^^^^^^^^^^^
  File "C:\projects\graphrag\graphrag\index\operations\embed_text\embed_text.py", line 92, in embed_text
    return await _text_embed_with_vector_store(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\projects\graphrag\graphrag\index\operations\embed_text\embed_text.py", line 180, in _text_embed_with_vector_store
    titles: list[str] = batch[title].to_numpy().tolist()
                        ~~~~~^^^^^^^
  File "C:\Users\user\AppData\Local\pypoetry\Cache\virtualenvs\graphrag-LFDedd5Q-py3.12\Lib\site-packages\pandas\core\frame.py", line 4102, in __getitem__
    indexer = self.columns.get_loc(key)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\AppData\Local\pypoetry\Cache\virtualenvs\graphrag-LFDedd5Q-py3.12\Lib\site-packages\pandas\core\indexes\base.py", line 3812, in get_loc
    raise KeyError(key) from err
KeyError: 'DATA SCIENTIST:A Data Scientist is a professional who utilizes statistical methods, algorithms, and machine learning techniques to analyze and interpret complex data. This
role involves the application of scientific methods and systems to extract knowledge and insights from both structured and unstructured data. A Data Scientist combines expertise in  
programming, statistics, and domain knowledge to effectively derive insights from data, making them essential in transforming raw data into actionable information.'
 None
❌ generate_text_embeddings
None
⠋ GraphRAG Indexer
├── Loading Input (InputFileType.text) - 1713 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
├── create_base_text_units
└── generate_text_embeddings
Errors occurred during the pipeline run, see logs for more details.
❌ Errors occurred during the pipeline run, see logs for more details.

### Additional Information

- GraphRAG Version:
- Operating System: Win11
- Python Version: 3.12.7
- Related Issues:


	title = title_column or embed_column
	if title not in input.columns:
	msg = (
	f"Column {title} not found in input dataframe with columns {input.columns}"
	)
	raise ValueError(msg)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: KeyError for dataframe while batch embedding in generate_text_embeddings workflow #1351

Do you need to file an issue?

Describe the bug

Steps to reproduce

Expected Behavior

GraphRAG Config Used

Logs and screenshots

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: KeyError for dataframe while batch embedding in generate_text_embeddings workflow #1351

Description

Do you need to file an issue?

Describe the bug

Steps to reproduce

Expected Behavior

GraphRAG Config Used

Logs and screenshots

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions