IndexError: list index out of range when using VertexAIEmbeddings with SemanticChunker #353

jsconan · 2024-07-04T14:30:41Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Description

I'm splitting documents using SemanicChunker with VertexAIEmbeddings.
When the number of chunks is high enough (more than ~120), I'm getting IndexError: list index out of range .

Please note that the issue does not occur when using the previous implementation from langchain.embeddings.VertexAIEmbeddings, but it obviously triggers a deprecation warning.

The problem seems to come from the batch-size calculation, in langchain_google_vertexai/embeddings.py, which produces arbitrarily low values for the batch size, despite the total number of texts being higher. More exactly, the first batch is ok, but the second has a lower size despite the remaining number of chunks should produce more batches.

Example Code

import os
import getpass
import itertools
import lorem
from dotenv import load_dotenv
from google.cloud import aiplatform
# from langchain.embeddings import VertexAIEmbeddings  # this one works
from langchain_google_vertexai import VertexAIEmbeddings # this one fails
from langchain_experimental.text_splitter import SemanticChunker

load_dotenv()
# .env file must looks like this:
#
# GOOGLE_APPLICATION_CREDENTIALS=
# PROJECT_ID=
# LOCATION=
#

PROJECT_ID = os.environ.get("PROJECT_ID")
LOCATION = os.environ.get("LOCATION", "europe-west1")

if PROJECT_ID is None:
    PROJECT_ID = getpass.getpass("Project ID")

aiplatform.init(project=PROJECT_ID, location=LOCATION)

embedding_model = VertexAIEmbeddings("text-embedding-004")

text_splitter = SemanticChunker(embedding_model)


NB_SENTENCES = 200 # up to 120 it is ok

document_chunks = text_splitter.split_text(" ".join(itertools.islice(lorem.sentence(word_range=(8, 16)), NB_SENTENCES)))

Error Message and Stack Trace (if applicable)

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[1], line 34
     29 text_splitter = SemanticChunker(embedding_model)
     32 NB_SENTENCES = 200 # up to 120 it is ok
---> 34 document_chunks = text_splitter.split_text(" ".join(itertools.islice(lorem.sentence(word_range=(8, 16)), NB_SENTENCES)))

File ~/Projects/ml/ml-research/vertex-rag/.venv/lib/python3.12/site-packages/langchain_experimental/text_splitter.py:215, in SemanticChunker.split_text(self, text)
    213 if len(single_sentences_list) == 1:
    214     return single_sentences_list
--> 215 distances, sentences = self._calculate_sentence_distances(single_sentences_list)
    216 if self.number_of_chunks is not None:
    217     breakpoint_distance_threshold = self._threshold_from_clusters(distances)

File ~/Projects/ml/ml-research/vertex-rag/.venv/lib/python3.12/site-packages/langchain_experimental/text_splitter.py:200, in SemanticChunker._calculate_sentence_distances(self, single_sentences_list)
    196 embeddings = self.embeddings.embed_documents(
    197     [x["combined_sentence"] for x in sentences]
    198 )
    199 for i, sentence in enumerate(sentences):
--> 200     sentence["combined_sentence_embedding"] = embeddings[i]
    202 return calculate_cosine_distances(sentences)

IndexError: list index out of range

System Info

langchain==0.2.6
langchain-community==0.2.6
langchain-core==0.2.11
langchain-experimental==0.0.62
langchain-google-community==1.0.6
langchain-google-vertexai==1.0.6
langchain-text-splitters==0.2.1

Mac M3 Pro (macOS 14.5)
Python 3.12

The text was updated successfully, but these errors were encountered:

lkuligin · 2024-07-05T07:13:08Z

Can you please re-install langchain-google-vertexai from Github, and try again? I believe there was a bug that was fixed last week.
P.S. Don't forget you need to uninstall the existing version before installing from Github.

jsconan · 2024-07-05T08:53:27Z

Thank you @lkuligin
Indeed, the GitHub version does not have the issue. I am waiting now for its release.

In the meantime, here is a way to make it work, with pip:

pip uninstall langchain-google-vertexai
pip install git+ssh://git@github.com/langchain-ai/langchain-google.git#subdirectory=libs/vertexai

jsconan mentioned this issue Jul 4, 2024

SemanticChunker: list index out of range langchain-ai/langchain#23250

Open

5 tasks

jsconan closed this as completed Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError: list index out of range when using VertexAIEmbeddings with SemanticChunker #353

IndexError: list index out of range when using VertexAIEmbeddings with SemanticChunker #353

jsconan commented Jul 4, 2024 •

edited

Loading

lkuligin commented Jul 5, 2024

jsconan commented Jul 5, 2024

IndexError: list index out of range when using VertexAIEmbeddings with SemanticChunker #353

IndexError: list index out of range when using VertexAIEmbeddings with SemanticChunker #353

Comments

jsconan commented Jul 4, 2024 • edited Loading

Checked other resources

Description

Example Code

Error Message and Stack Trace (if applicable)

System Info

lkuligin commented Jul 5, 2024

jsconan commented Jul 5, 2024

jsconan commented Jul 4, 2024 •

edited

Loading