You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
Description
I'm splitting documents using SemanicChunker with VertexAIEmbeddings.
When the number of chunks is high enough (more than ~120), I'm getting IndexError: list index out of range .
Please note that the issue does not occur when using the previous implementation from langchain.embeddings.VertexAIEmbeddings, but it obviously triggers a deprecation warning.
The problem seems to come from the batch-size calculation, in langchain_google_vertexai/embeddings.py, which produces arbitrarily low values for the batch size, despite the total number of texts being higher. More exactly, the first batch is ok, but the second has a lower size despite the remaining number of chunks should produce more batches.
Example Code
importosimportgetpassimportitertoolsimportloremfromdotenvimportload_dotenvfromgoogle.cloudimportaiplatform# from langchain.embeddings import VertexAIEmbeddings # this one worksfromlangchain_google_vertexaiimportVertexAIEmbeddings# this one failsfromlangchain_experimental.text_splitterimportSemanticChunkerload_dotenv()
# .env file must looks like this:## GOOGLE_APPLICATION_CREDENTIALS=# PROJECT_ID=# LOCATION=#PROJECT_ID=os.environ.get("PROJECT_ID")
LOCATION=os.environ.get("LOCATION", "europe-west1")
ifPROJECT_IDisNone:
PROJECT_ID=getpass.getpass("Project ID")
aiplatform.init(project=PROJECT_ID, location=LOCATION)
embedding_model=VertexAIEmbeddings("text-embedding-004")
text_splitter=SemanticChunker(embedding_model)
NB_SENTENCES=200# up to 120 it is okdocument_chunks=text_splitter.split_text(" ".join(itertools.islice(lorem.sentence(word_range=(8, 16)), NB_SENTENCES)))
Error Message and Stack Trace (if applicable)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[1], line 34
29 text_splitter = SemanticChunker(embedding_model)
32 NB_SENTENCES = 200 # up to 120 it is ok
---> 34 document_chunks = text_splitter.split_text(" ".join(itertools.islice(lorem.sentence(word_range=(8, 16)), NB_SENTENCES)))
File ~/Projects/ml/ml-research/vertex-rag/.venv/lib/python3.12/site-packages/langchain_experimental/text_splitter.py:215, in SemanticChunker.split_text(self, text)
213 if len(single_sentences_list) == 1:
214 return single_sentences_list
--> 215 distances, sentences = self._calculate_sentence_distances(single_sentences_list)
216 if self.number_of_chunks is not None:
217 breakpoint_distance_threshold = self._threshold_from_clusters(distances)
File ~/Projects/ml/ml-research/vertex-rag/.venv/lib/python3.12/site-packages/langchain_experimental/text_splitter.py:200, in SemanticChunker._calculate_sentence_distances(self, single_sentences_list)
196 embeddings = self.embeddings.embed_documents(
197 [x["combined_sentence"] for x in sentences]
198 )
199 for i, sentence in enumerate(sentences):
--> 200 sentence["combined_sentence_embedding"] = embeddings[i]
202 return calculate_cosine_distances(sentences)
IndexError: list index out of range
Can you please re-install langchain-google-vertexai from Github, and try again? I believe there was a bug that was fixed last week.
P.S. Don't forget you need to uninstall the existing version before installing from Github.
Checked other resources
Description
I'm splitting documents using SemanicChunker with VertexAIEmbeddings.
When the number of chunks is high enough (more than ~120), I'm getting
IndexError: list index out of range
.Please note that the issue does not occur when using the previous implementation from
langchain.embeddings.VertexAIEmbeddings
, but it obviously triggers a deprecation warning.The problem seems to come from the batch-size calculation, in
langchain_google_vertexai/embeddings.py
, which produces arbitrarily low values for the batch size, despite the total number of texts being higher. More exactly, the first batch is ok, but the second has a lower size despite the remaining number of chunks should produce more batches.Example Code
Error Message and Stack Trace (if applicable)
System Info
langchain==0.2.6
langchain-community==0.2.6
langchain-core==0.2.11
langchain-experimental==0.0.62
langchain-google-community==1.0.6
langchain-google-vertexai==1.0.6
langchain-text-splitters==0.2.1
Mac M3 Pro (macOS 14.5)
Python 3.12
The text was updated successfully, but these errors were encountered: