Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: list index out of range when using VertexAIEmbeddings with SemanticChunker #353

Closed
5 tasks done
jsconan opened this issue Jul 4, 2024 · 2 comments
Closed
5 tasks done

Comments

@jsconan
Copy link

jsconan commented Jul 4, 2024

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Description

I'm splitting documents using SemanicChunker with VertexAIEmbeddings.
When the number of chunks is high enough (more than ~120), I'm getting IndexError: list index out of range .

Please note that the issue does not occur when using the previous implementation from langchain.embeddings.VertexAIEmbeddings, but it obviously triggers a deprecation warning.

The problem seems to come from the batch-size calculation, in langchain_google_vertexai/embeddings.py, which produces arbitrarily low values for the batch size, despite the total number of texts being higher. More exactly, the first batch is ok, but the second has a lower size despite the remaining number of chunks should produce more batches.

Example Code

import os
import getpass
import itertools
import lorem
from dotenv import load_dotenv
from google.cloud import aiplatform
# from langchain.embeddings import VertexAIEmbeddings  # this one works
from langchain_google_vertexai import VertexAIEmbeddings # this one fails
from langchain_experimental.text_splitter import SemanticChunker

load_dotenv()
# .env file must looks like this:
#
# GOOGLE_APPLICATION_CREDENTIALS=
# PROJECT_ID=
# LOCATION=
#

PROJECT_ID = os.environ.get("PROJECT_ID")
LOCATION = os.environ.get("LOCATION", "europe-west1")

if PROJECT_ID is None:
    PROJECT_ID = getpass.getpass("Project ID")

aiplatform.init(project=PROJECT_ID, location=LOCATION)

embedding_model = VertexAIEmbeddings("text-embedding-004")

text_splitter = SemanticChunker(embedding_model)


NB_SENTENCES = 200 # up to 120 it is ok

document_chunks = text_splitter.split_text(" ".join(itertools.islice(lorem.sentence(word_range=(8, 16)), NB_SENTENCES)))

Error Message and Stack Trace (if applicable)

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[1], line 34
     29 text_splitter = SemanticChunker(embedding_model)
     32 NB_SENTENCES = 200 # up to 120 it is ok
---> 34 document_chunks = text_splitter.split_text(" ".join(itertools.islice(lorem.sentence(word_range=(8, 16)), NB_SENTENCES)))

File ~/Projects/ml/ml-research/vertex-rag/.venv/lib/python3.12/site-packages/langchain_experimental/text_splitter.py:215, in SemanticChunker.split_text(self, text)
    213 if len(single_sentences_list) == 1:
    214     return single_sentences_list
--> 215 distances, sentences = self._calculate_sentence_distances(single_sentences_list)
    216 if self.number_of_chunks is not None:
    217     breakpoint_distance_threshold = self._threshold_from_clusters(distances)

File ~/Projects/ml/ml-research/vertex-rag/.venv/lib/python3.12/site-packages/langchain_experimental/text_splitter.py:200, in SemanticChunker._calculate_sentence_distances(self, single_sentences_list)
    196 embeddings = self.embeddings.embed_documents(
    197     [x["combined_sentence"] for x in sentences]
    198 )
    199 for i, sentence in enumerate(sentences):
--> 200     sentence["combined_sentence_embedding"] = embeddings[i]
    202 return calculate_cosine_distances(sentences)

IndexError: list index out of range

System Info

langchain==0.2.6
langchain-community==0.2.6
langchain-core==0.2.11
langchain-experimental==0.0.62
langchain-google-community==1.0.6
langchain-google-vertexai==1.0.6
langchain-text-splitters==0.2.1

Mac M3 Pro (macOS 14.5)
Python 3.12

@lkuligin
Copy link
Collaborator

lkuligin commented Jul 5, 2024

Can you please re-install langchain-google-vertexai from Github, and try again? I believe there was a bug that was fixed last week.
P.S. Don't forget you need to uninstall the existing version before installing from Github.

@jsconan
Copy link
Author

jsconan commented Jul 5, 2024

Thank you @lkuligin
Indeed, the GitHub version does not have the issue. I am waiting now for its release.

In the meantime, here is a way to make it work, with pip:

pip uninstall langchain-google-vertexai
pip install git+ssh://git@github.com/langchain-ai/langchain-google.git#subdirectory=libs/vertexai

@jsconan jsconan closed this as completed Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants