Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate chunking strategies #74

Open
1 of 4 tasks
orpiske opened this issue May 10, 2024 · 5 comments
Open
1 of 4 tasks

Investigate chunking strategies #74

orpiske opened this issue May 10, 2024 · 5 comments

Comments

@orpiske
Copy link
Contributor

orpiske commented May 10, 2024

We need to investigate chunking strategies that can help the assistant provide better answers:

  • Investigate whether chunking libraries already exist
  • Investigate whether chunking features are available on some of the libraries/frameworks that we use
  • Create a Java library/project that implements typical strategies for chunking
  • Use chunking features from a different language (i.e.: Python) and externalize chunking to a different component.
@lburgazzoli
Copy link
Contributor

IMHO, this work should end up being part of lanchain4j and we can eventually use is as one of the tokenize strategy in Apache Camel

@oscerd
Copy link
Contributor

oscerd commented May 10, 2024

I don't think it's something that should go in Camel. Camel is an integration framework, tokenizing is a feature related to something else.

@orpiske
Copy link
Contributor Author

orpiske commented May 10, 2024

IMHO, this work should end up being part of lanchain4j and we can eventually use is as one of the tokenize strategy in Apache Camel

Yeah.

I also don't see it as being part of camel, as rightly pointed by @oscerd. It could be used by it, though.

So, I think a reasonable approach would be to create a Java library and then work to include support for it on langchain4j and Quarkus.

@lburgazzoli
Copy link
Contributor

I would then move this discussion to the langchain4j issue tacker so they may provide some additional info/suggestion as they may have had the chance to think about it already

@orpiske
Copy link
Contributor Author

orpiske commented May 10, 2024

For reference, here's a discussion with the Langchain4j project. Their suggestion is to look at the DocumentSplitter interface and work on top of that.

langchain4j/langchain4j#1081

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants