Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

langchain[patch]: Add Possibility to use Contextual chunk headers in Parent Document Retriever #4651

Merged

Conversation

karol-f
Copy link
Contributor

@karol-f karol-f commented Mar 6, 2024

Add possibility to use Contextual chunk headers in Parent Document Retriever.

This is particularly important if you have several fine-grained child chunks that need to be correctly retrieved from the vector store.

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Mar 6, 2024
Copy link

vercel bot commented Mar 6, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchainjs-api-refs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 12, 2024 8:19pm
langchainjs-docs ✅ Ready (Inspect) Visit Preview Mar 12, 2024 8:19pm

@jacoblee93
Copy link
Collaborator

Nice one, thank you!

@jacoblee93
Copy link
Collaborator

Looks like you might need to run yarn lint?

@jacoblee93 jacoblee93 changed the title Add Possibility to use Contextual chunk headers in Parent Document Retriever langchain[patch]: Add Possibility to use Contextual chunk headers in Parent Document Retriever Mar 8, 2024
@karol-f
Copy link
Contributor Author

karol-f commented Mar 8, 2024

Looks like you might need to run yarn lint?

You are right, sorry. Fixed.

@@ -159,7 +162,11 @@ export class ParentDocumentRetriever extends MultiVectorRetriever {
for (let i = 0; i < parentDocs.length; i += 1) {
const parentDoc = parentDocs[i];
const parentDocId = parentDocIds[i];
const subDocs = await this.childSplitter.splitDocuments([parentDoc]);
const chunkHeaderOptions = chunkHeader && parentDoc?.metadata?.chunkHeader ? {
chunkHeader: parentDoc.metadata.chunkHeader,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on this to be in the metadata field of the document is a bit weird - could we make this a property on the class instead? Like this.childSplitterOptions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore the above, I see the problem. Let me think about it a little longer - would be nice to not have a dependency on an untyped metadata key like this.

Copy link
Contributor Author

@karol-f karol-f Mar 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just have to the try the same way of passing chunkHeader as this is currently implemented in LangChainJs. It will be better solution, I will see if this can be done the same way. Thanks for spotting.

@jacoblee93
Copy link
Collaborator

See comment, also please run yarn format from root!

@jacoblee93 jacoblee93 added the close PRs that need one or two touch-ups to be ready label Mar 9, 2024
@karol-f karol-f force-pushed the chunk-header-ParentDocumentRetriever branch from 97a739d to c28f039 Compare March 11, 2024 17:26
@karol-f
Copy link
Contributor Author

karol-f commented Mar 11, 2024

Thank you again for the suggestions. I've pushed the improved version that looks cleaner and doesn't use metadata. Regards!

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Mar 12, 2024
@jacoblee93 jacoblee93 self-assigned this Mar 12, 2024
@jacoblee93 jacoblee93 added lgtm PRs that are ready to be merged as-is and removed close PRs that need one or two touch-ups to be ready labels Mar 12, 2024
@jacoblee93
Copy link
Collaborator

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:improvement Medium size change to existing code to handle new use-cases lgtm PRs that are ready to be merged as-is size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants