Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/redbox 411 chunk refactor in api #673

Merged
merged 10 commits into from
Jun 27, 2024

Conversation

jamesrichards4
Copy link
Contributor

@jamesrichards4 jamesrichards4 commented Jun 26, 2024

Context

There is an assumption throughout the code that there is a file and a set of chunks. This will not be true long term. This refactor removes the strong link between files and chunks and tries to use Langchain retrievers to interact with chunks where possible.

Changes proposed in this pull request

Use retrievers for all chunk interactions (see below for why this is a lie)
Avoid use of Chunk class in 'AI' code, chains and runnables handle Documents which lines up with Langchain stuff
Added additional mapping functions for document management vs chunks/source documents
Explicitly setting file processing status on the file in Elasticsearch. This means we can update the file status without the API interpreting it from chunks. This will help if/when we have multiple sets of chunks for additional processing beyond a single index of chunks. We now return processing and complete statuses to Django.

Guidance to review

The StorageHandler class is still needed for file interactions, this is clean and makes sense (particularly because we may want to make more use of this class). It is also needed to manage returning chunks and statuses in the API. File status provides a list of chunk statuses, this will need revisiting and if removed could allow further clean up.

The chunk clustering code is still present in case we want to restore that functionality (it is now only a code skeleton as it hasn't been mapped to the new Document setup). We could remove all of redbox.parsing if we decide not to do this clustering. Dropping this would be my preference!

There is a lot of messy document mapping happening. Tracking this through it is caused by the rechunking happening in the summarisation chains. If we create summarisation chunks at ingest time (second worker) we could avoid all of this and then only the retrievers need to understand the document structure

The new retrievers need tests, I'll add these on this PR but I wanted to raise it to get thoughts about the structure

Things to check

There's a lot messing around in the retrievers to handle old style documents. This should allow us to maintain compatibility with documents ingested by the old worker (chunk style). Worth noting and also confirming that this works in dev!

Can Django receive the new file status of 'processing'? I imagine not?

  • I have added any new ENV vars in all deployed environments
  • I have tested any code added or changed
  • I have run integration tests

chunking = "chunking"
embedding = "embedding"
processing = "processing"
failed = "failed"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to have implications for the frontend cc @brunns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspected so, there's an enum in the Django app but I wanted to check with those who actually understand that app!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need to do this with 3 migrations - one to add the new statuses, one to update files using the old ones to the new ones, and one to remove the old statuses. Plus we'd need to fix the document and chat views.

Shall I jump on to this branch and make those changes?

Copy link
Collaborator

@gecBurton gecBurton Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the time being could we just make this a string in django @brunns ? This significant refactor isnt over yet and:

  1. there could be more changes
  2. id like to decouple the front and back ends

Copy link
Collaborator

@brunns brunns Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do, but we base a bit of behavior on this value; we don't just display it. Values for this that Django doesn't know about will probably cause problems, and I think the cause will be more obvious if we're using an enum.

We can certainly decouple core-api and django-app a bit, though, by agreeing on the message. Only 3 values are really useful here - processing, ready, and it's-gone-pete-tong. It feels like that will be pretty future-proof. If we align on that, what core-api and django-app do internally can be their own business.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the behaviour is presumably when to enable chat on it? it has to be 'complete' for that I guess? It's only the processing status that's new so can't we leave the various statuses in place? Aren't parsing, chunking, embedding and now processing all the same to the frontend?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, yes. So if we could not send core's internal idea of the status, but just send one of three values, that would isolate the 2 services.

If we continue to expose core's internal status, we'll keep needing to change core & django at the same time. Let's say you added, I don't know, munching as a status - we'd need a change on the django side to know what to do with it.

last_chunk = chunks[-1]
# Backwards compatible with worker which didn't store token_count
# Everything is optional None so we have to check everything
# This will all be rolled up into a SummarisationChunkRetriever or similar in future work and this ugliness
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for explanation 👍

file_uuid=file_uuid,
# We need to break the link between file status and a specific set of chunks
# to enable future work with many chunks or other indices etc
chunk_statuses=[],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im wondering if we shouldnt just get rid of chunk status all together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my feeling too file level status for the whole ingest, or maybe individual indices feels nicer if we add more stages. For now this feels like the safest change?

gecBurton
gecBurton previously approved these changes Jun 27, 2024
Copy link
Collaborator

@gecBurton gecBurton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR : I like the approach

I've added some nitpicking but this can be ignored, better to press ahead to the next stage and tidy up later. I see there are some empty files added too

v keen to have @brunns thoughts on this top spot breaking changes to the front end

additional thoughts:

This should allow us to maintain compatibility with documents ingested by the old worker (chunk style).

All data gets deleted after 30 days, we can just wait that amount of time and then remove this code

remove all of redbox.parsing

yes

@gecBurton gecBurton dismissed their stale review June 27, 2024 07:14

too early, still in draft!

Copy link
Collaborator

@wpfl-dbt wpfl-dbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always impressed by your code James, it's really something. I don't think I've thrown any radical spanners in the works, but hopefully given some challenge that galvanises your overall direction.

)


# This should be unnecessary and indicates we're not chunking correctly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know Langchain/Elastic well enough to understand why. When you test this you show a created date (correctly) merging to the earliest of the two. How is that done implicitly where other metadatas need rules? Why wasn't it turned into a list of values, or the later of the two used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this but as the new document metadata is copied from the first doc passed in 'a' all values are implicitly the same. Values which want a combined value are then explicitly mapped. This creates a weird 'no mapping' for some values in that function which is harder to parse then completely explicit

document_loader(s3_client=s3_client, env=env)
| RunnableLambda(list)
| RunnableLambda(vectorstore.add_documents)
).invoke(file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is an async function should/could this be one of the async calls, like .ainvoke()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you're right and vectorstore supports async too (aadd_documents) so we could probably improve this going that way. I'll bump that to another ticket though unless you have strong feelings?

NOTE: in general I'm not sure where to record stuff like this, product seem to manage our tickets and this is internal so not really on them. Do we just slap stuff on the backlog and make sure it gets pulled??

worker/src/app.py Show resolved Hide resolved
@@ -0,0 +1,90 @@
from functools import partial
from typing import Any, TypedDict, Dict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such a minor point in the scheme of things but from 3.9 dict and list are functionally the same as their partners in typing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that is good to know!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that make format will fix this for you automatically?

Comment on lines +1 to +2
from .all_chunks import get_all_chunks_query
from .parameterised import ParameterisedElasticsearchRetriever
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel about reorganising the retriever folder into:

  • queries.py, with both query functions (get_all_chunks_query, parameterised_body_func) and the typed dicts that define them (ESQuery, ESParams)
  • retrievers.py, which imports the queries and defines the parameterised subclass

Thinking about it this way helped me understand how this all fitted together a bit better, up to you though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the layout I've ended up with for sure and your suggestion is much cleaner. I'd suggest that we thing about this when we add new ones and think more about the summarisation/large chunk retriever as there might be significant work here again?

metadata=hit['_source'].get('metadata', {})
)

class ParameterisedElasticsearchRetriever(ElasticsearchRetriever):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my nagging feeling: now we work with Documents, do we need to parameterise at this low level? Can we use the higher level Langchain classes? After consideration I think I'm still a yes.

The thing that sways me is that if you read the docs for the Elasticsearch store class and the Elasticsearch retriever class, the simple fact is that both of them require you to write Elastic query objects to get anything nuanced done. Going high level would still need us to write stuff like this.

The only thing I've seen that going high level would buy is that dropping subscription-only features locally and turning them on remotely (hybrid) is a LOT easier. But, big picture, I think it's quicker and easier for us to bolt that onto what we have than refactor the entire thing to go high level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should keep this. I would see the advantage in two main parts, we can start using the high level 'combination' retrievers easily now [citation needed] (ensemble, contextualcompression) and we can start adding in others where appropriate and see what happens. This low level one lets us flex our Elastic knowledge though and we'll likely always want that kind of control I think

"sources": retriever,
}
| {
"response": prompt | llm | StrOutputParser(),
"sources": itemgetter("sources"),
}
)


def resize_documents(max_tokens: int | None = None) -> list[Document]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other place where I wonder about the high/low level difference. I've never used the ParentDocumentRetriever but it seems to do the same thing. Are we reinventing the wheel, or avoiding a part of Langchain that's more trouble than it's worth?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're 100% reinventing the wheel here and I want to think about a new ingest pipeline which creates these larger chunks. I think langchain can help here with the ParentDocumentRetriever as you say so that'd be good to look at. We can remove so much faffing around with docs if we get that working

"index": 12,
"page_number": 3,
"languages": [],
# "link_texts": [],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding this correctly, this is one of the most important parts of this test. It says: your metadata remains as fluid as Langchain intended, and we can still work with it, combine it, etc.

I think this is extremely important. I've been thinking about returning Elastic's scores as part of the query and doing something with them on the Python side, and this would mean testing whether that's worth doing would be super easy and lightweight, rather than needing code writing that touches a million different places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's also kind of ugly but I think we can add lightweight wrappers as needed here (transform functions etc)

@jamesrichards4 jamesrichards4 marked this pull request as ready for review June 27, 2024 10:31
@jamesrichards4 jamesrichards4 merged commit 5081437 into main Jun 27, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants