-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/redbox 411 documents refactor #644
Conversation
…ver to allow retrieving both structures
It might help to know that we do not need the same level of granularity that we currently have. Users have told us that they arent interested in the uploaded/chunking/embedding/complete status but only want to know if the file is ready/not-ready cc @KevinEtchells |
redbox/storage/elasticsearch.py
Outdated
@@ -210,3 +232,21 @@ def get_file_status(self, file_uuid: UUID, user_uuid: UUID) -> FileStatus: | |||
chunk_statuses=chunk_statuses, | |||
processing_status=ProcessingStatusEnum.complete if is_complete else ProcessingStatusEnum.embedding, | |||
) | |||
|
|||
|
|||
def hit_to_chunk(hit: Dict[str, Any]) -> Chunk: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice. this "inclusive" approach looks like the best way to start this major refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have understood the approach as:
- all new text-chunks will be encoded as
Document
s - on reading text-chunks they will be cast to
Chunk
s regardless of how they we encoded
In which case it all looks sensible to me.
I presume the next stage is to change step (2) to return Document
s? Are we thinking that:
- we should maintain backwards compatibility for text-chunks encoded as
Chunk
s? - we should write one off migration script
- we should accept a breaking change for July
I'll handle the conflicts after #630 goes in. We should merge that first |
Context
To start using Documents rather than custom chunks. The intention is to replace all use of chunks in the 'AI' code in a second PR following this one
Changes proposed in this pull request
Update the retriever and storagehandler to be able to use chunk and document style stored chunks. This means documents ingested either way are fine.
Worker rewritten to be LCEL based. This should allow moving to Azure embeddings more easily and is in line with the rest of the codebase
Guidance to review
The storage handler currently handles File objects and can track status. To drop the storage handler chunk access we need to find a way to update the API about status. Moving the storage handler to being a file handler only and having the worker update the status makes this quite clean but other ideas welcome on how we make the most of these changes
There are several places logic is duplicated, this has been left as it will be replaced by dropping the storagehandler version of this logic once we can drop chunk storage (and use documents only). This will come when they expire in all environments.
This will be followed by several PRs:
Relevant links
Things to check