Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/redbox 411 documents refactor #644

Merged
merged 13 commits into from
Jun 25, 2024

Conversation

jamesrichards4
Copy link
Contributor

@jamesrichards4 jamesrichards4 commented Jun 24, 2024

Context

To start using Documents rather than custom chunks. The intention is to replace all use of chunks in the 'AI' code in a second PR following this one

Changes proposed in this pull request

Update the retriever and storagehandler to be able to use chunk and document style stored chunks. This means documents ingested either way are fine.

Worker rewritten to be LCEL based. This should allow moving to Azure embeddings more easily and is in line with the rest of the codebase

Guidance to review

The storage handler currently handles File objects and can track status. To drop the storage handler chunk access we need to find a way to update the API about status. Moving the storage handler to being a file handler only and having the worker update the status makes this quite clean but other ideas welcome on how we make the most of these changes

There are several places logic is duplicated, this has been left as it will be replaced by dropping the storagehandler version of this logic once we can drop chunk storage (and use documents only). This will come when they expire in all environments.

This will be followed by several PRs:

  • One to complete the use of documents and make the retriever use documents from Elastic and pass them through to SourceDocuments to the frontend, this will remove all references to Chunks. The storagehandler logic will also be updated here.
  • Move to Azure Embeddings in ingest and API to remove the need for torch, sentencetransformers etc.
  • Dropping duplicate logic for fields on stored chunks and use just the document structure in the 'AI' side code. These can be mapped to SourceDocuments and other API objects in defined runnables

Relevant links

Things to check

  • I have added any new ENV vars in all deployed environments
  • I have tested any code added or changed
  • I have run integration tests

@gecBurton
Copy link
Collaborator

gecBurton commented Jun 25, 2024

To drop the storage handler chunk access we need to find a way to update the API about status.

It might help to know that we do not need the same level of granularity that we currently have. Users have told us that they arent interested in the uploaded/chunking/embedding/complete status but only want to know if the file is ready/not-ready cc @KevinEtchells

@@ -210,3 +232,21 @@ def get_file_status(self, file_uuid: UUID, user_uuid: UUID) -> FileStatus:
chunk_statuses=chunk_statuses,
processing_status=ProcessingStatusEnum.complete if is_complete else ProcessingStatusEnum.embedding,
)


def hit_to_chunk(hit: Dict[str, Any]) -> Chunk:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. this "inclusive" approach looks like the best way to start this major refactor.

Copy link
Collaborator

@gecBurton gecBurton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have understood the approach as:

  1. all new text-chunks will be encoded as Documents
  2. on reading text-chunks they will be cast to Chunks regardless of how they we encoded

In which case it all looks sensible to me.

I presume the next stage is to change step (2) to return Documents? Are we thinking that:

  1. we should maintain backwards compatibility for text-chunks encoded as Chunks?
  2. we should write one off migration script
  3. we should accept a breaking change for July

@jamesrichards4
Copy link
Contributor Author

I'll handle the conflicts after #630 goes in. We should merge that first

redbox/models/settings.py Show resolved Hide resolved
@jamesrichards4 jamesrichards4 merged commit afb6254 into main Jun 25, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants