Feature/redbox 411 documents refactor #644

jamesrichards4 · 2024-06-24T17:05:11Z

Context

To start using Documents rather than custom chunks. The intention is to replace all use of chunks in the 'AI' code in a second PR following this one

Changes proposed in this pull request

Update the retriever and storagehandler to be able to use chunk and document style stored chunks. This means documents ingested either way are fine.

Worker rewritten to be LCEL based. This should allow moving to Azure embeddings more easily and is in line with the rest of the codebase

Guidance to review

The storage handler currently handles File objects and can track status. To drop the storage handler chunk access we need to find a way to update the API about status. Moving the storage handler to being a file handler only and having the worker update the status makes this quite clean but other ideas welcome on how we make the most of these changes

There are several places logic is duplicated, this has been left as it will be replaced by dropping the storagehandler version of this logic once we can drop chunk storage (and use documents only). This will come when they expire in all environments.

This will be followed by several PRs:

One to complete the use of documents and make the retriever use documents from Elastic and pass them through to SourceDocuments to the frontend, this will remove all references to Chunks. The storagehandler logic will also be updated here.
Move to Azure Embeddings in ingest and API to remove the need for torch, sentencetransformers etc.
Dropping duplicate logic for fields on stored chunks and use just the document structure in the 'AI' side code. These can be mapped to SourceDocuments and other API objects in defined runnables

Relevant links

Things to check

I have added any new ENV vars in all deployed environments
I have tested any code added or changed
I have run integration tests

…ver to allow retrieving both structures

…ain embeddings

gecBurton · 2024-06-25T06:22:36Z

To drop the storage handler chunk access we need to find a way to update the API about status.

It might help to know that we do not need the same level of granularity that we currently have. Users have told us that they arent interested in the uploaded/chunking/embedding/complete status but only want to know if the file is ready/not-ready cc @KevinEtchells

gecBurton · 2024-06-25T06:26:41Z

redbox/storage/elasticsearch.py

@@ -210,3 +232,21 @@ def get_file_status(self, file_uuid: UUID, user_uuid: UUID) -> FileStatus:
            chunk_statuses=chunk_statuses,
            processing_status=ProcessingStatusEnum.complete if is_complete else ProcessingStatusEnum.embedding,
        )
+
+
+def hit_to_chunk(hit: Dict[str, Any]) -> Chunk:


nice. this "inclusive" approach looks like the best way to start this major refactor.

worker/src/loader/file_loader.py

gecBurton

I have understood the approach as:

all new text-chunks will be encoded as Documents
on reading text-chunks they will be cast to Chunks regardless of how they we encoded

In which case it all looks sensible to me.

I presume the next stage is to change step (2) to return Documents? Are we thinking that:

we should maintain backwards compatibility for text-chunks encoded as Chunks?
we should write one off migration script
we should accept a breaking change for July

jamesrichards4 · 2024-06-25T08:39:11Z

I'll handle the conflicts after #630 goes in. We should merge that first

redbox/models/settings.py

jamesrichards4 added 7 commits June 24, 2024 14:33

[REDBOX-411] Refactored worker to use langchain components and retrie…

d4f078e

…ver to allow retrieving both structures

Added langchain community dependency to worker to allow use of langch…

028ad2e

…ain embeddings

Ruff

0244625

Removing redundant function from worker

69867d3

Setting chunk sizes for ingest pipeline

00b68ee

Correcting settings breaking tests

695f9aa

Patching new document structured chunks in elasticstoragehandler

a5ec2c7

gecBurton reviewed Jun 25, 2024

View reviewed changes

worker/src/loader/file_loader.py Show resolved Hide resolved

gecBurton approved these changes Jun 25, 2024

View reviewed changes

jamesrichards4 added 2 commits June 25, 2024 08:04

Merged main to fix poetry conflicts

d870c7a

Updated poetry lock

e41a0f5

jamesrichards4 temporarily deployed to release June 25, 2024 08:27 — with GitHub Actions Inactive

jamesrichards4 added 2 commits June 25, 2024 08:29

Ruff

c2b0d06

Correcting file loader not returning a generator

ddb37c6

Ruff

11fdfc9

lmwilkigov approved these changes Jun 25, 2024

View reviewed changes

redbox/models/settings.py Show resolved Hide resolved

Merged latest main

1f22653

jamesrichards4 temporarily deployed to release June 25, 2024 09:30 — with GitHub Actions Inactive

jamesrichards4 merged commit afb6254 into main Jun 25, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/redbox 411 documents refactor #644

Feature/redbox 411 documents refactor #644

jamesrichards4 commented Jun 24, 2024 •

edited

Loading

gecBurton commented Jun 25, 2024 •

edited

Loading

gecBurton Jun 25, 2024

gecBurton left a comment

jamesrichards4 commented Jun 25, 2024

Feature/redbox 411 documents refactor #644

Feature/redbox 411 documents refactor #644

Conversation

jamesrichards4 commented Jun 24, 2024 • edited Loading

Context

Changes proposed in this pull request

Guidance to review

Relevant links

Things to check

gecBurton commented Jun 25, 2024 • edited Loading

gecBurton Jun 25, 2024

Choose a reason for hiding this comment

gecBurton left a comment

Choose a reason for hiding this comment

jamesrichards4 commented Jun 25, 2024

jamesrichards4 commented Jun 24, 2024 •

edited

Loading

gecBurton commented Jun 25, 2024 •

edited

Loading