Feature/redbox 411 chunk refactor in api #673

jamesrichards4 · 2024-06-26T15:35:28Z

Context

There is an assumption throughout the code that there is a file and a set of chunks. This will not be true long term. This refactor removes the strong link between files and chunks and tries to use Langchain retrievers to interact with chunks where possible.

Changes proposed in this pull request

Use retrievers for all chunk interactions (see below for why this is a lie)
Avoid use of Chunk class in 'AI' code, chains and runnables handle Documents which lines up with Langchain stuff
Added additional mapping functions for document management vs chunks/source documents
Explicitly setting file processing status on the file in Elasticsearch. This means we can update the file status without the API interpreting it from chunks. This will help if/when we have multiple sets of chunks for additional processing beyond a single index of chunks. We now return processing and complete statuses to Django.

Guidance to review

The StorageHandler class is still needed for file interactions, this is clean and makes sense (particularly because we may want to make more use of this class). It is also needed to manage returning chunks and statuses in the API. File status provides a list of chunk statuses, this will need revisiting and if removed could allow further clean up.

The chunk clustering code is still present in case we want to restore that functionality (it is now only a code skeleton as it hasn't been mapped to the new Document setup). We could remove all of redbox.parsing if we decide not to do this clustering. Dropping this would be my preference!

There is a lot of messy document mapping happening. Tracking this through it is caused by the rechunking happening in the summarisation chains. If we create summarisation chunks at ingest time (second worker) we could avoid all of this and then only the retrievers need to understand the document structure

The new retrievers need tests, I'll add these on this PR but I wanted to raise it to get thoughts about the structure

Things to check

There's a lot messing around in the retrievers to handle old style documents. This should allow us to maintain compatibility with documents ingested by the old worker (chunk style). Worth noting and also confirming that this works in dev!

Can Django receive the new file status of 'processing'? I imagine not?

I have added any new ENV vars in all deployed environments
I have tested any code added or changed
I have run integration tests

gecBurton · 2024-06-26T15:52:43Z

redbox/models/file.py

-    chunking = "chunking"
-    embedding = "embedding"
+    processing = "processing"
+    failed = "failed"


this is going to have implications for the frontend cc @brunns

I suspected so, there's an enum in the Django app but I wanted to check with those who actually understand that app!

We'd need to do this with 3 migrations - one to add the new statuses, one to update files using the old ones to the new ones, and one to remove the old statuses. Plus we'd need to fix the document and chat views.

Shall I jump on to this branch and make those changes?

for the time being could we just make this a string in django @brunns ? This significant refactor isnt over yet and:

there could be more changes

id like to decouple the front and back ends

We could do, but we base a bit of behavior on this value; we don't just display it. Values for this that Django doesn't know about will probably cause problems, and I think the cause will be more obvious if we're using an enum.

We can certainly decouple core-api and django-app a bit, though, by agreeing on the message. Only 3 values are really useful here - processing, ready, and it's-gone-pete-tong. It feels like that will be pretty future-proof. If we align on that, what core-api and django-app do internally can be their own business.

the behaviour is presumably when to enable chat on it? it has to be 'complete' for that I guess? It's only the processing status that's new so can't we leave the various statuses in place? Aren't parsing, chunking, embedding and now processing all the same to the frontend?

Basically, yes. So if we could not send core's internal idea of the status, but just send one of three values, that would isolate the 2 services.

If we continue to expose core's internal status, we'll keep needing to change core & django at the same time. Let's say you added, I don't know, munching as a status - we'd need a change on the django side to know what to do with it.

…ug in new structure for summarisation chain

gecBurton · 2024-06-27T06:50:24Z

core_api/src/format.py

-    last_chunk = chunks[-1]
+    # Backwards compatible with worker which didn't store token_count
+    # Everything is optional None so we have to check everything
+    # This will all be rolled up into a SummarisationChunkRetriever or similar in future work and this ugliness


thx for explanation 👍

gecBurton · 2024-06-27T06:54:20Z

core_api/src/routes/file.py

+            file_uuid=file_uuid,
+            # We need to break the link between file status and a specific set of chunks
+            # to enable future work with many chunks or other indices etc
+            chunk_statuses=[],


im wondering if we shouldnt just get rid of chunk status all together?

That's my feeling too file level status for the whole ingest, or maybe individual indices feels nicer if we add more stages. For now this feels like the safest change?

core_api/tests/conftest.py

core_api/tests/retriever/test_parameterised.py

gecBurton

TLDR : I like the approach

I've added some nitpicking but this can be ignored, better to press ahead to the next stage and tidy up later. I see there are some empty files added too

v keen to have @brunns thoughts on this top spot breaking changes to the front end

additional thoughts:

This should allow us to maintain compatibility with documents ingested by the old worker (chunk style).

All data gets deleted after 30 days, we can just wait that amount of time and then remove this code

remove all of redbox.parsing

yes

too early, still in draft!

wpfl-dbt

I'm always impressed by your code James, it's really something. I don't think I've thrown any radical spanners in the works, but hopefully given some challenge that galvanises your overall direction.

wpfl-dbt · 2024-06-27T06:40:03Z

redbox/transform.py

+    )
+
+
+# This should be unnecessary and indicates we're not chunking correctly


I don't know Langchain/Elastic well enough to understand why. When you test this you show a created date (correctly) merging to the earliest of the two. How is that done implicitly where other metadatas need rules? Why wasn't it turned into a list of values, or the later of the two used?

I don't love this but as the new document metadata is copied from the first doc passed in 'a' all values are implicitly the same. Values which want a combined value are then explicitly mapped. This creates a weird 'no mapping' for some values in that function which is harder to parse then completely explicit

wpfl-dbt · 2024-06-27T06:43:29Z

worker/src/app.py

+            document_loader(s3_client=s3_client, env=env)
+            | RunnableLambda(list)
+            | RunnableLambda(vectorstore.add_documents)
+        ).invoke(file)


As this is an async function should/could this be one of the async calls, like .ainvoke()?

yes you're right and vectorstore supports async too (aadd_documents) so we could probably improve this going that way. I'll bump that to another ticket though unless you have strong feelings?

NOTE: in general I'm not sure where to record stuff like this, product seem to manage our tickets and this is internal so not really on them. Do we just slap stuff on the backlog and make sure it gets pulled??

worker/src/app.py

wpfl-dbt · 2024-06-27T07:07:02Z

core_api/src/retriever/parameterised.py

@@ -0,0 +1,90 @@
+from functools import partial
+from typing import Any, TypedDict, Dict


This is such a minor point in the scheme of things but from 3.9 dict and list are functionally the same as their partners in typing.

oh that is good to know!

I believe that make format will fix this for you automatically?

wpfl-dbt · 2024-06-27T07:13:04Z

core_api/src/retriever/__init__.py

+from .all_chunks import get_all_chunks_query
+from .parameterised import ParameterisedElasticsearchRetriever


How would you feel about reorganising the retriever folder into:

queries.py, with both query functions (get_all_chunks_query, parameterised_body_func) and the typed dicts that define them (ESQuery, ESParams)

retrievers.py, which imports the queries and defines the parameterised subclass

Thinking about it this way helped me understand how this all fitted together a bit better, up to you though!

I don't like the layout I've ended up with for sure and your suggestion is much cleaner. I'd suggest that we thing about this when we add new ones and think more about the summarisation/large chunk retriever as there might be significant work here again?

wpfl-dbt · 2024-06-27T07:20:57Z

core_api/src/retriever/parameterised.py

+        metadata=hit['_source'].get('metadata', {})
+    )
+
+class ParameterisedElasticsearchRetriever(ElasticsearchRetriever):


Here's my nagging feeling: now we work with Documents, do we need to parameterise at this low level? Can we use the higher level Langchain classes? After consideration I think I'm still a yes.

The thing that sways me is that if you read the docs for the Elasticsearch store class and the Elasticsearch retriever class, the simple fact is that both of them require you to write Elastic query objects to get anything nuanced done. Going high level would still need us to write stuff like this.

The only thing I've seen that going high level would buy is that dropping subscription-only features locally and turning them on remotely (hybrid) is a LOT easier. But, big picture, I think it's quicker and easier for us to bolt that onto what we have than refactor the entire thing to go high level.

I agree that we should keep this. I would see the advantage in two main parts, we can start using the high level 'combination' retrievers easily now [citation needed] (ensemble, contextualcompression) and we can start adding in others where appropriate and see what happens. This low level one lets us flex our Elastic knowledge though and we'll likely always want that kind of control I think

wpfl-dbt · 2024-06-27T07:27:01Z

core_api/src/runnables.py

            "sources": retriever,
        }
        | {
            "response": prompt | llm | StrOutputParser(),
            "sources": itemgetter("sources"),
        }
    )
+
+
+def resize_documents(max_tokens: int | None = None) -> list[Document]:


This is the other place where I wonder about the high/low level difference. I've never used the ParentDocumentRetriever but it seems to do the same thing. Are we reinventing the wheel, or avoiding a part of Langchain that's more trouble than it's worth?

we're 100% reinventing the wheel here and I want to think about a new ingest pipeline which creates these larger chunks. I think langchain can help here with the ParentDocumentRetriever as you say so that'd be good to look at. We can remove so much faffing around with docs if we get that working

wpfl-dbt · 2024-06-27T07:30:41Z

redbox/tests/test_transform.py

+                            "index": 12,
+                            "page_number": 3,
+                            "languages": [],
+                            # "link_texts": [],


If I'm understanding this correctly, this is one of the most important parts of this test. It says: your metadata remains as fluid as Langchain intended, and we can still work with it, combine it, etc.

I think this is extremely important. I've been thinking about returning Elastic's scores as part of the query and doing something with them on the Python side, and this would mean testing whether that's worth doing would be super easy and lightweight, rather than needing code writing that touches a million different places.

it's also kind of ugly but I think we can add lightweight wrappers as needed here (transform functions etc)

jamesrichards4 added 3 commits June 26, 2024 13:09

In progress, tests passing

f027203

Completed refactor

edbc187

Cleaned up unneeded references, draft ready

4bc27e7

gecBurton reviewed Jun 26, 2024

View reviewed changes

jamesrichards4 added 2 commits June 26, 2024 22:28

Added tests for retrievers

0274fd4

Returning map-reduce summarisation to be main summary route. Fixing b…

b75a4a7

…ug in new structure for summarisation chain

gecBurton reviewed Jun 27, 2024

View reviewed changes

core_api/tests/conftest.py Outdated Show resolved Hide resolved

gecBurton reviewed Jun 27, 2024

View reviewed changes

core_api/tests/retriever/test_parameterised.py Outdated Show resolved Hide resolved

gecBurton previously approved these changes Jun 27, 2024

View reviewed changes

wpfl-dbt reviewed Jun 27, 2024

View reviewed changes

jamesrichards4 added 3 commits June 27, 2024 08:17

Ruff

cfe86fd

Refactored test data into objects and removed old json files

1b59b16

mypy and format fixes

6d43c9d

jamesrichards4 marked this pull request as ready for review June 27, 2024 10:31

jamesrichards4 added 2 commits June 27, 2024 11:03

Fixing issues from PR. Ignoring mypy issues

420645b

mypy fixes

dcc6d81

gecBurton approved these changes Jun 27, 2024

View reviewed changes

jamesrichards4 merged commit 5081437 into main Jun 27, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/redbox 411 chunk refactor in api #673

Feature/redbox 411 chunk refactor in api #673

jamesrichards4 commented Jun 26, 2024 •

edited

Loading

gecBurton Jun 26, 2024

jamesrichards4 Jun 26, 2024

brunns Jun 27, 2024

gecBurton Jun 27, 2024 •

edited

Loading

brunns Jun 27, 2024 •

edited

Loading

jamesrichards4 Jun 27, 2024

brunns Jun 27, 2024

gecBurton Jun 27, 2024

gecBurton Jun 27, 2024

jamesrichards4 Jun 27, 2024

gecBurton left a comment •

edited

Loading

wpfl-dbt left a comment

wpfl-dbt Jun 27, 2024

jamesrichards4 Jun 27, 2024

wpfl-dbt Jun 27, 2024

jamesrichards4 Jun 27, 2024

wpfl-dbt Jun 27, 2024

jamesrichards4 Jun 27, 2024

brunns Jun 27, 2024

wpfl-dbt Jun 27, 2024

jamesrichards4 Jun 27, 2024

wpfl-dbt Jun 27, 2024

jamesrichards4 Jun 27, 2024

wpfl-dbt Jun 27, 2024

jamesrichards4 Jun 27, 2024

wpfl-dbt Jun 27, 2024

jamesrichards4 Jun 27, 2024

		)


		# This should be unnecessary and indicates we're not chunking correctly

		@@ -0,0 +1,90 @@
		from functools import partial
		from typing import Any, TypedDict, Dict

		from .all_chunks import get_all_chunks_query
		from .parameterised import ParameterisedElasticsearchRetriever

Feature/redbox 411 chunk refactor in api #673

Feature/redbox 411 chunk refactor in api #673

Conversation

jamesrichards4 commented Jun 26, 2024 • edited Loading

Context

Changes proposed in this pull request

Guidance to review

Things to check

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gecBurton Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

brunns Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gecBurton left a comment • edited Loading

Choose a reason for hiding this comment

wpfl-dbt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesrichards4 commented Jun 26, 2024 •

edited

Loading

gecBurton Jun 27, 2024 •

edited

Loading

brunns Jun 27, 2024 •

edited

Loading

gecBurton left a comment •

edited

Loading