Remove S3 dependency from chunking process #695

wpfl-dbt · 2024-07-01T07:26:37Z

Context

In order to regularly test the AI in the system, we need:

To import and use the chunking functions at the highest level we can
Avoid mocking/simulating unnecessary dependencies

For these reasons this PR removes S3 from the chunking process, instead passing in a bytestream. In production this just means downloading from S3 higher up the process. For testing, this means we can use the chunking code directly without needing MinIO/S3 in the picture.

Changes proposed in this pull request

Refactors chunking so S3 passes the opened file into it
Removes dead redbox.parsing code and tests
Updates quickupload.ipynb so it better reflects code in the worker

Guidance to review

Check you're happy with the refactor
Confirm all unit tests pass

Things to check

I have added any new ENV vars in all deployed environments
I have tested any code added or changed
I have run integration tests

notebooks/quickupload.ipynb

jamesrichards4

Love the idea, just one comment on whether we can avoid downloading the file. We should think about cleaning them up if we've pulled them to the worker

jamesrichards4 · 2024-07-01T08:28:23Z

worker/src/app.py

@@ -66,7 +67,10 @@ async def lifespan(context: ContextRepo):
 def document_loader(s3_client: S3Client, env: Settings):
    @chain
    def wrapped(file: File):
-        return UnstructuredDocumentLoader(file, s3_client, env).lazy_load()
+        file_raw = BytesIO()
+        s3_client.download_fileobj(Bucket=file.bucket, Key=file.key, Fileobj=file_raw)


can we pass the 'Body' field of get_object to the documentloader to avoid this download? It's also a byte stream so that should work?

I think I'm maybe using the incorrect terminology somewhere -- it lacks a .seek() method and throws an error. For unstructured I guess it specifically needs to be an io.BytesIO stream, not just any stream of bytes?

Yeah I've read a few bits and bobs. There's ways to make what comes back from boto3 seekable but they look like they'd introduce fragility for not much gain, imo.

This uses a private attribute

This uses a bunch of fragile-looking code/decode

ah ok yeah in which case let's pull it. We should think about cleaning these files up at some point but maybe don't need to worry in this release. Restaring/scaling the workers will clear out all files anyway so should be ok for a bit

Will Langdale added 4 commits July 1, 2024 07:41

Moved S3 out of uploader class

a868c79

Removed dead parsing code

1220e44

Updated the quickupload notebook

dbc20e9

Removes dead unit tests

7939cd9

gecBurton reviewed Jul 1, 2024

View reviewed changes

notebooks/quickupload.ipynb Outdated Show resolved Hide resolved

Minor change to Django notebook code

98e2ba6

jamesrichards4 reviewed Jul 1, 2024

View reviewed changes

jamesrichards4 approved these changes Jul 1, 2024

View reviewed changes

jamesrichards4 merged commit 64862c8 into main Jul 1, 2024
3 checks passed

jamesrichards4 deleted the feature/chunker_des3ed branch July 1, 2024 09:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove S3 dependency from chunking process #695

Remove S3 dependency from chunking process #695

wpfl-dbt commented Jul 1, 2024 •

edited

Loading

jamesrichards4 left a comment

jamesrichards4 Jul 1, 2024

wpfl-dbt Jul 1, 2024

wpfl-dbt Jul 1, 2024 •

edited

Loading

jamesrichards4 Jul 1, 2024

Remove S3 dependency from chunking process #695

Remove S3 dependency from chunking process #695

Conversation

wpfl-dbt commented Jul 1, 2024 • edited Loading

Context

Changes proposed in this pull request

Guidance to review

Things to check

jamesrichards4 left a comment

Choose a reason for hiding this comment

jamesrichards4 Jul 1, 2024

Choose a reason for hiding this comment

wpfl-dbt Jul 1, 2024

Choose a reason for hiding this comment

wpfl-dbt Jul 1, 2024 • edited Loading

Choose a reason for hiding this comment

jamesrichards4 Jul 1, 2024

Choose a reason for hiding this comment

wpfl-dbt commented Jul 1, 2024 •

edited

Loading

wpfl-dbt Jul 1, 2024 •

edited

Loading