Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove S3 dependency from chunking process #695

Merged
merged 5 commits into from
Jul 1, 2024

Conversation

wpfl-dbt
Copy link
Collaborator

@wpfl-dbt wpfl-dbt commented Jul 1, 2024

Context

In order to regularly test the AI in the system, we need:

  • To import and use the chunking functions at the highest level we can
  • Avoid mocking/simulating unnecessary dependencies

For these reasons this PR removes S3 from the chunking process, instead passing in a bytestream. In production this just means downloading from S3 higher up the process. For testing, this means we can use the chunking code directly without needing MinIO/S3 in the picture.

Changes proposed in this pull request

  • Refactors chunking so S3 passes the opened file into it
  • Removes dead redbox.parsing code and tests
  • Updates quickupload.ipynb so it better reflects code in the worker

Guidance to review

  • Check you're happy with the refactor
  • Confirm all unit tests pass

Things to check

Copy link
Contributor

@jamesrichards4 jamesrichards4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the idea, just one comment on whether we can avoid downloading the file. We should think about cleaning them up if we've pulled them to the worker

@@ -66,7 +67,10 @@ async def lifespan(context: ContextRepo):
def document_loader(s3_client: S3Client, env: Settings):
@chain
def wrapped(file: File):
return UnstructuredDocumentLoader(file, s3_client, env).lazy_load()
file_raw = BytesIO()
s3_client.download_fileobj(Bucket=file.bucket, Key=file.key, Fileobj=file_raw)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we pass the 'Body' field of get_object to the documentloader to avoid this download? It's also a byte stream so that should work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm maybe using the incorrect terminology somewhere -- it lacks a .seek() method and throws an error. For unstructured I guess it specifically needs to be an io.BytesIO stream, not just any stream of bytes?

Copy link
Collaborator Author

@wpfl-dbt wpfl-dbt Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I've read a few bits and bobs. There's ways to make what comes back from boto3 seekable but they look like they'd introduce fragility for not much gain, imo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok yeah in which case let's pull it. We should think about cleaning these files up at some point but maybe don't need to worry in this release. Restaring/scaling the workers will clear out all files anyway so should be ok for a bit

@jamesrichards4 jamesrichards4 merged commit 64862c8 into main Jul 1, 2024
3 checks passed
@jamesrichards4 jamesrichards4 deleted the feature/chunker_des3ed branch July 1, 2024 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants