-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove S3 dependency from chunking process #695
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the idea, just one comment on whether we can avoid downloading the file. We should think about cleaning them up if we've pulled them to the worker
@@ -66,7 +67,10 @@ async def lifespan(context: ContextRepo): | |||
def document_loader(s3_client: S3Client, env: Settings): | |||
@chain | |||
def wrapped(file: File): | |||
return UnstructuredDocumentLoader(file, s3_client, env).lazy_load() | |||
file_raw = BytesIO() | |||
s3_client.download_fileobj(Bucket=file.bucket, Key=file.key, Fileobj=file_raw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we pass the 'Body' field of get_object to the documentloader to avoid this download? It's also a byte stream so that should work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm maybe using the incorrect terminology somewhere -- it lacks a .seek()
method and throws an error. For unstructured I guess it specifically needs to be an io.BytesIO
stream, not just any stream of bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I've read a few bits and bobs. There's ways to make what comes back from boto3 seekable but they look like they'd introduce fragility for not much gain, imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok yeah in which case let's pull it. We should think about cleaning these files up at some point but maybe don't need to worry in this release. Restaring/scaling the workers will clear out all files anyway so should be ok for a bit
Context
In order to regularly test the AI in the system, we need:
For these reasons this PR removes S3 from the chunking process, instead passing in a bytestream. In production this just means downloading from S3 higher up the process. For testing, this means we can use the chunking code directly without needing MinIO/S3 in the picture.
Changes proposed in this pull request
redbox.parsing
code and testsquickupload.ipynb
so it better reflects code in the workerGuidance to review
Things to check