Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file utils #745

Merged
merged 6 commits into from
Apr 29, 2024
Merged

Add file utils #745

merged 6 commits into from
Apr 29, 2024

Conversation

collindutter
Copy link
Member

Adds file utils to fill void left by removal of file loading logic in Loaders.

Specifically relevant in doc examples that use load_collection.

sources = ["tests/resources/test.txt"]
files = utils.load_files(sources)
loader = TextLoader(max_tokens=MAX_TOKENS, embedding_driver=MockEmbeddingDriver())
collection = loader.load_collection(list(files.values()))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if load_files should just return a list so it's easier to pass into load_collection? Don't think we can load the files concurrently if we change that though.

Copy link
Contributor

@dylanholmes dylanholmes Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have two functions, something like load_file_collection and load_file_list. Then you can identify individual files in the response if you need to, otherwise the load_file_list will be more convenient. The implementation of load_file_list could just call load_file_collection (obviously).

Another more heavy handed alternative could be to allow passing a dict[str, bytes] into load_collection in addition to list[bytes] for each of the relevant loaders. A pro to this option is that you could even use the original key passed in to the input, so the client needed to map from a file name to the file contents, they wouldn't need to do two lookups.

Currently:

sources = [ 'foo.txt' ]
content_by_filename_hash = utils.load_files(sources)
artifacts_by_content_hash = loader.load_collection(list(files_by_filename.values()))

foo_content = content_by_filename_hash[hash_from_str('foo.txt')]
foo_artifact = artifacts_by_content_hash[loader.to_key(foo_content)]

With suggestion:

sources = [ 'foo.txt' ]
content_by_filename_hash = utils.load_files(sources)
artifacts_by_filename_hash = loader.load_collection(content_by_filename_hash)

# One less lookup if `load_collection` reuses keys when passed a dict
foo_artifact = artifacts_by_filename_hash[hash_from_str('foo.txt')]

If the input is the same shape as the output of load_collection, then it'd make it easier to chain together loaders in general, however I can't think of a use case besides this one.


Another alternative that I'm not necessarily advocating for due to increase in scope, but we could make loaders composable and re-introduce file loader.

dylanholmes
dylanholmes previously approved these changes Apr 16, 2024
Copy link
Contributor

@dylanholmes dylanholmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Approving since the core use case is solved, but added some alternative suggestions for you to consider at your discretion.

sources = ["tests/resources/test.txt"]
files = utils.load_files(sources)
loader = TextLoader(max_tokens=MAX_TOKENS, embedding_driver=MockEmbeddingDriver())
collection = loader.load_collection(list(files.values()))
Copy link
Contributor

@dylanholmes dylanholmes Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have two functions, something like load_file_collection and load_file_list. Then you can identify individual files in the response if you need to, otherwise the load_file_list will be more convenient. The implementation of load_file_list could just call load_file_collection (obviously).

Another more heavy handed alternative could be to allow passing a dict[str, bytes] into load_collection in addition to list[bytes] for each of the relevant loaders. A pro to this option is that you could even use the original key passed in to the input, so the client needed to map from a file name to the file contents, they wouldn't need to do two lookups.

Currently:

sources = [ 'foo.txt' ]
content_by_filename_hash = utils.load_files(sources)
artifacts_by_content_hash = loader.load_collection(list(files_by_filename.values()))

foo_content = content_by_filename_hash[hash_from_str('foo.txt')]
foo_artifact = artifacts_by_content_hash[loader.to_key(foo_content)]

With suggestion:

sources = [ 'foo.txt' ]
content_by_filename_hash = utils.load_files(sources)
artifacts_by_filename_hash = loader.load_collection(content_by_filename_hash)

# One less lookup if `load_collection` reuses keys when passed a dict
foo_artifact = artifacts_by_filename_hash[hash_from_str('foo.txt')]

If the input is the same shape as the output of load_collection, then it'd make it easier to chain together loaders in general, however I can't think of a use case besides this one.


Another alternative that I'm not necessarily advocating for due to increase in scope, but we could make loaders composable and re-introduce file loader.

@collindutter
Copy link
Member Author

Also just learned a new python syntax, maybe these changes aren't needed?

with open("tests/resources/cities.csv", "r") as cities, open("tests/resources/addresses.csv", "r") as addresses:
    CsvLoader().load_collection([cities.read(), addresses.read()])

A dictionary where the keys are a hash of the path and the values are the content of the files.
"""

futures_executor = futures.ThreadPoolExecutor()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this into a method parameter with a default value?

Copy link
Member Author

@collindutter collindutter Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can, but do we still want this PR even with the discussed syntax above? You can see it live here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! Does it parallelize open?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to find a first-party resource that discusses the concurrency. But all mentions of the syntax online seem to suggest it is.

Even still, it might be nice to provide these utils as a slightly less painful migration path for users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up adding file utils to docs too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously used test.txt is generated by other tests and is not a reliable source.

andrewfrench
andrewfrench previously approved these changes Apr 16, 2024
@collindutter
Copy link
Member Author

Bump on this question

vasinov
vasinov previously approved these changes Apr 16, 2024
"""Load multiple files concurrently and return a dictionary of their content.

Args:
paths: The paths to the files to load.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arg doc for futures_executor is missing.

vasinov
vasinov previously approved these changes Apr 23, 2024
@collindutter collindutter force-pushed the feature/file-utils branch 2 times, most recently from c712b18 to 68405dd Compare April 23, 2024 21:12
dylanholmes
dylanholmes previously approved these changes Apr 24, 2024
@collindutter
Copy link
Member Author

I can't figure out why tests are failing on 3.9, and 3.11. Some sort of file race condition? The file should be there...

@collindutter collindutter merged commit 5c93b4e into dev Apr 29, 2024
8 checks passed
@collindutter collindutter deleted the feature/file-utils branch April 29, 2024 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants