Add file utils #745

collindutter · 2024-04-15T23:56:41Z

Adds file utils to fill void left by removal of file loading logic in Loaders.

Specifically relevant in doc examples that use load_collection.

collindutter · 2024-04-15T23:57:24Z

tests/unit/utils/test_file_utils.py

+        sources = ["tests/resources/test.txt"]
+        files = utils.load_files(sources)
+        loader = TextLoader(max_tokens=MAX_TOKENS, embedding_driver=MockEmbeddingDriver())
+        collection = loader.load_collection(list(files.values()))


Wondering if load_files should just return a list so it's easier to pass into load_collection? Don't think we can load the files concurrently if we change that though.

You could have two functions, something like load_file_collection and load_file_list. Then you can identify individual files in the response if you need to, otherwise the load_file_list will be more convenient. The implementation of load_file_list could just call load_file_collection (obviously).

Another more heavy handed alternative could be to allow passing a dict[str, bytes] into load_collection in addition to list[bytes] for each of the relevant loaders. A pro to this option is that you could even use the original key passed in to the input, so the client needed to map from a file name to the file contents, they wouldn't need to do two lookups.

Currently:

sources = [ 'foo.txt' ] content_by_filename_hash = utils.load_files(sources) artifacts_by_content_hash = loader.load_collection(list(files_by_filename.values())) foo_content = content_by_filename_hash[hash_from_str('foo.txt')] foo_artifact = artifacts_by_content_hash[loader.to_key(foo_content)]

With suggestion:

sources = [ 'foo.txt' ] content_by_filename_hash = utils.load_files(sources) artifacts_by_filename_hash = loader.load_collection(content_by_filename_hash) # One less lookup if `load_collection` reuses keys when passed a dict foo_artifact = artifacts_by_filename_hash[hash_from_str('foo.txt')]

If the input is the same shape as the output of load_collection, then it'd make it easier to chain together loaders in general, however I can't think of a use case besides this one.

Another alternative that I'm not necessarily advocating for due to increase in scope, but we could make loaders composable and re-introduce file loader.

dylanholmes

Nice work!

Approving since the core use case is solved, but added some alternative suggestions for you to consider at your discretion.

dylanholmes · 2024-04-16T07:30:41Z

tests/unit/utils/test_file_utils.py

+        sources = ["tests/resources/test.txt"]
+        files = utils.load_files(sources)
+        loader = TextLoader(max_tokens=MAX_TOKENS, embedding_driver=MockEmbeddingDriver())
+        collection = loader.load_collection(list(files.values()))


You could have two functions, something like load_file_collection and load_file_list. Then you can identify individual files in the response if you need to, otherwise the load_file_list will be more convenient. The implementation of load_file_list could just call load_file_collection (obviously).

Another more heavy handed alternative could be to allow passing a dict[str, bytes] into load_collection in addition to list[bytes] for each of the relevant loaders. A pro to this option is that you could even use the original key passed in to the input, so the client needed to map from a file name to the file contents, they wouldn't need to do two lookups.

Currently:

sources = [ 'foo.txt' ] content_by_filename_hash = utils.load_files(sources) artifacts_by_content_hash = loader.load_collection(list(files_by_filename.values())) foo_content = content_by_filename_hash[hash_from_str('foo.txt')] foo_artifact = artifacts_by_content_hash[loader.to_key(foo_content)]

With suggestion:

sources = [ 'foo.txt' ] content_by_filename_hash = utils.load_files(sources) artifacts_by_filename_hash = loader.load_collection(content_by_filename_hash) # One less lookup if `load_collection` reuses keys when passed a dict foo_artifact = artifacts_by_filename_hash[hash_from_str('foo.txt')]

If the input is the same shape as the output of load_collection, then it'd make it easier to chain together loaders in general, however I can't think of a use case besides this one.

Another alternative that I'm not necessarily advocating for due to increase in scope, but we could make loaders composable and re-introduce file loader.

collindutter · 2024-04-16T18:20:07Z

Also just learned a new python syntax, maybe these changes aren't needed?

with open("tests/resources/cities.csv", "r") as cities, open("tests/resources/addresses.csv", "r") as addresses:
    CsvLoader().load_collection([cities.read(), addresses.read()])

vasinov · 2024-04-16T20:48:33Z

griptape/utils/file_utils.py

+        A dictionary where the keys are a hash of the path and the values are the content of the files.
+    """
+
+    futures_executor = futures.ThreadPoolExecutor()


Can we move this into a method parameter with a default value?

I can, but do we still want this PR even with the discussed syntax above? You can see it live here

Ha! Does it parallelize open?

I'm struggling to find a first-party resource that discusses the concurrency. But all mentions of the syntax online seem to suggest it is.

Even still, it might be nice to provide these utils as a slightly less painful migration path for users.

Ended up adding file utils to docs too

collindutter · 2024-04-16T21:45:11Z

tests/resources/foobar.txt

Previously used test.txt is generated by other tests and is not a reliable source.

collindutter · 2024-04-16T22:49:21Z

Bump on this question

vasinov · 2024-04-16T22:48:42Z

griptape/utils/file_utils.py

+    """Load multiple files concurrently and return a dictionary of their content.
+
+    Args:
+        paths: The paths to the files to load.


Arg doc for futures_executor is missing.

collindutter · 2024-04-24T14:12:27Z

I can't figure out why tests are failing on 3.9, and 3.11. Some sort of file race condition? The file should be there...

collindutter requested review from vasinov and dylanholmes April 15, 2024 23:56

collindutter commented Apr 15, 2024

View reviewed changes

dylanholmes previously approved these changes Apr 16, 2024

View reviewed changes

vasinov requested changes Apr 16, 2024

View reviewed changes

collindutter force-pushed the feature/file-utils branch from b4d5547 to cfe5306 Compare April 16, 2024 21:36

collindutter dismissed dylanholmes’s stale review via 1adc429 April 16, 2024 21:44

collindutter force-pushed the feature/file-utils branch from a4a8ad9 to 1adc429 Compare April 16, 2024 21:44

collindutter commented Apr 16, 2024

View reviewed changes

tests/resources/foobar.txt Outdated

Copy link

Member Author

collindutter Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously used test.txt is generated by other tests and is not a reliable source.

collindutter force-pushed the feature/file-utils branch from cfe5306 to a4a8ad9 Compare April 16, 2024 21:45

andrewfrench previously approved these changes Apr 16, 2024

View reviewed changes

vasinov previously approved these changes Apr 16, 2024

View reviewed changes

collindutter dismissed stale reviews from vasinov and andrewfrench via 90d89e8 April 17, 2024 19:36

collindutter force-pushed the feature/file-utils branch from 1adc429 to 90d89e8 Compare April 17, 2024 19:36

collindutter requested review from vasinov, dylanholmes and andrewfrench April 17, 2024 19:53

collindutter force-pushed the feature/file-utils branch from 10e9e2d to bdab643 Compare April 22, 2024 16:25

vasinov previously approved these changes Apr 23, 2024

View reviewed changes

collindutter force-pushed the feature/file-utils branch 2 times, most recently from c712b18 to 68405dd Compare April 23, 2024 21:12

dylanholmes previously approved these changes Apr 24, 2024

View reviewed changes

collindutter force-pushed the feature/file-utils branch from 68405dd to fe5007e Compare April 29, 2024 16:19

collindutter added 3 commits April 29, 2024 09:31

Add file utils

6d452f4

Use static foobar test file instead of ephemeral test file

398b9a6

Add missing arg doc

182a2da

collindutter added 2 commits April 29, 2024 09:31

Update docs to use file util

149413e

Rename test file

86661a4

collindutter force-pushed the feature/file-utils branch from fe5007e to 86661a4 Compare April 29, 2024 16:31

collindutter dismissed stale reviews from dylanholmes and vasinov via 9d58a70 April 29, 2024 17:25

Load relative files

bf808e1

collindutter force-pushed the feature/file-utils branch from 9d58a70 to bf808e1 Compare April 29, 2024 17:36

cjkindel approved these changes Apr 29, 2024

View reviewed changes

collindutter merged commit 5c93b4e into dev Apr 29, 2024
8 checks passed

collindutter deleted the feature/file-utils branch April 29, 2024 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add file utils #745

Add file utils #745

collindutter commented Apr 15, 2024

collindutter Apr 15, 2024

dylanholmes Apr 16, 2024 •

edited

Loading

dylanholmes left a comment

dylanholmes Apr 16, 2024 •

edited

Loading

collindutter commented Apr 16, 2024

vasinov Apr 16, 2024

collindutter Apr 16, 2024 •

edited

Loading

vasinov Apr 16, 2024

collindutter Apr 16, 2024

collindutter Apr 17, 2024

collindutter Apr 16, 2024

collindutter commented Apr 16, 2024

vasinov Apr 16, 2024

collindutter commented Apr 24, 2024

Add file utils #745

Add file utils #745

Conversation

collindutter commented Apr 15, 2024

collindutter Apr 15, 2024

Choose a reason for hiding this comment

dylanholmes Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

dylanholmes left a comment

Choose a reason for hiding this comment

dylanholmes Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

collindutter commented Apr 16, 2024

vasinov Apr 16, 2024

Choose a reason for hiding this comment

collindutter Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

vasinov Apr 16, 2024

Choose a reason for hiding this comment

collindutter Apr 16, 2024

Choose a reason for hiding this comment

collindutter Apr 17, 2024

Choose a reason for hiding this comment

collindutter Apr 16, 2024

Choose a reason for hiding this comment

collindutter commented Apr 16, 2024

vasinov Apr 16, 2024

Choose a reason for hiding this comment

collindutter commented Apr 24, 2024

dylanholmes Apr 16, 2024 •

edited

Loading

dylanholmes Apr 16, 2024 •

edited

Loading

collindutter Apr 16, 2024 •

edited

Loading