Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3Directory Loader with prefix error #1510

Closed
kevalshah90 opened this issue Mar 7, 2023 · 2 comments · Fixed by #1517
Closed

s3Directory Loader with prefix error #1510

kevalshah90 opened this issue Mar 7, 2023 · 2 comments · Fixed by #1517

Comments

@kevalshah90
Copy link

I am running into an error when attempting to read a bunch of csvs from a folder in s3 bucket.

from langchain.document_loaders import S3FileLoader, S3DirectoryLoader
loader = S3DirectoryLoader("s3-bucker", prefix="folder1")
loader.load()

Traceback:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_27645/1860550440.py in <cell line: 18>()
     16 from langchain.document_loaders import S3FileLoader, S3DirectoryLoader
     17 loader = S3DirectoryLoader("stroom-data", prefix="dbrs")
---> 18 loader.load()

~/anaconda3/envs/python3/lib/python3.10/site-packages/langchain/document_loaders/s3_directory.py in load(self)
     29         for obj in bucket.objects.filter(Prefix=self.prefix):
     30             loader = S3FileLoader(self.bucket, obj.key)
---> 31             docs.extend(loader.load())
     32         return docs

~/anaconda3/envs/python3/lib/python3.10/site-packages/langchain/document_loaders/s3_file.py in load(self)
     28         with tempfile.TemporaryDirectory() as temp_dir:
     29             file_path = f"{temp_dir}/{self.key}"
---> 30             s3.download_file(self.bucket, self.key, file_path)
     31             loader = UnstructuredFileLoader(file_path)
     32             return loader.load()

~/anaconda3/envs/python3/lib/python3.10/site-packages/boto3/s3/inject.py in download_file(self, Bucket, Key, Filename, ExtraArgs, Callback, Config)
    188     """
    189     with S3Transfer(self, Config) as transfer:
--> 190         return transfer.download_file(
    191             bucket=Bucket,
    192             key=Key,

~/anaconda3/envs/python3/lib/python3.10/site-packages/boto3/s3/transfer.py in download_file(self, bucket, key, filename, extra_args, callback)
    324         )
    325         try:
--> 326             future.result()
    327         # This is for backwards compatibility where when retries are
    328         # exceeded we need to throw the same error from boto3 instead of

~/anaconda3/envs/python3/lib/python3.10/site-packages/s3transfer/futures.py in result(self)
    101             # however if a KeyboardInterrupt is raised we want want to exit
    102             # out of this and propagate the exception.
--> 103             return self._coordinator.result()
    104         except KeyboardInterrupt as e:
    105             self.cancel()

~/anaconda3/envs/python3/lib/python3.10/site-packages/s3transfer/futures.py in result(self)
    264         # final result.
    265         if self._exception:
--> 266             raise self._exception
    267         return self._result
    268 

~/anaconda3/envs/python3/lib/python3.10/site-packages/s3transfer/tasks.py in __call__(self)
    137             # main() method.
    138             if not self._transfer_coordinator.done():
--> 139                 return self._execute_main(kwargs)
    140         except Exception as e:
    141             self._log_and_set_exception(e)

~/anaconda3/envs/python3/lib/python3.10/site-packages/s3transfer/tasks.py in _execute_main(self, kwargs)
    160         logger.debug(f"Executing task {self} with kwargs {kwargs_to_display}")
    161 
--> 162         return_value = self._main(**kwargs)
    163         # If the task is the final task, then set the TransferFuture's
    164         # value to the return value from main().

~/anaconda3/envs/python3/lib/python3.10/site-packages/s3transfer/download.py in _main(self, fileobj, data, offset)
    640         :param offset: The offset to write the data to.
    641         """
--> 642         fileobj.seek(offset)
    643         fileobj.write(data)
    644 

~/anaconda3/envs/python3/lib/python3.10/site-packages/s3transfer/utils.py in seek(self, where, whence)
    376 
    377     def seek(self, where, whence=0):
--> 378         self._open_if_needed()
    379         self._fileobj.seek(where, whence)
    380 

~/anaconda3/envs/python3/lib/python3.10/site-packages/s3transfer/utils.py in _open_if_needed(self)
    359     def _open_if_needed(self):
    360         if self._fileobj is None:
--> 361             self._fileobj = self._open_function(self._filename, self._mode)
    362             if self._start_byte != 0:
    363                 self._fileobj.seek(self._start_byte)

~/anaconda3/envs/python3/lib/python3.10/site-packages/s3transfer/utils.py in open(self, filename, mode)
    270 
    271     def open(self, filename, mode):
--> 272         return open(filename, mode)
    273 
    274     def remove_file(self, filename):

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpx99uzbme/dbrs/.792d2ba4'



@AlpriElse
Copy link
Contributor

I'm able to reproduce on langchain==0.0.102

I created test objects in S3 using the following code snippet:

import boto3

s3 = boto3.resource('s3')

for i in range(0, 10):
  s3.Bucket('langchain-debugging-test').put_object(Key=f'folder/hello-{i}.txt', Body=f'Hello World {i}!')

I then have the same error when running the following code:

from langchain.document_loaders import S3DirectoryLoader

loader = S3DirectoryLoader("langchain-debugging-test", prefix="folder")
loader.load()

Full stack trace

Traceback (most recent call last):
  File "/Users/alprielse/src/sandboxed/2023-03-07_langchain/recreate-s3-error.py", line 11, in <module>
    loader.load()
  File "/Users/alprielse/src/sandboxed/2023-03-07_langchain/.venv/lib/python3.9/site-packages/langchain/document_loaders/s3_directory.py", line 31, in load
    docs.extend(loader.load())
  File "/Users/alprielse/src/sandboxed/2023-03-07_langchain/.venv/lib/python3.9/site-packages/langchain/document_loaders/s3_file.py", line 30, in load
    s3.download_file(self.bucket, self.key, file_path)
  File "/Users/alprielse/src/sandboxed/2023-03-07_langchain/.venv/lib/python3.9/site-packages/boto3/s3/inject.py", line 190, in download_file
    return transfer.download_file(
  File "/Users/alprielse/src/sandboxed/2023-03-07_langchain/.venv/lib/python3.9/site-packages/boto3/s3/transfer.py", line 326, in download_file
    future.result()
  File "/Users/alprielse/src/sandboxed/2023-03-07_langchain/.venv/lib/python3.9/site-packages/s3transfer/futures.py", line 103, in result
    return self._coordinator.result()
  File "/Users/alprielse/src/sandboxed/2023-03-07_langchain/.venv/lib/python3.9/site-packages/s3transfer/futures.py", line 266, in result
    raise self._exception
  File "/Users/alprielse/src/sandboxed/2023-03-07_langchain/.venv/lib/python3.9/site-packages/s3transfer/tasks.py", line 139, in __call__
    return self._execute_main(kwargs)
  File "/Users/alprielse/src/sandboxed/2023-03-07_langchain/.venv/lib/python3.9/site-packages/s3transfer/tasks.py", line 162, in _execute_main
    return_value = self._main(**kwargs)
  File "/Users/alprielse/src/sandboxed/2023-03-07_langchain/.venv/lib/python3.9/site-packages/s3transfer/download.py", line 642, in _main
    fileobj.seek(offset)
  File "/Users/alprielse/src/sandboxed/2023-03-07_langchain/.venv/lib/python3.9/site-packages/s3transfer/utils.py", line 378, in seek
    self._open_if_needed()
  File "/Users/alprielse/src/sandboxed/2023-03-07_langchain/.venv/lib/python3.9/site-packages/s3transfer/utils.py", line 361, in _open_if_needed
    self._fileobj = self._open_function(self._filename, self._mode)
  File "/Users/alprielse/src/sandboxed/2023-03-07_langchain/.venv/lib/python3.9/site-packages/s3transfer/utils.py", line 272, in open
    return open(filename, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/var/folders/kt/7gdyyk6d75lbw1xwdqkrbrjw0000gn/T/tmpgypdhma7/folder/hello-0.txt.ddBa004f'

@AlpriElse
Copy link
Contributor

Working on a fix

hwchase17 pushed a commit that referenced this issue Mar 9, 2023
Resolves #1510

### Problem
When loading S3 Objects with `/` in the object key (eg.
`folder/some-document.txt`) using `S3FileLoader`, the objects are
downloaded into a temporary directory and saved as a file.

This errors out when the parent directory does not exist within the
temporary directory.

See
#1510 (comment)
on how to reproduce this bug

### What this pr does
Creates parent directories based on object key. 

This also works with deeply nested keys:
`folder/subfolder/some-document.txt`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants