Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glob-related error appeared on 0.3.3 #97

Closed
avibrazil opened this issue Jan 5, 2022 · 13 comments
Closed

Glob-related error appeared on 0.3.3 #97

avibrazil opened this issue Jan 5, 2022 · 13 comments

Comments

@avibrazil
Copy link

avibrazil commented Jan 5, 2022

0.3.2 was working fine until I upgraded to 0.3.3 and then the following error started to happen. I don't know why and where is the problem but going back to 0.3.2 fixes the problem.

  File "/home/sagemaker-user/Kroft/my_avm/avm/avm.py", line 1063, in load
    available=[f for f in list(resolved_path.glob(filename)) if '.dvc' not in str(f)]
  File "/home/sagemaker-user/.local/lib/python3.9/site-packages/s3path.py", line 706, in glob
    yield from super().glob(pattern)
  File "/home/sagemaker-user/.pyenv/versions/3.9.9/lib/python3.9/pathlib.py", line 1177, in glob
    for p in selector.select_from(self):
  File "/home/sagemaker-user/.pyenv/versions/3.9.9/lib/python3.9/pathlib.py", line 523, in select_from
    if not is_dir(parent_path):
  File "/home/sagemaker-user/.local/lib/python3.9/site-packages/s3path.py", line 678, in is_dir
    return self._accessor.is_dir(self)
  File "/home/sagemaker-user/.local/lib/python3.9/site-packages/s3path.py", line 169, in is_dir
    resource, _ = self.configuration_map.get_configuration(path)
  File "/home/sagemaker-user/.local/lib/python3.9/site-packages/s3path.py", line 90, in get_configuration
    if resources is None and path in self.resources:
TypeError: argument of type 'NoneType' is not iterable

filename variable contains something like random • * • *.pkl*. And I don't think this is relevant information but this code is a static method of a class and it was running twice, in 2 parallel threads, where first thread succeeded and second failed with above error.

@liormizr
Copy link
Owner

liormizr commented Jan 6, 2022

@avibrazil thanks for the issue

How can I reproduce it?
can you please give me a line to run or a small script?

@sbrandtb
Copy link
Contributor

Same issue here, also threading-related. Relevant stack trace:

  File "/usr/local/lib/python3.9/pathlib.py", line 1264, in write_bytes
    with self.open(mode='wb') as f: 
  File "/usr/local/lib/python3.9/site-packages/s3path.py", line 721, in open
    return self._accessor.open(         
  File "/usr/local/lib/python3.9/site-packages/s3path.py", line 205, in open
    resource, config = self.configuration_map.get_configuration(path)
  File "/usr/local/lib/python3.9/site-packages/s3path.py", line 90, in get_configuration
    if resources is None and path in self.resources:                                                      
TypeError: argument of type 'NoneType' is not iterable   

The simplified code goes like this:

src = Path('some-local')
dst = S3Path.from_uri('s3://something') / 'foo.csv'
dst.write_bytes(src.read_bytes())

And run this massively parallel. Today I found out that creating boto3 clients is not thread safe and changed my own code like this. s3path is creating its own boto3 resource in class _S3Accessor, maybe that is the culprit.

sbrandtb added a commit to FRI-DAY/s3path that referenced this issue Mar 1, 2022
The issue was there all the time, but became critical in 9950dd0 with version
0.3.3.

`class _S3ConfigurationMap` is not thread safe because both
`set/get_configuration()` ask for `self.arguments` and `self.resources`
being `None`. However, since the configuration map is a singleton, when
using the library in a multithreaded environment, it can happen that
half the part of `_initial_setup` has run and `self.arguments` is indeed
not `None`, but `self.resources` is still `None` because in another
thread, creation of the boto3 resource is still ongoing. Then, our
thread would *not* call `_initial_setup()`, but still die on
`self.resources` being `None` when trying to access it, leading to the
error described in liormizr#97.

I could not make the bug appear without introducing the IO delay by
creating the resource in `_initial_setup()`. Maybe it is a CPython
implementation detail that makes `_initial_setup()` atomic of incredibly
rare to not be atomic without the creation of the boto resource.

Anyways, the setup is now synchronised with a lock, which is required,
again, given that `_S3ConfigurationMap` is a singleton.

Here is the code that reliably triggers the bug without the change:

```py
from multiprocessing.pool import ThreadPool

from s3path import S3Path

def do(i):
    dst = S3Path.from_uri('s3://something') / f'hello_{i}.txt'
    dst.write_bytes(b'hello')

ThreadPool(processes=20).map(do, range(100))
```
sbrandtb added a commit to FRI-DAY/s3path that referenced this issue Mar 1, 2022
The issue was there all the time, but became critical in 9950dd0 with version
0.3.3.

`class _S3ConfigurationMap` is not thread safe because both
`set/get_configuration()` ask for `self.arguments` and `self.resources`
being `None`. However, since the configuration map is a singleton, when
using the library in a multithreaded environment, it can happen that
half the part of `_initial_setup` has run and `self.arguments` is indeed
not `None`, but `self.resources` is still `None` because in another
thread, creation of the boto3 resource is still ongoing. Then, our
thread would *not* call `_initial_setup()`, but still die on
`self.resources` being `None` when trying to access it, leading to the
error described in liormizr#97.

I could not make the bug appear without introducing the IO delay by
creating the resource in `_initial_setup()`. Maybe it is a CPython
implementation detail that makes `_initial_setup()` atomic of incredibly
rare to not be atomic without the creation of the boto resource.

Anyways, the setup is now synchronised with a lock, which is required,
again, given that `_S3ConfigurationMap` is a singleton.

Here is the code that reliably triggers the bug without the change:

```py
from multiprocessing.pool import ThreadPool

from s3path import S3Path

def do(i):
    dst = S3Path.from_uri('s3://something') / f'hello_{i}.txt'
    dst.write_bytes(b'hello')

ThreadPool(processes=20).map(do, range(100))
```
@sbrandtb
Copy link
Contributor

sbrandtb commented Mar 1, 2022

I submitted a fix. Since I understand the issue better now, here is a workaround that should fix the issue until a new version is released: Basically, use any feature of the library without threads at the beginning of your application. It will make the library initialise without the concurrency issue that is the cause of the exception. I.e.:

S3Path.from_uri('s3://some-bucket-you-have-access-to/').exists()

There are probably other methods that are even shorter, but this one works.

@avibrazil
Copy link
Author

I’m not sure I got it.

You mean, call the lib somehow before any threading and then the lib can be used later in threading contexts?

Is this the temporary fix?

@sbrandtb
Copy link
Contributor

sbrandtb commented Mar 2, 2022

@avibrazil Sorry, I realise I was not clear 😞

Take the example from the pull request:

from multiprocessing.pool import ThreadPool

from s3path import S3Path

def do(i):
    dst = S3Path.from_uri('s3://something') / f'hello_{i}.txt'
    dst.write_bytes(b'hello')

# This line will "fix" the issue because it does ultimately call
# _S3ConfigurationMap.get_configuration()
S3Path.from_uri('s3://some-bucket-you-have-access-to/').exists()

# Now, do your threaded stuff.
ThreadPool(processes=20).map(do, range(100))

The thing is, it needs to call _S3ConfigurationMap.get_configuration(). Pardon me that I do not compile a list of possible functions that do this, because my time is limited 🙂 With the code above you can experiment and find one that also works, if the provided one is not good for you, i.e. because it requires some permission you do not have or so.

EDIT: The important part is calling exists(). How you construct S3Path does not matter.

liormizr added a commit that referenced this issue Mar 21, 2022
@liormizr liormizr reopened this Mar 21, 2022
@liormizr
Copy link
Owner

Hi all,

I merged the fix
I'll update in this issue when I'll create a new release

Thanks!

@sbrandtb
Copy link
Contributor

@liormizr Anything we can help with? This issue is blocking my upgrade to Python 3.10, so I am eager to see the new release 🙂

@liormizr
Copy link
Owner

@sbrandtb will it help if I'll deploy a new version now only with this fix?

@sbrandtb
Copy link
Contributor

@liormizr yes 🙂

@liormizr
Copy link
Owner

Version 0.3.4 deployed to PyPi
With only this issue fix
Sorry for the delay

I'll update tomorrow when I'll deploy to Conda

@liormizr
Copy link
Owner

Version deployed to Conda

@avibrazil
Copy link
Author

Oh, that's great!
I'll test it later.
But having it published, maybe it is time to close this bug.

@sbrandtb
Copy link
Contributor

sbrandtb commented Mar 24, 2022

Update: Unrelated. This issue was related to this workaround

After updating and running my project in our CI, I am seemingly randomly getting

ValueError: the bucket 'xxx does not exist, or is forbidden for access (ClientError('An error occurred (InvalidAccessKeyId) when calling the CreateMultipartUpload operation: The AWS Access Key Id you provided does not exist in our records.'))

I did not observe this while testing the change locally and I am not sure if if is not connected to another change when migrating to Python 3.10 and updating packages, but want to drop this message here so if anyone else encounters it, he can +1. After all, it looks like some other non-deterministic thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants