-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: NOT FOUND when trying to save train state in docker container #446
Comments
I was able to run your codes without errors from a simple virtual env without docker. I think it's something do file permissions of your |
I just tried running it in ubuntu without docker and it worked, so it seems docker is the problem. Creating the checkpoints folder with python doesn't seem to change anything. I'll keep experimenting. |
So the problem was the ownership of my mounted volume. Because I mounted Here's the updated dockerfile:
So the root directory of my container now looks like this:
And
Whereas
Finally, here's the updated python file with the copying technique that I mentioned: import flax.linen as nn
from flax.training import train_state
import optax
import orbax.checkpoint as ocp
import jax
import jax.numpy as jnp
import os
import shutil
def create_train_state(module, rng):
x = (jnp.ones([1, 256, 256, 1]))
variables = module.init(rng, x)
params = variables['params']
tx = optax.adam(1e-3)
ts = train_state.TrainState.create(
apply_fn=module.apply, params=params, tx=tx
)
return ts
class TestModel(nn.Module):
@nn.compact
def __call__(self, x):
x = nn.Conv(4, kernel_size=(3, 3))(x)
return x
if __name__ == '__main__':
init_rng = jax.random.PRNGKey(0)
model = TestModel()
state = create_train_state(model, init_rng)
del init_rng
checkpointer = ocp.Checkpointer(ocp.PyTreeCheckpointHandler(use_ocdbt=True))
# Save to root owned checkpoints dir.
checkpointer.save(os.path.abspath('../checkpoints/checkpoint1'), state)
# Copy from root owned checkpoints dir, to checkpoints dir in mounted volume.
shutil.copytree('../checkpoints/checkpoint1', 'checkpoints/checkpoint1')
# Restore from checkpoints dir in mounted volume.
state = checkpointer.restore(os.path.abspath('checkpoints/checkpoint1')) @ChromeHearts thanks for pointing me in the right direction. |
Bug was caused by incorrect directory ownership of volume mounted in docker container. See here for more details: google/orbax#446
This is certainly very weird. Your docker was actually running as root so it shouldn't have issues directly saving checkpoints to the sudo docker-compose exec test bash
root@e8a981978d40:/project# ls -l
total 16
-rw-r--r-- 1 1003 1003 268 Aug 3 02:26 Dockerfile
-rw-r--r-- 1 1003 1003 105 Aug 3 02:27 docker-compose.yaml
-rw-r--r-- 1 1003 1003 940 Aug 3 02:07 main.py
drwxr-xr-x 5 1003 1003 4096 Aug 3 02:08 py39
root@e8a981978d40:/project# mkdir checkpoints
root@e8a981978d40:/project# python main.py
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
save_path='/project/checkpoints/checkpoint1'
root@e8a981978d40:/project# ls -l
total 20
-rw-r--r-- 1 1003 1003 268 Aug 3 02:26 Dockerfile
drwxr-xr-x 3 root root 4096 Aug 3 02:35 checkpoints
-rw-r--r-- 1 1003 1003 105 Aug 3 02:27 docker-compose.yaml
-rw-r--r-- 1 1003 1003 940 Aug 3 02:07 main.py
drwxr-xr-x 5 1003 1003 4096 Aug 3 02:08 py39
root@e8a981978d40:/project# find checkpoints
checkpoints
checkpoints/checkpoint1
checkpoints/checkpoint1/checkpoint
checkpoints/checkpoint1/d
checkpoints/checkpoint1/d/ce942794f70ea11a64fa0742f009b653
checkpoints/checkpoint1/d/2b248e926ce267f2604fe6215090a51b
checkpoints/checkpoint1/d/7d4d3dd57291b20fd8866db6035f8025
checkpoints/checkpoint1/manifest.ocdbt
root@e8a981978d40:/project# The main.py is simply your python script (1st version without the copy). I managed to save checkpoint without issues. I suggest avoid the copy from temp to mounted volume. Docker temp folders are not meant for storing large dataset. They are slow and have limited storage size. |
It seems that we have at least one solution, so closing this issue. |
I'm getting the following error when I try to save my train state from within a docker container:
Here's the code to reproduce:
And here's my docker setup:
Dockerfile:
docker-compose.yaml:
I use the following commands to build and enter my docker container:
Then I create the checkpoint directory:
From here you can run the reproduction code.
I've been able to reproduce this error in a couple of different docker environments, but this one is the simplest. For some reason it does not reproduce in Colab.
The text was updated successfully, but these errors were encountered: