Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with Openfl Gramine "error: PAL failed at ../../Pal/src/db_main.c:pal_main:513 (exitcode = 4, reason=No 'loader.entrypoint' is specified in the manifest)" #503

Closed
gagandeep987123 opened this issue Sep 13, 2022 · 20 comments
Assignees
Labels
MediumBug Some impact to use
Milestone

Comments

@gagandeep987123
Copy link

Describe the bug
I am attempting to run the example of FL as given in manual and getting this error on the aggregator.

Gramine is starting. Parsing TOML manifest file, this may take some time...
Detected a huge manifest, preallocating 64MB of internal memory.
-----------------------------------------------------------------------------------------------------------------------
Gramine detected the following insecure configurations:

  - loader.insecure__use_cmdline_argv = true   (forwarding command-line args from untrusted host to the app)
  - sgx.allowed_files = [ ... ]                (some files are passed through from untrusted host without verification)

Gramine will continue application execution, but this configuration must not be used in production!
-----------------------------------------------------------------------------------------------------------------------

error: PAL failed at ../../Pal/src/db_main.c:pal_main:513 (exitcode = 4, reason=No 'loader.entrypoint' is specified in the manifest)

Also, step 7 in the manual is presented in a bit vague manner for a first-time user.
I used the setup given here as a workspace and template. But using this gave above error when I am trying to start federation on aggregator machine.

@Einse57 Einse57 added this to the Backlog milestone Sep 15, 2022
@igor-davidyuk
Copy link
Contributor

igor-davidyuk commented Sep 15, 2022

Hey! Thanks for reporting this. This issue was addressed and I believe it shouldn't be there with the next OpenFL release.
For now, you can try installing openfl from the develop branch:
git clone https://github.com/intel/openfl.git && cd openfl && pip install -e .
But you will need to rebuild your docker images, you could remove the old ones, or just pass --rebuild to the graminize command

@igor-davidyuk
Copy link
Contributor

I agree with your note regarding step 7 in the manual, but it would be also a wrong place to explain the certification process.
To make your life easier try using automatically generated certificates from this script, as the manual suggests.

@gagandeep987123
Copy link
Author

I tried using with the latest build "openfl 1.4" but got a new error. For running openfl-gramine I had to install openfl1.4 in the docker image as well, for which I had to change the file "*venv/lib/python3.8/site-packages/openfl-gramine/Dockerfile.gramine" but there after I got a new error

Traceback (most recent call last):
  File "/usr/local/bin/fx", line 8, in <module>
    sys.exit(entry())
  File "/usr/local/lib/python3.8/site-packages/openfl/interface/cli.py", line 207, in entry
    command_group = import_module(module, package)
  File "/usr/local/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.8/site-packages/openfl/interface/director.py", line 17, in <module>
    from openfl.component.director import Director
  File "/usr/local/lib/python3.8/site-packages/openfl/component/director/__init__.py", line 6, in <module>
    from .director import Director
  File "/usr/local/lib/python3.8/site-packages/openfl/component/director/director.py", line 15, in <module>
    from .experiment import Experiment
  File "/usr/local/lib/python3.8/site-packages/openfl/component/director/experiment.py", line 14, in <module>
    from openfl.federated import Plan
  File "/usr/local/lib/python3.8/site-packages/openfl/federated/__init__.py", line 8, in <module>
    from .task import TaskRunner  # NOQA
  File "/usr/local/lib/python3.8/site-packages/openfl/federated/task/__init__.py", line 14, in <module>
    import tensorflow  # NOQA
  File "/usr/local/lib/python3.8/site-packages/tensorflow/__init__.py", line 41, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "/usr/local/lib/python3.8/site-packages/tensorflow/python/__init__.py", line 108, in <module>
    from tensorflow.python.platform import test
  File "/usr/local/lib/python3.8/site-packages/tensorflow/python/platform/test.py", line 24, in <module>
    from tensorflow.python.framework import test_util as _test_util
  File "/usr/local/lib/python3.8/site-packages/tensorflow/python/framework/test_util.py", line 37, in <module>
    from absl.testing import parameterized
  File "/usr/local/lib/python3.8/site-packages/absl/testing/parameterized.py", line 215, in <module>
    from absl.testing import absltest
  File "/usr/local/lib/python3.8/site-packages/absl/testing/absltest.py", line 225, in <module>
    get_default_test_tmpdir(),
  File "/usr/local/lib/python3.8/site-packages/absl/testing/absltest.py", line 163, in get_default_test_tmpdir
    tmpdir = os.path.join(tempfile.gettempdir(), 'absl_testing')
  File "/usr/local/lib/python3.8/tempfile.py", line 286, in gettempdir
    tempdir = _get_default_tempdir()
  File "/usr/local/lib/python3.8/tempfile.py", line 218, in _get_default_tempdir
    raise FileNotFoundError(_errno.ENOENT,
FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/workspace']

I have tried it for multiple temples and the error is persistent there also.

@mansishr
Copy link
Collaborator

Hi @gagandeep987123, could you specify the command which lead to the error above? Just looking through the error, looks like either you don't have root permissions or there is no space left on device (df -kh .)

@gagandeep987123
Copy link
Author

Hi @mansishr
I am using a modified script of test_graminize.sh (please change .pdf with .sh)
test_graminize.pdf

gsingh@sgx03:~$ df -kh
Filesystem                         Size  Used Avail Use% Mounted on
udev                               189G     0  189G   0% /dev
tmpfs                               38G  2.7M   38G   1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv  117G   92G   19G  84% /
tmpfs                              189G     0  189G   0% /dev/shm
tmpfs                              5.0M     0  5.0M   0% /run/lock
tmpfs                              189G     0  189G   0% /sys/fs/cgroup
/dev/nvme0n1p2                     974M  372M  535M  42% /boot
/dev/nvme0n1p1                     511M  5.3M  506M   2% /boot/efi
/dev/sda1                          880G   28K  835G   1% /mnt/storage
/dev/loop0                          56M   56M     0 100% /snap/core18/2560
/dev/loop2                          62M   62M     0 100% /snap/core20/1611
/dev/loop1                          56M   56M     0 100% /snap/core18/2566
/dev/loop3                          64M   64M     0 100% /snap/core20/1623
/dev/loop4                          68M   68M     0 100% /snap/lxd/22526
/dev/loop5                          68M   68M     0 100% /snap/lxd/22753
/dev/loop6                          47M   47M     0 100% /snap/snapd/16292
/dev/loop7                          48M   48M     0 100% /snap/snapd/16778
tmpfs                               38G     0   38G   0% /run/user/1171

Also I made another changes to the "*venv/lib/python3.8/site-packages/openfl-gramine/Dockerfile.gramine"

ARG BASE_IMAGE=python:3.8
FROM ${BASE_IMAGE}

SHELL ["/bin/bash", "-o", "pipefail", "-c"]
RUN pwd
WORKDIR /openfl

COPY openfl .
RUN pwd
RUN --mount=type=cache,target=/root/.cache/ \
    pip install --upgrade pip && \
    pip install .

WORKDIR /
RUN pwd
# install gramine
RUN curl -fsSLo /usr/share/keyrings/gramine-keyring.gpg https://packages.gramineproject.io/gramine-keyring.gpg && \
    echo 'deb [arch=amd64 signed-by=/usr/share/keyrings/gramine-keyring.gpg] https://packages.gramineproject.io/ stable main' | \
    tee /etc/apt/sources.list.d/gramine.list
RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt \
    apt-get update && \
    apt-get install -y --no-install-recommends \
    gramine libprotobuf-c-dev \
    && rm -rf /var/lib/apt/lists/*
    # there is an issue for libprotobuf-c in gramine repo, install from apt for now

# graminelibos is under this dir
ENV PYTHONPATH=/usr/local/lib/python3.8/site-packages/:/usr/lib/python3/dist-packages/:



# install linux headers
# WORKDIR /tmp/
# RUN wget -c https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.11/amd64/linux-headers-5.11.0-051100_5.11.0-051100.202102142330_all.deb
# RUN dpkg -i *.deb
# RUN mv /usr/src/linux-headers-5.11.0-051100/ /usr/src/linux-headers-5.11.0-051100rc5-generic/
# WORKDIR /

# ENV LC_ALL=C.UTF-8
# ENV LANG=C.UTF-8

I am using the latest release from GitHub for openfl installation.

Also attaching the output from running the script. (please remove .pdf from output.pdf)
output.pdf

@igor-davidyuk
Copy link
Contributor

igor-davidyuk commented Sep 19, 2022

No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/workspace']

I also got this. It looks like Tensorflow tries to create a temporary directory somewhere inside an enclave which is not a good idea in the first place.
What makes, it strange the gramine manifest allows to use /tmp just for this purpose and we still see this error.

@mansishr
Copy link
Collaborator

In order for TF to create a directory at runtime inside an enclave, we would need to mount that directory from the host area to the enclave, something like this. For the syntax that we currently support with OpenFL, the lines below can be added to the manifest here:

fs.mount.etc.type = "chroot"
fs.mount.etc.path = "/tmp"
fs.mount.etc.uri = "file:/tmp"

@igor-davidyuk
Copy link
Contributor

In order for TF to create a directory at runtime inside an enclave, we would need to mount that directory from the host area to the enclave, something like this. For the syntax that we currently support with OpenFL, the lines below can be added to the manifest here:

fs.mount.etc.type = "chroot"
fs.mount.etc.path = "/tmp"
fs.mount.etc.uri = "file:/tmp"

It is strange we need to mount the temp folder and not just allow using it inside an enclave. I am positive it worked before 😅😅

@gagandeep987123
Copy link
Author

In order for TF to create a directory at runtime inside an enclave, we would need to mount that directory from the host area to the enclave, something like this. For the syntax that we currently support with OpenFL, the lines below can be added to the manifest here:

fs.mount.etc.type = "chroot"
fs.mount.etc.path = "/tmp"
fs.mount.etc.uri = "file:/tmp"

So I made the following changes

  1. In test_graminize.sh I added a line mkdir tmpfs in ${FED_DIRECTORY} and then added an option in docker run --volume=${FED_DIRECTORY}/tmpfs:/tmp \
  2. Added the lines to openfl.manifest.template file.

But I am still getting the same error.
Is it working for you @igor-davidyuk ? or Am I missing something?

@igor-davidyuk
Copy link
Contributor

  • In test_graminize.sh I added a line mkdir tmpfs in ${FED_DIRECTORY} and then added an option in docker run --volume=${FED_DIRECTORY}/tmpfs:/tmp \
  • Added the lines to openfl.manifest.template file

It should be done in a different way. We need to add that mounting line to the gramine manifest template.
Then we should keep in mind that the enclave is built withing a docker image, so we need that updated manifest in docker image. Thus we should not install openfl from pip in docker base image, but copy the local repository and install from source.

Will try to do this in a separate branch. Yet I am still not sure if it is safe to mount /tmp to the enclave, will ask gramine guys

@igor-davidyuk
Copy link
Contributor

Try this branch, worked for me!
https://github.com/igor-davidyuk/openfl/tree/manifest-gramine-update

@mansishr
Copy link
Collaborator

Thanks for a working branch @igor-davidyuk. Ideally, it is unsafe to even allow /tmp directory, but since we are putting that as an allowed file in the manifest for this example, it should be okay to mount it as well. We should definitely consult the Gramine team as well.

@gagandeep987123
Copy link
Author

yes it is not stopping at that step 😄 but stopping a bit further. As of now after using the new branch, aggregator starts but the example it self is giving a error.

[09:20:29] INFO     Using TaskRunner subclassing API                                                                                         collaborator.py:253
[09:20:29] INFO     Using TaskRunner subclassing API                                                                                         collaborator.py:253
[09:20:37] METRIC   Round 0, collaborator one is sending metric for task aggregated_model_validation: acc   0.140000                         collaborator.py:415
[09:20:37] INFO     Collaborator one is sending task results for aggregated_model_validation, round 0                                          aggregator.py:515
           ERROR    Exception calling application: [Errno 2] No such file or directory                                                            _server.py:445
                    Traceback (most recent call last):                                                                                                          
                      File "/usr/local/lib/python3.8/site-packages/grpc/_server.py", line 435, in _call_behavior                                                
                        response_or_iterator = behavior(argument, context)                                                                                      
                      File "/usr/local/lib/python3.8/site-packages/openfl/transport/grpc/aggregator_server.py", line 222, in SendLocalTaskResults               
                        self.aggregator.send_local_task_results(                                                                                                
                      File "/usr/local/lib/python3.8/site-packages/openfl/component/aggregator/aggregator.py", line 552, in                                     
                    send_local_task_results                                                                                                                     
                        self.log_metric(tensor_key.tags[-1], task_name,                                                                                         
                      File "/usr/local/lib/python3.8/site-packages/openfl/utilities/logs.py", line 24, in write_metric                                          
                        get_writer()                                                                                                                            
                      File "/usr/local/lib/python3.8/site-packages/openfl/utilities/logs.py", line 19, in get_writer                                            
                        writer = SummaryWriter('./logs/tensorboard', flush_secs=5)                                                                              
                      File "/usr/local/lib/python3.8/site-packages/tensorboardX/writer.py", line 301, in __init__                                               
                        self._get_file_writer()                                                                                                                 
                      File "/usr/local/lib/python3.8/site-packages/tensorboardX/writer.py", line 349, in _get_file_writer                                       
                        self.file_writer = FileWriter(logdir=self.logdir,                                                                                       
                      File "/usr/local/lib/python3.8/site-packages/tensorboardX/writer.py", line 105, in __init__                                               
                        self.event_writer = EventFileWriter(                                                                                                    
                      File "/usr/local/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 105, in __init__                                    
                        self._event_queue = multiprocessing.Queue(max_queue_size)                                                                               
                      File "/usr/local/lib/python3.8/multiprocessing/context.py", line 103, in Queue                                                            
                        return Queue(maxsize, ctx=self.get_context())                                                                                           
                      File "/usr/local/lib/python3.8/multiprocessing/queues.py", line 42, in __init__                                                           
                        self._rlock = ctx.Lock()                                                                                                                
                      File "/usr/local/lib/python3.8/multiprocessing/context.py", line 68, in Lock                                                              
                        return Lock(ctx=self.get_context())                                                                                                     
                      File "/usr/local/lib/python3.8/multiprocessing/synchronize.py", line 162, in __init__                                                     
                        SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)                                                                                        
                      File "/usr/local/lib/python3.8/multiprocessing/synchronize.py", line 57, in __init__                                                      
                        sl = self._semlock = _multiprocessing.SemLock(                                                                                          
                    FileNotFoundError: [Errno 2] No such file or directory                                                                                      
           INFO     Response code: StatusCode.UNKNOWN                                                                                    aggregator_client.py:59
           INFO     Attempting to resend data request to aggregator at localhost:58034                                                   aggregator_client.py:98
           INFO     Collaborator one is sending task results for aggregated_model_validation, round 0                                          aggregator.py:515
           ERROR    Exception calling application: [Errno 2] No such file or directory                                                            _server.py:445
                    Traceback (most recent call last):                                                                                                          
                      File "/usr/local/lib/python3.8/site-packages/grpc/_server.py", line 435, in _call_behavior                                                
                        response_or_iterator = behavior(argument, context)                                                                                      
                      File "/usr/local/lib/python3.8/site-packages/openfl/transport/grpc/aggregator_server.py", line 222, in SendLocalTaskResults               
                        self.aggregator.send_local_task_results(                                                                                                
                      File "/usr/local/lib/python3.8/site-packages/openfl/component/aggregator/aggregator.py", line 552, in                                     
                    send_local_task_results                                                                                                                     
                        self.log_metric(tensor_key.tags[-1], task_name,                                                                                         
                      File "/usr/local/lib/python3.8/site-packages/openfl/utilities/logs.py", line 24, in write_metric                                          
                        get_writer()                                                                                                                            
                      File "/usr/local/lib/python3.8/site-packages/openfl/utilities/logs.py", line 19, in get_writer                                            
                        writer = SummaryWriter('./logs/tensorboard', flush_secs=5)                                                                              
                      File "/usr/local/lib/python3.8/site-packages/tensorboardX/writer.py", line 301, in __init__                                               
                        self._get_file_writer()                                                                                                                 
                      File "/usr/local/lib/python3.8/site-packages/tensorboardX/writer.py", line 349, in _get_file_writer                                       
                        self.file_writer = FileWriter(logdir=self.logdir,                                                                                       
                      File "/usr/local/lib/python3.8/site-packages/tensorboardX/writer.py", line 105, in __init__                                               
                        self.event_writer = EventFileWriter(                                                                                                    
                      File "/usr/local/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 105, in __init__                                    
                        self._event_queue = multiprocessing.Queue(max_queue_size)                                                                               
                      File "/usr/local/lib/python3.8/multiprocessing/context.py", line 103, in Queue                                                            
                        return Queue(maxsize, ctx=self.get_context())                                                                                           
                      File "/usr/local/lib/python3.8/multiprocessing/queues.py", line 42, in __init__                                                           
                        self._rlock = ctx.Lock()                                                                                                                
                      File "/usr/local/lib/python3.8/multiprocessing/context.py", line 68, in Lock                                                              
                        return Lock(ctx=self.get_context())                                                                                                     
                      File "/usr/local/lib/python3.8/multiprocessing/synchronize.py", line 162, in __init__                                                     
                        SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)                                                                                        
                      File "/usr/local/lib/python3.8/multiprocessing/synchronize.py", line 57, in __init__                                                      
                        sl = self._semlock = _multiprocessing.SemLock(                                                                                          
                    FileNotFoundError: [Errno 2] No such file or directory                                                                                      
           INFO     Response code: StatusCode.UNKNOWN                                                                                    aggregator_client.py:59
           INFO     Attempting to resend data request to aggregator at localhost:58034                                                   aggregator_client.py:98
           INFO     Collaborator one is sending task results for aggregated_model_validation, round 0                                          aggregator.py:515
           ERROR    Exception calling application: [Errno 2] No such file or directory                                                            _server.py:445
                    Traceback (most recent call last):                                                                                                          
                      File "/usr/local/lib/python3.8/site-packages/grpc/_server.py", line 435, in _call_behavior                                                
                        response_or_iterator = behavior(argument, context)                                                                                      
                      File "/usr/local/lib/python3.8/site-packages/openfl/transport/grpc/aggregator_server.py", line 222, in SendLocalTaskResults               
                        self.aggregator.send_local_task_results(                                                                                                
                      File "/usr/local/lib/python3.8/site-packages/openfl/component/aggregator/aggregator.py", line 552, in                                     
                    send_local_task_results                                                                                                                     
                        self.log_metric(tensor_key.tags[-1], task_name,                                                                                         
                      File "/usr/local/lib/python3.8/site-packages/openfl/utilities/logs.py", line 24, in write_metric                                          
                        get_writer()                                                                                                                            
                      File "/usr/local/lib/python3.8/site-packages/openfl/utilities/logs.py", line 19, in get_writer                                            
                        writer = SummaryWriter('./logs/tensorboard', flush_secs=5)                                                                              
                      File "/usr/local/lib/python3.8/site-packages/tensorboardX/writer.py", line 301, in __init__                                               
                        self._get_file_writer()                                                                                                                 
                      File "/usr/local/lib/python3.8/site-packages/tensorboardX/writer.py", line 349, in _get_file_writer                                       
                        self.file_writer = FileWriter(logdir=self.logdir,                                                                                       
                      File "/usr/local/lib/python3.8/site-packages/tensorboardX/writer.py", line 105, in __init__                                               
                        self.event_writer = EventFileWriter(                                                                                                    
                      File "/usr/local/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 105, in __init__                                    
                        self._event_queue = multiprocessing.Queue(max_queue_size)                                                                               
                      File "/usr/local/lib/python3.8/multiprocessing/context.py", line 103, in Queue                                                            
                        return Queue(maxsize, ctx=self.get_context())                                                                                           
                      File "/usr/local/lib/python3.8/multiprocessing/queues.py", line 42, in __init__                                                           
                        self._rlock = ctx.Lock()                                                                                                                
                      File "/usr/local/lib/python3.8/multiprocessing/context.py", line 68, in Lock                                                              
                        return Lock(ctx=self.get_context())                                                                                                     
                      File "/usr/local/lib/python3.8/multiprocessing/synchronize.py", line 162, in __init__                                                     
                        SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)                                                                                        
                      File "/usr/local/lib/python3.8/multiprocessing/synchronize.py", line 57, in __init__                                                      
                        sl = self._semlock = _multiprocessing.SemLock(                                                                                          
                    FileNotFoundError: [Errno 2] No such file or directory  

The above error is happening in loop.
I am currently using TEMPLATE=${3:- 'torch_cnn_histology_gramine_ready'} in test_graminize.sh.
Also, when I use "torch_unet_kvasir_gramine_ready", I am getting below error:

SystemError: ZIP File hash doesn't match expected file hash.

@Einse57 Einse57 added bug Something isn't working MediumBug Some impact to use and removed bug Something isn't working labels Sep 22, 2022
@mansishr
Copy link
Collaborator

Hi @gagandeep987123, can you try out the example with "torch_unet_kvasir_gramine_ready"? We are aware of the issue of the hash not being valid (came pretty recently). We'll resolve this soon, but in the meantime, please comment out this line and proceed.

Regarding the multiprocessing issue that you see, it is a known issue that Python's multiprocessing package requires support for POSIX semaphores, which Gramine does not support or implement. So we'll need to disable multiprocessing (num_workers=0) to make it work.

@gagandeep987123
Copy link
Author

Hi @mansishr, getting same error multiprocessing while running "torch_unet_kvasir_gramine_ready"

@mansishr
Copy link
Collaborator

Hi @gagandeep987123, sorry for a late response. Multiprocessing is getting triggered through the use of tensorboard's summary writer. Please disable write_logs in plan setting:

aggregator :
  defaults : plan/defaults/aggregator.yaml
  template : openfl.component.Aggregator
  settings :
    init_state_path : save/torch_cnn_histology_init.pbuf
    best_state_path : save/torch_cnn_histology_best.pbuf
    last_state_path : save/torch_cnn_histology_last.pbuf
    rounds_to_train : 20
    write_logs : false

@gagandeep987123
Copy link
Author

@mansishr Is it working for you because I am still getting the error. I just changed the file as you suggested via a change in Dockerfile.gramine

@mansishr
Copy link
Collaborator

mansishr commented Oct 3, 2022

Hi @gagandeep987123, could you attach files where you have made the changes? Also, you would need to run through all the steps again and rebuild the image after any changes to the plan.

@mansishr
Copy link
Collaborator

mansishr commented Oct 26, 2022

Hi @gagandeep987123 let us know if the issue got resolved?

@gagandeep987123
Copy link
Author

It is working. Thanks for the help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MediumBug Some impact to use
Projects
None yet
Development

No branches or pull requests

4 participants