Skip to content
This repository has been archived by the owner on Jan 8, 2024. It is now read-only.

Par executables can't find data #43

Open
hwright opened this issue Sep 23, 2017 · 12 comments
Open

Par executables can't find data #43

hwright opened this issue Sep 23, 2017 · 12 comments
Assignees

Comments

@hwright
Copy link

hwright commented Sep 23, 2017

I'm using subpar to generate an independent python executable with Bazel. Unfortunately, it doesn't appear that the resulting par file can find data files referenced by the par_binary() rule.

For example, consider this trivial workspace: foo.tar.gz

The par file is built and runs fine:

$ tar -xvzf foo.tar.gz
x foo/
x foo/BUILD
x foo/dir/
x foo/foo.py
x foo/WORKSPACE
x foo/dir/file.txt
$ cd foo
$ bazel build :foo.par
INFO: Found 1 target...
Target //:foo.par up-to-date:
  bazel-bin/foo.par
INFO: Elapsed time: 1.103s, Critical Path: 0.00s
$ bazel-bin/foo.par
Hello, World!

$

However, it appears that the executable is picking up the data files from the local filesystem, not the par file. Switching directories and trying to run the par file from its location produces an error:

$ cd ..
$ foo/bazel-bin/foo.par
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "foo/bazel-bin/foo.par/__main__.py", line 6, in <module>
IOError: [Errno 2] No such file or directory: 'dir/file.txt'
$

The expected behavior would be that the execution of the par file from outside the workspace succeeds just as it does within the workspace.

@mattmoor
Copy link
Contributor

Can you use pkg_resources or pkgutil to access the file?

Here's an example where we use pkgutil to access a file within the PAR and extract it onto the filesystem so that things work as the bundled library expects.

@mattmoor
Copy link
Contributor

Actually maybe it's pkg_resources that doesn't work well with PAR, so try pkgutil first :)

@hwright
Copy link
Author

hwright commented Sep 24, 2017

pkgutil works, thanks.

Is this in the documentation somewhere?

@mattmoor
Copy link
Contributor

@duggelz is the authority on this repo. If not, I think we should track adding it with this issue.

@duggelz
Copy link
Contributor

duggelz commented Sep 26, 2017

.par files are not intended (by me, at least), to extract all of their files to disk, that kind of defeats the point.

However, the waters are seriously muddied by the Bazel .zip for Windows which does always extract by default, and the various ways to create .par files inside Google that use magic command lines or environment variables to autoextract.

So, point 1:

  1. Document how to access data files

Yes, I should do this. For reference it's like:

import pkgutil
dat = pkgutil.get_data('my.package.name', 'filename.ext')

This provides a file-like object, which is often good enough. When you really need an actual file, you need an API to materialize that file to disk. The internal Google API is terrible (I can say that because I wrote it) so we don't plan to open-source it. The pkg_resources module should be the preferred API, at least it's better, but there's some issues with proper metadata handling at present, and also there are some logistical issues with pkg_resources being part of setuptools rather than part of the Python standard library, the way pkgutil is.

  1. "Feature Request: .par files should autoextract when you run them".

This is a valid feature request, but I'm biased by the fact that we're actively trying to move away from this inside Google, because the performance and disk usage implications have become quite severe. It's a balance between programmer ease of use, and performance/resource usage, and Google's position on that line is probably quite different than almost everyone else.

@hwright
Copy link
Author

hwright commented Sep 26, 2017

Slight correction: pkgutil.get_data returns a string (on Python 3 I believe it's actually bytes), not a file-like object.

I personally have less interest in 2, but see how others might.

@duggelz
Copy link
Contributor

duggelz commented Dec 8, 2017

A resource API is finally coming to the standard library in Python 3.7, and will be backported to 2.7 and 3.4-3.6. Hallelujah!

https://gitlab.com/python-devs/importlib_resources

Also, I'm leaning toward a "just extract everything to disk all the time" strategy for this tool, instead of the much more complicated heuristics used inside Google for their performance benefits. At the same time, we're investigating open-sourcing the real PAR file implementation used inside Google.

@hwright
Copy link
Author

hwright commented Dec 11, 2017

@duggelz If it's coming in Python 3.7, that means we'll only have to wait 3-4 years before it makes it into the distroless base images which rules_docker uses. :)

@depthwise
Copy link

depthwise commented May 2, 2019

Could someone suggest how to deal with data deps provided by WORKSPACE? Basically, I'd like to embed a deep learning model and then read it with TFLite from inside. TFLite needs either a file, or byte representation of the model. The model does get embedded into PAR, but it's at the root level (if I unzip it), and therefore, it seems, pkgutil can't get to it.

The layout of the unpacked par is as follows:

tflite_models  __main__.py  subpar  <namespace name>

The models are inside tflite_models.

@depthwise
Copy link

Answering my own question after digging through the code some more:

pkgutil.get_data("__main__", "tflite_models/detect_float.tflite")

gets the data

@dsculptor
Copy link

if you have a build rule like:

par_binary(
    name = alpha, 
    data = ["data.txt"]
    ...
)

# Then the following commands are available:
bazel run //path/to/alpha           # This is same as py_binary.
bazel run //path/to/alpha.par    # This is subpar in action

However, It turns out only one of them can work:

  • bazel is recommending us to use their own runfiles library to access data files.
  • subpar is recommending the use of pkg_util!

Can we have an ultimate example of a python project which uses a simple data file - and it works for both par as well as py_binary?

@jackhumphries
Copy link

jackhumphries commented Apr 1, 2021

Hi,

I'm having a similar issue. I have this project tree:

project/BUILD
project/experiments/scripts/BUILD

In project/BUILD, I have a cc_binary called agent. In project/experiments/scripts/BUILD, I have a py_library with a data dependency (data = ["//:agent"]) and a par_binary that depends on that py_library.

I've been trying for two hours and I can't figure out how to access the agent binary from my Python code. Does anyone know what I should put in place of the question mark below? Also, is there a constraint that a par_binary can only have a data dependency on a target that is in its own directory? I was having issues creating a par_binary from Python files in a subdirectory, so maybe that is part of the issue here.

pkgutil.get_data("?", "agent")

Attempts:

  1. pkgutil.get_data("", "agent") # Returns None
  2. pkgutil.get_data("experiments", "agent") # Returns None
  3. pkgutil.get_data("experiments.scripts", "agent") # Returns None

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants