Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not assume any code-tables on strings #1073

Closed
wants to merge 4 commits into from

Conversation

torgil
Copy link
Contributor

@torgil torgil commented Mar 27, 2017

Scripts should never assume what codetable an external user has
and try to decode. Doing that has huge potential to cause pain and
misery for users (a common python3 sickness nowadays). A better
option is for python3-users to live with b'' syntax or provide an
option --decode=xxx to each script.

Revert "Python 3 compatibility fixes around string handling (#986)"
This reverts commit 78948e4.

The offending patch produces the following test failure on Jessie
with backports 4.9 kernel:

Traceback (most recent call last):
File "../../tools/slabratetop.py", line 123, in
print("%-32s %6d %10d" % (k.name.decode(), v.count, v.size))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 2: ordinal not in range(128)

@r4f4
Copy link
Contributor

r4f4 commented Mar 29, 2017

How do you reproduce that error?

@goldshtn
Copy link
Collaborator

[buildbot, ok to test]

1 similar comment
@drzaeus77
Copy link
Collaborator

[buildbot, ok to test]

@torgil
Copy link
Contributor Author

torgil commented Mar 29, 2017

I may have two issues, one that gives me odd characters in the string, and (this one) that makes the script crash.

To get this error I just cloned latest bcc, followed the "jessie" instructions in the INSTALL.md and "make test" failed after building.

To reproduce any crash related to any decode('ascii') is easy, just have one char in the string with a value above 127.

My guess is that all "decode" is there to not show the "b''" in the output in python3. That could be fixed in numerous ways but the current method "decode("ascii")" is a hack, is crash-prone and imposes un-necessary restrictions. Also, this is completely meaningless for python2 users so at least make it python3 only to limit the damage.

@drzaeus77
Copy link
Collaborator

My guess is that all "decode" is there to not show the "b''" in the output in python3

That is an oversimplification of the issue. We have gone through many iterations of dealing with (un)encoded strings in bcc, while trying to be cross-compatible with python2/3. There are several interdependent conventions, which all need to be addressed in a common way, for instance:

  1. Passing a python string to ctypes lib requires bytes encoding
  2. Fetching output from subprocess (used in usdt) comes out bytes encoded, and parsing any such output into a native python string requires decoding.
  3. Things like regex need to have matching argument types
  4. Items stored in a dict() need to be fetched with a matching string/bytes type, otherwise silent lookup failures will result
  5. others I've forgotten over time

So, while the liberal use of encode/decode may be naive and ill-advised, it is an approach that we have learned how to code reliably, with unit tests to back up the implementation.

@torgil
Copy link
Contributor Author

torgil commented Mar 30, 2017

Well. This is a real PITA for me and according to your comment you're pretty much trapped in the new Python3 issues. I deal with numerous py2/py3-issues in my day-to-day work so for me to make a patch that keep current behavior in py3 and remove the issues in p2 is pretty straight forward.

Would you accept such a patch?

@drzaeus77
Copy link
Collaborator

Yes, we don't want to leave you stuck not being able to use these tools either, that's why we made them! My response was worded strongly in order to convey the work that we've put into this. Of your proposals, I'm curious to see more details about the --decode-xxx option, if that is one that is still a likely alternative in your mind.

@r4f4
Copy link
Contributor

r4f4 commented Mar 31, 2017

The --decode-xxx option can work for most of the tools, but some of them will still have to keep the coding/decoding when interfacing with C. For example in uthreads.py:

-    if event.type == "pthread":
+    etype = event.type.decode()
+    if etype == "pthread":

without the decode, the comparison is always false on python3. This is more than just a case of "not to show b'' ".

@palmtenor
Copy link
Member

How would the --decode option gets pass down to the library (like in __init__.py and table.py)? They all need to take an additional argument?

@torgil
Copy link
Contributor Author

torgil commented Apr 1, 2017

With decode, I meant that python3 script.py --decode=GB2312 results in decode('GB2312'). Having thought further I conclude that it just ads to the fire and doesn't really solve anything. Could we add an encoding that does nothing?

I have no good solution for the python3 issue, the problem seems distributed all over the code (grep gives 131 instances) and having to deal with it feels like a nonconstructive use of time. Is python (at least python3) the right tool for the job?

Debug is a "quick and dirty" business and python community is working hard to remove the "dirty" from the equation with "quick" taking large collateral damage. If we have to polish every embedded project under development out there to fit a shiny, dust-free, gold-plated environment that breaks down to pieces each time i little dirt enters then pretty soon polishing is all we do.

The codebase obviously needs a bigger face-lift to deal with this problem. In the meantime, I'm currently leaning towards something around this theme (tested):
for f in $(find . -type f -name "*.py"); do sed -b -i"" -e"s/[\.]\(en\|de\)code[\(][^\)]*[\)]//g" $f; done

Maybe provide two builds (and debian packages), one for "python2" and one for "python3". If we don't come up with something clever this is the way to go or a fork is likely to happen (no one wants to work with "non-problems" or "random crashes").

@torgil
Copy link
Contributor Author

torgil commented Apr 1, 2017

Here's a patch to keep python3 behavior but remove coding issues on python2: https://github.com/torgil/bcc/tree/no-decode-on-py2

The drawback with this and why I lean towards two builds is that there is two behaviors with dynamic selection. This could for instance be problems if someone wants the "python2" behavior but has "python" set to "python3". I assume this can be a common case in the future.

It may be better to explicitly say "apt-get install bcc-python2" and get "#!/usr/bin/python2" in the head of all scripts. This would be the normal case. Those who wants to live dangerous or only wants one python due to storage restrictions and it has to be python3 could run the "bcc-python3" version.

The patch dynamically choose behavior depending on python2 or python3 using different functions in 'src/python/bcc/init.py'

if sys.version_info[0] == 2:
    to_bytes = lambda string, encoding='ascii': string
    to_string = lambda string, encoding='ascii': string
elif sys.version_info[0] == 3:
    to_bytes = lambda string, encoding='ascii': string.decode(encoding)
    to_string = lambda bytes_in, encoding='ascii': bytes_in.encode(encoding)

The rest of the changes is based on these lines with a handful of fixes where the regexes didn't made it:

for f in $(find . -type f -name "*.py"); do sed -i"" -rbe"s/(([a-zA-Z0-9_\.]|\([^)]*\))+)\.decode\([^\)]*\)/to_string\(\1\)/g" $f; done
for f in $(git ls-files -m); do sed -i"" -rbe"s/from bcc import.*/\0, to_string/g" $f; done
git add .
for f in $(find . -type f -name "*.py"); do sed -i"" -rbe"s/(([a-zA-Z0-9_\.]|\([^)]*\))+)\.encode\([^\)]*\)/to_bytes\(\1\)/g" $f; done
for f in $(git ls-files -m); do sed -i"" -rbe"s/from bcc import.*/\0, to_bytes/g" $f; done
git add .

@torgil
Copy link
Contributor Author

torgil commented Apr 1, 2017

Patch is no longer a revert, that was a short-cut. Now it just removes all offending code.

@torgil
Copy link
Contributor Author

torgil commented Apr 3, 2017

Patch to remove offending code rebased. New patch added to replace "/usr/bin/python" with "/usr/bin/python2" made by

for f in $(find . -type f -name "*.py"); do sed -i"" -rbe"s%(^.*/usr/bin/pytho)n(\W.*$|$)%\1n2\2%g" $f; done

@drzaeus77
Copy link
Collaborator

[buildbot, test this please]

@drzaeus77
Copy link
Collaborator

Failures are due to buildbot maintenance, please ignore for now.

@torgil
Copy link
Contributor Author

torgil commented Apr 13, 2017

ping...

What is needed for this patch to go through?

@goldshtn
Copy link
Collaborator

@torgil Well, just to clarify the status: the end result is that we're just forcing python2 in all the tools? That sounds OK to me. @drzaeus77
Also, could you please rebase and take a look at the build failure, so we could merge if no one objects?

@4ast
Copy link
Member

4ast commented Apr 15, 2017

hmm. I don't think we can force everyone to use python2 only with bcc.
cc @palmtenor @markdrayton

@torgil
Copy link
Contributor Author

torgil commented Apr 18, 2017

@goldshtn rebased. another decode-fix added

@4ast As current implementation is broken i'd argue that we stop the bleeding and go from there. Any language we chose are "forced" on our users, our task is to chose the best language possible for the task. Do you have a better solution? Do you prefer the "to_bytes"/"to_string" -solution (see above)?

Scripts should never assume what codetable an external user has
and try to decode. Doing that has huge potential to cause pain and
misery for users (a common python3 sickness nowadays).

Traceback (most recent call last):
  File "../../tools/slabratetop.py", line 123, in <module>
    print("%-32s %6d %10d" % (k.name.decode(), v.count, v.size))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 2: ordinal not in range(128)

This patch removes all instances of decode and encode. This means it will have regressions on
python3 which will no longer work.
Python3 will not work du to the missing encode()/decode() calls needed
to interoperate to for instance ctypes.

Only support python2 until a proper solution to the encoding mess has
been found
@torgil
Copy link
Contributor Author

torgil commented Apr 21, 2017

Ping...

This patch applies cleanly and solves a real problem. What's missing for it to get merged?

@drzaeus77
Copy link
Collaborator

I still have a somewhat allergic reaction to this, and now that I'm caught up after a vacation I am going to sit down this weekend and try out the various failure modes just to make sure I fully understand the problem and that the fix is fullproof. Sorry for the delay, please bear with me.

@palmtenor
Copy link
Member

Do we need to change the strings passed from Kernel space (like those event.xxx)? Shouldn't they always be ASCII?

debian/control Outdated
@@ -6,7 +6,8 @@ Standards-Version: 3.9.5
Build-Depends: debhelper (>= 9), cmake, libllvm3.7 | libllvm3.8,
llvm-3.7-dev | llvm-3.8-dev, libclang-3.7-dev | libclang-3.8-dev,
libelf-dev, bison, flex, libedit-dev,
clang-format | clang-format-3.7 | clang-format-3.8, python (>= 2.7),
clang-format | clang-format-3.7 | clang-format-3.8,
python (>= 2.7), python (< 3.0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

< 3.0 is no go

@torgil
Copy link
Contributor Author

torgil commented Apr 22, 2017

@4ast fixed. Pushed without that change

@drzaeus77 No problem. We want to get this well thought through. I hate to say this, I've used Python extensively for debug (HW/SW) since 2003, especially interactive mode towards c-components, but python3 has really taken the "seam" out of "seamless" with regards to productivity and interoperability with external components partly due to these issues.

@palmtenor assuming ASCII is exactly the problem I want to address, debug tools should make as few assumptions as possible (keeping data "as is" as long as possible). We could assume that the string is null-terminated or capped att field size. Apart from specific tools, the python module itself could possible be used on every embedded driver out there and/or user-space-tools (at least I'm planning to). I don't want developers to put any time on "academic" (?) decode/encode-issues, It's bad enough we need to have this discussion.

@torgil
Copy link
Contributor Author

torgil commented Apr 24, 2017

I have another suggestion. Since Python2 development is basically frozen (see PEP404) and I want to make the python/c-interaction even more seamless we could keep python scripts as is (with decode/encode and other workarounds) and I could add a Tauthon wrapper to bcc. If it's very clear what's "Tauthon" and what's "Python" there shouldn't be any confusions as to what to expect from the scripts. My problems will be solved (with some non-trivial amount of work - but that's ok as long as I get a lean and mean debugging machine) and you'll get your python3-compatibility.

https://github.com/naftaliharris/tauthon

@torgil
Copy link
Contributor Author

torgil commented Apr 24, 2017

Related issue in Tauthon
naftaliharris/tauthon#79

@ambv
Copy link

ambv commented Apr 24, 2017

Making gcc depend on Tauthon would be irresponsible. It's an experimental unsupported project that doesn't get us in any way compatible with Python 3. If you're already considering forcing everybody to have to install a Tauthon runtime on their machines to be able to run bcc, then I'll have a better suggestion: let's just migrate everything to Python 3 and be done with it.

Python 3 enables you to do bytes-only processing if you need it, and will refuse to assume "ascii", instead raising exceptions on improper string/bytes usage.

@drzaeus77
Copy link
Collaborator

I think Tauthon is indeed a dangerous road to go down, let's avoid it.

I also think that the effort to adopt a safe but version-independent approach is tangible. There are 3 different and competing goals that I'm keeping in mind:

  1. BCC is partly a debugging tool, hence it should be reliable in the face of unexpected scenarios which it's users are attempting to troubleshoot. Handling of unexpected or malformed data inputs coming from the units being probed/traced should be predictable. Currently (as @torgil has pointed out) it is not.
  2. BCC is a new tool in the community, and one of our goals as developers is to see its adoption spread. This means being flexible when Linux distributions bundle our code alongside other packages, including whichever python version they may have. We can't afford to be opinionated here. Also, since we have already heard many requests for bcc to support arcane distributions/architectures/form factors, we should strive to include fewer non-standard dependencies (e.g. Tauthon). LLVM is already a large burden here.
  3. We're still actively developing BCC, which means we can't adopt coding styles that make it harder for the developers to make correct decisions when writing their tools. Forcing a bytes() representation is contrary to that, or at least that has been the thinking so far. However, looking at point 1 above, I think we're probably masking a bunch of bugs with the way we use decode() liberally in our code. Let's try to fix this, and engineer ergonomic solutions when we can.

I spent a bit of time with execsnoop, trying to come up with a not-too-kludgy approach that we can replicate (it was surprisingly difficult). So far this doesn't at all address uses of encode/decode inside the bcc library, but for that I am intending to go for a bytes()-everywhere approach.

Please see the branch at https://github.com/iovisor/bcc/commits/python_bytes for details.

In summary, changes that I'm proposing:
Create ergonomic wrappers for some common bytes/str codec operations
Move internal code to b"" syntax where possible.
Get rid of encode/decode usage, use wrappers instead. Wrappers should be only for ingesting data from pseudo-trusted sources (e.g. command line), not from the probes/tracing data sources.

@ambv
Copy link

ambv commented Apr 24, 2017

In summary, changes that I'm proposing:
Create ergonomic wrappers for some common bytes/str codec operations
Move internal code to b"" syntax where possible.
Get rid of encode/decode usage, use wrappers instead. Wrappers should be only for ingesting data from pseudo-trusted sources (e.g. command line), not from the probes/tracing data sources.

Which common bytes/str operations do you mean? The operations are already clear and ergonomic, the problem is that the encoding value that should be used with them is unknown, as @torgil is pointing out. If there's anything non-ASCII to transmit, we need to know the encoding. The question is how common that situation is in practice. It might just be enough to default to UTF-8 everywhere instead and move on.

@torgil, what kind of bytes are you trying to pass through that fail on ASCII?

@drzaeus77
Copy link
Collaborator

Which common bytes/str operations do you mean?

Taking an argument from the command line via argparse, for instance. If interpreter is python2, we should call decode(sys.getfilesystemencoding(). If interpreter is python3, we should do nothing. When converting into bytes, we should do encode(sys.getfilesystemencoding()). See http://legacy.python.org/dev/peps/pep-0383/ for background.

Here is a test for execsnoop:

# note that the target is ρ (lowercase greek rho) not ascii p
sudo ln -s /usr/bin/printf /usr/local/bin/ρrintf
sudo ./execsnoop.py -n "ρrintf"
# elsewhere try:
ρrintf "%s\n" √

Another bag of worms could be getting the USDT infra to work with non-ascii characters in .so or binary names.

@torgil
Copy link
Contributor Author

torgil commented May 16, 2017

@ambv "Making gcc depend on Tauthon would be irresponsible."
That is not irresponsible, that is smart and already happening here. For me productivity (read get things done) is key and for this numpy & company is still a big boost (especially interactively). Python3 is not an option for me for the same reasons (read productivity regression). Python2 is probably sufficient but is dead regarding development so we won't get the next level productivity from there. Until a new killer environment for productivity appears, Tauthon is a pretty good deal.

@torgil
Copy link
Contributor Author

torgil commented May 16, 2017

@drzaeus77 "including whichever python version they may have"
What made Python the choice for BCC to begin with? You didn't see this problems coming?

"If interpreter is python2, we should call decode(sys.getfilesystemencoding()"
I'd rather se a no-op for python2 (see my no-decode-on-py2 branch), make the kludgy stuff on Python3 if possible such that people able to avoid it.

@ambv "what kind of bytes are you trying to pass through that fail on ASCII?"
This is like asking "why would anyone need more than 640k memory". In my case I had a crash just by running the tests and going forward given the evolution there could be endless possibilities for ASCII above 127 showing up. The trick is to be lean and remove all sources of errors where possible and have tools with the least restrictions, especially when this restriction is just a Python3 software construct it seems easily avoidable.

@4ast
Copy link
Member

4ast commented Oct 26, 2017

have been pending for too long. Please resubmit if it's still relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants