New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not assume any code-tables on strings #1073
Conversation
How do you reproduce that error? |
[buildbot, ok to test] |
1 similar comment
[buildbot, ok to test] |
I may have two issues, one that gives me odd characters in the string, and (this one) that makes the script crash. To get this error I just cloned latest bcc, followed the "jessie" instructions in the INSTALL.md and "make test" failed after building. To reproduce any crash related to any decode('ascii') is easy, just have one char in the string with a value above 127. My guess is that all "decode" is there to not show the "b''" in the output in python3. That could be fixed in numerous ways but the current method "decode("ascii")" is a hack, is crash-prone and imposes un-necessary restrictions. Also, this is completely meaningless for python2 users so at least make it python3 only to limit the damage. |
That is an oversimplification of the issue. We have gone through many iterations of dealing with (un)encoded strings in bcc, while trying to be cross-compatible with python2/3. There are several interdependent conventions, which all need to be addressed in a common way, for instance:
So, while the liberal use of encode/decode may be naive and ill-advised, it is an approach that we have learned how to code reliably, with unit tests to back up the implementation. |
Well. This is a real PITA for me and according to your comment you're pretty much trapped in the new Python3 issues. I deal with numerous py2/py3-issues in my day-to-day work so for me to make a patch that keep current behavior in py3 and remove the issues in p2 is pretty straight forward. Would you accept such a patch? |
Yes, we don't want to leave you stuck not being able to use these tools either, that's why we made them! My response was worded strongly in order to convey the work that we've put into this. Of your proposals, I'm curious to see more details about the |
The --decode-xxx option can work for most of the tools, but some of them will still have to keep the coding/decoding when interfacing with C. For example in uthreads.py:
without the decode, the comparison is always false on python3. This is more than just a case of "not to show b'' ". |
How would the |
With decode, I meant that I have no good solution for the python3 issue, the problem seems distributed all over the code (grep gives 131 instances) and having to deal with it feels like a nonconstructive use of time. Is python (at least python3) the right tool for the job? Debug is a "quick and dirty" business and python community is working hard to remove the "dirty" from the equation with "quick" taking large collateral damage. If we have to polish every embedded project under development out there to fit a shiny, dust-free, gold-plated environment that breaks down to pieces each time i little dirt enters then pretty soon polishing is all we do. The codebase obviously needs a bigger face-lift to deal with this problem. In the meantime, I'm currently leaning towards something around this theme (tested): Maybe provide two builds (and debian packages), one for "python2" and one for "python3". If we don't come up with something clever this is the way to go or a fork is likely to happen (no one wants to work with "non-problems" or "random crashes"). |
Here's a patch to keep python3 behavior but remove coding issues on python2: https://github.com/torgil/bcc/tree/no-decode-on-py2 The drawback with this and why I lean towards two builds is that there is two behaviors with dynamic selection. This could for instance be problems if someone wants the "python2" behavior but has "python" set to "python3". I assume this can be a common case in the future. It may be better to explicitly say "apt-get install bcc-python2" and get "#!/usr/bin/python2" in the head of all scripts. This would be the normal case. Those who wants to live dangerous or only wants one python due to storage restrictions and it has to be python3 could run the "bcc-python3" version. The patch dynamically choose behavior depending on python2 or python3 using different functions in 'src/python/bcc/init.py'
The rest of the changes is based on these lines with a handful of fixes where the regexes didn't made it:
|
Patch is no longer a revert, that was a short-cut. Now it just removes all offending code. |
Patch to remove offending code rebased. New patch added to replace "/usr/bin/python" with "/usr/bin/python2" made by
|
[buildbot, test this please] |
Failures are due to buildbot maintenance, please ignore for now. |
ping... What is needed for this patch to go through? |
@torgil Well, just to clarify the status: the end result is that we're just forcing python2 in all the tools? That sounds OK to me. @drzaeus77 |
hmm. I don't think we can force everyone to use python2 only with bcc. |
@goldshtn rebased. another decode-fix added @4ast As current implementation is broken i'd argue that we stop the bleeding and go from there. Any language we chose are "forced" on our users, our task is to chose the best language possible for the task. Do you have a better solution? Do you prefer the "to_bytes"/"to_string" -solution (see above)? |
Scripts should never assume what codetable an external user has and try to decode. Doing that has huge potential to cause pain and misery for users (a common python3 sickness nowadays). Traceback (most recent call last): File "../../tools/slabratetop.py", line 123, in <module> print("%-32s %6d %10d" % (k.name.decode(), v.count, v.size)) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 2: ordinal not in range(128) This patch removes all instances of decode and encode. This means it will have regressions on python3 which will no longer work.
Python3 will not work du to the missing encode()/decode() calls needed to interoperate to for instance ctypes. Only support python2 until a proper solution to the encoding mess has been found
Ping... This patch applies cleanly and solves a real problem. What's missing for it to get merged? |
I still have a somewhat allergic reaction to this, and now that I'm caught up after a vacation I am going to sit down this weekend and try out the various failure modes just to make sure I fully understand the problem and that the fix is fullproof. Sorry for the delay, please bear with me. |
Do we need to change the strings passed from Kernel space (like those |
debian/control
Outdated
@@ -6,7 +6,8 @@ Standards-Version: 3.9.5 | |||
Build-Depends: debhelper (>= 9), cmake, libllvm3.7 | libllvm3.8, | |||
llvm-3.7-dev | llvm-3.8-dev, libclang-3.7-dev | libclang-3.8-dev, | |||
libelf-dev, bison, flex, libedit-dev, | |||
clang-format | clang-format-3.7 | clang-format-3.8, python (>= 2.7), | |||
clang-format | clang-format-3.7 | clang-format-3.8, | |||
python (>= 2.7), python (< 3.0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
< 3.0 is no go
@4ast fixed. Pushed without that change @drzaeus77 No problem. We want to get this well thought through. I hate to say this, I've used Python extensively for debug (HW/SW) since 2003, especially interactive mode towards c-components, but python3 has really taken the "seam" out of "seamless" with regards to productivity and interoperability with external components partly due to these issues. @palmtenor assuming ASCII is exactly the problem I want to address, debug tools should make as few assumptions as possible (keeping data "as is" as long as possible). We could assume that the string is null-terminated or capped att field size. Apart from specific tools, the python module itself could possible be used on every embedded driver out there and/or user-space-tools (at least I'm planning to). I don't want developers to put any time on "academic" (?) decode/encode-issues, It's bad enough we need to have this discussion. |
I have another suggestion. Since Python2 development is basically frozen (see PEP404) and I want to make the python/c-interaction even more seamless we could keep python scripts as is (with decode/encode and other workarounds) and I could add a Tauthon wrapper to bcc. If it's very clear what's "Tauthon" and what's "Python" there shouldn't be any confusions as to what to expect from the scripts. My problems will be solved (with some non-trivial amount of work - but that's ok as long as I get a lean and mean debugging machine) and you'll get your python3-compatibility. |
Related issue in Tauthon |
Making gcc depend on Tauthon would be irresponsible. It's an experimental unsupported project that doesn't get us in any way compatible with Python 3. If you're already considering forcing everybody to have to install a Tauthon runtime on their machines to be able to run bcc, then I'll have a better suggestion: let's just migrate everything to Python 3 and be done with it. Python 3 enables you to do bytes-only processing if you need it, and will refuse to assume "ascii", instead raising exceptions on improper string/bytes usage. |
I think Tauthon is indeed a dangerous road to go down, let's avoid it. I also think that the effort to adopt a safe but version-independent approach is tangible. There are 3 different and competing goals that I'm keeping in mind:
I spent a bit of time with execsnoop, trying to come up with a not-too-kludgy approach that we can replicate (it was surprisingly difficult). So far this doesn't at all address uses of encode/decode inside the bcc library, but for that I am intending to go for a bytes()-everywhere approach. Please see the branch at https://github.com/iovisor/bcc/commits/python_bytes for details. In summary, changes that I'm proposing: |
Which common bytes/str operations do you mean? The operations are already clear and ergonomic, the problem is that the encoding value that should be used with them is unknown, as @torgil is pointing out. If there's anything non-ASCII to transmit, we need to know the encoding. The question is how common that situation is in practice. It might just be enough to default to UTF-8 everywhere instead and move on. @torgil, what kind of bytes are you trying to pass through that fail on ASCII? |
Taking an argument from the command line via argparse, for instance. If interpreter is python2, we should call decode(sys.getfilesystemencoding(). If interpreter is python3, we should do nothing. When converting into bytes, we should do encode(sys.getfilesystemencoding()). See http://legacy.python.org/dev/peps/pep-0383/ for background. Here is a test for execsnoop:
Another bag of worms could be getting the USDT infra to work with non-ascii characters in .so or binary names. |
@ambv "Making gcc depend on Tauthon would be irresponsible." |
@drzaeus77 "including whichever python version they may have" "If interpreter is python2, we should call decode(sys.getfilesystemencoding()" @ambv "what kind of bytes are you trying to pass through that fail on ASCII?" |
have been pending for too long. Please resubmit if it's still relevant. |
Scripts should never assume what codetable an external user has
and try to decode. Doing that has huge potential to cause pain and
misery for users (a common python3 sickness nowadays). A better
option is for python3-users to live with b'' syntax or provide an
option --decode=xxx to each script.
Revert "Python 3 compatibility fixes around string handling (#986)"
This reverts commit 78948e4.
The offending patch produces the following test failure on Jessie
with backports 4.9 kernel:
Traceback (most recent call last):
File "../../tools/slabratetop.py", line 123, in
print("%-32s %6d %10d" % (k.name.decode(), v.count, v.size))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 2: ordinal not in range(128)