Do not assume any code-tables on strings #1073

torgil · 2017-03-27T19:39:11Z

Scripts should never assume what codetable an external user has
and try to decode. Doing that has huge potential to cause pain and
misery for users (a common python3 sickness nowadays). A better
option is for python3-users to live with b'' syntax or provide an
option --decode=xxx to each script.

Revert "Python 3 compatibility fixes around string handling (#986)"
This reverts commit 78948e4.

The offending patch produces the following test failure on Jessie
with backports 4.9 kernel:

Traceback (most recent call last):
File "../../tools/slabratetop.py", line 123, in
print("%-32s %6d %10d" % (k.name.decode(), v.count, v.size))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 2: ordinal not in range(128)

r4f4 · 2017-03-29T08:58:13Z

How do you reproduce that error?

goldshtn · 2017-03-29T10:25:53Z

[buildbot, ok to test]

drzaeus77 · 2017-03-29T17:31:58Z

[buildbot, ok to test]

torgil · 2017-03-29T18:31:52Z

I may have two issues, one that gives me odd characters in the string, and (this one) that makes the script crash.

To get this error I just cloned latest bcc, followed the "jessie" instructions in the INSTALL.md and "make test" failed after building.

To reproduce any crash related to any decode('ascii') is easy, just have one char in the string with a value above 127.

My guess is that all "decode" is there to not show the "b''" in the output in python3. That could be fixed in numerous ways but the current method "decode("ascii")" is a hack, is crash-prone and imposes un-necessary restrictions. Also, this is completely meaningless for python2 users so at least make it python3 only to limit the damage.

drzaeus77 · 2017-03-29T18:55:03Z

My guess is that all "decode" is there to not show the "b''" in the output in python3

That is an oversimplification of the issue. We have gone through many iterations of dealing with (un)encoded strings in bcc, while trying to be cross-compatible with python2/3. There are several interdependent conventions, which all need to be addressed in a common way, for instance:

Passing a python string to ctypes lib requires bytes encoding
Fetching output from subprocess (used in usdt) comes out bytes encoded, and parsing any such output into a native python string requires decoding.
Things like regex need to have matching argument types
Items stored in a dict() need to be fetched with a matching string/bytes type, otherwise silent lookup failures will result
others I've forgotten over time

So, while the liberal use of encode/decode may be naive and ill-advised, it is an approach that we have learned how to code reliably, with unit tests to back up the implementation.

torgil · 2017-03-30T20:04:16Z

Well. This is a real PITA for me and according to your comment you're pretty much trapped in the new Python3 issues. I deal with numerous py2/py3-issues in my day-to-day work so for me to make a patch that keep current behavior in py3 and remove the issues in p2 is pretty straight forward.

Would you accept such a patch?

drzaeus77 · 2017-03-30T20:10:19Z

Yes, we don't want to leave you stuck not being able to use these tools either, that's why we made them! My response was worded strongly in order to convey the work that we've put into this. Of your proposals, I'm curious to see more details about the --decode-xxx option, if that is one that is still a likely alternative in your mind.

r4f4 · 2017-03-31T14:55:10Z

The --decode-xxx option can work for most of the tools, but some of them will still have to keep the coding/decoding when interfacing with C. For example in uthreads.py:

-    if event.type == "pthread":
+    etype = event.type.decode()
+    if etype == "pthread":

without the decode, the comparison is always false on python3. This is more than just a case of "not to show b'' ".

palmtenor · 2017-03-31T20:50:17Z

How would the --decode option gets pass down to the library (like in __init__.py and table.py)? They all need to take an additional argument?

torgil · 2017-04-01T10:35:03Z

With decode, I meant that python3 script.py --decode=GB2312 results in decode('GB2312'). Having thought further I conclude that it just ads to the fire and doesn't really solve anything. Could we add an encoding that does nothing?

I have no good solution for the python3 issue, the problem seems distributed all over the code (grep gives 131 instances) and having to deal with it feels like a nonconstructive use of time. Is python (at least python3) the right tool for the job?

Debug is a "quick and dirty" business and python community is working hard to remove the "dirty" from the equation with "quick" taking large collateral damage. If we have to polish every embedded project under development out there to fit a shiny, dust-free, gold-plated environment that breaks down to pieces each time i little dirt enters then pretty soon polishing is all we do.

The codebase obviously needs a bigger face-lift to deal with this problem. In the meantime, I'm currently leaning towards something around this theme (tested):
for f in $(find . -type f -name "*.py"); do sed -b -i"" -e"s/[\.]$en\|de$code[$][^$]*[\)]//g" $f; done

Maybe provide two builds (and debian packages), one for "python2" and one for "python3". If we don't come up with something clever this is the way to go or a fork is likely to happen (no one wants to work with "non-problems" or "random crashes").

torgil · 2017-04-01T14:12:01Z

Here's a patch to keep python3 behavior but remove coding issues on python2: https://github.com/torgil/bcc/tree/no-decode-on-py2

The drawback with this and why I lean towards two builds is that there is two behaviors with dynamic selection. This could for instance be problems if someone wants the "python2" behavior but has "python" set to "python3". I assume this can be a common case in the future.

It may be better to explicitly say "apt-get install bcc-python2" and get "#!/usr/bin/python2" in the head of all scripts. This would be the normal case. Those who wants to live dangerous or only wants one python due to storage restrictions and it has to be python3 could run the "bcc-python3" version.

The patch dynamically choose behavior depending on python2 or python3 using different functions in 'src/python/bcc/init.py'

if sys.version_info[0] == 2:
    to_bytes = lambda string, encoding='ascii': string
    to_string = lambda string, encoding='ascii': string
elif sys.version_info[0] == 3:
    to_bytes = lambda string, encoding='ascii': string.decode(encoding)
    to_string = lambda bytes_in, encoding='ascii': bytes_in.encode(encoding)

The rest of the changes is based on these lines with a handful of fixes where the regexes didn't made it:

for f in $(find . -type f -name "*.py"); do sed -i"" -rbe"s/(([a-zA-Z0-9_\.]|\([^)]*\))+)\.decode\([^\)]*\)/to_string\(\1\)/g" $f; done
for f in $(git ls-files -m); do sed -i"" -rbe"s/from bcc import.*/\0, to_string/g" $f; done
git add .
for f in $(find . -type f -name "*.py"); do sed -i"" -rbe"s/(([a-zA-Z0-9_\.]|\([^)]*\))+)\.encode\([^\)]*\)/to_bytes\(\1\)/g" $f; done
for f in $(git ls-files -m); do sed -i"" -rbe"s/from bcc import.*/\0, to_bytes/g" $f; done
git add .

torgil · 2017-04-01T14:32:33Z

Patch is no longer a revert, that was a short-cut. Now it just removes all offending code.

torgil · 2017-04-03T22:24:05Z

Patch to remove offending code rebased. New patch added to replace "/usr/bin/python" with "/usr/bin/python2" made by

for f in $(find . -type f -name "*.py"); do sed -i"" -rbe"s%(^.*/usr/bin/pytho)n(\W.*$|$)%\1n2\2%g" $f; done

drzaeus77 · 2017-04-03T23:29:34Z

[buildbot, test this please]

drzaeus77 · 2017-04-03T23:45:12Z

Failures are due to buildbot maintenance, please ignore for now.

torgil · 2017-04-13T20:33:28Z

ping...

What is needed for this patch to go through?

goldshtn · 2017-04-14T07:30:14Z

@torgil Well, just to clarify the status: the end result is that we're just forcing python2 in all the tools? That sounds OK to me. @drzaeus77
Also, could you please rebase and take a look at the build failure, so we could merge if no one objects?

4ast · 2017-04-15T01:02:29Z

hmm. I don't think we can force everyone to use python2 only with bcc.
cc @palmtenor @markdrayton

torgil · 2017-04-18T18:10:22Z

@goldshtn rebased. another decode-fix added

@4ast As current implementation is broken i'd argue that we stop the bleeding and go from there. Any language we chose are "forced" on our users, our task is to chose the best language possible for the task. Do you have a better solution? Do you prefer the "to_bytes"/"to_string" -solution (see above)?

Scripts should never assume what codetable an external user has and try to decode. Doing that has huge potential to cause pain and misery for users (a common python3 sickness nowadays). Traceback (most recent call last): File "../../tools/slabratetop.py", line 123, in <module> print("%-32s %6d %10d" % (k.name.decode(), v.count, v.size)) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 2: ordinal not in range(128) This patch removes all instances of decode and encode. This means it will have regressions on python3 which will no longer work.

Python3 will not work du to the missing encode()/decode() calls needed to interoperate to for instance ctypes. Only support python2 until a proper solution to the encoding mess has been found

torgil · 2017-04-21T21:58:18Z

Ping...

This patch applies cleanly and solves a real problem. What's missing for it to get merged?

drzaeus77 · 2017-04-21T22:41:09Z

I still have a somewhat allergic reaction to this, and now that I'm caught up after a vacation I am going to sit down this weekend and try out the various failure modes just to make sure I fully understand the problem and that the fix is fullproof. Sorry for the delay, please bear with me.

palmtenor · 2017-04-21T23:11:53Z

Do we need to change the strings passed from Kernel space (like those event.xxx)? Shouldn't they always be ASCII?

4ast · 2017-04-22T00:32:39Z

debian/control

@@ -6,7 +6,8 @@ Standards-Version: 3.9.5
 Build-Depends: debhelper (>= 9), cmake, libllvm3.7 | libllvm3.8,
    llvm-3.7-dev | llvm-3.8-dev, libclang-3.7-dev | libclang-3.8-dev,
    libelf-dev, bison, flex, libedit-dev,
-    clang-format | clang-format-3.7 | clang-format-3.8, python (>= 2.7),
+    clang-format | clang-format-3.7 | clang-format-3.8,
+    python (>= 2.7), python (< 3.0),


< 3.0 is no go

torgil · 2017-04-22T16:25:40Z

@4ast fixed. Pushed without that change

@drzaeus77 No problem. We want to get this well thought through. I hate to say this, I've used Python extensively for debug (HW/SW) since 2003, especially interactive mode towards c-components, but python3 has really taken the "seam" out of "seamless" with regards to productivity and interoperability with external components partly due to these issues.

@palmtenor assuming ASCII is exactly the problem I want to address, debug tools should make as few assumptions as possible (keeping data "as is" as long as possible). We could assume that the string is null-terminated or capped att field size. Apart from specific tools, the python module itself could possible be used on every embedded driver out there and/or user-space-tools (at least I'm planning to). I don't want developers to put any time on "academic" (?) decode/encode-issues, It's bad enough we need to have this discussion.

torgil · 2017-04-24T17:40:04Z

I have another suggestion. Since Python2 development is basically frozen (see PEP404) and I want to make the python/c-interaction even more seamless we could keep python scripts as is (with decode/encode and other workarounds) and I could add a Tauthon wrapper to bcc. If it's very clear what's "Tauthon" and what's "Python" there shouldn't be any confusions as to what to expect from the scripts. My problems will be solved (with some non-trivial amount of work - but that's ok as long as I get a lean and mean debugging machine) and you'll get your python3-compatibility.

https://github.com/naftaliharris/tauthon

torgil · 2017-04-24T19:41:29Z

Related issue in Tauthon
naftaliharris/tauthon#79

ambv · 2017-04-24T19:44:21Z

Making gcc depend on Tauthon would be irresponsible. It's an experimental unsupported project that doesn't get us in any way compatible with Python 3. If you're already considering forcing everybody to have to install a Tauthon runtime on their machines to be able to run bcc, then I'll have a better suggestion: let's just migrate everything to Python 3 and be done with it.

Python 3 enables you to do bytes-only processing if you need it, and will refuse to assume "ascii", instead raising exceptions on improper string/bytes usage.

drzaeus77 · 2017-04-24T22:22:22Z

I think Tauthon is indeed a dangerous road to go down, let's avoid it.

I also think that the effort to adopt a safe but version-independent approach is tangible. There are 3 different and competing goals that I'm keeping in mind:

BCC is partly a debugging tool, hence it should be reliable in the face of unexpected scenarios which it's users are attempting to troubleshoot. Handling of unexpected or malformed data inputs coming from the units being probed/traced should be predictable. Currently (as @torgil has pointed out) it is not.
BCC is a new tool in the community, and one of our goals as developers is to see its adoption spread. This means being flexible when Linux distributions bundle our code alongside other packages, including whichever python version they may have. We can't afford to be opinionated here. Also, since we have already heard many requests for bcc to support arcane distributions/architectures/form factors, we should strive to include fewer non-standard dependencies (e.g. Tauthon). LLVM is already a large burden here.
We're still actively developing BCC, which means we can't adopt coding styles that make it harder for the developers to make correct decisions when writing their tools. Forcing a bytes() representation is contrary to that, or at least that has been the thinking so far. However, looking at point 1 above, I think we're probably masking a bunch of bugs with the way we use decode() liberally in our code. Let's try to fix this, and engineer ergonomic solutions when we can.

I spent a bit of time with execsnoop, trying to come up with a not-too-kludgy approach that we can replicate (it was surprisingly difficult). So far this doesn't at all address uses of encode/decode inside the bcc library, but for that I am intending to go for a bytes()-everywhere approach.

Please see the branch at https://github.com/iovisor/bcc/commits/python_bytes for details.

In summary, changes that I'm proposing:
Create ergonomic wrappers for some common bytes/str codec operations
Move internal code to b"" syntax where possible.
Get rid of encode/decode usage, use wrappers instead. Wrappers should be only for ingesting data from pseudo-trusted sources (e.g. command line), not from the probes/tracing data sources.

ambv · 2017-04-24T23:15:58Z

In summary, changes that I'm proposing:
Create ergonomic wrappers for some common bytes/str codec operations
Move internal code to b"" syntax where possible.
Get rid of encode/decode usage, use wrappers instead. Wrappers should be only for ingesting data from pseudo-trusted sources (e.g. command line), not from the probes/tracing data sources.

Which common bytes/str operations do you mean? The operations are already clear and ergonomic, the problem is that the encoding value that should be used with them is unknown, as @torgil is pointing out. If there's anything non-ASCII to transmit, we need to know the encoding. The question is how common that situation is in practice. It might just be enough to default to UTF-8 everywhere instead and move on.

@torgil, what kind of bytes are you trying to pass through that fail on ASCII?

drzaeus77 · 2017-04-25T01:32:47Z

Which common bytes/str operations do you mean?

Taking an argument from the command line via argparse, for instance. If interpreter is python2, we should call decode(sys.getfilesystemencoding(). If interpreter is python3, we should do nothing. When converting into bytes, we should do encode(sys.getfilesystemencoding()). See http://legacy.python.org/dev/peps/pep-0383/ for background.

Here is a test for execsnoop:

# note that the target is ρ (lowercase greek rho) not ascii p
sudo ln -s /usr/bin/printf /usr/local/bin/ρrintf
sudo ./execsnoop.py -n "ρrintf"
# elsewhere try:
ρrintf "%s\n" √

Another bag of worms could be getting the USDT infra to work with non-ascii characters in .so or binary names.

torgil · 2017-05-16T19:26:41Z

@ambv "Making gcc depend on Tauthon would be irresponsible."
That is not irresponsible, that is smart and already happening here. For me productivity (read get things done) is key and for this numpy & company is still a big boost (especially interactively). Python3 is not an option for me for the same reasons (read productivity regression). Python2 is probably sufficient but is dead regarding development so we won't get the next level productivity from there. Until a new killer environment for productivity appears, Tauthon is a pretty good deal.

torgil · 2017-05-16T19:40:50Z

@drzaeus77 "including whichever python version they may have"
What made Python the choice for BCC to begin with? You didn't see this problems coming?

"If interpreter is python2, we should call decode(sys.getfilesystemencoding()"
I'd rather se a no-op for python2 (see my no-decode-on-py2 branch), make the kludgy stuff on Python3 if possible such that people able to avoid it.

@ambv "what kind of bytes are you trying to pass through that fail on ASCII?"
This is like asking "why would anyone need more than 640k memory". In my case I had a crash just by running the tests and going forward given the evolution there could be endless possibilities for ASCII above 127 showing up. The trick is to be lean and remove all sources of errors where possible and have tools with the least restrictions, especially when this restriction is just a Python3 software construct it seems easily avoidable.

4ast · 2017-10-26T03:34:20Z

have been pending for too long. Please resubmit if it's still relevant.

torgil closed this Apr 1, 2017

torgil force-pushed the no_decode branch from dcaf0e1 to 26383f2 Compare April 1, 2017 14:14

torgil reopened this Apr 1, 2017

torgil force-pushed the no_decode branch from 6a7fe17 to 1b3501e Compare April 3, 2017 22:22

torgil force-pushed the no_decode branch from 4367663 to 76688b8 Compare April 18, 2017 18:03

torgil added 2 commits April 20, 2017 02:59

Change all /usr/bin/python to /usr/bin/python2

8acfff4

Python3 will not work du to the missing encode()/decode() calls needed to interoperate to for instance ctypes. Only support python2 until a proper solution to the encoding mess has been found

torgil force-pushed the no_decode branch from 76688b8 to 1add2e6 Compare April 20, 2017 01:00

4ast reviewed Apr 22, 2017

View reviewed changes

decode fix: bytes.decode removed from section_print_fn

db31bf0

torgil force-pushed the no_decode branch from 1add2e6 to db31bf0 Compare April 22, 2017 15:32

Merge branch 'master' into no_decode

108440b

4ast closed this Oct 26, 2017

palmtenor mentioned this pull request Feb 7, 2018

python2/3 compatibility issue in tools regarding byte/str encoding #1579

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not assume any code-tables on strings #1073

Do not assume any code-tables on strings #1073

torgil commented Mar 27, 2017

r4f4 commented Mar 29, 2017

goldshtn commented Mar 29, 2017

drzaeus77 commented Mar 29, 2017

torgil commented Mar 29, 2017

drzaeus77 commented Mar 29, 2017

torgil commented Mar 30, 2017 •

edited

drzaeus77 commented Mar 30, 2017

r4f4 commented Mar 31, 2017 •

edited

palmtenor commented Mar 31, 2017

torgil commented Apr 1, 2017

torgil commented Apr 1, 2017

torgil commented Apr 1, 2017

torgil commented Apr 3, 2017

drzaeus77 commented Apr 3, 2017

drzaeus77 commented Apr 3, 2017

torgil commented Apr 13, 2017

goldshtn commented Apr 14, 2017

4ast commented Apr 15, 2017

torgil commented Apr 18, 2017

torgil commented Apr 21, 2017

drzaeus77 commented Apr 21, 2017

palmtenor commented Apr 21, 2017

4ast Apr 22, 2017

torgil commented Apr 22, 2017

torgil commented Apr 24, 2017

torgil commented Apr 24, 2017

ambv commented Apr 24, 2017

drzaeus77 commented Apr 24, 2017

ambv commented Apr 24, 2017

drzaeus77 commented Apr 25, 2017

torgil commented May 16, 2017

torgil commented May 16, 2017

4ast commented Oct 26, 2017

Do not assume any code-tables on strings #1073

Do not assume any code-tables on strings #1073

Conversation

torgil commented Mar 27, 2017

r4f4 commented Mar 29, 2017

goldshtn commented Mar 29, 2017

drzaeus77 commented Mar 29, 2017

torgil commented Mar 29, 2017

drzaeus77 commented Mar 29, 2017

torgil commented Mar 30, 2017 • edited

drzaeus77 commented Mar 30, 2017

r4f4 commented Mar 31, 2017 • edited

palmtenor commented Mar 31, 2017

torgil commented Apr 1, 2017

torgil commented Apr 1, 2017

torgil commented Apr 1, 2017

torgil commented Apr 3, 2017

drzaeus77 commented Apr 3, 2017

drzaeus77 commented Apr 3, 2017

torgil commented Apr 13, 2017

goldshtn commented Apr 14, 2017

4ast commented Apr 15, 2017

torgil commented Apr 18, 2017

torgil commented Apr 21, 2017

drzaeus77 commented Apr 21, 2017

palmtenor commented Apr 21, 2017

4ast Apr 22, 2017

Choose a reason for hiding this comment

torgil commented Apr 22, 2017

torgil commented Apr 24, 2017

torgil commented Apr 24, 2017

ambv commented Apr 24, 2017

drzaeus77 commented Apr 24, 2017

ambv commented Apr 24, 2017

drzaeus77 commented Apr 25, 2017

torgil commented May 16, 2017

torgil commented May 16, 2017

4ast commented Oct 26, 2017

torgil commented Mar 30, 2017 •

edited

r4f4 commented Mar 31, 2017 •

edited