-
-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make make_str() always return unicode, no matter the Python version. #990
Conversation
As far as I remember the whole point of the make_str function was because we just wanted to return str(x) unless x was already str or unicode - then leave x alone. That is, we put that code in there to allow either str or unicode types to be used as NetworkX nodes. Of course if you want to output to other data streams you have to think about what you are doing (and hopefully choose unicode). So I'm not sure we want to demand that all strings be unicode in Python2. This nonsense all goes away next year when we all are using Python3. |
Gotcha. So my concern is that when we write code that needs to manipulate strings or string-like objects, subtle bugs can be introduced because we are forced to handle different types of output (encoded bytes or unicode strings). I did a quick grep and the only places we use make_str() are when we are reading/writing to files. So this seems exactly the time when we'd want to make sure we are dealing with unicode. Even if we opt to keep make_str as it is, then all of the places were we are currently using make_str() should be using something like what is proposed here instead of the existing make_str(). As to allowing str or unicode types to be used as NetworkX nodes or edge attributes, this is still possible since we don't force make_str() during node construction or while adding edges. If the strings are ascii, then this will mostly be a nonop...but it makes coding easier because we can be sure that the strings are unicode. If the strings are utf-8, then this will be a problem (when/if they try to write Graphs to files or other such exports)...but that is already a problem right now (doing so can cause an exception to raise). |
Thanks for looking at that. If we are only using make_str when we output data to files or strings meant for other programs then it does make sense to me to convert to unicode (then encode properly for writing/sending). And everybody let's all use Python3 from now on OK? |
Haha yes, Python 3 please. Just for our own notes, let me provide a example of why this is difficult. In
where For 2.x, if your nodes are ascii-encoded, then line 80 is combining an ascii-encoded string with a utf-encoded delimiter. For the default delimiter, there is no issue...but its a bit dangerous. If your nodes are unicode, then line 80 is combining a unicode string with an utf-8 encoded delimiter. This isn't even allowed in Python 3.x (but an exception slyly avoided since the delimiter is unicode in Py3). Although 2.x will allow this operation, it will not work every time. When adding unicode and str, Python first attempts to decode the string using sys.getdefaultencoding() (which is ususally ascii). So if I passed delimiter="é" to this function, we would get a UnicodeDecodeError. So there are legitimate bugs in there: generate_adjlist() does not work if your nodes are unicode and the delimiter is not coercible to your default encoding:
The big point is that its hard, even for people familiar with the code, to know what line 80 is doing. And this is because make_str() is sometimes a str, sometimes unicode. Since we are (essentially) not supporting non-ascii encodings, the safer way to do this is for make_str() to always return unicode and for delimiter to be a unicode string. Then, whether you use unicode or ascii-encoded nodes, the above will work. And its very clear how/why it works, because at the high-level line 80 is always working with unicode strings. [But of course, specifying the delimiter as unicode in a way that works for 2.x and 3.x is painful.] To support non-ascii encodings, we'd have to add another parameter which specifies the input-encoding for encoded strings in the graph. Then, we could decode to unicode and then re-encode to the desired output-encoding. |
I think we can merge this, but we should talk about the example I provided (perhaps in another pull request). Mainly, there a subtle issue in many of these files where we specify string literals. In 2.x they are encoded strings, while in 3.x they are unicode. If those string literals not decodable (via the default encoding) back to unicode, then there will be errors. Fortunately, I think every time we use a string literal in files like these, the string is coercible to unicode. If desired, I can make a pull request for adjlist.py and we can discuss it there separately. |
I don't want to revisit all of those graph format writers to fix Python2 warts. Some of these issues are certainly bugs but they seem like corner cases (using a non-encodable delimiter). We are all using Python3 next week anyway.... |
Sounds reasonable. It's be a lot of tediousness without much benefit (esp since no one has really complained in the past---so those corner cases are not being visited by users). |
Make make_str() always return unicode, no matter the Python version.
For python3 can't we just return str(x)? (no ugly type checking) |
Opps, I think so. |
And could we just use a try/except instead of checking for python version? try:
#python 2
if isinstance(x, unicode):
....
except NameError:
# python 3
return str(x) |
I merged too soon. Hah. We can do that, but then every call to |
Can't you still push more to the same branch? On Tue, Oct 22, 2013 at 6:15 PM, chebee7i notifications@github.com wrote:
|
I pushed up to it, but it didn't update here. Perhaps because the pull request was merged? Anyway, I pushed the simplification directly to master (and then deleted the branch for this PR). Haven't taken any action on the try/except yet though. |
I just noticed something about
make_str
that bothered me. This pull request is built off @hagberg's branch for #989. So make sure to merge #989 first, but its not strictly necessary, as merging this PR will merge both.Consider the following example:
For Python 2,
y
is of typestr
, but as an encoded string. For Python 3,y
is of typestr
, but as a unicode string. This is consistent output type, but the interpretation is very different. Now, consider:For Python 2,
y
is of typeunicode
, whereas in Python 3,y
is of typestr
(which is unicode). Now, the types are different, but the interpretation is the same.Overall, for Python 2,
make_str
returns either astr
or aunicode
object. On one hand, this makes sense since the input was different in each case.On the other hand, its more confusing because it means that our code is possibly working with encoded strings or unicode strings. Always working with unicode would make things easier to maintain, I think. Additionally, this would be similar in spirit to how we handle
map
,range
andzip
---we always treat them as iterators so that our code treatment is uniform across Python versions. Right now, internal NetworkX code handles "strings" differently depending on the Python version.Below is a new proposal for
make_str
.The important difference is that the
str
in namemake_str
is now interpreted to mean 'unicode', always. Its forward looking to Python 3 just as map, range, zip, etc. The benefit now is that all NetworkX code can be confident that make_str returns a unicode string. Note: the type will be different depending on the Python version, (e.g. unicode vs str) but we shouldn't care about the type as much as we care about what the type represents (e.g. encoded string vs unicode string).In Python 3, examples 1 and 2 from above are the same. unicode strings in and unicode strings out.
In Python 2, examples 1 and 2 differ but
y
represents a unicode string in both cases. In example 1,y
is a unicode object of length 8. In example 2, 'y' is a unicode object of length 7. The length 8 unicode string is probably not what the use wanted, but the user should also not expect us (NetworkX) to know how the string is encoded. So the user should first decode to unicode and then pass it intomake_str
---in which case, the output would be the same as in example 2. Ifx
were an ascii encoded string, then examples 1 and 2 would be the same in Python 2 andy
would be unicode strings in both cases.