Make make_str() always return unicode, no matter the Python version. #990

chebee7i · 2013-10-22T19:25:04Z

I just noticed something about make_str that bothered me. This pull request is built off @hagberg's branch for #989. So make sure to merge #989 first, but its not strictly necessary, as merging this PR will merge both.

Consider the following example:

#  -*- coding: utf-8 -*-
# Example 1
from networkx.utils import make_str
x = "qualité"
y = make_str(x)

For Python 2, y is of type str, but as an encoded string. For Python 3, y is of type str, but as a unicode string. This is consistent output type, but the interpretation is very different. Now, consider:

#  -*- coding: utf-8 -*-
# Example 2
from networkx.utils import make_str
try:
    #2.x
    x = unicode("qualité")
except NameError:
    #3.x
    x = "qualité"
y = make_str(x)

For Python 2, y is of type unicode, whereas in Python 3, y is of type str (which is unicode). Now, the types are different, but the interpretation is the same.

Overall, for Python 2, make_str returns either a str or a unicode object. On one hand, this makes sense since the input was different in each case.

On the other hand, its more confusing because it means that our code is possibly working with encoded strings or unicode strings. Always working with unicode would make things easier to maintain, I think. Additionally, this would be similar in spirit to how we handle map, range and zip---we always treat them as iterators so that our code treatment is uniform across Python versions. Right now, internal NetworkX code handles "strings" differently depending on the Python version.

Below is a new proposal for make_str.

PY2 = sys.version_info[0] == 2
if PY2:
    def make_str(x):
        if isinstance(x, unicode):
            return x
        else:
            # Note, this will not work unless x is ascii-encoded.
            # That is good, since we should be working with unicode anyway.
            # Essentially, unless we are reading a file, we demand that users convert
            # any encoded strings to unicode before using the library.
            # 
            # Also, the str() is necessary to convert integers, etc.
            # unicode(3) works, but unicode(3, 'unicode-escape') demands a buffer.
            #
            return unicode(str(x), 'unicode-escape')
else:
    def make_str(x):
        if isinstance(x, str):
             return x
        else:
             return str(x)

The important difference is that the str in name make_str is now interpreted to mean 'unicode', always. Its forward looking to Python 3 just as map, range, zip, etc. The benefit now is that all NetworkX code can be confident that make_str returns a unicode string. Note: the type will be different depending on the Python version, (e.g. unicode vs str) but we shouldn't care about the type as much as we care about what the type represents (e.g. encoded string vs unicode string).

In Python 3, examples 1 and 2 from above are the same. unicode strings in and unicode strings out.

In Python 2, examples 1 and 2 differ but y represents a unicode string in both cases. In example 1, y is a unicode object of length 8. In example 2, 'y' is a unicode object of length 7. The length 8 unicode string is probably not what the use wanted, but the user should also not expect us (NetworkX) to know how the string is encoded. So the user should first decode to unicode and then pass it into make_str---in which case, the output would be the same as in example 2. If x were an ascii encoded string, then examples 1 and 2 would be the same in Python 2 and y would be unicode strings in both cases.

hagberg · 2013-10-22T20:46:22Z

As far as I remember the whole point of the make_str function was because we just wanted to return str(x) unless x was already str or unicode - then leave x alone. That is, we put that code in there to allow either str or unicode types to be used as NetworkX nodes. Of course if you want to output to other data streams you have to think about what you are doing (and hopefully choose unicode).

So I'm not sure we want to demand that all strings be unicode in Python2.

This nonsense all goes away next year when we all are using Python3.

chebee7i · 2013-10-22T21:11:13Z

Gotcha. So my concern is that when we write code that needs to manipulate strings or string-like objects, subtle bugs can be introduced because we are forced to handle different types of output (encoded bytes or unicode strings).

I did a quick grep and the only places we use make_str() are when we are reading/writing to files. So this seems exactly the time when we'd want to make sure we are dealing with unicode. Even if we opt to keep make_str as it is, then all of the places were we are currently using make_str() should be using something like what is proposed here instead of the existing make_str(). As to allowing str or unicode types to be used as NetworkX nodes or edge attributes, this is still possible since we don't force make_str() during node construction or while adding edges.

If the strings are ascii, then this will mostly be a nonop...but it makes coding easier because we can be sure that the strings are unicode. If the strings are utf-8, then this will be a problem (when/if they try to write Graphs to files or other such exports)...but that is already a problem right now (doing so can cause an exception to raise).

hagberg · 2013-10-22T21:50:40Z

Thanks for looking at that. If we are only using make_str when we output data to files or strings meant for other programs then it does make sense to me to convert to unicode (then encode properly for writing/sending).
So I am for this change.

And everybody let's all use Python3 from now on OK?

chebee7i · 2013-10-22T22:18:18Z

Haha yes, Python 3 please. Just for our own notes, let me provide a example of why this is difficult.

In networkx/readwrite/adjlist.py line 80 reads:

line = make_str(s)+delimiter

where delimiter=' ' and is utf-8 encoded (due to the encoding declaration at the top of the module). Now, the question is: What is line 80 doing? The problem for me is that it is very hard to know what line 80 is doing because it depends on what version of Python you are running and also what type s happens to be.

For 2.x, if your nodes are ascii-encoded, then line 80 is combining an ascii-encoded string with a utf-encoded delimiter. For the default delimiter, there is no issue...but its a bit dangerous.

If your nodes are unicode, then line 80 is combining a unicode string with an utf-8 encoded delimiter. This isn't even allowed in Python 3.x (but an exception slyly avoided since the delimiter is unicode in Py3). Although 2.x will allow this operation, it will not work every time. When adding unicode and str, Python first attempts to decode the string using sys.getdefaultencoding() (which is ususally ascii). So if I passed delimiter="é" to this function, we would get a UnicodeDecodeError. So there are legitimate bugs in there: generate_adjlist() does not work if your nodes are unicode and the delimiter is not coercible to your default encoding:

G = nx.Graph()
G.add_edge(u"A", u"B")
adjlist = list(nx.adjlist.generate_adjlist(G, delimiter="é"))    # error
adjlist = list(nx.adjlist.generate_adjlist(G, delimiter=u"é"))   # success

The big point is that its hard, even for people familiar with the code, to know what line 80 is doing. And this is because make_str() is sometimes a str, sometimes unicode.

Since we are (essentially) not supporting non-ascii encodings, the safer way to do this is for make_str() to always return unicode and for delimiter to be a unicode string. Then, whether you use unicode or ascii-encoded nodes, the above will work. And its very clear how/why it works, because at the high-level line 80 is always working with unicode strings. [But of course, specifying the delimiter as unicode in a way that works for 2.x and 3.x is painful.]

To support non-ascii encodings, we'd have to add another parameter which specifies the input-encoding for encoded strings in the graph. Then, we could decode to unicode and then re-encode to the desired output-encoding.

chebee7i · 2013-10-22T22:25:32Z

I think we can merge this, but we should talk about the example I provided (perhaps in another pull request). Mainly, there a subtle issue in many of these files where we specify string literals. In 2.x they are encoded strings, while in 3.x they are unicode. If those string literals not decodable (via the default encoding) back to unicode, then there will be errors. Fortunately, I think every time we use a string literal in files like these, the string is coercible to unicode. If desired, I can make a pull request for adjlist.py and we can discuss it there separately.

hagberg · 2013-10-22T23:39:53Z

I don't want to revisit all of those graph format writers to fix Python2 warts. Some of these issues are certainly bugs but they seem like corner cases (using a non-encodable delimiter).

We are all using Python3 next week anyway....

chebee7i · 2013-10-22T23:53:09Z

Sounds reasonable. It's be a lot of tediousness without much benefit (esp since no one has really complained in the past---so those corner cases are not being visited by users).

Make make_str() always return unicode, no matter the Python version.

hagberg · 2013-10-22T23:59:05Z

For python3 can't we just return str(x)? (no ugly type checking)

chebee7i · 2013-10-23T00:02:17Z

Opps, I think so.

hagberg · 2013-10-23T00:02:29Z

And could we just use a try/except instead of checking for python version?

try:
    #python 2
    if isinstance(x, unicode):
   ....
except NameError:
   # python 3
   return str(x)

chebee7i · 2013-10-23T00:14:59Z

I merged too soon. Hah.

We can do that, but then every call to make_str will have to first raise an exception in Python 3. So it will be a bit slower, rather than specialized just for Py3. What should we do?

hagberg · 2013-10-23T00:41:11Z

Can't you still push more to the same branch?

On Tue, Oct 22, 2013 at 6:15 PM, chebee7i notifications@github.com wrote:

I merged too soon. Hah.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/990#issuecomment-26868708
.

chebee7i · 2013-10-23T00:49:27Z

I pushed up to it, but it didn't update here. Perhaps because the pull request was merged? Anyway, I pushed the simplification directly to master (and then deleted the branch for this PR). Haven't taken any action on the try/except yet though.

Make make_str() always return unicode, no matter the Python version.

801a2e0

chebee7i added a commit that referenced this pull request Oct 22, 2013

Merge pull request #990 from chebee7i/hagberg/json-str

744bd89

Make make_str() always return unicode, no matter the Python version.

chebee7i merged commit 744bd89 into networkx:master Oct 22, 2013

hagberg added this to the networkx-1.9 milestone May 13, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make make_str() always return unicode, no matter the Python version. #990

Make make_str() always return unicode, no matter the Python version. #990

chebee7i commented Oct 22, 2013

hagberg commented Oct 22, 2013

chebee7i commented Oct 22, 2013

hagberg commented Oct 22, 2013

chebee7i commented Oct 22, 2013

chebee7i commented Oct 22, 2013

hagberg commented Oct 22, 2013

chebee7i commented Oct 22, 2013

hagberg commented Oct 22, 2013

chebee7i commented Oct 23, 2013

hagberg commented Oct 23, 2013

chebee7i commented Oct 23, 2013

hagberg commented Oct 23, 2013

chebee7i commented Oct 23, 2013

Make make_str() always return unicode, no matter the Python version. #990

Make make_str() always return unicode, no matter the Python version. #990

Conversation

chebee7i commented Oct 22, 2013

hagberg commented Oct 22, 2013

chebee7i commented Oct 22, 2013

hagberg commented Oct 22, 2013

chebee7i commented Oct 22, 2013

chebee7i commented Oct 22, 2013

hagberg commented Oct 22, 2013

chebee7i commented Oct 22, 2013

hagberg commented Oct 22, 2013

chebee7i commented Oct 23, 2013

hagberg commented Oct 23, 2013

chebee7i commented Oct 23, 2013

hagberg commented Oct 23, 2013

chebee7i commented Oct 23, 2013