Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undeterministic output order with python3 #1181

Closed
josch opened this issue May 29, 2014 · 37 comments
Closed

Undeterministic output order with python3 #1181

josch opened this issue May 29, 2014 · 37 comments

Comments

@josch
Copy link
Contributor

josch commented May 29, 2014

Hi,
with Python3, hashes are salted and thus, functions like edges_iter return their results in a different order between subsequent python3 invocations. While this might be acceptable, this should be documented.

Unfortunately, a side effect of this is, that writing to a file does not produce deterministic output either. I'd argue that if the user saves the same graph twice with different invocations of python3, the generated graphs should not only be semantically the same but also byte-by-byte the same files.

Right now, this is only the case in python2.x but not anymore in python3 where a random seed is used by default.

Please document that the order of the various iterators in networkx depends on pythons internal order and is just not deterministic with python3.

Please make it such that when writing graphs to files, the output is deterministically the same between different invocations.

@chebee7i
Copy link
Member

When writing to files, we can attempt to sort the nodes/edges, but NetworkX doesn't require that nodes be sortable. Maybe a sort=True parameter?

I wonder if we could just move to OrderedDict without causing too many issues. 2.7 and 3.x have it. For 2.6, we make ordereddict a dependency.

@josch
Copy link
Contributor Author

josch commented May 29, 2014

Thanks, having the option of deterministic output with a sort=True parameter in write_graphml and friends would fix my problem. Having determinism anywhere else might not be desirable due to possible performance penalties.

@ysitu
Copy link
Contributor

ysitu commented May 29, 2014

I wonder why deterministic output is needed.

@josch
Copy link
Contributor Author

josch commented May 29, 2014

One reason: if I want to check whether two graphs are equal, it is easier to run md5sum or diff than having to parse the graph first and then comparing whether both graphs have the same vertices and edges.
Given that it is very simple to add deterministic output (for example through a sort=True option) I think it is worth to have it.

@hagberg
Copy link
Member

hagberg commented May 29, 2014

How about turning off the hash randomization?

$ PYTHONHASHSEED=random python3.4 -c 'print(hash("a"))'
-1769101628
$ PYTHONHASHSEED=random python3.4 -c 'print(hash("a"))'
-661212464
$ PYTHONHASHSEED=0 python3.4 -c 'print(hash("a"))'
-845962679
$ PYTHONHASHSEED=0 python3.4 -c 'print(hash("a"))'
-845962679

@josch
Copy link
Contributor Author

josch commented May 29, 2014

This will affect all hashes and not only those used to generate the graphml or dot output. So this is not a solution because in general, hash randomization is useful.

@ysitu
Copy link
Contributor

ysitu commented May 29, 2014

Can you prepare a PR that implements the needed functionality?

@josch
Copy link
Contributor Author

josch commented May 29, 2014

Yes, if you tell me how you'd like it to be implemented. Should the write_graphml and friends functions get a sort=True or similar parameter which would then make the output sorted? Or should the output be sorted by default?

@hagberg
Copy link
Member

hagberg commented May 29, 2014

Example please where you need hash randomization and determinstic output. Otherwise it seems like a pretty special case?

@josch
Copy link
Contributor Author

josch commented May 29, 2014

Hash randomization seems like a good thing to have https://mail.python.org/pipermail/python-announce-list/2012-March/009394.html so I do not think it would be wise to drop it for everything.

I gave one example why deterministic output makes sense above. I do not think that hash randomization should be given up for that.

My biggest argument would actually be that if a user runs an algorithm that writes the graph somewhere, then given the same input, they'd expect the exact same output. At least that was my assumption. I was surprised when I saw that each time I ran my code, the output file was different (though representing the same graph).

If you do not want deterministic graph output in networkx, please mention in the documentation that the vertex and edge order in the graph writers is random when using python3 and deterministic for a given version of python 2.x. I think this will avoid surprises as my code started failing its testcases when I upgraded to python3 because I was comparing the md5sum of the created graphs.

@hagberg
Copy link
Member

hagberg commented May 29, 2014

Yes, I can understand the surprise and frustration that output varies from run to run. But I still think the solution is to disable hash randomization (or use the same randomization) if you want more deterministic behavior (as used to be in Python2). I can't evaluate the security risk in disabling the randomization. It seems to be basically a denial of service attack so if you are using NetworkX in an uncontrolled situation with malicious users it could make sense to randomize.

In general graphs are sets of nodes and edges with no implied order. We should certainly document this fact as it relates to input and output of data.

@josch
Copy link
Contributor Author

josch commented May 29, 2014

Would you accept a patch which adds an additional sorted (or similar name) argument to functions like write_graphml? That way, the original behaviour could be retained, while those who want deterministic output could set sorted=True.

@hagberg
Copy link
Member

hagberg commented May 29, 2014

Sure we could do that. We need to be careful that the error produced when nodes are not sortable makes sense to the user. I don't think many people use nodes that can't be sorted but we allow it.

@josch
Copy link
Contributor Author

josch commented May 29, 2014

Thanks, I'll try to prepare one.

@chebee7i
Copy link
Member

Is it the edges or nodes or both that get out of order? If its only the nodes, you could also make it accept an ordering on the nodes. node_order=None being the default...which then just grabs G.nodes(). But if edges are getting out of order as well, then this won't work.

@ysitu
Copy link
Contributor

ysitu commented May 30, 2014

All textual output functions, not just write_graphml, should receive the same treatment so that the user will not be surprised to find out that some supports sorting and some do not.

@josch
Copy link
Contributor Author

josch commented May 30, 2014

@ysitu as you can see in the conversation above I was always saying "functions like write_graphml" or "write_graphml and friends". So I do not plan to limit this to graphml output. On the other hand I did not yet manage to call write_dot in python3. It would exit with TypeError: 'str' does not support the buffer interface.

@chebee7i it is both. Consider the following example:

<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns">
    <key id="foo" for="node" attr.name="foo" attr.type="string"></key>
    <key id="bar" for="node" attr.name="bar" attr.type="string"></key>
    <graph id="G" edgedefault="directed">
        <node id="0">
            <data key="foo">node0foo</data>
            <data key="bar">node0bar</data>
        </node>
        <node id="1">
            <data key="foo">node1foo</data>
            <data key="bar">node1bar</data>
        </node>
        <edge id="0" source="0" target="1"></edge>
        <edge id="0" source="1" target="0"></edge>
    </graph>
</graphml>

And then execute this a couple of times:

python3 -c "import networkx, sys; g = networkx.read_graphml('test.xml'); networkx.write_graphml(g, sys.stdout.buffer)"

You will observe that the edge order changes. Looking at the code for graphml output this seems because edges_iter is not deterministic. Execute this a couple of times to convince yourself:

python3 -c "import networkx, sys; g = networkx.read_graphml('test.xml'); print(list(g.edges_iter()))"

This might also mean that some of the algorithms in networkx behave undeterministic in python3 because their output relies on the order in which they traverse the nodes or edges. But I did not investigate this claim.

edit: there are algorithms which are affected. For example try running the following multiple times:

python3 -c "import networkx, sys; g = networkx.read_graphml('test.xml'); print(list(networkx.algorithms.cycles.simple_cycles(g)))"

Because Johnson's algorithm will start searching at a different node every other time, the output is different as well. But this just as a sidenode. I dont want to make this the topic of this bugreport.

@ysitu
Copy link
Contributor

ysitu commented May 30, 2014

The interface can be something like write_graphml(G, path, encoding='utf-8', prettyprint=True, node_cmp=None, key_cmp=None). The user should supply two comparison functions, one for the nodes and one for the edge keys of multigraphs. They should follow the convention used by sorted. Then the edges can be sorted as tuples (u, v, key). Undirected edges should have node_cmp(u, v) <= 0 in (u, v, key).

@josch
Copy link
Contributor Author

josch commented May 30, 2014

@ysitu sorry I read your comment only after I already created the pull request. It implements a simple additional sorted argument to a variety of output functions.

If you feel stronly about allowing how to compare nodes and edges, then I can create another pull request.

Also, in addition to comparing nodes and edges it must also be possible to compare attributes, no? At least in some graph types there is not only the node and edge order but also the attribute order.

@ysitu
Copy link
Contributor

ysitu commented May 30, 2014

In general, the idea is that you should not rely on the nodes/keys/attributes being sortable by themselves. Letting the user supply comparison functions will enable the greatest flexibility. If that is much too aggressive, you can start with sorting them by their textual representations.

@josch
Copy link
Contributor Author

josch commented May 30, 2014

Okay, I can prepare another pull request.

@chebee7i
Copy link
Member

Just wanted to point out that in Python 3, the cmp option to sorted no longer exists. So to be forward looking, maybe it would be better to make the user provide key functions instead.

http://python3porting.com/preparing.html#keycmp-section

So for each node, node_key just returns something that is sortable. For edges without attribute data, it seems that those can be sorted using the node key information: (u_key, v_key) or (u_key, v_key, edge_key).

@josch
Copy link
Contributor Author

josch commented May 30, 2014

Having the dropping of the cmp parameter in favor of key in python3 in mind, should this not be handled differently? Instead of supplying the comparison functions node_cmp and key_cmp which can only be supplied to sorted by using cmp_to_key from the functools module, it seems to me that the python3 way of doing sorting would be that the user adds the functions __lt__, __le__ , __eq__, __ne__, __gt__ and __ge__ to the node and edge objects?
The key parameter to sorted would then govern what to sort while the comparison functions govern how to sort.

@ysitu
Copy link
Contributor

ysitu commented May 30, 2014

The interface can be changed accordingly. Instead of node_cmp, key_cmp and attr_cmp, you ask for node_key, key_key, attr_key.

@hagberg
Copy link
Member

hagberg commented May 30, 2014

I still feel that the best solution here is to turn of the hash randomization. I haven't yet been convinced that this isn't a valid solution that will work for @josch.

If we want to support ordering of nodes and edges in output files that networkx generates (I'm not in favor myself) I suggest we allow direct specification of a node list and edge list instead of sorting.

@josch
Copy link
Contributor Author

josch commented May 30, 2014

@hagberg and attribute lists per node/edge.

@hagberg
Copy link
Member

hagberg commented May 30, 2014

Yes, attributes too, ugh. Now I'm really -1 on adding this as a feature.

@josch
Copy link
Contributor Author

josch commented May 30, 2014

@hagberg have a look at the pull request that I referenced further up. The most minimal solution which just takes a sort parameter works, is simple and doesnt look intrusive to me. This would already solve this bugreport. If I understand it correctly (but I didnt try) this solution would even already allow custom sorting by the user attaching __lt__, __le__ , __eq__, __ne__, __gt__ and __ge__ functions to the nodes, edges and attributes for which he wants custom sorting.

Does the pull request that I created look unacceptable to you?

@bjedwards
Copy link
Member

I'm sort of with @hagberg on this one. This seems like a lot of cruft to add in for what seems like a single use case. NetworkX has tried to be as general as possible and not introduce constraints(like orderable node types), and I think trying to shoehorn that into a writer is a bad idea. I think there are also another option, subclass or rewrite the GraphMLWrite class as a custom writer.

@josch, if you know a priori the type of your nodes, rewriting GraphMLWriter.add_nodes and GraphMLWriter.add_edges, seems easy enough. It's the general case that is a pain in the ass.

Also can you turn of hash seed randomization programmatically? Or is it read at startup?

@ysitu ysitu removed this from the networkx-2.0 milestone May 30, 2014
@chebee7i
Copy link
Member

I must be the odd one here :) I'm still partial to building your graph (or converting your graph to) a class that uses ordered dicts.

@josch
Copy link
Contributor Author

josch commented May 30, 2014

@bjedwards sure, if you guys decide that having sorted or otherwise deterministic output in the graph writers is a bad idea for networkx, then of course I will just put my own graphml writer code into my python project. As far as I can see from the documentation, there is no way to switch hash randomization on and off from within python. It is done via the -R argument on the commandline or the PYTHONHASHSEED environment variable. Both then affect all hashes in the Python process and I don't see a way to only let the networkx dictionaries used to store nodes, edges and their attributes be affected.

@dschult
Copy link
Member

dschult commented May 30, 2014

One way to reduce the impact of hash order on only the networkx dictionaries would be to follow @chebee7i and use a class which has OrderedDict dictionaries. See #980. But that's not quite ready to go.

@josch
Copy link
Contributor Author

josch commented Jun 2, 2014

@dschult if I understand it correctly, then OrderedDict dictionaries could even maintain the same order of the graph as it was originally parsed, no? That would be even more great because it would mean that networkx would not even modify the input file at all if it didnt make any changes to it.

@harlowja
Copy link
Contributor

harlowja commented Jun 8, 2014

I'd like the ordereddict dictionaries usage (even if graphs really should be unordered in the ideal world). #980 and #1047 (and maybe others?) started along this path and maybe we can get those to be finished up?

Btw hash randomization is usually a really good idea (see https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-1150 and https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-5371 and others...), and asking for it to be disabled for networkx would severely hamper its usage in servers (and other applications) that use networkx internally (and those same applications/servers may output graph files for users to download, for example).

@ysitu ysitu added this to the networkx-future milestone Jun 8, 2014
@ysitu
Copy link
Contributor

ysitu commented Jun 11, 2014

Regarding enabling OrderedDict, one possibility is to create a function

def make_graph_class(dict):
    class Graph(object):
        def __init__(self):
            self.graph = dict()
            # ...
    return Graph

Then the default Graph class is simply defined as

Graph = make_graph_class(dict)

If one wishes to use OrderedDict, he/she can do

OrderedGraph = make_graph_class(OrderedDict)

I believe that this model can be extended to support other backend types as well.

@harlowja
Copy link
Contributor

The problem that I hit was that its not just the construction that needs to use an OrderedDict, its also any clone methods, any copy methods and so-on, it extends into edge addition/ordering (should that also use an OrderedDict?).

Some of the issues I see (others may have a better idea here):

  • def to_undirected (creates a new undirected graph, should this retain ordering?)
  • def subgraph(self, nbunch) (same question)
  • def to_directed(self) (same question)
  • Any of the copy functions in networkx/classes/function.py and the generation functions that exist in networkx (how will these behave?)
  • Others?

My guess is none of this is insurmountable, just needs to be analyzed to figure out the scope that should be tackled to get from here to there.

@hagberg
Copy link
Member

hagberg commented Jun 12, 2014

We have previous discussed ideas along the line @ysitu suggested in #980 (and earlier). I am for designing a way to use alternate 'backends' to store graph and attribute data such as an SQL database, OrderedDict, sparse matrix, etc. So far we have not converged on a engineering solution to do that. NetworkX 2.0 is an opportunity to adjust the basic API/interface some if necessary to make this happen. It seems do-able but there are issues and trade-offs that should be discussed in a different issue.

That approach may solve @harlowja's issue in other ways (e.g. OrderedDict as a backend) than turning off the hash randomization. So I suggest we not implement a half-way approach here to provide ordered storage or reading and writing of nodes, edges, and attributes.

@ysitu ysitu closed this as completed Aug 8, 2014
@ysitu ysitu removed this from the networkx-future milestone Oct 12, 2014
dschult added a commit to dschult/networkx that referenced this issue Jan 2, 2015
This supercedes networkx#1268 as a proposed structure for OrderedGraph and
ThinGraph class structures. It is based on discussion from networkx#980 and
has also been discussed in networkx#1181, networkx#1244, networkx#1267, networkx#1267

This implementation includes examples (in docs and in tests) for 1)
ordered nodes, 2) ordered nodes and edges and 3) thin (low RAM) graphs
that don't need edge attributes [not related to ordered nodes].

In this proposal, the dict-like arguments are assigned in a new class
structure rather than assigning them to an instance of the SpecialGraph
class. Something similar to:

class OrderedGraph(SpecialGraph):
    node_dict_factory=OrderedDict
G=OrderedGraph()

If all goes well with this version (including reasonable speed)
it might be reasonable to include it in the base classes Graph, etc.
I think @ysitu prefers a more flexible base class than adding yet
more graph classes, and all else equal (or almost equal) I agree.

This pull request may be useful for people who wish OrderedGraph
or ThinGraph and don't want to wait for 2.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

7 participants