Undeterministic output order with python3 #1181

josch · 2014-05-29T15:39:14Z

Hi,
with Python3, hashes are salted and thus, functions like edges_iter return their results in a different order between subsequent python3 invocations. While this might be acceptable, this should be documented.

Unfortunately, a side effect of this is, that writing to a file does not produce deterministic output either. I'd argue that if the user saves the same graph twice with different invocations of python3, the generated graphs should not only be semantically the same but also byte-by-byte the same files.

Right now, this is only the case in python2.x but not anymore in python3 where a random seed is used by default.

Please document that the order of the various iterators in networkx depends on pythons internal order and is just not deterministic with python3.

Please make it such that when writing graphs to files, the output is deterministically the same between different invocations.

chebee7i · 2014-05-29T18:07:39Z

When writing to files, we can attempt to sort the nodes/edges, but NetworkX doesn't require that nodes be sortable. Maybe a sort=True parameter?

I wonder if we could just move to OrderedDict without causing too many issues. 2.7 and 3.x have it. For 2.6, we make ordereddict a dependency.

josch · 2014-05-29T18:12:03Z

Thanks, having the option of deterministic output with a sort=True parameter in write_graphml and friends would fix my problem. Having determinism anywhere else might not be desirable due to possible performance penalties.

ysitu · 2014-05-29T18:13:53Z

I wonder why deterministic output is needed.

josch · 2014-05-29T18:17:40Z

One reason: if I want to check whether two graphs are equal, it is easier to run md5sum or diff than having to parse the graph first and then comparing whether both graphs have the same vertices and edges.
Given that it is very simple to add deterministic output (for example through a sort=True option) I think it is worth to have it.

hagberg · 2014-05-29T20:36:40Z

How about turning off the hash randomization?

$ PYTHONHASHSEED=random python3.4 -c 'print(hash("a"))'
-1769101628
$ PYTHONHASHSEED=random python3.4 -c 'print(hash("a"))'
-661212464
$ PYTHONHASHSEED=0 python3.4 -c 'print(hash("a"))'
-845962679
$ PYTHONHASHSEED=0 python3.4 -c 'print(hash("a"))'
-845962679

josch · 2014-05-29T20:39:06Z

This will affect all hashes and not only those used to generate the graphml or dot output. So this is not a solution because in general, hash randomization is useful.

ysitu · 2014-05-29T20:41:03Z

Can you prepare a PR that implements the needed functionality?

josch · 2014-05-29T20:46:09Z

Yes, if you tell me how you'd like it to be implemented. Should the write_graphml and friends functions get a sort=True or similar parameter which would then make the output sorted? Or should the output be sorted by default?

hagberg · 2014-05-29T22:58:55Z

Example please where you need hash randomization and determinstic output. Otherwise it seems like a pretty special case?

josch · 2014-05-29T23:09:06Z

Hash randomization seems like a good thing to have https://mail.python.org/pipermail/python-announce-list/2012-March/009394.html so I do not think it would be wise to drop it for everything.

I gave one example why deterministic output makes sense above. I do not think that hash randomization should be given up for that.

My biggest argument would actually be that if a user runs an algorithm that writes the graph somewhere, then given the same input, they'd expect the exact same output. At least that was my assumption. I was surprised when I saw that each time I ran my code, the output file was different (though representing the same graph).

If you do not want deterministic graph output in networkx, please mention in the documentation that the vertex and edge order in the graph writers is random when using python3 and deterministic for a given version of python 2.x. I think this will avoid surprises as my code started failing its testcases when I upgraded to python3 because I was comparing the md5sum of the created graphs.

hagberg · 2014-05-29T23:33:37Z

Yes, I can understand the surprise and frustration that output varies from run to run. But I still think the solution is to disable hash randomization (or use the same randomization) if you want more deterministic behavior (as used to be in Python2). I can't evaluate the security risk in disabling the randomization. It seems to be basically a denial of service attack so if you are using NetworkX in an uncontrolled situation with malicious users it could make sense to randomize.

In general graphs are sets of nodes and edges with no implied order. We should certainly document this fact as it relates to input and output of data.

josch · 2014-05-29T23:37:16Z

Would you accept a patch which adds an additional sorted (or similar name) argument to functions like write_graphml? That way, the original behaviour could be retained, while those who want deterministic output could set sorted=True.

hagberg · 2014-05-29T23:39:13Z

Sure we could do that. We need to be careful that the error produced when nodes are not sortable makes sense to the user. I don't think many people use nodes that can't be sorted but we allow it.

josch · 2014-05-29T23:44:36Z

Thanks, I'll try to prepare one.

chebee7i · 2014-05-30T00:34:03Z

Is it the edges or nodes or both that get out of order? If its only the nodes, you could also make it accept an ordering on the nodes. node_order=None being the default...which then just grabs G.nodes(). But if edges are getting out of order as well, then this won't work.

ysitu · 2014-05-30T05:57:55Z

All textual output functions, not just write_graphml, should receive the same treatment so that the user will not be surprised to find out that some supports sorting and some do not.

josch · 2014-05-30T06:52:35Z

@ysitu as you can see in the conversation above I was always saying "functions like write_graphml" or "write_graphml and friends". So I do not plan to limit this to graphml output. On the other hand I did not yet manage to call write_dot in python3. It would exit with TypeError: 'str' does not support the buffer interface.

@chebee7i it is both. Consider the following example:

<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns">
    <key id="foo" for="node" attr.name="foo" attr.type="string"></key>
    <key id="bar" for="node" attr.name="bar" attr.type="string"></key>
    <graph id="G" edgedefault="directed">
        <node id="0">
            <data key="foo">node0foo</data>
            <data key="bar">node0bar</data>
        </node>
        <node id="1">
            <data key="foo">node1foo</data>
            <data key="bar">node1bar</data>
        </node>
        <edge id="0" source="0" target="1"></edge>
        <edge id="0" source="1" target="0"></edge>
    </graph>
</graphml>

And then execute this a couple of times:

python3 -c "import networkx, sys; g = networkx.read_graphml('test.xml'); networkx.write_graphml(g, sys.stdout.buffer)"

You will observe that the edge order changes. Looking at the code for graphml output this seems because edges_iter is not deterministic. Execute this a couple of times to convince yourself:

python3 -c "import networkx, sys; g = networkx.read_graphml('test.xml'); print(list(g.edges_iter()))"

This might also mean that some of the algorithms in networkx behave undeterministic in python3 because their output relies on the order in which they traverse the nodes or edges. But I did not investigate this claim.

edit: there are algorithms which are affected. For example try running the following multiple times:

python3 -c "import networkx, sys; g = networkx.read_graphml('test.xml'); print(list(networkx.algorithms.cycles.simple_cycles(g)))"

Because Johnson's algorithm will start searching at a different node every other time, the output is different as well. But this just as a sidenode. I dont want to make this the topic of this bugreport.

ysitu · 2014-05-30T08:44:36Z

The interface can be something like write_graphml(G, path, encoding='utf-8', prettyprint=True, node_cmp=None, key_cmp=None). The user should supply two comparison functions, one for the nodes and one for the edge keys of multigraphs. They should follow the convention used by sorted. Then the edges can be sorted as tuples (u, v, key). Undirected edges should have node_cmp(u, v) <= 0 in (u, v, key).

josch · 2014-05-30T09:58:13Z

@ysitu sorry I read your comment only after I already created the pull request. It implements a simple additional sorted argument to a variety of output functions.

If you feel stronly about allowing how to compare nodes and edges, then I can create another pull request.

Also, in addition to comparing nodes and edges it must also be possible to compare attributes, no? At least in some graph types there is not only the node and edge order but also the attribute order.

ysitu · 2014-05-30T10:08:01Z

In general, the idea is that you should not rely on the nodes/keys/attributes being sortable by themselves. Letting the user supply comparison functions will enable the greatest flexibility. If that is much too aggressive, you can start with sorting them by their textual representations.

josch · 2014-05-30T10:09:20Z

Okay, I can prepare another pull request.

chebee7i · 2014-05-30T14:11:00Z

Just wanted to point out that in Python 3, the cmp option to sorted no longer exists. So to be forward looking, maybe it would be better to make the user provide key functions instead.

http://python3porting.com/preparing.html#keycmp-section

So for each node, node_key just returns something that is sortable. For edges without attribute data, it seems that those can be sorted using the node key information: (u_key, v_key) or (u_key, v_key, edge_key).

josch · 2014-05-30T14:24:18Z

Having the dropping of the cmp parameter in favor of key in python3 in mind, should this not be handled differently? Instead of supplying the comparison functions node_cmp and key_cmp which can only be supplied to sorted by using cmp_to_key from the functools module, it seems to me that the python3 way of doing sorting would be that the user adds the functions __lt__, __le__ , __eq__, __ne__, __gt__ and __ge__ to the node and edge objects?
The key parameter to sorted would then govern what to sort while the comparison functions govern how to sort.

ysitu · 2014-05-30T14:30:42Z

The interface can be changed accordingly. Instead of node_cmp, key_cmp and attr_cmp, you ask for node_key, key_key, attr_key.

hagberg · 2014-05-30T14:46:58Z

I still feel that the best solution here is to turn of the hash randomization. I haven't yet been convinced that this isn't a valid solution that will work for @josch.

If we want to support ordering of nodes and edges in output files that networkx generates (I'm not in favor myself) I suggest we allow direct specification of a node list and edge list instead of sorting.

josch · 2014-05-30T15:01:30Z

@hagberg and attribute lists per node/edge.

hagberg · 2014-05-30T15:09:56Z

Yes, attributes too, ugh. Now I'm really -1 on adding this as a feature.

josch · 2014-05-30T15:14:46Z

@hagberg have a look at the pull request that I referenced further up. The most minimal solution which just takes a sort parameter works, is simple and doesnt look intrusive to me. This would already solve this bugreport. If I understand it correctly (but I didnt try) this solution would even already allow custom sorting by the user attaching __lt__, __le__ , __eq__, __ne__, __gt__ and __ge__ functions to the nodes, edges and attributes for which he wants custom sorting.

Does the pull request that I created look unacceptable to you?

bjedwards · 2014-05-30T16:11:58Z

I'm sort of with @hagberg on this one. This seems like a lot of cruft to add in for what seems like a single use case. NetworkX has tried to be as general as possible and not introduce constraints(like orderable node types), and I think trying to shoehorn that into a writer is a bad idea. I think there are also another option, subclass or rewrite the GraphMLWrite class as a custom writer.

@josch, if you know a priori the type of your nodes, rewriting GraphMLWriter.add_nodes and GraphMLWriter.add_edges, seems easy enough. It's the general case that is a pain in the ass.

Also can you turn of hash seed randomization programmatically? Or is it read at startup?

chebee7i · 2014-05-30T17:59:44Z

I must be the odd one here :) I'm still partial to building your graph (or converting your graph to) a class that uses ordered dicts.

josch · 2014-05-30T20:45:28Z

@bjedwards sure, if you guys decide that having sorted or otherwise deterministic output in the graph writers is a bad idea for networkx, then of course I will just put my own graphml writer code into my python project. As far as I can see from the documentation, there is no way to switch hash randomization on and off from within python. It is done via the -R argument on the commandline or the PYTHONHASHSEED environment variable. Both then affect all hashes in the Python process and I don't see a way to only let the networkx dictionaries used to store nodes, edges and their attributes be affected.

dschult · 2014-05-30T21:04:19Z

One way to reduce the impact of hash order on only the networkx dictionaries would be to follow @chebee7i and use a class which has OrderedDict dictionaries. See #980. But that's not quite ready to go.

josch · 2014-06-02T06:18:52Z

@dschult if I understand it correctly, then OrderedDict dictionaries could even maintain the same order of the graph as it was originally parsed, no? That would be even more great because it would mean that networkx would not even modify the input file at all if it didnt make any changes to it.

harlowja · 2014-06-08T02:30:23Z

I'd like the ordereddict dictionaries usage (even if graphs really should be unordered in the ideal world). #980 and #1047 (and maybe others?) started along this path and maybe we can get those to be finished up?

Btw hash randomization is usually a really good idea (see https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-1150 and https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-5371 and others...), and asking for it to be disabled for networkx would severely hamper its usage in servers (and other applications) that use networkx internally (and those same applications/servers may output graph files for users to download, for example).

ysitu · 2014-06-11T17:48:20Z

Regarding enabling OrderedDict, one possibility is to create a function

def make_graph_class(dict):
    class Graph(object):
        def __init__(self):
            self.graph = dict()
            # ...
    return Graph

Then the default Graph class is simply defined as

Graph = make_graph_class(dict)

If one wishes to use OrderedDict, he/she can do

OrderedGraph = make_graph_class(OrderedDict)

I believe that this model can be extended to support other backend types as well.

harlowja · 2014-06-12T04:55:12Z

The problem that I hit was that its not just the construction that needs to use an OrderedDict, its also any clone methods, any copy methods and so-on, it extends into edge addition/ordering (should that also use an OrderedDict?).

Some of the issues I see (others may have a better idea here):

def to_undirected (creates a new undirected graph, should this retain ordering?)
def subgraph(self, nbunch) (same question)
def to_directed(self) (same question)
Any of the copy functions in networkx/classes/function.py and the generation functions that exist in networkx (how will these behave?)
Others?

My guess is none of this is insurmountable, just needs to be analyzed to figure out the scope that should be tackled to get from here to there.

hagberg · 2014-06-12T19:01:09Z

We have previous discussed ideas along the line @ysitu suggested in #980 (and earlier). I am for designing a way to use alternate 'backends' to store graph and attribute data such as an SQL database, OrderedDict, sparse matrix, etc. So far we have not converged on a engineering solution to do that. NetworkX 2.0 is an opportunity to adjust the basic API/interface some if necessary to make this happen. It seems do-able but there are issues and trade-offs that should be discussed in a different issue.

That approach may solve @harlowja's issue in other ways (e.g. OrderedDict as a backend) than turning off the hash randomization. So I suggest we not implement a half-way approach here to provide ordered storage or reading and writing of nodes, edges, and attributes.

@ysitu

This supercedes networkx#1268 as a proposed structure for OrderedGraph and ThinGraph class structures. It is based on discussion from networkx#980 and has also been discussed in networkx#1181, networkx#1244, networkx#1267, networkx#1267 This implementation includes examples (in docs and in tests) for 1) ordered nodes, 2) ordered nodes and edges and 3) thin (low RAM) graphs that don't need edge attributes [not related to ordered nodes]. In this proposal, the dict-like arguments are assigned in a new class structure rather than assigning them to an instance of the SpecialGraph class. Something similar to: class OrderedGraph(SpecialGraph): node_dict_factory=OrderedDict G=OrderedGraph() If all goes well with this version (including reasonable speed) it might be reasonable to include it in the base classes Graph, etc. I think @ysitu prefers a more flexible base class than adding yet more graph classes, and all else equal (or almost equal) I agree. This pull request may be useful for people who wish OrderedGraph or ThinGraph and don't want to wait for 2.0

ysitu modified the milestones: networkx-future, networkx-2.0 May 30, 2014

josch mentioned this issue May 30, 2014

Allow to write graphs in sorted order #1183

Closed

ysitu removed this from the networkx-2.0 milestone May 30, 2014

ysitu added this to the networkx-future milestone Jun 8, 2014

ysitu closed this as completed Aug 8, 2014

chebee7i mentioned this issue Aug 15, 2014

there should be stable_topological_sort() #1244

Closed

ysitu added Declined labels Aug 15, 2014

ysitu removed this from the networkx-future milestone Oct 12, 2014

hagberg mentioned this issue Oct 12, 2014

Undeterministic output between different architectures #1267

Closed

dschult mentioned this issue Oct 12, 2014

OrderedGraph and friends proposed structure #1268

Closed

tirkarthi mentioned this issue May 11, 2022

dag: fix dot file rendering order. iterative/dvc#7725

Merged

2 tasks

LilDojd mentioned this issue Apr 27, 2024

ProtocolDAG._iterate_dag_order is unstable OpenFreeEnergy/gufe#314

Open

Undeterministic output order with python3 #1181

Undeterministic output order with python3 #1181

Comments

josch commented May 29, 2014

chebee7i commented May 29, 2014

josch commented May 29, 2014

ysitu commented May 29, 2014

josch commented May 29, 2014

hagberg commented May 29, 2014

josch commented May 29, 2014

ysitu commented May 29, 2014

josch commented May 29, 2014

hagberg commented May 29, 2014

josch commented May 29, 2014

hagberg commented May 29, 2014

josch commented May 29, 2014

hagberg commented May 29, 2014

josch commented May 29, 2014

chebee7i commented May 30, 2014

ysitu commented May 30, 2014

josch commented May 30, 2014

ysitu commented May 30, 2014

josch commented May 30, 2014

ysitu commented May 30, 2014

josch commented May 30, 2014

chebee7i commented May 30, 2014

josch commented May 30, 2014

ysitu commented May 30, 2014

hagberg commented May 30, 2014

josch commented May 30, 2014

hagberg commented May 30, 2014

josch commented May 30, 2014

bjedwards commented May 30, 2014

chebee7i commented May 30, 2014

josch commented May 30, 2014

dschult commented May 30, 2014

josch commented Jun 2, 2014

harlowja commented Jun 8, 2014

ysitu commented Jun 11, 2014

harlowja commented Jun 12, 2014

hagberg commented Jun 12, 2014