New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Undeterministic output order with python3 #1181
Comments
When writing to files, we can attempt to sort the nodes/edges, but NetworkX doesn't require that nodes be sortable. Maybe a sort=True parameter? I wonder if we could just move to OrderedDict without causing too many issues. 2.7 and 3.x have it. For 2.6, we make ordereddict a dependency. |
Thanks, having the option of deterministic output with a sort=True parameter in write_graphml and friends would fix my problem. Having determinism anywhere else might not be desirable due to possible performance penalties. |
I wonder why deterministic output is needed. |
One reason: if I want to check whether two graphs are equal, it is easier to run |
How about turning off the hash randomization?
|
This will affect all hashes and not only those used to generate the graphml or dot output. So this is not a solution because in general, hash randomization is useful. |
Can you prepare a PR that implements the needed functionality? |
Yes, if you tell me how you'd like it to be implemented. Should the write_graphml and friends functions get a |
Example please where you need hash randomization and determinstic output. Otherwise it seems like a pretty special case? |
Hash randomization seems like a good thing to have https://mail.python.org/pipermail/python-announce-list/2012-March/009394.html so I do not think it would be wise to drop it for everything. I gave one example why deterministic output makes sense above. I do not think that hash randomization should be given up for that. My biggest argument would actually be that if a user runs an algorithm that writes the graph somewhere, then given the same input, they'd expect the exact same output. At least that was my assumption. I was surprised when I saw that each time I ran my code, the output file was different (though representing the same graph). If you do not want deterministic graph output in networkx, please mention in the documentation that the vertex and edge order in the graph writers is random when using python3 and deterministic for a given version of python 2.x. I think this will avoid surprises as my code started failing its testcases when I upgraded to python3 because I was comparing the md5sum of the created graphs. |
Yes, I can understand the surprise and frustration that output varies from run to run. But I still think the solution is to disable hash randomization (or use the same randomization) if you want more deterministic behavior (as used to be in Python2). I can't evaluate the security risk in disabling the randomization. It seems to be basically a denial of service attack so if you are using NetworkX in an uncontrolled situation with malicious users it could make sense to randomize. In general graphs are sets of nodes and edges with no implied order. We should certainly document this fact as it relates to input and output of data. |
Would you accept a patch which adds an additional |
Sure we could do that. We need to be careful that the error produced when nodes are not sortable makes sense to the user. I don't think many people use nodes that can't be sorted but we allow it. |
Thanks, I'll try to prepare one. |
Is it the edges or nodes or both that get out of order? If its only the nodes, you could also make it accept an ordering on the nodes. |
All textual output functions, not just |
@ysitu as you can see in the conversation above I was always saying "functions like @chebee7i it is both. Consider the following example:
And then execute this a couple of times:
You will observe that the edge order changes. Looking at the code for graphml output this seems because
This might also mean that some of the algorithms in networkx behave undeterministic in python3 because their output relies on the order in which they traverse the nodes or edges. But I did not investigate this claim. edit: there are algorithms which are affected. For example try running the following multiple times:
Because Johnson's algorithm will start searching at a different node every other time, the output is different as well. But this just as a sidenode. I dont want to make this the topic of this bugreport. |
The interface can be something like |
@ysitu sorry I read your comment only after I already created the pull request. It implements a simple additional If you feel stronly about allowing how to compare nodes and edges, then I can create another pull request. Also, in addition to comparing nodes and edges it must also be possible to compare attributes, no? At least in some graph types there is not only the node and edge order but also the attribute order. |
In general, the idea is that you should not rely on the nodes/keys/attributes being sortable by themselves. Letting the user supply comparison functions will enable the greatest flexibility. If that is much too aggressive, you can start with sorting them by their textual representations. |
Okay, I can prepare another pull request. |
Just wanted to point out that in Python 3, the http://python3porting.com/preparing.html#keycmp-section So for each node, node_key just returns something that is sortable. For edges without attribute data, it seems that those can be sorted using the node key information: (u_key, v_key) or (u_key, v_key, edge_key). |
Having the dropping of the |
The interface can be changed accordingly. Instead of |
I still feel that the best solution here is to turn of the hash randomization. I haven't yet been convinced that this isn't a valid solution that will work for @josch. If we want to support ordering of nodes and edges in output files that networkx generates (I'm not in favor myself) I suggest we allow direct specification of a node list and edge list instead of sorting. |
@hagberg and attribute lists per node/edge. |
Yes, attributes too, ugh. Now I'm really -1 on adding this as a feature. |
@hagberg have a look at the pull request that I referenced further up. The most minimal solution which just takes a Does the pull request that I created look unacceptable to you? |
I'm sort of with @hagberg on this one. This seems like a lot of cruft to add in for what seems like a single use case. NetworkX has tried to be as general as possible and not introduce constraints(like orderable node types), and I think trying to shoehorn that into a writer is a bad idea. I think there are also another option, subclass or rewrite the @josch, if you know a priori the type of your nodes, rewriting Also can you turn of hash seed randomization programmatically? Or is it read at startup? |
I must be the odd one here :) I'm still partial to building your graph (or converting your graph to) a class that uses ordered dicts. |
@bjedwards sure, if you guys decide that having sorted or otherwise deterministic output in the graph writers is a bad idea for networkx, then of course I will just put my own graphml writer code into my python project. As far as I can see from the documentation, there is no way to switch hash randomization on and off from within python. It is done via the |
@dschult if I understand it correctly, then OrderedDict dictionaries could even maintain the same order of the graph as it was originally parsed, no? That would be even more great because it would mean that networkx would not even modify the input file at all if it didnt make any changes to it. |
I'd like the ordereddict dictionaries usage (even if graphs really should be unordered in the ideal world). #980 and #1047 (and maybe others?) started along this path and maybe we can get those to be finished up? Btw hash randomization is usually a really good idea (see https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-1150 and https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-5371 and others...), and asking for it to be disabled for networkx would severely hamper its usage in servers (and other applications) that use networkx internally (and those same applications/servers may output graph files for users to download, for example). |
Regarding enabling def make_graph_class(dict):
class Graph(object):
def __init__(self):
self.graph = dict()
# ...
return Graph Then the default Graph = make_graph_class(dict) If one wishes to use OrderedGraph = make_graph_class(OrderedDict) I believe that this model can be extended to support other backend types as well. |
The problem that I hit was that its not just the construction that needs to use an OrderedDict, its also any clone methods, any copy methods and so-on, it extends into edge addition/ordering (should that also use an OrderedDict?). Some of the issues I see (others may have a better idea here):
My guess is none of this is insurmountable, just needs to be analyzed to figure out the scope that should be tackled to get from here to there. |
We have previous discussed ideas along the line @ysitu suggested in #980 (and earlier). I am for designing a way to use alternate 'backends' to store graph and attribute data such as an SQL database, OrderedDict, sparse matrix, etc. So far we have not converged on a engineering solution to do that. NetworkX 2.0 is an opportunity to adjust the basic API/interface some if necessary to make this happen. It seems do-able but there are issues and trade-offs that should be discussed in a different issue. That approach may solve @harlowja's issue in other ways (e.g. OrderedDict as a backend) than turning off the hash randomization. So I suggest we not implement a half-way approach here to provide ordered storage or reading and writing of nodes, edges, and attributes. |
This supercedes networkx#1268 as a proposed structure for OrderedGraph and ThinGraph class structures. It is based on discussion from networkx#980 and has also been discussed in networkx#1181, networkx#1244, networkx#1267, networkx#1267 This implementation includes examples (in docs and in tests) for 1) ordered nodes, 2) ordered nodes and edges and 3) thin (low RAM) graphs that don't need edge attributes [not related to ordered nodes]. In this proposal, the dict-like arguments are assigned in a new class structure rather than assigning them to an instance of the SpecialGraph class. Something similar to: class OrderedGraph(SpecialGraph): node_dict_factory=OrderedDict G=OrderedGraph() If all goes well with this version (including reasonable speed) it might be reasonable to include it in the base classes Graph, etc. I think @ysitu prefers a more flexible base class than adding yet more graph classes, and all else equal (or almost equal) I agree. This pull request may be useful for people who wish OrderedGraph or ThinGraph and don't want to wait for 2.0
Hi,
with Python3, hashes are salted and thus, functions like edges_iter return their results in a different order between subsequent python3 invocations. While this might be acceptable, this should be documented.
Unfortunately, a side effect of this is, that writing to a file does not produce deterministic output either. I'd argue that if the user saves the same graph twice with different invocations of python3, the generated graphs should not only be semantically the same but also byte-by-byte the same files.
Right now, this is only the case in python2.x but not anymore in python3 where a random seed is used by default.
Please document that the order of the various iterators in networkx depends on pythons internal order and is just not deterministic with python3.
Please make it such that when writing graphs to files, the output is deterministically the same between different invocations.
The text was updated successfully, but these errors were encountered: