New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
O(n^2) -> O(n) implementation for scale_free_graph #4727
O(n^2) -> O(n) implementation for scale_free_graph #4727
Conversation
Very Nice!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally LGTM - I find it easier to read and it should definitely be more performant with the degree distribution stored rather than the existing iterative computation over the degree distribution every time a node selection is made.
Reading through this I just wanted to confirm one thing. IIUC, in the limit of very large delta
(delta
>> len(degree_distribution)
) the selection essentially boils down to uniform sampling from the nodes, correct? I'm not familiar with the cited algorithm so I just wanted to double check that this is the expected behavior.
networkx/generators/directed.py
Outdated
bias_sum = n_nodes * delta | ||
p_delta = bias_sum / (bias_sum + len(candidates)) | ||
if seed.random() < p_delta: | ||
return seed.randint(0, n_nodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implicit here is an assumption that the nodes are essentially drawn from range(0, n_nodes)
. For cases where this isn't true, the returned nodes might seem strange to users:
>>> G = nx.scale_free_graph(10)
>>> H = nx.relabel_nodes(G, {i: 2**i for i in range(10)}
>>> H.nodes()
NodeView((1, 2, 4, 8, 16, 32, 64, 128, 256, 512))
>>> K = nx.scale_free_graph(20, create_using=H, delta_in=1e6, delta_out=1e6)
>>> K.nodes()
NodeView((1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 7, 11, 12, 5, 14, 0, 9, 13, 18, 17))
AFAICT this doesn't affect the algorithm behavior so it isn't wrong. However it might be worth just adding a note to the docstring about how new nodes are selected so that users starting with an initial graph that has non-integer nodes or unevenly spaced integer nodes know what to expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can use that random number as an index to the candidates list so it returns a node instead of an integer? But I might be misunderstanding the intent....
return candidate[seed.randint(0, n_nodes)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! I've added a commit that properly supports working with existing graphs by introducing a node state (simple list of nodes) and a cursor for new nodes that starts at the highest already occupied int
+ 1.
The code would be simpler if we would sample from list(G.nodes())
but it would make the code orders of magnitude slower once again for large n
. The current approach does not sacrifice any of the performance gains, timing results are still similar to posted above.
@rossbar your snippet now produces
NodeView((1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522))
But it now also supports e.g. strings:
>>> G = nx.scale_free_graph(10)
>>> H = nx.relabel_nodes(G, {i: str(2**i) for i in range(10)})
>>> H.nodes()
NodeView(('1', '2', '4', '8', '16', '32', '64', '128', '256', '512'))
>>> K = nx.scale_free_graph(20, create_using=H, delta_in=1e6, delta_out=1e6)
>>> K.nodes()
NodeView(('1', '2', '4', '8', '16', '32', '64', '128', '256', '512', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9))
@dschult your approach unfortunately wouldn't work because len(candidates)
> len(nodes)
, as candidates
contains the nodes repeated d
times where d
are their degrees. Also, sampling from candidates
brings us back to sampling with weights determined by the degrees, and in that piece of the code we want to sample uniformly.
Almost @rossbar ! Better formulated would be that we approach uniform sampling when This should make sense because going back to the algorithm, the probability for a node to be sampled scales with its in or out-degree plus the bias, i.e. |
Hmm so looking at this again, it looks to me like there may have been a very subtle bug that would persist in this implementation as well. For example, look at the following (original) code chunk: networkx/networkx/generators/directed.py Lines 276 to 279 in f977054
The comment indicates that >>> G = nx.MultiDiGraph([(2, 3), (3, 4), (2, 4)])
>>> H = nx.scale_free_graph(10, create_using=G) In this case, if |
@rossbar I added a commit (9bc7d31) that addresses precisely this! I explained it further in one of the threads above since I responded to some of the content of your comment there but I guess you missed that, it's tricky to keep these conversations linear.. Copy of #4727 (comment) : Great catch! I've added a commit that properly supports working with existing graphs by introducing a node state (simple list of nodes) and a cursor for new nodes that starts at the highest already occupied The code would be simpler if we would sample from @rossbar your snippet now produces NodeView((1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522)) But it now also supports e.g. strings: >>> G = nx.scale_free_graph(10)
>>> H = nx.relabel_nodes(G, {i: str(2**i) for i in range(10)})
>>> H.nodes()
NodeView(('1', '2', '4', '8', '16', '32', '64', '128', '256', '512'))
>>> K = nx.scale_free_graph(20, create_using=H, delta_in=1e6, delta_out=1e6)
>>> K.nodes()
NodeView(('1', '2', '4', '8', '16', '32', '64', '128', '256', '512', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9)) |
That's nice approach to finding new nodes. But it is fragile in the sense that 1.0 equates to 1 but isn't an integer. We could use |
@dschult interesting, at first I did not understand why in your example the cursor cannot be zero, as I figured the graph would allow for I'll prepare a fix to properly determine the starting point for the cursor. Treatment that would work for complex as well could be selecting the max + 1 from Open to suggestions. |
Hmmm.... This is getting annoying and obtuse. :} if new_nodes is None:
...current code to find max...
cursor = itertools.count(max + 1)
else:
try:
assert len(new_nodes) == n:
cursor = iter(new_nodes)
# catch objects with no `len` or `iter` capability
except (TypeError, AssertionError):
raise nx.NetworkXError("new_nodes should be a list of `n` node objects") By making |
Agreed - my fault, sorry for sending us in the rabbit hole! FWIW I think even ec9bfee was fine! I just thought it'd be good to add a note to the docstring about the non-sequential node behavior. I'd just as soon get the original implementation in and worry about the gremlins associated with passing in a non-empty MultiDiGraph instance in another issue. |
…ing graph with number-based nodes
Thanks for the input @dschult and @rossbar. I think it would be better indeed for this pull request to leave the contract of the generator unchanged and just optimize. In the meantime, I did go ahead and implemented the The following now yields: >>> G = nx.scale_free_graph(5)
>>> new_labels = {
>>> 0: 4.4,
>>> 1: 5,
>>> 2: 12.1+9.2j,
>>> 3: "m",
>>> 4: "16.3"
>>> }
>>> H = nx.relabel_nodes(G, new_labels)
>>> H.nodes()
NodeView((4.4, 5, (12.1+9.2j), 'm', '16.3'))
>>> K = nx.scale_free_graph(10, create_using=H)
>>> K.nodes()
NodeView((4.4, 5, (12.1+9.2j), 'm', '16.3', 13, 14, 15, 16, 17)) To be completely safe one could require that nodes of an existing graph fed to the algorithm must adhere to a type whitelist for which we can ensure correct behavior, or indeed have the user supply the new nodes as an iterable / iterator of sorts. When we then are in the territory of modifying the contract anyways, I have some additional improvements in mind. E.g. DiGraph support (the non-multi kind) and an option to make self-loops illegal. So, ready to merge? Cheers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @florishermsen !
* O(N) implementation for scale_free_graph * proper support for generating a scale-free graph on top of an existing graph * scale free graph generator: better support for starting with an existing graph with number-based nodes
* O(N) implementation for scale_free_graph * proper support for generating a scale-free graph on top of an existing graph * scale free graph generator: better support for starting with an existing graph with number-based nodes
I was trying to use
scale_free_graph
fromgenerators.directed
and noticed it drastically grew slower with number of nodes n, rendering it effectively useless for large numbers of nodes (e.g. 1M). Since the gist of the algorithm involves preferential attachment, and other networkx methods like the ones based on Barabási–Albert are in fact fast, I took a look at the implementation. Parts of the current node sampling have O(n) time complexity, effectively rendering the generator as a whole O(n^2). I rewrote that sampling to be O(1), so now the generator itself is O(n).Changes
Like e.g. the implementation for
barabasi_albert_graph
ordual_barabasi_albert_graph
, the new implementation keeps simple lists of repeated nodes representing the in- and out-degrees so we can perform O(1) sampling with probabilities linear in those degrees. This suffices for the case that biases are zero.Because the algorithm also needs to support nonzero biases, it would seem full weights have to be computed before each sample is drawn (based on current node degrees plus the biases), which would make the sampling O(n) again. I've solved this probabilistically with a two-step sampling mechanism that is once again O(1). Note that we're not doing any approximations here, the new implementation of the algorithm behaves exactly the same as the old one.
I've also added checks that the biases must be >= 0. This does not remove functionality because the current implementation would have been incorrect for biases < 0 anyways (if in doubt, please review the original cursor-based sampling function and imagine a negative bias < -1 to conclude the same). Furthermore, the referenced original paper (see https://www.microsoft.com/en-us/research/publication/directed-scale-free-graphs/) also sets biases >= 0 as a condition, so it probably should have been checked anyways (why that is indeed a sensible condition is another story which I think we should not get into currently, I would refer to the paper for now).
Timing results
scale_free_graph(n, alpha=0.2, beta=0.6, gamma=0.2, delta_in=1, delta_out=1)
Couldn't find any related issues, apart from this SO post similarly complaining about days of runtime:
https://stackoverflow.com/questions/56119399/quickly-generating-large-scale-free-graphs-with-networkx