O(n^2) -> O(n) implementation for scale_free_graph #4727

florishermsen · 2021-04-05T20:46:16Z

I was trying to use scale_free_graph from generators.directed and noticed it drastically grew slower with number of nodes n, rendering it effectively useless for large numbers of nodes (e.g. 1M). Since the gist of the algorithm involves preferential attachment, and other networkx methods like the ones based on Barabási–Albert are in fact fast, I took a look at the implementation. Parts of the current node sampling have O(n) time complexity, effectively rendering the generator as a whole O(n^2). I rewrote that sampling to be O(1), so now the generator itself is O(n).

Changes
Like e.g. the implementation for barabasi_albert_graph or dual_barabasi_albert_graph, the new implementation keeps simple lists of repeated nodes representing the in- and out-degrees so we can perform O(1) sampling with probabilities linear in those degrees. This suffices for the case that biases are zero.

Because the algorithm also needs to support nonzero biases, it would seem full weights have to be computed before each sample is drawn (based on current node degrees plus the biases), which would make the sampling O(n) again. I've solved this probabilistically with a two-step sampling mechanism that is once again O(1). Note that we're not doing any approximations here, the new implementation of the algorithm behaves exactly the same as the old one.

I've also added checks that the biases must be >= 0. This does not remove functionality because the current implementation would have been incorrect for biases < 0 anyways (if in doubt, please review the original cursor-based sampling function and imagine a negative bias < -1 to conclude the same). Furthermore, the referenced original paper (see https://www.microsoft.com/en-us/research/publication/directed-scale-free-graphs/) also sets biases >= 0 as a condition, so it probably should have been checked anyways (why that is indeed a sensible condition is another story which I think we should not get into currently, I would refer to the paper for now).

Timing results
scale_free_graph(n, alpha=0.2, beta=0.6, gamma=0.2, delta_in=1, delta_out=1)

n	old	new	speedup
1e+1	7.19 ms	168 µs	x50
1e+2	8.91 ms	1.15 ms	x10
1e+3	582 ms	11.2 ms	x50
1e+4	59.1 s	185 ms	x300
1e+5	100 m (est.)	1.94 s	x3000
1e+6	7 d (est.)	21.6 s	x30000

Couldn't find any related issues, apart from this SO post similarly complaining about days of runtime:
https://stackoverflow.com/questions/56119399/quickly-generating-large-scale-free-graphs-with-networkx

dschult · 2021-04-06T06:11:50Z

Very Nice!!
Thanks for hunting this down and fixing it! :}

rossbar

This generally LGTM - I find it easier to read and it should definitely be more performant with the degree distribution stored rather than the existing iterative computation over the degree distribution every time a node selection is made.

Reading through this I just wanted to confirm one thing. IIUC, in the limit of very large delta (delta >> len(degree_distribution)) the selection essentially boils down to uniform sampling from the nodes, correct? I'm not familiar with the cited algorithm so I just wanted to double check that this is the expected behavior.

networkx/generators/directed.py

rossbar · 2021-04-07T02:09:20Z

networkx/generators/directed.py

+            bias_sum = n_nodes * delta
+            p_delta = bias_sum / (bias_sum + len(candidates))
+            if seed.random() < p_delta:
+                return seed.randint(0, n_nodes)


Implicit here is an assumption that the nodes are essentially drawn from range(0, n_nodes). For cases where this isn't true, the returned nodes might seem strange to users:

>>> G = nx.scale_free_graph(10) >>> H = nx.relabel_nodes(G, {i: 2**i for i in range(10)} >>> H.nodes() NodeView((1, 2, 4, 8, 16, 32, 64, 128, 256, 512)) >>> K = nx.scale_free_graph(20, create_using=H, delta_in=1e6, delta_out=1e6) >>> K.nodes() NodeView((1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 7, 11, 12, 5, 14, 0, 9, 13, 18, 17))

AFAICT this doesn't affect the algorithm behavior so it isn't wrong. However it might be worth just adding a note to the docstring about how new nodes are selected so that users starting with an initial graph that has non-integer nodes or unevenly spaced integer nodes know what to expect.

Perhaps we can use that random number as an index to the candidates list so it returns a node instead of an integer? But I might be misunderstanding the intent....
return candidate[seed.randint(0, n_nodes)]

Great catch! I've added a commit that properly supports working with existing graphs by introducing a node state (simple list of nodes) and a cursor for new nodes that starts at the highest already occupied int + 1.

The code would be simpler if we would sample from list(G.nodes()) but it would make the code orders of magnitude slower once again for large n. The current approach does not sacrifice any of the performance gains, timing results are still similar to posted above.

@rossbar your snippet now produces

NodeView((1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522))

But it now also supports e.g. strings:

>>> G = nx.scale_free_graph(10) >>> H = nx.relabel_nodes(G, {i: str(2**i) for i in range(10)}) >>> H.nodes() NodeView(('1', '2', '4', '8', '16', '32', '64', '128', '256', '512')) >>> K = nx.scale_free_graph(20, create_using=H, delta_in=1e6, delta_out=1e6) >>> K.nodes() NodeView(('1', '2', '4', '8', '16', '32', '64', '128', '256', '512', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9))

@dschult your approach unfortunately wouldn't work because len(candidates) > len(nodes), as candidates contains the nodes repeated d times where d are their degrees. Also, sampling from candidates brings us back to sampling with weights determined by the degrees, and in that piece of the code we want to sample uniformly.

…g graph

florishermsen · 2021-04-07T20:26:52Z

Almost @rossbar ! Better formulated would be that we approach uniform sampling when len(nodes) * delta >> len(degree_distribution) where degree_distribution (sampling candidates in the code) are the "raveled" in and out-degree distributions (i.e. a node with degree 7 appears 7 times in that list), so its length is equal to len(nodes) * avg(degree). As you can see the inequality then simplifies to delta >> avg(degree).

This should make sense because going back to the algorithm, the probability for a node to be sampled scales with its in or out-degree plus the bias, i.e. ~(d+b). In case b >> d, this effectively becomes ~b, and since the bias is the same for all nodes this indeed effectively is uniform sampling.

rossbar · 2021-04-10T00:40:17Z

Hmm so looking at this again, it looks to me like there may have been a very subtle bug that would persist in this implementation as well. For example, look at the following (original) code chunk:

networkx/networkx/generators/directed.py

Lines 276 to 279 in f977054

    
           if r < alpha: 
        
               # alpha 
        
               # add new node v 
        
               v = len(G)

The comment indicates that v should be a node that is not already present in the Graph. It seems this relies on assumptions about the nodes in G; most likely based on the default value of create_using, which creates a graph starting with node 0 and incrementing upwards by one. However, it's easy to violate this assumption e.g.

>>> G = nx.MultiDiGraph([(2, 3), (3, 4), (2, 4)])
>>> H = nx.scale_free_graph(10, create_using=G)

In this case, if scale_free_graph hit the snippet above, the "new" node would be 3, which is clearly already in the Graph. I'm not sure what impact this has on the algorithm as a whole (maybe this is fine?) - it's just a detail where the implementation seems to diverge from the desired behavior (gathered from the comments) for certain graph inputs. Is this a problem for the algorithm? If it's not, at the very least the comments should then be updated to reflect what can actually happen in the general case.

florishermsen · 2021-04-10T10:46:31Z

@rossbar I added a commit (9bc7d31) that addresses precisely this!

I explained it further in one of the threads above since I responded to some of the content of your comment there but I guess you missed that, it's tricky to keep these conversations linear..

Copy of #4727 (comment) :

Great catch! I've added a commit that properly supports working with existing graphs by introducing a node state (simple list of nodes) and a cursor for new nodes that starts at the highest already occupied int + 1.

The code would be simpler if we would sample from list(G.nodes()) but it would make the code orders of magnitude slower once again for large n. The current approach does not sacrifice any of the performance gains, timing results are still similar to posted above.

@rossbar your snippet now produces

NodeView((1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522))

But it now also supports e.g. strings:

>>> G = nx.scale_free_graph(10)
>>> H = nx.relabel_nodes(G, {i: str(2**i) for i in range(10)})
>>> H.nodes()
NodeView(('1', '2', '4', '8', '16', '32', '64', '128', '256', '512'))
>>> K = nx.scale_free_graph(20, create_using=H, delta_in=1e6, delta_out=1e6)
>>> K.nodes()
NodeView(('1', '2', '4', '8', '16', '32', '64', '128', '256', '512', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9))

dschult · 2021-04-10T16:40:33Z

That's nice approach to finding new nodes. But it is fragile in the sense that 1.0 equates to 1 but isn't an integer.
So a graph with nodes 0.0 1.0 2.0 3.0 would have cursor be 0 even though it should be 4.

We could use isinstance(node, numbers.Number) in place of type(i)==int. Or even type(node) in (int, float, complex). But we might have to keep adding types to that list...
However we do it, we should make sure that counter is an integer -- perhaps when we add the 1: counter = int(max(...)) + 1.

florishermsen · 2021-04-10T17:27:52Z

@dschult interesting, at first I did not understand why in your example the cursor cannot be zero, as I figured the graph would allow for int 0 and float 0.0 to be different nodes. Apparently it doesn't, as per normal python dictionary behavior (didn't know that was the implementation)

I'll prepare a fix to properly determine the starting point for the cursor. Treatment that would work for complex as well could be selecting the max + 1 from int(n.real) for all numbers n.

Open to suggestions.

dschult · 2021-04-11T19:30:41Z

Hmmm.... This is getting annoying and obtuse. :}
Perhaps the best way is to implement an easy -- imperfect method for finding new nodes and then let the user override it with parameter e.g. new_nodes. We could force new_nodes to be a list of n nodes. Or we could be fancy and let it be either a starting value for new nodes or a list of new nodes. But the easy/simple interface might look something like:

if new_nodes is None:
    ...current code to find max...
    cursor = itertools.count(max + 1)
else:
    try:
        assert len(new_nodes) == n:
        cursor = iter(new_nodes)
    # catch objects with no `len` or `iter` capability
    except (TypeError, AssertionError):
        raise nx.NetworkXError("new_nodes should be a list of `n` node objects")

By making cursor an iterator we can use/update it in one step with next(cursor)

rossbar · 2021-04-12T21:10:12Z

Hmmm.... This is getting annoying and obtuse. :}

Agreed - my fault, sorry for sending us in the rabbit hole! FWIW I think even ec9bfee was fine! I just thought it'd be good to add a note to the docstring about the non-sequential node behavior.

I'd just as soon get the original implementation in and worry about the gremlins associated with passing in a non-empty MultiDiGraph instance in another issue.

…ing graph with number-based nodes

florishermsen · 2021-04-17T11:25:24Z

Thanks for the input @dschult and @rossbar. I think it would be better indeed for this pull request to leave the contract of the generator unchanged and just optimize. In the meantime, I did go ahead and implemented the isinstance(n, numbers.Number) plus int(n.real) strategy so that we now have a best-effort (and improved) support for existing graphs with number-based nodes.

The following now yields:

>>> G = nx.scale_free_graph(5)
>>> new_labels = {
>>>     0: 4.4,
>>>     1: 5,
>>>     2: 12.1+9.2j,
>>>     3: "m",
>>>     4: "16.3"
>>> }
>>> H = nx.relabel_nodes(G, new_labels)
>>> H.nodes()
NodeView((4.4, 5, (12.1+9.2j), 'm', '16.3'))
>>> K = nx.scale_free_graph(10, create_using=H)
>>> K.nodes()
NodeView((4.4, 5, (12.1+9.2j), 'm', '16.3', 13, 14, 15, 16, 17))

To be completely safe one could require that nodes of an existing graph fed to the algorithm must adhere to a type whitelist for which we can ensure correct behavior, or indeed have the user supply the new nodes as an iterable / iterator of sorts. When we then are in the territory of modifying the contract anyways, I have some additional improvements in mind. E.g. DiGraph support (the non-multi kind) and an option to make self-loops illegal.

So, ready to merge? Cheers

dschult

Thanks for this!

rossbar

Thanks @florishermsen !

* O(N) implementation for scale_free_graph * proper support for generating a scale-free graph on top of an existing graph * scale free graph generator: better support for starting with an existing graph with number-based nodes

O(N) implementation for scale_free_graph

ec9bfee

rossbar reviewed Apr 7, 2021

View reviewed changes

networkx/generators/directed.py Show resolved Hide resolved

rossbar reviewed Apr 7, 2021

View reviewed changes

proper support for generating a scale-free graph on top of an existin…

9bc7d31

…g graph

scale free graph generator: better support for starting with an exist…

482bc6c

…ing graph with number-based nodes

dschult approved these changes Apr 17, 2021

View reviewed changes

dschult added this to the networkx-2.6 milestone Apr 18, 2021

dschult added the type: Enhancements label Apr 18, 2021

rossbar approved these changes Apr 19, 2021

View reviewed changes

rossbar merged commit 643f457 into networkx:main Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

O(n^2) -> O(n) implementation for scale_free_graph #4727

O(n^2) -> O(n) implementation for scale_free_graph #4727

florishermsen commented Apr 5, 2021 •

edited

dschult commented Apr 6, 2021

rossbar left a comment

rossbar Apr 7, 2021

dschult Apr 7, 2021

florishermsen Apr 7, 2021

florishermsen commented Apr 7, 2021 •

edited

rossbar commented Apr 10, 2021

florishermsen commented Apr 10, 2021 •

edited

dschult commented Apr 10, 2021

florishermsen commented Apr 10, 2021

dschult commented Apr 11, 2021

rossbar commented Apr 12, 2021

florishermsen commented Apr 17, 2021 •

edited

dschult left a comment

rossbar left a comment

O(n^2) -> O(n) implementation for scale_free_graph #4727

O(n^2) -> O(n) implementation for scale_free_graph #4727

Conversation

florishermsen commented Apr 5, 2021 • edited

dschult commented Apr 6, 2021

rossbar left a comment

Choose a reason for hiding this comment

rossbar Apr 7, 2021

Choose a reason for hiding this comment

dschult Apr 7, 2021

Choose a reason for hiding this comment

florishermsen Apr 7, 2021

Choose a reason for hiding this comment

florishermsen commented Apr 7, 2021 • edited

rossbar commented Apr 10, 2021

florishermsen commented Apr 10, 2021 • edited

dschult commented Apr 10, 2021

florishermsen commented Apr 10, 2021

dschult commented Apr 11, 2021

rossbar commented Apr 12, 2021

florishermsen commented Apr 17, 2021 • edited

dschult left a comment

Choose a reason for hiding this comment

rossbar left a comment

Choose a reason for hiding this comment

florishermsen commented Apr 5, 2021 •

edited

florishermsen commented Apr 7, 2021 •

edited

florishermsen commented Apr 10, 2021 •

edited

florishermsen commented Apr 17, 2021 •

edited