Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

O(n^2) -> O(n) implementation for scale_free_graph #4727

Merged
merged 3 commits into from Apr 19, 2021

Conversation

florishermsen
Copy link
Contributor

@florishermsen florishermsen commented Apr 5, 2021

I was trying to use scale_free_graph from generators.directed and noticed it drastically grew slower with number of nodes n, rendering it effectively useless for large numbers of nodes (e.g. 1M). Since the gist of the algorithm involves preferential attachment, and other networkx methods like the ones based on Barabási–Albert are in fact fast, I took a look at the implementation. Parts of the current node sampling have O(n) time complexity, effectively rendering the generator as a whole O(n^2). I rewrote that sampling to be O(1), so now the generator itself is O(n).

Changes
Like e.g. the implementation for barabasi_albert_graph or dual_barabasi_albert_graph, the new implementation keeps simple lists of repeated nodes representing the in- and out-degrees so we can perform O(1) sampling with probabilities linear in those degrees. This suffices for the case that biases are zero.

Because the algorithm also needs to support nonzero biases, it would seem full weights have to be computed before each sample is drawn (based on current node degrees plus the biases), which would make the sampling O(n) again. I've solved this probabilistically with a two-step sampling mechanism that is once again O(1). Note that we're not doing any approximations here, the new implementation of the algorithm behaves exactly the same as the old one.

I've also added checks that the biases must be >= 0. This does not remove functionality because the current implementation would have been incorrect for biases < 0 anyways (if in doubt, please review the original cursor-based sampling function and imagine a negative bias < -1 to conclude the same). Furthermore, the referenced original paper (see https://www.microsoft.com/en-us/research/publication/directed-scale-free-graphs/) also sets biases >= 0 as a condition, so it probably should have been checked anyways (why that is indeed a sensible condition is another story which I think we should not get into currently, I would refer to the paper for now).

Timing results
scale_free_graph(n, alpha=0.2, beta=0.6, gamma=0.2, delta_in=1, delta_out=1)

n old new speedup
1e+1 7.19 ms 168 µs x50
1e+2 8.91 ms 1.15 ms x10
1e+3 582 ms 11.2 ms x50
1e+4 59.1 s 185 ms x300
1e+5 100 m (est.) 1.94 s x3000
1e+6 7 d (est.) 21.6 s x30000

Couldn't find any related issues, apart from this SO post similarly complaining about days of runtime:
https://stackoverflow.com/questions/56119399/quickly-generating-large-scale-free-graphs-with-networkx

@dschult
Copy link
Member

dschult commented Apr 6, 2021

Very Nice!!
Thanks for hunting this down and fixing it! :}

Copy link
Contributor

@rossbar rossbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally LGTM - I find it easier to read and it should definitely be more performant with the degree distribution stored rather than the existing iterative computation over the degree distribution every time a node selection is made.

Reading through this I just wanted to confirm one thing. IIUC, in the limit of very large delta (delta >> len(degree_distribution)) the selection essentially boils down to uniform sampling from the nodes, correct? I'm not familiar with the cited algorithm so I just wanted to double check that this is the expected behavior.

networkx/generators/directed.py Show resolved Hide resolved
bias_sum = n_nodes * delta
p_delta = bias_sum / (bias_sum + len(candidates))
if seed.random() < p_delta:
return seed.randint(0, n_nodes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implicit here is an assumption that the nodes are essentially drawn from range(0, n_nodes). For cases where this isn't true, the returned nodes might seem strange to users:

>>> G = nx.scale_free_graph(10)
>>> H = nx.relabel_nodes(G, {i: 2**i for i in range(10)}
>>> H.nodes()
NodeView((1, 2, 4, 8, 16, 32, 64, 128, 256, 512))
>>> K = nx.scale_free_graph(20, create_using=H, delta_in=1e6, delta_out=1e6)
>>> K.nodes()
NodeView((1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 7, 11, 12, 5, 14, 0, 9, 13, 18, 17))

AFAICT this doesn't affect the algorithm behavior so it isn't wrong. However it might be worth just adding a note to the docstring about how new nodes are selected so that users starting with an initial graph that has non-integer nodes or unevenly spaced integer nodes know what to expect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can use that random number as an index to the candidates list so it returns a node instead of an integer? But I might be misunderstanding the intent....
return candidate[seed.randint(0, n_nodes)]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! I've added a commit that properly supports working with existing graphs by introducing a node state (simple list of nodes) and a cursor for new nodes that starts at the highest already occupied int + 1.

The code would be simpler if we would sample from list(G.nodes()) but it would make the code orders of magnitude slower once again for large n. The current approach does not sacrifice any of the performance gains, timing results are still similar to posted above.

@rossbar your snippet now produces

NodeView((1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522))

But it now also supports e.g. strings:

>>> G = nx.scale_free_graph(10)
>>> H = nx.relabel_nodes(G, {i: str(2**i) for i in range(10)})
>>> H.nodes()
NodeView(('1', '2', '4', '8', '16', '32', '64', '128', '256', '512'))
>>> K = nx.scale_free_graph(20, create_using=H, delta_in=1e6, delta_out=1e6)
>>> K.nodes()
NodeView(('1', '2', '4', '8', '16', '32', '64', '128', '256', '512', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9))

@dschult your approach unfortunately wouldn't work because len(candidates) > len(nodes), as candidates contains the nodes repeated d times where d are their degrees. Also, sampling from candidates brings us back to sampling with weights determined by the degrees, and in that piece of the code we want to sample uniformly.

@florishermsen
Copy link
Contributor Author

florishermsen commented Apr 7, 2021

Almost @rossbar ! Better formulated would be that we approach uniform sampling when len(nodes) * delta >> len(degree_distribution) where degree_distribution (sampling candidates in the code) are the "raveled" in and out-degree distributions (i.e. a node with degree 7 appears 7 times in that list), so its length is equal to len(nodes) * avg(degree). As you can see the inequality then simplifies to delta >> avg(degree).

This should make sense because going back to the algorithm, the probability for a node to be sampled scales with its in or out-degree plus the bias, i.e. ~(d+b). In case b >> d, this effectively becomes ~b, and since the bias is the same for all nodes this indeed effectively is uniform sampling.

@rossbar
Copy link
Contributor

rossbar commented Apr 10, 2021

Hmm so looking at this again, it looks to me like there may have been a very subtle bug that would persist in this implementation as well. For example, look at the following (original) code chunk:

if r < alpha:
# alpha
# add new node v
v = len(G)

The comment indicates that v should be a node that is not already present in the Graph. It seems this relies on assumptions about the nodes in G; most likely based on the default value of create_using, which creates a graph starting with node 0 and incrementing upwards by one. However, it's easy to violate this assumption e.g.

>>> G = nx.MultiDiGraph([(2, 3), (3, 4), (2, 4)])
>>> H = nx.scale_free_graph(10, create_using=G)

In this case, if scale_free_graph hit the snippet above, the "new" node would be 3, which is clearly already in the Graph. I'm not sure what impact this has on the algorithm as a whole (maybe this is fine?) - it's just a detail where the implementation seems to diverge from the desired behavior (gathered from the comments) for certain graph inputs. Is this a problem for the algorithm? If it's not, at the very least the comments should then be updated to reflect what can actually happen in the general case.

@florishermsen
Copy link
Contributor Author

florishermsen commented Apr 10, 2021

@rossbar I added a commit (9bc7d31) that addresses precisely this!

I explained it further in one of the threads above since I responded to some of the content of your comment there but I guess you missed that, it's tricky to keep these conversations linear..


Copy of #4727 (comment) :

Great catch! I've added a commit that properly supports working with existing graphs by introducing a node state (simple list of nodes) and a cursor for new nodes that starts at the highest already occupied int + 1.

The code would be simpler if we would sample from list(G.nodes()) but it would make the code orders of magnitude slower once again for large n. The current approach does not sacrifice any of the performance gains, timing results are still similar to posted above.

@rossbar your snippet now produces

NodeView((1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522))

But it now also supports e.g. strings:

>>> G = nx.scale_free_graph(10)
>>> H = nx.relabel_nodes(G, {i: str(2**i) for i in range(10)})
>>> H.nodes()
NodeView(('1', '2', '4', '8', '16', '32', '64', '128', '256', '512'))
>>> K = nx.scale_free_graph(20, create_using=H, delta_in=1e6, delta_out=1e6)
>>> K.nodes()
NodeView(('1', '2', '4', '8', '16', '32', '64', '128', '256', '512', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9))

@dschult
Copy link
Member

dschult commented Apr 10, 2021

That's nice approach to finding new nodes. But it is fragile in the sense that 1.0 equates to 1 but isn't an integer.
So a graph with nodes 0.0 1.0 2.0 3.0 would have cursor be 0 even though it should be 4.

We could use isinstance(node, numbers.Number) in place of type(i)==int. Or even type(node) in (int, float, complex). But we might have to keep adding types to that list...
However we do it, we should make sure that counter is an integer -- perhaps when we add the 1: counter = int(max(...)) + 1.

@florishermsen
Copy link
Contributor Author

@dschult interesting, at first I did not understand why in your example the cursor cannot be zero, as I figured the graph would allow for int 0 and float 0.0 to be different nodes. Apparently it doesn't, as per normal python dictionary behavior (didn't know that was the implementation)

I'll prepare a fix to properly determine the starting point for the cursor. Treatment that would work for complex as well could be selecting the max + 1 from int(n.real) for all numbers n.

Open to suggestions.

@dschult
Copy link
Member

dschult commented Apr 11, 2021

Hmmm.... This is getting annoying and obtuse. :}
Perhaps the best way is to implement an easy -- imperfect method for finding new nodes and then let the user override it with parameter e.g. new_nodes. We could force new_nodes to be a list of n nodes. Or we could be fancy and let it be either a starting value for new nodes or a list of new nodes. But the easy/simple interface might look something like:

if new_nodes is None:
    ...current code to find max...
    cursor = itertools.count(max + 1)
else:
    try:
        assert len(new_nodes) == n:
        cursor = iter(new_nodes)
    # catch objects with no `len` or `iter` capability
    except (TypeError, AssertionError):
        raise nx.NetworkXError("new_nodes should be a list of `n` node objects")

By making cursor an iterator we can use/update it in one step with next(cursor)

@rossbar
Copy link
Contributor

rossbar commented Apr 12, 2021

Hmmm.... This is getting annoying and obtuse. :}

Agreed - my fault, sorry for sending us in the rabbit hole! FWIW I think even ec9bfee was fine! I just thought it'd be good to add a note to the docstring about the non-sequential node behavior.

I'd just as soon get the original implementation in and worry about the gremlins associated with passing in a non-empty MultiDiGraph instance in another issue.

@florishermsen
Copy link
Contributor Author

florishermsen commented Apr 17, 2021

Thanks for the input @dschult and @rossbar. I think it would be better indeed for this pull request to leave the contract of the generator unchanged and just optimize. In the meantime, I did go ahead and implemented the isinstance(n, numbers.Number) plus int(n.real) strategy so that we now have a best-effort (and improved) support for existing graphs with number-based nodes.

The following now yields:

>>> G = nx.scale_free_graph(5)
>>> new_labels = {
>>>     0: 4.4,
>>>     1: 5,
>>>     2: 12.1+9.2j,
>>>     3: "m",
>>>     4: "16.3"
>>> }
>>> H = nx.relabel_nodes(G, new_labels)
>>> H.nodes()
NodeView((4.4, 5, (12.1+9.2j), 'm', '16.3'))
>>> K = nx.scale_free_graph(10, create_using=H)
>>> K.nodes()
NodeView((4.4, 5, (12.1+9.2j), 'm', '16.3', 13, 14, 15, 16, 17))

To be completely safe one could require that nodes of an existing graph fed to the algorithm must adhere to a type whitelist for which we can ensure correct behavior, or indeed have the user supply the new nodes as an iterable / iterator of sorts. When we then are in the territory of modifying the contract anyways, I have some additional improvements in mind. E.g. DiGraph support (the non-multi kind) and an option to make self-loops illegal.

So, ready to merge? Cheers

Copy link
Member

@dschult dschult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

@dschult dschult added this to the networkx-2.6 milestone Apr 18, 2021
Copy link
Contributor

@rossbar rossbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @florishermsen !

@rossbar rossbar merged commit 643f457 into networkx:main Apr 19, 2021
MridulS pushed a commit to MridulS/networkx that referenced this pull request Feb 4, 2023
* O(N) implementation for scale_free_graph

* proper support for generating a scale-free graph on top of an existing graph

* scale free graph generator: better support for starting with an existing graph with number-based nodes
cvanelteren pushed a commit to cvanelteren/networkx that referenced this pull request Apr 22, 2024
* O(N) implementation for scale_free_graph

* proper support for generating a scale-free graph on top of an existing graph

* scale free graph generator: better support for starting with an existing graph with number-based nodes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

3 participants