decompose() too slow on large graph (that cluster can handle) #960

Open
lucasmation opened this Issue Aug 23, 2016 · 10 comments

Projects

None yet

3 participants

@lucasmation

I have a large graph, of 96 million edges, which form 40 million small connected components. On a large and powerfull server, It takes 36min to create the graph from the edge list, and then 2.6 min to define the graph clusters. decompose()however has been running for 10 hours but still has not finished.

Is there a fast alternative to decompose()? Can I somehow use the output of cluster() to create a list of separate connected graphs?


library(readr)
t0 <- Sys.time()
d6 <- read_csv('cpf_pis_2004_2013_num.csv')
t1 <- Sys.time()
t1-t0 # 10min
print(object.size(d6),units='Gb') # 7.5 Gb

t0 <- Sys.time()
g6 <- graph_from_data_frame(d6, directed = FALSE)
t1 <- Sys.time()
t1-t0 # 36min  min
print(object.size(g6),units='Gb') # 10.4 Gb


t0 <- Sys.time()
clu6 <- clusters(g6)
t1 <- Sys.time()
t1-t0 # 2.6min
print(object.size(clu6),units='Gb') # Gb


t0 <- Sys.time()
g6l <- decompose(g6, mode = "weak")
t1 <- Sys.time()
t1-t0
print(object.size(g6l),units='Gb') # Gb
@ntamas
Member
ntamas commented Aug 23, 2016

decompose() effectively calls induced.subgraph() multiple times. induced.subgraph() has two implementations: copy_and_delete and create_from_scratch. copy_and_delete basically copies the entire graph and then deletes the unneeded vertices; this is used for "large" vertex sets. create_from_scratch creates the new graph from scratch. Judging from the sizes of your components, most of the calls to induced.subgraph() probably uses the create_from_scratch method.

Now, looking into the source code of create_from_scratch, I think there is a potential bottleneck lurking in there; it seems like the time complexity of create_from_scratch method depends on the size of the original graph and not the size of the subgraph, which could explain the weak performance.

Unfortunately there is no easy workaround; if you really need all the 40 million graph objects, we need to fix the C implementation of induced.subgraph() so its complexity would depend only on the size of the subgraph. I'll look into this soon.

@ntamas ntamas self-assigned this Aug 23, 2016
@lucasmation
lucasmation commented Aug 23, 2016 edited

Tks. A workaround is extracting the clusters(g)$membership data, merging to the original edge list ( code bellow stops here). And then separating the edlist by the new "component" variable a into a list of separate edgelist. And then running graph_from_data_frame on that list.

cl <- clusters(g) 
cl2 <- data.frame(idA=as.numeric(names(cl$membership)), cl$membership)
edgeList %>% left_join(y=cl2,by = idA) -> edgeList2
@ntamas
Member
ntamas commented Aug 23, 2016 edited

Yes, that could work, but I'm not too familiar with R so I cannot help you with that. On the upside of things, I'm working on some optimizations to igraph_induced_subgraph in the C layer that should speed things up.

@ntamas
Member
ntamas commented Aug 24, 2016

Status update: managed to shave off the execution time on a random graph with 500K vertices and 1M edges from ~40 seconds to ~13 seconds on my machine. There is still a bottleneck in igraph_induced_subgraph that I am aware of, so I'll keep trying.

@lucasmation

that is great. However, it is still too slow I think. Ins't it possible to use the engine for cluster() and use the components it generates in a smart way?

@gaborcsardi
Member

It is the same code as cluster(). Creating 40 million graph objects is slow. igraph is optimized for largeish graphs, not for millions of tiny ones.

@lucasmation

ok tks. What I find weird is that cluster() takes 2.6 min, and decompose() takes 20hrs. I understand that there should be some difference, as cluster() does not reorganize the data, but the difference is huge.

@gaborcsardi
Member

You could try profiling it so see what is taking time. (With a smaller graph, obviously.) I am pretty sure that creating the graphs takes the time, that's the extra thing that decompose() is doing. If creating 40 million graphs is 20 hours, then creating one is roughly 1.8 ms.

@ntamas
Member
ntamas commented Aug 31, 2016

I can chime in here because I've done some profiling. Right now most of the time is spent in igraph_induced_subgraph in a line that creates a vector whose size is equal to the length of the original graph (not the subgraph). This vector is used to quickly find the new vertex index for any old vertex when the edges of the old graph are mapped into the new one. When the original graph is huge and the subgraph is tiny, the creation of this vector is a waste of time and space because we could do better with a hash map. Unfortunately we don't have a hash map type yet in igraph so that's why a vector was chosen, and that's why it is slow. Creating an internal version of igraph_induced_subgraph that gets a pre-allocated area for this vector could help if we know that there will be multiple repeated calls to igraph_induced_subgraph with the same graph.

@gaborcsardi
Member

Makes sense, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment