Performance of common substructure/maximalcliques #50

timholy · 2020-12-10T14:30:30Z

I have some not-huge molecules (a little bigger than the ones in this demo), and mcismol/mcesmol are taking more runtime than I care to allow to finish. Here's a demo that illustrates the problem:

julia> mol1 = smilestomol("C1CCCC1");   # hexane

julia> mol2 = smilestomol("C1C2C(CCC1)CCC2");   # hexahydroindan, i.e., https://pubchem.ncbi.nlm.nih.gov/compound/10325

julia> mol3 = smilestomol("C1C3C(C2C(C1)CCCC2)CCC3"); # https://pubchem.ncbi.nlm.nih.gov/compound/12729210

julia> @time mcismol(mol2, mol3)
  0.604197 seconds (2.20 M allocations: 274.098 MiB, 16.81% gc time)
(Dict(5 => 5, 4 => 4, 6 => 6, 7 => 11, 2 => 2, 9 => 13, 8 => 12, 3 => 3, 1 => 1), :done)

even after first warmup. Profiling shows that essentially all the runtime is in

MolecularGraph.jl/src/graph/clique.jl

Lines 83 to 111 in be43e65

    
           function expand!(state::FindCliqueState, subg, cand) 
        
               (state.status == :timedout || state.status == :targetreached) && return 
        
               if isempty(subg) 
        
                   # Report max clique 
        
                   push!(state.cliques, Set(state.Q)) 
        
                   return 
        
               elseif state.expire !== nothing && time_ns() > state.expire 
        
                   state.status = :timedout 
        
                   return 
        
               elseif state.targetsize !== nothing && length(state.Q) >= state.targetsize 
        
                   state.status = :targetreached 
        
                   push!(state.cliques, Set(state.Q)) 
        
                   return 
        
               end 
        
               candnbrcnt(n) = length(intersect(cand, state.adjacencies[n])) 
        
               pivot = sortstablemax(subg, by=candnbrcnt) 
        
               copv = setdiff(cand, state.adjacencies[pivot]) 
        
               for q in copv 
        
                   push!(state.Q, q) 
        
                   qnbrs = state.adjacencies[q] 
        
                   subgq = intersect(subg, qnbrs) 
        
                   candq = intersect(cand, qnbrs) 
        
                   expand!(state, subgq, candq) 
        
                   pop!(cand, q) 
        
                   pop!(state.Q) 
        
               end 
        
               return 
        
           end

In contrast, if I convert this to a LightGraphs SimpleGraph:

function simplegraph(mol::UndirectedGraph)
    g = LightGraphs.SimpleGraph(nodecount(mol))
    for e in edgesiter(mol)
        LightGraphs.add_edge!(g, e...)
    end
    return g
end

then

julia> g1, g2, g3 = simplegraph(mol1), simplegraph(mol2), simplegraph(mol3);

julia> @time maximal_cliques(tensor_product(g2, g3));
  0.000712 seconds (7.20 k allocations: 838.570 KiB)

I'm far from certain that these are doing the same thing, but if they're even close it's a pretty big difference.

The text was updated successfully, but these errors were encountered:

mojaie · 2020-12-10T15:09:07Z

I'm not sure but ModularProduct may be expensive...

MolecularGraph.jl/src/graph/product.jl

Lines 24 to 30 in be43e65

    
           struct ModularProduct <: OrderedGraph 
        
               neighbormap::Vector{Dict{Int,Int}} 
        
               edges::Vector{Tuple{Int,Int}} 
        
               nodeattrs::Vector{ModularProductNode} 
        
               edgeattrs::Vector{ModularProductEdge} 
        
               cache::Dict{Symbol,Any} 
        
           end

timholy · 2020-12-10T19:08:38Z

It's not the modularproduct call itself that's expensive in and of itself (though of course it creates the expensive thing), it's the maximalcliques/maximalconncliques that costs the time. You can do a "poor-person's profiling" just by running mcismol and hitting Ctrl-C and see where it is in the code when it breaks. If you do this several times, you'll see it's always in the clique-analysis.

mojaie · 2020-12-13T07:49:27Z

I'm sorry, I didn't get your point.

maximalcliques and maximalconncliques are not optimized yet. If there is known implementation of maximal clique enumeration (like NetworkX but I'm not sure Julialang also have such one), we can do benchmarking and optimization. Or using well-established external graph mining library is an option.
Uniform molecule like your carbon ring example can't benefit from descriptor matching functions that can significantly reduce the modular product size and the execution time of maximalcliques.

Here is a minimum sample code that returns modular product graph for mcesmol

using MolecularGraph
using MolecularGraph.Graph

function modprod(G, H)
    lg = linegraph(G)
    lh = linegraph(H)
    am = MolecularGraph.atommatch(G, H)
    bm = MolecularGraph.bondmatch(G, H)
    ematch = MolecularGraph.Graph.lgedgematcher(lg, lh, am)
    nmatch = MolecularGraph.Graph.lgnodematcher(lg, lh, am, bm)
    eflt = MolecularGraph.Graph.modprodedgefilter(lg, lh, ematch)
    return modularproduct(lg, lh, nodematcher=nmatch, edgefilter=eflt)
end

# Tutorial example
ldopa = smilestomol("O=C(O)[C@@H](N)Cc1cc(O)c(O)cc1")
ac = smilestomol("C1=CC=C2C(=C1)C=C(C(=O)O2)N")
prod = modprod(ldopa, ac)
println(nodecount(prod))
println(edgecount(prod))

195
861

# Your example
mol2 = smilestomol("C1C2C(CCC1)CCC2")
mol3 = smilestomol("C1C3C(C2C(C1)CCCC2)CCC3")
prod = modprod(mol2, mol3)
println(nodecount(prod))
println(edgecount(prod))

150
5922

Descriptor matching functions MolecularGraph.atommatch recognize difference in atom symbols and hybridization (important in bond angle and 3D structure) by default.

Dealing with uniform molecule is a difficult problem. You can use connected=true and topological=true to reduce the cost if it is acceptable in your application. Even wIth these tricks, fullerene (C60) calculation may never finishes.

mojaie · 2023-05-17T09:24:01Z

At the version 0.14, MolecularGraph.maximal_cliques is now implemented with Graphs.SimpleGraph.

julia> mol2 = smilestomol("C1C2C(CCC1)CCC2");
julia> mol3 = smilestomol("C1C3C(C2C(C1)CCCC2)CCC3");

julia> @time Graphs.maximal_cliques(tensor_product(mol2.graph, mol3.graph));
  0.000526 seconds (7.14 k allocations: 703.336 KiB)

julia> @time MolecularGraph.maximal_cliques(tensor_product(mol2.graph, mol3.graph));
  0.000507 seconds (8.68 k allocations: 851.594 KiB)

julia> @time Graphs.maximal_cliques(smallgraph(:karate))
  0.000110 seconds (1.31 k allocations: 131.438 KiB)
36-element Vector{Vector{Int64}}:
 [5, 1, 7]
 [5, 1, 11]
 [12, 1]
 [8, 4, 2, 3, 1]
 [17, 6, 7]
 [1, 32]
 [1, 6, 7]
 [1, 6, 11]
 [1, 9, 3]
 [1, 13, 4]
 ⋮
 [34, 33, 21]
 [34, 33, 9, 31]
 [34, 33, 24, 30]
 [34, 33, 23]
 [34, 33, 19]
 [34, 10]
 [34, 27, 30]
 [2, 31]
 [26, 24]

julia> @time MolecularGraph.maximal_cliques(smallgraph(:karate))
  0.000138 seconds (1.96 k allocations: 206.531 KiB)
36-element Vector{Vector{Int64}}:
 [5, 1, 7]
 [5, 1, 11]
 [12, 1]
 [8, 4, 2, 3, 1]
 [17, 6, 7]
 [1, 32]
 [1, 6, 7]
 [1, 6, 11]
 [1, 9, 3]
 [1, 13, 4]
 ⋮
 [34, 33, 21]
 [34, 33, 9, 31]
 [34, 33, 24, 30]
 [34, 33, 23]
 [34, 33, 19]
 [34, 10]
 [34, 27, 30]
 [2, 31]
 [26, 24]

I have tried several times with some graphs and it seems that Graphs version performs slightly better.
On the other hand, MolecularGraphs.maximal_cliques ~~is based on a lazy Iterator and~~ has several features that are useful in practical molecular mining (e.g. abortion by time out or reaching target size) , so I will continue to use this implementation for now.

julia> @time disconnected_mcis(mol2, mol3)
  0.387954 seconds (2.06 M allocations: 219.295 MiB, 12.14% gc time)
MCSResult{Int64}(Dict(5 => 5, 4 => 4, 6 => 6, 7 => 11, 2 => 2, 9 => 13, 8 => 12, 3 => 3, 1 => 1), :done)

The computation time for MCS may have slightly improved. This mainly depends on the computational performance of the modular product generation mentioned above, and still there may be a room for optimization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of common substructure/maximalcliques #50

Performance of common substructure/maximalcliques #50

timholy commented Dec 10, 2020

mojaie commented Dec 10, 2020

timholy commented Dec 10, 2020

mojaie commented Dec 13, 2020

mojaie commented May 17, 2023 •

edited

Loading

Performance of common substructure/maximalcliques #50

Performance of common substructure/maximalcliques #50

Comments

timholy commented Dec 10, 2020

mojaie commented Dec 10, 2020

timholy commented Dec 10, 2020

mojaie commented Dec 13, 2020

mojaie commented May 17, 2023 • edited Loading

mojaie commented May 17, 2023 •

edited

Loading