Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of common substructure/maximalcliques #50

Open
timholy opened this issue Dec 10, 2020 · 4 comments
Open

Performance of common substructure/maximalcliques #50

timholy opened this issue Dec 10, 2020 · 4 comments

Comments

@timholy
Copy link
Contributor

timholy commented Dec 10, 2020

I have some not-huge molecules (a little bigger than the ones in this demo), and mcismol/mcesmol are taking more runtime than I care to allow to finish. Here's a demo that illustrates the problem:

julia> mol1 = smilestomol("C1CCCC1");   # hexane

julia> mol2 = smilestomol("C1C2C(CCC1)CCC2");   # hexahydroindan, i.e., https://pubchem.ncbi.nlm.nih.gov/compound/10325

julia> mol3 = smilestomol("C1C3C(C2C(C1)CCCC2)CCC3"); # https://pubchem.ncbi.nlm.nih.gov/compound/12729210

julia> @time mcismol(mol2, mol3)
  0.604197 seconds (2.20 M allocations: 274.098 MiB, 16.81% gc time)
(Dict(5 => 5, 4 => 4, 6 => 6, 7 => 11, 2 => 2, 9 => 13, 8 => 12, 3 => 3, 1 => 1), :done)

even after first warmup. Profiling shows that essentially all the runtime is in

function expand!(state::FindCliqueState, subg, cand)
(state.status == :timedout || state.status == :targetreached) && return
if isempty(subg)
# Report max clique
push!(state.cliques, Set(state.Q))
return
elseif state.expire !== nothing && time_ns() > state.expire
state.status = :timedout
return
elseif state.targetsize !== nothing && length(state.Q) >= state.targetsize
state.status = :targetreached
push!(state.cliques, Set(state.Q))
return
end
candnbrcnt(n) = length(intersect(cand, state.adjacencies[n]))
pivot = sortstablemax(subg, by=candnbrcnt)
copv = setdiff(cand, state.adjacencies[pivot])
for q in copv
push!(state.Q, q)
qnbrs = state.adjacencies[q]
subgq = intersect(subg, qnbrs)
candq = intersect(cand, qnbrs)
expand!(state, subgq, candq)
pop!(cand, q)
pop!(state.Q)
end
return
end

In contrast, if I convert this to a LightGraphs SimpleGraph:

function simplegraph(mol::UndirectedGraph)
    g = LightGraphs.SimpleGraph(nodecount(mol))
    for e in edgesiter(mol)
        LightGraphs.add_edge!(g, e...)
    end
    return g
end

then

julia> g1, g2, g3 = simplegraph(mol1), simplegraph(mol2), simplegraph(mol3);

julia> @time maximal_cliques(tensor_product(g2, g3));
  0.000712 seconds (7.20 k allocations: 838.570 KiB)

I'm far from certain that these are doing the same thing, but if they're even close it's a pretty big difference.

@mojaie
Copy link
Owner

mojaie commented Dec 10, 2020

I'm not sure but ModularProduct may be expensive...

struct ModularProduct <: OrderedGraph
neighbormap::Vector{Dict{Int,Int}}
edges::Vector{Tuple{Int,Int}}
nodeattrs::Vector{ModularProductNode}
edgeattrs::Vector{ModularProductEdge}
cache::Dict{Symbol,Any}
end

@timholy
Copy link
Contributor Author

timholy commented Dec 10, 2020

It's not the modularproduct call itself that's expensive in and of itself (though of course it creates the expensive thing), it's the maximalcliques/maximalconncliques that costs the time. You can do a "poor-person's profiling" just by running mcismol and hitting Ctrl-C and see where it is in the code when it breaks. If you do this several times, you'll see it's always in the clique-analysis.

@mojaie
Copy link
Owner

mojaie commented Dec 13, 2020

I'm sorry, I didn't get your point.

  1. maximalcliques and maximalconncliques are not optimized yet. If there is known implementation of maximal clique enumeration (like NetworkX but I'm not sure Julialang also have such one), we can do benchmarking and optimization. Or using well-established external graph mining library is an option.

  2. Uniform molecule like your carbon ring example can't benefit from descriptor matching functions that can significantly reduce the modular product size and the execution time of maximalcliques.

Here is a minimum sample code that returns modular product graph for mcesmol

using MolecularGraph
using MolecularGraph.Graph

function modprod(G, H)
    lg = linegraph(G)
    lh = linegraph(H)
    am = MolecularGraph.atommatch(G, H)
    bm = MolecularGraph.bondmatch(G, H)
    ematch = MolecularGraph.Graph.lgedgematcher(lg, lh, am)
    nmatch = MolecularGraph.Graph.lgnodematcher(lg, lh, am, bm)
    eflt = MolecularGraph.Graph.modprodedgefilter(lg, lh, ematch)
    return modularproduct(lg, lh, nodematcher=nmatch, edgefilter=eflt)
end
# Tutorial example
ldopa = smilestomol("O=C(O)[C@@H](N)Cc1cc(O)c(O)cc1")
ac = smilestomol("C1=CC=C2C(=C1)C=C(C(=O)O2)N")
prod = modprod(ldopa, ac)
println(nodecount(prod))
println(edgecount(prod))
195
861
# Your example
mol2 = smilestomol("C1C2C(CCC1)CCC2")
mol3 = smilestomol("C1C3C(C2C(C1)CCCC2)CCC3")
prod = modprod(mol2, mol3)
println(nodecount(prod))
println(edgecount(prod))
150
5922

Descriptor matching functions MolecularGraph.atommatch recognize difference in atom symbols and hybridization (important in bond angle and 3D structure) by default.

Dealing with uniform molecule is a difficult problem. You can use connected=true and topological=true to reduce the cost if it is acceptable in your application. Even wIth these tricks, fullerene (C60) calculation may never finishes.

@mojaie
Copy link
Owner

mojaie commented May 17, 2023

At the version 0.14, MolecularGraph.maximal_cliques is now implemented with Graphs.SimpleGraph.

julia> mol2 = smilestomol("C1C2C(CCC1)CCC2");
julia> mol3 = smilestomol("C1C3C(C2C(C1)CCCC2)CCC3");

julia> @time Graphs.maximal_cliques(tensor_product(mol2.graph, mol3.graph));
  0.000526 seconds (7.14 k allocations: 703.336 KiB)
julia> @time MolecularGraph.maximal_cliques(tensor_product(mol2.graph, mol3.graph));
  0.000507 seconds (8.68 k allocations: 851.594 KiB)
julia> @time Graphs.maximal_cliques(smallgraph(:karate))
  0.000110 seconds (1.31 k allocations: 131.438 KiB)
36-element Vector{Vector{Int64}}:
 [5, 1, 7]
 [5, 1, 11]
 [12, 1]
 [8, 4, 2, 3, 1]
 [17, 6, 7]
 [1, 32]
 [1, 6, 7]
 [1, 6, 11]
 [1, 9, 3]
 [1, 13, 4]
 ⋮
 [34, 33, 21]
 [34, 33, 9, 31]
 [34, 33, 24, 30]
 [34, 33, 23]
 [34, 33, 19]
 [34, 10]
 [34, 27, 30]
 [2, 31]
 [26, 24]
julia> @time MolecularGraph.maximal_cliques(smallgraph(:karate))
  0.000138 seconds (1.96 k allocations: 206.531 KiB)
36-element Vector{Vector{Int64}}:
 [5, 1, 7]
 [5, 1, 11]
 [12, 1]
 [8, 4, 2, 3, 1]
 [17, 6, 7]
 [1, 32]
 [1, 6, 7]
 [1, 6, 11]
 [1, 9, 3]
 [1, 13, 4]
 ⋮
 [34, 33, 21]
 [34, 33, 9, 31]
 [34, 33, 24, 30]
 [34, 33, 23]
 [34, 33, 19]
 [34, 10]
 [34, 27, 30]
 [2, 31]
 [26, 24]

I have tried several times with some graphs and it seems that Graphs version performs slightly better.
On the other hand, MolecularGraphs.maximal_cliques is based on a lazy Iterator and has several features that are useful in practical molecular mining (e.g. abortion by time out or reaching target size) , so I will continue to use this implementation for now.

julia> @time disconnected_mcis(mol2, mol3)
  0.387954 seconds (2.06 M allocations: 219.295 MiB, 12.14% gc time)
MCSResult{Int64}(Dict(5 => 5, 4 => 4, 6 => 6, 7 => 11, 2 => 2, 9 => 13, 8 => 12, 3 => 3, 1 => 1), :done)

The computation time for MCS may have slightly improved. This mainly depends on the computational performance of the modular product generation mentioned above, and still there may be a room for optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants