Graph performance problems with OrientDB 1.7.x / 2.0.x when searching for edges #4105

wcraigtrader · 2015-05-07T17:02:05Z

I have built a model application that can be used to measure the performance of OrientDB when ingesting data that is similar to our actual application, but which is openly shareable, and much smaller than the actual application and datasets. You can find the model application on GitHub:

https://github.com/wcraigtrader/ogp

My analysis leads me to believe that out-of-the-box, OrientDB is not doing any type of indexing for edges (light or heavy). Our application data graph is prone to creating super-nodes (e.g.: nodes with hundreds or thousands of edges), and our performance is suffering when trying to ingest these super-nodes.

I believe that by forcing our application to use heavy edges, and by correctly indexing the edges, we should be able to resolve this problem, but my efforts here have not been successful, so I'm appealing to you and your engineers for assistance.

lvca · 2015-05-08T06:58:18Z

@wcraigtrader Are you using SQL for traversing?

wcraigtrader · 2015-05-08T09:56:15Z

No.

We've been using the graph APIs (mostly Java, some Gremlin). The only place where we use SQL commands is while creating the schema. We'd use the Java APIs there except that the Java APIs didn't have enough documentation and I wasn't prepared to dig through the SQL layer and see what was being called under the covers.

The model code I provided is almost exactly a line-by-line translation of the code for ingesting data into the graph; the differences are in how the ingest data is generated (random instead of parsed from files) and stored in memory. The package and class names have been changed, of course.

IMO, the primary offending method is findEdge:

OrientEdge findEdge(String type, OrientVertex src, OrientVertex tgt ) {
    for (Edge eraw : src.getEdges( tgt, Direction.BOTH, type ) ) {
        return (OrientEdge) eraw
    }
    return null
}

The underlying call to hasNext() is apparently doing linear searches under the covers; I am sure that there must be some way to index edges and use those indexes (similarly to how findNode uses vertex indexes) but if there is, it is currently undocumented.

lvca · 2015-05-08T13:40:46Z

This will be resolved by OrientDB 3.0 where a new index structure will be used to assure O(LogN) performance even in this case. But looking at the code you only need the first one edge, is this correct?

wcraigtrader · 2015-05-08T15:13:32Z

Our schema requires that any pair of nodes be only connected by one instance of any one type of valid edge, so the first matching edge should be the only matching edge, for any given type of edge.

Given that Orient 2.X is really only just out, waiting for Orient 3.X isn't really a solution for us.

lvca · 2015-05-08T19:41:11Z

@wcraigtrader If you only want the first edge of type X, you don't need indexing. OrientDB already keeps edges in separate collections to return them quickly. I'm interested about the result of profiling to see why browsing just one edge is so expensive.

However, if you have regular edges (no lightweight), you could index out, in with a composite index and retrieve the edge with a lookup:

select expand( in ) from Friend where out = :src and in = :tgt

passing src and tgt as arguments.

wcraigtrader · 2015-05-08T20:24:37Z

@lvca As noted above, our application is generating lots of nodes with hundreds or even thousands of edges, and the sample application was developed specifically for ease of profiling and performance measurement for the types of data we're using. Please run it and examine its performance for yourself.

I am looking for advice specific to the sample application, to best improve its performance -- any changes to the Database or GraphPerformance classes that you think best. (Those are the classes that create and manipulate the graph database; the other classes are not counted in the performance metrics.)

With your suggestions in hand, I will apply the results to our actual application.

lvca · 2015-05-11T21:12:12Z

@wcraigtrader Any news on results?

wcraigtrader · 2015-05-11T23:17:21Z

@lvca If you look at the indexes branch of my model application you can see the results for yourself. My original implementation of findEdge() was:

OrientEdge findEdgeUsingSourceGetEdges(String type, OrientVertex src, OrientVertex tgt ) {
    // NOTE: This is where the ingester spends 75+% of its time!!!
    for (Edge eraw : src.getEdges( tgt, Direction.BOTH, type ) ) {
        return (OrientEdge) eraw
    }
    return null
}

As previously noted, this did not use indexes for edges, and the performance was abysmal for nodes with large numbers of edges. I then added indexes for edges, as follows:

OrientGraphNoTx g = factory.getNoTx()
OCommandSQL cmd = new OCommandSQL()

cmd.setText("create index sees.unique on sees (out,in) unique" )
g.command(cmd).execute()

cmd.setText("create index hears.unique on hears (out,in) unique" )
g.command(cmd).execute()

cmd.setText("create index feels.unique on feels (out,in) unique" )
g.command(cmd).execute()

cmd.setText("create index smells.unique on smells (out,in) unique" )
g.command(cmd).execute()

cmd.setText("create index tastes.unique on tastes (out,in) unique" )
g.command(cmd).execute()

Then I tried using a SQL query to select for my edge, as follows:

OrientEdge findEdgeUsingQuery(String type, OrientVertex src, OrientVertex tgt ) {
    def cmd = new OCommandSQL("select from index:${type}.unique where key=?")
    def key = new OCompositeKey( [src.id, tgt.id] )
    for (Vertex result : graph.command( cmd ).execute( key )) {
        return (OrientEdge) result.getProperty( 'rid' )
    }
    return null
}

Finally I used a pure Java / Groovy implementation (no SQL):

OrientEdge findEdgeUsingGraphGetEdges(String type, OrientVertex src, OrientVertex tgt ) {
    for (Edge eraw : graph.getEdges( "${type}.unique", new OCompositeKey( [src.id, tgt.id] ) ) ) {
        return (OrientEdge) eraw
    }
    return null
}

If you look at this spreadsheet you can see the end results. The black line represents the number of edges per node (500 edges per node for the first 50 transactions (to give the JVM time to finish JITting), then varying from 2 to 900 edges per transaction (to gauge the time per edge). The dark blue line represents the non-indexed solution, the dark green line represents the SQL query solution, and the dark red line represents the pure-Java solution. Both new solutions are O(1), but the non-SQL solution is clearly faster (typically 30% faster). In this use case, the SQL interface adds more overhead than it is worth (not surprising since we're not doing joins or returning complicated results).

Lessons Learned / Further Questions

Lightweight edges are not statistically faster with OrientDB 2.0.X and can't be indexed -- don't use them unless you're only using at most a handful of edges for any given node.
Groovy 2.4.3 is either significantly faster, or as fast as Groovy 1.8.9, depending upon the use case.
- Are there any known reasons why we shouldn't use Groovy 2.4.X for our Gremlin queries instead of Groovy 1.8.9?
OrientBaseGraph::createIndex() is both awkward to use and under-documented.
- How do you use createIndex() to create an index with a composite key?
- There's almost no useful documentation of the Parameter class.
- What's the difference in performance / implementation of UNIQUE and UNIQUE_HASH_INDEX indexes?
Vertex::getEdges() checks for useful indexes, but doesn't make use of the edge index I defined.
- Why not?
- What indexes would be used by getEdges()?
- How would I define them?
Am I correct in surmising that edge data (eg: edge @Rid) is stored on the source and target node records, in addition to on the edge record?
- How do we tune this?
- Is this more helpful for Gremlin or SQL queries?
- What are the practical limits to scalability, for both SQL and Gremlin?
What types of indexes could be defined to improve the performance of Gremlin queries when working with (or through) super-nodes?

acmeguy · 2015-05-15T21:51:07Z

just adding this here so I get notified. (This is of great interest/importance to us too)

lvca · 2015-05-21T23:12:32Z

On creation and traversal lightweight edges are much faster then regular one. What kind of benchmark did you execute?
No particular reason: that groovy version was bundled by Blueprints. Did you test Groovy 2.4.3 in place of the version in bundle?
You're right, it's under documented. Look at: http://orientdb.com/docs/last/Schema.html, where you can find something.
This is still missing. Please could you create a new issue for that?
while edge's RID is not stored in the Edge, because it's the edge address itself, it's stored on both vertices. You can't tune it, unless you are using lightweight edges
Now Gremlin is still not able to use indexes transparently. The best in this case is using SQL using the index explicitely

wcraigtrader · 2015-05-22T03:14:27Z

See the beginning of this ticket -- I referenced all of the source and posted it all up on GitHub, including how to run the comparison benchmarks for different versions of Orient and Groovy. Test was specifically ingesting large sub-graphs characteristic of our application and data. There are two branches -- one without edge indexing, one with.
Yes, I did, but not with Gremlin -- yet.
I already did -- indexes are only mentioned as constraints on the schema, not for performance tuning. The indexes page (http://orientdb.com/docs/last/Indexes.html) is hardly better.
In side-by-side comparison of simple queries using indexes to select edges, the Java APIs are 30% faster than the SQL queries -- I'm not inclined to use SQL if either Java or Gremlin can be used to craft the search/traversal.

lvca · 2015-05-22T14:15:31Z

About (1) I see you wrote 75% of the time is in getEdges(). Can you see inside that method where the time is spent mostly?

wcraigtrader · 2015-05-22T15:57:40Z

@lvca I use Oracle's Java Mission Control and Java Flight Recorder (included with Oracle JDK 7u40 and above). I've updated my ogp project to include a script to make profiling the app easier (see the README for details). I've also included two profiling samples (one without indexed edges, one with) that you can use to see what I was seeing. The profiles were taken just now, using Orient 2.0.9 and Groovy 2.3.10.

If you run jmc and open the profile-orient-209-no-indexes.jfr recording and click on the code tab (on the left), then click on the Hot Methods tab at the bottom, you'll see that HashMap.getEntry is the hot method -- not exactly useful -- but if drill down a little, you'll find that actual culprits are OTransactionRealAbstract.getRecordEntry and ODocument.rawField. Personally I prefer to follow the CallTree tab.

buzzkillington · 2015-05-27T19:42:57Z

I'm also interested in this. I'm browsing adjacent vertices based on specific edge labels, but the performance of Vertex.getEdges(Direction direction, String... labels) and Vertex.getVertices(Direction direction, String... labels) is slow. With either option, it's taking an average of over 0.5ms per record to look up the adjacent RIDs in a test case of ~110k vertices (the db has ~400k vertices and ~1.03M edges total).

I tried setting up some indexes, but ran into #4232. I'd be glad to try any other suggestions for speeding this up. (Currently testing on 2.1-rc3.)

tglman · 2015-06-02T15:50:24Z

@wcraigtrader

I'm running your test for understand how tune the system to get better result, i got mostly your results, i know that we did some improvement for 2.1, and more will come for 2.2

In the mean while i fixed also #4232, that may help in some cases

tglman · 2015-06-03T15:45:32Z

Hi @wcraigtrader,

I did some improvements on the plain iteration of edges, starting from your benchmark, but anyway i think the final solution is using indexes as you have done, also consider to use also some multi threading, that will help you to speed up, we are thinking to introduce an automatic detection of the index from getEdges from 2.2, that will simplify a lot the usage.

anyway does the index based solution satisfy your needs?

wcraigtrader · 2015-06-04T21:01:49Z

@tglman For the purposes of ingesting new records, the edge indexing method that I've described is working as well as we can reasonably expect.

I certainly wish that we didn't have to wait until Orient 2.2 ships for all aspects of the database to implicitly use defined indexes (edge or node). In particular, we expect to be making heavy use of Gremlin in coming months, and everything I've learned so far indicates that the Gremlin implementation ignores indexes and simply works from the in_XXX and out_XXX fields stored with the vertex. Please correct me if I'm wrong.

lvca · 2015-06-05T15:16:25Z

You're right. We could support optimization in gremlin, but in 3.0 we'll provide pattern matching feature where this will be much easier and fast.

wcraigtrader · 2015-06-06T17:49:57Z

@lvca My problem is timelines ... when is Orient planning to release 2.1, 2.2, 3.0?

lvca · 2015-06-08T21:45:45Z

OrientDB 2.1 GA this week, 2.2 August 2015 and 3.0 October 2015.

wcraigtrader · 2015-06-08T22:45:02Z

@lvca, thanks for the dates. I doubt that we will move to 2.1 much before August (we go to internal GA at the end of June, and customer GA at the end of July).

Just out of curiosity, what is the largest number of edges connected to a single node that you're testing against? In our current dataset (roughly half the size of what we expect to have when we go live), we have a number of supernodes with massive numbers of edges. At the moment, one of those nodes currently has 495,376 edges. (Current dataset size: 5,075,444 nodes, 5,467,323 edges.)

lvca · 2015-06-09T07:04:22Z

We have clients with millions of edges as super nodes, but we always suggest to try avoiding it if possible. Could you share your design (in private) to see how we can improve it?

wcraigtrader · 2015-06-09T07:39:42Z

Thanks, but at the moment, our schema is fixed -- too close to release to be changing the fundamentals. I'll revisit that problem in August.

wcraigtrader · 2015-06-17T18:18:18Z

Comment to my other account.

lvca · 2015-06-18T14:26:48Z

Ok ;-)

Jotschi · 2015-10-16T11:38:15Z

I created my own little unit test using OrientDB 2.1.3 to better understand the performance implications and to investigate each option that was mentioned in this issue.

The graph.getEdges method also works very well for me. But i'm still a puzzled why the v1.getEdges(v2, direction, label) method is so slow.

@lvca You mentioned this method on SO: http://stackoverflow.com/questions/32953396/orientdb-edge-index-via-java

AFAIK the mentioned method is not using the index at all.

Creating {14000} items.
[graph.getEdges] Duration: 38
[graph.getEdges] Duration per lookup: 0.0095

[root.getEdges] Duration: 59439
[root.getEdges] Duration per lookup: 14.85975

[root.getEdges - iterating] Duration: 56710
[root.getEdges - iterating] Duration per lookup: 14.1775

[query] Duration: 817
[query] Duration per lookup: 0.20425

package de.jotschi.orientdb;

import static org.junit.Assert.assertTrue;

import java.util.ArrayList;
import java.util.List;

import org.junit.BeforeClass;
import org.junit.Test;

import com.orientechnologies.orient.core.index.OCompositeKey;
import com.orientechnologies.orient.core.metadata.schema.OType;
import com.orientechnologies.orient.core.sql.OCommandSQL;
import com.tinkerpop.blueprints.Direction;
import com.tinkerpop.blueprints.Edge;
import com.tinkerpop.blueprints.Vertex;
import com.tinkerpop.blueprints.impls.orient.OrientEdgeType;
import com.tinkerpop.blueprints.impls.orient.OrientGraphFactory;
import com.tinkerpop.blueprints.impls.orient.OrientGraphNoTx;
import com.tinkerpop.blueprints.impls.orient.OrientVertex;
import com.tinkerpop.blueprints.impls.orient.OrientVertexType;

public class EdgeIndexPerformanceTest {

    private static OrientGraphFactory factory = new OrientGraphFactory("memory:tinkerpop");

    private final static int nDocuments = 14000;
    private final static int nChecks = 4000;

    private static List<OrientVertex> items;
    private static OrientVertex root;

    @BeforeClass
    public static void setupDatabase() {
        setupTypesAndIndices(factory);

        root = createRoot(factory);
        items = createData(root, factory, nDocuments);
    }

    private static void setupTypesAndIndices(OrientGraphFactory factory2) {
        OrientGraphNoTx g = factory.getNoTx();
        try {
            OCommandSQL cmd = new OCommandSQL();

            cmd.setText("alter database custom useLightweightEdges=false");
            g.command(cmd).execute();

            cmd.setText("alter database custom useVertexFieldsForEdgeLabels=false");
            g.command(cmd).execute();

            OrientEdgeType e = g.getEdgeType("E");
            e.createProperty("in", OType.LINK);
            e.createProperty("out", OType.LINK);

            OrientVertexType v = g.createVertexType("root", "V");
            v.createProperty("name", OType.STRING);

            v = g.createVertexType("item", "V");
            v.createProperty("name", OType.STRING);

            cmd.setText("create index edge.HAS_ITEM on E (out,in) unique");
            g.command(cmd).execute();

        } finally {
            g.shutdown();
        }

    }

    private static List<OrientVertex> createData(OrientVertex root, OrientGraphFactory factory, int count) {
        OrientGraphNoTx g = factory.getNoTx();
        try {
            System.out.println("Creating {" + count + "} items.");
            List<OrientVertex> items = new ArrayList<>();
            for (int i = 0; i < count; i++) {
                OrientVertex item = g.addVertex("class:item");
                item.setProperty("name", "item_" + i);
                items.add(item);
                root.addEdge("HAS_ITEM", item, "class:E");
            }
            return items;
        } finally {
            g.shutdown();
        }
    }

    private static OrientVertex createRoot(OrientGraphFactory factory) {
        OrientGraphNoTx g = factory.getNoTx();
        try {
            OrientVertex root = g.addVertex("class:root");
            root.setProperty("name", "root vertex");
            return root;
        } finally {
            g.shutdown();
        }
    }

    @Test
    public void testEdgeIndexViaRootGetEdgesWithoutTarget() throws Exception {
        OrientGraphNoTx g = factory.getNoTx();
        try {
            long start = System.currentTimeMillis();
            for (int i = 0; i < nChecks; i++) {
                OrientVertex randomDocument = items.get((int) (Math.random() * items.size()));
                Iterable<Edge> edges = root.getEdges(Direction.OUT, "HAS_ITEM");
                boolean found = false;
                for (Edge edge : edges) {
                    if (edge.getVertex(Direction.IN).equals(randomDocument)) {
                        found = true;
                        break;
                    }
                }
                assertTrue(found);
            }
            long dur = System.currentTimeMillis() - start;
            System.out.println("[root.getEdges - iterating] Duration: " + dur);
            System.out.println("[root.getEdges - iterating] Duration per lookup: " + ((double) dur / (double) nChecks));
        } finally {
            g.shutdown();
        }
    }

    @Test
    public void testEdgeIndexViaRootGetEdges() throws Exception {
        OrientGraphNoTx g = factory.getNoTx();
        try {
            long start = System.currentTimeMillis();
            for (int i = 0; i < nChecks; i++) {
                OrientVertex randomDocument = items.get((int) (Math.random() * items.size()));
                Iterable<Edge> edges = root.getEdges(randomDocument, Direction.OUT, "HAS_ITEM");
                assertTrue(edges.iterator().hasNext());
            }
            long dur = System.currentTimeMillis() - start;
            System.out.println("[root.getEdges] Duration: " + dur);
            System.out.println("[root.getEdges] Duration per lookup: " + ((double) dur / (double) nChecks));
        } finally {
            g.shutdown();
        }
    }

    @Test
    public void testEdgeIndexViaGraphGetEdges() throws Exception {
        OrientGraphNoTx g = factory.getNoTx();
        try {
            long start = System.currentTimeMillis();
            for (int i = 0; i < nChecks; i++) {
                OrientVertex randomDocument = items.get((int) (Math.random() * items.size()));
                Iterable<Edge> edges = g.getEdges("edge.has_item", new OCompositeKey(root.getId(), randomDocument.getId()));
                assertTrue(edges.iterator().hasNext());
            }
            long dur = System.currentTimeMillis() - start;
            System.out.println("[graph.getEdges] Duration: " + dur);
            System.out.println("[graph.getEdges] Duration per lookup: " + ((double) dur / (double) nChecks));
        } finally {
            g.shutdown();
        }
    }

    @Test
    public void testEdgeIndexViaQuery() throws Exception {
        OrientGraphNoTx g = factory.getNoTx();
        try {
            System.out.println("Checking edge");
            long start = System.currentTimeMillis();
            for (int i = 0; i < nChecks; i++) {
                OrientVertex randomDocument = items.get((int) (Math.random() * items.size()));

                OCommandSQL cmd = new OCommandSQL("select from index:edge.has_item where key=?");
                OCompositeKey key = new OCompositeKey(root.getId(), randomDocument.getId());

                assertTrue(((Iterable<Vertex>) g.command(cmd).execute(key)).iterator().hasNext());
            }
            long dur = System.currentTimeMillis() - start;
            System.out.println("[query] Duration: " + dur);
            System.out.println("[query] Duration per lookup: " + ((double) dur / (double) nChecks));
        } finally {
            g.shutdown();
        }
    }

}

prjhub added the performance label May 7, 2015

lvca assigned tglman May 22, 2015

lvca added this to the 2.1 GA milestone May 22, 2015

tglman added a commit that referenced this issue Jun 3, 2015

improved performance on vertex/edge type check, issue #4105

4c520a4

tglman added the waiting reply label Jun 3, 2015

wcraigtrader closed this as completed Jun 4, 2015

lvca removed this from the 2.1 GA milestone Jun 16, 2015

lvca modified the milestones: 2.1-rc4, 2.1 GA Jun 16, 2015

prjhub removed the waiting reply label Jun 17, 2015

Jotschi mentioned this issue Oct 16, 2015

Documentation and APIs #5037

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph performance problems with OrientDB 1.7.x / 2.0.x when searching for edges #4105

Graph performance problems with OrientDB 1.7.x / 2.0.x when searching for edges #4105

wcraigtrader commented May 7, 2015

lvca commented May 8, 2015

wcraigtrader commented May 8, 2015

lvca commented May 8, 2015

wcraigtrader commented May 8, 2015

lvca commented May 8, 2015

wcraigtrader commented May 8, 2015

lvca commented May 11, 2015

wcraigtrader commented May 11, 2015

acmeguy commented May 15, 2015

lvca commented May 21, 2015

wcraigtrader commented May 22, 2015

lvca commented May 22, 2015

wcraigtrader commented May 22, 2015

buzzkillington commented May 27, 2015

tglman commented Jun 2, 2015

tglman commented Jun 3, 2015

wcraigtrader commented Jun 4, 2015

lvca commented Jun 5, 2015

wcraigtrader commented Jun 6, 2015

lvca commented Jun 8, 2015

wcraigtrader commented Jun 8, 2015

lvca commented Jun 9, 2015

wcraigtrader commented Jun 9, 2015

wcraigtrader commented Jun 17, 2015

lvca commented Jun 18, 2015

Jotschi commented Oct 16, 2015

Graph performance problems with OrientDB 1.7.x / 2.0.x when searching for edges #4105

Graph performance problems with OrientDB 1.7.x / 2.0.x when searching for edges #4105

Comments

wcraigtrader commented May 7, 2015

lvca commented May 8, 2015

wcraigtrader commented May 8, 2015

lvca commented May 8, 2015

wcraigtrader commented May 8, 2015

lvca commented May 8, 2015

wcraigtrader commented May 8, 2015

lvca commented May 11, 2015

wcraigtrader commented May 11, 2015

Lessons Learned / Further Questions

acmeguy commented May 15, 2015

lvca commented May 21, 2015

wcraigtrader commented May 22, 2015

lvca commented May 22, 2015

wcraigtrader commented May 22, 2015

buzzkillington commented May 27, 2015

tglman commented Jun 2, 2015

tglman commented Jun 3, 2015

wcraigtrader commented Jun 4, 2015

lvca commented Jun 5, 2015

wcraigtrader commented Jun 6, 2015

lvca commented Jun 8, 2015

wcraigtrader commented Jun 8, 2015

lvca commented Jun 9, 2015

wcraigtrader commented Jun 9, 2015

wcraigtrader commented Jun 17, 2015

lvca commented Jun 18, 2015

Jotschi commented Oct 16, 2015