Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph performance problems with OrientDB 1.7.x / 2.0.x when searching for edges #4105

Closed
wcraigtrader opened this issue May 7, 2015 · 26 comments
Assignees
Milestone

Comments

@wcraigtrader
Copy link

I have built a model application that can be used to measure the performance of OrientDB when ingesting data that is similar to our actual application, but which is openly shareable, and much smaller than the actual application and datasets. You can find the model application on GitHub:

https://github.com/wcraigtrader/ogp

My analysis leads me to believe that out-of-the-box, OrientDB is not doing any type of indexing for edges (light or heavy). Our application data graph is prone to creating super-nodes (e.g.: nodes with hundreds or thousands of edges), and our performance is suffering when trying to ingest these super-nodes.

I believe that by forcing our application to use heavy edges, and by correctly indexing the edges, we should be able to resolve this problem, but my efforts here have not been successful, so I'm appealing to you and your engineers for assistance.

@lvca
Copy link
Member

lvca commented May 8, 2015

@wcraigtrader Are you using SQL for traversing?

@wcraigtrader
Copy link
Author

No.

We've been using the graph APIs (mostly Java, some Gremlin). The only place where we use SQL commands is while creating the schema. We'd use the Java APIs there except that the Java APIs didn't have enough documentation and I wasn't prepared to dig through the SQL layer and see what was being called under the covers.

The model code I provided is almost exactly a line-by-line translation of the code for ingesting data into the graph; the differences are in how the ingest data is generated (random instead of parsed from files) and stored in memory. The package and class names have been changed, of course.

IMO, the primary offending method is findEdge:

OrientEdge findEdge(String type, OrientVertex src, OrientVertex tgt ) {
    for (Edge eraw : src.getEdges( tgt, Direction.BOTH, type ) ) {
        return (OrientEdge) eraw
    }
    return null
}

The underlying call to hasNext() is apparently doing linear searches under the covers; I am sure that there must be some way to index edges and use those indexes (similarly to how findNode uses vertex indexes) but if there is, it is currently undocumented.

@lvca
Copy link
Member

lvca commented May 8, 2015

This will be resolved by OrientDB 3.0 where a new index structure will be used to assure O(LogN) performance even in this case. But looking at the code you only need the first one edge, is this correct?

@wcraigtrader
Copy link
Author

Our schema requires that any pair of nodes be only connected by one instance of any one type of valid edge, so the first matching edge should be the only matching edge, for any given type of edge.

Given that Orient 2.X is really only just out, waiting for Orient 3.X isn't really a solution for us.

@lvca
Copy link
Member

lvca commented May 8, 2015

@wcraigtrader If you only want the first edge of type X, you don't need indexing. OrientDB already keeps edges in separate collections to return them quickly. I'm interested about the result of profiling to see why browsing just one edge is so expensive.

However, if you have regular edges (no lightweight), you could index out, in with a composite index and retrieve the edge with a lookup:

select expand( in ) from Friend where out = :src and in = :tgt

passing src and tgt as arguments.

@wcraigtrader
Copy link
Author

@lvca As noted above, our application is generating lots of nodes with hundreds or even thousands of edges, and the sample application was developed specifically for ease of profiling and performance measurement for the types of data we're using. Please run it and examine its performance for yourself.

I am looking for advice specific to the sample application, to best improve its performance -- any changes to the Database or GraphPerformance classes that you think best. (Those are the classes that create and manipulate the graph database; the other classes are not counted in the performance metrics.)

With your suggestions in hand, I will apply the results to our actual application.

@lvca
Copy link
Member

lvca commented May 11, 2015

@wcraigtrader Any news on results?

@wcraigtrader
Copy link
Author

@lvca If you look at the indexes branch of my model application you can see the results for yourself. My original implementation of findEdge() was:

OrientEdge findEdgeUsingSourceGetEdges(String type, OrientVertex src, OrientVertex tgt ) {
    // NOTE: This is where the ingester spends 75+% of its time!!!
    for (Edge eraw : src.getEdges( tgt, Direction.BOTH, type ) ) {
        return (OrientEdge) eraw
    }
    return null
}

As previously noted, this did not use indexes for edges, and the performance was abysmal for nodes with large numbers of edges. I then added indexes for edges, as follows:

OrientGraphNoTx g = factory.getNoTx()
OCommandSQL cmd = new OCommandSQL()

cmd.setText("create index sees.unique on sees (out,in) unique" )
g.command(cmd).execute()

cmd.setText("create index hears.unique on hears (out,in) unique" )
g.command(cmd).execute()

cmd.setText("create index feels.unique on feels (out,in) unique" )
g.command(cmd).execute()

cmd.setText("create index smells.unique on smells (out,in) unique" )
g.command(cmd).execute()

cmd.setText("create index tastes.unique on tastes (out,in) unique" )
g.command(cmd).execute()

Then I tried using a SQL query to select for my edge, as follows:

OrientEdge findEdgeUsingQuery(String type, OrientVertex src, OrientVertex tgt ) {
    def cmd = new OCommandSQL("select from index:${type}.unique where key=?")
    def key = new OCompositeKey( [src.id, tgt.id] )
    for (Vertex result : graph.command( cmd ).execute( key )) {
        return (OrientEdge) result.getProperty( 'rid' )
    }
    return null
}

Finally I used a pure Java / Groovy implementation (no SQL):

OrientEdge findEdgeUsingGraphGetEdges(String type, OrientVertex src, OrientVertex tgt ) {
    for (Edge eraw : graph.getEdges( "${type}.unique", new OCompositeKey( [src.id, tgt.id] ) ) ) {
        return (OrientEdge) eraw
    }
    return null
}

If you look at this spreadsheet you can see the end results. The black line represents the number of edges per node (500 edges per node for the first 50 transactions (to give the JVM time to finish JITting), then varying from 2 to 900 edges per transaction (to gauge the time per edge). The dark blue line represents the non-indexed solution, the dark green line represents the SQL query solution, and the dark red line represents the pure-Java solution. Both new solutions are O(1), but the non-SQL solution is clearly faster (typically 30% faster). In this use case, the SQL interface adds more overhead than it is worth (not surprising since we're not doing joins or returning complicated results).

Lessons Learned / Further Questions

  1. Lightweight edges are not statistically faster with OrientDB 2.0.X and can't be indexed -- don't use them unless you're only using at most a handful of edges for any given node.
  2. Groovy 2.4.3 is either significantly faster, or as fast as Groovy 1.8.9, depending upon the use case.
    • Are there any known reasons why we shouldn't use Groovy 2.4.X for our Gremlin queries instead of Groovy 1.8.9?
  3. OrientBaseGraph::createIndex() is both awkward to use and under-documented.
    • How do you use createIndex() to create an index with a composite key?
    • There's almost no useful documentation of the Parameter class.
    • What's the difference in performance / implementation of UNIQUE and UNIQUE_HASH_INDEX indexes?
  4. Vertex::getEdges() checks for useful indexes, but doesn't make use of the edge index I defined.
    • Why not?
    • What indexes would be used by getEdges()?
    • How would I define them?
  5. Am I correct in surmising that edge data (eg: edge @Rid) is stored on the source and target node records, in addition to on the edge record?
    • How do we tune this?
    • Is this more helpful for Gremlin or SQL queries?
    • What are the practical limits to scalability, for both SQL and Gremlin?
  6. What types of indexes could be defined to improve the performance of Gremlin queries when working with (or through) super-nodes?

@acmeguy
Copy link

acmeguy commented May 15, 2015

just adding this here so I get notified. (This is of great interest/importance to us too)

@lvca
Copy link
Member

lvca commented May 21, 2015

  1. On creation and traversal lightweight edges are much faster then regular one. What kind of benchmark did you execute?
  2. No particular reason: that groovy version was bundled by Blueprints. Did you test Groovy 2.4.3 in place of the version in bundle?
  3. You're right, it's under documented. Look at: http://orientdb.com/docs/last/Schema.html, where you can find something.
  4. This is still missing. Please could you create a new issue for that?
  5. while edge's RID is not stored in the Edge, because it's the edge address itself, it's stored on both vertices. You can't tune it, unless you are using lightweight edges
  6. Now Gremlin is still not able to use indexes transparently. The best in this case is using SQL using the index explicitely

@wcraigtrader
Copy link
Author

  1. See the beginning of this ticket -- I referenced all of the source and posted it all up on GitHub, including how to run the comparison benchmarks for different versions of Orient and Groovy. Test was specifically ingesting large sub-graphs characteristic of our application and data. There are two branches -- one without edge indexing, one with.
  2. Yes, I did, but not with Gremlin -- yet.
  3. I already did -- indexes are only mentioned as constraints on the schema, not for performance tuning. The indexes page (http://orientdb.com/docs/last/Indexes.html) is hardly better.
  4. In side-by-side comparison of simple queries using indexes to select edges, the Java APIs are 30% faster than the SQL queries -- I'm not inclined to use SQL if either Java or Gremlin can be used to craft the search/traversal.

@lvca
Copy link
Member

lvca commented May 22, 2015

About (1) I see you wrote 75% of the time is in getEdges(). Can you see inside that method where the time is spent mostly?

@wcraigtrader
Copy link
Author

@lvca I use Oracle's Java Mission Control and Java Flight Recorder (included with Oracle JDK 7u40 and above). I've updated my ogp project to include a script to make profiling the app easier (see the README for details). I've also included two profiling samples (one without indexed edges, one with) that you can use to see what I was seeing. The profiles were taken just now, using Orient 2.0.9 and Groovy 2.3.10.

If you run jmc and open the profile-orient-209-no-indexes.jfr recording and click on the code tab (on the left), then click on the Hot Methods tab at the bottom, you'll see that HashMap.getEntry is the hot method -- not exactly useful -- but if drill down a little, you'll find that actual culprits are OTransactionRealAbstract.getRecordEntry and ODocument.rawField. Personally I prefer to follow the CallTree tab.

@lvca lvca added this to the 2.1 GA milestone May 22, 2015
@buzzkillington
Copy link

I'm also interested in this. I'm browsing adjacent vertices based on specific edge labels, but the performance of Vertex.getEdges(Direction direction, String... labels) and Vertex.getVertices(Direction direction, String... labels) is slow. With either option, it's taking an average of over 0.5ms per record to look up the adjacent RIDs in a test case of ~110k vertices (the db has ~400k vertices and ~1.03M edges total).

I tried setting up some indexes, but ran into #4232. I'd be glad to try any other suggestions for speeding this up. (Currently testing on 2.1-rc3.)

@tglman
Copy link
Member

tglman commented Jun 2, 2015

@wcraigtrader

I'm running your test for understand how tune the system to get better result, i got mostly your results, i know that we did some improvement for 2.1, and more will come for 2.2

In the mean while i fixed also #4232, that may help in some cases

@tglman
Copy link
Member

tglman commented Jun 3, 2015

Hi @wcraigtrader,

I did some improvements on the plain iteration of edges, starting from your benchmark, but anyway i think the final solution is using indexes as you have done, also consider to use also some multi threading, that will help you to speed up, we are thinking to introduce an automatic detection of the index from getEdges from 2.2, that will simplify a lot the usage.

anyway does the index based solution satisfy your needs?

@wcraigtrader
Copy link
Author

@tglman For the purposes of ingesting new records, the edge indexing method that I've described is working as well as we can reasonably expect.

I certainly wish that we didn't have to wait until Orient 2.2 ships for all aspects of the database to implicitly use defined indexes (edge or node). In particular, we expect to be making heavy use of Gremlin in coming months, and everything I've learned so far indicates that the Gremlin implementation ignores indexes and simply works from the in_XXX and out_XXX fields stored with the vertex. Please correct me if I'm wrong.

@lvca
Copy link
Member

lvca commented Jun 5, 2015

You're right. We could support optimization in gremlin, but in 3.0 we'll provide pattern matching feature where this will be much easier and fast.

@wcraigtrader
Copy link
Author

@lvca My problem is timelines ... when is Orient planning to release 2.1, 2.2, 3.0?

@lvca
Copy link
Member

lvca commented Jun 8, 2015

OrientDB 2.1 GA this week, 2.2 August 2015 and 3.0 October 2015.

@wcraigtrader
Copy link
Author

@lvca, thanks for the dates. I doubt that we will move to 2.1 much before August (we go to internal GA at the end of June, and customer GA at the end of July).

Just out of curiosity, what is the largest number of edges connected to a single node that you're testing against? In our current dataset (roughly half the size of what we expect to have when we go live), we have a number of supernodes with massive numbers of edges. At the moment, one of those nodes currently has 495,376 edges. (Current dataset size: 5,075,444 nodes, 5,467,323 edges.)

@lvca
Copy link
Member

lvca commented Jun 9, 2015

We have clients with millions of edges as super nodes, but we always suggest to try avoiding it if possible. Could you share your design (in private) to see how we can improve it?

@wcraigtrader
Copy link
Author

Thanks, but at the moment, our schema is fixed -- too close to release to be changing the fundamentals. I'll revisit that problem in August.

@lvca lvca removed this from the 2.1 GA milestone Jun 16, 2015
@lvca lvca modified the milestones: 2.1-rc4, 2.1 GA Jun 16, 2015
@wcraigtrader
Copy link
Author

Comment to my other account.

@lvca
Copy link
Member

lvca commented Jun 18, 2015

Ok ;-)

@Jotschi
Copy link

Jotschi commented Oct 16, 2015

I created my own little unit test using OrientDB 2.1.3 to better understand the performance implications and to investigate each option that was mentioned in this issue.

The graph.getEdges method also works very well for me. But i'm still a puzzled why the v1.getEdges(v2, direction, label) method is so slow.

@lvca You mentioned this method on SO: http://stackoverflow.com/questions/32953396/orientdb-edge-index-via-java

AFAIK the mentioned method is not using the index at all.

Creating {14000} items.
[graph.getEdges] Duration: 38
[graph.getEdges] Duration per lookup: 0.0095

[root.getEdges] Duration: 59439
[root.getEdges] Duration per lookup: 14.85975

[root.getEdges - iterating] Duration: 56710
[root.getEdges - iterating] Duration per lookup: 14.1775

[query] Duration: 817
[query] Duration per lookup: 0.20425
package de.jotschi.orientdb;

import static org.junit.Assert.assertTrue;

import java.util.ArrayList;
import java.util.List;

import org.junit.BeforeClass;
import org.junit.Test;

import com.orientechnologies.orient.core.index.OCompositeKey;
import com.orientechnologies.orient.core.metadata.schema.OType;
import com.orientechnologies.orient.core.sql.OCommandSQL;
import com.tinkerpop.blueprints.Direction;
import com.tinkerpop.blueprints.Edge;
import com.tinkerpop.blueprints.Vertex;
import com.tinkerpop.blueprints.impls.orient.OrientEdgeType;
import com.tinkerpop.blueprints.impls.orient.OrientGraphFactory;
import com.tinkerpop.blueprints.impls.orient.OrientGraphNoTx;
import com.tinkerpop.blueprints.impls.orient.OrientVertex;
import com.tinkerpop.blueprints.impls.orient.OrientVertexType;

public class EdgeIndexPerformanceTest {

    private static OrientGraphFactory factory = new OrientGraphFactory("memory:tinkerpop");

    private final static int nDocuments = 14000;
    private final static int nChecks = 4000;

    private static List<OrientVertex> items;
    private static OrientVertex root;

    @BeforeClass
    public static void setupDatabase() {
        setupTypesAndIndices(factory);

        root = createRoot(factory);
        items = createData(root, factory, nDocuments);
    }

    private static void setupTypesAndIndices(OrientGraphFactory factory2) {
        OrientGraphNoTx g = factory.getNoTx();
        try {
            OCommandSQL cmd = new OCommandSQL();

            cmd.setText("alter database custom useLightweightEdges=false");
            g.command(cmd).execute();

            cmd.setText("alter database custom useVertexFieldsForEdgeLabels=false");
            g.command(cmd).execute();

            OrientEdgeType e = g.getEdgeType("E");
            e.createProperty("in", OType.LINK);
            e.createProperty("out", OType.LINK);

            OrientVertexType v = g.createVertexType("root", "V");
            v.createProperty("name", OType.STRING);

            v = g.createVertexType("item", "V");
            v.createProperty("name", OType.STRING);

            cmd.setText("create index edge.HAS_ITEM on E (out,in) unique");
            g.command(cmd).execute();

        } finally {
            g.shutdown();
        }

    }

    private static List<OrientVertex> createData(OrientVertex root, OrientGraphFactory factory, int count) {
        OrientGraphNoTx g = factory.getNoTx();
        try {
            System.out.println("Creating {" + count + "} items.");
            List<OrientVertex> items = new ArrayList<>();
            for (int i = 0; i < count; i++) {
                OrientVertex item = g.addVertex("class:item");
                item.setProperty("name", "item_" + i);
                items.add(item);
                root.addEdge("HAS_ITEM", item, "class:E");
            }
            return items;
        } finally {
            g.shutdown();
        }
    }

    private static OrientVertex createRoot(OrientGraphFactory factory) {
        OrientGraphNoTx g = factory.getNoTx();
        try {
            OrientVertex root = g.addVertex("class:root");
            root.setProperty("name", "root vertex");
            return root;
        } finally {
            g.shutdown();
        }
    }

    @Test
    public void testEdgeIndexViaRootGetEdgesWithoutTarget() throws Exception {
        OrientGraphNoTx g = factory.getNoTx();
        try {
            long start = System.currentTimeMillis();
            for (int i = 0; i < nChecks; i++) {
                OrientVertex randomDocument = items.get((int) (Math.random() * items.size()));
                Iterable<Edge> edges = root.getEdges(Direction.OUT, "HAS_ITEM");
                boolean found = false;
                for (Edge edge : edges) {
                    if (edge.getVertex(Direction.IN).equals(randomDocument)) {
                        found = true;
                        break;
                    }
                }
                assertTrue(found);
            }
            long dur = System.currentTimeMillis() - start;
            System.out.println("[root.getEdges - iterating] Duration: " + dur);
            System.out.println("[root.getEdges - iterating] Duration per lookup: " + ((double) dur / (double) nChecks));
        } finally {
            g.shutdown();
        }
    }

    @Test
    public void testEdgeIndexViaRootGetEdges() throws Exception {
        OrientGraphNoTx g = factory.getNoTx();
        try {
            long start = System.currentTimeMillis();
            for (int i = 0; i < nChecks; i++) {
                OrientVertex randomDocument = items.get((int) (Math.random() * items.size()));
                Iterable<Edge> edges = root.getEdges(randomDocument, Direction.OUT, "HAS_ITEM");
                assertTrue(edges.iterator().hasNext());
            }
            long dur = System.currentTimeMillis() - start;
            System.out.println("[root.getEdges] Duration: " + dur);
            System.out.println("[root.getEdges] Duration per lookup: " + ((double) dur / (double) nChecks));
        } finally {
            g.shutdown();
        }
    }

    @Test
    public void testEdgeIndexViaGraphGetEdges() throws Exception {
        OrientGraphNoTx g = factory.getNoTx();
        try {
            long start = System.currentTimeMillis();
            for (int i = 0; i < nChecks; i++) {
                OrientVertex randomDocument = items.get((int) (Math.random() * items.size()));
                Iterable<Edge> edges = g.getEdges("edge.has_item", new OCompositeKey(root.getId(), randomDocument.getId()));
                assertTrue(edges.iterator().hasNext());
            }
            long dur = System.currentTimeMillis() - start;
            System.out.println("[graph.getEdges] Duration: " + dur);
            System.out.println("[graph.getEdges] Duration per lookup: " + ((double) dur / (double) nChecks));
        } finally {
            g.shutdown();
        }
    }

    @Test
    public void testEdgeIndexViaQuery() throws Exception {
        OrientGraphNoTx g = factory.getNoTx();
        try {
            System.out.println("Checking edge");
            long start = System.currentTimeMillis();
            for (int i = 0; i < nChecks; i++) {
                OrientVertex randomDocument = items.get((int) (Math.random() * items.size()));

                OCommandSQL cmd = new OCommandSQL("select from index:edge.has_item where key=?");
                OCompositeKey key = new OCompositeKey(root.getId(), randomDocument.getId());

                assertTrue(((Iterable<Vertex>) g.command(cmd).execute(key)).iterator().hasNext());
            }
            long dur = System.currentTimeMillis() - start;
            System.out.println("[query] Duration: " + dur);
            System.out.println("[query] Duration per lookup: " + ((double) dur / (double) nChecks));
        } finally {
            g.shutdown();
        }
    }

}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

7 participants