New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graph performance problems with OrientDB 1.7.x / 2.0.x when searching for edges #4105
Comments
@wcraigtrader Are you using SQL for traversing? |
No. We've been using the graph APIs (mostly Java, some Gremlin). The only place where we use SQL commands is while creating the schema. We'd use the Java APIs there except that the Java APIs didn't have enough documentation and I wasn't prepared to dig through the SQL layer and see what was being called under the covers. The model code I provided is almost exactly a line-by-line translation of the code for ingesting data into the graph; the differences are in how the ingest data is generated (random instead of parsed from files) and stored in memory. The package and class names have been changed, of course. IMO, the primary offending method is findEdge:
The underlying call to |
This will be resolved by OrientDB 3.0 where a new index structure will be used to assure O(LogN) performance even in this case. But looking at the code you only need the first one edge, is this correct? |
Our schema requires that any pair of nodes be only connected by one instance of any one type of valid edge, so the first matching edge should be the only matching edge, for any given type of edge. Given that Orient 2.X is really only just out, waiting for Orient 3.X isn't really a solution for us. |
@wcraigtrader If you only want the first edge of type X, you don't need indexing. OrientDB already keeps edges in separate collections to return them quickly. I'm interested about the result of profiling to see why browsing just one edge is so expensive. However, if you have regular edges (no lightweight), you could index out, in with a composite index and retrieve the edge with a lookup: select expand( in ) from Friend where out = :src and in = :tgt passing src and tgt as arguments. |
@lvca As noted above, our application is generating lots of nodes with hundreds or even thousands of edges, and the sample application was developed specifically for ease of profiling and performance measurement for the types of data we're using. Please run it and examine its performance for yourself. I am looking for advice specific to the sample application, to best improve its performance -- any changes to the With your suggestions in hand, I will apply the results to our actual application. |
@wcraigtrader Any news on results? |
@lvca If you look at the indexes branch of my model application you can see the results for yourself. My original implementation of
As previously noted, this did not use indexes for edges, and the performance was abysmal for nodes with large numbers of edges. I then added indexes for edges, as follows:
Then I tried using a SQL query to select for my edge, as follows:
Finally I used a pure Java / Groovy implementation (no SQL):
If you look at this spreadsheet you can see the end results. The black line represents the number of edges per node (500 edges per node for the first 50 transactions (to give the JVM time to finish JITting), then varying from 2 to 900 edges per transaction (to gauge the time per edge). The dark blue line represents the non-indexed solution, the dark green line represents the SQL query solution, and the dark red line represents the pure-Java solution. Both new solutions are O(1), but the non-SQL solution is clearly faster (typically 30% faster). In this use case, the SQL interface adds more overhead than it is worth (not surprising since we're not doing joins or returning complicated results). Lessons Learned / Further Questions
|
just adding this here so I get notified. (This is of great interest/importance to us too) |
|
|
About (1) I see you wrote 75% of the time is in getEdges(). Can you see inside that method where the time is spent mostly? |
@lvca I use Oracle's Java Mission Control and Java Flight Recorder (included with Oracle JDK 7u40 and above). I've updated my ogp project to include a script to make profiling the app easier (see the README for details). I've also included two profiling samples (one without indexed edges, one with) that you can use to see what I was seeing. The profiles were taken just now, using Orient 2.0.9 and Groovy 2.3.10. If you run |
I'm also interested in this. I'm browsing adjacent vertices based on specific edge labels, but the performance of Vertex.getEdges(Direction direction, String... labels) and Vertex.getVertices(Direction direction, String... labels) is slow. With either option, it's taking an average of over 0.5ms per record to look up the adjacent RIDs in a test case of ~110k vertices (the db has ~400k vertices and ~1.03M edges total). I tried setting up some indexes, but ran into #4232. I'd be glad to try any other suggestions for speeding this up. (Currently testing on 2.1-rc3.) |
I'm running your test for understand how tune the system to get better result, i got mostly your results, i know that we did some improvement for 2.1, and more will come for 2.2 In the mean while i fixed also #4232, that may help in some cases |
Hi @wcraigtrader, I did some improvements on the plain iteration of edges, starting from your benchmark, but anyway i think the final solution is using indexes as you have done, also consider to use also some multi threading, that will help you to speed up, we are thinking to introduce an automatic detection of the index from getEdges from 2.2, that will simplify a lot the usage. anyway does the index based solution satisfy your needs? |
@tglman For the purposes of ingesting new records, the edge indexing method that I've described is working as well as we can reasonably expect. I certainly wish that we didn't have to wait until Orient 2.2 ships for all aspects of the database to implicitly use defined indexes (edge or node). In particular, we expect to be making heavy use of Gremlin in coming months, and everything I've learned so far indicates that the Gremlin implementation ignores indexes and simply works from the in_XXX and out_XXX fields stored with the vertex. Please correct me if I'm wrong. |
You're right. We could support optimization in gremlin, but in 3.0 we'll provide pattern matching feature where this will be much easier and fast. |
@lvca My problem is timelines ... when is Orient planning to release 2.1, 2.2, 3.0? |
OrientDB 2.1 GA this week, 2.2 August 2015 and 3.0 October 2015. |
@lvca, thanks for the dates. I doubt that we will move to 2.1 much before August (we go to internal GA at the end of June, and customer GA at the end of July). Just out of curiosity, what is the largest number of edges connected to a single node that you're testing against? In our current dataset (roughly half the size of what we expect to have when we go live), we have a number of supernodes with massive numbers of edges. At the moment, one of those nodes currently has 495,376 edges. (Current dataset size: 5,075,444 nodes, 5,467,323 edges.) |
We have clients with millions of edges as super nodes, but we always suggest to try avoiding it if possible. Could you share your design (in private) to see how we can improve it? |
Thanks, but at the moment, our schema is fixed -- too close to release to be changing the fundamentals. I'll revisit that problem in August. |
Comment to my other account. |
Ok ;-) |
I created my own little unit test using OrientDB 2.1.3 to better understand the performance implications and to investigate each option that was mentioned in this issue. The @lvca You mentioned this method on SO: http://stackoverflow.com/questions/32953396/orientdb-edge-index-via-java AFAIK the mentioned method is not using the index at all.
package de.jotschi.orientdb;
import static org.junit.Assert.assertTrue;
import java.util.ArrayList;
import java.util.List;
import org.junit.BeforeClass;
import org.junit.Test;
import com.orientechnologies.orient.core.index.OCompositeKey;
import com.orientechnologies.orient.core.metadata.schema.OType;
import com.orientechnologies.orient.core.sql.OCommandSQL;
import com.tinkerpop.blueprints.Direction;
import com.tinkerpop.blueprints.Edge;
import com.tinkerpop.blueprints.Vertex;
import com.tinkerpop.blueprints.impls.orient.OrientEdgeType;
import com.tinkerpop.blueprints.impls.orient.OrientGraphFactory;
import com.tinkerpop.blueprints.impls.orient.OrientGraphNoTx;
import com.tinkerpop.blueprints.impls.orient.OrientVertex;
import com.tinkerpop.blueprints.impls.orient.OrientVertexType;
public class EdgeIndexPerformanceTest {
private static OrientGraphFactory factory = new OrientGraphFactory("memory:tinkerpop");
private final static int nDocuments = 14000;
private final static int nChecks = 4000;
private static List<OrientVertex> items;
private static OrientVertex root;
@BeforeClass
public static void setupDatabase() {
setupTypesAndIndices(factory);
root = createRoot(factory);
items = createData(root, factory, nDocuments);
}
private static void setupTypesAndIndices(OrientGraphFactory factory2) {
OrientGraphNoTx g = factory.getNoTx();
try {
OCommandSQL cmd = new OCommandSQL();
cmd.setText("alter database custom useLightweightEdges=false");
g.command(cmd).execute();
cmd.setText("alter database custom useVertexFieldsForEdgeLabels=false");
g.command(cmd).execute();
OrientEdgeType e = g.getEdgeType("E");
e.createProperty("in", OType.LINK);
e.createProperty("out", OType.LINK);
OrientVertexType v = g.createVertexType("root", "V");
v.createProperty("name", OType.STRING);
v = g.createVertexType("item", "V");
v.createProperty("name", OType.STRING);
cmd.setText("create index edge.HAS_ITEM on E (out,in) unique");
g.command(cmd).execute();
} finally {
g.shutdown();
}
}
private static List<OrientVertex> createData(OrientVertex root, OrientGraphFactory factory, int count) {
OrientGraphNoTx g = factory.getNoTx();
try {
System.out.println("Creating {" + count + "} items.");
List<OrientVertex> items = new ArrayList<>();
for (int i = 0; i < count; i++) {
OrientVertex item = g.addVertex("class:item");
item.setProperty("name", "item_" + i);
items.add(item);
root.addEdge("HAS_ITEM", item, "class:E");
}
return items;
} finally {
g.shutdown();
}
}
private static OrientVertex createRoot(OrientGraphFactory factory) {
OrientGraphNoTx g = factory.getNoTx();
try {
OrientVertex root = g.addVertex("class:root");
root.setProperty("name", "root vertex");
return root;
} finally {
g.shutdown();
}
}
@Test
public void testEdgeIndexViaRootGetEdgesWithoutTarget() throws Exception {
OrientGraphNoTx g = factory.getNoTx();
try {
long start = System.currentTimeMillis();
for (int i = 0; i < nChecks; i++) {
OrientVertex randomDocument = items.get((int) (Math.random() * items.size()));
Iterable<Edge> edges = root.getEdges(Direction.OUT, "HAS_ITEM");
boolean found = false;
for (Edge edge : edges) {
if (edge.getVertex(Direction.IN).equals(randomDocument)) {
found = true;
break;
}
}
assertTrue(found);
}
long dur = System.currentTimeMillis() - start;
System.out.println("[root.getEdges - iterating] Duration: " + dur);
System.out.println("[root.getEdges - iterating] Duration per lookup: " + ((double) dur / (double) nChecks));
} finally {
g.shutdown();
}
}
@Test
public void testEdgeIndexViaRootGetEdges() throws Exception {
OrientGraphNoTx g = factory.getNoTx();
try {
long start = System.currentTimeMillis();
for (int i = 0; i < nChecks; i++) {
OrientVertex randomDocument = items.get((int) (Math.random() * items.size()));
Iterable<Edge> edges = root.getEdges(randomDocument, Direction.OUT, "HAS_ITEM");
assertTrue(edges.iterator().hasNext());
}
long dur = System.currentTimeMillis() - start;
System.out.println("[root.getEdges] Duration: " + dur);
System.out.println("[root.getEdges] Duration per lookup: " + ((double) dur / (double) nChecks));
} finally {
g.shutdown();
}
}
@Test
public void testEdgeIndexViaGraphGetEdges() throws Exception {
OrientGraphNoTx g = factory.getNoTx();
try {
long start = System.currentTimeMillis();
for (int i = 0; i < nChecks; i++) {
OrientVertex randomDocument = items.get((int) (Math.random() * items.size()));
Iterable<Edge> edges = g.getEdges("edge.has_item", new OCompositeKey(root.getId(), randomDocument.getId()));
assertTrue(edges.iterator().hasNext());
}
long dur = System.currentTimeMillis() - start;
System.out.println("[graph.getEdges] Duration: " + dur);
System.out.println("[graph.getEdges] Duration per lookup: " + ((double) dur / (double) nChecks));
} finally {
g.shutdown();
}
}
@Test
public void testEdgeIndexViaQuery() throws Exception {
OrientGraphNoTx g = factory.getNoTx();
try {
System.out.println("Checking edge");
long start = System.currentTimeMillis();
for (int i = 0; i < nChecks; i++) {
OrientVertex randomDocument = items.get((int) (Math.random() * items.size()));
OCommandSQL cmd = new OCommandSQL("select from index:edge.has_item where key=?");
OCompositeKey key = new OCompositeKey(root.getId(), randomDocument.getId());
assertTrue(((Iterable<Vertex>) g.command(cmd).execute(key)).iterator().hasNext());
}
long dur = System.currentTimeMillis() - start;
System.out.println("[query] Duration: " + dur);
System.out.println("[query] Duration per lookup: " + ((double) dur / (double) nChecks));
} finally {
g.shutdown();
}
}
} |
I have built a model application that can be used to measure the performance of OrientDB when ingesting data that is similar to our actual application, but which is openly shareable, and much smaller than the actual application and datasets. You can find the model application on GitHub:
https://github.com/wcraigtrader/ogp
My analysis leads me to believe that out-of-the-box, OrientDB is not doing any type of indexing for edges (light or heavy). Our application data graph is prone to creating super-nodes (e.g.: nodes with hundreds or thousands of edges), and our performance is suffering when trying to ingest these super-nodes.
I believe that by forcing our application to use heavy edges, and by correctly indexing the edges, we should be able to resolve this problem, but my efforts here have not been successful, so I'm appealing to you and your engineers for assistance.
The text was updated successfully, but these errors were encountered: