For each unique vertex and edge alias in a motif, GraphFrames currently wraps the entire vertex and edge dataframes, respectively, in aliased structs.
Spark's query planner operates on the entire contents of a struct column. This means that large column sets can be unnecessarily shuffled when they are not needed by the return statement of the query. In some cases, it will also cause dataframes to be shuffled multiple times when once would suffice.
Consider replacing the use of struct with dataframe aliases instead (e.g., gf.vertices.alias(myFirstVertexAlias)). This will allow the Spark query planner to push up column selections and reuse shuffles where possible.
This is almost certainly the cause of #322.
For each unique vertex and edge alias in a motif, GraphFrames currently wraps the entire vertex and edge dataframes, respectively, in aliased structs.
Spark's query planner operates on the entire contents of a struct column. This means that large column sets can be unnecessarily shuffled when they are not needed by the return statement of the query. In some cases, it will also cause dataframes to be shuffled multiple times when once would suffice.
Consider replacing the use of
structwith dataframe aliases instead (e.g.,gf.vertices.alias(myFirstVertexAlias)). This will allow the Spark query planner to push up column selections and reuse shuffles where possible.This is almost certainly the cause of #322.