Cypher for Apache Spark brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.
Clone or download
Mats-SX Merge pull request #676 from Mats-SX/issue-587
Show how to store a custom schema on file
Latest commit fea056d Oct 19, 2018
Permalink
Failed to load latest commit information.
.build Move release script into separate project Apr 27, 2018
dev-support Update code style Mar 14, 2018
doc Add Zeppelin screenshot to readme Oct 17, 2017
etc Update pom: scalastyle, ammonite, shade, set default encoding to utf-8 Aug 6, 2016
licensecheck-config Updating version to 0.2.1-SNAPSHOT Sep 25, 2018
okapi-api Merge pull request #683 from tobias-johansson/entity-keys-fixes Oct 17, 2018
okapi-ir Merge pull request #674 from soerenreichardt/trim_functions Oct 18, 2018
okapi-logical Updating version to 0.2.1-SNAPSHOT Sep 25, 2018
okapi-neo4j-io-testing Updating version to 0.2.1-SNAPSHOT Sep 25, 2018
okapi-neo4j-io Fix naming conflicts between property keys and entity keys in Schema Oct 12, 2018
okapi-neo4j-procedures Updating version to 0.2.1-SNAPSHOT Sep 25, 2018
okapi-relational Merge pull request #670 from pstutz/ddl-parser Oct 16, 2018
okapi-tck Updating version to 0.2.1-SNAPSHOT Sep 25, 2018
okapi-testing Parse relationship mapping definition Oct 11, 2018
okapi-trees Updating version to 0.2.1-SNAPSHOT Sep 25, 2018
spark-cypher-examples Address review comments Oct 18, 2018
spark-cypher-tck Whitelist TCK scenario involving substring Sep 26, 2018
spark-cypher-testing Merge pull request #674 from soerenreichardt/trim_functions Oct 18, 2018
spark-cypher Merge pull request #674 from soerenreichardt/trim_functions Oct 18, 2018
sql-ddl Simplify test code Oct 16, 2018
src Add empty src directory to enable invoking license checker Apr 24, 2018
.gitignore Switch from FunSuite to FunSpec Jan 12, 2018
.travis.yml Disable Travis.CI test build May 18, 2018
CONTRIBUTING.adoc Rename CONTRIBUTING to CONTRIBUTING.adoc Mar 14, 2018
LICENSE.txt Fix cluster profile in parent pom May 22, 2018
LICENSES.txt Change licenses due to addition of frontend dependency to okapi-api Sep 17, 2018
NOTICE.txt Change licenses due to addition of frontend dependency to okapi-api Sep 17, 2018
README.md Fake commit to test CI Aug 6, 2018
pom.xml Add SQL DDL module Oct 9, 2018

README.md

Build Status Maven Central

CAPS: Cypher for Apache Spark

CAPS extends Apache Spark™ with Cypher, the industry's most widely used property graph query language defined and maintained by the openCypher project. It allows for the integration of many data sources and supports multiple graph querying. It enables you to use your Spark cluster to run analytical graph queries. Queries can also return graphs to create processing pipelines.

Intended audience

CAPS allows you to develop complex processing pipelines orchestrated by a powerful and expressive high-level language. In addition to developers and big data integration specialists, CAPS is also of practical use to data scientists, offering tools allowing for disparate data sources to be integrated into a single graph. From this graph, queries can extract subgraphs of interest into new result graphs, which can be conveniently exported for further processing.

CAPS builds on the Spark SQL DataFrame API, offering integration with standard Spark SQL processing and also allows integration with GraphX. To learn more about this, please see our examples.

Current status: Pre-release

The functionality and APIs are stabilizing but surface changes (e.g. to the Cypher syntax and semantics for multiple graph processing and graph projections/construction) are still likely to occur. We invite you to try out the project, and we welcome feedback and contributions

A 1.0 release of the project with a stable feature set and API/language surface is targeted for Q4 of 2018.

We expect continuing development of the project after 1.0. If you are interested in contributing to the project we would love to hear from you; email us at opencypher@neo4j.org or just raise a PR. Please note that this is an openCypher project and contributions can only be accepted if you’ve agreed to the openCypher Contributors Agreement (oCCA).

CAPS Features

CAPS is built on top of the Spark DataFrame API and uses features such as the Catalyst optimizer. The Spark representations are accessible and can be converted to representations that integrate with other Spark libraries.

CAPS supports a subset of Cypher and is the first implementation of multiple graphs and graph query compositionality.

CAPS currently supports importing graphs from both Neo4j and from custom CSV format in HDFS and local file system. CAPS has a data source API that allows you to plug in custom data importers for external graphs.

CAPS Roadmap

CAPS is under rapid development and we are planning to offer support for:

  • a large subset of the Cypher language
  • new Cypher Multiple Graph features
  • integration with Spark SQL
  • injection of custom graph data sources

Get started with CAPS

CAPS is currently easiest to use with Scala. Below we explain how you can import a simple graph and run a Cypher query on it.

Building CAPS

CAPS is built using Maven

mvn clean install

Add the CAPS dependency to your project

In order to use CAPS add the following dependency:

Maven:

<dependency>
  <groupId>org.opencypher</groupId>
  <artifactId>spark-cypher</artifactId>
  <version>0.1.6</version>
</dependency>

sbt:

libraryDependencies += "org.opencypher" % "spark-cypher" % "0.1.6"

Remember to add fork in run := true in your build.sbt for scala projects; this is not CAPS specific, but a quirk of spark execution that will help prevent problems

Generating API documentation

mvn scala:doc

Documentation will be generated and placed under [MODULE_DIRECTORY]/target/site/scaladocs/index.html

Hello CAPS

Cypher is based on the property graph data model, comprising labelled nodes and typed relationships, with a relationship either connecting two nodes, or forming a self-loop on a single node. Both nodes and relationships are uniquely identified by an ID (in CAPS this is of type Long), and contain a set of properties.

The following example shows how to convert a social network represented as Scala case classes to a PropertyGraph representation. The PropertyGraph representation is internally transformed into Spark data frames. If you have existing data frames which you would like to treat as a graph, have a look at our DataFrameInputExample.

Once the property graph is constructed, it supports Cypher queries via its cypher method.

import org.opencypher.spark.api.CAPSSession
import org.opencypher.spark.api.io.{Node, Relationship, RelationshipType}
import org.opencypher.spark.util.ConsoleApp

/**
  * Demonstrates basic usage of the CAPS API by loading an example network via Scala case classes and running a Cypher
  * query on it.
  */
object CaseClassExample extends ConsoleApp {

  // 1) Create CAPS session
  implicit val session: CAPSSession = CAPSSession.local()

  // 2) Load social network data via case class instances
  val socialNetwork = session.readFrom(SocialNetworkData.persons, SocialNetworkData.friendships)

  // 3) Query graph with Cypher
  val results = socialNetwork.cypher(
    """|MATCH (a:Person)-[r:FRIEND_OF]->(b)
       |RETURN a.name, b.name, r.since
       |ORDER BY a.name""".stripMargin
  )

  // 4) Print result to console
  results.show
}

/**
  * Specify schema and data with case classes.
  */
object SocialNetworkData {

  case class Person(id: Long, name: String, age: Int) extends Node

  @RelationshipType("FRIEND_OF")
  case class Friend(id: Long, source: Long, target: Long, since: String) extends Relationship

  val alice = Person(0, "Alice", 10)
  val bob = Person(1, "Bob", 20)
  val carol = Person(2, "Carol", 15)

  val persons = List(alice, bob, carol)
  val friendships = List(Friend(0, alice.id, bob.id, "23/01/1987"), Friend(1, bob.id, carol.id, "12/12/2009"))
}

The above program prints:

╔═════════╤═════════╤══════════════╗
║ a.name  │ b.name  │ r.since      ║
╠═════════╪═════════╪══════════════╣
║ 'Alice' │ 'Bob'   │ '23/01/1987' ║
║ 'Bob'   │ 'Carol' │ '12/12/2009' ║
╚═════════╧═════════╧══════════════╝
(2 rows)

More examples, including multiple graph features, can be found in the examples module.

Loading CSV Data

See the documentation in org.opencypher.spark.impl.io.hdfs.CsvGraphLoader, which specifies how to structure the CSV and the schema mappings that describe the graph structure for the underlying data.

Next steps

How to contribute

We would love to find out about any issues you encounter and are happy to accept contributions following a Contributors License Agreement (CLA) signature as per the process outlined in our contribution guidelines.

License

The project is licensed under the Apache Software License, Version 2.0, with an extended attribution notice as described in the license header.

Copyright

© Copyright 2016-2018 Neo4j, Inc.

Apache Spark™, Spark, and Apache are registered trademarks of the Apache Software Foundation.