Skip to content

Pixy Tutorial

lambdazen edited this page Sep 16, 2013 · 30 revisions

This page takes you through a series of example queries with a graph representing a portion of the Kennedy family tree. Every vertex in the graph is a person with properties 'name', 'sex', 'nickname' (optional), 'born' and 'died' (optional). The edges of the graph are labeled 'father', 'mother' or 'wife' representing the corresponding relationship.

For example, the vertex with the name "John Fitzgerald Kennedy, Sr." will have

  • an outgoing 'father' edge to the vertex with the name "Joseph Patrick Kennedy, Sr.",
  • an outgoing 'mother' edge to the vertex with the name "Rose Elizabeth Fitzgerald", and
  • an outgoing 'wife' edge from the vertex with the name "Jacqueline Lee Bouvier".

Getting started with the Gremlin shell

  • Install gremlin-groovy-2.4.0 from the Gremlin downloads page
  • (Optional) Move/rename/remove the JARs with SLF4J bindings, viz. slf4j-log4j12-1.7.2.jar, logback-classic-0.9.30.jar, logback-core-0.9.30.jar. This step is to prevent DEBUG messages from being printed on the Gremlin shell.
  • Explode the contents of this zip file to a folder such as C:/temp.
  • You can use one of these method to setup Pixy and Bitsy (for demo database):
    • Copy JARs: Explode the contents of this zip file to the lib/ folder in your Gremlin installation.
    • OR import using Grape: Run these commands in the shell (gremlin.bat or gremlin.sh from gremlin-groovy-2.4.0/bin).
Grape.addResolver([name: 'sonatype-snapshots', root: 'https://oss.sonatype.org/content/repositories/snapshots', m2Compatible: 'true'])
Grape.grab([group:'com.lambdazen.bitsy', module:'bitsy', version:'1.2-SNAPSHOT'])
Grape.grab([group:'com.lambdazen.pixy',module:'pixy',version:'1.0-SNAPSHOT'])

Setting up the environment

Once you start the Gremlin shell, you should see something like this:

C:\gremlin-groovy-2.4.0\bin>gremlin.bat

         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
gremlin>

The first couple of commands will import classes from Pixy and Bitsy.

gremlin> import com.lambdazen.bitsy.BitsyGraph
...
gremlin> import com.lambdazen.pixy.*
...

Now, you can load the Bitsy database with the Kennedy family tree.

gremlin> g = new BitsyGraph(java.nio.file.Paths.get('C:/temp/kennedy'))
==>bitsygraph[C:\temp\kennedy]

That's it for the database. To set up Pixy, you run the following command:

gremlin> GremlinGroovyPipeline.class.metaClass.pixy { PixyTheory pt, String query, Object... args -> pt.makePipe(query, args).pixyStep(delegate) }
==>null

The above code adds a new step called 'pixy' to the standard Gremlin pipeline. This is the only point of integration between Pixy and Gremlin. In the future, this will be made simpler using the new Gremlin.use() feature.

At this point, we can start writing rules and queries. The reader will find it easier to follow the tutorial if he/she understands Gremlin and a bit of PROLOG. Advanced PROLOG features such as recursion and list comprehension aren't supported in Pixy -- so it might be easier to start with this tutorial and refer to PROLOG documentation when you have questions. The idea behind Pixy is to support a subset of PROLOG, while faithfully following its syntax and semantics for that subset of features. Pixy uses Igor Maznitza's excellent Java Prolog Parser to parse its rules.

Hello father, hello child

You can start with the first PROLOG rule in Pixy which represents the father relationship. Note that we are following PROLOG syntax which means that all variables must start with a capitalized letter and all relations/predicates must start with a lower-case letter. Dashes are not supported in relation names.

gremlin> pt = new PixyTheory('''father(Child, Father) :- out(Child, 'father', Father).''')

The above rule means that father(Child, Father) will be true if out(Child, 'father', Father) is true. We refer to this as out/3 to capture the arity of the relationship, i.e., the number of parameters, which is 3. out/3 is a predefined rule that corresponds to the Gremlin step 'out' and looks for a link with the label 'father' from the Child to the Father.

A PixyTheory can contain any number of rules. These rules form the domain model for the graph database. In this case, we are defining a simple domain model with the rule 'father'. As we go down the section, we will layer on more and more rules to define complex relationships.

Now, we can run our first query to find out "Who is the father of JFK Sr?".

gremlin> g.V('name', 'John Fitzgerald Kennedy, Sr.').pixy(pt, 'father($, &)').name
==>Joseph Patrick Kennedy, Sr.

The first part of the gremlin expression g.V('name', '...') creates a pipe that outputs the vertex for JFK Sr. The Pixy query is the second step named 'pixy' that takes two parameters, viz. the PixyTheory that we defined above, and a query 'father($, &)'. The term '$' refers to the input of the pipe and the term '&' refers to the output of the pipe. The query 'father($, &)' means "match the first parameter of the father relationship to the input and send the second parameter as the output". In other words, "find out who is the father". The last part of the query gets the name of the father, which is Joseph Patrick Kennedy, Sr.

The queries in Pixy can work in reverse as well. To find out the children of JFK Sr, you can run the following query:

gremlin> g.V('name', 'John Fitzgerald Kennedy, Sr.').pixy(pt, 'father(&, $)').name
==>John Fitzgerald Kennedy, Jr.
==>Caroline Bouvier Kennedy

The only thing that changed is the input and the output. In this query we are saying "match the input to the second parameter", which is the father, and send the second parameter as the output. This returns JFK Sr.'s children defined in the database as JFK Jr. and Caroline.

The execution plan

You can look at the execution plan for each query using the toString() method on the pipe. Here's the execution plan for the first 'Hello father' query.

gremlin> g.V('name', 'John Fitzgerald Kennedy, Sr.').pixy(pt, 'father($, &)').name.toString()
==>[GremlinStartPipe, GraphQueryPipe(has,vertex), IdentityPipe, 
    VertexQueryPipe(out,[father],vertex), PropertyPipe(name)]

As you can see the final pipeline is made up of the Gremlin and Pixy portions. The Pixy portion includes the "VertexQueryPipe(out,[father],vertex)" which looks for the out edge named 'father' from the vertex representing JFK Sr.

What about the 'Hello son' query? Its execution plan looks like this:

gremlin> g.V('name', 'John Fitzgerald Kennedy, Sr.').pixy(pt, 'father(&, $)').name.toString()
==>[GremlinStartPipe, GraphQueryPipe(has,vertex), IdentityPipe, 
    VertexQueryPipe(in,[father],vertex), PropertyPipe(name)]

As you can see, Pixy automatically converted the query to an 'in' step in Gremlin because the Father is 'bound' but Child is not bound in father(Child, Father). Throughout the tutorial, we will keep peeking into the execution plans generated by Pixy.

Extending the domain model

The rules in the Pixy Theory, aka domain model, define the general relationships between various entities (vertices, edges, properties, labels, etc). A rich domain model makes it easy to write queries. In that spirit, we are going to extend the domain model with a few more simple one-step rules:

gremlin> pt = pt.extend(''' \
      mother(Child, Mother) :- out(Child, 'mother', Mother). \
      wife(Husband, Wife) :- out(Husband, 'wife', Wife). \
      husband(Wife, Husband) :- wife(Husband, Wife). \
     ''')

The method PixyTheory.extend creates another theory with an additional set of rules. You can use this to extend your main domain model, or create domain models with rules that are specific to a particular query.

In the above case, we are adding 3 rules to the domain model 'pt' for mother, wife and husband. As you can see, the rule for husband depends on the rule for wife.

Now, we can ask "Who is Jacqueline Lee Bouvier's husband?"

gremlin> g.V('name', 'Jacqueline Lee Bouvier').pixy(pt, 'husband($, &)').name
==>John Fitzgerald Kennedy, Sr.

Now, you can check the execution plan for the above query.

gremlin> g.V('name', 'Jacqueline Lee Bouvier').pixy(pt, 'husband($, &)').name.toString()
==>[GremlinStartPipe, GraphQueryPipe(has,vertex), IdentityPipe, 
   VertexQueryPipe(in,[wife],vertex), PropertyPipe(name)]

As you can see, the indirect definition of husband/2 based on wife/2 has not affected the complexity of the pipe. The Pixy portion of the pipe is just 'VertexQueryPipe(in,[wife],vertex)'.

Modeling OR conditions for spouse and parent

We can now define spouse/2 and parent/2 as follows:

gremlin> pt = pt.extend(''' \
      spouse(Spouse1, Spouse2) :- wife(Spouse1, Spouse2). \
      spouse(Spouse1, Spouse2) :- husband(Spouse1, Spouse2). \
      parent(Child, Parent) :- father(Child, Parent). \
      parent(Child, Parent) :- mother(Child, Parent). \
    ''')

Horn clauses model OR conditions using separate rules for each OR clause. This means that spouse(X, Y) is true if wife(X, Y) is true, or husband(X, Y) is true.

Let's take this for a run: Who are JFK's parents?

gremlin> g.V('name', 'John Fitzgerald Kennedy, Sr.').pixy(pt, 'parent($, &)').name()
==>Joseph Patrick Kennedy, Sr.
==>Rose Elizabeth Fitzgerald

How does that query look?

gremlin> g.V('name', 'John Fitzgerald Kennedy, Sr.').pixy(pt, 'parent($, &)').toString()
==>[GremlinStartPipe, GraphQueryPipe(has,vertex), IdentityPipe, 
    PixySplitMergePipe([[IdentityPipe, AsPipe(__pixy_output_1,VertexQueryPipe(out,[father],vertex))],
      [IdentityPipe, AsPipe(__pixy_output_7,VertexQueryPipe(out,[mother],vertex))]]),
    CoalescePipe(__pixy_output_7,__pixy_output_1), PropertyPipe(name)]

Not so easy. This is the first query involving custom pipes in Pixy. The PixySplitMergePipe is responsible for evaluating the pipes representing each Horn clauses with the same input. The CoalescePipe is responsible for extracting and emitting the outputs of each pipe to the output of the main pipe. In the above plan, the vertices matching JFK are fed into the two pipes corresponding to the lookup of the father and the mother. The output of these pipes are extracted and emitted by the CoalescePipe.

Modeling AND conditions for grandparent

Now, we can combine the above OR conditions with AND conditions by defining a grandparent rule.

gremlin> pt = pt.extend('grandparent(Child, GrandParent) :- parent(Child, Father), parent(Father, GrandParent).')

Now, we can ask "Who are the grandparents of JFK Jr.?"

gremlin> g.V('name', 'John Fitzgerald Kennedy, Jr.').pixy(pt, 'grandparent($, &)').name.order()
==>Janet Lee Bouvier
==>John Vernou Bouvier III
==>Joseph Patrick Kennedy, Sr.
==>Rose Elizabeth Fitzgerald

We are ordering the names to avoid confusion. As you can see, we have all the 4 grandparents of JFK Jr. Let's look at the execution plan:

gremlin> g.V('name', 'John Fitzgerald Kennedy, Jr.').pixy(pt, 'grandparent($, &)').name.toString()
==>[GremlinStartPipe, GraphQueryPipe(has,vertex), IdentityPipe, 
    PixySplitMergePipe([[IdentityPipe, AsPipe(__pixy_Father_5,VertexQueryPipe(out,[father],vertex))],
       [IdentityPipe, AsPipe(__pixy_Father_11,VertexQueryPipe(out,[mother],vertex))]]), 
       AsPipe(__pixy_Father_4,CoalescePipe(__pixy_Father_11,__pixy_Father_5)), 
    PixySplitMergePipe([[IdentityPipe, AsPipe(__pixy_output_17,VertexQueryPipe(out,[father],vertex))], 
       [IdentityPipe, CoalescePipe(__pixy_Father_4), AsPipe(__pixy_output_23,VertexQueryPipe(out,mother],vertex))]]), 
    CoalescePipe(__pixy_output_23,__pixy_output_17), PropertyPipe(name)]

Pixy doesn't care how long the expression gets -- it just grinds through. In many ways, the underlying engine (Pipes) is similar in that 4 objects vs. 40 objects makes no difference. The number of elements visited by these pipes is the crucial factor in efficiency.

For simplicity however, we will re-define the spouse and parent relationships use a simpler execution plan:

gremlin> pt = pt.remove('spouse/2').extend('''spouse(Spouse1, Spouse2) :- both(Spouse1, 'wife', Spouse2).''')

The above statement removes the rule for spouse/2 and replaces it with a simpler rule that uses the both relationship, similar to the Gremlin step 'both'.

gremlin> pt = pt.remove('parent/2').extend('''parent(Child, Parent) :- out(Child, ['father', 'mother'], Parent).''')

The above statement passes a list of strings ['father', 'mother'] for the labels parameter which uses the appropriate list for the Gremlin in/out step.

With these changes, let's take a look at the execution plan for JFK Jr's grandparents:

gremlin> g.V('name', 'John Fitzgerald Kennedy, Jr.').pixy(pt, 'grandparent($, &)').name.toString()
==>[GremlinStartPipe, GraphQueryPipe(has,vertex), IdentityPipe, 
    VertexQueryPipe(out,[father, mother],vertex), VertexQueryPipe(out,[father, mother],vertex), 
    PropertyPipe(name)]

Much simpler. The Pixy portion of the pipeline is now just two out steps involving lookups of father/mother edges.

Let's do a quick check on the new grandparent by finding out the grandchildren of "Rose Elizabeth Fitzgerald":

gremlin> g.V('name', 'Rose Elizabeth Fitzgerald').pixy(pt, 'grandparent(&, $)').name.order()
==>Anthony Paul Kennedy Shriver
==>Caroline Bouvier Kennedy
...
==>William Kennedy Smith

The above query uses the same grandparent relation, but reverses the input and output. As you can see, the output includes the 16 grandchildren of Rose available in the database.

Selecting more than one output

So far, we have only seen queries that produce one output in '&'. With Pixy, it is possible to select multiple outputs using &[step], where the [step] is a named step in Gremlin.

Here is an example of a grandparent/3 relationship that maps the child to the parent to the grandparent.

gremlin> pt = pt.extend('grandparent(Child, Parent, GrandParent) :- \
                             parent(Child, Parent), parent(Parent, GrandParent).')

You can use the same relation name, i.e., grandparent, because the arity is different from grandfather/2.

Now, let's run a query to find the grand parents of JFK Jr., but this time we also get the name of the corresponding parent.

gremlin> g.V('name', 'John Fitzgerald Kennedy, Jr.').pixy(pt, 'grandparent($, &p, &gp)') \
         .select(['p', 'gp'], {it.name}, {it.name})
==>[p:John Fitzgerald Kennedy, Sr., gp:Joseph Patrick Kennedy, Sr.]
==>[p:John Fitzgerald Kennedy, Sr., gp:Rose Elizabeth Fitzgerald]
==>[p:Jacqueline Lee Bouvier, gp:John Vernou Bouvier III]
==>[p:Jacqueline Lee Bouvier, gp:Janet Lee Bouvier]

The query now reports both the parent and grandparent while looking up the grandparent of JFK Jr. The select, and the closures provided to the select do the job of emitting rows with columns 'p' and 'gp'.

Now, we can run the same query this time using JFK Sr. as the Parent, and look for matches for Child and GrandParent. Here's the query and its results:

gremlin> g.V('name', 'John Fitzgerald Kennedy, Sr.').pixy(pt, 'grandparent(&gc, $, &gp)') \
         .select(['gc', 'gp'], {it.name}, {it.name})
==>[gc:John Fitzgerald Kennedy, Jr., gp:Joseph Patrick Kennedy, Sr.]
==>[gc:John Fitzgerald Kennedy, Jr., gp:Rose Elizabeth Fitzgerald]
==>[gc:Caroline Bouvier Kennedy, gp:Joseph Patrick Kennedy, Sr.]
==>[gc:Caroline Bouvier Kennedy, gp:Rose Elizabeth Fitzgerald]

This query finds all grandchildren/grandparent combinations that flow through JFK Sr. Here's the pipeline for this query

gremlin> g.V('name', 'John Fitzgerald Kennedy, Sr.').pixy(pt, 'grandparent(&gc, $, &gp)') \
         .select(['gc', 'gp'], {it.name}, {it.name}).toString()
==>[GremlinStartPipe, GraphQueryPipe(has,vertex), AsPipe(__pixy_input,IdentityPipe), 
    AsPipe(gc,VertexQueryPipe(in,[father, mother],vertex)), 
    CoalescePipe(__pixy_input), AsPipe(gp,VertexQueryPipe(out,[father, mother],vertex)), 
    SelectPipe(gc,gp)]

The AsPipe step for gc and gp captures the values of the vertices for grandparent and grandchild.

Hello son, hello daughter

So far, we have only looked at edge relationships between vertices. We will now do some matches on the properties of vertices by defining son/2 and daughter/2 relationships.

gremlin> pt = pt.extend('''daughter(Parent, Child) :- parent(Child, Parent), property(Child, 'sex', 'female'). \
    son(Parent, Child) :- parent(Child, Parent), property(Child, 'sex', 'male').''')

The property/3 rule is a predefined rule that takes an element (vertex/edge) as the first parameter, looks up the key specified in the second parameter, and matches/loads it into the third parameter. In the above queries, the value of the property is directly matched against 'male' or 'female'. This process is called 'unification' in PROLOG speak.

Now, if we ask who are the sons of Teddy Kennedy:

gremlin> g.V('name', 'Edward Moore Kennedy, Sr.').pixy(pt, 'son($, &)').name.order()
==>Edward Moore Kennedy, Jr.
==>Patrick Joseph Kennedy II

The reverse query is possible too -- to whom is Teddy Kennedy a son:

gremlin> g.V('name', 'Edward Moore Kennedy, Sr.').pixy(pt, 'son(&, $)').name.order()
==>Joseph Patrick Kennedy, Sr.
==>Rose Elizabeth Fitzgerald

You can see that both his parents are found in the output of the query. Now if we ask, "to whom is Teddy Kennedy a daughter", we get:

gremlin> g.V('name', 'Edward Moore Kennedy, Sr.').pixy(pt, 'daughter(&, $)').name.order()
gremlin> 

The answer is no one, because Teddy is a man. We can now take a look at the execution plan for the above query to see how it works:

gremlin> g.V('name', 'Edward Moore Kennedy, Sr.').pixy(pt, 'daughter(&, $)').name.toString()
==>[GremlinStartPipe, GraphQueryPipe(has,vertex), AsPipe(__pixy_input,IdentityPipe), 
    AsPipe(__pixy_output,VertexQueryPipe(out,[father, mother],vertex)), 
    CoalescePipe(__pixy_input), PropertyPipe(sex), FilterFunctionPipe, 
    PixyEvalPipe($,'female',ifeq/2), CoalescePipe(__pixy_output), PropertyPipe(name)]

This Pixy part of this expression works by first marking the input as __pixy_input, then getting the parents and marking them as __pixy_output, loading the input again, looking up its property 'sex', filtering it against nulls, checking for equality 'female' (using a custom pipe called PixyEvalPipe), and finally emitting the output.

Sequencing in Pixy

Pixy finds the first sequence of operations that works and uses it to create the expression. So sometimes the expression may be longer than required. You can get a better match by playing with the order of the clauses. In this case, moving the property() before the parent() will do the trick. For this level of tuning, you need to define one relation per direction (forward/reverse).

In the above definition, the path from parent to son has fewer steps.

gremlin> g.V('name', 'Edward Moore Kennedy, Sr.').pixy(pt, 'son($, &)').name.toString()
==>[GremlinStartPipe, GraphQueryPipe(has,vertex), IdentityPipe, 
    AsPipe(__pixy_output,VertexQueryPipe(in,[father, mother],vertex)), PropertyPipe(sex), 
    FilterFunctionPipe, PixyEvalPipe($,'male',ifeq/2), CoalescePipe(__pixy_output), 
    PropertyPipe(name)]

As mentioned earlier, it is doubtful if the length of the expression will have any noticeable effect on performance.

Hello brother

We will now add three new rules to the domain model:

gremlin> pt = pt.extend(''' \
      sibling(Person1, Person2) :- father(Person1, Parent), father(Person2, Parent), Person1 \\\\= Person2.  \
      brother(Person1, Person2) :- sibling(Person1, Person2), property(Person2, 'sex', 'male'). \
      sister(Person1, Person2) :- sibling(Person1, Person2), property(Person2, 'sex', 'female'). \
    ''')

The first rule defines sibling as someone who shares a father, but isn't the same person. The operator = is a PROLOG operator for not-equals. The next two rules refine sibling as brother and sister.

Now, let's ask who are JFK Sr's brothers:

gremlin> g.V('name', 'John Fitzgerald Kennedy, Sr.').pixy(pt, 'brother($, &)').name.order()
==>Edward Moore Kennedy, Sr.
==>Joseph Patrick Kennedy, Jr.
==>Robert Francis Kennedy, Sr.

We can further refine brother as youngerBrother, and find out JFK Sr's younger brothers:

gremlin> pt = pt.extend('''youngerBrother(X, Y) :- brother(X, Y), property(X, 'born', BX), property(Y, 'born', BY), BY > BX.''')
...
gremlin> g.V('name', 'John Fitzgerald Kennedy, Sr.').pixy(pt, 'youngerBrother($, &)').name.order()
==>Edward Moore Kennedy, Sr.
==>Robert Francis Kennedy, Sr.

This query returns the vertices for the two younger brothers (Teddy and Bobby). Now, let's take a peek at the execution plan for this:

gremlin> g.V('name', 'John Fitzgerald Kennedy, Sr.').pixy(pt, 'youngerBrother($, &)').name.toString()
==>[GremlinStartPipe, GraphQueryPipe(has,vertex), AsPipe(__pixy_input,IdentityPi
pe), VertexQueryPipe(out,[father],vertex), AsPipe(__pixy_output,VertexQueryPipe(
in,[father],vertex)), PixyEvalPipe($__pixy_input,$__pixy_output,\=/2), PixyEvalP
ipe($,true,ifeq/2), CoalescePipe(__pixy_output), PropertyPipe(sex), FilterFuncti
onPipe, PixyEvalPipe($,'male',ifeq/2), CoalescePipe(__pixy_input), PropertyPipe(
born), AsPipe(__pixy_BX_4,FilterFunctionPipe), CoalescePipe(__pixy_output), Prop
ertyPipe(born), AsPipe(__pixy_BY_5,FilterFunctionPipe), PixyEvalPipe($__pixy_BY_
5,$__pixy_BX_4,>/2), PixyEvalPipe($,true,ifeq/2), CoalescePipe(__pixy_output), P
ropertyPipe(name)]

I haven't bothered to indent the output like before, but you can get the gist of what's happening.

Properties and defaults

The property/3 relationship will fail to find matches if the given property key is not defined in the vertex or edge. There is a property/4 relationship which takes a default value in case the property is missing. Here is an example:

gremlin> pt = pt.extend('''nickname(Person, Nickname) :- property(Person, 'nickname', Nickname, '').''')
...
gremlin> g.V().as('p').pixy(pt, 'nickname($, &n)').select(['p', 'n'], {it.name}, {it})
==>[p:Edwin Arthur Schlossberg, n:]
==>[p:Joseph Patrick Kennedy, Jr., n:Joe Jr.]
==>[p:William John Robert Cavendish, n:Billy]
...

This query will show the names and nicknames of all people in the database. Without the default value in property/4, the property relationship will filter out any vertices that don't have the property 'nickname'.

Note that the input to the above pixy pipe includes all the vertices defined in the graph. It should be noted that any of earlier queries can be executed on any number of input vertices (aka starts). If you mark the input using an 'as' step, you can run a 'select' to pick the input and the output of each query.

Complex AND clauses with aunts, uncles, nieces and nephews

Let's define them:

gremlin> pt = pt.extend(''' \
      aunt(Person, Aunt) :- parent(Person, Parent), sister(Parent, Aunt). \
      uncle(Person, Aunt) :- parent(Person, Parent), brother(Parent, Aunt). \
      niece(Person, Niece) :- sibling(Person, Sibling), daughter(Sibling, Niece). \
      nephew(Person, Nephew) :- sibling(Person, Sibling), son(Sibling, Nephew). \
    ''')

Once you understand PROLOG's syntax, these rules are very easy to write because they can use the logic built into other rules. For example, the definition of an aunt relies on the definition of sister, which relies on the definition of sibling. The semantics of these relationships are clear because of the domain model, which in this case is a model of family relations.

To make it interesting, we will create a temporary PixyTheory (ptTemp) with a relation deadNephew defined as:

gremlin> ptTemp = pt.extend('''deadNephew(Person, Nephew) :- nephew(Person, Nephew), property(Nephew, 'died', Year).''')

Now, we can find the dead nephews of Teddy Kennedy:

gremlin> g.V('name', 'Edward Moore Kennedy, Sr.').pixy(ptTemp, 'deadNephew($, &)').name.order()
==>John Fitzgerald Kennedy, Jr.

The answer is John Fitzgerald Kennedy, Jr. who lost his life in a plane crash.

Complex OR clauses with in-laws

We initially modeled spouse and parent using OR clauses, and later changed them to use the both relationship and multiple-labels (respectively). In this section, we will look at a case where multiple Horn clauses can not be avoided.

gremlin> pt = pt.extend(''' \
      brotherInLaw(Person, BIL) :- spouse(Person, Spouse), brother(Spouse, BIL). \
      brotherInLaw(Person, BIL) :- sibling(Person, Sister), husband(Sister, BIL). \
      sisterInLaw(Person, SIL) :- spouse(Person, Spouse), sister(Spouse, SIL). \
      sisterInLaw(Person, SIL) :- sibling(Person, Sister), wife(Sister, SIL). \
    ''')

As you can see each rule requires two Horn clauses that are OR-ed. The pattern matching for brotherInLaw finds a path either through the spouse or through the sibling. The PROLOG-style of encoding OR-conditions is more natural than SQL-like languages that rely on UNIONs and sub-queries. Furthermore, a rule such as brotherInLaw can use the rules in the existing theory which consists of spouse, brother, sibling and husband. For these reasons, a Pixy domain model is easier to maintain than SQL-style queries built from scratch, especially if the system needs to perform complex pattern matching and logical reasoning.

Here is a query that shows all sisterInLaw relationships in the database.

gremlin> g.V().as('person').pixy(pt, 'sisterInLaw($, &sil)').select(['person', 'sil'], {it.name}, {it.name})
...
==>[person:Joseph Patrick Kennedy, Jr., sil:Ethel Skakel Kennedy]
...
==>[person:William John Robert Cavendish, sil:Jean Ann Kennedy]
...

There are 94 such relationships and I have selected two showing the different paths (brother's wife and spouse's sister).

Nots and the living

Pixy supports a pre-defined not/1 relation that can operate on other relations. We will now define some rules using nots:

gremlin> pt = pt.extend(''' \
       childless(Person) :- not(parent(Child, Person)). \
       living(Person) :- not(property(Person, 'died', _)). \
     ''')

The first rule says that a person is childless if no match for Child in parent(Child, Person) can be found. The second rule says that a person is living if there is no property called 'died'. Note that the variable name in the second rule is ''. A variable named '' (underscore) in PROLOG represents a variable whose value can be anything. Each '_' automatically maps to a unique hidden variable.

We can now use a temporary rule-set with a new rule called 'childlessAunt' as follows:

gremlin> ptTemp = pt.extend('clAunt(X, Y) :- aunt(X, Y), childless(Y).')
...
gremlin> g.V('name', 'John Fitzgerald Kennedy, Jr.').pixy(ptTemp, 'clAunt($, &)').name.order()
==>Kathleen Agnes Kennedy
==>Rose Marie Kennedy

Kathleen has several children in real-life, but not in the database. The pipe for the query looks like this:

gremlin> g.V('name', 'John Fitzgerald Kennedy, Jr.').pixy(ptTemp, 'clAunt($, &)').name.toString()
==>[GremlinStartPipe, GraphQueryPipe(has,vertex), 
  IdentityPipe, AsPipe(__pixy_Parent_7,VertexQueryPipe(out,[father, mother],vertex)), 
  VertexQueryPipe(out,[father],vertex), 
  AsPipe(__pixy_output,VertexQueryPipe(in,[father],vertex)), 
  PixyEvalPipe($__pixy_Parent_7,$__pixy_output,\=/2), 
  PixyEvalPipe($,true,ifeq/2), CoalescePipe(__pixy_output), 
  PropertyPipe(sex), FilterFunctionPipe, PixyEvalPipe($,'female',ifeq/2), 
  PixySplitMergePipe(
    [[IdentityPipe, CoalescePipe(__pixy_output), 
      AsPipe(__pixy_Child_26,VertexQueryPipe(in,[father, mother],vertex)), 
      AsPipe(__pixy_cut__32,PixyCutPipe), FilterFunctionPipe], 
     [IdentityPipe]]), 
  CoalescePipe(__pixy_output), PropertyPipe(name)]

The first part of the pipes finds out the siblings, filters them by sex, and then feeds them into a PixySplitMergePipe. The split-merge task consists of two pipes. The first one matches finds out the children of the aunt. Any match passes through the PixyCutPipe which stops further child matches. The last FilterFunctionPipe always fails. If the cut encounters a match, the PixySplitMergePipe doesn't bother with matches from the second pipe. The second pipe always matches the aunt, which at this point is guaranteed to be childless.

Some of the details are lost in the toString() method of the GremlinPipeline. You can get a more readable plan from Pixy using the toString() method on the PixyPipe object. Here's the output for the above query:

gremlin> ptTemp.makePipe('clAunt($, &)').toString()
==>out([father, mother]) -> as('__pixy_Parent_7') 
   -> out([father]) -> in([father]) -> as('__pixy_output') 
   -> pixyEval([$__pixy_Parent_7, $__pixy_output, \=/2])
   -> pixyEval($ ifeq true) -> coalesce('__pixy_output') 
   -> property('sex') -> filter(this != null) -> pixyEval($ ifeq 'female') 
   -> merge([coalesce('__pixy_output') -> in([father, mother]) -> as('__pixy_Child_26') 
             -> pixyCut() -> as('__pixy_cut__32') -> fail, 
           ]) 
   -> coalesce([__pixy_Child_33, __pixy_Child_26]) -> coalesce('__pixy_output')

This is a more readable version of the pipeline, but doesn't cover the Gremlin steps before/after the Pixy-generated part of the pipeline.

Cuts

We looked at cuts in the previous section. Cuts are PROLOG directives, represented using ! (exclamation mark). The semantics of cuts in Pixy are the same as the semantics of cuts in PROLOG. Cuts are used to internally implement not(), but can also be used to improve performance of queries. Since this is an advanced topic, we will skip the discussion of cuts in this tutorial.

Survivors and parameters

The power of the Kennedy family is only matched by the tragedies faced by some of its notable members. In this section, we will develop a query that looks for the survivors of person who lost his/her life. The definition of a survivor is someone who is a son/daughter or son-in-law/daughter-in-law, or grandchild who was living at the time of death. Although the definition seems complex, we can break it down into smaller Pixy rules.

gremlin> pt = pt.extend(''' \
      livingAsOf(Person, Year) :- property(Person, 'born', Year1), Year1 =< Year, livingAsOfSub(Person, Year). \
      livingAsOfSub(Person, Year) :- property(Person, 'died', Year2), Year2 >= Year. \
      livingAsOfSub(Person, Year) :- not(property(Person, 'died', _)). \
    ''')

Here we define a relation livingAsOf/2 which takes a Person and a Year and checks to see if that person was born before the given year and is either living now, or died after the year. The second part of the query is handled by livingAsOfSub/2.

Let's test this query by finding out which Kennedys were living as of 1950.

gremlin> g.V().as('p').pixy(pt, 'livingAsOf($, ?)', 1950) \
        .select(['p'], {it.name + ', born ' + it.born + (it.died ? ', died ' + it.died : '') })
...
==>[p:Hilda Shriver, born 1883, died 1977]
...
==>[p:Arnold Schwarzenegger, born 1947]
...

This query outputs one row per person who was alive in 1950. You can see matches from both the clauses of livingAsOfSub/2 in Hilda Shriver and Arnold Schwarzenegger.

Note that in the above query we are passing 1950 as a parameter. A Pixy query can have any number of question marks which can be replaced with string, numeric or boolean parameters.

With the help of livingAsOf/2, we can define survivor/2 as follows:

gremlin> pt = pt.extend(''' \
      survivor(Person, Survivor) :- property(Person, 'died', Year), survivorSub(Person, Survivor), livingAsOf(Survivor, Year). \
      survivorSub(Person, Survivor) :- spouse(Person, Survivor). \
      survivorSub(Person, Survivor) :- parent(Survivor, Person). \
      survivorSub(Person, Survivor) :- parent(Child, Person), spouse(Child, Survivor). \
      survivorSub(Person, Survivor) :- grandparent(Survivor, Person). \
    ''')

The survivor rule relies on a sub-rule called survivorSub which selects spouses, children, children-in-law and grandchildren. The main survivor rule says that the matching members must be living as of the year of death. Now, if we try this on JFK Sr, we will see:

gremlin> g.V('name', 'John Fitzgerald Kennedy, Sr.').pixy(pt, 'survivor($, &)').name.order()
==>Caroline Bouvier Kennedy
==>Edwin Arthur Schlossberg
==>Jacqueline Lee Bouvier
==>John Fitzgerald Kennedy, Jr.

The query path as you can imagine is a monster (syn. Gremlin, Pixy).

gremlin> g.V('name', 'John Fitzgerald Kennedy, Sr.').pixy(pt, 'survivor($, &)').name.toString()
==>[GremlinStartPipe, GraphQueryPipe(has,vertex), AsPipe(__pixy_input,IdentityPi
pe), PropertyPipe(died), AsPipe(__pixy_Year_4,FilterFunctionPipe), PixySplitMerg
ePipe([[IdentityPipe, CoalescePipe(__pixy_input), AsPipe(__pixy_output_5,VertexQ
ueryPipe(both,[wife],vertex))], [IdentityPipe, CoalescePipe(__pixy_input), AsPip
e(__pixy_output_11,VertexQueryPipe(in,[father, mother],vertex))], [IdentityPipe,
 CoalescePipe(__pixy_input), VertexQueryPipe(in,[father, mother],vertex), AsPipe
(__pixy_output_17,VertexQueryPipe(both,[wife],vertex))], [IdentityPipe, Coalesce
Pipe(__pixy_input), VertexQueryPipe(in,[father, mother],vertex), AsPipe(__pixy_o
utput_27,VertexQueryPipe(in,[father, mother],vertex))]]), AsPipe(__pixy_output,C
oalescePipe(__pixy_output_27,__pixy_output_17,__pixy_output_11,__pixy_output_5))
, PropertyPipe(born), AsPipe(__pixy_Year1_46,FilterFunctionPipe), PixyEvalPipe($
__pixy_Year1_46,$__pixy_Year_4,=</2), PixyEvalPipe($,true,ifeq/2), PixySplitMerg
ePipe([[IdentityPipe, CoalescePipe(__pixy_output), PropertyPipe(died), AsPipe(__
pixy_Year2_49,FilterFunctionPipe), PixyEvalPipe($__pixy_Year2_49,$__pixy_Year_4,
>=/2), PixyEvalPipe($,true,ifeq/2)], [IdentityPipe, PixySplitMergePipe([[Identit
yPipe, CoalescePipe(__pixy_output), PropertyPipe(died), AsPipe(__pixy___53,Filte
rFunctionPipe), AsPipe(__pixy_cut__57,PixyCutPipe), FilterFunctionPipe], [Identi
tyPipe]])]]), CoalescePipe(__pixy_output), PropertyPipe(name)]

Now, we can ask another monster query: "In whose deaths was Maria Owings Shriver a survivor?"

gremlin> g.V('name', 'Maria Owings Shriver').pixy(pt, 'survivor(&, $)').name.order()
==>Eunice Mary Kennedy
==>Hilda Shriver
==>Joseph Patrick Kennedy, Sr.
==>Robert Sargent Shriver, Jr.
==>Rose Elizabeth Fitzgerald

As you can imagine, coming up with this query in pure-Gremin is head-scratching work.

Loops and descendants

The entire set of rules defined prior to this section rely on four pre-defined relations, viz. property, out, in, both. In this section we introduce three new pre-defined relations called outLoop, inLoop and bothLoop. These relationships are looping versions of out, in and both that optionally take labels and min/max loops.

In the following example, inLoop is used to establish the descendant relationship.

gremlin> pt = pt.extend('''descendant(Person, Descendant) :- inLoop(Person, ['father', 'mother'], Descendant).''')

Now you can query the descendants of Hilda Shriver as follows:

gremlin> g.V('name', 'Hilda Shriver').pixy(pt, 'descendant($, &)').name.order()
==>Anthony Paul Kennedy Shriver
==>Maria Owings Shriver
==>Mark Kennedy Shriver
==>Robert Sargent Shriver III
==>Robert Sargent Shriver, Jr.
==>Timothy Perry Shriver

You can limit the minimum/maximum length of the path by passing a list of two numbers in the last parameter of the loop relationship. If you pass one number, that number acts as the maximum length of the path selected by the loop relationship.

gremlin> pt = pt.extend(''' \
      motherOrGrandmother(Person, Mogm) :- outLoop(Person, 'mother', Mogm, 2). \
      secondOrderRelative(Person, R) :- bothLoop(Person, R, [2, 2]), not(both(Person, R)), Person \\\\= R. \
    ''')

The first rule establishes a maximum path length of two which limits the outLoop to mother and grandmother. The second rule traverses father/mother/wife relationships in either direction, but ensures that the person is not a first-order relative or self.

Let's try the two queries:

gremlin> g.V('name', 'Anthony Paul Kennedy Shriver').pixy(pt, 'motherOrGrandmother($, &)').name.order()
==>Eunice Mary Kennedy
==>Rose Elizabeth Fitzgerald

This picks up the mother and grandmother as expected. Here's the query for second order relatives of Anthony Paul Kennedy Shriver:

gremlin> g.V('name', 'Anthony Paul Kennedy Shriver').pixy(pt, 'secondOrderRelative($, &)').name.order().unique()
==>Hilda Shriver
==>Joseph Patrick Kennedy, Sr.
==>Maria Owings Shriver
==>Mark Kennedy Shriver
==>Robert Sargent Shriver III
==>Robert Sargent Shriver, Sr.
==>Rose Elizabeth Fitzgerald
==>Timothy Perry Shriver

We need to use unique() to filter out duplicate matches of the same people through different paths. Note that none of the second order relatives are directly related to Anthony.

Edges and labels

Pixy supports pre-defined relationships corresponding to Gremlin's outE, inE, bothE, outV, inV, bothV and label steps. Since this example doesn't have any properties tied to edges, we have not had the need to use these relationships. In this section, we will express some of the previously discussed pre-defined relationships using the newly introduced ones.

gremlin> ptTemp = pt.extend(''' \
    my_out(X, Y) :- outE(X, E), inV(E, Y). \
    my_in(X, Y) :- inE(X, E), outV(E, Y). \
    my_both(X, Y) :- bothE(X, E), bothV(E, Y), X \\\\= Y. \
    my_out(X, Labels, Y) :- outE(X, Labels, E), inV(E, Y). \
    my_in(X, Label, Y) :- inE(X, E), label(E, Label), outV(E, Y). \
    my_both(X, Labels, Y) :- bothE(X, Labels, E), bothV(E, Y), X \\\\= Y. \
  ''')

Note that my_in/3 matches a single label against the label of the edge. You can also inspect the properties of an edge using the property/3 and property/4 relationships.

Wrap up

This page took you through the creation of a Pixy domain model for a graph database holding a portion of the Kennedy family tree. We started with three types of edges, viz. father, mother and wife, and built a domain model with a whole lot of relationships. Here's the list of new relationships that we added

gremlin> pt.getRelationSignatures()
==>aunt/2
==>brother/2
==>brotherInLaw/2
==>childless/1
==>daughter/2
==>father/2
==>grandparent/2
==>grandparent/3
==>husband/2
==>living/1
==>livingAsOf/2
==>livingAsOfSub/2
==>mother/2
==>motherOrGrandmother/2
==>nephew/2
==>nickname/2
==>niece/2
==>parent/2
==>secondOrderRelative/2
==>sibling/2
==>sister/2
==>sisterInLaw/2
==>son/2
==>spouse/2
==>survivor/2
==>survivorSub/2
==>uncle/2
==>wife/2
==>youngerBrother/2

In the course of this tutorial, we have created a domain model with 29 rules that capture various family relations. Any of these rules can be turned into Gremlin pipes to query the underlying graph database. We showed over 20 examples of such queries and the execution plans for some of them. You can go through a similar exercise with your graph database to come up with a domain model that works for your database. A good domain model can greatly simplify the query-writing experience while maintaining the portability of your application.

Please report your feedback, comments and suggestions to pixy (at) lambdazen.com, or as issues in this project.

You can’t perform that action at this time.