Joins in cascalog

MaineC edited this page Jul 31, 2012 · 4 revisions

Cascalog joins datasets that share variable names; this turns out to be quite a concise way to express complex relationships between datasets. Let's start our discussion by defining age and gender, two datasets that link customer names to these attributes. We'll use these datasets to discuss inner and outer joins.

    (def age [["alice" 28]
              ["bob" 33]
              ["chris" 40]
              ["david" 25]
              ["emily" 25]
              ["george" 31]
              ["gary" 28]
              ["kumar" 27]
              ["luanne" 36]])

    (def gender [["alice" "f"]
                 ["bob" "m"]
                 ["chris" "m"]
                 ["david" "m"]
                 ["emily" "f"]
                 ["george" "m"]
                 ["gary" "m"]
                 ["harold" "m"]
                 ["luanne" "f"]])

Inner Join

An inner join between two datasets returns all tuples that satisfy the "join condition"; in Cascalog, join conditions are specified through shared variable names.

An example query that triggers an inner join between age and gender through a shared ?name variable would be

    (?<- (stdout)
         [?name ?age ?gender]
         (age ?name ?age)
         (gender ?name ?gender))

Executing this query produces these results:

  RESULTS
  -----------------------
  emily   25       f
  david   25       m
  alice   28       f
  gary    28       m
  george  31       m
  bob     33       m
  luanne  36       f
  chris   40       m
  -----------------------

The inner join drops Harold's and Kumar's records from the result set, as they each weren't present in both source datasets.

Here's a more interesting query that generates all female customers over the age of 25:

    (?<- (stdout)
         [?name ?age]
         (age ?name ?age)
         (gender ?name "f")
         (> ?age 25))

This query joins on the ?name field, filter ?age explicitly with clojure's > operator, and perform an implicit filtering constraint on gender by constraining the ?gender var to the static value of "f". (Powerful stuff!) This query produces

  RESULTS
  -----------------------
  emily   25
  alice   28
  luanne  36
  -----------------------

Outer Joins

The inner join discussed above will drop tuples that don't exist in all datasets. To keep all data, Cascalog allows you to perform outer joins by using "ungrounding variables", prefixed with !!. In the following query, !!age and !!gender are ungrounding variables:

    (?<- (stdout)
         [?person !!age !!gender]
         (age ?person !!age)
         (gender ?person !!gender))

Executing this query generates the following results:

  -----------------------
  kumar    27   null
  alice    28   f
  emily    25   f
  luanne    36  f
  bob      33   m
  chris    40   m
  david    25   m
  gary     28   m
  george    31  m
  harold    null    m
  -----------------------

A predicate that contains an ungrounding variable is called an "unground predicate", and a predicate that does not contain an ungrounding variable is called a "ground predicate". Joining together two unground predicates results in a full outer join, while joining a ground predicate to an unground predicate results in a left join.

Let's introduce person (our customer list) and follows (a dataset of folks our customers are interested in):

    (def person [["alice"]
                 ["bob"]
                 ["chris"]
                 ["david"]
                 ["emily"]
                 ["george"]
                 ["gary"]
                 ["harold"]
                 ["kumar"]
                 ["luanne"]])

    (def follows [["alice" "david"]
                  ["alice" "bob"]
                  ["alice" "emily"]
                  ["bob" "david"]
                  ["bob" "george"]
                  ["bob" "luanne"]
                  ["david" "alice"]
                  ["david" "luanne"]
                  ["emily" "alice"]
                  ["emily" "bob"]
                  ["emily" "george"]
                  ["emily" "gary"]
                  ["george" "gary"]
                  ["harold" "bob"]
                  ["luanne" "harold"]
                  ["luanne" "gary"]])

Here's an example of a left join. To get all the follow relationships for each person in our dataset, or null if the person has no follow relationships, we run:

    (?<- (stdout)
         [?person1 !!person2]
         (person ?person1)
         (follows ?person1 !!person2))

To get all the people who do not have a follows relationship, we can run:

    (?<- (stdout)
         [?person]
         (person ?person)
         (follows ?person !!p2)
         (nil? !!p2))

Notice that the (nil? !!p2) predicate gets applied after !!p2 gets joined to a ground predicate. This is an important part of the semantics of outer joins in Cascalog.

Now let's say we want the follows count for each person. A normal "count" aggregation won't work because it counts the number of tuples and doesn't distinguish between null and non-null follows. In this case, we want null follows to be counted as 0 and non-null follows to be counted as 1. Cascalog has an aggregator called !count that does exactly this:

    (?<- (stdout)
         [?person ?count]
         (person ?person)
         (follows ?person !!p2)
         (c/!count !!p2 :> ?count))

Cross Joins

Inner joins return all records satisfying the equality constraint; how do we join datasets that have nothing at all in common?

This is called a cross join, and produces behavior similar to clojure's for macro:

    (for [name ["wendy" "sam"], food ["cake" "burger"]]
      [name food])

    => (["wendy" "cake"] ["wendy" "burger"] ["sam" "cake"] ["sam" "burger"])

This is usually a mistake, and causes a massive tuple blowup, but if you must do this, it's possible to trigger a cross-join by including (cross-join) as a predicate in your query:

    (let [name [["wendy"] ["sam"]]
          food [["cake"] ["burger"]]]
      (??<- [?name ?food]
            (name ?name)
            (food ?food)
            (cross-join)))

    => (["sam" "burger"] ["sam" "cake"] ["wendy" "burger"] ["wendy" "cake"])

Note that ??<- defines a query, executes it, and returns the result in the form of a clojure sequence, as discussed here.