# PartiQL query language demonstrator

In our discussions about domain-specific languages for particle physics, we agreed that we want the language to be "declarative," to "specify *what* is to be computed, rather than *how* it is to be computed." However, we haven't made many proposals of language features that would achieve this end.

PartiQL is a toy language intended to demonstrate several features that would be a radical departure from general-purpose languages, addressing problems specific to particle physics. My intent is to inject ideas into the development of languages intended for physicists, to provide more "value added" with respect to conventional programming languages.

## State of the art

Two main classes of languages have been proposed so far: ADL, CutLang, and YAML as an ADL can collectively be called languages with a "block syntax," and RDataFrame, NAIL, [func-adl](https://iris-hep.org/projects/func-adl.html), and my long-defunct [Femtocode](https://github.com/diana-hep/femtocode) can be called "Spark-like functional languages."

Except for Femtocode, all of the above were discussed at the [Analysis Description Language Workshop](https://indico.cern.ch/event/769263/timetable/#day-2019-05-06) at Fermilab on May 6‒8, 2019.

### Block languages

Block languages have few recognizable programming constructs, providing a high-level feel to analysis code. There are no explicit loops, so a backend system can parallelize or distribute them however necessary. (Decisions about parallelization may still be in the user's control, but not mixed with analysis logic.) Here is an example:

```
define MR = fMR(megajets)
define METl = met + leptonsVeto[0]
define Rsql = sqrt(fMTR(megajets, METl) / MR)
define Rsq = sqrt(fMTR(megajets, met) / MR)

# Boost pre-selection cuts
region preselection
select AK4jets.size >= 3
select AK8jets.size >= 1
select MR > 800
select Rsq > 0.08
```

Block constructs like `define` and `region/select` are equivalent to assignment and `if` statements in a general purpose language, but selections with meaningful physics content are named and are potentially searchable/programmatically accessible. Block languages are primarily intended to make analysis code more readable and shareable, not necessarily to make analysis code easier to write. It is not easier to develop an analysis using constructs like `define`, rather than assignment, or `region/select`, rather than `if`, nor is it much harder (apart from more typing).

The disadvantage of block languages comes when complex processing is needed—when particles must be combined and looped over to search for solutions to various constraints (colloquially called "combinatorics" by physicists). Loops over particle candidates, selecting the best candidate, avoiding overlaps between candidate decays involving many particles, applying constraints like "same flavor" without requring a paricular flavor in the result—all of these typical analysis tasks (especially common when an analyis is in development) would significantly complicate the block structure. This kind of logic can be difficult to follow in a general programming language using `=`, `for`, and `if`, but it would get even more difficult if spread out into blocks, pushing more of the analysis description off-screen (due to length). One way out would be to allow analysis descriptions to call external code for the more complex manipulations, but too frequent use of such an "escape valve" would undermine the portability and preservation goals.

### Functional languages

Spark-like functional languages also abstract away loops over events, but in a way that doesn't give up general-purpose programming constructs. Replacing a `for` loop with a `map` or `filter` functional, with the body of the `for` loop becoming the function passed to the functional, moves the specification of which events are processed in what order from the `for` arguments into the implementation of the `map` or `filter` functional, which can be externally configured.

In RDataFrame and NAIL, the functions passed to functionals are written in a general-purpose programming language: C++ (either as function pointers/references/lambda expressions or as C++ formatted strings that get wrapped as functions). In func-adl and Femtocode, on the other hand, the domain-specific language is self-embedable in the sense that functions passed into functionals are in the same language as the functionals themselves.

The advantage of passing functions from a general-purpose language, such as in this example:

```c++
RDataFrame d("myTree", "file.root");
auto df = d.Define("p", "std::array<double, 4> p{px, py, pz}; return p;")
           .Filter("double p2 = 0.0; for (auto&& x : p) p2 += x*x; return sqrt(p2) < 10.0;");
```

is that there are no restrictions on what can be computed in the loop. The disadvantage is that these functions are opaque to the processing engine: they cannot be further optimized or specialized for particle physics—they can only be executed. Functionals that take functions in a general-purpose language solve a single problem: the large-scale organization of the analysis. As in Vegas, what happens in the loops stays in the loops.

I don't mean to minimize the importance of this one problem: it is recognized and important enough that frameworks built by physics collaborations have addressed it in the past. Usually, these object-oriented frameworks defined an abstract processor class with `begin`, `event`, and `end` methods for the physicist-user to override. The framework promised to call the user's `event` method on every event, just as RDataFrame calls the function passed to `Define` or `Filter` on every event. RDataFrame significantly expands the set of functionals (see [cheat sheet](https://root.cern.ch/doc/master/classROOT_1_1RDataFrame.html#cheatsheet)) well beyond `begin`, `event`, and `end`, and NAIL extends them further. However, like object-oriented frameworks, it leaves particle candidate manipulations to the general-purpose language.

func-adl, on the other hand, lets you use the same functionals for subcollections as for collections of events, such as `Select` and `Where` in this example:

```python
jet_info = (events
    .Select(lambda e:
        (e.EventInfo('EventInfo'),
         e.Jets('AntiKt4EMTopoJets'),
         e.Tracks('InDetTrackParticles').Where(lambda t: t.pt() > 1000.0)))
    .SelectMany(lambda e1:
        e1[1].Select(lambda j:
            (e1[0],j,e1[2].Where(lambda t:
                DeltaR(t.eta(), t.phi(), j.eta(), j.phi()) < 0.2))))
```

These functionals, inspired by LINQ, do not make the business of particle combinations any easier than loops in a general-purpose programming language. (Femtocode, as originally conceived, wouldn't have been any better.) At best, the self-embedable feature allows for inner loops to be optimized in conjunction with the outer loops (e.g. vectorized calculations across sets of particles per event).

## Combining block languages with functional

I have written three toy languages to try to influence the development of analysis description languages. The first of these, "Jim's ADL demo," ([GitHub](https://nbviewer.jupyter.org/github/jpivarski/analysis-description-language/blob/master/binder/demo.ipynb), [Binder](https://mybinder.org/v2/gh/jpivarski/analysis-description-language/master?filepath=binder%2Fdemo.ipynb), Oct 19, 2018) combined the block language approach with the functional to show that we can combine their strengths.

This language used a curly bracket syntax to define cuts as visual blocks, but performed calculations using a functional language, like this:

```
region "two muons": muons.size >= 2
{

  zmass := muons.distincts                             # make pairs of distinct muons
                .map(pair => (pair[0] + pair[1]))      # add the Lorentz vectors
                .maxby(zcandidate => zcandidate.pt)    # find the maximum by pt
                .mass                                  # but compute and return mass

  count "zmass" by
    regular(60, 0, 120) <- zmass

}
```

These "regions" could be nested to express cut-flows and defined for multiple cuts to compute the same expression for several different cuts (a common physics task).

```
region "slices": true by
  regular(5, -5, 5) <- x
{
  region "one":   y > -3
         "two":   y >  0
         "three": y >  3
  {
    count "counter"
  }
}
```

Another basic primitive of particle physics analysis (and *not* general-purpose programming) is a systematic variation—running the same code with different input parameters. These can be blocks (nestable with `region`):

```
vary
  "central":    epsilon :=  0       # assignments use := for clarity
  "sigma up":   epsilon :=  0.5
  "sigma down": epsilon := -0.5
{
  count "histogram" by
    regular(100, -5, 5) <- x + epsilon
}
```

and this nested hierarchy of blocks can be used to define a directory hierarchy of histograms:

```python
run["two muons", "zmass"].plot()

run["slices", 2, "histogram"].plot()
run["slices", 3, "histogram"].plot()

run["central", "histogram"].plot()
run["sigma up", "histogram"].plot()
run["sigma down", "histogram"].plot()
```

The functional language implemented inside of the blocks didn't address the problem of combinatorics (any more than a general-purpose programming language does).

## Pattern matching