Skip to content

Latest commit

 

History

History
483 lines (387 loc) · 21.8 KB

notes.org

File metadata and controls

483 lines (387 loc) · 21.8 KB

ToDo

  • tests
    • benchmark datomic/datascript/yadat
    • integration
      • support pull comparisons: remove :db/ids from results
    • add test to ensure pull handles cycles
    • add test to ensure query default source is respected by nested clauses
    • tests will run into problems in cljs /clj
      • datomic does not support fns as input afaik (shared test cases in integration)
      • cljs does not support resolve
  • pull
    • recursion (… / limit)
    • PullMap -> Pattern, Recursion - split?
  • rules
    • group rules by ffirst, i.e. by their name
    • or-clause ~ invocation of anonymous rule whose predicates make up clause
  • query
  • dsl
    • improve error messages / validation
      • use of non-existant variable
  • api
    • datoms / index access
    • map datoms, i.e. allow flat view modification of entities as the nested map modification in clj is really annoying e.g. (core/map-datoms query f) -> (fn [e a v] [e (keyword (lowercase (str a))) v])
  • performance
    • check out alternative backends like data.avl, hitchhiker tree, datascript btset
    • magic set transformation
    • rethink querying relations currently contain rows rather than raw datoms. Turning a datom into a row is costly and takes the bulk of the time while querying. For comparison: Queries are ~5x slower than datascript while transact is ~1.5 faster

vector vs defrecord

defrecord with type hints is much faster than vector -> as we do a lot of comparisons in the sorted sets that makes quite a difference - improved overall speed by ~ 30%

thoughts on async

It would be nice to support async storage backends but without a way to convert async -> sync (clj has that, cljs doesn’t) this is infectious and requires changes pretty much everywhere. An alternative would be to not fully support async but take an approach similar to what datomic does with segments (batches of datoms). Set a sentinel value when a query cannot be fulfilled due to missing segments and re-execute it after fetching those segments. Afaict this would require less extensive changes throughout the codebase and mostly be limited to db & api. Will think about this some other day - for now basic features are more important.

blog post

rewriting query to use a parser

while trying to add :with i realized that i had not been removing unused variables from the result rows this meant the medusa head problem was working without with even though it shouldn’t - rows were not collapsed as they did not merely contain the :heads but also :name - which should have been removed not being part of the find to figure out what variables are used in find and should be kept in the rows i would have to parse everything again and distinguish spec types and element types… this lead me to consider using a parser. interestingly, a 80% workaroud is just (filter var? (flatten find-spec)) this only runs into problems with (pull) that contains (x :as ?variable-looking-thing) an ugly 100% would be just flattenging once!

(defn flatten-1 [x] (mapcat #(if (sequential? %) % [%]) x)) (filter util/var? (flatten-1 find-spec))

gets us all the way as far as i can see :)

make it work first - later make it pretty with a parser learn some more first i guess

parsing

  • use clojure spec conform as parser :o???
    (s/def ::config (s/*
                      (s/cat :prop string?
                             :val  (s/alt :s string? :b boolean?))))
    (s/conform ::config ["-server" "foo" "-verbose" true "-user" "joe"])
    ;;=> [{:prop "-server", :val [:s "foo"]}
    ;;    {:prop "-verbose", :val [:b true]}
    ;;    {:prop "-user", :val [:s "joe"]}]
        

-> not that pretty as conform does not leave much freedom regarding actual transforming

demo

web editor for datalog query, update results http://blog.teamtreehouse.com/building-a-syntax-highlighted-input-box-with-javascript https://microsoft.github.io/monaco-editor/index.html

how use

generic rss, i.e. rss OR scrape rule -> feed of updates of interest -> this generalizes the wg-gesucht scraper & rss feed and stuff -> hooks to act on new feed items, e.g. send telegram message, reply to ad, …

limitations

  • pull pattern from input must be a variable, i.e. a symbol starting with ? - cannot use just any symbol as in datascript/datomic
  • no fixed schema and no valueTypes. everything is dynamic. values are added as simple value, cardinality-1 by default only attributes for which something else should be done need to be defined in the schema
  • no schema migrations
  • sorted-set does not support non-comparable, non-reference many values same for datascript though
  • cljs does not have real resolve (stackoverflow q) - what do right now resolve in cljs just throws -> cannot use predicates/functions
  • only one database supported - no implicit $ / explicit multiple databases

documentation

querying

Each pattern creates a set that binds the variables of that pattern when a pattern shares variable names with another pattern, the sets are inner-joined on those variables

;; ?course-id is shared between the queries - inner join
[?major-id :major/courses ?course-id]
[?course-id :course/name ?course-name]

:in clause constants are resolved to relations as well {:symbols {?a 0}, :tuples [ [val] ]}

each where clause is resolved to a relation

{:symbols {var-name position} :tuples [[var-at-position-0 var-at-position-1 ...] ...]}

Build result set based on :find and :with vars list: do cartesian product on all relevant relations, leave just vars that matter from them, collect them into a set

If there’s some aggregation happening, do group-by and run aggregation functions

If pull() is used, call Pull API for each entity

If find specifications are used, do post-processing: unwrap inner tuples, take first element from a set, etc.

Explanation - defn docs that i’m still refining

inner-join relations with shared variables: This reflects narrowing down the result set in bottom up evaluation. datalog query consists of where and find each where clause is resolved against the database where clauses that share variables are inner joined on those (unification) - this allows us to express complex query requirements the resulting, independent, relations are then cross joined (independent means the bindings can not be joined on some shared variable. the only sensible thing to do is to join each binding of relation A with each binding of relation B)

the resulting relation contains all the variable bindings we need to answer the :find of the query #{{?var1 1 ?var2 2} {?var1 3 ?var2 4}}

We’re not wroking on datoms (raw facts from the store) anymore but already on extracted data. this is required to allow using variables at different positions, i.e. first as the value, then as the id to link parent and child

i want some documentation on what datalog actually is, what we do

notes

Closed World Assumption: what is not implied by the logic program is false (rather than unknown) graph: vertices: predicates edges:

  • positive (p, q) if there is a clause in P in which q appears in a positive atom in the body and p appears in the head
  • negative (p, q) if there is a clause in P in which q appears in a negative atom in the body and p appears in the head

stratified: No cycle in pdg(P) contains a negative edge.

datalog without not is monotonic, i.e. adding facts can not remove but only add to result of Q

What You Always Wanted to Know About Datalog (Ceri, 1989)

Lo :- L1, …, Ln Li, is a literal of the form pi ( tl, … , tk) p is a predicate symbol, t are terms terms are either constant or variable

left-hand-side (lhs) of datalog clause is head, right-hand-side (rhs) is body body may be empty - clause without body is a fact clause with at least one literal in the body is a rule

father(bob, john) represents a fact (John is the father of Bob) grandparent(Z, X) :- parent( Y, X), parent(Z, Y) represents a rule (If X is a parent of Y AND if Y is a parent of Z, then X is a grandparent of Z)

grandparent, parent & father are predicate symbols john and bob are constants X, Y and Z are variables

datalog programn P must satisfy the following safety conditions (to ensure the set of facts that can be derived is finite)

  • Each fact of P is ground
  • Each variable which occurs in the head of a rule of P must also occur in the body of the same rule

A literal, fact, rule, or clause which does not contain any variables is called ground. The set of ground facts forms the extensional database (EDB) the datalog program P (~ set of rules) forms the intensional database (IDB)

head predicate of each clause in P must be an IDB-predicate. EDB-predicates may only occur in clause bodies. each edb predicate corresponds to a relation (table) -> stored as a tuple

predicates of P are IDB-relations / derived relations - correspond to relational view

when interested in a subset of an idb relation -> specify goal using literal preceded by “?-“, e.g. ?-sgc(ann, X) -> goal ~ query against view (view being the idb relation)

evaluation

top-down: rule as problem-generator, each rule as a problem that must be solved initial goal is matched with lhs of rule and generates rhs of that rule as new problems but with this kind of evaluation more natural to produce answer one tuple at the time => not good

also: breadth vs depth first depth-first: order of literals affects performance breadth-first: result of computation not affected by order of predicates within rhs or order of rules!

bottom-up: rule as production => apply to all facts in edb. does not take into account constants in goal predicate => wasteful bottom-up: inefficiencies: 1. reproducing same facts in dependent sets (?) 2. ignores constants from queries -> produces unnecessary facts

magic set: rewrite program into larger one additional idb that require some additional conditions to be satisfied used in bottom-up

READ AGAIN (p10-11)

to ensure safety (i.e. finite result set of intensional): each variable argument to a fn (representing an infinite set) must also occur as an argument to a predicate (-> relation, finite set) in same rule body or be bound to a constant evaluation of builtin predicate must be deferred until all its arguments are bound to constants! excption equality predicate, execute as soon as one arg is bound

negative: for safety reasons each variable in negative literal of rule body must also be in positive literal of same body

stratified datalog

Logic Programming (History) Kowalski

Horn clauses A0 ← A1 ∧ … ∧ An where n ≥ 0. ← = if ∧ = and A0 = conclusion - an empty body evaluates to true and can be omitted. A0 is then called a fact If A0 is omitted it is false. Such clauses are goal clauses goal can be understood as denying A1 ∧ … ∧ An has a solution -> challenge to refute denial by finding solution Ai = p(t1, …, tm), with p = predicate and t = terms predicates are the relations (defined or computed) of a program functions are treated as special case of relation (computed) function can be translated to

each term is either a constant, variable or composite term fn(t1, …, tm) terms can contain variables. any expression x (horn clause, term, …) without variables is called ground x variables in terms are universally qualified (?) scoped to horn clause it occurs in

datalog is a logic program without function symbols -> decidable with functions it would be turing complete and undecidable

datalog enough for databases and a lot of other shit e.g. and or trees can be represented as horn clauses

pure datalog is monotonic (i.e. clauses cannot take away from results, only add) (?) negation makes datalog non-monotonic

negation requires horn clauses to be extended to A0 ← A1 ∧ … ∧ An ∧ not B1 ∧ … ∧ not Bm where n ≥ 0 and m ≥ 0. atomic formulas and their negations are called literals

sets of clauses in this form are called normal logic program horn clause program: horn clauses without negation normal logic program: horn clauses with negation

  • top-down: clauses in P as goal-oriented reduction procedures to derive G fits both declarative and procedural representation (?)
  • bottom-up: generate new conlusions from existing conlusions until the conclusions contain all information required to solve G in one step ~= generating a model in which G is true natural fit to declarative representation (?)

solving for G is hard in hornclauses, even harder in horn clause with negation

resolution mehtod refutation procedure (reductio ad adsurdum) convert P and negation of G into set of clauses, derive empty clause (representing falsity)

Clauses are

  • disjunctions of literals
  • represented as sets

it if there is any substitution that unifies K and L, then there is a most general such unifying substitution, which is unique up to renaming of variables.

set notation of clauses is not user friendly. more common to write as disjunctions {A1, … , An, ¬B1, … , ¬Bm} => A1∨…∨An ∨¬B1∨…∨¬Bm. H

e.g. to find capital of usa ∃Xcapital(X, usa) is negated and the answer literal added => ¬capital(X, usa) ∨ answer(X)

proof procedure: 1. inference system (space of all proofs) 2. search strategy (for solution to goal in proof space) proof procedure = proof space + search strategy A typical proof space has the structure of an and-or tree turned upside down resolution: breadth/depth first (or heuristic but meh)

  • hyper-resolution derives new clauses from the input clauses, without paying attention to the problem to be solved ignores goal until it resolves it

If the top clause C0 represents an initial goal, then the tree of all linear derivations is a goal tree, and generating the tree top-down is a form of goal-reduction. The development of various forms of linear resolution with set of support and ordering restrictions brought resolution systems closer to Planner-like theoremprovers.

unification: ground terms are equal if syntactically equal

read again pg 24-25 comparison to arithmetic

To make model generation relevant to the query, Datalog uses transformations such as Magic Sets [Bancilhon, et al 1985] to incorporate the query into the transformed database rules.

stratified negation The simplest example of a stratified logic program is that of a deductive database E ∪ I whose predicates are partitioned into extensional predicates, defined by facts E, and intensional predicates, defined in terms of the extensional predicates by facts and rules I. . Consider, for example, a network of nodes, some of whose links at any given time may be broken14. This can be represented by an extensional database, say: E: link(a, b) link(a, c) link(b, c) broken(a, c) Two nodes in the network are connected if there is a path of unbroken links. This can be represented intensionally by the clauses: I: connected(X, Y ) ← link(X, Y ) ∧ not broken(X, Y ) connected(X, Y ) ← connected(X, Z) ∧ connected(Z, Y ) The conditions of the first clause in I are completely defined by E. So they can be evaluated independently of I. The use of E to evaluate these conditions results in a set of Horn clauses I ′ , which intuitively has the same meaning as I in the context of E: I ′

connected(a, b) connected(b, c)

connected(X, Y ) ← connected(X, Z) ∧ connected(Z, Y )

The natural, intended model of the original deductive database E ∪ I is the minimal model M of the resulting set of Horn clauses E ∪ I ′ : M: link(a, b) link(a, c) link(b, c) broken(a, c) connected(a, b) connected(b, c) connected(a, c)

pg 27 explains stratification

Having recognised the problem, a number of authors proposed further refinements of stratification. However, it now seems to be generally agreed that these refinements are superseded by the well-founded semantics of [Van Gelder, Ross and Schlipf 1991]. In particular, [Denecker et al., 2001] argues that the well-founded semantics “provides a more general and more robust formalization of the principle of iterated inductive definition that applies beyond the stratified case.”

ASP most advanced, does not allow functions (but i don’t want those afaik - only filter)

next

https://iccl.inf.tu-dresden.de/w/images/1/1c/Vlog-datalog-materialization-aaai2016.pdf

https://mobisocial.stanford.edu/papers/icde13.pdf -> implementation

https://ac.els-cdn.com/S0004370212000562/1-s2.0-S0004370212000562-main.pdf?_tid=edfa5b15-57a0-47ff-89a0-e6ed307ede8d&acdnat=1525348061_dab4112845d58061aee422fe5b7703c0 magic set thing with pseudo code implementation!

2.4. Magic Sets for Datalog programs on 161

The goal of the original Magic Set method (defined for non-disjunctive Datalog programs) is to exploit the presence of constants in a query for restricting the possible search space by considering only a subset of a hypothetical program instantiation that is sufficient to answer the query in question. In order to do this, a top–down computation for answering the query is simulated in an abstract way.

reading & unsorted notes

https://github.com/djjolicoeur/datamaps/blob/master/src/datamaps/pull.clj http://users.informatik.uni-halle.de/~brass/lp07/c7_magic.pdf on the magic set transformation https://souffle-lang.org/docs/magicset/ https://semmle.com/download-files/sigmod08.pdf https://www.cs.cmu.edu/~fp/courses/15317-f17/lectures/18-datalog.pdf https://iccl.inf.tu-dresden.de/w/images/c/cc/DBT2016-Lecture-12.pdf datalog lectures http://pages.cs.wisc.edu/~paris/cs784-s17/lectures/lecture7.pdf (also 8.pdf & 9.pdf)

https://github.com/travitch/datalog/blob/master/src/Database/Datalog/MagicSets.hs http://webdam.inria.fr/Alice/pdfs/Chapter-13.pdf http://www.ifis.cs.tu-bs.de/webfm_send/176 -> good

http://www.cs.toronto.edu/~drosu/csc343-l7-handout6.pdf -> REALLY GOOD A rule is safe if each distinguished and nondistinguished variable appears in at least one nonnegated relational atom unsafe E(w) ← NOT Movies(t, y, l, c, s, p) Years(w) ← Movies(t, y, l, c, s, p) AND w < y in each case an infinity of w’s can satisfy the rule, even though Movies is a finite relation.

datalog program is recursive if dependency graph has a cycle!

naive solution for recursive (without negated) fixpoint search, i.e. eval rules on edb and idb until no change to idb negation and recursion makes no sense (?)

stratified recursion: forbid negation in recursion: max negations to idb must be finite -> labeled dependency graph

  • nodes: idb predicates
  • edges: from node1(predicate1) to node(predicate2) if

and only if there is a rule with predicate1 in the head and predicate2 in the body. If predicate2 appears negated, label the edge with “-”.

• The stratum of a node (predicate) is the maximum number of “-” labeled edges on a path leading from that node A Datalog program is stratified if al its IDB predicates have finite strata.

next: this http://infolab.stanford.edu/~ullman/fcdb/slides/slides14.pdf

https://www.kde.cs.uni-kassel.de/lehre/ss2006/datenbanken/folien/Kapitel15.pdf <- do this! very good edb: extensional db (facts, relational data basis) deduktionskomponente: menge aus herleitungsregeln idb: intensional db (hergeleitete relationen, ausprägungen). result of application of rules to facts

edb facts, idb rules (?)

regel formel: q(A1,…An), q being name of base relation, intensional relation or built in predicate

adorn = annotate bound / free magic set contains all possibly interesting constant values recursively calc using magic rules

reachable adorned system: i.e. incorporate the query as rule and replace all predicate by it’s respective adornment

we obtain multiple magic predicates for a single adorned predicate occurrence

Every rule using an adorned IDB predicate in its body is augmented with an additional literal containing the respective magic set

magic set:

  • query is part of program
  • reachable adorned system: which terms are distinguished and propagate the resulting adornments. Reachable adorned system contains separated adorned predicate occurrences
  • magic set for each adorned predicate occurrence

i should try first to use datascripts existing query engine -> use that in tests for validation -> build my own with that and the datomic docs

for starters i should focus on where

  • :find is only post-processing of results
  • :in is advanced customization
  • :with as well
  • :where
    :where [[?e :user/firstName ?fname]
            [?e :user/secondName ?sname]]
        

needs

  • query plan for each clause query plan is based on what is variable and what is constant
  • join plan for all clauses based on shared variables
  • that’s it

rule is safe (i.e. result is not infinite) when all variables in head are finite

  • variable must be in body inside at least one non built in predicate (i.e. one real relation. function predicates are infinite)
  • variable is assigned a constant or another finite set

evaluation is expanding

Resources