Backend filtering refactor #921

dmos62 · 2021-12-22T11:23:35Z

Fixes #385, fixes #846

Sets up the backend infrastructure for filtering.

Features

Notice that I might use the terms filter specification and predicate interchangeably.

Some of the things this new predicate data structure does is:

Declares (predicate) correctness to a high degree
- An illegal predicate cannot be instantiated; an instantiated predicate is always legal
  - Notable exception
    - A predicate referencing unexisting columns can be instantiated
      - Referenced column existance is checked right before applying the filter specification to a query
Declares what predicates the frontend can use under what circumstances and how
- It can tell if a predicate is compatible with some column type based on its properties (like comparability, compatibility with LIKE, compatibility with URI-type-specific SQL functions)
- It can tell the frontend how many parameters a predicate takes (e.g. empty, equal and in take different number of params) and it can tell it what options it accepts (e.g. should starts_with be case insensitive)
- It can tell the frontend how to compose predicates: leaf vs branch predicates: parameters on branches are other predicates
It supports SQL functions
- You can use SQL functions to, for example, destructure a URI and filter based on that
  - Though this extension is/will be in a newer PR
  - Was not possible with sqlalchemy-filters

What it does not (currently) do:

Does not support using other columns as parameters
- Can't do {equal: {column: x, parameter: {column: y}}}
- This is an oversight
- Does not seem difficult to implement if/when there's interest

Technical details

I organized the relevant pieces in the predicate data structure into a mixin hierarchy and also used this PR as a testbed for frozen dataclasses. Some of my objectives with the basic structure were:

Immutability
- Used properties where I could
  - Didn't use properties where an SQLAlchemy filter is returned, since its mutability is uncertain
Many small classes
- Use the mixin/type system to capture information
  - A class that is a predicate that takes one parameter and relies on LIKE will directly or indirectly mix in ReliesOnLike, SingleParameter, Leaf and Predicate
- Some logic I offloaded to and hence centralized in static methods: correctness is declared on db/filters/base::assert_predicate_correct: and, what the JSON filter specification is is declared in db/filters/operations/deserialize
  - Has the drawback that the control flow in these methods can seem daunting, since it walks itself through the mixin hierarchy
    - These methods are not complicated, but big: each branch is simple: it's just that there's many of them
    - Might fan this logic out into the mixin definitions
Used parameter where a singular parameter is expected and parameters where a sequence is expected
- I have reservations about this
  - It makes the specifics of single/multi-parameter requirements obvious
  - Procedural instantiation is more verbose (like when testing), since you have to change the argument name depending on circumstance
    - But that can be solved with an auxillary constructor that's only for utilities

How to extend with new predicate

Introduce the appropriate class; for example, Greater; this includes defining the new Predicate's properties through mixins: ReliesOnComparability, SingleParameter, Leaf in this case (note that mixin order matters, see Python docs);

@frozen_dataclass
class Greater(ReliesOnComparability, SingleParameter, Leaf):
    type: LeafPredicateType = static(LeafPredicateType.GREATER)
    name: str = static("Greater")

    def to_sa_filter(self):
        return column(self.column) > self.parameter

Introduce the predicate type enum: LeafPredicateType.GREATER in this case; it's what the JSON filter spec will use to identify a predicate;
Tell it how to generate an equivalent SQLAlchemy filter by implementing the abstract Predicate::to_sa_filter (as seen above);
Update correctness definition (db/filters/base::assert_predicate_correct), if needed; currently this involves finding the spot in the method's control flow tree that corresponds to this new predicate, adding a new type check, etc.
Add new predicate to the all_predicates set;
Update mathesar.database.types::is_ma_type_supported_by_predicate, if needed; this would involve declaring what types have the properties the new predicate depends on: in this PR it's:

def _is_ma_type_comparable(ma_type: MathesarTypeIdentifier) -> bool:
    return ma_type in comparable_mathesar_types

def is_ma_type_supported_by_predicate(ma_type: MathesarTypeIdentifier, predicate: Type[Predicate]):
    if relies_on_comparability(predicate):
        return _is_ma_type_comparable(ma_type)
    else:
        return True

Notice that relies_on_comparability is an auxiliary function returning true when a predicate is a subclass of ReliesOnComparability.

`dataclasses` and `typing`

As Kriti suggested, I'll start a dedicated discussion on the use of dataclasses and typing.

But, to summarize:

dataclasses eliminated custom boilerplate; I got immutability, _post_init and defaults without writing a single constructor; I think this is great since it's essentially standardized boilerplate;
typing got me partial typing, which is great:
- in combination with my LSP server/client (pyright): catches a lot of mistakes and conflicts
- in that I can annotate when I think it's useful and not when I don't (partial typing)
- for readability, for example the above method is_ma_type_supported_by_predicate operates on uninstantiated predicate classes, not instances, so to express that I can just say predicate: Type[Predicate]) instead of predicate_subclass
- doesn't have a cost; another developer can just omit type or write Any if he prefers

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the master branch of the repository
My commit messages follow best practices.
My code follows the established code style of the repository.
I added tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no
visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

Also includes implementation for converting to SA filter spec.

Apparently dataclass defaults don't carry over from mixins.

I was wrong. Apparently I just got the order of mixin application wrong.

There's nothing faux about it.

…cally

dmos62 · 2022-01-14T16:50:00Z

Above discussed changes had some interesting knock-on effects and I ended up rethinking some things:

renamed Expression classes to DbFunctions. Seems more appropriate;
moved DbFunctions to the mathesar namespace and then back to the db namespace,
- because we do filtering in the db namespace;
  - that's where we apply a filter to a query and then execute it before returning it to a caller in the mathesar namespace);
  - so either, we execute a query by passing it a preprocessor callback, or we move the stuff relevant to filtering to the db namespace.

The DbFunction attributes that seem service-layery are the display name and the hints. Display name we can remove and use a mapping between DbFunctions and display names in the service layer. The hints are not as high-level as I previously thought: they are what will let us mix and match database types and DbFunctions when constructing something like a filter on the frontend, among other things.

dmos62 · 2022-01-17T12:04:33Z

Request for feedback:

Currently the DbFunction (previously known as Predicate and Expression) classes look like this:

db.functions.base:

[...]

class DbFunction(ABC):
    id = None
    name = None
    hints = None

    def __init__(self, parameters):
        if self.id is None:
            raise ValueError('DbFunction subclasses must define an ID.')
        if self.name is None:
            raise ValueError('DbFunction subclasses must define a name.')
        self.parameters = parameters

    @property
    def referenced_columns(self):
        """Walks the expression tree, collecting referenced columns.
        Useful when checking if all referenced columns are present in the queried relation."""
        columns = set([])
        for parameter in self.parameters:
            if isinstance(parameter, ColumnReference):
                columns.add(parameter.column)
            elif isinstance(parameter, DbFunction):
                columns.update(parameter.referenced_columns)
        return columns

    @staticmethod
    @abstractmethod
    def to_sa_expression():
        return None


class ColumnReference(DbFunction):
    id = 'column_reference'
    name = 'Column Reference'
    hints = tuple([
        hints.parameter_count(1),
        hints.parameter(1, hints.column),
    ])

    @property
    def column(self):
        return self.parameters[0]

    @staticmethod
    def to_sa_expression(p):
        return column(p)


class List(DbFunction):
    id = 'list'
    name = 'List'

    @staticmethod
    def to_sa_expression(*ps):
        return list(ps)


class Empty(DbFunction):
    id = 'empty'
    name = 'Empty'
    hints = tuple([
        hints.returns(hints.boolean),
        hints.parameter_count(1),
    ])

    @staticmethod
    def to_sa_expression(p):
        return p.is_(None)

[...]

DbFunction subclasses are exposed via the new /api/v0/functions/ endpoint:

[
    {
        "id": "and",
        "name": "And",
        "hints": [
            {
                "id": "returns",
                "hints": [
                    {
                        "id": "boolean"
                    }
                ]
            }
        ]
    },
    {
        "id": "column_reference",
        "name": "Column Reference",
        "hints": [
            {
                "id": "parameter_count",
                "count": 1
            },
            {
                "id": "parameter",
                "index": 1,
                "hints": [
                    {
                        "id": "column"
                    }
                ]
            }
        ]
    },
    {
        "id": "empty",
        "name": "Empty",
        "hints": [
            {
                "id": "returns",
                "hints": [
                    {
                        "id": "boolean"
                    }
                ]
            },
            {
                "id": "parameter_count",
                "count": 1
            }
        ]
    },
    {
        "id": "equal",
        "name": "Equal",
        "hints": [
            {
                "id": "returns",
                "hints": [
                    {
                        "id": "boolean"
                    }
                ]
            },
            {
                "id": "parameter_count",
                "count": 2
            }
        ]
    },
    {
        "id": "extract_uri_authority",
        "name": "Extract URI Authority",
        "hints": [
            {
                "id": "parameter_count",
                "count": 1
            },
            {
                "id": "parameter",
                "index": 1,
                "hints": [
                    {
                        "id": "uri"
                    }
                ]
            }
        ]
    },
    {
        "id": "in",
        "name": "In",
        "hints": [
            {
                "id": "returns",
                "hints": [
                    {
                        "id": "boolean"
                    }
                ]
            },
            {
                "id": "parameter_count",
                "count": 2
            },
            {
                "id": "parameter",
                "index": 2,
                "hints": [
                    {
                        "id": "array"
                    }
                ]
            }
        ]
    },
    {
        "id": "lesser",
        "name": "Lesser",
        "hints": [
            {
                "id": "returns",
                "hints": [
                    {
                        "id": "boolean"
                    }
                ]
            },
            {
                "id": "parameter_count",
                "count": 2
            },
            {
                "id": "all_parameters",
                "hints": [
                    {
                        "id": "comparable"
                    }
                ]
            }
        ]
    },
    {
        "id": "list",
        "name": "List",
        "hints": null
    },
    {
        "id": "not",
        "name": "Not",
        "hints": [
            {
                "id": "returns",
                "hints": [
                    {
                        "id": "boolean"
                    }
                ]
            },
            {
                "id": "parameter_count",
                "count": 1
            }
        ]
    },
    [...]
]

The frontend will look at the available functions via above endpoint; it will look at the hints defined on those functions and decide how to suggest/constrict user's choices when constructing a database expression.

Suppose the user wants to define a filter. The frontend might make sure that the top-level function used returns a boolean (a filter is an expression that resolves to a boolean), then that the nested expressions' return types are compatible with the nesting functions' parameter hints. Also, worth noting that this expression language is maximally simple: everything is a function, even column references, literals and lists.

An example filter expression in JSON might look like this:

{"in": [
  {"column_reference": ["column1"]}, 
  {"list": [
    {"column_reference": ["column2"]}, 
    {"to_lowercase": [
      {"column_reference": ["column3"]}, 
    ]},
    {"literal": ["foo"]},
    {"literal": ["bar"]},
  ]}, 
]}

This would be equivalent to:

In([
  ColumnReference(["column1"]),
  List([
    ColumnReference(["column2"]),
    ToLowercase([ColumnReference(["column3"])),
    Literal(["foo"]),
    Literal(["bar"]),
  ]),
])

Which would be equivalent to following SQL:

column1 IN (column2, LOWER(column3), "foo", "bar")

On the backend, database function expressions will receive a minimum of checks. The aim is to constrict the user as little as possible. The frontend will be able to guide the user's choices by using the hint system (hints were previously called suggestions; using hints because shorter). The hint system is very simple, but expressive. Here's the definition of hints on the backend:

db.functions.hints:

from frozendict import frozendict


def _make_hint(id, **rest):
    return frozendict({"id": id, **rest})


def parameter_count(count):
    return _make_hint("parameter_count", count=count)


def parameter(index, *hints):
    return _make_hint("parameter", index=index, hints=hints)


def all_parameters(*hints):
    return _make_hint("all_parameters", hints=hints)


def returns(*hints):
    return _make_hint("returns", hints=hints)


boolean = _make_hint("boolean")


comparable = _make_hint("comparable")


column = _make_hint("column")


array = _make_hint("array")


string_like = _make_hint("string_like")


uri = _make_hint("uri")

This allows the backend to specify the number of parameters expected (where appropriate), to define hints for all parameters and/or for individual parameters, and to define hints for the return type. New hints are trivial to add.

Previous major revision of this refactor didn't require the frontend to know anything about the filter constructs. This revision does not require knowing anything about the functions, but it does require knowing about the hint system. That is the price of the simplicity and flexibility offered. For example, it will have to be hardcoded in the frontend that a filter expression is a nesting of database functions where the top-level one has the hint {"returns": ["boolean"]}.

When it will come to mixing and matching database functions to data types, we'll define relevant hints on the data types. So the frontend, having looked up database functions and data types, will see that a BOOLEAN Postgres type has the {"id": "boolean"} hint (equivalent of db.functions.hints.boolean) and so will know that it should be a safe input to a function where the parameter has the same hint ({"id": "boolean"}).

@kgodey @mathemancer @silentninja does this make sense?

kgodey · 2022-01-18T20:10:30Z

I think this all makes sense to me, I like how extensible and simple it is.

The only things that I'd like to add is:

There should be some way for the frontend to know which filters we're supporting in the UI for each data type. I'd recommend putting a list of filters each type supports in the the /api/v0/databases/<id>/types/ endpoint (which should actually just be /api/v0/types/ per discussion in Restructure our URLs into a hierarchy #974, now that I think of it, but that can be a separate issue). This list can just be a pointer to the relevant function in /api/v0/functions/.
Please add documentation on how this system is structured and how to add new filters in the Architecture section of the wiki once this is merged.

mathemancer · 2022-01-19T08:30:06Z

Overall, I think this makes sense. Looking at this, I'm convinced that it does make sense to have the hints, etc. in with the to_sa_expression part. Since this lets users construct fairly general make sure to wrap any string (or data) that we're passing to to_sa_expression in a text blob. It would be good if that was somehow handled generally to avoid relying on every future implementer getting it right for every such class.

@kgodey I thought we agreed (in some other spot on GH) that filters aren't (in general) subordinate to types, but rather the other way around. For example, I could filter on whether a string in column A is longer than the integer in column B. The associated function would take a string-like and integer-like input, and return a boolean, so it wouldn't make sense to put it under either the TEXT or INTEGER types. Rather, it makes sense to decide you want to do some sort of function like that, and then choose the appropriate columns for arguments. Maybe a compromise would be to be able to filter the list of functions to restrict to those that involve a given type as at least one argument?

The same issues apply to groupings for similar reasons. I think it makes more sense to keep the filters and other functions under some /.../functions/ endpoint.

silentninja · 2022-01-19T12:07:23Z

I agree with @mathemancer suggestion to have it under /function endpoint and not in types.

@dmos62 Looks good, I got a few questions:

How would overloaded functions work?
what does all_parameter mean, is it used as a blanket to say any comparable or any string?

kgodey · 2022-01-19T16:41:21Z

@kgodey I thought we agreed (in some other spot on GH) that filters aren't (in general) subordinate to types, but rather the other way around. For example, I could filter on whether a string in column A is longer than the integer in column B. The associated function would take a string-like and integer-like input, and return a boolean, so it wouldn't make sense to put it under either the TEXT or INTEGER types. Rather, it makes sense to decide you want to do some sort of function like that, and then choose the appropriate columns for arguments. Maybe a compromise would be to be able to filter the list of functions to restrict to those that involve a given type as at least one argument?

@mathemancer I think this makes sense at the db module and API level, but we have specific filters and groups available for specific types in the Mathesar frontend - see the Global Data Types spec. The frontend needs some way to figure this out from the API.

I was suggesting leaving the functions endpoint alone and adding a separate endpoint that focuses only on what we've specced out for the UI. That way, someone can use the functions endpoint if they want to access all DB functions and the separate endpoint if they want to use our product abstractions. I guess we could create a filters endpoint instead of overloading the types endpoint for that.

This conversation is making me think that we should have separate API paths for our frontend-specific abstractions (Mathesar types + the filters they support + any other abstractions we come up with) and general DB functionality. I think that's out of scope for this issue, though.

dmos62 · 2022-01-20T12:11:33Z

@kgodey

There should be some way for the frontend to know which filters we're supporting in the UI for each data type.

Yeah, definitely. Hints fill that role. To say that a function can take strings as input, you give it a hint along the lines of hints.parameter(1, hints.string_like) (could be read "I'm hinting that the first parameter is a string"). Then we give the string-like DB types the same hint (that is not yet implemented, but it's straightforward, only a mapping). Imagine:

GET /api/v0/db_types:

[
    {
        "id": "character varying",
        "name": "Character varying",
        "hints": [
            {
                "id": "string_like"
            }
        ]
    },
    [...]
]

So, when frontend queries both the /functions and the /db_types endpoints, it has all the information necessary to tell which functions should work with which data types.

@silentninja

How would overloaded functions work?

This only supports a single signature per function. If there's a database function with multiple signatures and we'd like to expose them, we'd do best to wrap them in separate DbFunction subclasses with different hint sets each.

Note that this mechanism does not actually enforce following the hints. As far as db.functions is concerned you can attempt to call and compose DbFunction subclasses however you like and it will attempt to not fail or at least fail as late as possible (preferrably during actual evaluation by Postgres). This is in the spirit of allowing the user as much freedom as possible.

This is also a pain point of this concept, in my mind, since it might be difficult to fail gracefully or to show nice error messages. But, that should only affect advanced users, unless we mess up the hints.

what does all_parameter mean [..]?

hints.parameter(1, *some_set_of_hints) says parameter at index 1 has the hints in some_set_of_hints applied. And, hints.all_parameters(*some_set_of_hints) says all parameters have the hints in some_set_of_hints applied.

When both these hints are provided at the same time, I'd say that the more specific hints.parameter should take precedent for the specific parameter it is targeting, but we might want to look into this specific scenario some more if it comes up in a practical setting. For example, CSS does inheritance and then provides things like hints.unset(hints.some_hint_that_i_dont_want_to_inherit) in the more specific rule to specify when to not inherit.

kgodey · 2022-01-20T16:58:07Z

So, when frontend queries both the /functions and the /db_types endpoints, it has all the information necessary to tell which functions should work with which data types.

@dmos62 Okay, that makes sense. Please open a separate issue for the frontend to switch to the new APIs for filters, we can use it to discuss the API spec with the frontend team and make sure it works for them.

dmos62 · 2022-01-25T11:49:37Z

Closing this PR. Splitting it into two PRs. First, there will be a PR that introduces the new function API (borne out of iterating over this PR). Then, there will be a PR that replaces the old filtering API with the new function API.

dmos62 added 30 commits November 22, 2021 16:03

Intro dataclass hierarchy for modeling filtering

f714909

Also includes implementation for converting to SA filter spec.

Rename primitive to leaf and operator to branch

6bdb935

Rename subject to branch

90710c3

Introduce BranchType and ParameterType

b51cf6b

Refactor to not use mixins

1819c0b

Apparently dataclass defaults don't carry over from mixins.

Implement remaining predicates

639c14b

Revert to using mixins

5025f87

I was wrong. Apparently I just got the order of mixin application wrong.

Ammend fauxStatic comment

906cbd1

Rename fauxStatic to static

efd3dc2

There's nothing faux about it.

Rename parameterType to Count; other minor changes

aa9239e

Implement exposing filtering options through REST API

828814a

Dead code

58a9432

switch to snake_case

7705bae

Make name more specific

7379d8c

Impl. parsing our custom filter spec to Predicate

ee1f01a

Rename Leaf.field to Leaf.column

c18e0fe

Move serialization and deserialization routines

74c6626

Rename function to something more specific

78bf19b

Add basic serialization test

311456e

Implement Predicate parameter constraints

201837f

Change filter api nomenclature

54e143a

Fix SA spec serialization

086d1d3

Impl. testing deserialization

6424c1f

Make filters aware of MA type comparability

b1910b7

Reorder MA filter spec object keys

30ea5c4

Switch to using None as default filter

db73bda

Use snake_case for vars and funcs

8d16a0f

Remove duplicate_only (get_duplicates) filter

79172bc

Finish changing casing

28eb1b3

Remove get_duplicates tests

485184e

dmos62 added 6 commits January 13, 2022 15:38

Improve module doc

b7c72c2

Switch to single-quote

1cb877c

Rename Function to DbFunction; collect DbFunction subclasses automati…

689fa15

…cally

Remove or move last of db.filters

05b3cdc

Intro. endpoint for DbFunctions

5325098

Move DbFunctions from mathesar to db namespace

2e56359

dmos62 added 4 commits January 17, 2022 13:09

Revert to absolute imports

e6891d8

Force nested hints into lists

fe3b7f5

Make make_hint private

446fdf7

Linter fixes

7b737fd

dmos62 added 2 commits January 17, 2022 14:17

Introduce class Literal(DbFunction)

8b64508

Add "literal" hint

ce7cc7b

Intro. endpoint for hinted db types

4d2e5ff

Make sure literal params are always wrapped in Literal(...)

89ddf5c

dmos62 mentioned this pull request Jan 20, 2022

Switch frontend to new filtering API #1001

Closed

dmos62 added 3 commits January 21, 2022 17:32

Let ColumnReference take literal strings.

cbf9968

Redo some of the tests

b7b6fce

Fix whitespace

27c246e

dmos62 closed this Jan 25, 2022

dmos62 mentioned this pull request Jan 25, 2022

Introduce function API #1022

Merged

7 tasks

dmos62 deleted the backend-filtering-numbers branch March 1, 2022 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend filtering refactor #921

Backend filtering refactor #921

dmos62 commented Dec 22, 2021 •

edited

Loading

dmos62 commented Jan 14, 2022

dmos62 commented Jan 17, 2022 •

edited

Loading

kgodey commented Jan 18, 2022

mathemancer commented Jan 19, 2022

silentninja commented Jan 19, 2022

kgodey commented Jan 19, 2022

dmos62 commented Jan 20, 2022 •

edited

Loading

kgodey commented Jan 20, 2022

dmos62 commented Jan 25, 2022

Backend filtering refactor #921

Backend filtering refactor #921

Conversation

dmos62 commented Dec 22, 2021 • edited Loading

Features

Technical details

How to extend with new predicate

dataclasses and typing

Checklist

Developer Certificate of Origin

dmos62 commented Jan 14, 2022

dmos62 commented Jan 17, 2022 • edited Loading

kgodey commented Jan 18, 2022

mathemancer commented Jan 19, 2022

silentninja commented Jan 19, 2022

kgodey commented Jan 19, 2022

dmos62 commented Jan 20, 2022 • edited Loading

kgodey commented Jan 20, 2022

dmos62 commented Jan 25, 2022

dmos62 commented Dec 22, 2021 •

edited

Loading

`dataclasses` and `typing`

dmos62 commented Jan 17, 2022 •

edited

Loading

dmos62 commented Jan 20, 2022 •

edited

Loading