Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend filtering refactor #921

Closed
wants to merge 91 commits into from
Closed

Backend filtering refactor #921

wants to merge 91 commits into from

Conversation

dmos62
Copy link
Contributor

@dmos62 dmos62 commented Dec 22, 2021

Fixes #385, fixes #846

Sets up the backend infrastructure for filtering.

Features

Notice that I might use the terms filter specification and predicate interchangeably.

Some of the things this new predicate data structure does is:

  • Declares (predicate) correctness to a high degree
    • An illegal predicate cannot be instantiated; an instantiated predicate is always legal
      • Notable exception
        • A predicate referencing unexisting columns can be instantiated
          • Referenced column existance is checked right before applying the filter specification to a query
  • Declares what predicates the frontend can use under what circumstances and how
    • It can tell if a predicate is compatible with some column type based on its properties (like comparability, compatibility with LIKE, compatibility with URI-type-specific SQL functions)
    • It can tell the frontend how many parameters a predicate takes (e.g. empty, equal and in take different number of params) and it can tell it what options it accepts (e.g. should starts_with be case insensitive)
    • It can tell the frontend how to compose predicates: leaf vs branch predicates: parameters on branches are other predicates
  • It supports SQL functions
    • You can use SQL functions to, for example, destructure a URI and filter based on that
      • Though this extension is/will be in a newer PR
      • Was not possible with sqlalchemy-filters

What it does not (currently) do:

  • Does not support using other columns as parameters
    • Can't do {equal: {column: x, parameter: {column: y}}}
    • This is an oversight
    • Does not seem difficult to implement if/when there's interest

Technical details

I organized the relevant pieces in the predicate data structure into a mixin hierarchy and also used this PR as a testbed for frozen dataclasses. Some of my objectives with the basic structure were:

  • Immutability
    • Used properties where I could
      • Didn't use properties where an SQLAlchemy filter is returned, since its mutability is uncertain
  • Many small classes
    • Use the mixin/type system to capture information
      • A class that is a predicate that takes one parameter and relies on LIKE will directly or indirectly mix in ReliesOnLike, SingleParameter, Leaf and Predicate
    • Some logic I offloaded to and hence centralized in static methods: correctness is declared on db/filters/base::assert_predicate_correct: and, what the JSON filter specification is is declared in db/filters/operations/deserialize
      • Has the drawback that the control flow in these methods can seem daunting, since it walks itself through the mixin hierarchy
        • These methods are not complicated, but big: each branch is simple: it's just that there's many of them
        • Might fan this logic out into the mixin definitions
  • Used parameter where a singular parameter is expected and parameters where a sequence is expected
    • I have reservations about this
      • It makes the specifics of single/multi-parameter requirements obvious
      • Procedural instantiation is more verbose (like when testing), since you have to change the argument name depending on circumstance
        • But that can be solved with an auxillary constructor that's only for utilities

How to extend with new predicate

  1. Introduce the appropriate class; for example, Greater; this includes defining the new Predicate's properties through mixins: ReliesOnComparability, SingleParameter, Leaf in this case (note that mixin order matters, see Python docs);
@frozen_dataclass
class Greater(ReliesOnComparability, SingleParameter, Leaf):
    type: LeafPredicateType = static(LeafPredicateType.GREATER)
    name: str = static("Greater")

    def to_sa_filter(self):
        return column(self.column) > self.parameter
  1. Introduce the predicate type enum: LeafPredicateType.GREATER in this case; it's what the JSON filter spec will use to identify a predicate;

  2. Tell it how to generate an equivalent SQLAlchemy filter by implementing the abstract Predicate::to_sa_filter (as seen above);

  3. Update correctness definition (db/filters/base::assert_predicate_correct), if needed; currently this involves finding the spot in the method's control flow tree that corresponds to this new predicate, adding a new type check, etc.

  4. Add new predicate to the all_predicates set;

  5. Update mathesar.database.types::is_ma_type_supported_by_predicate, if needed; this would involve declaring what types have the properties the new predicate depends on: in this PR it's:

def _is_ma_type_comparable(ma_type: MathesarTypeIdentifier) -> bool:
    return ma_type in comparable_mathesar_types

def is_ma_type_supported_by_predicate(ma_type: MathesarTypeIdentifier, predicate: Type[Predicate]):
    if relies_on_comparability(predicate):
        return _is_ma_type_comparable(ma_type)
    else:
        return True

Notice that relies_on_comparability is an auxiliary function returning true when a predicate is a subclass of ReliesOnComparability.

dataclasses and typing

As Kriti suggested, I'll start a dedicated discussion on the use of dataclasses and typing.

But, to summarize:

  • dataclasses eliminated custom boilerplate; I got immutability, _post_init and defaults without writing a single constructor; I think this is great since it's essentially standardized boilerplate;
  • typing got me partial typing, which is great:
    • in combination with my LSP server/client (pyright): catches a lot of mistakes and conflicts
    • in that I can annotate when I think it's useful and not when I don't (partial typing)
    • for readability, for example the above method is_ma_type_supported_by_predicate operates on uninstantiated predicate classes, not instances, so to express that I can just say predicate: Type[Predicate]) instead of predicate_subclass
    • doesn't have a cost; another developer can just omit type or write Any if he prefers

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the master branch of the repository
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

Also includes implementation for converting to SA filter spec.
Apparently dataclass defaults don't carry over from mixins.
I was wrong. Apparently I just got the order of mixin application wrong.
There's nothing faux about it.
@dmos62
Copy link
Contributor Author

dmos62 commented Jan 14, 2022

Above discussed changes had some interesting knock-on effects and I ended up rethinking some things:

  • renamed Expression classes to DbFunctions. Seems more appropriate;
  • moved DbFunctions to the mathesar namespace and then back to the db namespace,
    • because we do filtering in the db namespace;
      • that's where we apply a filter to a query and then execute it before returning it to a caller in the mathesar namespace);
      • so either, we execute a query by passing it a preprocessor callback, or we move the stuff relevant to filtering to the db namespace.

The DbFunction attributes that seem service-layery are the display name and the hints. Display name we can remove and use a mapping between DbFunctions and display names in the service layer. The hints are not as high-level as I previously thought: they are what will let us mix and match database types and DbFunctions when constructing something like a filter on the frontend, among other things.

@dmos62
Copy link
Contributor Author

dmos62 commented Jan 17, 2022

Request for feedback:

Currently the DbFunction (previously known as Predicate and Expression) classes look like this:

db.functions.base:

[...]

class DbFunction(ABC):
    id = None
    name = None
    hints = None

    def __init__(self, parameters):
        if self.id is None:
            raise ValueError('DbFunction subclasses must define an ID.')
        if self.name is None:
            raise ValueError('DbFunction subclasses must define a name.')
        self.parameters = parameters

    @property
    def referenced_columns(self):
        """Walks the expression tree, collecting referenced columns.
        Useful when checking if all referenced columns are present in the queried relation."""
        columns = set([])
        for parameter in self.parameters:
            if isinstance(parameter, ColumnReference):
                columns.add(parameter.column)
            elif isinstance(parameter, DbFunction):
                columns.update(parameter.referenced_columns)
        return columns

    @staticmethod
    @abstractmethod
    def to_sa_expression():
        return None


class ColumnReference(DbFunction):
    id = 'column_reference'
    name = 'Column Reference'
    hints = tuple([
        hints.parameter_count(1),
        hints.parameter(1, hints.column),
    ])

    @property
    def column(self):
        return self.parameters[0]

    @staticmethod
    def to_sa_expression(p):
        return column(p)


class List(DbFunction):
    id = 'list'
    name = 'List'

    @staticmethod
    def to_sa_expression(*ps):
        return list(ps)


class Empty(DbFunction):
    id = 'empty'
    name = 'Empty'
    hints = tuple([
        hints.returns(hints.boolean),
        hints.parameter_count(1),
    ])

    @staticmethod
    def to_sa_expression(p):
        return p.is_(None)

[...]

DbFunction subclasses are exposed via the new /api/v0/functions/ endpoint:

[
    {
        "id": "and",
        "name": "And",
        "hints": [
            {
                "id": "returns",
                "hints": [
                    {
                        "id": "boolean"
                    }
                ]
            }
        ]
    },
    {
        "id": "column_reference",
        "name": "Column Reference",
        "hints": [
            {
                "id": "parameter_count",
                "count": 1
            },
            {
                "id": "parameter",
                "index": 1,
                "hints": [
                    {
                        "id": "column"
                    }
                ]
            }
        ]
    },
    {
        "id": "empty",
        "name": "Empty",
        "hints": [
            {
                "id": "returns",
                "hints": [
                    {
                        "id": "boolean"
                    }
                ]
            },
            {
                "id": "parameter_count",
                "count": 1
            }
        ]
    },
    {
        "id": "equal",
        "name": "Equal",
        "hints": [
            {
                "id": "returns",
                "hints": [
                    {
                        "id": "boolean"
                    }
                ]
            },
            {
                "id": "parameter_count",
                "count": 2
            }
        ]
    },
    {
        "id": "extract_uri_authority",
        "name": "Extract URI Authority",
        "hints": [
            {
                "id": "parameter_count",
                "count": 1
            },
            {
                "id": "parameter",
                "index": 1,
                "hints": [
                    {
                        "id": "uri"
                    }
                ]
            }
        ]
    },
    {
        "id": "in",
        "name": "In",
        "hints": [
            {
                "id": "returns",
                "hints": [
                    {
                        "id": "boolean"
                    }
                ]
            },
            {
                "id": "parameter_count",
                "count": 2
            },
            {
                "id": "parameter",
                "index": 2,
                "hints": [
                    {
                        "id": "array"
                    }
                ]
            }
        ]
    },
    {
        "id": "lesser",
        "name": "Lesser",
        "hints": [
            {
                "id": "returns",
                "hints": [
                    {
                        "id": "boolean"
                    }
                ]
            },
            {
                "id": "parameter_count",
                "count": 2
            },
            {
                "id": "all_parameters",
                "hints": [
                    {
                        "id": "comparable"
                    }
                ]
            }
        ]
    },
    {
        "id": "list",
        "name": "List",
        "hints": null
    },
    {
        "id": "not",
        "name": "Not",
        "hints": [
            {
                "id": "returns",
                "hints": [
                    {
                        "id": "boolean"
                    }
                ]
            },
            {
                "id": "parameter_count",
                "count": 1
            }
        ]
    },
    [...]
]

The frontend will look at the available functions via above endpoint; it will look at the hints defined on those functions and decide how to suggest/constrict user's choices when constructing a database expression.

Suppose the user wants to define a filter. The frontend might make sure that the top-level function used returns a boolean (a filter is an expression that resolves to a boolean), then that the nested expressions' return types are compatible with the nesting functions' parameter hints. Also, worth noting that this expression language is maximally simple: everything is a function, even column references, literals and lists.

An example filter expression in JSON might look like this:

{"in": [
  {"column_reference": ["column1"]}, 
  {"list": [
    {"column_reference": ["column2"]}, 
    {"to_lowercase": [
      {"column_reference": ["column3"]}, 
    ]},
    {"literal": ["foo"]},
    {"literal": ["bar"]},
  ]}, 
]}

This would be equivalent to:

In([
  ColumnReference(["column1"]),
  List([
    ColumnReference(["column2"]),
    ToLowercase([ColumnReference(["column3"])),
    Literal(["foo"]),
    Literal(["bar"]),
  ]),
])

Which would be equivalent to following SQL:

column1 IN (column2, LOWER(column3), "foo", "bar")

On the backend, database function expressions will receive a minimum of checks. The aim is to constrict the user as little as possible. The frontend will be able to guide the user's choices by using the hint system (hints were previously called suggestions; using hints because shorter). The hint system is very simple, but expressive. Here's the definition of hints on the backend:

db.functions.hints:

from frozendict import frozendict


def _make_hint(id, **rest):
    return frozendict({"id": id, **rest})


def parameter_count(count):
    return _make_hint("parameter_count", count=count)


def parameter(index, *hints):
    return _make_hint("parameter", index=index, hints=hints)


def all_parameters(*hints):
    return _make_hint("all_parameters", hints=hints)


def returns(*hints):
    return _make_hint("returns", hints=hints)


boolean = _make_hint("boolean")


comparable = _make_hint("comparable")


column = _make_hint("column")


array = _make_hint("array")


string_like = _make_hint("string_like")


uri = _make_hint("uri")

This allows the backend to specify the number of parameters expected (where appropriate), to define hints for all parameters and/or for individual parameters, and to define hints for the return type. New hints are trivial to add.

Previous major revision of this refactor didn't require the frontend to know anything about the filter constructs. This revision does not require knowing anything about the functions, but it does require knowing about the hint system. That is the price of the simplicity and flexibility offered. For example, it will have to be hardcoded in the frontend that a filter expression is a nesting of database functions where the top-level one has the hint {"returns": ["boolean"]}.

When it will come to mixing and matching database functions to data types, we'll define relevant hints on the data types. So the frontend, having looked up database functions and data types, will see that a BOOLEAN Postgres type has the {"id": "boolean"} hint (equivalent of db.functions.hints.boolean) and so will know that it should be a safe input to a function where the parameter has the same hint ({"id": "boolean"}).

@kgodey @mathemancer @silentninja does this make sense?

@kgodey
Copy link
Contributor

kgodey commented Jan 18, 2022

I think this all makes sense to me, I like how extensible and simple it is.

The only things that I'd like to add is:

  • There should be some way for the frontend to know which filters we're supporting in the UI for each data type. I'd recommend putting a list of filters each type supports in the the /api/v0/databases/<id>/types/ endpoint (which should actually just be /api/v0/types/ per discussion in Restructure our URLs into a hierarchy #974, now that I think of it, but that can be a separate issue). This list can just be a pointer to the relevant function in /api/v0/functions/.
  • Please add documentation on how this system is structured and how to add new filters in the Architecture section of the wiki once this is merged.

@mathemancer
Copy link
Contributor

Overall, I think this makes sense. Looking at this, I'm convinced that it does make sense to have the hints, etc. in with the to_sa_expression part. Since this lets users construct fairly general make sure to wrap any string (or data) that we're passing to to_sa_expression in a text blob. It would be good if that was somehow handled generally to avoid relying on every future implementer getting it right for every such class.

@kgodey I thought we agreed (in some other spot on GH) that filters aren't (in general) subordinate to types, but rather the other way around. For example, I could filter on whether a string in column A is longer than the integer in column B. The associated function would take a string-like and integer-like input, and return a boolean, so it wouldn't make sense to put it under either the TEXT or INTEGER types. Rather, it makes sense to decide you want to do some sort of function like that, and then choose the appropriate columns for arguments. Maybe a compromise would be to be able to filter the list of functions to restrict to those that involve a given type as at least one argument?

The same issues apply to groupings for similar reasons. I think it makes more sense to keep the filters and other functions under some /.../functions/ endpoint.

@silentninja
Copy link
Contributor

I agree with @mathemancer suggestion to have it under /function endpoint and not in types.

@dmos62 Looks good, I got a few questions:

  1. How would overloaded functions work?
  2. what does all_parameter mean, is it used as a blanket to say any comparable or any string?

@kgodey
Copy link
Contributor

kgodey commented Jan 19, 2022

@kgodey I thought we agreed (in some other spot on GH) that filters aren't (in general) subordinate to types, but rather the other way around. For example, I could filter on whether a string in column A is longer than the integer in column B. The associated function would take a string-like and integer-like input, and return a boolean, so it wouldn't make sense to put it under either the TEXT or INTEGER types. Rather, it makes sense to decide you want to do some sort of function like that, and then choose the appropriate columns for arguments. Maybe a compromise would be to be able to filter the list of functions to restrict to those that involve a given type as at least one argument?

@mathemancer I think this makes sense at the db module and API level, but we have specific filters and groups available for specific types in the Mathesar frontend - see the Global Data Types spec. The frontend needs some way to figure this out from the API.

I was suggesting leaving the functions endpoint alone and adding a separate endpoint that focuses only on what we've specced out for the UI. That way, someone can use the functions endpoint if they want to access all DB functions and the separate endpoint if they want to use our product abstractions. I guess we could create a filters endpoint instead of overloading the types endpoint for that.

This conversation is making me think that we should have separate API paths for our frontend-specific abstractions (Mathesar types + the filters they support + any other abstractions we come up with) and general DB functionality. I think that's out of scope for this issue, though.

@dmos62
Copy link
Contributor Author

dmos62 commented Jan 20, 2022

@kgodey

There should be some way for the frontend to know which filters we're supporting in the UI for each data type.

Yeah, definitely. Hints fill that role. To say that a function can take strings as input, you give it a hint along the lines of hints.parameter(1, hints.string_like) (could be read "I'm hinting that the first parameter is a string"). Then we give the string-like DB types the same hint (that is not yet implemented, but it's straightforward, only a mapping). Imagine:

GET /api/v0/db_types:

[
    {
        "id": "character varying",
        "name": "Character varying",
        "hints": [
            {
                "id": "string_like"
            }
        ]
    },
    [...]
]

So, when frontend queries both the /functions and the /db_types endpoints, it has all the information necessary to tell which functions should work with which data types.

@silentninja

How would overloaded functions work?

This only supports a single signature per function. If there's a database function with multiple signatures and we'd like to expose them, we'd do best to wrap them in separate DbFunction subclasses with different hint sets each.

Note that this mechanism does not actually enforce following the hints. As far as db.functions is concerned you can attempt to call and compose DbFunction subclasses however you like and it will attempt to not fail or at least fail as late as possible (preferrably during actual evaluation by Postgres). This is in the spirit of allowing the user as much freedom as possible.

This is also a pain point of this concept, in my mind, since it might be difficult to fail gracefully or to show nice error messages. But, that should only affect advanced users, unless we mess up the hints.

what does all_parameter mean [..]?

hints.parameter(1, *some_set_of_hints) says parameter at index 1 has the hints in some_set_of_hints applied. And, hints.all_parameters(*some_set_of_hints) says all parameters have the hints in some_set_of_hints applied.

When both these hints are provided at the same time, I'd say that the more specific hints.parameter should take precedent for the specific parameter it is targeting, but we might want to look into this specific scenario some more if it comes up in a practical setting. For example, CSS does inheritance and then provides things like hints.unset(hints.some_hint_that_i_dont_want_to_inherit) in the more specific rule to specify when to not inherit.

@kgodey
Copy link
Contributor

kgodey commented Jan 20, 2022

So, when frontend queries both the /functions and the /db_types endpoints, it has all the information necessary to tell which functions should work with which data types.

@dmos62 Okay, that makes sense. Please open a separate issue for the frontend to switch to the new APIs for filters, we can use it to discuss the API spec with the frontend team and make sure it works for them.

@dmos62 dmos62 closed this Jan 25, 2022
@dmos62
Copy link
Contributor Author

dmos62 commented Jan 25, 2022

Closing this PR. Splitting it into two PRs. First, there will be a PR that introduces the new function API (borne out of iterating over this PR). Then, there will be a PR that replaces the old filtering API with the new function API.

@dmos62 dmos62 mentioned this pull request Jan 25, 2022
7 tasks
@dmos62 dmos62 deleted the backend-filtering-numbers branch March 1, 2022 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects: architecture Improvements or additions to architecture
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Duplicates filter should work with other filters Implement filtering options for Number types.
4 participants