Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend filtering refactor #837

Conversation

dmos62
Copy link
Contributor

@dmos62 dmos62 commented Nov 22, 2021

Fixes #385, fixes #846

Sets up the backend infrastructure for filtering.

Technical details

Uses our fork of sqlalchemy-filters.

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the master branch of the repository
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

My TODO

  • Use Python classes to model filters
  • Have a way to convert Python filter representation to SQLAlchemy filter specification
  • Have a way to generate filter specification for the frontend
  • Find out if/how possible predicates should change depending on column's type
  • Expose possible predicates through the REST API
  • Start discussion about new filter API
  • Accept a filter specification through REST API
  • Apply the requested filter, via SQLAlchemy filters, to the table query
  • Integrate any input from the discussion
  • Resolve conflict with the duplicate-only filter
  • In-code documentation
  • Describe breaking-changes (as relevant to frontend)
  • Reintroduce duplicate-only filter as a dedicated parameter query
  • Port the recent grouping PR's grouping tests to this PR (this PR invalidates many grouping tests)

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

Also includes implementation for converting to SA filter spec.
Apparently dataclass defaults don't carry over from mixins.
I was wrong. Apparently I just got the order of mixin application wrong.
There's nothing faux about it.
@kgodey
Copy link
Contributor

kgodey commented Nov 23, 2021

@dmos62 I see that you're working off of a fork still, you have write access to the repo now so I recommend switching to a branch instead.

@dmos62
Copy link
Contributor Author

dmos62 commented Nov 24, 2021

At the moment possible predicates are exposed in the REST API like this:

http://localhost:8000/api/v0/databases/1/types/:

[
    {
        "identifier": "boolean",
        "name": "Boolean",
        "db_types": [
            "BOOLEAN"
        ],
        "filters": [
            {
                "superType": "leaf",
                "type": "equal",
                "parameterCount": "single",
                "parameterMathesarType": "boolean"
            },
            {
                "superType": "leaf",
                "type": "greater",
                "parameterCount": "single",
                "parameterMathesarType": "boolean"
            },
            {
                "superType": "leaf",
                "type": "greater_or_equal",
                "parameterCount": "single",
                "parameterMathesarType": "boolean"
            },
            {
                "superType": "leaf",
                "type": "lesser",
                "parameterCount": "single",
                "parameterMathesarType": "boolean"
            },
            {
                "superType": "leaf",
                "type": "lesser_or_equal",
                "parameterCount": "single",
                "parameterMathesarType": "boolean"
            },
            {
                "superType": "leaf",
                "type": "empty",
                "parameterCount": "none"
            },
            {
                "superType": "leaf",
                "type": "in",
                "parameterCount": "multi",
                "parameterMathesarType": "boolean"
            },
            {
                "superType": "branch",
                "type": "not",
                "parameterCount": "single"
            },
            {
                "superType": "branch",
                "type": "and",
                "parameterCount": "multi"
            },
            {
                "superType": "branch",
                "type": "or",
                "parameterCount": "multi"
            }
        ]
    },
    ...
]

@kgodey
Copy link
Contributor

kgodey commented Nov 24, 2021

Couple of quick comments:

  • I don't think we should mix snake case and camel case in key names. I think we should stick with snake case since that's what the rest of the API uses. (This also applies to function names).
  • I'm not sure what superType means.

@dmos62
Copy link
Contributor Author

dmos62 commented Nov 25, 2021

@kgodey thanks for noticing the casing conflict. I'm using the tree abstraction for predicates. The empty predicate's superType is leaf, because it's always a leaf node on the predicate tree (has height zero). branch predicates are never leaves (have height that's non-zero), like and, or or not. Example composite predicate:

and(not(empty(field1)), equal(field2, value))

and and not will always have other predicates within them (they're branches), while empty and equal never have predicates within them (they're leafes).

@kgodey
Copy link
Contributor

kgodey commented Nov 26, 2021

I'm using the tree abstraction for predicates. The empty predicate's superType is leaf, because it's always a leaf node on the predicate tree (has height zero). branch predicates are never leaves (have height that's non-zero), like and, or or not.

I figured that out, I meant specifically that the "super type" nomenclature is a little confusing, if I was just paying attention to the key names, seems like it's a superset of the "type" key somehow (which it's not). Is there a more obvious name for it? Or are you using some standard set of names derived from something else?


def not_empty(l): return len(l) > 0

def assertPredicateCorrect(predicate):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmos62 we use snake case for Python functions and variables too. We only use camel case (capitalized) for class names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it. Thanks for pointing that out.

@dmos62
Copy link
Contributor Author

dmos62 commented Nov 26, 2021

superType is a superset in that every Predicate subclass (every type in other words) is also a subclass of one of the superTypes. Or do you mean something else?

I'm open to suggestions. Calling it a parent type would have a similar meaning. That's pretty much talking about the underlying class/mixin hierarchy. We could also have nomenclature that talks about predicate names (e.g. empty, greater, and) and positions in predicate trees (leaf or branch).

@kgodey
Copy link
Contributor

kgodey commented Nov 26, 2021

Calling it a parent type would have a similar meaning. That's pretty much talking about the underlying class/mixin hierarchy.

API users will probably not know or care about how it's implemented, I think the nomenclature should prioritize API readability.

We could also have nomenclature that talks about predicate names (e.g. empty, greater, and) and positions in predicate trees (leaf or branch).

I like this. How about name instead of type and position instead of super_type?

@dmos62
Copy link
Contributor Author

dmos62 commented Dec 9, 2021

Since @mathemancer is still making changes on the dependency PR (#862), I'll hold off on merging it into this one.

@dmos62 dmos62 mentioned this pull request Dec 10, 2021
7 tasks
@kgodey kgodey marked this pull request as draft December 10, 2021 22:13
Copy link
Contributor

@kgodey kgodey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review this more detail later.

db/filters/base.py Show resolved Hide resolved
@kgodey kgodey mentioned this pull request Dec 12, 2021
7 tasks
@silentninja
Copy link
Contributor

Have we decided on using dataclasses and typing in our codebase?

Copy link
Contributor

@mathemancer mathemancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass review. I'll be more precise once this is more stabilized. However, I have some broader comments to make. Overall, I really like the tree concept. I do think there's a false dichotomy introduced; there are branches and leaves, but you can't assume that an "and" is a branch (since it could be an "and" between two boolean columns). I.e., the dichotomy isn't between types exactly.

My biggest concern is the specificity. I really think we need to take a try and see what happens approach to some of these things, rather than trying to check everything beforehand. Let the DB tell you if a given proposition makes no sense, and handle the error. This will be more flexible, and avoid having to define things in multiple places. Long run, I really think it'll be easier to maintain that way. For example, avoiding specifying branch vs. leaf is more flexible for composition. I acknowledge that it would sometimes run into issues, but that can be handled by really good feedback and errors.

Final note for this round: we need to come to some kind of team-wide agreement about type hints (and by implication dataclasses). I'm ambivalent on these issues, but I suspect I'm the only one.

db/filters/operations/deserialize.py Show resolved Hide resolved
db/filters/base.py Show resolved Hide resolved
db/filters/base.py Show resolved Hide resolved
db/filters/base.py Show resolved Hide resolved
db/filters/base.py Show resolved Hide resolved
@kgodey
Copy link
Contributor

kgodey commented Dec 21, 2021

Have we decided on using dataclasses and typing in our codebase?

Final note for this round: we need to come to some kind of team-wide agreement about type hints (and by implication dataclasses). I'm ambivalent on these issues, but I suspect I'm the only one.

I think we should take the discussion about dataclasses and type hints to a GitHub discussion so as to not clutter up this PR. @dmos62, I think it would make sense for you to start this.

@kgodey
Copy link
Contributor

kgodey commented Dec 21, 2021

@dmos62 This is a large PR. I think it would help me review the code if you could do a write up of the changes. Topics I think would be useful:

  • Explaining the code structure and why you chose it + any alternatives you considered.
  • Extensibility - how to add new filters and new data types
  • The benefits of dataclasses in this particular application

I'm having a hard time grokking the code because it doesn't seem Pythonic somehow. That's probably not useful but I can't articulate any more specifics, I'm hoping reading through your explanation will help me either understand the code or articulate why it's hard for me to understand.

@dmos62 dmos62 changed the base branch from master to range_grouping December 21, 2021 08:01
@dmos62
Copy link
Contributor Author

dmos62 commented Dec 21, 2021

Writeup

I'm collapsing this write-up; please see its copy-pasted (and possibly updated) version on the initial post of the new thread for this PR.

Collapsed ### Features

Notice that I might use the terms filter specification and predicate interchangeably.

Some of the things this new predicate data structure does is:

  • Declares (predicate) correctness to a high degree
    • An illegal predicate cannot be instantiated; an instantiated predicate is always legal
      • Notable exception
        • A predicate referencing unexisting columns can be instantiated
          • Referenced column existance is checked right before applying the filter specification to a query
  • Declares what predicates the frontend can use under what circumstances and how
    • It can tell if a predicate is compatible with some column type based on its properties (like comparability, compatibility with LIKE, compatibility with URI-type-specific SQL functions)
    • It can tell the frontend how many parameters a predicate takes (e.g. empty, equal and in take different number of params) and it can tell it what options it accepts (e.g. should starts_with be case insensitive)
    • It can tell the frontend how to compose predicates: leaf vs branch predicates: parameters on branches are other predicates
  • It supports SQL functions
    • You can use SQL functions to, for example, destructure a URI and filter based on that
      • Though this extension is/will be in a newer PR
      • Was not possible with sqlalchemy-filters

What it does not (currently) do:

  • Does not support using other columns as parameters
    • Can't do {equal: {column: x, parameter: {column: y}}}
    • This is an oversight
    • Does not seem difficult to implement if/when there's interest

Technical details

I organized the relevant pieces in the predicate data structure into a mixin hierarchy and also used this PR as a testbed for frozen dataclasses. Some of my objectives with the basic structure were:

  • Immutability
    • Used properties where I could
      • Didn't use properties where an SQLAlchemy filter is returned, since its mutability is uncertain
  • Many small classes
    • Use the mixin/type system to capture information
      • A class that is a predicate that takes one parameter and relies on LIKE will directly or indirectly mix in ReliesOnLike, SingleParameter, Leaf and Predicate
    • Some logic I offloaded to and hence centralized in static methods: correctness is declared on db/filters/base::assert_predicate_correct: and, what the JSON filter specification is is declared in db/filters/operations/deserialize
      • Has the drawback that the control flow in these methods can seem daunting, since it walks itself through the mixin hierarchy
        • These methods are not complicated, but big: each branch is simple: it's just that there's many of them
        • Might fan this logic out into the mixin definitions
  • Used parameter where a singular parameter is expected and parameters where a sequence is expected
    • I have reservations about this
      • It makes the specifics of single/multi-parameter requirements obvious
      • Procedural instantiation is more verbose (like when testing), since you have to change the argument name depending on circumstance
        • But that can be solved with an auxillary constructor that's only for utilities

How to extend with new predicate

  1. Introduce the appropriate class; for example, Greater; this includes defining the new Predicate's properties through mixins: ReliesOnComparability, SingleParameter, Leaf in this case (note that mixin order matters, see Python docs);
@frozen_dataclass
class Greater(ReliesOnComparability, SingleParameter, Leaf):
    type: LeafPredicateType = static(LeafPredicateType.GREATER)
    name: str = static("Greater")

    def to_sa_filter(self):
        return column(self.column) > self.parameter
  1. Introduce the predicate type enum: LeafPredicateType.GREATER in this case; it's what the JSON filter spec will use to identify a predicate;

  2. Tell it how to generate an equivalent SQLAlchemy filter by implementing the abstract Predicate::to_sa_filter (as seen above);

  3. Update correctness definition (db/filters/base::assert_predicate_correct), if needed; currently this involves finding the spot in the method's control flow tree that corresponds to this new predicate, adding a new type check, etc.

  4. Add new predicate to the all_predicates set;

  5. Update mathesar.database.types::is_ma_type_supported_by_predicate, if needed; this would involve declaring what types have the properties the new predicate depends on: in this PR it's:

def _is_ma_type_comparable(ma_type: MathesarTypeIdentifier) -> bool:
    return ma_type in comparable_mathesar_types

def is_ma_type_supported_by_predicate(ma_type: MathesarTypeIdentifier, predicate: Type[Predicate]):
    if relies_on_comparability(predicate):
        return _is_ma_type_comparable(ma_type)
    else:
        return True

Notice that relies_on_comparability is an auxiliary function returning true when a predicate is a subclass of ReliesOnComparability.

dataclasses and typing

As Kriti suggested, I'll start a dedicated discussion on the use of dataclasses and typing.

But, to summarize:

  • dataclasses eliminated custom boilerplate; I got immutability, _post_init and defaults without writing a single constructor; I think this is great since it's essentially standardized boilerplate;
  • typing got me partial typing, which is great:
    • in combination with my LSP server/client (pyright): catches a lot of mistakes and conflicts
    • in that I can annotate when I think it's useful and not when I don't (partial typing)
    • for readability, for example the above method is_ma_type_supported_by_predicate operates on uninstantiated predicate classes, not instances, so to express that I can just say predicate: Type[Predicate]) instead of predicate_subclass
    • doesn't have a cost; another developer can just omit type or write Any if he prefers

@mathemancer mathemancer deleted the branch mathesar-foundation:range_grouping December 22, 2021 11:03
@dmos62 dmos62 mentioned this pull request Dec 22, 2021
7 tasks
@dmos62 dmos62 removed this from the [07] Initial Data Types milestone Dec 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs: unblocking Blocked by other work pr-status: review A PR awaiting review work: backend Related to Python, Django, and simple SQL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicates filter should work with other filters Implement filtering options for Number types.
5 participants