Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Pairwise array comparison levels #1337

Open
ADBond opened this issue Jun 14, 2023 · 3 comments
Open

[FEAT] Pairwise array comparison levels #1337

ADBond opened this issue Jun 14, 2023 · 3 comments
Assignees
Labels
comparison levels enhancement New feature or request

Comments

@ADBond
Copy link
Contributor

ADBond commented Jun 14, 2023

Is your proposal related to a problem?

We have the udf dualarrayexplode which gives us the Cartesian product of two arrays which we sometimes use for comparison levels. Currently we hand-roll the SQL, but it would be useful to have a shortcut for this.

Describe the solution you'd like

Would be nice to have a couple of ways to construct this - using string templates (or something similar), or from comparison levels.

For instance, pairwise levenshtein comparison level (minimum levenshtein distance between arrays less than 2)

# for each x in name_arr_l, y in name_arr_r, condition is min(levenshtein(x, y)) <= 2
cll.pairwise_array_level("name_arr", sql_snippet="levenshtein({col_l}, {col_r})", aggregate_fun="min", threshold=2)

or the same thing from a comparison level:

# for each x in name_arr_l, y in name_arr_r, condition is min(levenshtein(x, y)) <= 2
cll.pairwise_array_from_level(cll.levenshtein_level("name_arr", 2), aggregate_fun="min")

or for equality (if for instance we need to count instances so array_length_intersection is inappropriate):

cll.pairwise_array_level("name_arr", sql_snippet="{col_l} = {col_r}", aggregate_fun="any")
cll.pairwise_array_from_level(cll.exact_match_level("name_arr"))

Describe alternatives you've considered

Could have specialised versions - e.g. one for Levenshtein, one for equality, etc, but a suitably flexible option probably covers more use-cases without providing an overwhelming number of new levels.

Additional context

Might need some thought about range of use-cases + the best interface for this.

@RossKen
Copy link
Contributor

RossKen commented Sep 19, 2023

Explore functionality in DuckDB

@samkodes
Copy link
Contributor

@RossKen Since you mention DuckDB, have you considered something like my suggestion to use in iterated list_reduce in DuckDB in issue #1994? Is that approach viable?

@RobinL
Copy link
Member

RobinL commented Feb 23, 2024

Yep it should be viable, see:
#1994 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comparison levels enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants