Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test failures in duckdb 0.10.0 #1999

Merged
merged 9 commits into from
Mar 4, 2024
Merged

Fix test failures in duckdb 0.10.0 #1999

merged 9 commits into from
Mar 4, 2024

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Feb 26, 2024

The root cause of the error is that we've been relying on implicit typecasting which is no longer possible

From duckdb 0.10.0 forwards, if dob is a date or timestamp, you cannot do this:
LEVENSHTEIN("dob_l", "dob_r") <= 1

Which means that you have to correctly cast all columns.

Error was: Binder Error: No function matches the given name and argument types 'levenshtein(TIMESTAMP_NS, TIMESTAMP_NS)'. You might need to add explicit type casts.
	Candidate functions:
	levenshtein(VARCHAR, VARCHAR) -> BIGINT

Note you can also no longer do try_strptime(dob_l, dob_r) either if dob is a datetime

Error was: Binder Error: No function matches the given name and argument types 'try_strptime(TIMESTAMP_NS, STRING_LITERAL)'. You might need to add explicit type casts.
	Candidate functions:
	try_strptime(VARCHAR, VARCHAR) -> TIMESTAMP
	try_strptime(VARCHAR, VARCHAR[]) -> TIMESTAMP

My proposed 'solution' is to have the function only work in duckdb 0.10.0 if the input is a string (cast_strings_to_date: bool = True). But NOT to error out if the user sets cast_strings_to_date: bool = False, because that would break things in duckdb < 0.10.0.

I think this makes sense because many of the function arguments only make sense if the input is a string e.g.:

  • invalid_dates_as_null cannot be used if the input is a date
  • separate_1st_january

When users have problems, we need to tell them to:

  1. cast the date to a string before brining it into Splink, OR
  2. downgrade back to duckdb 0.9.2
example
import pandas as pd

import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.fake_1000
# convert dob column to datetime
df["dob"] = pd.to_datetime(df["dob"], format="%Y-%m-%d")

# get example value to check it's a date


settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on("first_name"),

    ],
    "comparisons": [
        cl.exact_match("first_name"),
        cl.exact_match("surname"),
        ctl.date_comparison("dob", levenshtein_thresholds=[1]),
        cl.exact_match("city"),
        cl.exact_match("email"),
    ],
}

linker = DuckDBLinker(df, settings)
linker.predict()

@RobinL RobinL changed the title (WIP) Fix test failures in duckdb 0.10.0 Fix test failures in duckdb 0.10.0 Feb 26, 2024
@RobinL RobinL requested a review from ADBond February 26, 2024 17:00
@@ -257,22 +257,6 @@ def test_duckdb_arrow_array():
assert len(df) == 2


@mark_with_dialects_including("duckdb")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test no longer needed with later versions of pandas which seems to deal with typing better

Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, makes sense.
Not sure how the lockfile fits with the one in #1998 but presume you know which is more up-to-date

@RobinL RobinL merged commit 5d2e0d9 into master Mar 4, 2024
13 checks passed
@RobinL RobinL deleted the duckdb_0_10_0_fixes branch March 4, 2024 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants