Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlglot.parse_one - use read keyword argument #1996

Merged
merged 2 commits into from
Feb 23, 2024

Conversation

ADBond
Copy link
Contributor

@ADBond ADBond commented Feb 23, 2024

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Fixes #1995.

For older versions (before 17.1.0) of sqlglot, SqlglotColumnTreeBuilder.from_raw_column_name_or_column_reference
would break.

Give a brief description for the solution you have provided

in 17.1.0 the dialect keyword argument was introduced as a synonym for read. But this breaks on earlier versions of sqlglot, so this changes to use the read parameter, which is compatible with 'all' (that we support) versions of sqlglot.

As a side-point, I wonder if it makes sense for us to wrap parse_one, as we use it in several places in the codebase? That way if in the future the api changes and we need to provide compatibility code, we only need it in one place. We also avoid these kinds of issues, as we wouldn't need to add further calls (directly) to parse_one, where it would be easy to accidentally use dialect= if the developer has a newer sqlglot installed (and as used in CI).

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks or tutorial (if appropriate)
  • Added tests (if appropriate)
  • Updated CHANGELOG.md (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter

in 17.1.0 the `dialect` keyword argument was introduced as a synonym for `read`. But this breaks on earlier versions of `sqlglot`.
@ADBond ADBond requested a review from RobinL February 23, 2024 14:22
@ADBond
Copy link
Contributor Author

ADBond commented Feb 23, 2024

Test script (which breaks prior to this PR for sqlglot < 17.1.0:

expand
import pandas as pd

from splink.duckdb.linker import DuckDBLinker

df = pd.DataFrame(
    [
        {
            "unique_id": 1,
            "bothinitials": "BA",
            "sex": "M",
            "dob": "2020-01-01",
            "sector": "Q",
        },
        {
            "unique_id": 2,
            "bothinitials": "ZO",
            "sex": "M",
            "dob": "2020-03-01",
            "sector": "T",
        },
        {
            "unique_id": 3,
            "bothinitials": "BA",
            "sex": "M",
            "dob": "2020-01-01",
            "sector": "Q",
        },
        {
            "unique_id": 4,
            "bothinitials": "ZB",
            "sex": "F",
            "dob": "2020-05-01",
            "sector": "Q",
        },
        {
            "unique_id": 5,
            "bothinitials": "OK",
            "sex": "F",
            "dob": "2020-01-01",
            "sector": "Q",
        },
        {
            "unique_id": 6,
            "bothinitials": "NE",
            "sex": "F",
            "dob": "2020-06-01",
            "sector": "Q",
        },
        {
            "unique_id": 7,
            "bothinitials": "OK",
            "sex": "M",
            "dob": "2020-08-01",
            "sector": "T",
        },
        {
            "unique_id": 8,
            "bothinitials": "NE",
            "sex": "F",
            "dob": "2020-01-01",
            "sector": "Q",
        },
    ]
)

deterministic_brs = [
    # Non null match on both initials
    "l.bothinitials = r.bothinitials AND "
    # Non null match on date of birth
    "l.dob = r.dob AND "
    # Non null match on sex
    "l.sex = r.sex AND "
    # Non null match on postcode sector
    "l.sector = r.sector"
]

deterministic_settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": deterministic_brs,
}
deterministic_linker = DuckDBLinker(df, deterministic_settings)

@ADBond ADBond added the bug Something isn't working label Feb 23, 2024
@ADBond ADBond merged commit dd2df28 into master Feb 23, 2024
10 checks passed
@ADBond ADBond deleted the bug/older-sqlglot-compatibility branch February 23, 2024 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deduplication doesn't work TypeError: Parser.__init__() got an unexpected keyword argument 'dialect'
2 participants