Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/MAINT]. BlockingRule should always be dialected, but this should be taken from the dialect in the settings #1781

Open
RobinL opened this issue Dec 4, 2023 · 0 comments

Comments

@RobinL
Copy link
Member

RobinL commented Dec 4, 2023

I just ran into an issue whereby this

self._settings_obj._blocking_rules_to_generate_predictions = [
    BlockingRule(f"{uid_l} = {uid_r}"
]

failed when the settings object was deepcopied because it lost its dialect information and therefore parsing a string with a spark backtick failed/

Full traceback
>       linker.unlinkables_chart(source_dataset="Testing")

tests/test_full_example_spark.py:106: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
splink/linker.py:2970: in unlinkables_chart
    records = unlinkables_data(self)
splink/unlinkables.py:20: in unlinkables_data
    self_link = linker._self_link()
splink/linker.py:2013: in _self_link
    sqls = block_using_rules_sqls(self)
splink/blocking.py:361: in block_using_rules_sqls
    sql = br.create_blocked_pairs_sql(linker, where_condition, probability)
splink/blocking.py:98: in create_blocked_pairs_sql
    columns_to_select = linker._settings_obj._columns_to_select_for_blocking
splink/settings.py:222: in _columns_to_select_for_blocking
    cols.append(uid_col.l_name_as_l)
splink/input_column.py:239: in l_name_as_l
    alias = self.unquote().name_l
splink/input_column.py:202: in unquote
    self_copy = deepcopy(self)
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/copy.py:172: in deepcopy
    y = _reconstruct(x, memo, *rv)
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/copy.py:270: in _reconstruct
    state = deepcopy(state, memo)
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/copy.py:146: in deepcopy
    y = copier(x, memo)
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/copy.py:230: in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
/opt/hostedtoolcache/Python/3.9.10/x64/lib/python3.9/copy.py:153: in deepcopy
    y = copier(memo)
splink/settings.py:87: in __deepcopy__
    cc = Settings(self.as_dict())
splink/settings.py:80: in __init__
    self.def()
splink/settings.py:129: in _get_additional_columns_to_retain
    get_columns_used_from_sql(br.blocking_rule_sql, br.sql_dialect)
splink/parse_sql.py:10: in get_columns_used_from_sql
    syntax_tree = sqlglot.parse_one(sql, read=dialect)
.venv/lib/python3.9/site-packages/sqlglot/__init__.py:125: in parse_one
    result = dialect.parse(sql, **opts)
.venv/lib/python3.9/site-packages/sqlglot/dialects/dialect.py:311: in parse
    return self.parser(**opts).parse(self.tokenize(sql), sql)
.venv/lib/python3.9/site-packages/sqlglot/parser.py:979: in parse
    return self._parse(
.venv/lib/python3.9/site-packages/sqlglot/parser.py:1048: in _parse
    self.raise_error("Invalid expression / Unexpected token")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <sqlglot.parser.Parser object at 0x7f9eaf27a7c0>
message = 'Invalid expression / Unexpected token'
token = <Token token_type: TokenType.IDENTIFIER, text: `, line: 1, col: 13, start: 12, end: 12, comments: []>

    def raise_error(self, message: str, token: t.Optional[Token] = None) -> None:
        """
        Appends an error in the list of recorded errors or raises it, depending on the chosen
        error level setting.
        """
        token = token or self._curr or self._prev or Token.string("")
        start = token.start
        end = token.end + 1
        start_context = self.sql[max(start - self.error_message_context, 0) : start]
        highlight = self.sql[start:end]
        end_context = self.sql[end : end + self.error_message_context]
    
        error = ParseError.new(
            f"{message}. Line {token.line}, Col: {token.col}.\n"
            f"  {start_context}\033[4m{highlight}\033[0m{end_context}",
            description=message,
            line=token.line,
            col=token.col,
            start_context=start_context,
            highlight=highlight,
            end_context=end_context,
        )
    
        if self.error_level == ErrorLevel.IMMEDIATE:
>           raise error
E           sqlglot.errors.ParseError: Invalid expression / Unexpected token. Line 1, Col: 13.
E             l.`unique_id` = r.`unique_id`

.venv/lib/python3.9/site-packages/sqlglot/parser.py:1089: ParseError

When we re-do blocking rules for Splink 4, we need to have a consistent approach ensuring they're always dialected, and that dialect is 'taken from' the same place (which should be the dialect on the root settings object, rather than a blocking-rule-specific dialect

This example underscores a general principle around deepcoping that it should be straightforward to serialise settings to json and read back in to get a copy of settings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant