Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for custom case expressions which use arbitrary/multiple column(s) #90

Merged
merged 5 commits into from Mar 13, 2020

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Mar 7, 2020

Closes #65

A relatively advanced use case that is fairly common is for values in the comparison vectors to be derived from multiple columns.

For example, to deal with the possibility of name inversions, the user might want something like:

CASE WHEN
first_name_l = first_name_r AND surname_l = surname_r THEN 2
first_name_l = surname_r AND surname_l = first_name_r THEN 1
ELSE 0 
END

It's also a bit tricky to think about how this should be dealt with in comparison_columns because up until now, we've assumed that each value in the comparison vector corresponds to a single column, and each column is identified by the col_name key in the comparison_columns part of the settings dictionary.

What I've done is allowed two possible specs for an entry in comparison_columns.

Either col_name must be completed as before:

{
"col_name": "first_name"
}

or the user must explicitly specify a custom column:

  {
        "custom_name": "custom_name",
        "custom_columns_used": ["first_name", "surname],
        "case_expression": """
CASE WHEN
first_name_l = first_name_r AND surname_l = surname_r THEN 2
first_name_l = surname_r AND surname_l = first_name_r THEN 1
ELSE 0 
END
""",
        "num_levels": 2
        }

The later syntax isn't very intuitive, but this is advanced functionality and it would be expected that the user would modify example settings rather than write this from scratch

@@ -98,8 +99,8 @@ def manually_apply_fellegi_sunter_weights(self):

def get_scored_comparisons(self, num_iterations=None):

if (num_iterations is None):
num_iterations=self.settings["max_iterations"]
if not num_iterations:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not relevant to this PR, I'm just cleaning up code here. Specifically, the number of iterations is taken from the settings object unless explicitly set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I wasn't sure whether None could be coerced to a boolean, hence treating it like a value to be sure without having to look it up. Good to know.

@@ -34,7 +34,7 @@ def _add_null_treatment_to_case_statement(case_statement: str):

if "then -1" not in sl:
try:
variable_name = re.search(r"when ([\w_]{1,100})_l", case_statement)[1]
variable_name = re.search(r"when ([\w_]{1,100})_l", case_statement.lower())[1]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary for this PR, this was an error I identified whilst doing the PR that meant the regex was incorrect

if col["term_frequency_adjustments"]:
select_cols = _add_left_right(select_cols, col_name)
select_cols["gamma_" + col_name] = "gamma_" + col_name
if "col_name" in col:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these bits I'm having to fiddle around to make sure the order of the columns is sensible even when a custom_name is used

@@ -5,6 +5,7 @@
"type": "object",
"title": "Splink settings",
"required": ["comparison_columns", "link_type"],
"additionalProperties": false,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary to PR but useful - this means that if the user includes extraneous keys (e.g. a misspelt key) in the settings dictionary, they'll get a validation error

@@ -54,12 +55,13 @@
"$id": "#/properties/max_iterations",
"type": "number",
"title": "The maximum number of iterations to run even if convergence has not been reached",
"default": 10,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 10 is too low as a default num iterations

@@ -211,12 +237,6 @@
"$id": "#/properties/comparison_columns/items/properties/gamma_index",
"type": "integer",
"description": "Gamma values in the comparison vector will be put in columns called gamma_0, gamma_1 etc. This is the gamma index corresponding to this column"
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a previous attempt to account for custom columns that was never properly implemented so acn be deleted

@samnlindsay
Copy link
Contributor

Using the test case_expression of:

case when jaro_winkler_sim(
array_join(array_sort(ARRAY [first_name_key_l, middle_name_key_l, last_name_key_l]), ''),
array_join(array_sort(ARRAY [first_name_key_r, middle_name_key_r, last_name_key_r]), '')
) > 0.94 then 1 else 0 end

has highlighted an issue with _add_null_treatment_to_case_statement(...).

While it works for your example above, it assumes a format of case when colname_l... to extract colname. If I understand correctly, this is where custom_column_names should come in so the null treatment can be added to these columns without searching the case expression?

@RobinL
Copy link
Member Author

RobinL commented Mar 11, 2020

Well spotted. I agree, the current implementation isn't very robust and could be improved

@RobinL
Copy link
Member Author

RobinL commented Mar 12, 2020

Thinking about this, I think the simplest solution may be just to drop this functionality entirely and put it in the users' hands.

The only time it makes sense is the case where the user is providing a custom case expression for a single column (i.e. they have specified a col_name not a custom_name). Most of the time when they do this, we'd recommend they parameterise one of the pre-built statements from case_statements.py rather than write their own (these pre-built ones include the -1 null treatment so don't need any additions.

So having 'surprising behaviour' (null treatment magically appears) for an unusual use case is probably more trouble than it's worth.

In the case where they've provided a custom_name, I don't think we'd want to try and 'guess' null treatment for the user. Instead, the docs need to clearly state the user should define how the treatment of nulls should work i.e. their case expression should include a case for case x then -1 .

@samnlindsay
Copy link
Contributor

OK good, because I was coming to the opinion that it was becoming more complicated than was worthwhile in order to do it "properly" because it was just the null treatment that was a problem.

That's partly because I had issues with the null treatment even for the original simple usage. For example, it isn't triggered when "then -1" is found in the expression, ignoring potential other legitimate conditions for gamma = -1 that don't check for nulls.

Will put this to bed in the morning, sorry for the delay

@samnlindsay
Copy link
Contributor

One last thought (not necessary for this PR but potential as an option to put in later):

It seems like the real difficulty is in trying to detect whether the user has already dealt with nulls in the case_expression.
If we assume the user doesn't include it, adding the null treatment is easy (for col_name or for a list of custom_column_names). If we're worried about it not being all that simple for custom columns, then the simple null treatment could be an option in the settings, so the only time the onus is on the user is when they have a custom column where they have specific reason to opt out of it.

Copy link
Contributor

@samnlindsay samnlindsay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think nulls may need to be revisited, but the core feature of allowing custom case expressions for multiple columns now works. ✅

Closes #65

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow a case expression which takes two columns as an input
2 participants