Add support for custom case expressions which use arbitrary/multiple column(s) #90

RobinL · 2020-03-07T18:05:32Z

Closes #65

A relatively advanced use case that is fairly common is for values in the comparison vectors to be derived from multiple columns.

For example, to deal with the possibility of name inversions, the user might want something like:

CASE WHEN
first_name_l = first_name_r AND surname_l = surname_r THEN 2
first_name_l = surname_r AND surname_l = first_name_r THEN 1
ELSE 0 
END

It's also a bit tricky to think about how this should be dealt with in comparison_columns because up until now, we've assumed that each value in the comparison vector corresponds to a single column, and each column is identified by the col_name key in the comparison_columns part of the settings dictionary.

What I've done is allowed two possible specs for an entry in comparison_columns.

Either col_name must be completed as before:

{
"col_name": "first_name"
}

or the user must explicitly specify a custom column:

  {
        "custom_name": "custom_name",
        "custom_columns_used": ["first_name", "surname],
        "case_expression": """
CASE WHEN
first_name_l = first_name_r AND surname_l = surname_r THEN 2
first_name_l = surname_r AND surname_l = first_name_r THEN 1
ELSE 0 
END
""",
        "num_levels": 2
        }

The later syntax isn't very intuitive, but this is advanced functionality and it would be expected that the user would modify example settings rather than write this from scratch

RobinL · 2020-03-08T14:15:59Z

splink/__init__.py

@@ -98,8 +99,8 @@ def manually_apply_fellegi_sunter_weights(self):

    def get_scored_comparisons(self, num_iterations=None):

-        if (num_iterations is None):
-            num_iterations=self.settings["max_iterations"]
+        if not num_iterations:


Not relevant to this PR, I'm just cleaning up code here. Specifically, the number of iterations is taken from the settings object unless explicitly set.

Ah, I wasn't sure whether None could be coerced to a boolean, hence treating it like a value to be sure without having to look it up. Good to know.

RobinL · 2020-03-08T14:16:27Z

splink/case_statements.py

@@ -34,7 +34,7 @@ def _add_null_treatment_to_case_statement(case_statement: str):

    if "then -1" not in sl:
        try:
-            variable_name = re.search(r"when ([\w_]{1,100})_l", case_statement)[1]
+            variable_name = re.search(r"when ([\w_]{1,100})_l", case_statement.lower())[1]


Not necessary for this PR, this was an error I identified whilst doing the PR that meant the regex was incorrect

RobinL · 2020-03-08T14:16:57Z

splink/expectation_step.py

-        if col["term_frequency_adjustments"]:
-            select_cols = _add_left_right(select_cols, col_name)
-        select_cols["gamma_" + col_name] = "gamma_" + col_name
+        if "col_name" in col:


In these bits I'm having to fiddle around to make sure the order of the columns is sensible even when a custom_name is used

RobinL · 2020-03-08T14:17:28Z

splink/files/settings_jsonschema.json

@@ -5,6 +5,7 @@
  "type": "object",
  "title": "Splink settings",
  "required": ["comparison_columns", "link_type"],
+  "additionalProperties": false,


Not necessary to PR but useful - this means that if the user includes extraneous keys (e.g. a misspelt key) in the settings dictionary, they'll get a validation error

RobinL · 2020-03-08T14:17:41Z

splink/files/settings_jsonschema.json

@@ -54,12 +55,13 @@
      "$id": "#/properties/max_iterations",
      "type": "number",
      "title": "The maximum number of iterations to run even if convergence has not been reached",
-      "default": 10,


I think 10 is too low as a default num iterations

RobinL · 2020-03-08T14:18:10Z

splink/files/settings_jsonschema.json

@@ -211,12 +237,6 @@
            "$id": "#/properties/comparison_columns/items/properties/gamma_index",
            "type": "integer",
            "description": "Gamma values in the comparison vector will be put in columns called gamma_0, gamma_1 etc.  This is the gamma index corresponding to this column"
-          },


This was a previous attempt to account for custom columns that was never properly implemented so acn be deleted

samnlindsay · 2020-03-10T17:06:23Z

Using the test case_expression of:

case when jaro_winkler_sim(
array_join(array_sort(ARRAY [first_name_key_l, middle_name_key_l, last_name_key_l]), ''),
array_join(array_sort(ARRAY [first_name_key_r, middle_name_key_r, last_name_key_r]), '')
) > 0.94 then 1 else 0 end

has highlighted an issue with _add_null_treatment_to_case_statement(...).

While it works for your example above, it assumes a format of case when colname_l... to extract colname. If I understand correctly, this is where custom_column_names should come in so the null treatment can be added to these columns without searching the case expression?

RobinL · 2020-03-11T09:59:18Z

Well spotted. I agree, the current implementation isn't very robust and could be improved

RobinL · 2020-03-12T19:34:57Z

Thinking about this, I think the simplest solution may be just to drop this functionality entirely and put it in the users' hands.

The only time it makes sense is the case where the user is providing a custom case expression for a single column (i.e. they have specified a col_name not a custom_name). Most of the time when they do this, we'd recommend they parameterise one of the pre-built statements from case_statements.py rather than write their own (these pre-built ones include the -1 null treatment so don't need any additions.

So having 'surprising behaviour' (null treatment magically appears) for an unusual use case is probably more trouble than it's worth.

In the case where they've provided a custom_name, I don't think we'd want to try and 'guess' null treatment for the user. Instead, the docs need to clearly state the user should define how the treatment of nulls should work i.e. their case expression should include a case for case x then -1 .

samnlindsay · 2020-03-12T20:30:02Z

OK good, because I was coming to the opinion that it was becoming more complicated than was worthwhile in order to do it "properly" because it was just the null treatment that was a problem.

That's partly because I had issues with the null treatment even for the original simple usage. For example, it isn't triggered when "then -1" is found in the expression, ignoring potential other legitimate conditions for gamma = -1 that don't check for nulls.

Will put this to bed in the morning, sorry for the delay

samnlindsay · 2020-03-13T09:47:45Z

One last thought (not necessary for this PR but potential as an option to put in later):

It seems like the real difficulty is in trying to detect whether the user has already dealt with nulls in the case_expression.
If we assume the user doesn't include it, adding the null treatment is easy (for col_name or for a list of custom_column_names). If we're worried about it not being all that simple for custom columns, then the simple null treatment could be an option in the settings, so the only time the onus is on the user is when they have a custom column where they have specific reason to opt out of it.

samnlindsay

Think nulls may need to be revisited, but the core feature of allowing custom case expressions for multiple columns now works. ✅

Closes #65

RobinL added 4 commits March 7, 2020 17:07

enable support for custom comparisons

053e095

basic example works

48f36f8

tests pass

80d17be

intuition report now works with custom col

17b8520

RobinL requested review from mamonu and samnlindsay March 7, 2020 18:09

RobinL commented Mar 8, 2020

View reviewed changes

remove the add nulls function

8a310ed

samnlindsay approved these changes Mar 13, 2020

View reviewed changes

RobinL merged commit 6ba3c04 into dev Mar 13, 2020

RobinL deleted the custom_column branch March 13, 2020 11:39

samnlindsay mentioned this pull request Mar 13, 2020

Include automatic null treatment (with ability to opt out) #92

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for custom case expressions which use arbitrary/multiple column(s) #90

Add support for custom case expressions which use arbitrary/multiple column(s) #90

RobinL commented Mar 7, 2020 •

edited

RobinL Mar 8, 2020

samnlindsay Mar 10, 2020

RobinL Mar 8, 2020

RobinL Mar 8, 2020

RobinL Mar 8, 2020

RobinL Mar 8, 2020

RobinL Mar 8, 2020

samnlindsay commented Mar 10, 2020

RobinL commented Mar 11, 2020

RobinL commented Mar 12, 2020 •

edited

samnlindsay commented Mar 12, 2020

samnlindsay commented Mar 13, 2020

samnlindsay left a comment

Add support for custom case expressions which use arbitrary/multiple column(s) #90

Add support for custom case expressions which use arbitrary/multiple column(s) #90

Conversation

RobinL commented Mar 7, 2020 • edited

RobinL Mar 8, 2020

Choose a reason for hiding this comment

samnlindsay Mar 10, 2020

Choose a reason for hiding this comment

RobinL Mar 8, 2020

Choose a reason for hiding this comment

RobinL Mar 8, 2020

Choose a reason for hiding this comment

RobinL Mar 8, 2020

Choose a reason for hiding this comment

RobinL Mar 8, 2020

Choose a reason for hiding this comment

RobinL Mar 8, 2020

Choose a reason for hiding this comment

samnlindsay commented Mar 10, 2020

RobinL commented Mar 11, 2020

RobinL commented Mar 12, 2020 • edited

samnlindsay commented Mar 12, 2020

samnlindsay commented Mar 13, 2020

samnlindsay left a comment

Choose a reason for hiding this comment

RobinL commented Mar 7, 2020 •

edited

RobinL commented Mar 12, 2020 •

edited