New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for custom case expressions which use arbitrary/multiple column(s) #90
Conversation
@@ -98,8 +99,8 @@ def manually_apply_fellegi_sunter_weights(self): | |||
|
|||
def get_scored_comparisons(self, num_iterations=None): | |||
|
|||
if (num_iterations is None): | |||
num_iterations=self.settings["max_iterations"] | |||
if not num_iterations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not relevant to this PR, I'm just cleaning up code here. Specifically, the number of iterations is taken from the settings object unless explicitly set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I wasn't sure whether None
could be coerced to a boolean, hence treating it like a value to be sure without having to look it up. Good to know.
splink/case_statements.py
Outdated
@@ -34,7 +34,7 @@ def _add_null_treatment_to_case_statement(case_statement: str): | |||
|
|||
if "then -1" not in sl: | |||
try: | |||
variable_name = re.search(r"when ([\w_]{1,100})_l", case_statement)[1] | |||
variable_name = re.search(r"when ([\w_]{1,100})_l", case_statement.lower())[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary for this PR, this was an error I identified whilst doing the PR that meant the regex was incorrect
if col["term_frequency_adjustments"]: | ||
select_cols = _add_left_right(select_cols, col_name) | ||
select_cols["gamma_" + col_name] = "gamma_" + col_name | ||
if "col_name" in col: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In these bits I'm having to fiddle around to make sure the order of the columns is sensible even when a custom_name is used
@@ -5,6 +5,7 @@ | |||
"type": "object", | |||
"title": "Splink settings", | |||
"required": ["comparison_columns", "link_type"], | |||
"additionalProperties": false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary to PR but useful - this means that if the user includes extraneous keys (e.g. a misspelt key) in the settings dictionary, they'll get a validation error
@@ -54,12 +55,13 @@ | |||
"$id": "#/properties/max_iterations", | |||
"type": "number", | |||
"title": "The maximum number of iterations to run even if convergence has not been reached", | |||
"default": 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 10 is too low as a default num iterations
@@ -211,12 +237,6 @@ | |||
"$id": "#/properties/comparison_columns/items/properties/gamma_index", | |||
"type": "integer", | |||
"description": "Gamma values in the comparison vector will be put in columns called gamma_0, gamma_1 etc. This is the gamma index corresponding to this column" | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a previous attempt to account for custom columns that was never properly implemented so acn be deleted
Using the test
has highlighted an issue with While it works for your example above, it assumes a format of |
Well spotted. I agree, the current implementation isn't very robust and could be improved |
Thinking about this, I think the simplest solution may be just to drop this functionality entirely and put it in the users' hands. The only time it makes sense is the case where the user is providing a custom case expression for a single column (i.e. they have specified a So having 'surprising behaviour' (null treatment magically appears) for an unusual use case is probably more trouble than it's worth. In the case where they've provided a |
OK good, because I was coming to the opinion that it was becoming more complicated than was worthwhile in order to do it "properly" because it was just the null treatment that was a problem. That's partly because I had issues with the null treatment even for the original simple usage. For example, it isn't triggered when "then -1" is found in the expression, ignoring potential other legitimate conditions for gamma = -1 that don't check for nulls. Will put this to bed in the morning, sorry for the delay |
One last thought (not necessary for this PR but potential as an option to put in later): It seems like the real difficulty is in trying to detect whether the user has already dealt with nulls in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think nulls may need to be revisited, but the core feature of allowing custom case expressions for multiple columns now works. ✅
Closes #65
Closes #65
A relatively advanced use case that is fairly common is for values in the comparison vectors to be derived from multiple columns.
For example, to deal with the possibility of name inversions, the user might want something like:
It's also a bit tricky to think about how this should be dealt with in
comparison_columns
because up until now, we've assumed that each value in the comparison vector corresponds to a single column, and each column is identified by thecol_name
key in thecomparison_columns
part of the settings dictionary.What I've done is allowed two possible specs for an entry in
comparison_columns
.Either
col_name
must be completed as before:or the user must explicitly specify a custom column:
The later syntax isn't very intuitive, but this is advanced functionality and it would be expected that the user would modify example settings rather than write this from scratch