Proper settings syntax to compare only the first few characters of a column? #1621
Unanswered
jkginfinite
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I am setting up the comparisons in my settings dictionary in Pyspark. My goal is to use a column but reduce the cardinality of comparisons by only using the first few characters of a column;
e.g. instead of
cl.exact_match("first_name")
i'd like to do something likecl.exact_match("substr(first_name,1,3)")
I have a code to do this, but it keeps returning errors saying that it cant finding the "ValueError: exact match level for first_name"
when trying to run
linker.estimate_parameters_using_expectation_maximisation(rule)
;The dictionary I used was this;
{ "output_column_name":"first_name", "comparison_description":"first_name", "comparison_levels":[ { "sql_condition":"substr(first_name_l,1,5) IS NULL OR substr(first_name_r,1,5) IS NULL", "label_for_charts":"Null", "is_null_level":true }, { "sql_condition":"substr(first_name_l,1,5) = substr(first_name_r,1,5)", "label_for_charts":"Exact match", "tf_adjustment_column":"first_name", "tf_adjustment_weight":1.0, "tf_minimum_u_value":0.001 }, { "sql_condition":"ELSE", "label_for_charts":"All other comparisons" } ] }
I am wondering how that might work with. If I use a comparison like this, do I have to modify the rules accordingly?
e.g. have rules such as;
l.first_name = r.first_name and levenshtein(substr(l.dob,1,4), substr(r.dob,1,4)) <= 1
PS - Github seemed to remove the JSON formatting of the dictionary above
Beta Was this translation helpful? Give feedback.
All reactions