Proper settings syntax to compare only the first few characters of a column? #1621

jkginfinite · 2023-09-24T16:39:35Z

jkginfinite
Sep 24, 2023

Hi,

I am setting up the comparisons in my settings dictionary in Pyspark. My goal is to use a column but reduce the cardinality of comparisons by only using the first few characters of a column;

e.g. instead of cl.exact_match("first_name") i'd like to do something like cl.exact_match("substr(first_name,1,3)")

I have a code to do this, but it keeps returning errors saying that it cant finding the "ValueError: exact match level for first_name"
when trying to run linker.estimate_parameters_using_expectation_maximisation(rule);

The dictionary I used was this;

{ "output_column_name":"first_name", "comparison_description":"first_name", "comparison_levels":[ { "sql_condition":"substr(first_name_l,1,5) IS NULL OR substr(first_name_r,1,5) IS NULL", "label_for_charts":"Null", "is_null_level":true }, { "sql_condition":"substr(first_name_l,1,5) = substr(first_name_r,1,5)", "label_for_charts":"Exact match", "tf_adjustment_column":"first_name", "tf_adjustment_weight":1.0, "tf_minimum_u_value":0.001 }, { "sql_condition":"ELSE", "label_for_charts":"All other comparisons" } ] }

I am wondering how that might work with. If I use a comparison like this, do I have to modify the rules accordingly?

e.g. have rules such as;

l.first_name = r.first_name and levenshtein(substr(l.dob,1,4), substr(r.dob,1,4)) <= 1

PS - Github seemed to remove the JSON formatting of the dictionary above

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proper settings syntax to compare only the first few characters of a column? #1621

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Proper settings syntax to compare only the first few characters of a column? #1621

jkginfinite Sep 24, 2023

Replies: 0 comments

jkginfinite
Sep 24, 2023