New TF adjustment #201

samnlindsay · 2021-04-29T06:34:32Z

Changes so far

Add term frequencies to df
- e.g. tf_surname is the count of surname / count of all non-null surnames
Add tf_adjustment_weights to settings
- weight between 0 (no TF adjustment) and 1 (full TF adjustment) for each gamma level
- intermediate weights can be applied to fuzzy match levels where TF adjustment is partially relevant (e.g. lindsay vs lindsey) -> tf_adjustment_weights = [0.0, 0.7, 1.0]
- default: 0 except for max gamma level (e.g. [0.0, 0.0, 1.0])
Convert match probability calculation to Bayes factors (rather than m and u)
New TF component to the Bayes factor
- TF adjustment effectively replaces u with term frequency (i.e. surname-specific BF = m / tf_surname)
- Expressed as an additional Bayes factor (i.e. surname-specific BF = bf_gamma_surname * bf_tf_adj_surname)
- bf_tf_adj_surname = u / tf_surname (raised to the power of tf_adjustment_weight)
- Where matches are fuzzy, uses the larger of the two term frequencies (assumes the less frequent one is an error)
EM algorithm uses TF adjusted match probability with each iteration

TO DO / TEST / CONSIDER:

Does it give the right answer??? (test on real data and QA clusters?)
TF adjustments on custom comparison columns (do not allow 🚫)
Does the new method work without first estimating u probabilities and fixing them? (Do the parameters still iterate properly to the same solution?)
How this feeds into later diagnostic elements
Visualising TF adjustments
Other stuff I haven't thought of yet...

…s/splink into new_tf_adjustment

RobinL · 2021-10-25T09:10:38Z

splink/term_frequencies.py

+
+    sql = f"""
+    select
+    {column_name}, count(*) / sum(count(*)) over () as tf_{column_name}


It' not at all obvious but this formulation of the SQL doesn't scale well. When you run this, if you look at the Spark logs (look at the terminal window you're running Jupyter from) you get:

21/10/25 10:08:32 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

Only mentioning so you 'know how you could know' this causes a problem. I'll refactor to avoid this problem.

Tidying up, deleting previous tf code

samnlindsay added 6 commits April 28, 2021 15:21

Add term frequency cols to source dataset

042af73

New TF adjustment bayes factor

7b9f76f

Merge branch 'master' into new_tf_adjustment

d503f9a

General fixes

bfc84a2

Merge branch 'new_tf_adjustment' of github.com:moj-analytical-service…

ec173c7

…s/splink into new_tf_adjustment

Attempt at new test - doesnt work, maybe delete...

c19ad0c

RobinL reviewed Oct 25, 2021

View reviewed changes

RobinL added 9 commits October 26, 2021 08:14

all tests pass except tf adj

cca45df

formatting

276b675

rename new adj

88fc678

test_adj

88a37f6

tests pass

475ffaa

remove unnecessary zero test

05f3aca

update deps in dockerfile

0aa129c

add fixed jar

1cedf0d

Merge pull request #216 from moj-analytical-services/new_tf_rl

81406f0

Tidying up, deleting previous tf code

RobinL changed the title ~~DO NOT MERGE - New TF adjustment~~ New TF adjustment Oct 26, 2021

RobinL merged commit e0f1e59 into master Oct 26, 2021

RobinL deleted the new_tf_adjustment branch February 20, 2022 08:21

RobinL mentioned this pull request Jun 26, 2023

[FEAT] EM algorithm performance improvement #1363

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New TF adjustment #201

New TF adjustment #201

samnlindsay commented Apr 29, 2021

RobinL Oct 25, 2021 •

edited

Loading

New TF adjustment #201

New TF adjustment #201

Conversation

samnlindsay commented Apr 29, 2021

Changes so far

TO DO / TEST / CONSIDER:

RobinL Oct 25, 2021 • edited Loading

Choose a reason for hiding this comment

RobinL Oct 25, 2021 •

edited

Loading