Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New TF adjustment #201

Merged
merged 15 commits into from
Oct 26, 2021
Merged

New TF adjustment #201

merged 15 commits into from
Oct 26, 2021

Conversation

samnlindsay
Copy link
Contributor

Changes so far

  • Add term frequencies to df
    • e.g. tf_surname is the count of surname / count of all non-null surnames
  • Add tf_adjustment_weights to settings
    • weight between 0 (no TF adjustment) and 1 (full TF adjustment) for each gamma level
    • intermediate weights can be applied to fuzzy match levels where TF adjustment is partially relevant (e.g. lindsay vs lindsey) -> tf_adjustment_weights = [0.0, 0.7, 1.0]
    • default: 0 except for max gamma level (e.g. [0.0, 0.0, 1.0])
  • Convert match probability calculation to Bayes factors (rather than m and u)
  • New TF component to the Bayes factor
    • TF adjustment effectively replaces u with term frequency (i.e. surname-specific BF = m / tf_surname)
    • Expressed as an additional Bayes factor (i.e. surname-specific BF = bf_gamma_surname * bf_tf_adj_surname)
    • bf_tf_adj_surname = u / tf_surname (raised to the power of tf_adjustment_weight)
    • Where matches are fuzzy, uses the larger of the two term frequencies (assumes the less frequent one is an error)
  • EM algorithm uses TF adjusted match probability with each iteration

TO DO / TEST / CONSIDER:

  • Does it give the right answer??? (test on real data and QA clusters?)
  • TF adjustments on custom comparison columns (do not allow 🚫)
  • Does the new method work without first estimating u probabilities and fixing them? (Do the parameters still iterate properly to the same solution?)
  • How this feeds into later diagnostic elements
  • Visualising TF adjustments
  • Other stuff I haven't thought of yet...


sql = f"""
select
{column_name}, count(*) / sum(count(*)) over () as tf_{column_name}
Copy link
Member

@RobinL RobinL Oct 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It' not at all obvious but this formulation of the SQL doesn't scale well. When you run this, if you look at the Spark logs (look at the terminal window you're running Jupyter from) you get:

21/10/25 10:08:32 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

Only mentioning so you 'know how you could know' this causes a problem. I'll refactor to avoid this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants