-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New TF adjustment #201
New TF adjustment #201
Conversation
…s/splink into new_tf_adjustment
splink/term_frequencies.py
Outdated
|
||
sql = f""" | ||
select | ||
{column_name}, count(*) / sum(count(*)) over () as tf_{column_name} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It' not at all obvious but this formulation of the SQL doesn't scale well. When you run this, if you look at the Spark logs (look at the terminal window you're running Jupyter from) you get:
21/10/25 10:08:32 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
Only mentioning so you 'know how you could know' this causes a problem. I'll refactor to avoid this problem.
Tidying up, deleting previous tf code
Changes so far
df
tf_surname
is the count of surname / count of all non-null surnamestf_adjustment_weights
to settingslindsay
vslindsey
) ->tf_adjustment_weights = [0.0, 0.7, 1.0]
[0.0, 0.0, 1.0]
)u
with term frequency (i.e. surname-specific BF =m / tf_surname
)bf_gamma_surname * bf_tf_adj_surname
)bf_tf_adj_surname = u / tf_surname
(raised to the power oftf_adjustment_weight
)TO DO / TEST / CONSIDER: