Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

M values aren't trained for a column #2154

Closed
2 tasks done
lamaeldo opened this issue Apr 28, 2024 · 3 comments
Closed
2 tasks done

M values aren't trained for a column #2154

lamaeldo opened this issue Apr 28, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@lamaeldo
Copy link

What happens?

Hello, I am using splink to link two datasets, using mostly custom comparisons. One of my columns, "sname" is used in comparison and in neither of my blocking rules. However, when I use EM to calculate the m values, splink says the column is used in the blocking rules (it isn't). Yet, when i print the match weight charts and the parameter estimate comparisons chart, they both show values for sname. What should I believe? Are my M values trained properly or not?
Am i missing something obvious?

To Reproduce

A notebook is attached (as a .txt to allow for upload), but I cannot share the data files
bugged_ipynb.txt

OS:

Debian

Splink version:

3.9.14

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@lamaeldo lamaeldo added the bug Something isn't working label Apr 28, 2024
@ADBond
Copy link
Contributor

ADBond commented Apr 29, 2024

The condition used to determine whether or not parameters are estimated for a comparison is whether it not any data columns are used in any of the comparison levels.

In your case, the sname comparison makes reference to the columns sex and mar, which also appear in your training blocking rules, and so this comparison cannot be estimated. To train the parameters for the sname comparison you will need to use a blocking rule that does not use any of the columns sname, sex, or mar, as these are the columns that the sname comparison depends on.

The match weight chart (and the m u parameters chart) will show the default m-values for any comparison that has no trained values associated to it, so those will probably be what you are seeing there.

The parameter estimates chart should not show default values, and should only be displaying values that are estimated from training sessions (expectation maximisation or estimate u from random sampling) - if you do have m-values appearing there for sname, would you be able to upload an image of it?

@RobinL
Copy link
Member

RobinL commented Apr 29, 2024

The condition used to determine whether or not parameters are estimated for a comparison is whether it not any data columns are used in any of the comparison levels.

In your case, the sname comparison makes reference to the columns sex and mar, which also appear in your training blocking rules, and so this comparison cannot be estimated. To train the parameters for the sname comparison you will need to use a blocking rule that does not use any of the columns sname, sex, or mar, as these are the columns that the sname comparison depends on.

The match weight chart (and the m u parameters chart) will show the default m-values for any comparison that has no trained values associated to it, so those will probably be what you are seeing there.

The parameter estimates chart should not show default values, and should only be displaying values that are estimated from training sessions (expectation maximisation or estimate u from random sampling) - if you do have m-values appearing there for sname, would you be able to upload an image of it?

I think possibly the distinction here is whether you're displaying from linker.match_weights_chart() (which iirc does display default values) or the charts returned by the training session:

training_session = linker.estimate_parameters_using_expectation_maximisation(block_on(["first_name"]))
training_session.match_weights_interactive_history_chart()

(which shouldn't)

I admit, it's a bit confusing that linker.match_weights_chart() shows default values, we should probably improve that somehow!

@lamaeldo
Copy link
Author

Thanks both for the replies this solves it. @ADBond apologies, there was indeed no values shown for sname in parameter_estimate_comparisons_chart()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants