Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidance on improving chances EM algorithm will converge? #61

Open
zross opened this issue May 24, 2022 · 9 comments
Open

Guidance on improving chances EM algorithm will converge? #61

zross opened this issue May 24, 2022 · 9 comments

Comments

@zross
Copy link

zross commented May 24, 2022

@tedenamorado and @kosukeimai thanks so much for making all of this hard work available in this R package! I'm wondering if you had published some guidance or suggestions on what situations lead the EM algorithm to fail to converge.

Unfortunately, my data is not shareable so I'm having trouble giving you a reprex but, broadly, I'm linking birth data with hospitalization data for many different years and I'm having trouble pinpointing what is causing a failure to converge. Sometimes it does, sometimes it doesn't converge.

It does seem that if I exclude any record with any NA value I get convergence more often. But I'd really like to keep these records and the proportion of NA in the variables (max 4.5%) does not "seem" too high. Excluding NA values, in any case, is not a solution that works often.

I'm running the linkage, in many cases, on a 200k subsample in my efforts to figure out where the issue is. Some facts:

  1. In most cases, I'm using DOB, last name, first name, race and municipal code
  2. None of these variables is more than 4.5% missing

Any guidance on what I might do to improve the chances the EM algorithm will converge?

lnk <- fastLink::fastLink(
  dfA = dfA,
  dfB = dfB,
  varnames =  c("lk_dob", "lk_last", "lk_first", "lk_race", "lk_muni_res"),
  # dob as string match with cut of 0.95 will give a match for a one-digit difference in last few numbers
  stringdist.match = c("lk_dob", "lk_last", "lk_first"),
  cut.a = 0.95,
  dedupe.matches = FALSE,
  threshold.match =  0.975,
  verbose = TRUE
)
@aalexandersson
Copy link

Disclaimer: I am a regular user, not a fastLink developer.

Does a simpler model without partial matching converge more reliably?

I would use age instead of date of birth (dob). Does dropping the race variable lead to more frequent convergence? (I work for the Florida cancer registry and almost never link on race because it is not reliable enough as a linkage variable.) It would help if you could add a linkage variable with more values such as SSN (in the US) or street number+ZIP code.

@zross
Copy link
Author

zross commented May 24, 2022

I appreciate this. A few answers:

  1. I don't find partial vs non-partial matching changes things
  2. I can't use age since we might be, for example, looking at the mom in 1973 and seeing a hospitalization in 1980 so the age changes. But based on Age I computed year of birth but as this is less precise I'd prefer to use DOB
  3. Dropping race does not help consistently. I'll try looking again. Agreed it's not a reliable linkage variable but it can be a tiebreaker or add confidence. It's the variable with the most NA in many cases so I can do more experimentation.
  4. Yes! SSN, street etc would help. Unfortunately, especially in older births these variables don't exist. My kingdom for stronger linkage variables!

@aalexandersson
Copy link

Exact matching is much faster and simpler to compute, so it should converge without problems. How many exact matches are there?

How much is the overlap between the two datasets? fastLink struggles if you have close to 0% or 100% overlap. Imbalance matters too -- how large are the two datasets?

Did you count all missing as missing for sure? Often administrative datasets have hard-coded values such as 99 for missing which need to be recoded to NA before using fastLink.

Is birth sex available as a linkage variable?

@zross
Copy link
Author

zross commented May 25, 2022

I really appreciate the time you've put in here, thank you.

There is very little overlap in many cases. So only a few new moms from 1980 would show up in hospitalization data from, say, 1990 in the same state. That could very well be a big part of the issue. In my initial testing I was testing on data that was closer in time assuming it would work as the gap got larger. So in initial testing I had convergence in many cases.

Answers to your questions:

  1. I'm not sure about exact matches in these datasets since I've been including partials. I will take a look and see.
  2. Yes, missing are missing. We actually have 16 different types of administrative data, live birth, fetal death, hospitalizations and mortality (with different time slices, each of which has a different format) so it took a lot of time to recode the "99", empty strings etc. It's not impossible something slipped through but I have a test/review in place for the linkage variables and I don't think this is an issue.
  3. I don't need birth sex at this point. I'm looking at moms so, of course, female in live birth data and I limit in hospitalization to female.

I'll experiment with removing the string matches, but I suspect you're right that in some of these datasets there will be very few true matches and this will be an issue.

@aalexandersson2
Copy link

  1. The easiest way probably is to look at $patterns.w as described at https://github.com/kosukeimai/fastLink. I am interested in the rows with positive weights and "gammas" with value 2.

I am concerned that you will not be able to get useful results without stronger linkage variables.

@bengoehring
Copy link

Hi @zross -- just wanted to follow up to see if you gleaned any more tips for getting the EM algorithm to converge. Thanks!

@zross
Copy link
Author

zross commented Apr 26, 2023

Not really. The missing values definitely play a role sometimes and it seems like over 20% or 30% will be a problem but not all of the non-convergence was related to this it seemed.

@aalexandersson
Copy link

Closed issue #30 seems similar, and there Ted gave some additional advice not yet mentioned here, e.g., changing the tolerance criteria. However, to me, the basic issue here still is that we have no output to comment on.

Re the amount of missing data, my experience is the same that it causes a convergence issue only if it is over over 30% or so.

@tedenamorado
Copy link
Collaborator

Hi,

Having a large number of missing values in one field can affect the model's ability to converge since it must rely on the available information. Another issue is when merging many fields that only have a few possible values, such as race or gender. In such cases, the model will rely on fields that provide more discriminating power, like first and last names.

One suggestion is to use partial matching instead of binary comparison for string-valued fields. Another idea is to provide different starting values for the relevant parameters. Currently, our fastLink wrapper function does not have an argument for different starting values, but we are revising it and plan to add them to the new version we will release this summer.

If anything, do not hesitate to let us know.

All my best,

Ted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants