Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in mclapply() on Windows #15

Closed
aalexandersson opened this issue Aug 30, 2017 · 13 comments
Closed

Error in mclapply() on Windows #15

aalexandersson opened this issue Aug 30, 2017 · 13 comments

Comments

@aalexandersson
Copy link

I am using fastLink on confidential data and get an error in mclapply(). I am using fastLink version 0.1.1 on Windows 7 with 4 cores.

This is the problematic R command:

library(fastLink)
> fl.out <- fastLink(rpatient7, racs7, 
>     varnames = c("bstate", "sex", "nysf", "nysl", "ssn", "dob"),
>     stringdist.match = c("nysf", "dob"), n.cores = 2)

This is the problematic R output:

==================== 
fastLink(): Fast Probabilistic Record Linkage
==================== 

Calculating matches for each variable.
Error in mclapply(matches.2, function(s) { : 
  'mc.cores' > 1 is not supported on Windows

Immediately after the error, I typed traceback() and this is the result:

> traceback()
4: stop("'mc.cores' > 1 is not supported on Windows")
3: mclapply(matches.2, function(s) {
       ht1 <- which(matrix.1 == s[1])
       ht2 <- which(matrix.2 == s[2])
       list(ht1, ht2)
   }, mc.cores = getOption("mc.cores", no_cores))
2: gammaCK2par(dfA[, varnames[i]], dfB[, varnames[i]], cut.a = cut.a, 
       method = stringdist.method, w = jw.weight, n.cores = n.cores)
1: fastLink(rpatient7, racs7, varnames = c("bstate", "sex", "nysf", 
       "nysl", "ssn", "dob"), stringdist.match = c("nysf", "dob"), 
       n.cores = 2)

If I change the syntax from n.cores = 2 to n.cores = 1 (or if I omit the option) then the R output is fine.

I could not reproduce the error on datasets dfA and dfB. The problem with mclapply() on Windows is discussed further at https://www.r-bloggers.com/implementing-mclapply-on-windows-a-primer-on-embarrassingly-parallel-computation-on-multicore-systems-with-r/

Please advice.

@tedenamorado
Copy link
Collaborator

Hi,

Thanks for pointing that out. If you want to circumvent that issue, the best choice would be to install fastLink from gitHub directly i.e.,

library(devtools)
install_github("kosukeimai/fastLink",dependencies=TRUE)

Since you are using a Windows computer you will need to install Rtools as well. The latest version of Rtools can be downloaded here:

http://mirror.fcaglp.unlp.edu.ar/CRAN/bin/windows/Rtools/

If the problem persists, please let us know.

Ted

@aalexandersson
Copy link
Author

Hi Ted,

The development version of fastLink worked. Thank you.

I installed the development version from RStudio. RStudio conveniently asked if I want to install Rtool which I did. I have a minor follow-up question and a feature wish.

Follow-up question: Does the option n.cores perhaps refer to threads rather than to cores?

I have a 4-core 8-thread CPU. When I specify "n.cores = 2" I get in part the output "(Using OpenMP to parallelize calculation. 2 threads out of 8 are used.)". When I instead specify "n.cores = 8" I get "[...] 8 threads out of 8 are used.)". This suggests that n.threads is more accurate.

Feature wish: Classification table, for example as in the package RecordLinkage.

A classification table (a.k.a. table of confusion or error matrix) is the traditional summary of linkage results and can also be used to calculate other summary measures than match count, match rate, FDR and FNR such as, for example, link count, and F-measure.

Anders

@tedenamorado
Copy link
Collaborator

Hi Anders,

We are glad your issue is solved now.

Regarding your questions/request:

  1. Yes, you are right, for most systems thread should be a better way to describe that option. We will try to incorporate such a change in a future release.

  2. Currently, we are developing functions that will include detailed confusion tables and other ways to present graphically the results.

Note that if you use the wrapper function fastLink(), it is possible to obtain basic summary stats like match rate, FDR, and FNR. For the step-by-step implementation we do not have such functions yet, but as noted above, we are close to having them finished. We will keep you posted!

Thanks for using fastLink and please keep us posted on how your project goes. Again, any additional feedback would be greatly appreciated.

Ted

@aalexandersson
Copy link
Author

aalexandersson commented Aug 31, 2017 via email

@aalexandersson
Copy link
Author

aalexandersson commented Sep 1, 2017 via email

@aalexandersson
Copy link
Author

aalexandersson commented Sep 1, 2017 via email

@kosukeimai kosukeimai reopened this Sep 1, 2017
@kosukeimai
Copy link
Owner

Try uninstalling and then reinstalling the package to see if that fixes the problem.

@tedenamorado
Copy link
Collaborator

Hi Anders,

I hope all is OK.

Are you still having these issues when using fastLink?

Are inspectEM() and plot() what you meant by "detailed confusion tables"? No, these functions are designed to make plots that present the agreement vectors in an easy-to-interpret fashion.

As per the confusion table, we are still working on such a function. I will let you know when we push it.

Thanks a lot for patience and all your feedback! We hope fastLink helps with the record linkage problem you are dealing with.

Ted

@aalexandersson
Copy link
Author

Hi Ted,

Sorry for being late in my reply. Everything is OK. I did not have a chance to reinstall the software at work yet because I am home taking the long weekend off. I do not expect the help file to remain a problem on my Windows 7 with a clean install. I am closing the issue. If the problem remains I will let you know.

Thank you so much for working on adding a confusion table! It would make the results more comparable to the R package RecordLinkage, and to traditional output. A confusion table is the main feature that I and Florida Cancer Data System miss. It would enable me to switch record linkage software at work from RecordLinkage to fastLink.

Best wishes,
Anders

@aalexandersson
Copy link
Author

A clean re-installation of version 0.2.0 from CRAN fixed the problem with the corrupted help file.

@aalexandersson
Copy link
Author

aalexandersson commented Oct 3, 2017

Stata has a user-written command classtabi which concretely shows another example how the confusion matrix can be displayed. Unfortunately, the program has two minor bugs which are described on Statalist here:

https://www.statalist.org/forums/forum/general-stata-discussion/general/1321572-a-new-command-classtabi-now-available-for-download-from-ssc

Hope this helps,
Anders

@aalexandersson
Copy link
Author

Ariel Linden has now updated his Stata program classtabi to fix the two bugs. Hopefully you find it useful for developing a similar confusion matrix in fastLink.

https://www.statalist.org/forums/forum/general-stata-discussion/general/1321572-a-new-command-classtabi-now-available-for-download-from-ssc?p=1413865#post1413865

@tedenamorado
Copy link
Collaborator

tedenamorado commented Oct 9, 2017

Thanks a lot for sharing this with us! We are close to release a new version of the package and we promise that the new function with a confusion table will be released then.

In addition, we are adding two new functions that will allow the users to compare numeric variables based on the absolute difference between them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants