Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching Dates #52

Open
geneh0 opened this issue Jun 4, 2021 · 3 comments
Open

Matching Dates #52

geneh0 opened this issue Jun 4, 2021 · 3 comments

Comments

@geneh0
Copy link

geneh0 commented Jun 4, 2021

I have a column converted to date format (without time) using as.Date(), lubridate::as_date() and even `as.Date.character().

> df$date_of_contact[1]
[1] "2020-11-15"
> class(df$date_of_contact)
[1] "Date"

When I try to use that as a variable in fastLink, throws the error:

Error in charToDate(x) : 
  character string is not in a standard unambiguous format

How do I include dates as part of the match? I'm trying to do an exact match to deduplicate a dataframe.

@aalexandersson
Copy link

Disclaimer: I am a regular user, not a developer, of fastLink.

The problem is discussed on stackoverflow here. The easiest solution for you might be to use a character type, not a date type. Try this before running fastLink:

as.character(df$date_of_contact)

@geneh0
Copy link
Author

geneh0 commented Jun 4, 2021

That's what I currently have in my script, but I'm not sure what the best way of doing it if it was an inexact match, say up to 13 day difference (updated my original post). String distance matching doesn't seem to make sense since translocated digits don't really have any significance.

@aalexandersson
Copy link

I would convert dates to a numeric variable such as total_days as discussed on stackoverflow here. Then, you can use fastLink with range arguments for a numeric match such as

numeric.match = "total_days", cut.a.num = 0.4, cut.p.num = 6.5

In the special case if your date variable instead is date of birth, then you can calculate age using eeptools::age_calc().

The documentation for cut.a.num and cut.p.num says

cut.a.num Lower bound for full numeric match. Default is 1
cut.p.num Lower bound for partial numeric match. Default is 2.5

I have two concerns with that documentation:

  1. Why is cut.a.num = 0 not allowed? The resulting error is not user-friendly:
Error in { : 
  task 1 failed - "error in evaluating the argument 'x' in selecting a method for function 'which': subscript out of bounds" 
  1. What is the Upper bound for cut.a.num and cut.p.num (that is, the opposite of the "Lower bound")?

I think you mention a relatively simple but practically important issue which deserves better documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants