Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking base during edgelist_from_base() #10

Closed
tjibbed opened this issue Oct 14, 2019 · 15 comments
Closed

Checking base during edgelist_from_base() #10

tjibbed opened this issue Oct 14, 2019 · 15 comments
Assignees
Labels
help wanted Extra attention is needed invalid This doesn't seem right question Further information is requested

Comments

@tjibbed
Copy link
Collaborator

tjibbed commented Oct 14, 2019

There is a bit of code in NetworkBuilding.R that has some adverse effects:

## Checking base message("Checking base...") base = try({ checkBase(base) }) if (class(base)[[1]] == "try-error") { stop("Cannot compute the network: the database is not correctly formated or contains errors. The database must first be checked with the function 'checkBase()'. See the vignettes for more details on the workflow of the package.") }

I think our workflow is designed to allow checkBase() to be performed separately, to allow the users to think about what is wrong in or with the database. However, the above code forces the complete function to be run as a first step, even if the input data has already been checked.

To me there seem to be three options here:
1- Don't check the database in this function (which creates the possibility that the user will pass an unchecked database to the function, with all potential problems that it might create).
2- Create a input flag for checking the database. (No guarantee that the above will not happen, but by actively having to set it to FALSE the user at least has to think about it).
3- Leave it as it is. This can be rather slow, as the database checking does take quite a bit of time for larger databases. On the other hand, it does create a fully integrated function to move from raw data to network in one go... The question is whether this is what we want.

I'd prefer option 1 or 2, but like to hear you're thoughts before adjusting the code.

@tjibbed tjibbed added help wanted Extra attention is needed invalid This doesn't seem right question Further information is requested labels Oct 14, 2019
@PascalCrepey
Copy link
Owner

It's true that at some point, we need to assume that the base that is passed is good and has been checked. I'm ok with option 1 if we rename the functions so that it is clear for the user when the data is supposed to be clean or not. It would allow us to discriminate between high level functions (which call other functions to do the cleaning/checking and building) and low-level ones which only do what they are supposed to do. For example, we could say that all functions like *_from_base need to have checked and clean data, and functions like *_from_rawdata call functions to do the cleaning and potentially call *_from_base functions.
I think it may simplify the building blocks, what do you think ?

@ClementMassonnaud
Copy link
Collaborator

I agree that the best workflow is to perform checkBase() separately.
Here the call to checkBase() is only to see if the database was previously checked. If there is an issue with the base it will not be fixed because no arguments are passed to checkBase(). So it would raise an error and the user would have to use checkBase() separately to fix the database.

I agree that is we must worry about the running time, but I'm not sure that it would be an issue here. checkBase() is merely checking column names, column classes, missing values, which are not very resource-intensive tasks. Even the adjust_overlapping_stays() function is just comparing dates between rows, which is quite fast with data.table, I think. What could take time with the checkBase() function is actually changing the database, but in this case the function would simply raise an error.
I could run some benchmarks to see how the function behaves with larger data sets.

@ClementMassonnaud
Copy link
Collaborator

It's true that at some point, we need to assume that the base that is passed is good and has been checked. I'm ok with option 1 if we rename the functions so that it is clear for the user when the data is supposed to be clean or not.

I am always a bit concerned when we let the responsibility to the users when it can be quite easily handled by the program. Of course, the idea is not to assume the users won't understand the functions, but since we have a rather simple function at hand to double-check, we might as well use it.

As for adding various wrappers of functions, I am afraid it would add some unnecessary complexity to the package. I know that igraph is doing something similar, but I am not so fan. But this is a personal opinion really

@MikeLydeamore
Copy link
Contributor

On a large (~2 million admissions) database, the checkBase() function does take a noticeable amount of time.

Could you add an extra class to the output of the base from checkBase() and then assume if you have an object of that class there is no need to run checkBase() again? If this sounds sensible, happy to patch it in.

@ClementMassonnaud
Copy link
Collaborator

Yes, I thought about that too, but I don't know enough about R classes to implement that.
It seemed like the ideal solution, but I was wondering if creating a new class would make it a bit more difficult for the users to deal with the newly cleaned base (for instance if they want to write it as csv file, etc.)

P.S: could you also detail the benchmark you did on the function? I'm curious to see

@MikeLydeamore
Copy link
Contributor

I sourced all the package files so I could use profvis to have a look, and I also ran checkBase(base) on the base file I got out of a previous checkBase (with relevant arguments), as I think this is what is happening inside the function. Benchmark says that 80 of the 117 seconds are taken up by checkBase.
I'm running this on real data so I can't share, but if you're interested I'm happy to give the simulated data a go and see what happens.

Haven't yet tried not checking the base and comparing, but I'll try and look into that and report back.
image

@ClementMassonnaud
Copy link
Collaborator

Thank you very much for the details
It seems I was misleading, checkBase() takes indeed too much time to be called systematically within the function.
Looking again at the code, it might be due to checking of missing values, which, as I wrote it, involves a lot of string comparisons.

So, I would vote either for your solution of the extra class, or for option 2 proposed by @tjibbed

@tjibbed
Copy link
Collaborator Author

tjibbed commented Oct 15, 2019

Good to see @MikeLydeamore joining the discussion!

I'm tempted to agree that an extra class as output of checkBase(), and possible input of the edgelist_from_base() [and relating functions) might be the best way forward. We could carry the report on numbers of errors forward in that class as well, so that in the end we have a measure for input data quality, which can be exported.

@ClementMassonnaud 's point is important though:

but I was wondering if creating a new class would make it a bit more difficult for the users to deal with the newly cleaned base (for instance if they want to write it as csv file, etc.)

However, I'd imagine that the user would still be able to simply export the database element from that class (e.g. checkedData$base) to obtain a csv file with just the checked data. Right?

@PascalCrepey
Copy link
Owner

PascalCrepey commented Oct 15, 2019

Well, without creating a new object or anything like that, we can just use attributes.
At the end of checkBase(), instead of

if (returnReport) {
  return(report)
} else {
  return(report$base)
}

we could do :

#add the report as a named list attribute to the data.table
attr(report$base, "report") <- list(
  failedParse = report$failedParse,
  removedMissing = report$removedMissing, 
  missing = report$missing, 
  negativeLOS = report$negativeLOS,
  removedErrors = report$removedErrors, 
  removedDuplicates = report$removedDuplicates,
  neededIterations = report$neededIterations, 
  allIterations = report$allIterations, 
  addedAOS = report$addedAOS)

#add the class "checked.base" to the list of class so that we can easily 
#identify whether the base has been checked or not. 
if(!inherits(report$base, "checked.base") class(report$base) <- c("checked.base", class(report$base))

return(report$base)

Later on we can check whether the base has been checked by

if(inherits(my_potentially_checked_base, "checked.base")){
  ##do stuff on the checked base
}

and the report can be accessed via

attr(my_checked_base, "report")

but other than that, the object is still a data.table and behaves as it should...

Would it solve the issue at stake here ?

@MikeLydeamore
Copy link
Contributor

I think so - classes in R aren't exclusive so just tagging the checked.base on the end won't stop the data.table functionality.
Given that the report isn't a thing the user is likely to want frequently, it seems OK to set as an attribute in my opinion.

@tjibbed
Copy link
Collaborator Author

tjibbed commented Oct 15, 2019

Sounds like a good solution (this really pushes the boundaries of my R-knowledge... again, learning a lot here)

@ClementMassonnaud
Copy link
Collaborator

Sounds good to me

@PascalCrepey
Copy link
Owner

One additional thing:
I can't remember what we decided regarding the naming/renaming of the columns in the database. I know it's done in checkBase, but since we now have a way of checking if the base has been checked, I don't really see the point of still putting the alternative column names in argument to some of our functions... and anyway when checkbase() was called with no argument, it would have stopped the running without even using the alternative column names... 
I think it's still useful to allow alternative names for functions that one should be able to use on its own (not called by another function) but my guess is that it's a limited list... 
Which functions do you think the user may want to call directly ? which one do you call directly ?

@tjibbed
Copy link
Collaborator Author

tjibbed commented Oct 16, 2019

I would say we'd only need the possibility of alternative column names in

  • checkBase
  • hospinet_from_subject_database

If we rename the columns in checkBase(), the alternative names can be removed from the following functions:

  • checkFormat
  • checkMissingErrors
  • adjust_overlapping_stays
  • matrix_from_base [Actually, completely deprecated?]
  • edgelist_from_base

I would say we keep these functions internal, and have the users create a hospiNet first, and from there they can get the edgelist and matrix directly.
In that case, any reconstruction of the matrix is through the edgelist, and there would be no direct route from base to matrix. I'm fine with this, as generally the edgelist is much smaller than the matrix. matrix_from_base would therefore lose it's usefulness.

I'll have a go at it.

@tjibbed tjibbed closed this as completed Oct 16, 2019
@tjibbed
Copy link
Collaborator Author

tjibbed commented Oct 16, 2019

Oh by the way: I noticed @MikeLydeamore removed checkBase() from hospinet_from_subject_database() when inserting the check for hospinet.base.

I inserted it again, but within that check. It now reads

if (!inherits(base, "hospinet.base")) { message("Input database was not checked yet, which is required for network reconstruction. Running 'checkBase()' with default parameters. If this doesn't work, please run checkBase() separatelty with custom parameters first.") base=checkBase(base, verbose=verbose, ... ) }

So if an unchecked database is input, the script will first try to check the database with the available information. The user can input any parameters for checkBase() as parameters in hospinet_from_subject_database(), which passes them on.

This runs fine on my large real database.

If the database was already checked, it will skip it and go straight to calculating the edgelist and the rest of the hospinet object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed invalid This doesn't seem right question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants