-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checking base during edgelist_from_base() #10
Comments
It's true that at some point, we need to assume that the base that is passed is good and has been checked. I'm ok with option 1 if we rename the functions so that it is clear for the user when the data is supposed to be clean or not. It would allow us to discriminate between high level functions (which call other functions to do the cleaning/checking and building) and low-level ones which only do what they are supposed to do. For example, we could say that all functions like *_from_base need to have checked and clean data, and functions like *_from_rawdata call functions to do the cleaning and potentially call *_from_base functions. |
I agree that the best workflow is to perform checkBase() separately. I agree that is we must worry about the running time, but I'm not sure that it would be an issue here. checkBase() is merely checking column names, column classes, missing values, which are not very resource-intensive tasks. Even the adjust_overlapping_stays() function is just comparing dates between rows, which is quite fast with data.table, I think. What could take time with the checkBase() function is actually changing the database, but in this case the function would simply raise an error. |
I am always a bit concerned when we let the responsibility to the users when it can be quite easily handled by the program. Of course, the idea is not to assume the users won't understand the functions, but since we have a rather simple function at hand to double-check, we might as well use it. As for adding various wrappers of functions, I am afraid it would add some unnecessary complexity to the package. I know that igraph is doing something similar, but I am not so fan. But this is a personal opinion really |
On a large (~2 million admissions) database, the Could you add an extra class to the output of the base from |
Yes, I thought about that too, but I don't know enough about R classes to implement that. P.S: could you also detail the benchmark you did on the function? I'm curious to see |
Thank you very much for the details So, I would vote either for your solution of the extra class, or for option 2 proposed by @tjibbed |
Good to see @MikeLydeamore joining the discussion! I'm tempted to agree that an extra class as output of checkBase(), and possible input of the edgelist_from_base() [and relating functions) might be the best way forward. We could carry the report on numbers of errors forward in that class as well, so that in the end we have a measure for input data quality, which can be exported. @ClementMassonnaud 's point is important though:
However, I'd imagine that the user would still be able to simply export the database element from that class (e.g. checkedData$base) to obtain a csv file with just the checked data. Right? |
Well, without creating a new object or anything like that, we can just use attributes. if (returnReport) {
return(report)
} else {
return(report$base)
} we could do : #add the report as a named list attribute to the data.table
attr(report$base, "report") <- list(
failedParse = report$failedParse,
removedMissing = report$removedMissing,
missing = report$missing,
negativeLOS = report$negativeLOS,
removedErrors = report$removedErrors,
removedDuplicates = report$removedDuplicates,
neededIterations = report$neededIterations,
allIterations = report$allIterations,
addedAOS = report$addedAOS)
#add the class "checked.base" to the list of class so that we can easily
#identify whether the base has been checked or not.
if(!inherits(report$base, "checked.base") class(report$base) <- c("checked.base", class(report$base))
return(report$base) Later on we can check whether the base has been checked by if(inherits(my_potentially_checked_base, "checked.base")){
##do stuff on the checked base
} and the report can be accessed via attr(my_checked_base, "report") but other than that, the object is still a data.table and behaves as it should... Would it solve the issue at stake here ? |
I think so - classes in R aren't exclusive so just tagging the |
Sounds like a good solution (this really pushes the boundaries of my R-knowledge... again, learning a lot here) |
Sounds good to me |
One additional thing: |
I would say we'd only need the possibility of alternative column names in
If we rename the columns in checkBase(), the alternative names can be removed from the following functions:
I would say we keep these functions internal, and have the users create a hospiNet first, and from there they can get the edgelist and matrix directly. I'll have a go at it. |
Oh by the way: I noticed @MikeLydeamore removed checkBase() from hospinet_from_subject_database() when inserting the check for hospinet.base. I inserted it again, but within that check. It now reads
So if an unchecked database is input, the script will first try to check the database with the available information. The user can input any parameters for checkBase() as parameters in hospinet_from_subject_database(), which passes them on. This runs fine on my large real database. If the database was already checked, it will skip it and go straight to calculating the edgelist and the rest of the hospinet object. |
There is a bit of code in NetworkBuilding.R that has some adverse effects:
## Checking base message("Checking base...") base = try({ checkBase(base) }) if (class(base)[[1]] == "try-error") { stop("Cannot compute the network: the database is not correctly formated or contains errors. The database must first be checked with the function 'checkBase()'. See the vignettes for more details on the workflow of the package.") }
I think our workflow is designed to allow checkBase() to be performed separately, to allow the users to think about what is wrong in or with the database. However, the above code forces the complete function to be run as a first step, even if the input data has already been checked.
To me there seem to be three options here:
1- Don't check the database in this function (which creates the possibility that the user will pass an unchecked database to the function, with all potential problems that it might create).
2- Create a input flag for checking the database. (No guarantee that the above will not happen, but by actively having to set it to FALSE the user at least has to think about it).
3- Leave it as it is. This can be rather slow, as the database checking does take quite a bit of time for larger databases. On the other hand, it does create a fully integrated function to move from raw data to network in one go... The question is whether this is what we want.
I'd prefer option 1 or 2, but like to hear you're thoughts before adjusting the code.
The text was updated successfully, but these errors were encountered: