-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a function to make synthetic raw data #23
Comments
I'm interested in helping with this one |
Just having a think about this and maybe we could use the following procedure to generate synthetic data:
We should be able to create synthetic data for waiting lists that are stable or unstable by varying the initial parameters. For example, if the removal rate is lower than arrival rate then we will generate a dataset that shows an increasing waitlist size. Just some thoughts for consideration... |
So, just to get this clear in my head, we would call a function like this: create_synthetic_data(start_date, end_date, Ar, Rr) where the variables in the function call would be as follows: start_date -date range as per 1 above This would cause there to be a varying number of rows in the synthetic dataset since as an example with Ar=5: check_rows_created <- function(start_date, end_date, Ar){ check_rows_created ("2024/02/12","2024/03/10",5) would give 135 rows, but earlier we were suggesting 1000. To achieve this then I think we either need to set only the start date and Ar and then create 1000 rows at the Ar per day from the start date OR we choose to not pre-set on the idea of 1000 rows of synthetic data? |
I have been playing about with a toy function and come up with this:
This should produce 50 referrals per day over the course of the year, each with a variable waiting time that matches an exponential distribution with a mean wait of 21 days. Not sure if I am the right track here though... |
I think you are on the right track, but I still think my question above about 1000 rows or not needs to be answered in order to correctly write the function. @ThomUK suggested 1000 in the opening of this issue so perhaps he will clarify. |
I assumed the 1000 figure was an example? If a specific number of rows are needed you could just ensure that the product of n and mean_arrival_rate is equal to the desired quantity (assuming no variation in the arrival rate). As the wait times are randomly taken from the exponential distribution it should tend closer to the mean as the sample size grows I would have thought. Each time you call the function it will be different anyway. |
A slight variation on my toy function that includes a parameter for variation in daily arrival rate:
It assumes a normal distribution for the variation but prevents negative values. Maybe a different distribution, or perhaps using a vector of probabilities based on day of week, would be a better approach? I think we need to look at the characteristics of some real world data as a starting point. This will be especially important if we want to start generating OPCS codes as we will need to get a realistic case mix output. |
Ive been playing around with a function to wrap around yours @kaituna which would allow the creation of waiting lists that include variation for hospital site and specialty etc, each with their own mean_waits, start_dates etc so starting with something like this to feed in: I'll paste that here when I think it might be ok and then maybe one of us should create a pull-request and put it all together for a review? |
Also, we need to not forget about including "Removal without treatment datetime" in any resulting dataframe. |
Yeah, great stuff! Maybe we could just set a parameter for proportion of ROTT and randomly flag rows as being removed for reasons other than treatment? |
I guess for the purposes of synthetic data we either let the user give us the ROTT as an input parameter or we just randomly flag at the same rate across the whole dataframe? |
Ok..... had a bit of help from a colleague learning the pmap function but here goes....
|
Still need to deal with ROTT |
How about this for a revised function that randomly flags a user defined proportion of rows as ROTT:
I've added a column with the raw wait length so we can look at the underlying distribution more easily. There is also a patient ID column which is just a incrementing number at the moment. I've added it because I have used the test waitlist output in conjunction with the patientcounter library to look at the resulting waitlist size over time.
Should still work with your extended wrapper function hopefully. |
Yes it still works with the wrapper function. My only comment would be that the wait_length in the resulting dataframe is not the actual wait length that applies to that record of data. Should we add |
Actually, also wondering about |
@kaituna Do you want to make the pull request or shall I? |
It's just a default value if the user doesn't specify a start date.
Yes, I was going to suggest this, I was just using the raw values to check that everything made sense. I agree it should tally with the generated removal dates. Happy for you to do the pull request if you like, My laptop takes forever to do any git command, not sure why, think it's something to do with mapped drives. We probably need some roxygen markup for the documentation, but I've not gone through that learning curve yet! :) |
Having pondered it for a while I'm wondering if we might need to adjust the approach we are taking. The current method will produce data for queues that eventually stabilize given sufficient time, so we can't model queues where the load factor is perpetually >1. I think we need to include some parameter that defines the system capacity, but we probably need to think how we implement this. I'm not sure if we want to start introducing dependencies into this function but might it be easier to look at some queue simulation packages, rather than re-invent the wheel? I guess we need to discuss what type of synthetic data we are hoping to get out of the function. |
Hi both, just to say this is really good progress. I'm still experimenting with what you've written (I have some catching up to do), but I think the approach is neat:
I'll do some more experimenting, but I think the load factor depends on the capacity, which would be downstream of these functions. The same pattern of arrivals which would be handled easily in ED would stretch a small specialty past a load factor of 1. The only thing I can think of is that the generated waiting list needs to be non-zero in size, so some of the addition dates need to be unresolved, with a removal_date of |
Yes, I was thinking about this. I was considering adding a function parameter that defines if you want to suppress removal dates past the point at which additions are occurring. Basically, if you specify a start date of 2024-01-01 and an n (number of days) = 366 then you could generate the removal dates as currently happens but replace any values greater than 2024-01-01 plus 366 days with an I think this should essentially simulate a snapshot that was run at midnight on 2025-01-01, looking retrospectively at all referrals made between 2024-01-01 and 2024-12-31. |
Ok, just had a play and come up with this:
You can then call the function and end up with a non-zero waiting list at the end of the simulated period:
Is this what we want? |
Quickly looked and it seem good to me! Sorry I've gone quiet but I've had some big work deadlines to focus on. This week is also a bit crazy. I'm on leave next week so I should be ok to work on it more. I made the pull request with the previous version and was just at the point of looking at roxygen documentation so I will aim for this next week and to incorporate these changes above if that's OK? |
I have submitted a pull request for this issue with the addition of two functions: NHSRWaitinglist::create_bulk_synthetic_data(demo_df) This generates a single synthetic data set of 5 waiting lists using the contents of the built in demo dataframe demo_df as the input parameters for each waiting list We can alter the code that creates demo_df to whatever we see fit before release The user can equally use their own dataframe with details of their waiting lists, or if they only have one waiting list then they can generate directly using create_waiting_list() Linked to #51 we could consider whether we want to tweak this to add another piece of code to create a fixed synthetic dataset maybe using set.seed with the demo_df to create a dataframe of fixed synthetic patient level data? |
@all-contributors please add @kaituna idea and code |
To be used in documentation and potentially during testing.
The function could be called something like
create_waitinglist(n = 1000)
, which would return a dataframe of 1000 patients with columns for:The text was updated successfully, but these errors were encountered: