Create a function to make synthetic raw data #23

ThomUK · 2023-12-08T10:03:09Z

To be used in documentation and potentially during testing.

The function could be called something like create_waitinglist(n = 1000), which would return a dataframe of 1000 patients with columns for:

Hospital Site
Specialty
OPCS4 Code
Referral datetime
Treatment datetime
Removal without treatment datetime

The text was updated successfully, but these errors were encountered:

jacgrout · 2023-12-12T15:07:25Z

I'm interested in helping with this one

kaituna · 2024-02-09T12:04:42Z

Just having a think about this and maybe we could use the following procedure to generate synthetic data:

Specify a date range to generate data over.
Specify the mean daily arrival rate (A_r) onto waitlist (user definable with a realistic default).
Specify rate parameter R_r for exponential distribution that mimics removal from waitlist (user definable with a realistic default).
Create A_r entries per date (maybe with a bit of random variation added into the mix, or even using a vector of probabilities so we can model varying demand over day of week, for example). This represents Entry onto Waitlist
Draw a random waiting time in days from an exponential distribution using rate parameter R_r from Step 3.
Add this waiting time to the 'Entry onto Waitlist' date to compute the 'Removal from Waitlist' date. (N.B. if this removal date is greater than the date range specified in Step 1 then we say that the patient is still on the waitlist maybe? This way we have an outstanding balance that we can use to calculate relief capacity, etc)

We should be able to create synthetic data for waiting lists that are stable or unstable by varying the initial parameters. For example, if the removal rate is lower than arrival rate then we will generate a dataset that shows an increasing waitlist size.

Just some thoughts for consideration...

jacgrout · 2024-02-12T10:12:13Z

So, just to get this clear in my head, we would call a function like this:

create_synthetic_data(start_date, end_date, Ar, Rr)

where the variables in the function call would be as follows:

start_date -date range as per 1 above
end_date - date range as per 1 above
Ar - mean daily arrival rate - might need to allow a decimal as a mean could be decimal
Rr - rate parameter

This would cause there to be a varying number of rows in the synthetic dataset since as an example with Ar=5:

check_rows_created <- function(start_date, end_date, Ar){
days = as.double(difftime(as.Date(end_date),as.Date(start_date), units = "days"))
rows=days*Ar
return(rows)
}

check_rows_created ("2024/02/12","2024/03/10",5)

would give 135 rows, but earlier we were suggesting 1000. To achieve this then I think we either need to set only the start date and Ar and then create 1000 rows at the Ar per day from the start date OR we choose to not pre-set on the idea of 1000 rows of synthetic data?

kaituna · 2024-02-12T10:26:54Z

I have been playing about with a toy function and come up with this:

create_waiting_list <- function(n, mean_arrival_rate, mean_wait, start_date){
  
dates <- rep(seq.Date(from = as.Date(start_date,format="%Y-%m-%d"),length.out = n, by = "day"),mean_arrival_rate)

values <- rexp(length(dates),1/mean_wait)
test_df <- data.frame(addition_date = dates, wait_length = values)
test_df$removal_date <- test_df$addition_date + round(test_df$wait_length,0)
test_df <- test_df[order(test_df$addition_date),c(1,3)]

return(test_df)

}


create_waiting_list(n = 366,
                    mean_arrival_rate = 50,
                    mean_wait = 21,
                    start_date = "2024-01-01")

This should produce 50 referrals per day over the course of the year, each with a variable waiting time that matches an exponential distribution with a mean wait of 21 days. Not sure if I am the right track here though...

jacgrout · 2024-02-12T11:47:41Z

I think you are on the right track, but I still think my question above about 1000 rows or not needs to be answered in order to correctly write the function. @ThomUK suggested 1000 in the opening of this issue so perhaps he will clarify.
Also being ultra pedantic the resulting dataframe has a mean_wait of 21.15951, rather than the specified 21 in the function call. I don't know if this would ultimately matter or not?

kaituna · 2024-02-12T12:03:38Z

I assumed the 1000 figure was an example? If a specific number of rows are needed you could just ensure that the product of n and mean_arrival_rate is equal to the desired quantity (assuming no variation in the arrival rate).

As the wait times are randomly taken from the exponential distribution it should tend closer to the mean as the sample size grows I would have thought. Each time you call the function it will be different anyway.

kaituna · 2024-02-12T12:38:17Z

A slight variation on my toy function that includes a parameter for variation in daily arrival rate:

create_waiting_list_2 <- function(n, mean_arrival_rate, mean_wait, start_date, sd){
  
  dates <- seq.Date(from = as.Date(start_date,format="%Y-%m-%d"),length.out = n, by = "day")
  counts <- pmax(0,rnorm(n, mean = mean_arrival_rate, sd = sd))
  referrals <- rep(dates, times = counts)
  values <- rexp(length(referrals),1/mean_wait)
  test_df <- data.frame(addition_date = referrals, wait_length = values)
  test_df$removal_date <- test_df$addition_date + round(test_df$wait_length,0)
  test_df <- test_df[order(test_df$addition_date),c(1,3)]
  
  return(test_df)
  
}

test_wl <- create_waiting_list_2(mean_arrival_rate = 100,
                               mean_wait = 21,
                               start_date = "2024-01-01",
                               n = 366,
                               sd = 10)

It assumes a normal distribution for the variation but prevents negative values. Maybe a different distribution, or perhaps using a vector of probabilities based on day of week, would be a better approach? I think we need to look at the characteristics of some real world data as a starting point. This will be especially important if we want to start generating OPCS codes as we will need to get a realistic case mix output.

jacgrout · 2024-02-12T13:02:49Z

Ive been playing around with a function to wrap around yours @kaituna which would allow the creation of waiting lists that include variation for hospital site and specialty etc, each with their own mean_waits, start_dates etc so starting with something like this to feed in:

I'll paste that here when I think it might be ok and then maybe one of us should create a pull-request and put it all together for a review?

jacgrout · 2024-02-12T13:06:32Z

Also, we need to not forget about including "Removal without treatment datetime" in any resulting dataframe.

kaituna · 2024-02-12T13:25:09Z

Yeah, great stuff!

Maybe we could just set a parameter for proportion of ROTT and randomly flag rows as being removed for reasons other than treatment?

jacgrout · 2024-02-12T15:15:07Z

Yeah, great stuff!

Maybe we could just set a parameter for proportion of ROTT and randomly flag rows as being removed for reasons other than treatment?

I guess for the purposes of synthetic data we either let the user give us the ROTT as an input parameter or we just randomly flag at the same rate across the whole dataframe?

jacgrout · 2024-02-12T15:20:57Z

Ok..... had a bit of help from a colleague learning the pmap function but here goes....

slight tweak to your function

create_waiting_list_3 <- function(n, mean_arrival_rate, mean_wait, start_date, sd, ...){
  dots <- list(...)
  dates <- seq.Date(from = as.Date(start_date,format="%Y-%m-%d"),length.out = n, by = "day")
  counts <- pmax(0,rnorm(n, mean = mean_arrival_rate, sd = sd))
  referrals <- rep(dates, times = counts)
  values <- rexp(length(referrals),1/mean_wait)
  test_df <- data.frame(addition_date = referrals, wait_length = values)
  test_df$removal_date <- test_df$addition_date + round(test_df$wait_length,0)
  test_df <- test_df[order(test_df$addition_date),c(1,3)]
  
  return( dplyr::as_tibble(c(dots, test_df)) )
}

make a dataframe to specify the waiting lists we want to create:

bulk <- data.frame(hospital_sites=c("ABC001","DHR70","JRW20","RFW002","DHR70"),
                   specialties = c(100,110,120,130,100),
                   opcs4_code = c("A","B","C","D","A"),
                   n= 366,
                   mean_arrival_rate= c(50,25,20,40,50),
                   mean_wait = c(21,20,10,30,21),
                   start_date = c("2024-01-01","2023-04-01","2024-04-01","2023-01-01","2024-01-01"),
                   sd=10
)

run the create_waiting_list_3 function for each set of parameters, sites, etc in the bulk dataframe:

result <- bulk |>
  purrr::pmap(create_waiting_list_3) |>
  dplyr::bind_rows()

jacgrout · 2024-02-12T15:22:58Z

Still need to deal with ROTT

kaituna · 2024-02-12T16:50:55Z

How about this for a revised function that randomly flags a user defined proportion of rows as ROTT:

create_waiting_list <- function(n, mean_arrival_rate, mean_wait, start_date = Sys.Date(), sd = 0, rott = 0,  ...){
  
  dots <- list(...)
  
  #Generate date range and number of referrals for each date (with or without random variation around mean_arrival_rate)
  dates <- seq.Date(from = as.Date(start_date,format="%Y-%m-%d"),length.out = n, by = "day")
  counts <- pmax(0,rnorm(n, mean = mean_arrival_rate, sd = sd))
  referrals <- rep(dates, times = counts)
  
  #Set random waiting time in days for each referral received with exponential distribution rate 1/mean_waiting_time
  values <- rexp(length(referrals),1/mean_wait)
  
  #Create dataframe of referrals and calculate removal date
  test_df <- data.frame(addition_date = referrals, wait_length = values)
  test_df$removal_date <- test_df$addition_date + round(test_df$wait_length,0)
  
  #Randomly flag user defined proportion of referrals as ROTT
  test_df$rott <- FALSE
  sample_list <- sample(1:nrow(test_df),nrow(test_df)*rott, replace = FALSE)
  test_df$rott[sample_list] <- TRUE
  
  #Add a patient ID to each referral and prepare data for return
  test_df$pat_id <- 1:nrow(test_df)
  test_df <- test_df[order(test_df$addition_date),c("pat_id","addition_date","removal_date","wait_length","rott")]
  
  return( dplyr::as_tibble(c(dots, test_df)) )
}

I've added a column with the raw wait length so we can look at the underlying distribution more easily. There is also a patient ID column which is just a incrementing number at the moment. I've added it because I have used the test waitlist output in conjunction with the patientcounter library to look at the resulting waitlist size over time.

test_wl <- create_waiting_list(mean_arrival_rate = 100,
                               mean_wait = 21,
                               start_date = "2024-01-01",
                               n = 366,
                               sd = 10,
                               rott = 0.05)

Should still work with your extended wrapper function hopefully.

jacgrout · 2024-02-12T18:09:07Z

Yes it still works with the wrapper function. My only comment would be that the wait_length in the resulting dataframe is not the actual wait length that applies to that record of data. Should we add test_df$wait_length <- round(test_df$wait_length,0) to your function to reset the column to reflect the generated dates? I suppose it depends on the purpose of retaining the wait_length in the dataframe.

jacgrout · 2024-02-12T18:14:29Z

Actually, also wondering about start_date = Sys.Date() in your function call, unsure what that is for, if we are going to take that from the dataframe from the start_date column to specify the synthetic data?

jacgrout · 2024-02-12T18:23:49Z

@kaituna Do you want to make the pull request or shall I?

kaituna · 2024-02-12T19:23:01Z

Actually, also wondering about start_date = Sys.Date() in your function call, unsure what that is for, if we are going to take that from the dataframe from the start_date column to specify the synthetic data?

It's just a default value if the user doesn't specify a start date.

My only comment would be that the wait_length in the resulting dataframe is not the actual wait length that applies to that record of data. Should we add test_df$wait_length <- round(test_df$wait_length,0) to your function to reset the column to reflect the generated dates?

Yes, I was going to suggest this, I was just using the raw values to check that everything made sense. I agree it should tally with the generated removal dates.

Happy for you to do the pull request if you like, My laptop takes forever to do any git command, not sure why, think it's something to do with mapped drives.

We probably need some roxygen markup for the documentation, but I've not gone through that learning curve yet! :)

kaituna · 2024-02-13T09:48:29Z

Having pondered it for a while I'm wondering if we might need to adjust the approach we are taking. The current method will produce data for queues that eventually stabilize given sufficient time, so we can't model queues where the load factor is perpetually >1. I think we need to include some parameter that defines the system capacity, but we probably need to think how we implement this.

I'm not sure if we want to start introducing dependencies into this function but might it be easier to look at some queue simulation packages, rather than re-invent the wheel?

I guess we need to discuss what type of synthetic data we are hoping to get out of the function.

ThomUK · 2024-03-09T11:17:07Z

Hi both, just to say this is really good progress. I'm still experimenting with what you've written (I have some catching up to do), but I think the approach is neat:

A core function to generate addition and removal dates
A supporting function to make the output richer in terms of sites, specialities, etc for demo purposes

I'll do some more experimenting, but I think the load factor depends on the capacity, which would be downstream of these functions. The same pattern of arrivals which would be handled easily in ED would stretch a small specialty past a load factor of 1.

The only thing I can think of is that the generated waiting list needs to be non-zero in size, so some of the addition dates need to be unresolved, with a removal_date of NA. These being the patients that haven't been seen yet.

kaituna · 2024-03-13T15:55:39Z

The only thing I can think of is that the generated waiting list needs to be non-zero in size, so some of the addition dates need to be unresolved, with a removal_date of NA. These being the patients that haven't been seen yet.

Yes, I was thinking about this. I was considering adding a function parameter that defines if you want to suppress removal dates past the point at which additions are occurring.

Basically, if you specify a start date of 2024-01-01 and an n (number of days) = 366 then you could generate the removal dates as currently happens but replace any values greater than 2024-01-01 plus 366 days with an NA.

I think this should essentially simulate a snapshot that was run at midnight on 2025-01-01, looking retrospectively at all referrals made between 2024-01-01 and 2024-12-31.

kaituna · 2024-03-13T19:53:25Z

Ok, just had a play and come up with this:

create_waiting_list <- function(n, mean_arrival_rate, mean_wait, start_date = Sys.Date(), limit_removals = TRUE, sd = 0, rott = 0,  ...){
  
  dots <- list(...)
  
  #Generate date range and number of referrals for each date (with or without random variation around mean_arrival_rate)
  dates <- seq.Date(from = as.Date(start_date,format="%Y-%m-%d"),length.out = n, by = "day")
  counts <- pmax(0,rnorm(n, mean = mean_arrival_rate, sd = sd))
  referrals <- rep(dates, times = counts)
  
  #set random waiting time in days for each referral received with exponential distribution rate 1/mean_waiting_time
  values <- rexp(length(referrals),1/mean_wait)
  
  #Create dataframe of referrals and calculate removal date
  test_df <- data.frame(addition_date = referrals, wait_length = values)
  test_df$removal_date <- test_df$addition_date + round(test_df$wait_length,0)
  
  #Suppress removal dates that are greater than start date + 'n' days to simulate non-zero waitlist at end of simulated period.
  if (limit_removals) {
    test_df$removal_date[test_df$removal_date > (as.Date(start_date,format="%Y-%m-%d") + n)] <- NA
    test_df$wait_length[is.na(test_df$removal_date)] <- NA
    }
  
  #Randomly flag user defined proportion of referrals as ROTT
  test_df$rott <- FALSE
  sample_list <- sample(1:nrow(test_df),nrow(test_df)*rott, replace = FALSE)
  test_df$rott[sample_list] <- TRUE
  
  #Add a patient ID to each referral and prepare data for return
  test_df$pat_id <- 1:nrow(test_df)
  test_df <- test_df[order(test_df$addition_date),c("pat_id","addition_date","removal_date","wait_length","rott")]
  
  return( dplyr::as_tibble(c(dots, test_df)) )
}

You can then call the function and end up with a non-zero waiting list at the end of the simulated period:

test_wl <- create_waiting_list(mean_arrival_rate = 31,
                               mean_wait = 26.4,
                               start_date = "2024-01-01",
                               n = 366,
                               sd = 24,
                               rott = 0.1,
                               limit_removals = TRUE)

Is this what we want?

jacgrout · 2024-03-18T08:20:09Z

Quickly looked and it seem good to me! Sorry I've gone quiet but I've had some big work deadlines to focus on. This week is also a bit crazy. I'm on leave next week so I should be ok to work on it more. I made the pull request with the previous version and was just at the point of looking at roxygen documentation so I will aim for this next week and to incorporate these changes above if that's OK?

jacgrout · 2024-05-03T14:12:20Z

I have submitted a pull request for this issue with the addition of two functions:
create_waiting_list() and create_bulk_synthetic_data()
plus set up and creation of demo-data with data/ and data-raw/ folders
The demo_df dataframe is a dummy data set from which a synthetic set can be generated via

NHSRWaitinglist::create_bulk_synthetic_data(demo_df)

This generates a single synthetic data set of 5 waiting lists using the contents of the built in demo dataframe demo_df as the input parameters for each waiting list

We can alter the code that creates demo_df to whatever we see fit before release

The user can equally use their own dataframe with details of their waiting lists, or if they only have one waiting list then they can generate directly using create_waiting_list()

Linked to #51 we could consider whether we want to tweak this to add another piece of code to create a fixed synthetic dataset maybe using set.seed with the demo_df to create a dataframe of fixed synthetic patient level data?

Lextuga007 · 2024-05-13T21:44:07Z

@all-contributors please add @kaituna idea and code
@all-contributors please add @jacgrout idea and code

allcontributors · 2024-05-13T21:44:09Z

@Lextuga007

ThomUK mentioned this issue Dec 8, 2023

Create a cleaning function to shape raw data for use within the package #24

Closed

ThomUK added the help wanted Extra attention is needed label Dec 8, 2023

chrismainey assigned ThomUK, jacgrout and kaituna Feb 9, 2024

ThomUK added this to the v0.1 milestone Feb 9, 2024

jacgrout mentioned this issue Feb 12, 2024

Create a bulk calculation function for multiple sites and specialties #25

Closed

Lextuga007 mentioned this issue Apr 12, 2024

Fixed synthetic data set #51

Closed

Lextuga007 linked a pull request May 13, 2024 that will close this issue

23 create function to make synthetic data #63

Merged

ThomUK closed this as completed in #63 Jul 16, 2024

ThomUK closed this as completed in abf4014 Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a function to make synthetic raw data #23

Create a function to make synthetic raw data #23

ThomUK commented Dec 8, 2023

jacgrout commented Dec 12, 2023

kaituna commented Feb 9, 2024

jacgrout commented Feb 12, 2024

kaituna commented Feb 12, 2024 •

edited

Loading

jacgrout commented Feb 12, 2024 •

edited

Loading

kaituna commented Feb 12, 2024

kaituna commented Feb 12, 2024

jacgrout commented Feb 12, 2024

jacgrout commented Feb 12, 2024

kaituna commented Feb 12, 2024

jacgrout commented Feb 12, 2024

jacgrout commented Feb 12, 2024 •

edited

Loading

jacgrout commented Feb 12, 2024

kaituna commented Feb 12, 2024 •

edited

Loading

jacgrout commented Feb 12, 2024

jacgrout commented Feb 12, 2024

jacgrout commented Feb 12, 2024

kaituna commented Feb 12, 2024

kaituna commented Feb 13, 2024

ThomUK commented Mar 9, 2024 •

edited

Loading

kaituna commented Mar 13, 2024

kaituna commented Mar 13, 2024

jacgrout commented Mar 18, 2024

jacgrout commented May 3, 2024

Lextuga007 commented May 13, 2024

allcontributors bot commented May 13, 2024

Create a function to make synthetic raw data #23

Create a function to make synthetic raw data #23

Comments

ThomUK commented Dec 8, 2023

jacgrout commented Dec 12, 2023

kaituna commented Feb 9, 2024

jacgrout commented Feb 12, 2024

kaituna commented Feb 12, 2024 • edited Loading

jacgrout commented Feb 12, 2024 • edited Loading

kaituna commented Feb 12, 2024

kaituna commented Feb 12, 2024

jacgrout commented Feb 12, 2024

jacgrout commented Feb 12, 2024

kaituna commented Feb 12, 2024

jacgrout commented Feb 12, 2024

jacgrout commented Feb 12, 2024 • edited Loading

jacgrout commented Feb 12, 2024

kaituna commented Feb 12, 2024 • edited Loading

jacgrout commented Feb 12, 2024

jacgrout commented Feb 12, 2024

jacgrout commented Feb 12, 2024

kaituna commented Feb 12, 2024

kaituna commented Feb 13, 2024

ThomUK commented Mar 9, 2024 • edited Loading

kaituna commented Mar 13, 2024

kaituna commented Mar 13, 2024

jacgrout commented Mar 18, 2024

jacgrout commented May 3, 2024

Lextuga007 commented May 13, 2024

allcontributors bot commented May 13, 2024

kaituna commented Feb 12, 2024 •

edited

Loading

jacgrout commented Feb 12, 2024 •

edited

Loading

jacgrout commented Feb 12, 2024 •

edited

Loading

kaituna commented Feb 12, 2024 •

edited

Loading

ThomUK commented Mar 9, 2024 •

edited

Loading