Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a function to make synthetic raw data #23

Closed
ThomUK opened this issue Dec 8, 2023 · 26 comments · Fixed by #63
Closed

Create a function to make synthetic raw data #23

ThomUK opened this issue Dec 8, 2023 · 26 comments · Fixed by #63
Assignees
Labels
help wanted Extra attention is needed
Milestone

Comments

@ThomUK
Copy link
Collaborator

ThomUK commented Dec 8, 2023

To be used in documentation and potentially during testing.

The function could be called something like create_waitinglist(n = 1000), which would return a dataframe of 1000 patients with columns for:

  • Hospital Site
  • Specialty
  • OPCS4 Code
  • Referral datetime
  • Treatment datetime
  • Removal without treatment datetime
@jacgrout
Copy link
Collaborator

I'm interested in helping with this one

@ThomUK ThomUK added this to the v0.1 milestone Feb 9, 2024
@kaituna
Copy link
Collaborator

kaituna commented Feb 9, 2024

Just having a think about this and maybe we could use the following procedure to generate synthetic data:

  1. Specify a date range to generate data over.
  2. Specify the mean daily arrival rate (Ar) onto waitlist (user definable with a realistic default).
  3. Specify rate parameter Rr for exponential distribution that mimics removal from waitlist (user definable with a realistic default).
  4. Create Ar entries per date (maybe with a bit of random variation added into the mix, or even using a vector of probabilities so we can model varying demand over day of week, for example). This represents Entry onto Waitlist
  5. Draw a random waiting time in days from an exponential distribution using rate parameter Rr from Step 3.
  6. Add this waiting time to the 'Entry onto Waitlist' date to compute the 'Removal from Waitlist' date. (N.B. if this removal date is greater than the date range specified in Step 1 then we say that the patient is still on the waitlist maybe? This way we have an outstanding balance that we can use to calculate relief capacity, etc)

We should be able to create synthetic data for waiting lists that are stable or unstable by varying the initial parameters. For example, if the removal rate is lower than arrival rate then we will generate a dataset that shows an increasing waitlist size.

Just some thoughts for consideration...

@jacgrout
Copy link
Collaborator

So, just to get this clear in my head, we would call a function like this:

create_synthetic_data(start_date, end_date, Ar, Rr)

where the variables in the function call would be as follows:

start_date -date range as per 1 above
end_date - date range as per 1 above
Ar - mean daily arrival rate - might need to allow a decimal as a mean could be decimal
Rr - rate parameter

This would cause there to be a varying number of rows in the synthetic dataset since as an example with Ar=5:

check_rows_created <- function(start_date, end_date, Ar){
days = as.double(difftime(as.Date(end_date),as.Date(start_date), units = "days"))
rows=days*Ar
return(rows)
}

check_rows_created ("2024/02/12","2024/03/10",5)

would give 135 rows, but earlier we were suggesting 1000. To achieve this then I think we either need to set only the start date and Ar and then create 1000 rows at the Ar per day from the start date OR we choose to not pre-set on the idea of 1000 rows of synthetic data?

@kaituna
Copy link
Collaborator

kaituna commented Feb 12, 2024

I have been playing about with a toy function and come up with this:

create_waiting_list <- function(n, mean_arrival_rate, mean_wait, start_date){
  
dates <- rep(seq.Date(from = as.Date(start_date,format="%Y-%m-%d"),length.out = n, by = "day"),mean_arrival_rate)

values <- rexp(length(dates),1/mean_wait)
test_df <- data.frame(addition_date = dates, wait_length = values)
test_df$removal_date <- test_df$addition_date + round(test_df$wait_length,0)
test_df <- test_df[order(test_df$addition_date),c(1,3)]

return(test_df)

}


create_waiting_list(n = 366,
                    mean_arrival_rate = 50,
                    mean_wait = 21,
                    start_date = "2024-01-01")

This should produce 50 referrals per day over the course of the year, each with a variable waiting time that matches an exponential distribution with a mean wait of 21 days. Not sure if I am the right track here though...

@jacgrout
Copy link
Collaborator

jacgrout commented Feb 12, 2024

I think you are on the right track, but I still think my question above about 1000 rows or not needs to be answered in order to correctly write the function. @ThomUK suggested 1000 in the opening of this issue so perhaps he will clarify.
Also being ultra pedantic the resulting dataframe has a mean_wait of 21.15951, rather than the specified 21 in the function call. I don't know if this would ultimately matter or not?

@kaituna
Copy link
Collaborator

kaituna commented Feb 12, 2024

I assumed the 1000 figure was an example? If a specific number of rows are needed you could just ensure that the product of n and mean_arrival_rate is equal to the desired quantity (assuming no variation in the arrival rate).

As the wait times are randomly taken from the exponential distribution it should tend closer to the mean as the sample size grows I would have thought. Each time you call the function it will be different anyway.

@kaituna
Copy link
Collaborator

kaituna commented Feb 12, 2024

A slight variation on my toy function that includes a parameter for variation in daily arrival rate:

create_waiting_list_2 <- function(n, mean_arrival_rate, mean_wait, start_date, sd){
  
  dates <- seq.Date(from = as.Date(start_date,format="%Y-%m-%d"),length.out = n, by = "day")
  counts <- pmax(0,rnorm(n, mean = mean_arrival_rate, sd = sd))
  referrals <- rep(dates, times = counts)
  values <- rexp(length(referrals),1/mean_wait)
  test_df <- data.frame(addition_date = referrals, wait_length = values)
  test_df$removal_date <- test_df$addition_date + round(test_df$wait_length,0)
  test_df <- test_df[order(test_df$addition_date),c(1,3)]
  
  return(test_df)
  
}

test_wl <- create_waiting_list_2(mean_arrival_rate = 100,
                               mean_wait = 21,
                               start_date = "2024-01-01",
                               n = 366,
                               sd = 10)

It assumes a normal distribution for the variation but prevents negative values. Maybe a different distribution, or perhaps using a vector of probabilities based on day of week, would be a better approach? I think we need to look at the characteristics of some real world data as a starting point. This will be especially important if we want to start generating OPCS codes as we will need to get a realistic case mix output.

@jacgrout
Copy link
Collaborator

Ive been playing around with a function to wrap around yours @kaituna which would allow the creation of waiting lists that include variation for hospital site and specialty etc, each with their own mean_waits, start_dates etc so starting with something like this to feed in:

image

I'll paste that here when I think it might be ok and then maybe one of us should create a pull-request and put it all together for a review?

@jacgrout
Copy link
Collaborator

Also, we need to not forget about including "Removal without treatment datetime" in any resulting dataframe.

@kaituna
Copy link
Collaborator

kaituna commented Feb 12, 2024

Yeah, great stuff!

Maybe we could just set a parameter for proportion of ROTT and randomly flag rows as being removed for reasons other than treatment?

@jacgrout
Copy link
Collaborator

Yeah, great stuff!

Maybe we could just set a parameter for proportion of ROTT and randomly flag rows as being removed for reasons other than treatment?

I guess for the purposes of synthetic data we either let the user give us the ROTT as an input parameter or we just randomly flag at the same rate across the whole dataframe?

@jacgrout
Copy link
Collaborator

jacgrout commented Feb 12, 2024

Ok..... had a bit of help from a colleague learning the pmap function but here goes....

  1. slight tweak to your function
create_waiting_list_3 <- function(n, mean_arrival_rate, mean_wait, start_date, sd, ...){
  dots <- list(...)
  dates <- seq.Date(from = as.Date(start_date,format="%Y-%m-%d"),length.out = n, by = "day")
  counts <- pmax(0,rnorm(n, mean = mean_arrival_rate, sd = sd))
  referrals <- rep(dates, times = counts)
  values <- rexp(length(referrals),1/mean_wait)
  test_df <- data.frame(addition_date = referrals, wait_length = values)
  test_df$removal_date <- test_df$addition_date + round(test_df$wait_length,0)
  test_df <- test_df[order(test_df$addition_date),c(1,3)]
  
  return( dplyr::as_tibble(c(dots, test_df)) )
}
  1. make a dataframe to specify the waiting lists we want to create:
bulk <- data.frame(hospital_sites=c("ABC001","DHR70","JRW20","RFW002","DHR70"),
                   specialties = c(100,110,120,130,100),
                   opcs4_code = c("A","B","C","D","A"),
                   n= 366,
                   mean_arrival_rate= c(50,25,20,40,50),
                   mean_wait = c(21,20,10,30,21),
                   start_date = c("2024-01-01","2023-04-01","2024-04-01","2023-01-01","2024-01-01"),
                   sd=10
)
  1. run the create_waiting_list_3 function for each set of parameters, sites, etc in the bulk dataframe:
result <- bulk |>
  purrr::pmap(create_waiting_list_3) |>
  dplyr::bind_rows()

@jacgrout
Copy link
Collaborator

Still need to deal with ROTT

@kaituna
Copy link
Collaborator

kaituna commented Feb 12, 2024

How about this for a revised function that randomly flags a user defined proportion of rows as ROTT:

create_waiting_list <- function(n, mean_arrival_rate, mean_wait, start_date = Sys.Date(), sd = 0, rott = 0,  ...){
  
  dots <- list(...)
  
  #Generate date range and number of referrals for each date (with or without random variation around mean_arrival_rate)
  dates <- seq.Date(from = as.Date(start_date,format="%Y-%m-%d"),length.out = n, by = "day")
  counts <- pmax(0,rnorm(n, mean = mean_arrival_rate, sd = sd))
  referrals <- rep(dates, times = counts)
  
  #Set random waiting time in days for each referral received with exponential distribution rate 1/mean_waiting_time
  values <- rexp(length(referrals),1/mean_wait)
  
  #Create dataframe of referrals and calculate removal date
  test_df <- data.frame(addition_date = referrals, wait_length = values)
  test_df$removal_date <- test_df$addition_date + round(test_df$wait_length,0)
  
  #Randomly flag user defined proportion of referrals as ROTT
  test_df$rott <- FALSE
  sample_list <- sample(1:nrow(test_df),nrow(test_df)*rott, replace = FALSE)
  test_df$rott[sample_list] <- TRUE
  
  #Add a patient ID to each referral and prepare data for return
  test_df$pat_id <- 1:nrow(test_df)
  test_df <- test_df[order(test_df$addition_date),c("pat_id","addition_date","removal_date","wait_length","rott")]
  
  return( dplyr::as_tibble(c(dots, test_df)) )
}

I've added a column with the raw wait length so we can look at the underlying distribution more easily. There is also a patient ID column which is just a incrementing number at the moment. I've added it because I have used the test waitlist output in conjunction with the patientcounter library to look at the resulting waitlist size over time.

test_wl <- create_waiting_list(mean_arrival_rate = 100,
                               mean_wait = 21,
                               start_date = "2024-01-01",
                               n = 366,
                               sd = 10,
                               rott = 0.05)

Should still work with your extended wrapper function hopefully.

@jacgrout
Copy link
Collaborator

Yes it still works with the wrapper function. My only comment would be that the wait_length in the resulting dataframe is not the actual wait length that applies to that record of data. Should we add test_df$wait_length <- round(test_df$wait_length,0) to your function to reset the column to reflect the generated dates? I suppose it depends on the purpose of retaining the wait_length in the dataframe.

@jacgrout
Copy link
Collaborator

Actually, also wondering about start_date = Sys.Date() in your function call, unsure what that is for, if we are going to take that from the dataframe from the start_date column to specify the synthetic data?

@jacgrout
Copy link
Collaborator

@kaituna Do you want to make the pull request or shall I?

@kaituna
Copy link
Collaborator

kaituna commented Feb 12, 2024

Actually, also wondering about start_date = Sys.Date() in your function call, unsure what that is for, if we are going to take that from the dataframe from the start_date column to specify the synthetic data?

It's just a default value if the user doesn't specify a start date.

My only comment would be that the wait_length in the resulting dataframe is not the actual wait length that applies to that record of data. Should we add test_df$wait_length <- round(test_df$wait_length,0) to your function to reset the column to reflect the generated dates?

Yes, I was going to suggest this, I was just using the raw values to check that everything made sense. I agree it should tally with the generated removal dates.

Happy for you to do the pull request if you like, My laptop takes forever to do any git command, not sure why, think it's something to do with mapped drives.

We probably need some roxygen markup for the documentation, but I've not gone through that learning curve yet! :)

@kaituna
Copy link
Collaborator

kaituna commented Feb 13, 2024

Having pondered it for a while I'm wondering if we might need to adjust the approach we are taking. The current method will produce data for queues that eventually stabilize given sufficient time, so we can't model queues where the load factor is perpetually >1. I think we need to include some parameter that defines the system capacity, but we probably need to think how we implement this.

I'm not sure if we want to start introducing dependencies into this function but might it be easier to look at some queue simulation packages, rather than re-invent the wheel?

I guess we need to discuss what type of synthetic data we are hoping to get out of the function.

@ThomUK
Copy link
Collaborator Author

ThomUK commented Mar 9, 2024

Hi both, just to say this is really good progress. I'm still experimenting with what you've written (I have some catching up to do), but I think the approach is neat:

  1. A core function to generate addition and removal dates
  2. A supporting function to make the output richer in terms of sites, specialities, etc for demo purposes

I'll do some more experimenting, but I think the load factor depends on the capacity, which would be downstream of these functions. The same pattern of arrivals which would be handled easily in ED would stretch a small specialty past a load factor of 1.

The only thing I can think of is that the generated waiting list needs to be non-zero in size, so some of the addition dates need to be unresolved, with a removal_date of NA. These being the patients that haven't been seen yet.

@kaituna
Copy link
Collaborator

kaituna commented Mar 13, 2024

The only thing I can think of is that the generated waiting list needs to be non-zero in size, so some of the addition dates need to be unresolved, with a removal_date of NA. These being the patients that haven't been seen yet.

Yes, I was thinking about this. I was considering adding a function parameter that defines if you want to suppress removal dates past the point at which additions are occurring.

Basically, if you specify a start date of 2024-01-01 and an n (number of days) = 366 then you could generate the removal dates as currently happens but replace any values greater than 2024-01-01 plus 366 days with an NA.

I think this should essentially simulate a snapshot that was run at midnight on 2025-01-01, looking retrospectively at all referrals made between 2024-01-01 and 2024-12-31.

@kaituna
Copy link
Collaborator

kaituna commented Mar 13, 2024

Ok, just had a play and come up with this:

create_waiting_list <- function(n, mean_arrival_rate, mean_wait, start_date = Sys.Date(), limit_removals = TRUE, sd = 0, rott = 0,  ...){
  
  dots <- list(...)
  
  #Generate date range and number of referrals for each date (with or without random variation around mean_arrival_rate)
  dates <- seq.Date(from = as.Date(start_date,format="%Y-%m-%d"),length.out = n, by = "day")
  counts <- pmax(0,rnorm(n, mean = mean_arrival_rate, sd = sd))
  referrals <- rep(dates, times = counts)
  
  #set random waiting time in days for each referral received with exponential distribution rate 1/mean_waiting_time
  values <- rexp(length(referrals),1/mean_wait)
  
  #Create dataframe of referrals and calculate removal date
  test_df <- data.frame(addition_date = referrals, wait_length = values)
  test_df$removal_date <- test_df$addition_date + round(test_df$wait_length,0)
  
  #Suppress removal dates that are greater than start date + 'n' days to simulate non-zero waitlist at end of simulated period.
  if (limit_removals) {
    test_df$removal_date[test_df$removal_date > (as.Date(start_date,format="%Y-%m-%d") + n)] <- NA
    test_df$wait_length[is.na(test_df$removal_date)] <- NA
    }
  
  #Randomly flag user defined proportion of referrals as ROTT
  test_df$rott <- FALSE
  sample_list <- sample(1:nrow(test_df),nrow(test_df)*rott, replace = FALSE)
  test_df$rott[sample_list] <- TRUE
  
  #Add a patient ID to each referral and prepare data for return
  test_df$pat_id <- 1:nrow(test_df)
  test_df <- test_df[order(test_df$addition_date),c("pat_id","addition_date","removal_date","wait_length","rott")]
  
  return( dplyr::as_tibble(c(dots, test_df)) )
}

You can then call the function and end up with a non-zero waiting list at the end of the simulated period:

test_wl <- create_waiting_list(mean_arrival_rate = 31,
                               mean_wait = 26.4,
                               start_date = "2024-01-01",
                               n = 366,
                               sd = 24,
                               rott = 0.1,
                               limit_removals = TRUE)

Is this what we want?

@jacgrout
Copy link
Collaborator

Quickly looked and it seem good to me! Sorry I've gone quiet but I've had some big work deadlines to focus on. This week is also a bit crazy. I'm on leave next week so I should be ok to work on it more. I made the pull request with the previous version and was just at the point of looking at roxygen documentation so I will aim for this next week and to incorporate these changes above if that's OK?

@jacgrout
Copy link
Collaborator

jacgrout commented May 3, 2024

I have submitted a pull request for this issue with the addition of two functions:
create_waiting_list() and create_bulk_synthetic_data()
plus set up and creation of demo-data with data/ and data-raw/ folders
The demo_df dataframe is a dummy data set from which a synthetic set can be generated via

NHSRWaitinglist::create_bulk_synthetic_data(demo_df)

This generates a single synthetic data set of 5 waiting lists using the contents of the built in demo dataframe demo_df as the input parameters for each waiting list

We can alter the code that creates demo_df to whatever we see fit before release

The user can equally use their own dataframe with details of their waiting lists, or if they only have one waiting list then they can generate directly using create_waiting_list()

Linked to #51 we could consider whether we want to tweak this to add another piece of code to create a fixed synthetic dataset maybe using set.seed with the demo_df to create a dataframe of fixed synthetic patient level data?

@Lextuga007 Lextuga007 linked a pull request May 13, 2024 that will close this issue
@Lextuga007
Copy link
Member

@all-contributors please add @kaituna idea and code
@all-contributors please add @jacgrout idea and code

Copy link
Contributor

@Lextuga007

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants