Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: keep track of utility of assigning all units each treatment outside of "level_one_learning" #146

Open
beniaminogreen opened this issue Oct 28, 2022 · 32 comments
Labels
enhancement New feature or request

Comments

@beniaminogreen
Copy link

Hi All,

Thanks for writing such a fantastic and well-documented package.

I've been working using the package the past few weeks, and I think I might have come across a potential speed improvement which could let users build even deeper trees than before.

Specifically, instead of creating an NxPxD array of cumulative treatment effects in the level_one_learning function, couldn't the algorithm keep tabs of the utility from assigning each unit to every treatment (a D-length vector) and use this to calculate the utility + best action on each side of a potential split?

I've used a flamegraph to try and diagnose where the algorithm is spending time (see below) and it seems like a lot of the algorithm's time is spent allocating and de-allocating these arrays. I'm not a C++ developer, so I can't be sure that I am reading the code / charts correctly, but a decent amount of time also seems to be to be spent maintaining the binary trees, which like they could be converted to vectors after they have been sorted, as it looks like they are only being accessed by index after creation (but I'm less sure about this).

image

I don't know C++ so I can't file a PR to tweak this, but I do know Rust, so I've quickly put together both of these changes in an Rust-backed R package here). The package seems to perform well when compared to the existing search algorithm as the number of data points increases (see plot below), and also also parallelizes the search to run across multiple cores which increases the speed by another order of magnitude.

image

(I originally called the package "exhaustivetree," but I changed the name shortly after making the graph.)

As you can see, the change in scaling from these tweaks are quite dramatic, and they make it feasible to run quick policytree searches on larger datasets with a more covariates than before — I was delighted to find out that I can now run a search for a depth-2 tree on my a dataset with 250,000 observations and 55 (categorical) covariates in 15 seconds, which is a large improvement over the current policytree time (which took too long for me to record).

I hope you'll be able to consider these changes / look into them more, as it seems like they offer a large increase in speed. I'd be very happy to answer any questions you have, or to run any tests you would like on the code I have written to verify that the speedups are indeed due to the tweaks I made, rather than any Rust / C++ difference.

Best,
Ben

@erikcs
Copy link
Member

erikcs commented Oct 29, 2022

Hi @beniaminogreen, thanks a lot, this looks very cool and interesting! I don't have time to look closely until a bit later next month, but in the meantime, could I please ask if you get the same result here if you plug in parallel_policy_tree?

@kanodiaayush
Copy link
Member

kanodiaayush commented Oct 30, 2022

Hi @beniaminogreen, thanks for looking into this; your addition seems very promising. In addition to what @erikcs requests above, could you also add

  1. a comparison with your rust implementation on only one core
  2. explain in pseudocode your approach; I am unable to understand it right now, for the question of the utility tracking.
    wrt the trees, we do add/delete in the main loop; so I do not understand what exactly you mean by access by index. You could explain that too in pseducode.

I think we should definitely incorporate changes which offer gains of this magnitude.

@beniaminogreen
Copy link
Author

Hi @erikcs and @kanodiaayush, thanks for getting back to me so quickly!

Please find below a screenshot of the tests @erikcs asked me to run. Not sure how to read these statistics, but the outputs look similar to the ones in the script you sent.

image

I should also add that I have been unit-testing the treatment assignements produced by parallel_policy_tree against those produced by the policy_tree function (see here), to make sure that the two methods always produce the same treatment assignments. The tests show that the two methods coincide for tree depths up to 3 (beyond which I don't test, but I assume the assignments should still become identical).

@kanodiaayush, please find below a comparison of parallel_policy_tree and policytree when running on a single core. The times are likely different between the two charts, as I lost the bench-marking code I used for the first graph (the code for this run can be found here.

image

I couldn't get to writing the algorithm in pseudocode this weekend, but I'll be able to do it in the next few days, and I'll leave it here as soon as I do.

Best,
Ben

@erikcs
Copy link
Member

erikcs commented Oct 31, 2022

Thanks Ben, that looks great. This is very impressive work. I’m trying to understand where the gains are coming from, could you please run that benchmark plot for dense Xj, ie, line11: X = matrix(rnorm(n*p), n, p)?

@beniaminogreen
Copy link
Author

Sure thing! Here it is. I tried to optimize performance for discrete / binned features as I think that's the modal case, but it looks like it performs very well even in the case of completely continuous features.

image

@erikcs
Copy link
Member

erikcs commented Oct 31, 2022

Sorry for getting you to run this, but I only have my phone at the moment. Could you please run that same script but overlay a 3rd line with runtime for just a depth 0 policytree? Thanks !

@beniaminogreen
Copy link
Author

beniaminogreen commented Oct 31, 2022

Sure, here it is:

image

This isn't what I expected to see, so I ran another flamegraph to try and diagnose where the policytree algorithm is spending time when creating a depth-0 tree for a data set of 50K units.

image

It seems that for depth-0 trees in datasets of this size, most of the time is spent creating the sorted sets. Once that is done, it looks like the remaining search does not take much time at all. Maybe the differences in speed are coming from a difference in the time it takes to create the initial data structures?

This might make sense - the first flamegraph that I ran was for a dataset with ~1000 observations, if I remember correctly.

@erikcs
Copy link
Member

erikcs commented Nov 1, 2022

Thanks, yeah that is why I wanted to see the depth 0 timing, had a hunch a big speed difference arose from the sorted sets being created faster in Rust (I think algorithmically your level 1 learning should have the same amortized runtime as policytree’s, the amount of work it does should be the same)

What data structures are you using to create the sorted sets in Rust? I think (at least part of) your incredible speed improvement may be a system-level improvement in that particular data structure 🤔

@beniaminogreen
Copy link
Author

Definitely looks to be the case - good catch! I am using a BTreeMap to create the sorted sets (which are converted to vectors once they have been populated with all the observations).

For each axis, I create an empty BTreeMap with X values as keys, vectors of observation indices as entries. So a dataset that looks like this:

Observation ID X1
1 0.0
2 2.0
3 0.0
4 2.5

Would be stored as the following key/value pairs:

Key Value
0.0 vec![1,3]
2.0 vec![2]
2.5 vec![4]

I populate the B-TreeMap by iterating over each observation, and either adding a key-value pair if the key has not been seen before, or appending the value of the vector to the value associated with the key if the key has been seen before. In the example above, it would look like:

  1. Instantiate the BTreeMap
  2. Check if (0.0) is a key in the Map. It isn't, so insert the key-value pair (0.0, vec![1]) into the map.
  3. Check if (2.0) is a key in the Map. It isn't, so insert the key-value pair (2.0, vec![2]) into the map.
  4. Check if (0.0) is a key in the Map. It is, so insert the index 3 into the vector associated with this key.
  5. Check if (2.5) is a key in the Map. It isn't, so insert the key-value pair (2.5, vec![4]) into the map.
  6. Convert the BTreeMap to a vector of ObservationBundles, which store the key and vector of indecies for each entry in one struct

Some of the speed improvement may also have to do with the fact that I am storing indexes of observations in my trees, while I believe you are storing entire observations, which would make the trees slower to populate / clone.

@erikcs
Copy link
Member

erikcs commented Nov 2, 2022

Great, thanks! Could you please run similar benchmarks for a depth 2 tree on three X-configurations: {all binary, a handful of discrete values, dense (i.e runif(n*p))}? (policytree may be to slow for too large n on the last one)

Also, if possible, (I don't have my laptop now so can't do this) could you repeat this + the depth 1 timings from above for a modified policytree where you: replace L18 here to #include<set> and change the first part of L22 from boost::container::flat_set to std::set ? (EDIT: and delete L85 set.reserve...)

Thanks again!

@beniaminogreen
Copy link
Author

Sure - here are the benchmarks for depth-2 trees on the three datasets you describe.

For the un-changed policytree:

image

For the policytree with the changes you described:

image

@erikcs
Copy link
Member

erikcs commented Nov 2, 2022

Thanks! I'd like a larger number of observations to see what's going on, could you do up to n = 100k, at least for the top plot? It could take a few hours, but I guess you have an external server? (also p = 2 and d = 2 is fine). Thanks again!

@beniaminogreen
Copy link
Author

I've been running these on my laptop, but I can put them on my server. I don't know how long they will take, but I'll set them off and report back when they have completed. The only server I can run benchmarks on has 4 cores, so it could be a few days.

@erikcs
Copy link
Member

erikcs commented Nov 2, 2022

Actually, you could leave out policytree from the benchmarks, we have roughly an idea of how long time that takes if you include an R file with benchmark details. What would be interesting to see is the Rust policytree (1 core) timing as a function of n up to 100k or more for the different X settings.

How many categories are there in the 250k depth 2 tree you mention in your first post?

@beniaminogreen
Copy link
Author

Sounds good - I've set the code running again with just the Rust implementation going. Will likely still take a few days.

The 250k observation run was based on dataset which actually had 45 columns, not 55 as I reported. The dimensions and number of unique values in each column are below, but the columns with 141 and 173 breaks were not included.

image

If they are included, the training time goes up to about 50 seconds on our 40-core server. This time can still see a lot of improvement, as I parallelize by having each core only search over trees that start by splitting on a specific variable. This means that cores assigned variables with fewer break points finish much earlier than others, and have to sit around waiting for the stragglers to finish. My bet is that a better parallelization scheme could bring this time down to ~25 seconds.

@beniaminogreen
Copy link
Author

image

These came back a lot quicker than I anticipated! The binary and categorical runs both finished in about 15ms each for a dataset of n=100,000, while the dense dataset took about an hour and 5 mins to run. The benchmarking code I used can be found here

@erikcs
Copy link
Member

erikcs commented Nov 3, 2022

Interesting, policytree takes around 15 sec for the two first ones and 11 minutes for the last.

@beniaminogreen
Copy link
Author

That is interesting. This definitely reflects my focus on the categorical / binary case in coding my version.

I bundle all the observations with the same value of a predictor together so that they can be quickly added / removed from a node as a group. If no observations share the same value of a predictor, then all the bundling code will have a significant speed cost. My guess is that this explains the discrepancy in performance between the categorical and dense cases.

In the amortized sorting operation, I imagine I could limit some of the speed losses by assessing if a variable is dense or not, and then deciding on an appropriate representation. That might allow me to glean the advantages of grouping, while not loosing performance in the dense case.

@erikcs
Copy link
Member

erikcs commented Nov 3, 2022

Yes, we should consider handling repeated Xj in policytree, your package makes it clear it's very worthwhile. I'm not sure when I will get time to have a closer look though.

And I looked at your Rust code, for the sorted sets it reminds of an optional splitting rule GRF used to have. For N unique samples you are essentially maintaining a NxP array of the global sort order. When moving samples L/R you can update indexes in O(1) time since you have an N-length index array, but creating a copy/new one takes O(N) time, where N is the global number of samples, in policytree creating a new one takes O(n) time where n is the number of samples in the recursive call, and insertion takes O(logn). Which of these terms dominate will depend on the number of distinct values (and will likely have to be large, e.g. num.samples=5k is too small to notice these differences). If you experiment in bechmarking (varying "denseness" and N) I think you will find a ~threshold where the the two runtimes cross. For "full" dense Xj your runtime when doubling num.samples is roughly 8 times larger (last plot from 50k to 100k), for policytree that same ratio will be ~4 instead of 8.

@beniaminogreen
Copy link
Author

Thanks for spelling that out for me - I thought that the array approach would always dominate the sorted sets implementation in policytree, but now I see that isn't the case. This will definitely be something for me to think about if I keep working on my package .

Hopefully there will be a way for me to get the best of both worlds - the speed that arrays give for small / medium sizes of data with the scaling that your sorted sets implementation has for large datasets.

@erikcs
Copy link
Member

erikcs commented Nov 8, 2022

Yes, you can probably work out empirical N/n heuristics to switch between implementations during runtime if you really want to keep tinkering! (In these things there are practically always areas you can squeeze out more perf by identifying special cases). One small thing you could try is make your active array a hash table instead, that might make a small dent in speed if Rust has fast ~collision free integer hashing. Another thing to try could be to make the most out of SIMD, I don't know about Rust, but if you for example had d-rewards in an Eigen (C++) vector, then it should be able to compute the sum in less instructions than in a plain std::vector.

How about we add a link to your package in the README highlighting it for people who have large data with few distinct values?

@beniaminogreen
Copy link
Author

Thanks for the suggestion - I've just made a branch with the hash set approach you describe, and I'll benchmark it to see which is faster. I went with the array approach originally so that the active array would be stored in a contiguous memory block, but I doubt that will offset the speed from the large array copy each time I want to split along a new axis. I'll benchmark both ways and pick the quicker one.

I'll also look into the SIMD approach, but I don't think Rust has particularly mature support for SIMD stuff from what I've been able to google.

An inclusion in the README sounds fantastic to me, thanks for the shout out! I can provide a little table like the one below for my algorithm if you'd like.

image

@erikcs
Copy link
Member

erikcs commented Nov 9, 2022

Great, maybe you could put some timings on your GH README for reference? (considering I guess you'll keep making changes you'll always have the latest timings in your repo, we don't have bandwidth to make any big changes to policytree in the near future).

@beniaminogreen
Copy link
Author

Sounds good to me - I'll run them in the next few days, then put them in the README. Should also add some more documentation for how the parallel_policy_tree function is used.

Thanks for your help,
Ben

@erikcs erikcs added the enhancement New feature or request label Nov 10, 2022
@erikcs
Copy link
Member

erikcs commented Nov 10, 2022

a) When things are getting more polished, we could also add an option to policytree that would dispatch to your package depending on input type? I.e. something like:

if solver == "auto" {
  let v = compute some heuristic based on N and the number of unique values 
  // (an empirical question that probably requires some trial and error!)
  if v > some threshold {
    if "parallel_policy_tree" is installed {
      tree = parallel_policy_tree::parallel_policy_tree(...)
    } else {
      print("Consider installing "parallel_policy_tree at ...")
    }
  //etc
}

b) On algorithmic details, your ObservationBundles are the right idea, but I can imagine you got from the timings you ran in this thread that this loop is something you'd want to avoid. We could help you out if you want to improve this even further? You could probably write up what you did and send it to JOSS as a software paper - Ayush and I can help you along the way. If you are applying for PhD programs maybe that would give you a boost!

So, for an optimal version of BundledObservations and SortedSets, I caught up with @kanodiaayush: an asymptotically unbeatable version of policytree for duplicated x-values would zap the O(N) term by maintaining an index data structure which tells you which ObservationBundle each sample belong to for each dimension. Then you could use this to move bundles left to right like policytree, and iterate over only the non-empty bundles in the recursive calls.

A short pseudo example could be

sets: P-vector of vectors of ObservationBundles.
SampleToBundle: N x P array telling you which bundle each sample belongs to.

for p in 1 to P:
	let right_sets = sets.copy()
	let left_sets = sets.new_empty() // each bundle set to have no samples.
	for each observation_bundle in right_sets[p]:
		sample_collection = right_sets[p](observation_bundle).pop_all()
		left_sets[p](observation_bundle).add(sample_collection)
		for each dimension pp != p:
			right_sets[pp].pop_from_bundle(sample_collection) // 1)
			left_sets[pp].add_from_bundle(sample_collection)
	// do recursive call on left, right sets, etc.

// 1): This can be done quickly since from SampleToBundle we know which bundles we should remove samples from.

(this will probably require some back and forth on details).

@beniaminogreen
Copy link
Author

Both those suggestions sound great to me - I would love to keep optimizing the package, and if there's a paper at the end of the process, that would be fantastic!

Once things get more stable / developed, I can definitely try and construct a heuristic to decide which is more efficient for a given solution. It would be slightly complicated by the fact that paralell_policy_tree is multicored, so it will depend on how much computer power is available; I work on a 40-core server so parallel_policy_tree can be up to 40x slower than policytree on a single thread and still return faster if I'm pressed for time and don't care about the resources used.

Great suggestion about the modification to the sorted sets. I had thought of doing something similar earlier, but I shelved it because it seemed like a lot of effort, and I had not made the connection that it would make the runtime better asymptotically. I'll start working on it this evening / over the weekend - it would be super to achieve a better asymptotic scaling. G

I tried implementing your suggestion to keep track of the alive units using hash sets see here but it didn't seem to be quicker for small dataset sizes. I'll schedule a more exhaustive test in the next few days, but it might be that the memory-contiguity of the arrays offsets the cost of the large copies for datasets of most sizes.

@erikcs
Copy link
Member

erikcs commented Nov 11, 2022

Sounds good! Yes, we should add multithreading to policytree then as well, else it looks would look weird. Some practical considerations to keep in mind is guaranteeing same output for same input. Some settings may have many optimal trees and policytree returns the one that came first while exploring. Users will be confused if threading causes a different optimal tree to be returned.

Similar considerations for very many trees with ~almost numerically the same reward. If Rust / C++ compiler internal numerics cause differences that will break user expectations. An implicit contract we should always try to stick with is: if I write a paper using policytree version a.b.c, then that paper should replicate exactly if I use policytree version a.x.z on an arbitrary computer in the future.

@beniaminogreen
Copy link
Author

I think that right now, because of how I implement the parallelization, I am guaranteeing same output for same input, but I'll keep an eye out as I make changes to ensure that this is the case. I parallelize the search by assigning each thread to search for the best tree that starts by splitting on a single variable, and then select the best from these. any ties are resolved by looking at the initial variable the tree splits on. It will be important to make sure I maintain this pattern in future versions.

I also agree that Rust vs C++ numerics could be an issue, but I don't think that there is much I could do to fix these if they do arise. Empirically, I am not sure that they are much of a concern however, as I test paralell_policy_tree against policytree routinely, and haven't found any cases in which they return different classifications.

Finally, I have experimented implementing the tweaks you have suggested, and the results seem mixed (see below) I'm not sure if this is an issue with my implementation (looking for places to speed this up this afternoon), or if this is an accurate representation of how the new strategy will perform.

image

Part of what I think we are seeing is that the new strategy involves copying n sets, where n is the number of breaks / cut points in the dataset. This means that it would perform the best when the datasets are sparse, but this is also where the existing implimentation is the quickest, so there isn't much room for improvement. I'll keep plugging away and see if the performance in the sparse case can be improved at all. I think the best thing to do for the dense case might simply be to copy the existing algorithm into Rust (or to just have users use the existing package).

@erikcs
Copy link
Member

erikcs commented Nov 17, 2022

a) sounds good! Different compilers on different OS's sometimes produce different result. Also, different OS users may face different installation hassles, I tried installing the package on stanford's linux HPC and it failed with a Rust linking error. Maybe this is something your GH actions could try and catch with a larger build matrix, + give additional assurance tests pass across different OS's and dependency versions (am not up to date on Rust, but a big reason policytree/grf is stable is that they do not depend on a gazillion other libraries).

b1) great work, but what we meant with "asymptotically optimal" would be to do exactly what policytree (algo1) does, but with "sample_n" replaced by "ObservationBundle_{split_val_n}". Translating those BST insert/erase operations to work on ObservationBundles would require use of "SampleToBundle". That doesn't look like what's being done here? (It would likely be a quite big tinkering project.)

b2) by "asymptotically optimal" we mean: consider algo A (yours) and B (ours). Both solve the same problem, but algo A has runtime O(A) and algo B runtime O(B). Both depend on N = the total number of samples, and n_j = the number of distinct values of feature j. As you saw empirically, A is faster than B until you reach a point P where N and n_j gets sufficiently large. What we mean is that: the suggestion above to modify A to A' will likely bump up the point P (to P') at which A will be faster than B (not that it will make it best for the dense case, that's B). Note 1: What P and P' look like, and if it's worth the hassle, I don't know. Note 2: A' might very well be slower than A for the sparsest cases!

@erikcs
Copy link
Member

erikcs commented Nov 17, 2022

PS: I think your package in its current from already seem very useful. The algo suggestion above could be more of a longer term project if you are interested in keeping working on your package.

A more actionable near term thing we could do is add the following option to policytree:

policy_tree(
  X,
  Gamma,
  depth = 2,
  split.step = 1,
  min.node.size = 1,
  solver = c("policytree", "parallel_policy_tree"),
  num.threads = NULL,
  verbose = TRUE
)

where if installed, solver = "parallel_policy_tree" would just call directly into a fit method in that package (which bypasses input checks, we should sync and stick to an API and ensure future compatibility then). That would be easier than a "auto" solver option, at least for now, and you could maintain some benchmarks on your package github that could help inform the user (we also sent you a GH invite if you feel any info would be useful on policytree's page).

Ideally we should add multithreading to policytree too, as having a num.threads arugment that only works with one option looks strange (am not sure when I will get around to that though). On that note, maybe a more informative name could be something like "sparsepolicytree" instead of "parallel_policy_tree"?

@beniaminogreen
Copy link
Author

Gotcha, I see what you mean about "asymptotically optimal," and I think I understand what you mean about implementing the same strategy as policytree but with observation bundles. I did something similar to this in the original prototype of the package, and I think that it might still be archived in the git history so I'll try and pull it out / bring it back to life for some benchmarking.

Historically, the issue that I had with that strategy was that I had to copy p sorted sets each recursive call, and this ended up being more expensive than copying the vector of active units. I'll try and see if there's a way to get the best of both worlds in the next few days.

In the nearer term, I do like the idea of creating a fit method with a standard API so policytree can call out to sparsepolicytree (I do agree that's a better name). What would be the next steps here, and do you have any preferences for how the internal API should look? I'll add a split.step option to sparse_policy_tree so it has all the features of the other solver, then perhaps we can work on integrating the two.

@erikcs
Copy link
Member

erikcs commented Nov 22, 2022

In the nearer term, I do like the idea of creating a fit method with a standard API so policytree can call out to sparsepolicytree (I do agree that's a better name). What would be the next steps here, and do you have any preferences for how the internal API should look? I'll add a split.step option to sparse_policy_tree so it has all the features of the other solver, then perhaps we can work on integrating the two.

For now I think it would already be helpful if just sparse_policy_tree had the same signature as policy_tree, it doesn't matter if your min.node.size doesn't do anything as long as it's documented, then users can just try out your package as a drop-in replacement doing search-and-replace policytree with sparse_policy_tree. Having that workflow work would already go a long way!

Then internals + package integration via a solver argument is something we could come back to later. My plate is too full for this atm, so I can't give you a timeline except the vague academic glacial pace answer of sometime next year : P. There are some other policytree projects in the tinkering phase related bandit problem estimation I'd want to finish first (have no idea when that will be done). Also as I mentioned, we should do multithreading as well. Conceptually the implementation is simple: we should execute each {left, right} recursive call asynchronously in a thread pool (with a fixed number of threads). Getting that to work involves fighting with the C++ compiler and is not something I have time for now. It could however be a different threading strategy for you to try out in Rust, if you didn't already try that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants