Which approach to the on-policy training/inference synchronization is best? #53

Antymon · 2020-12-14T07:23:49Z

Hi,
I am thinking of a synchronization technique that would be suitable to introduce an on-policy algorithm such as PPO
which alternates between data collection and training. I came up with 3 basic ideas and would very much appreciate it if anyone could help me judge if any of them is valid. I am leaning towards the last one as it doesn't use busy waits (i.e., empty while loops on sync variables), although I am not entirely sure how (un)acceptable is a busy loop on a GPU. Although below I present simple pseudocode to demonstrate ideas, I implemented them minimally outside of SEED to try out synchronization via tf variables
inside of TF functions called through SEED's gRPC framework to examine potential problems. One of such problems is the
default mode of autograph which unaware of my synchronization efforts (re)moves stuff thus compromising my intents, but I did
some artificial dependencies hackery to go around this problem (guess optimizations can be controlled but ideally I would
hope for sth like much-hated C++ volatile qualifier). In any case: thanks for any feedback, if any of the below approaches, are valid and/or ideas for alternatives.

# Model is a tf.module with a constructor and 2 tf.functions: infer and train called with gRPC as in SEED.
# tf variables live on cpu of a sole host associated with learner

#1 busy waits, many sync variables
class Model() {

    def Model() {
        var training = tf.Var(False)
        var inferring = tf.Var([False]*NUM_ACTORS)
    }

    def infer(id) {
        while training:
            pass

        inferring[id] = True
        ...
        inferring[id] = False

        return result
    }

    def train() {
        training = True
        while sum(inferring) > 0:
            pass

        ...
        training = False
    }
}


#2 busy waits, a single sync flag, invalidating inference result
class Model() {

    def Model() {
        var training = tf.Var(False)
    }

    # invariant: time(inference) < time(training)
    def infer(id) {
        do {
            while training:
                pass

            ...

        } while not training

        return result
    }

    def train {
        training = True
        ...
        training = False
    }
}

#3 no busy waits, single sync variable, client needs to balance inference calls and reject some of the returns
class Model {

    def Model() {
        var training = tf.Var(False)
    }

    # invariant: time(inference) < time(training)
    def infer(id) {

        if training:
            return None

        ...

        if training:
            return None

        return result
    }

    def train {
        training = True
        ...
        training = False
    }
}

The text was updated successfully, but these errors were encountered:

lespeholt · 2020-12-14T09:09:23Z

Hi,

There is soon going to be a PPO version in SEED open sourced which could serve as an example for what you want to do.

Antymon · 2020-12-24T09:31:04Z

Actually, few weeks ago I asked Marcin Andrychowicz on release of the code due to his work on on-policy RL. Although affirmative. he was unable to tell when that might happen, so I thought of poking a bit on my own and asking around if needed. Guess you are referring to the very same line of work?

Antymon closed this as completed Feb 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which approach to the on-policy training/inference synchronization is best? #53

Which approach to the on-policy training/inference synchronization is best? #53

Antymon commented Dec 14, 2020

lespeholt commented Dec 14, 2020

Antymon commented Dec 24, 2020

Which approach to the on-policy training/inference synchronization is best? #53

Which approach to the on-policy training/inference synchronization is best? #53

Comments

Antymon commented Dec 14, 2020

lespeholt commented Dec 14, 2020

Antymon commented Dec 24, 2020