# Moore Machine Network Extraction from an RNN

Nicholas Renninger

**Attributions**:

* The work here is to reproduce the [`MMN`](https://github.com/koulanurag/mmn) supplementary repo in a more broadly-accessible form by recreating their results in using the widely-used [`stable-baselines`](https://stable-baselines.readthedocs.io/) library, and by migrating all of their networks over to tensorflow. As such, I use a lot of the libraries from `stable-baselines` (alebit mostly heavily modified), and most of the code to extract the Moore Machine Network is heavily inspired by the `MMN` repo (although that is the point of replication).

* Some of this code is borrowed from my [GAIL repo](https://github.com/nicholasRenninger/GAIL-Formal_Methods), mainly that having to do with interacting with a `stable-baselines` simulation*

## Defining the environment to test on.

Here, we're going to use an atari environment is the spirit of the original paper.

In [1]:
%load_ext autoreload
%autoreload 2

import os
from pathlib import Path

import gym
from gym import spaces

import stable_baselines
from stable_baselines import PPO2, DQN
from stable_baselines.gail import ExpertDataset, generate_expert_traj
from stable_baselines.common.policies import CnnLstmPolicy
from stable_baselines.common.vec_env import VecFrameStack
from stable_baselines.common.evaluation import evaluate_policy
from stable_baselines.common.callbacks import EvalCallback, BaseCallback
from stable_baselines.results_plotter import load_results, ts2xy
from stable_baselines.bench import Monitor
from stable_baselines.common.cmd_util import make_atari_env

import tensorflow as tf

from rl_baselines_zoo.utils import make_env, linear_schedule

from util.stable_baseline_viz import show_videos, record_video
from util.custom_evaluate import EvalLSTMCallback
from custom_policies import CustomCNNLstmPolicy

import numpy as np
from functools import reduce
import operator
import random
import logging
import pickle


# the main environment we're testing on
ENV_ID = 'PongNoFrameskip-v4'

# file I/O configuration
EXPERIMENT_HOME = 'experiment_data'

# training logging
LOG_DIR_BASE = './logs/'

# these are for data associated with the expert
BASEPOLICY_NAME = 'ppo2'

BASEPOLICY_LOG_DIR = os.path.join(LOG_DIR_BASE, BASEPOLICY_NAME)
BASEPOLICY_DIR = os.path.join(EXPERIMENT_HOME, 'basepolicy_data')
BASEPOLICY_VIDEO_DIR = os.path.join(BASEPOLICY_DIR, 'videos/')

BASEPOLICY_RUN_ID = f'basepolicy_{BASEPOLICY_NAME}_{ENV_ID}'
BASEPOLICY_MODEL_PATH = os.path.join(BASEPOLICY_DIR,
                                     f'{BASEPOLICY_RUN_ID}_model.zip')
BASEPOLICY_TRACES_PATH = os.path.join(BASEPOLICY_DIR,
                                      f'{BASEPOLICY_RUN_ID}_traces.npz')
BASEPOLICY_BOTTLENECKDATA_PATH = os.path.join(BASEPOLICY_DIR,
                                              f'{BASEPOLICY_RUN_ID}_bottleneck_data.p')
BASEPOLICY_BEST_MODEL_PATH = os.path.join(BASEPOLICY_LOG_DIR,
                                          'best_model.zip')
BASEPOLICY_VEC_VIDEO_NAME = BASEPOLICY_RUN_ID + "_vec"
BASEPOLICY_SINGLE_VIDEO_NAME = BASEPOLICY_RUN_ID + "_single"

# need to ensure these directories always exist
Path(LOG_DIR_BASE).mkdir(parents=True, exist_ok=True)
Path(BASEPOLICY_DIR).mkdir(parents=True, exist_ok=True)
Path(BASEPOLICY_VIDEO_DIR).mkdir(parents=True, exist_ok=True)

# performance evaluation / visualization settings
MAX_VIDEO_LEN = 1000
NUM_EVAL_EPISODES = 10

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



## Choosing whether to use saved model for RNN_Policy

*possible formats: `'pre_trained_model', 'learn_the_model'`*

* `'learn_the_model'`: learning the model just requires choosing the desired RL algorithm and settings its hyperparameters (see above).

* `'pre_trained_model'`: a pre-trained expert must have a saved `stable_baselines` model file at `BASEPOLICY_MODEL_PATH`. See the [saving guide](https://stable-baselines.readthedocs.io/en/master/guide/save_format.html) for info on how to do that.

In [2]:
# decide whether you want to load in a pre-trained model for the ENV or if
# you need to learn a model using the BASEPOLICY_NAME algorithm
model_formats = ['pre_trained_model', 'learn_the_model']
model_format = model_formats[1]

## Setting Hyperparams

Learning Models' Hyperparameters:

In [36]:
hyperparams = {'basepolicy_ppo2':
                    {'cliprange': linear_schedule(0.1),
                     'ent_coef': 0.01,
                     'gamma': 0.99,
                     'lam': 0.95,
                     'learning_rate': linear_schedule(4e-4),
                     'n_steps': 128,
                     'n_timesteps': 7_000_000,
                     'nminibatches': 8, 
                     'noptepochs': 4,
                     'policy': CustomCNNLstmPolicy,
                     'eval_freq': 5_000}}

 Using a vectorized env. is SOO much faster, but only some algs support it (e.g. `PPO2`).

**IMPORTANT**: you must set `num_envs` > 1, or the training will not stabilize and other things will break; we assume a vectorized environment.

**IMPORTANT**: number of parallel envs should be a multiple of num minibatches

In [4]:
num_envs = 16

## Environment Definition w.r.t. Hyperparameters

In [5]:
# There already exists an environment generator that will make and wrap atari 
# environments correctly.
env = make_atari_env(ENV_ID, num_env=num_envs, seed=0)

# only can have ONE environment for evaluation
eval_env = make_atari_env(ENV_ID, num_env=1, seed=0)




Here are the actions available to the agent in the environment:

In [6]:
gym.make(ENV_ID).get_action_meanings()

['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']

## Get an RNN Policy for the Environment

The end goal of this section is to provide the moore machine extraction process with a good policy representation. Thus, we will use the training tools in `stable-baselines` to train the policy to "expert" level first.


In [37]:
if model_format == 'pre_trained_model':
    basepolicy_model = PPO2.load(BASEPOLICY_MODEL_PATH)
    basepolicy_model.set_env(env)

elif model_format == 'learn_the_model':
    basepolicy_hparam = hyperparams['basepolicy_ppo2']

    # always use deterministic actions for live evaluation
    eval_callback = EvalLSTMCallback(eval_env,
                                     best_model_save_path=BASEPOLICY_LOG_DIR,
                                     log_path=BASEPOLICY_LOG_DIR,
                                     eval_freq=basepolicy_hparam['eval_freq'],
                                     deterministic=True, render=False)

    basepolicy_model = PPO2(basepolicy_hparam['policy'], env,
                            cliprange=basepolicy_hparam['cliprange'],
                            ent_coef=basepolicy_hparam['ent_coef'],
                            gamma=basepolicy_hparam['gamma'],
                            lam=basepolicy_hparam['lam'],
                            learning_rate=basepolicy_hparam['learning_rate'],
                            n_steps=basepolicy_hparam['n_steps'],
                            nminibatches=basepolicy_hparam['nminibatches'],
                            noptepochs=basepolicy_hparam['noptepochs'],
                            verbose=0,
                            tensorboard_log=BASEPOLICY_LOG_DIR)

    # while evaluate the model on a new environment and save the best one
    # periodically
    basepolicy_model.learn(total_timesteps=basepolicy_hparam['n_timesteps'],
                           callback=eval_callback)

Eval num_timesteps=8000, episode_reward=-21.00 +/- 0.00
Episode length: 755.20 +/- 2.40
New best mean reward!


KeyboardInterrupt: 

Here, you can see the learned model architecture:

In [38]:
basepolicy_model.get_parameter_list()

[<tf.Variable 'model/c1/w:0' shape=(8, 8, 1, 32) dtype=float32_ref>,
 <tf.Variable 'model/c1/b:0' shape=(1, 32, 1, 1) dtype=float32_ref>,
 <tf.Variable 'model/c2/w:0' shape=(4, 4, 32, 64) dtype=float32_ref>,
 <tf.Variable 'model/c2/b:0' shape=(1, 64, 1, 1) dtype=float32_ref>,
 <tf.Variable 'model/c3/w:0' shape=(3, 3, 64, 64) dtype=float32_ref>,
 <tf.Variable 'model/c3/b:0' shape=(1, 64, 1, 1) dtype=float32_ref>,
 <tf.Variable 'model/fc1/w:0' shape=(3136, 512) dtype=float32_ref>,
 <tf.Variable 'model/fc1/b:0' shape=(512,) dtype=float32_ref>,
 <tf.Variable 'model/lstm1/wx:0' shape=(512, 1024) dtype=float32_ref>,
 <tf.Variable 'model/lstm1/wh:0' shape=(256, 1024) dtype=float32_ref>,
 <tf.Variable 'model/lstm1/b:0' shape=(1024,) dtype=float32_ref>,
 <tf.Variable 'model/vf/w:0' shape=(256, 1) dtype=float32_ref>,
 <tf.Variable 'model/vf/b:0' shape=(1,) dtype=float32_ref>,
 <tf.Variable 'model/pi/w:0' shape=(256, 6) dtype=float32_ref>,
 <tf.Variable 'model/pi/b:0' shape=(6,) dtype=float32_ref

Now save the final model so if we like it, we don't need to re-learn it. The model will be saved under `$BASEPOLICY_MODEL_EXPER_MODEL_PATH.zip`

In [39]:
# need to load and then save the best model found during training
basepolicy_model.load(BASEPOLICY_MODEL_PATH)
basepolicy_model.set_env(env)

# now save the final model so if we like it, we don't need to re-learn it
# The model will be saved under $EXPERT_MODEL_EXPER_MODEL_PATH.zip
basepolicy_model.save(BASEPOLICY_MODEL_PATH)

Now, because we used a vectorized environment for training, we need to change the prediction model for use with a single unstacked environment such that it only uses a small portion of the frame

In [10]:
class PPO2_MultiEnv_to_SingleEnv(PPO2):
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
    
    def predict(self, eval_observation, state=None, mask=None,
                deterministic=False):
        
        n_envs = self.env.num_envs
        zero_completed_obs = np.zeros((n_envs,) + env.observation_space.shape)
        zero_completed_obs[0, :] = eval_observation
        
        action, state = super().predict(zero_completed_obs, state, mask,
                                        deterministic)
        
        return [action[0]], state

single_vec_basepolicy = PPO2_MultiEnv_to_SingleEnv.load(BASEPOLICY_MODEL_PATH)
single_vec_basepolicy.set_env(env)

### Evaluating the performance of the learned RNN policy:

In [11]:
mean_reward, std_reward = evaluate_policy(single_vec_basepolicy, eval_env,
                                          n_eval_episodes=NUM_EVAL_EPISODES)
print(f"base RNN policy ({BASEPOLICY_NAME}) mean_reward: {mean_reward:.2f}" +
      f" +/- {std_reward:.2f}")

base RNN policy (ppo2) mean_reward: -21.00 +/- 0.00


### Visualizing Learning

#### Visualizing the Base RNN Policy on the Original, Vectorized Domain

In [25]:
record_video(basepolicy_model, eval_env=env,
             max_video_length=MAX_VIDEO_LEN,
             video_prefix=BASEPOLICY_VEC_VIDEO_NAME,
             video_folder=BASEPOLICY_VIDEO_DIR,
             is_recurrent=True)

Saving video to  /home/ferg/NeuralMooreMachine_Experiments/experiment_data/basepolicy_data/videos/basepolicy_ppo2_PongNoFrameskip-v4_vec-step-0-to-step-1000.mp4


In [26]:
show_videos(BASEPOLICY_VIDEO_DIR, prefix=BASEPOLICY_VEC_VIDEO_NAME)

#### Visualizing the Base RNN Policy on a single instance of the Original Domain

In [23]:
record_video(single_vec_basepolicy, eval_env=eval_env,
             max_video_length=MAX_VIDEO_LEN,
             video_prefix=BASEPOLICY_SINGLE_VIDEO_NAME,
             video_folder=BASEPOLICY_VIDEO_DIR,
             is_recurrent=True)

Saving video to  /home/ferg/NeuralMooreMachine_Experiments/experiment_data/basepolicy_data/videos/basepolicy_ppo2_PongNoFrameskip-v4_single-step-0-to-step-1000.mp4


In [24]:
show_videos(BASEPOLICY_VIDEO_DIR, prefix=BASEPOLICY_SINGLE_VIDEO_NAME)

## Generating Bottleneck Data

Generate "Bottleneck Data". This is where you simulate many trajectories in the RL environment, recording the observations and the actions taken by the `RNN_Policy`. This is for training the "quantized bottleneck neural networks" (`QBNs`) next.

*note: The following code is adapted from the [paper repo `generate_bottleneck_data()`](https://github.com/koulanurag/mmn) for use with `tf` instead of `torch` and with `stable-baselines`.*

In [None]:
tryin_it_out_model = PPO2(CustomCNNLstmPolicy, env)

In [None]:
observation = eval_env.reset()

(action, value,
 hidden_state, neglogp,
 act_prob,
 obs_feature,
 rnn_output) = step_all_vals_wrapper(tryin_it_out_model, observation)
print(len(action), len(value), len(state), len(neglogp), len(act_prob), len(obs_feature), len(rnn_output))

tryin_it_out_model.get_parameter_list()

In [None]:
# Using only one expert trajectory
# you can specify `traj_limitation=-1` for using the whole dataset
dataset = ExpertDataset(expert_path='expert_cartpole.npz',
                        traj_limitation=1, batch_size=128)

model = PPO2('MlpPolicy', 'CartPole-v1', verbose=1)
# Pretrain the PPO2 model
model.pretrain(dataset, n_epochs=1000)

# As an option, you can train the RL agent
# model.learn(int(1e5))

# Test the pre-trained model
env = model.get_env()
obs = env.reset()

reward_sum = 0.0
for _ in range(1000):
        action, _ = model.predict(obs)
        obs, reward, done, _ = env.step(action)
        reward_sum += reward
        env.render()
        if done:
                print(reward_sum)
                reward_sum = 0.0
                obs = env.reset()

env.close()

Need to wrap the custom tensorflow variable inspector for the `model.act_model.all_vals_step()` command so we use it similarly to `model.predict()`. We need this so we can extract the convolutional feature extarctor and hidden states from the trained model.

We also need to deal with the fact that the environment we train in is vectorized, but for everything else here, we would only like to operate with a single environment.

Also, of note here is that while the `single_hidden_state` returns the RNN's hidden state, for building the hidden-state quantized bottleneck network (`HSQBN`), we actually need the summary of the hidden state as this is what feeds into the actor / critic (policy / value network) portion of the network - where we need to insert the `HSQBN`.

In [None]:
def step_all_vals_wrapper(model, observation, env=env, state=None,
                          mask=None, deterministic=False):
    if state is None:
        state = model.initial_state
    if mask is None:
        mask = [False for _ in range(model.n_envs)]
    observation = np.array(observation)
    vectorized_env = model._is_vectorized_observation(observation,
                                                      model.observation_space)

    observation = observation.reshape((-1,) + model.observation_space.shape)
    
    # need to remove obervations from the vectorized environments;
    # we just care about single-environment performance
    zero_completed_obs = np.zeros((model.n_envs,) +
                                  model.observation_space.shape)
    zero_completed_obs[0, :] = observation
    
    (actions, values,
     states, neglogp,
     act_prob, obs_features,
     rnn_output) = model.act_model.all_vals_step(zero_completed_obs,
                                                   state, mask)
    
    # need to return only the values for the first of the vecotrized
    # environments
    single_action = [actions[0]]
    single_value = [values[0]]
    single_hidden_state = states[0]
    single_neglogp = [neglogp[0]]
    single_act_prob_dist = act_prob[0]
    single_obs_feature = obs_features[0]
    single_rnn_output = rnn_output[0]
    
    return (single_action, single_value, single_hidden_state,
            single_neglogp, single_act_prob_dist, single_obs_feature,
            single_rnn_output)

With this ability to extract data from the learned `RNN_Policy`, we can now sample from it with an episilon-random strategy to increase data diversity:

In [None]:
def generate_bottleneck_data(policy_network, env, traces_output_path,
                             epsilon_rand_prob=0.15,
                             max_generation_steps=100,
                             num_episodes=20):

    bottleneck_data = {}
    hx_data, obs_data, action_data = [], [], []
    all_ep_rewards = []

    for ep in range(num_episodes):

        done = False
        obs = env.reset()
        hx = policy_network.initial_state
        ep_reward = 0
        
        # setting up the episilon-random action generation
        act_count = 0
        all_steps_to_explore = range(0, max_generation_steps, 
                                     int(0.02 * max_generation_steps))
        step_to_start_exploring = random.choice(all_steps_to_explore)

        while not done:
            
            # here we will get an action prediction from the observation
            # and observe the hidden state and the feature extraction state
            (action, _, hx,
             _, _, obs_c, _) = step_all_vals_wrapper(policy_network, obs)

            # use epsilon-random after burning in some compliant
            # trajectories to increase diversity of training data
            should_explore = (random.random() < epsilon_rand_prob)
            can_start_exploring = step_to_start_exploring >= act_count
            take_random_act = should_explore and can_start_exploring
            if take_random_act:
                action = [env.action_space.sample()]

            obs, reward, done, info = env.step(action)
            action = action[0]

            action_data.append(action)
            act_count += 1

            if act_count > max_generation_steps:
                done = True

            if action not in bottleneck_data:
                bottleneck_data[action] = {'hx_data': [], 'obs_data': []}

            bottleneck_data[action]['hx_data'].append(hx)
            bottleneck_data[action]['obs_data'].append(obs_c)

            ep_reward += reward

        print('episode:{} reward:{}'.format(ep, ep_reward[0]))
        all_ep_rewards.append(ep_reward[0])
    
    mean_reward = sum(all_ep_rewards) / len(all_ep_rewards)
    print('Average Performance:{}'.format(mean_reward))

    hx_train_data, hx_test_data, obs_train_data, obs_test_data = [], [], [], []
    for action in bottleneck_data.keys():
        hx_train_data += bottleneck_data[action]['hx_data']
        hx_test_data += bottleneck_data[action]['hx_data']
        obs_train_data += bottleneck_data[action]['obs_data']
        obs_test_data += bottleneck_data[action]['obs_data']

        print('Action: {} Hx Data: {} Obs Data: {}'.format(action,
            len(np.unique(bottleneck_data[action]['hx_data'], axis=0).tolist()),
            len(np.unique(bottleneck_data[action]['obs_data'], axis=0).tolist())))

    obs_test_data = np.unique(obs_test_data, axis=0).tolist()
    hx_test_data = np.unique(hx_test_data, axis=0).tolist()

    random.shuffle(hx_train_data)
    random.shuffle(obs_train_data)
    random.shuffle(hx_test_data)
    random.shuffle(obs_test_data)

    pickle.dump((hx_train_data, hx_test_data,
                 obs_train_data, obs_test_data),
                open(traces_output_path, "wb"))

    print('Data Sizes:')
    log_str = 'Hx Train:{} Hx Test:{} Obs Train:{} Obs Test:{}'
    print(log_str.format(len(hx_train_data), len(hx_test_data), 
                                len(obs_train_data), len(obs_test_data)))

    return hx_train_data, hx_test_data, obs_train_data, obs_test_data

(hx_train_data,
 hx_test_data,
 obs_train_data,
 obs_test_data) = generate_bottleneck_data(tryin_it_out_model,
                         env=eval_env,
                         traces_output_path=BASEPOLICY_BOTTLENECKDATA_PATH)

## Learning QBNs
Learn `QBNs`, which are essentially applied autoencoders (AE), to quantize (discretize):

* the observations of the environmental feature extractor:
    * CNN here as we are using an agent that observes video of the environment.
    
* the hidden state of the `RNN_Policy`. This is called `b_h` in the paper and `BHX` in the mnn code

## QBN Insertion into the RNN Policy Network

Insert the trained `OX` QBN *before* the feature extractor and the trained `BHX` QBN *after* the RNN unit in the feature_extractor-rnn_policy network to create what is now called the moore machine network (`MMN`) policy.

## MMN Policy Training / Fine-tuning

Fine-tune the `MMN` policy by re-running the rl algorithm using the `MMN` policy as a starting point for RL interactions. *Importantly, for training stability the `MMN` is fine-tuned to match the softmax action distribution of the original `RNN_Policy`, not the argmax -> optimize with a categorical cross-entropy loss between the RNN and `MMN` output softmax layers*. 

## Classical Moore Machine Extraction

Extract a classical moore machine from the `MMN` policy by doing:

1. Generate trajectories in the RL environment using rollout simulations of `MMN` policy. For each rollout simulation timestep, we extract a tuple `(h_{MMN, t-1}, f_{MMN, t}, h_{MMN, t}, a_{MMN, t})`:
    * `h_{MMN, t-1}`: the quantized hidden state of the RNN QBN at the previous timestep
    * `f_{MMN, t}`: the quantized observation state of the feature extractor QBN at the current timestep.
    * `h_{MMN, t}`: the quantized hidden state of the RNN QBN at the current timestep.
    * `a_{MMN, t}`: the action outputted by the MNN policy at the current timestep.
    
2. As you can see, we now have *most* of the elements needed to form a Moore machine:
    * `h_{MMN, t-1}` -> prior state of the moore machine, `h_{MM, t-1}`
    * `f_{MMN, t}` -> input transition label of the transition from moore machine state `h_{MM, t-1}` to moore machine state `h_{MM, t}`, `o{MM, t}`.
    * `h_{MMN, t}` -> current state of the moore machine, `h_{MM, t}`.
    * `a_{MMN, t}` -> output label of the current moore machine state `h_{MM, t}`, `a_{MM, t}`.
    
3. What we are missing is a transition function `delta()` and an initial state of the moore machine, `h_{MM, 0}`. 
     
    * `delta()`: A moore machine needs a transition function `delta(h_{MM, t - 1}, o_{MM, t}) -> h_{MM, t}` that maps the current state and observed feature to the next state. Here we will end up with a set of trajectories containing `p` distinct quantized states (`h_{MM}`) and `q` distinct quantized features (`o_{MM}`). These trajectories are then converted to a transition table representing `delta`, which maps any observation-state tuple `(h_{MM}, o_{MM})` to a new state `h_{MM}'`.

    * `h_{MM, 0}`: In practice, this is done by encoding the start state of `RNN_Policy` using `BHX`: `h_{MM, 0} = BHX(h_{`MMN`, 0}`.

## Classical Moore Machine Minimization

Minimize the extracted moore machine to get the smallest possible model. "In general, the number of states `p` will be larger than necessary in the sense that there is a much smaller, but equivalent, minimal machine". Thus, use age old moore machine minimization techniques to learn the moore machine. **This process is exactly the process in Grammatical Inference, thus we can use my own [wombats](https://github.com/nicholasRenninger/wombats/tree/master) tool.**

## Classical Moore Machine Policy Evaluation

You now have a moore machine operating on the abstract, quantized data obtained from the `QBNs`.  To use the moore machine as an agent polciy in the environment `env`:

1. Start by using `OX` and the feature extractor to take the initial environmental observation `f_{env, 0}` and get the moore machine feature observation `o_{MM, 0} = OX.encode(F_ExtractNet(f_{env, 0}))`.

2. Use `delta` with `o_{MM, 0}` and `h_{MM, 0}` (part of the definition of the moore machine) to get the action, `delta(o_{MM, 0}, h_{MM, 0}) = a_{MM, 0}`.

3. Take a step in the environment using `step(env, a_{MM, 0)` to produce a new observation `f_{env, 1}` and the environmental reward, `r_t`.
    
4.  As in step 1-3, we do for `t = 1` onwards:
    1.  `o_{MM, t} = OX.encode(F_ExtractNet(f_{env, t}))`
    2.  `a_{MM, t} = delta(o_{MM, t}, h_{MM, t})`
    3.  `f_{env, t+1}, r_t = step(env, a_{MM, t})`