# Moore Machine Network Extraction from an RNN

Nicholas Renninger

*note: Some of this code is borrowed from my [GAIL repo](https://github.com/nicholasRenninger/GAIL-Formal_Methods).*

## Defining the environment to test on.

Here, we're going to use an atari environment is the spirit of the original paper.

In [1]:
%load_ext autoreload
%autoreload 2

import os
from pathlib import Path

import gym
from gym import spaces

import stable_baselines
from stable_baselines import PPO2
from stable_baselines.gail import ExpertDataset, generate_expert_traj
from stable_baselines.common.policies import CnnLstmPolicy
from stable_baselines.common.vec_env import VecFrameStack
from stable_baselines.common.evaluation import evaluate_policy
from stable_baselines.common.callbacks import EvalCallback, BaseCallback
from stable_baselines.results_plotter import load_results, ts2xy
from stable_baselines.bench import Monitor
from stable_baselines.common.cmd_util import make_atari_env

from rl_baselines_zoo.utils import make_env, linear_schedule

from util.stable_baseline_viz import show_videos, record_video

import numpy as np
from functools import reduce
import operator


# the main environment we're testing on
ENV_ID = 'PongNoFrameskip-v4'

# file I/O configuration
EXPERIMENT_HOME = 'experiment_data'

# training logging
LOG_DIR_BASE = './logs/'

# these are for data associated with the expert
BASEPOLICY_NAME = 'ppo2'

BASEPOLICY_LOG_DIR = os.path.join(LOG_DIR_BASE, BASEPOLICY_NAME)
BASEPOLICY_DIR = os.path.join(EXPERIMENT_HOME, 'basepolicy_data')
BASEPOLICY_VIDEO_DIR = os.path.join(BASEPOLICY_DIR, 'videos/')

BASEPOLICY_RUN_ID = f'basepolicy_{BASEPOLICY_NAME}_{ENV_ID}'
BASEPOLICY_MODEL_PATH = os.path.join(BASEPOLICY_DIR,
                                     f'{BASEPOLICY_RUN_ID}_model.zip')
BASEPOLICY_TRACES_PATH = os.path.join(BASEPOLICY_DIR,
                                      f'{BASEPOLICY_RUN_ID}_traces.npz')
BASEPOLICY_BEST_MODEL_PATH = os.path.join(BASEPOLICY_LOG_DIR,
                                          'best_model.zip')

# need to ensure these directories always exist
Path(LOG_DIR_BASE).mkdir(parents=True, exist_ok=True)

Path(BASEPOLICY_DIR).mkdir(parents=True, exist_ok=True)
Path(BASEPOLICY_VIDEO_DIR).mkdir(parents=True, exist_ok=True)

# decide whether you want to load in a pre-trained model for the ENV or if
# you need to learn a model using the BASEPOLICY_NAME algorithm
model_formats = ['pre_trained_model', 'learn_the_model', 'traces_only']
model_format = model_formats[1]

# performance evaluation / visualization settings
MAX_VIDEO_LEN = 1000
NUM_EVAL_EPISODES = 10

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



## Learning Hyperparams

Learning Model Hyperparameters:

In [12]:
hyperparams = {'basepolicy_ppo2':
                    {'cliprange': linear_schedule(0.1),
                     'ent_coef': 0.01,
                     'gamma': 0.99,
                     'lam': 0.95,
                     'learning_rate': linear_schedule(3e-4),
                     'n_steps': 128,
                     'n_timesteps': 10_000_000,
                     'nminibatches': 8, # 8 number of parallel envs should be a
                                        # multiple of num minibatches
                     'noptepochs': 4,
                     'policy': CnnLstmPolicy,
                     'eval_freq': 500}}

Environment Hyperparameters:

In [3]:
# using vectorized env. is SOO much faster, but only some algs support
# it e.g. PPO2
# 64
num_envs = 16

## Environment Definition w.r.t. Hyperparameters

In [4]:
# There already exists an environment generator that will make and wrap atari 
# environments correctly.
env = make_atari_env(ENV_ID, num_env=num_envs, seed=0)

# only want ONE environment for evaluation
eval_env = make_atari_env(ENV_ID, num_env=1, seed=0)




## Get an RNN Policy for the Environment

The end goal of this section is to provide the moore machine extraction process with a good policy representation.

<br>

**Model Formats:**

*possible formats: `'pre_trained_model', 'learn_the_model', 'traces_only'`*

* `'learn_the_model'`: learning the model just requires choosing the desired RL algorithm and settings its hyperparameters (see above).

* `'pre_trained_model'`: a pre-trained expert must have a saved `stable_baselines` model file at `BASEPOLICY_MODEL_PATH`. See the [saving guide](https://stable-baselines.readthedocs.io/en/master/guide/save_format.html) for info on how to do that.

* `'traces_only'`: demonstration traces must reside in the `npz` archive at `BASEPOLICY_MODEL_TRACES_PATH` and folow the format needed by `stable_baselines.gail.ExpertDataset()`.


<br>

--- 

<br>

**From the docs**:

*The expert dataset is a .npz archive. The data is saved in python dictionary format with keys: actions, episode_returns, rewards, obs, episode_starts.*

*In case of images, obs contains the relative path to the images.*

*obs, actions: shape (N * L, ) + S*

*where N = # episodes, L = episode length and S is the environment observation/action space.*

*S = (1, ) for discrete space*

<br>

In [5]:
if model_format == 'pre_trained_model':
    basepolicy_model = PPO2.load(BASEPOLICY_MODEL_PATH)
    basepolicy_model.set_env(basepolicy_env)

elif model_format == 'learn_the_model':
    basepolicy_hparam = hyperparams['basepolicy_ppo2']

    basepolicy_model = PPO2(basepolicy_hparam['policy'], env,
                            cliprange=basepolicy_hparam['cliprange'],
                            ent_coef=basepolicy_hparam['ent_coef'],
                            gamma=basepolicy_hparam['gamma'],
                            lam=basepolicy_hparam['lam'],
                            learning_rate=basepolicy_hparam['learning_rate'],
                            n_steps=basepolicy_hparam['n_steps'],
                            nminibatches=basepolicy_hparam['nminibatches'],
                            noptepochs=basepolicy_hparam['noptepochs'],
                            verbose=0,
                            tensorboard_log=BASEPOLICY_LOG_DIR)

    # while evaluate the model on a new environment and save the best one
    # periodically
    basepolicy_model.learn(total_timesteps=basepolicy_hparam['n_timesteps'])










Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where






In [13]:
# now save the final model so if we like it, we don't need to re-learn it
# The model will be saved under $BASEPOLICY_MODEL_EXPER_MODEL_PATH.zip
basepolicy_model.save(BASEPOLICY_MODEL_PATH)

If we now have an learned model, sample trajectories from it and save them as for the quantization learning process. If you just provide traces, we will do nothing and assume the traces exist at `EXPERT_MODEL_TRACES_PATH`.

In [7]:
if model_format != 'traces_only':
    
    # vectorized environments do not work with generate_expert_traj, not sure
    # why. Seems to be that threads are probably not synced properly somehow
    #
    # The evaluation environment already must only have a singular env, so
    # use it for trace generation.
    expert_model.set_env(expert_eval_env)
    
    # generate trajectories in the environment under the expert_model
    data = generate_expert_traj(expert_model, EXPERT_MODEL_TRACES_PATH,
                                n_episodes=500);
    
# Load the expert dataset
expert_dataset = ExpertDataset(expert_path=EXPERT_MODEL_TRACES_PATH,
                               verbose=1)

NameError: name 'expert_model' is not defined

Now, because we used a vectorized environment for training, we need to change the prediction model for use with a single unstacked environment such that it only uses a small portion of the frame

In [14]:
class PPO2_MultiEnv_to_SingleEnv(PPO2):
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
    
    def predict(self, eval_observation, state=None, mask=None,
                deterministic=False):
        
        n_envs = self.env.num_envs
        zero_completed_obs = np.zeros((n_envs,) + env.observation_space.shape)
        zero_completed_obs[0, :] = eval_observation
        
        action, state = super().predict(zero_completed_obs, state, mask,
                                        deterministic)
        
        return [action[0]], state

single_vec_basepolicy = PPO2_MultiEnv_to_SingleEnv.load(BASEPOLICY_MODEL_PATH)
single_vec_basepolicy.set_env(env)

### Evaluating the performance of the learned RNN policy:

In [None]:
mean_reward, std_reward = evaluate_policy(single_vec_basepolicy, eval_env,
                                          n_eval_episodes=NUM_EVAL_EPISODES)
print(f"base RNN policy ({BASEPOLICY_NAME})  mean_reward:{mean_reward:.2f}" +
      f" +/- {std_reward:.2f}")

### Visualizing Learning

#### Visualizing the Base RNN Policy on the Original, Vectorized Domain

In [15]:
vec_video_name = BASEPOLICY_RUN_ID + "_vec"
record_video(basepolicy_model, eval_env=env,
             max_video_length=MAX_VIDEO_LEN,
             video_prefix=vec_video_name,
             video_folder=BASEPOLICY_VIDEO_DIR,
             is_recurrent=True)

show_videos(BASEPOLICY_VIDEO_DIR, prefix=vec_video_name)

Saving video to  /home/ferg/NeuralMooreMachine_Experiments/experiment_data/basepolicy_data/videos/basepolicy_ppo2_PongNoFrameskip-v4_vec-step-0-to-step-1000.mp4


#### Visualizing the Base RNN Policy on a single instance of the Original Domain

In [16]:
single_video_name = BASEPOLICY_RUN_ID + "_single"
record_video(single_vec_basepolicy, eval_env=eval_env,
             max_video_length=MAX_VIDEO_LEN,
             video_prefix=single_video_name,
             video_folder=BASEPOLICY_VIDEO_DIR,
             is_recurrent=True)

show_videos(BASEPOLICY_VIDEO_DIR, prefix=single_video_name)

Saving video to  /home/ferg/NeuralMooreMachine_Experiments/experiment_data/basepolicy_data/videos/basepolicy_ppo2_PongNoFrameskip-v4_single-step-0-to-step-1000.mp4


## Generating Bottleneck Data

Generate "Bottleneck Data". This is where you simulate many trajectories in the RL environment, recording the observations and the actions taken by the `RNN_Policy`. This is for training the "quantized bottleneck neural networks" (`QBNs`) next.

In [None]:
def generate_bottleneck_data(RNN_Policy):
    
    (input_obs, hidden_states) = (None, None)
    
    
    
    return (input_obs, hidden_states)

## Learning QBNs
Learn `QBNs`, which are essentially applied autoencoders (AE), to quantize (discretize):

* the observations of the environmental feature extractor:
    * CNN here as we are using an agent that observes video of the environment.
    
* the hidden state of the `RNN_Policy`. This is called `b_h` in the paper and `BHX` in the mnn code

## QBN Insertion into the RNN Policy Network

Insert the trained `OX` QBN *before* the feature extractor and the trained `BHX` QBN *after* the RNN unit in the feature_extractor-rnn_policy network to create what is now called the moore machine network (`MMN`) policy.

## MMN Policy Training / Fine-tuning

Fine-tune the `MMN` policy by re-running the rl algorithm using the `MMN` policy as a starting point for RL interactions. *Importantly, for training stability the `MMN` is fine-tuned to match the softmax action distribution of the original `RNN_Policy`, not the argmax -> optimize with a categorical cross-entropy loss between the RNN and `MMN` output softmax layers*. 

## Classical Moore Machine Extraction

Extract a classical moore machine from the `MMN` policy by doing:

1. Generate trajectories in the RL environment using rollout simulations of `MMN` policy. For each rollout simulation timestep, we extract a tuple `(h_{MMN, t-1}, f_{MMN, t}, h_{MMN, t}, a_{MMN, t})`:
    * `h_{MMN, t-1}`: the quantized hidden state of the RNN QBN at the previous timestep
    * `f_{MMN, t}`: the quantized observation state of the feature extractor QBN at the current timestep.
    * `h_{MMN, t}`: the quantized hidden state of the RNN QBN at the current timestep.
    * `a_{MMN, t}`: the action outputted by the MNN policy at the current timestep.
    
2. As you can see, we now have *most* of the elements needed to form a Moore machine:
    * `h_{MMN, t-1}` -> prior state of the moore machine, `h_{MM, t-1}`
    * `f_{MMN, t}` -> input transition label of the transition from moore machine state `h_{MM, t-1}` to moore machine state `h_{MM, t}`, `o{MM, t}`.
    * `h_{MMN, t}` -> current state of the moore machine, `h_{MM, t}`.
    * `a_{MMN, t}` -> output label of the current moore machine state `h_{MM, t}`, `a_{MM, t}`.
    
3. What we are missing is a transition function `delta()` and an initial state of the moore machine, `h_{MM, 0}`. 
     
    * `delta()`: A moore machine needs a transition function `delta(h_{MM, t - 1}, o_{MM, t}) -> h_{MM, t}` that maps the current state and observed feature to the next state. Here we will end up with a set of trajectories containing `p` distinct quantized states (`h_{MM}`) and `q` distinct quantized features (`o_{MM}`). These trajectories are then converted to a transition table representing `delta`, which maps any observation-state tuple `(h_{MM}, o_{MM})` to a new state `h_{MM}'`.

    * `h_{MM, 0}`: In practice, this is done by encoding the start state of `RNN_Policy` using `BHX`: `h_{MM, 0} = BHX(h_{`MMN`, 0}`.

## Classical Moore Machine Minimization

Minimize the extracted moore machine to get the smallest possible model. "In general, the number of states `p` will be larger than necessary in the sense that there is a much smaller, but equivalent, minimal machine". Thus, use age old moore machine minimization techniques to learn the moore machine. **This process is exactly the process in Grammatical Inference, thus we can use my own [wombats](https://github.com/nicholasRenninger/wombats/tree/master) tool.**

## Classical Moore Machine Policy Evaluation

You now have a moore machine operating on the abstract, quantized data obtained from the `QBNs`.  To use the moore machine as an agent polciy in the environment `env`:

1. Start by using `OX` and the feature extractor to take the initial environmental observation `f_{env, 0}` and get the moore machine feature observation `o_{MM, 0} = OX.encode(F_ExtractNet(f_{env, 0}))`.

2. Use `delta` with `o_{MM, 0}` and `h_{MM, 0}` (part of the definition of the moore machine) to get the action, `delta(o_{MM, 0}, h_{MM, 0}) = a_{MM, 0}`.

3. Take a step in the environment using `step(env, a_{MM, 0)` to produce a new observation `f_{env, 1}` and the environmental reward, `r_t`.
    
4.  As in step 1-3, we do for `t = 1` onwards:
    1.  `o_{MM, t} = OX.encode(F_ExtractNet(f_{env, t}))`
    2.  `a_{MM, t} = delta(o_{MM, t}, h_{MM, t})`
    3.  `f_{env, t+1}, r_t = step(env, a_{MM, t})`