<a href="https://colab.research.google.com/github/pankajr141/experiments/blob/master/Reasoning/Reinforcement/Reasoning%20%7BRL%7D%20-%205%3A%20Partially%20Observable%20MDP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## References 

https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-6-partial-observability-and-deep-recurrent-q-68463e9aeefc

https://web.stanford.edu/class/aa228/reports/2018/final150.pdf


**How to solve video games ?** we certainly cant use the same algorithm used previously as the previous algo require entire world knowledge to be loaded in order to work. This family is knows as MDP(Markov-Decision-Process) where current state is sufficient for making furture decision (eg. Grid ENV all the states and their Q values are known)


The env in which a single frame has only partial information is called as Partially Observable MDP. 

**How to solve these ?** 



*   <font color='red'>We can stack together some frames and pass to our n/w to learn - This is memory consumable and inefficient as their is a limit to the number of frames we can keep and also the experiance relay buffer will be huge. </font>
*   <font color='green'>We can use **RNN/LSTM/GRU** </font>. These network by default keep track of past observation through hidden states, so instead of stacking few frames, we can just pass 1 frame at a time.



### Setup

In [0]:
!pip install tensorflow==2.0.0 > /dev/null 2>&1
!pip install tensorflow-gpu==2.0.0 > /dev/null 2>&1

!apt-get install -y xvfb python-opengl > /dev/null 2>&1
!pip install pyvirtualdisplay ffmpeg > /dev/null 2>&1
!apt-get install -y x11-utils > /dev/null 2>&1
!pip install piglet pyglet > /dev/null 2>&1
!pip install gym==0.14.0 > /dev/null 2>&1

In [0]:
!git clone https://github.com/openai/mujoco-py.git
!cd mujoco-py;pip3 install -r requirements.txt;python3 setup.py install

!apt-get install swig > /dev/null 2>&1
!easy_install box2d > /dev/null 2>&1
# !easy_install mujoco-py
# !easy_install gym[all]==0.14.0
!pip3 install gym[all]==0.14.0 > /dev/null 2>&1

In [1]:
''' Some utility functions '''
import glob
import io
import base64
from gym.wrappers import Monitor
from IPython.display import HTML
from IPython import display as ipythondisplay


def show_video():  
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    
def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

from pyvirtualdisplay import Display
display = Display(visible=0, size=(400, 300))
display.start()

ModuleNotFoundError: ignored

### Environment - CarRacing-v0

CarRacing env is a perfect example of POMDP, here we are trying to run a car on a randomly generated racetrack and we can see only a small portion of racetrack at a time. 


According to http://gym.openai.com/envs/CarRacing-v0/  -- official documentation

`Easiest continuous control task to learn from pixels, a top-down racing environment. Discreet control is reasonable in this environment as well, on/off discretisation is fine. State consists of 96x96 pixels. Reward is -0.1 every frame and +1000/N for every track tile visited, where N is the total number of tiles in track. For example, if you have finished in 732 frames, your reward is 1000 - 0.1*732 = 926.8 points. Episode finishes when all tiles are visited. Some indicators shown at the bottom of the window and the state RGB buffer. From left to right: true speed, four ABS sensors, steering wheel position, gyroscope.`

In [0]:
from IPython.display import HTML

HTML("""
    <video alt="test" controls>
        <source src="http://gym.openai.com/videos/2019-10-08--6QXvzzSWoV/CarRacing-v0/original.mp4" type="video/mp4">
    </video>
""")

`Action space is the set of triples (s, a, d) ∈ [−1, 1] × [0, 1] × [0, 1]`

<pre>
s -> steering coefficient ranges from hard left to hard right
a -> acceleration a ranges from none to full steam ahead
d -> deceleration d ranges from none to slamming the brakes.
</pre>

Each variable can take continous value <b>however we can change our solution to include only discrete values</b> which in turn will make the solution into a classification problem rather then regression.

<pre> A = {left, right, accelerate, decelerate, nothing} </pre> 

The above will accept 5 values in form of `[0, 1]` such that on or off. 




In [0]:
import gym
print("Action Space:", gym.make("CarRacing-v0").action_space.sample())

Action Space: [-0.7609207   0.22518346  0.19771656]


### Random Agent

Lets try a Random Agent in which all the actions will be decided at random.

**Continous Action Space**

In [0]:
!rm -rf video

import gym

env = wrap_env(gym.make("CarRacing-v0"))
env.reset()

steps = 1000
reward_sum = 0

for step in range(steps):
    env.render()
    observation, reward, done, _ = env.step(env.action_space.sample())
    reward_sum += reward

    if done:
        print("Reward for this episode was: {}, Dead at: {}".format(reward_sum, step))
        reward_sum = 0
        env.reset()
        break

env.close()
print("Reward for this episode was: {}, Dead at: {}".format(reward_sum, steps))

show_video()

Track generation: 1143..1436 -> 293-tiles track
retry to generate track (normal if there are not many of this messages)
Track generation: 1191..1493 -> 302-tiles track
Reward for this episode was: -36.87707641196071, Dead at: 999
Track generation: 1000..1258 -> 258-tiles track
Reward for this episode was: 0, Dead at: 1000


Below is a random agent with discrete action space

In [0]:
def convert_discreate_action_to_continous(left, right, accelerate, decelerate, nothing):
  if nothing == 1:
    return [0, 0, 0]
  elif left == 1:
    return [-1, 0, 0]
  elif right == 1:
    return [1, 0, 0]
  elif accelerate == 1:
    return [0, 1, 0]
  elif decelerate == 1:
    return [0, 0, 1]
  
def convert_continous_action_to_discreate(direction, speed, brake):
  if direction == -1:
    return [1, 0, 0, 0, 0]
  elif direction == 1:
    return [0, 1, 0, 0, 0]
  elif speed == 1:
    return [0, 0, 1, 0, 0]
  elif brake == 1:
    return [0, 0, 0, 1, 0]
  else:
    return [0, 0, 0, 0, 1]

def get_random_action():
  return convert_discreate_action_to_continous(*np.eye(5)[np.random.choice(5, 1)].tolist()[0])


**Discrete Action Space**

In [0]:
!rm -rf video

import gym
import numpy as np

env = wrap_env(gym.make("CarRacing-v0"))
env.reset()

steps = 1000
reward_sum = 0

for step in range(steps):
    env.render()
    action = get_random_action()
    observation, reward, done, _ = env.step(action)
    reward_sum += reward
  
    if done:
        print("Reward for this episode was: {}, Dead at: {}".format(reward_sum, step))
        reward_sum = 0
        env.reset()
        break

env.close()
show_video()

Track generation: 999..1260 -> 261-tiles track
Reward for this episode was: -53.84615384615463, Dead at: 999
Track generation: 1113..1396 -> 283-tiles track


### Deep-Q Learning - Recurrent Models on top of CNN

The Requirement of Recurrent models can be further strengthen on the fact that we are going to recieve positive reward in only a single frame, rest are all -ver reward. Hence model will not know what triggered the reward, however if we use past sequence of events during reward, then model will also consider the state action which led upto reward.


Our action space is continous as discussed above, for sake of simplicity lets try discreet space where we are going to use 5 actions only.


`A = {left, right, accelerate, decelerate, nothing}`

In [0]:
import tensorflow as tf

class Qnetwork(tf.keras.Model):
    def __init__(self, final_layer_size, lr=0.0001):
      super(Qnetwork, self).__init__()
      self.final_layer_size = final_layer_size
      self.conv1 = tf.keras.layers.Conv2D(filters=32, kernel_size=8, strides=4, bias_initializer=None, padding='valid', activation='relu')
      self.conv2 = tf.keras.layers.Conv2D(filters=64, kernel_size=4, strides=2, bias_initializer=None, padding='valid', activation='relu')
      self.conv3 = tf.keras.layers.Conv2D(filters=64, kernel_size=3, strides=1, bias_initializer=None, padding='valid', activation='relu')
      self.conv4 = tf.keras.layers.Conv2D(filters=final_layer_size, kernel_size=7, strides=1, bias_initializer=None, padding='valid', activation='relu')
      self.rnn = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(final_layer_size))
      
      self.flatten = tf.keras.layers.Flatten()
      self.Dense1 = tf.keras.layers.Dense(5, activation='softmax')
      self.Dense2 = tf.keras.layers.Dense(1, activation='softmax')

      self.optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
      self.loss_object = tf.keras.losses.MeanSquaredError()

    def predict(self, inputs, timestep):
      
      x = self.conv1(inputs)
      x = self.conv2(x)
      x = self.conv3(x)
      x = self.conv4(x)

      x = self.flatten(x)
      
      x = tf.reshape(x, [x.shape[0], timestep, self.final_layer_size * 4])  ## x.shape[0] is batch_size, timestep, final_layer
      x = self.rnn(x)

      # We then separate the final convolution layer into an advantage and value
      # stream, 1/2 weights attached to each.. The value function is how well off you are in a given state.
      # The advantage is the how much better off you are after making a particular
      # move. Q is the value function of a state after a given action.
      # Advantage(state, action) = Q(state, action) - Value(state)

      stream_AC, stream_VC = tf.split(x, 2, 1)

      ''' Now Value based model cannot be a good model as we will have to select a best action which gives maximum Q value,
      but we are not selecting actions as they have continous value, so lets try policy n/w '''

      # We define weights for our advantage and value layers. We will train these
      # layers so the matmul will match the expected value and advantage from play
      Advantage = self.Dense1(stream_AC) # Give the advantage value of each action ?
      Value = self.Dense2(stream_VC)     # Give Q value of current state ? 

      # To get the Q output, we need to add the value to the advantage.
      # The advantage to be evaluated will be based on how good the action is based on the average advantage of that state.
      
      self.Qout = Value + tf.subtract(Advantage, tf.reduce_mean(Advantage, axis=1, keepdims=True))
      return self.Qout
    
    def train(self, inputs, labels, timesteps):
      with tf.GradientTape() as tape:
        predictions = self.predict(inputs, timesteps)
        loss = self.loss_object(labels, predictions)

      gradients = tape.gradient(loss, self.trainable_variables)
      self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))
      return np.array(loss)


In [0]:
class ExperienceReplay:
    def __init__(self,buffer_size=50000):
        """ Data structure used to hold game experiences """
        # Buffer will contain [state,action,reward,next_state,done]
        self.buffer = []
        self.buffer_size = buffer_size
    
    def add(self, experience):
        """ Adds list of experiences to the buffer """
        # Extend the stored experiences
        self.buffer.extend(experience)
        # Keep the last buffer_size number of experiences
        self.buffer = self.buffer[-self.buffer_size:]

    def sample(self, batch_size, trace_length=None):
        """ Returns a sample of experiences from the buffer """
        if not trace_length:
          sample_idxs = np.random.randint(len(self.buffer), size=batch_size)
          sample_output = [self.buffer[idx] for idx in sample_idxs]
          sample_output = np.reshape(sample_output, (batch_size, -1))
          return sample_output
        else:
          sampled_episodes = random.sample(self.buffer, batch_size)
          sampledTraces = []
          for episode in sampled_episodes:
              point = np.random.randint(0, len(episode) + 1 - trace_length)
              sampledTraces.append(episode[point: point + trace_length])
          sampledTraces = np.array(sampledTraces)
          return np.reshape(sampledTraces, [batch_size * trace_length, 5])

In [0]:
import os
import gym
import numpy as np
import tensorflow as tf

class DeepQModel():
  
  def __init__(self, lr=0.001):
    # Reset everything
    
    self.final_layer_size = 512 #Size of the final convolution layer before splitting into Advantage and Value streams

    self.env = gym.make("CarRacing-v0")
    self.env.env.verbose = 0
    self.env.reset()
    
    print(self.env.observation_space.shape)

    self.checkpoint_path = "training/cp.ckpt"

    # Setup our Q-networks
    self.main_qn = Qnetwork(self.final_layer_size)
    self.target_qn = Qnetwork(self.final_layer_size)

    # Make the networks equal
    self.update_target_graph()

    # Setup our experience replay
    self.experience_replay = ExperienceReplay()
    self.lr = lr
    self.load_model = False # Whether to load a saved model

    self.prob_random_start = 0.6 # Starting chance of random action
    self.prob_random_end = 0.1 # Ending chance of random action
    
    self.y = 0.99 # Discount factor
    self.tau = 1 # Rate to update target network toward primary network
    self.update_freq = 5 # How often you update the network

    self.num_steps = [] # Tracks number of steps per episode
    self.rewards = [] # Tracks rewards per episode
    self.total_steps = 0  # Tracks cumulative steps taken in training

  def update_target_graph(self):
      self.main_qn.save_weights(self.checkpoint_path)
      self.target_qn.load_weights(self.checkpoint_path)
    
  def build_experiance_buffer(self, num_episode, num_episodes, pre_train_episodes, steps, prob_random):
    
      # Create an experience replay for the current episode
      episode_buffer = ExperienceReplay()
      # Get the game state from the environment
      state = self.env.reset()
      done = False # Game is complete
      sum_rewards = 0 # Running sum of rewards in episode

      ''' Save all the interaction with Env '''
      for cntr in range(steps):
        state = state.astype(np.float32)
        if cntr >= steps or done:
          break

        self.total_steps += 1            
   
        if num_episode < pre_train_episodes or np.random.rand() < prob_random:        
            # Act randomly based on prob_random or if we have not accumulated enough pre_train episodes
            action = np.array(get_random_action())
        else:
            # Decide what action to take from the Q network
            qouts = np.array(self.main_qn.predict(np.array([state]), 1))
            action = np.array(convert_discreate_action_to_continous(*np.eye(5)[np.argmax(qouts[0])])).astype(np.int64)

        # Take the action and retrieve the next state, reward and done
        state_next, reward, done, _ = self.env.step(action)

        # Setup the episode to be stored in the episode buffer
        episode = np.array([[state], np.argmax(convert_continous_action_to_discreate(*action)), reward, [state_next], done])
        episode = episode.reshape(1,-1)

        # Store the experience in the episode buffer
        episode_buffer.add(episode)

        # Update the running rewards
        sum_rewards += reward

        # Update the state
        state = state_next

      self.experience_replay.add(episode_buffer.buffer)
      self.num_steps.append(cntr)
      self.rewards.append(sum_rewards)

  def print_status(self, num_episode, print_every, num_epochs, prob_random, goal):
    if not num_episode % print_every == 0:
      return False

    mean_loss = np.mean(self.losses[-(print_every * num_epochs):])
    print("Num episode: {} Mean reward: {:0.4f} Prob random: {:0.4f}, Loss: {:0.04f}".format(
        num_episode, np.mean(self.rewards[-print_every:]), prob_random, mean_loss))

    if np.mean(self.rewards[-print_every:]) >= goal:
        print("Training complete!")
        return True
    return False

  def execute_episodes(self, num_episodes=100, pre_train_episodes=100, batch_size=64, steps=1000, num_epochs=20, goal=5, evaluate_every=5, print_every=5):

    # We'll begin by acting complete randomly. As we gain experience and improve,
    # we will begin reducing the probability of acting randomly, and instead
    # take the actions that our Q network suggests

    annealing_steps = num_episodes
    
    prob_random = self.prob_random_start
    prob_random_drop = (self.prob_random_start - self.prob_random_end) / annealing_steps

    self.losses = [0]     # Tracking training losses

    for num_episode in range(num_episodes):
        self.build_experiance_buffer(num_episode, num_episodes, pre_train_episodes, steps, prob_random)

        if num_episode < pre_train_episodes:
          self.print_status(num_episode, print_every, num_epochs, prob_random, goal)
          continue

        if prob_random > self.prob_random_end:
            # Drop the probability of a random action
            prob_random -= prob_random_drop
  
        if not num_episode or not num_episode % evaluate_every == 0:
            continue

        for num_epoch in range(num_epochs):

            ''' Below Bellman equation will run we will use actual experiances from main model and
            will use actual reward + discounted rewards from target model to recalculate value

            target_q [action] = actual_reward (main_nw) + discounted_Qvalue (target_nw) 

            '''
            # Train batch is [[state,action,reward,next_state,done],...]
            train_batch = self.experience_replay.sample(batch_size)

            # Separate the batch into its components
            state_m, action_m, reward, state_next_m, done = train_batch.T

            # Convert the action array into an array of ints so they can be used for indexing
            action_m = action_m.astype(np.int)

            # Stack the states and train_next_state for learning
            state_m = np.vstack(state_m).astype(np.float32)
            state_next_m = np.vstack(state_next_m).astype(np.float32)

            ''' 
            In order to train n/w we need output from our target n/w modified with optimum rewards
            Based on Bellmon Equation we know that

            Q_out = rewards + discount_factor  * (future_rewards / advantage)

            We have rewards and duscount_factor, but advantage are unknown, advantage are just new state values from our target n/w.
            '''
            output_t = np.array(self.target_qn.predict(state_m, timestep=1))
            
            # The Q values from our target network from the next state
            output_state_next_m = np.array(self.main_qn.predict(state_next_m, timestep=1))

            action_state_next_m = np.argmax(output_state_next_m,axis=1).astype(np.int)

            # Q value of the next state based on action
            Qout_state_next_m = output_state_next_m[range(batch_size), action_state_next_m]

            ''''Tells us whether game over or not, We will multiply our rewards by this value to ensure we don't train on the last move '''
            done = done.astype(int)
            train_gameover = done == 0
            train_gameover = train_gameover.astype(int)

            # Reward from the action chosen in the train batch
            actual_reward = reward + (self.y * Qout_state_next_m * train_gameover)
            
            ''' Remember we are not overriding every value in output_t only the best action q value'''
            output_t[range(batch_size), action_m] = actual_reward
            
            # Train the main model
            '''
            Ideally we dont need target model, but here we can see that with each update weight will change
            which will contradict with target_q value we will be generating, hence we are keeping target constant but generating from 
            identical but slightly outdated same model copy.
            '''
            loss = self.main_qn.train(state_m, output_t, timesteps=1) 
            self.losses.append(loss)

        # Update the target model with values from the main model
        self.update_target_graph()

        # Print progress
        if self.print_status(num_episode, print_every, num_epochs, prob_random, goal):
           break

'''
batch_size - How many Experiences to use for each training step
steps = 1000 # Maximum allowed episode length
num_epochs = 20 # How many epochs to train from single episode experiances

save_every = 5   # How often to save
print_every = 5 # How often to print status
'''

import warnings
warnings.simplefilter('ignore')

obj = DeepQModel()
obj.execute_episodes(num_episodes=500, pre_train_episodes=50, batch_size=64, steps=1000, num_epochs=20, goal=900, evaluate_every=5, print_every=5)
# obj.execute_episodes(num_episodes=4, batch_size=64, steps=1000, num_epochs=20, goal=900, save_every=10, print_every=2)
# !rm -rf models_1
# obj.run_final_model()

(96, 96, 3)
Num episode: 0 Mean reward: -60.2888 Prob random: 0.6000, Loss: 0.0000
Num episode: 5 Mean reward: -55.2879 Prob random: 0.6000, Loss: 0.0000
Num episode: 10 Mean reward: -56.1606 Prob random: 0.6000, Loss: 0.0000
Num episode: 15 Mean reward: -56.1950 Prob random: 0.6000, Loss: 0.0000
Num episode: 20 Mean reward: -56.0693 Prob random: 0.6000, Loss: 0.0000
Num episode: 25 Mean reward: -57.4032 Prob random: 0.6000, Loss: 0.0000
Num episode: 30 Mean reward: -55.1875 Prob random: 0.6000, Loss: 0.0000
Num episode: 35 Mean reward: -56.9494 Prob random: 0.6000, Loss: 0.0000
Num episode: 40 Mean reward: -55.0546 Prob random: 0.6000, Loss: 0.0000
Num episode: 45 Mean reward: -55.3428 Prob random: 0.6000, Loss: 0.0000
Num episode: 50 Mean reward: -60.2545 Prob random: 0.5990, Loss: 0.0258
Num episode: 55 Mean reward: -54.1801 Prob random: 0.5940, Loss: 0.0395
Num episode: 60 Mean reward: -58.2100 Prob random: 0.5890, Loss: 0.0400
Num episode: 65 Mean reward: -57.8946 Prob random: 0.5

In [0]:
def run_final_model(obj):
  os.system("rm -rf models_1")
  env = wrap_env(gym.make("CarRacing-v0"))
  state = env.reset()

  steps = 1000
  reward_sum = 0
  for step in range(steps):
      env.render()
      
      state = state.astype(np.float32)
      qouts = np.array(obj.main_qn.predict(np.array([state]), 1))
      action = np.array(convert_discreate_action_to_continous(*np.eye(5)[np.argmax(qouts[0])])).astype(np.int64)
      print(action)
      state, reward, done, _ = env.step(action)
      reward_sum += reward

      if done:
          print("Reward for this episode was: {}, Dead at: {}".format(reward_sum, step))
          reward_sum = 0
#           env.reset()
          break
  env.close()
  show_video()

run_final_model(obj)

Track generation: 1204..1509 -> 305-tiles track
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
