### Debug and understand: two brids same shot

### Background (almost a copy paste of memory based---) paper:
<br> The agent only indirectly observes the underlying state of the MDP through the observations.  An optimal agent may, in principle, require access to the entire history $h_t= (o_0,a_0,o_1,a_1,...a_{t−1},o_t)$ to perform optimal action $a_{t-1}$. We set o_0 as a dummy observation: the agent begins with action $a_0$</b>
<br>
<br>
We aim to maximize the (expected) cumulative reward from time-step one (for us is the success probability); 

<br>$J(\theta) := \sum_{s_0} p(s_0) \; v_{\pi(\theta)}(s_0)  = \mathbb{E}_\tau \sum _{t}^{\infty} \gamma^{t-1} r(s_t, a_t)$<br>

where $\tau$ is a trajectory over observation-action space. Notice we slightly change the notation/point of view from the paper Memory-based control with recurrent neuralnetworks, since for us the history consitutes the state, and we don't make a guess for the state given a history unless at the last time-step.

Then, $\frac{\partial J }{\partial \theta} = \mathbb{E}_{\tau} \sum_t \; \gamma^{t-1} \frac{\partial Q(h_t, a)}{\partial a}|_{a = \pi_{\theta}(h_t)} \frac{\partial \pi(h_t) }{\partial \theta}$, what gives an update for the parameters of the policy.
<br>
<br>

<b>Double Q-learning</b>. The algorithm maintains two copies of the value function Q and of the policy $\pi$ each, with parametersθandθ′, andωandω′respectively.θandωare theparameters that are being updated by the algorithm;θ′andω′track them with some delay and areused to compute the “targets values” for theQfunction update

Papers/bibliography:
<ol>
<li> http://rll.berkeley.edu/deeprlworkshop/papers/rdpg.pdf</li>
<li> https://arxiv.org/pdf/1509.02971.pdf</li>
</ol>

In [None]:
import tensorflow as tf
import pandas as pd
from tensorflow.keras.layers import Dense
import numpy as np
import matplotlib.pyplot as plt
import os
from tqdm import tqdm as tqdm
tf.keras.backend.set_floatx('float32')
from collections import deque
from datetime import datetime
import random
import matplotlib
from environment import Environment
from plots import just_plot
from misc import *
from nets import *
from buffer import ReplayBuffer


amplitude=0.4
dolinar_layers=2
number_phases=2
total_episodes = 10**3
buffer_size=500
batch_size=64
ep_guess=0.01
noise_displacement=0.5
lr_actor=0.01
lr_critic=0.001
tau=0.005

In [2]:
exper = np.load("example_buffer/2_sample.npy")
env = Environment(amplitude=amplitude, dolinar_layers = dolinar_layers, number_phases=number_phases)
# buffer = ReplayBuffer(buffer_size=buffer_size)

critic = Critic(nature="primary",valreg=0.01, dolinar_layers = dolinar_layers, number_phases=number_phases)
critic_target = Critic(nature="target", dolinar_layers = dolinar_layers, number_phases=number_phases)
actor = Actor(nature="primary", dolinar_layers = dolinar_layers)
actor_target = Actor(nature="target", dolinar_layers = dolinar_layers)

optimizer_critic = tf.keras.optimizers.Adam(lr=lr_critic)
optimizer_actor = tf.keras.optimizers.Adam(lr=lr_actor)

policy_evaluator = PolicyEvaluator(amplitude = amplitude, dolinar_layers=dolinar_layers, number_phases = number_phases)

Let's evaluate the success probability of best strategy so far (random initialization...) - later on we'll inspect this function

In [None]:
policy_evaluator.greedy_strategy(actor = actor, critic = critic)

Now we move on to the optimization step

In [3]:
experiences = exper.astype(np.float32)
print(experiences.shape)
print(experiences)

(15, 6)
[[ 0.03453873  0.         -0.33991995  0.          0.          1.        ]
 [-0.25866753  0.         -0.19607863  0.          0.5         1.        ]
 [-0.24178177  0.          0.09374991  0.          0.5         1.        ]
 [-0.2672065   0.          0.04338924  0.          0.          0.        ]
 [-0.1888024   0.         -0.11296402  0.          0.5         1.        ]
 [-0.0119067   0.          0.10449888  0.          0.          0.        ]
 [ 0.0838516   0.         -0.223578    0.          0.          0.        ]
 [ 0.09396052  0.         -0.00805294  0.          0.5         0.        ]
 [-0.18203147  0.         -0.11425584  0.          0.          1.        ]
 [-0.2001634   0.          0.02064468  0.          0.          1.        ]
 [-0.27773562  0.          0.05043116  0.          0.          0.        ]
 [-0.06698474  0.         -0.2857382   0.          0.          0.        ]
 [-0.29122508  1.         -0.16833827  1.          0.5         1.        ]
 [-0.02650449  0.

From target_experience we obtain $\mu^{targ}(h_t) \; \forall t>0$ ($t=0$ is not used in the update so it doesnt' matter)

In [4]:
targeted_experience = actor_target.process_sequence_of_experiences(experiences)
print(targeted_experience)
print(targeted_experience.shape)


[[ 0.03453873  0.         -0.08791127  0.          0.          1.        ]
 [-0.25866753  0.         -0.08791127  0.          0.5         1.        ]
 [-0.24178177  0.         -0.08791127  0.          0.5         1.        ]
 [-0.2672065   0.         -0.08791127  0.          0.          0.        ]
 [-0.1888024   0.         -0.08791127  0.          0.5         1.        ]
 [-0.0119067   0.         -0.08791127  0.          0.          0.        ]
 [ 0.0838516   0.         -0.08791127  0.          0.          0.        ]
 [ 0.09396052  0.         -0.08791127  0.          0.5         0.        ]
 [-0.18203147  0.         -0.08791127  0.          0.          1.        ]
 [-0.2001634   0.         -0.08791127  0.          0.          1.        ]
 [-0.27773562  0.         -0.08791127  0.          0.          0.        ]
 [-0.06698474  0.         -0.08791127  0.          0.          0.        ]
 [-0.29122508  1.         -0.08966627  1.          0.5         1.        ]
 [-0.02650449  0.        

In [21]:
def process_sequence_of_experiences(actor, experiences):
    export = experiences.copy()
    actor.lstm.stateful=True
    for index in range(1,2*actor.dolinar_layers-1,2): # I consider from first outcome to last one (but guess)
        export[:,index+1] = np.squeeze(actor(np.reshape(np.array(export[:,index]),
                                                             (experiences.shape[0],1,1))))
    actor.lstm.stateful=False
    return export


@tf.function
def process_sequence_of_experiences_tf(actor, experiences):
    unstacked_exp = tf.unstack(tf.convert_to_tensor(experiences), axis=1)
    to_stack = []
    for index in range(2*actor.dolinar_layers-1): # I consider from first outcome to last one (but guess)
        if (index==0):
            to_stack.append(unstacked_exp[index])
        if (index%2 == 1):
            to_stack.append(unstacked_exp[index])
            
            to_stack.append(tf.squeeze(actor(tf.reshape(unstacked_exp[index],(experiences.shape[0],1,1)))))
    for index in range(2*actor.dolinar_layers-1, 2*actor.dolinar_layers+2):
        to_stack.append(unstacked_exp[index])
        
    return tf.stack(to_stack, axis=1)

In [23]:
n = process_sequence_of_experiences_tf(actor_target,experiences)

In [25]:
n.numpy() - targeted_experience

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]], dtype=float32)

$\texttt{process_sequence}$ is a function that separate the histories and the rewards into two different arrays

In [26]:
def process_sequence(critic,sample_buffer):
    if isinstance(sample_buffer, tf.Tensor):
        sample_buffer=sample_buffer.numpy()
    rr = np.ones(sample_buffer.shape)*critic.pad_value
    rr[:,1:] = sample_buffer[:,:-1]
    rr = np.reshape(rr, (sample_buffer.shape[0],critic.dolinar_layers+1,2))
    rewards_obtained = np.zeros((sample_buffer.shape[0], critic.dolinar_layers+1))
    rewards_obtained[:,-1] = sample_buffer[:,-1]
    return rr, rewards_obtained

In [58]:
@tf.function
def process_sequence_tf(critic, sample_buffer):
    exps = tf.convert_to_tensor(sample_buffer)
    onns = tf.multiply(critic.pad_value,tf.ones((sample_buffer.shape[0],1)))
    s1 = tf.concat([onns,exps[:,:-1]], axis=1)
    s1 = tf.reshape(s1, (sample_buffer.shape[0],critic.dolinar_layers+1,2))
    rr = tf.concat([tf.zeros((sample_buffer.shape[0], critic.dolinar_layers)), tf.expand_dims(exps[:,-1],axis=1)], axis=1)
##############################################################################################
#     exps = tf.convert_to_tensor(sample_buffer)
#     s1 = tf.reshape(tf.concat([tf.multiply(critic.pad_value,tf.ones((sample_buffer.shape[0],1))),exps[:,:-1]], axis=1), (sample_buffer.shape[0],critic.dolinar_layers+1,2))
#     rr = tf.concat([tf.zeros((sample_buffer.shape[0], critic.dolinar_layers)), tf.expand_dims(exps[:,-1],axis=1)], axis=1)
    return s1, rr
 

In [53]:
%timeit process_sequence(critic,experiences)

8.54 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [59]:
%timeit process_sequence_tf(critic, experiences)

201 µs ± 6.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Ok, it's useless

In [54]:
s0, r0 = process_sequence(critic,experiences)

In [55]:
s1, r1 = process_sequence_tf(critic, experiences)

In [62]:
sequences, zeroed_rews = process_sequence(critic,experiences)

In [None]:
labels_critic = critic_target.give_td_error_Kennedy_guess( sequences, zeroed_rews)

In [None]:
def give_td_error_Kennedy_guess(critic,batched_input,sequential_rews_with_zeros):
    if critic.nature != "target":
        raise AttributeError("I'm not the target!")
        return
    b = batched_input.copy()
    ll = sequential_rews_with_zeros.copy()
    for k in range(critic.dolinar_layers):
        ll[:,k] = np.squeeze(critic(b))[:,k+1] + ll[:,k]

    b[:,-1][:,-1] = 0.
    all_preds = critic(b)
    for phase in np.arange(1,critic.number_phases)/critic.number_phases:
        b[:,-1][:,-1] = phase
        all_preds = tf.concat([all_preds,critic(b)],2)

    maxs = np.squeeze(tf.math.reduce_max(all_preds,axis=2).numpy())
    ll[:,-2] = maxs[:,-1] # This is the last befre the guess. So the label is max_g Q(h-L, g)
    ll = np.expand_dims(ll,axis=1)
    return ll

In [None]:
%timeit  critic_target.give_td_error_Kennedy_guess( sequences, zeroed_rews)

In [66]:
@tf.function ##### THERE'S AN ERORR IN CONCATENATING..
def give_td_error_Kennedy_guess_tf(critic,batched_input,sequential_rews_with_zeros):
    if critic.nature != "target":
        raise AttributeError("I'm not the target!")
    sequential_rews_with_zeros = zeroed_rews

    final_rews = tf.reshape(sequential_rews_with_zeros[:,-1], (batched_input.shape[0],1,1))
    bellman_tds_noguess = critic(batched_input)[:,1:-1,:]

    phases = tf.range(critic.number_phases, dtype=np.float32)/critic.number_phases

    unstacked = tf.unstack(tf.convert_to_tensor(batched_input))
    phases_concs = {}
    for ph in range(critic.number_phases):
        phases_concs[str(ph)] = []
    stacked = {}

    for episode in unstacked:
        prefinal = episode[:-1]
        for ph in range(critic.number_phases):
            final = tf.expand_dims(tf.concat([tf.unstack(episode[-1])[0], phases[ph]], axis=0), axis=0)    
            phases_concs[str(ph)].append(tf.concat([prefinal, final], axis=0))
    #   
        for ph in range(critic.number_phases):
            stacked[str(ph)] = tf.stack(phases_concs[str(ph)], axis=0)

    all_preds = tf.concat([critic(stacked[str(ph)]) for ph in range(critic.number_phases)], axis=2) 
    maxs = tf.math.reduce_max(all_preds,axis=2)[:,-1]
    bellman_td = tf.concat([tf.reshape(bellman_tds_noguess,(batched_input.shape[0],critic.dolinar_layers-1)), tf.reshape(maxs,(batched_input.shape[0],1))], axis=1)
    return tf.concat([bellman_td, tf.reshape(sequential_rews_with_zeros[:,-1].astype(np.float32), (batched_input.shape[0],1))], axis=1)



In [67]:
%timeit give_td_error_Kennedy_guess_tf(critic_target,sequences.astype(np.float32), zeroed_rews.astype(np.float32))

Tensor("strided_slice_2:0", shape=(), dtype=float32)


ValueError: in user code:

    <ipython-input-66-c560455fa163>:22 give_td_error_Kennedy_guess_tf  *
        final = tf.expand_dims(tf.concat([tf.unstack(episode[-1])[0], phases[ph]], axis=0), axis=0)
    /home/cooper-cooper/.local/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py:180 wrapper  **
        return target(*args, **kwargs)
    /home/cooper-cooper/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py:1606 concat
        return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
    /home/cooper-cooper/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py:1189 concat_v2
        "ConcatV2", values=values, axis=axis, name=name)
    /home/cooper-cooper/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:744 _apply_op_helper
        attrs=attr_protos, op_def=op_def)
    /home/cooper-cooper/.local/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py:595 _create_op_internal
        compute_device)
    /home/cooper-cooper/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py:3327 _create_op_internal
        op_def=op_def)
    /home/cooper-cooper/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py:1817 __init__
        control_input_ops, op_def)
    /home/cooper-cooper/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py:1657 _create_c_op
        raise ValueError(str(e))

    ValueError: Can't concatenate scalars (use tf.stack instead) for '{{node concat}} = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32](unstack_1, strided_slice_4, concat/axis)' with input shapes: [], [], [].


OK, optimization begins!

In [None]:
experiences = experiences.astype(np.float32)
targeted_experience = actor_target.process_sequence_of_experiences(experiences)
sequences, zeroed_rews = critic_target.process_sequence(targeted_experience)
labels_critic = critic_target.give_td_error_Kennedy_guess( sequences, zeroed_rews)

In [None]:
with tf.GradientTape() as tape:
    tape.watch(critic.trainable_variables)
    preds_critic = critic(sequences)
    loss_critic = tf.keras.losses.MSE(labels_critic, preds_critic)
    loss_critic = tf.reduce_mean(loss_critic)
    grads = tape.gradient(loss_critic, critic.trainable_variables)
    optimizer_critic.apply_gradients(zip(grads, critic.trainable_variables))
    loss_critic = np.squeeze(loss_critic.numpy())

In [None]:
preds_critic

In [None]:
labels_critic_1 = labels_critic.reshape((5,3,1))