<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Zadanie-1---Actor-Critic" data-toc-modified-id="Zadanie-1---Actor-Critic-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Zadanie 1 - Actor-Critic</a></span></li></ul></div>

# Laboratorium 7

Celem siódmego laboratorium jest zapoznanie się oraz zaimplementowanie algorytmu głębokiego uczenia aktywnego - Actor-Critic. Zaimplementowany algorytm będzie testowany z wykorzystaniem środowiska z OpenAI - *CartPole*.


Dołączenie standardowych bibliotek

In [1]:
from collections import deque
import gym
import numpy as np
import random

Dołączenie bibliotek do obsługi sieci neuronowych

In [2]:
import tensorflow.keras.backend as K
from keras.layers import Dense, Input
from keras.optimizers import Adam
from keras.models import Model

Using TensorFlow backend.


## Zadanie 1 - Actor-Critic

<p style='text-align: justify;'>
Celem ćwiczenie jest zaimplementowanie algorytmu Actor-Critic. W tym celu należy utworzyć dwie głębokie sieci neuronowe:
    1. *actor* - sieć, która będzie uczyła się optymalnej strategii (podobna do tej z laboratorium 6),
    2. *critic* - sieć, która będzie uczyła się funkcji oceny stanu (podobnie jak się DQN).
Wagi sieci *actor* aktualizowane są zgodnie ze wzorem:
\begin{equation*}
    \theta \leftarrow \theta + \alpha \delta_t \nabla_\theta log \pi_{\theta}(a_t, s_t | \theta).
\end{equation*}
Wagi sieci *critic* aktualizowane są zgodnie ze wzorem:
\begin{equation*}
    w \leftarrow w + \beta \delta_t \nabla_w\upsilon(s_{t + 1}, w),
\end{equation*}
gdzie:
\begin{equation*}
    \delta_t \leftarrow r_t + \gamma \upsilon(s_{t + 1}, w) - \upsilon(s_t, w).
\end{equation*}
</p>

In [3]:
class ACagent:
    def __init__(self, action_size, actor, policy, critic):
        self.action_size = action_size
        self.gamma = 0.98    # discount rate
        self.actor = actor
        self.policy = policy
        self.critic = critic #critic network should have only one output
        self.possible_actions = [0, 1]
        
    def get_action(self, state):
        """
        Compute the action to take in the current state, basing on policy returned by the network.

        Note: To pick action according to the probability generated by the network
        """
        possible_actions = self.possible_actions
        action_probabilities = self.policy.predict(state[np.newaxis, :])[0]

        # Pick possible action based on its probabilities from network
        return np.random.choice(self.possible_actions, p = action_probabilities)


    def learn(self, state, action, reward, next_state, done):
        """
        Function learn networks using information about state, action, reward and next state. 
        First the values for state and next_state should be estimated based on output of critic network.
        Critic network should be trained based on target value:
        target = r + \gamma next_state_value if not done]
        target = r if done.
        Actor network shpuld be trained based on delta value:
        delta = target - state_value
        """
        # Comply with batches
        state = state[np.newaxis, :]
        next_state = next_state[np.newaxis, :]
    
        # Calculate critic's opinion
        critic_val = self.critic.predict(state)
        critic_val_next = self.critic.predict(next_state)

        # state value
        target_val = reward + (done is not True)*self.gamma*critic_val_next
        
        # delta - how good is the state considering critic's opinion
        delta = target_val - critic_val

        # make one hot for selected actions - now one action only
        actions_one_hot = np.zeros([1, self.action_size])
        actions_one_hot[np.arange(1), action] = 1 
        
        # Train - actor to follow good policy, critic to evaluate target properly
        self.actor.fit([state, delta], actions_one_hot, verbose=0) # definitly verbose = 0
        self.critic.fit(state, target_val, verbose=0)

Czas przygotować model sieci, która będzie się uczyła działania w środowisku [*CartPool*](https://gym.openai.com/envs/CartPole-v0/):

In [4]:
env = gym.make("CartPole-v0").env
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
alpha_learning_rate = 0.00015
beta_learning_rate = 0.00055

input_layer = Input(shape=(env.observation_space.shape[0], )) # for state tensor
delta = Input(shape=[1])
dense_1 = Dense(24, activation='relu')(input_layer)
dense_2 = Dense(24, activation='relu')(dense_1)
output_layer = Dense(env.action_space.n, activation='softmax')(dense_2)
evaluation_layer = Dense(1, activation='linear')(dense_2)

  result = entry_point.load(False)


In [5]:
def gradient_ascend(true, pred):
    log_lk = true * K.log(K.clip(pred, 1e-8, 1)) # log of 0 is a bad idea
    return K.mean(-log_lk * delta)

In [6]:
actor = Model(inputs=[input_layer, delta], outputs=[output_layer], name='Actor')
actor.compile(optimizer=Adam(lr=alpha_learning_rate), loss=gradient_ascend)

critic = Model(inputs=[input_layer], outputs=[evaluation_layer], name='Critic')
critic.compile(optimizer=Adam(lr=beta_learning_rate), loss='mean_squared_error')

policy = Model(inputs=[input_layer], outputs=[output_layer])

actor.summary()
critic.summary()

Model: "Actor"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 4)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 24)                120       
_________________________________________________________________
dense_2 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 50        
Total params: 770
Trainable params: 770
Non-trainable params: 0
_________________________________________________________________
Model: "Critic"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 4)                 0         
_______________________________________________

Czas nauczyć agenta gry w środowisku *CartPool*:

In [7]:
agent = ACagent(action_size=env.action_space.n, actor=actor, policy = policy, critic=critic)

In [8]:
for i in range(100):
    score_history = []

    for i in range(100):
        done = False
        score = 0
        state = env.reset()
        while not done:
            action = agent.get_action(state)
            next_state, reward, done, _ = env.step(action)
            agent.learn(state, action, reward, next_state, done)
            state = next_state
            score += reward
        score_history.append(score)

    print("mean reward:%.3f" % (np.mean(score_history)))

    if np.mean(score_history) > 300:
        print("You Win!")
        break

mean reward:13.380
mean reward:14.140
mean reward:27.000
mean reward:87.260
mean reward:174.890
mean reward:236.400
mean reward:435.390
You Win!


To make training faster I would definitly emply some buffer policy, otherwise there is no real parallelization in the training process.