# Laboratorium 7

Celem siódmego laboratorium jest zapoznanie się oraz zaimplementowanie algorytmu głębokiego uczenia aktywnego - Actor-Critic. Zaimplementowany algorytm będzie testowany z wykorzystaniem środowiska z OpenAI - *CartPole*.


Dołączenie standardowych bibliotek

In [1]:
from collections import deque
import gym
import numpy as np
import random

Dołączenie bibliotek do obsługi sieci neuronowych

In [2]:
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Input, Activation
from tensorflow.keras.optimizers import Adam
from tensorflow.math import log
from tensorflow.math import reduce_sum
from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()

## Zadanie 1 - Actor-Critic

<p style='text-align: justify;'>
Celem ćwiczenie jest zaimplementowanie algorytmu Actor-Critic. W tym celu należy utworzyć dwie głębokie sieci neuronowe:
    1. *actor* - sieć, która będzie uczyła się optymalnej strategii (podobna do tej z laboratorium 6),
    2. *critic* - sieć, która będzie uczyła się funkcji oceny stanu (podobnie jak się DQN).
Wagi sieci *actor* aktualizowane są zgodnie ze wzorem:
\begin{equation*}
    \theta \leftarrow \theta + \alpha \delta_t \nabla_\theta log \pi_{\theta}(a_t, s_t | \theta).
\end{equation*}
Wagi sieci *critic* aktualizowane są zgodnie ze wzorem:
\begin{equation*}
    w \leftarrow w + \beta \delta_t \nabla_w\upsilon(s_{t + 1}, w),
\end{equation*}
gdzie:
\begin{equation*}
    \delta_t \leftarrow r_t + \gamma \upsilon(s_{t + 1}, w) - \upsilon(s_t, w).
\end{equation*}
</p>

In [3]:
class ActorCriticAgent:
    def __init__(self, action_size, policy_model, actor, critic):
        self.action_size = action_size
        self.gamma = 0.99    # discount rate
        self.action_space = [i for i in range(action_size)]
        self.policy_model = policy_model
        self.actor = actor
        self.critic = critic #critic network should have only one output'

    def get_action(self, state):
        """
        Compute the action to take in the current state, basing on policy returned by the network.

        Note: To pick action according to the probability generated by the network
        """
        #
        # INSERT CODE HERE to get action in a given state
        #        
        state = state[np.newaxis, :]
        chosen_action = np.random.choice(self.action_space, p=self.policy_model.predict(state)[0])   

        return chosen_action

    def learn(self, state, action, reward, next_state, done):
        """
        Function learn networks using information about state, action, reward and next state. 
        First the values for state and next_state should be estimated based on output of critic network.
        Critic network should be trained based on target value:
        target = r + \gamma next_state_value if not done]
        target = r if done.
        Actor network shpuld be trained based on delta value:
        delta = target - state_value
        """
        #
        # INSERT CODE HERE to train network
        #
        state = state[np.newaxis, :]
        next_state = next_state[np.newaxis, :]
        critic_value_next_state = self.critic.predict(next_state)
        critic_value = self.critic.predict(state)

        target = reward + self.gamma * critic_value_next_state * (1 - int(done))
        delta = target - critic_value

        actions = np.zeros([1, self.action_size])
        actions[np.arange(1), action] = 1

        self.actor.fit([state, delta], actions, verbose=0)
        self.critic.fit(state, target, verbose=0)        

Czas przygotować model sieci, która będzie się uczyła działania w środowisku [*CartPool*](https://gym.openai.com/envs/CartPole-v0/):

In [4]:
env = gym.make("CartPole-v0").env
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
alpha_learning_rate = 0.0001
beta_learning_rate = 0.0005

input = Input(shape=(state_size,))
delta = Input(shape=[1])
x1 = Dense(64, activation='relu')(input)
probs = Dense(action_size, activation='softmax')(x1)
values = Dense(1, activation='linear')(x1)

def custom_loss(y_true, y_pred):
    log_lik = y_true * log(y_pred)
    return reduce_sum(-log_lik * delta)

policy_model = Model(inputs=[input], outputs=[probs])

#
# INSERT CODE HERE to build actor network, in the last layer use softmax activation function
#
actor_model = Model(inputs=[input, delta], outputs=[probs])
actor_model.compile(optimizer=Adam(lr=alpha_learning_rate), loss=custom_loss)

#
# INSERT CODE HERE to build critic network, in the last layer use linear activation function, network should have single output
#
critic_model = Model(inputs=[input], outputs=[values])
critic_model.compile(optimizer=Adam(lr=beta_learning_rate), loss='mean_squared_error')


  super(Adam, self).__init__(name, **kwargs)


Czas nauczyć agenta gry w środowisku *CartPool*:

In [5]:
agent = ActorCriticAgent(action_size, policy_model, actor_model, critic_model)


for i in range(100):
    score_history = []

    for i in range(100):
        done = False
        score = 0
        state = env.reset()
        while not done:
            action = agent.get_action(state)
            next_state, reward, done, _ = env.step(action)
            agent.learn(state, action, reward, next_state, done)
            state = next_state
            score += reward
        score_history.append(score)

    print("mean reward:%.3f" % (np.mean(score_history)))

    if np.mean(score_history) > 300:
        print("You Win!")
        break

  updates=self.state_updates,


mean reward:18.540
mean reward:20.160
mean reward:21.520
mean reward:31.130
mean reward:43.740
mean reward:51.200
mean reward:66.480
mean reward:135.130
mean reward:155.290
mean reward:207.310
mean reward:131.190
mean reward:138.320
mean reward:165.600
mean reward:119.780
mean reward:168.280
mean reward:104.000
mean reward:138.560
mean reward:124.490
mean reward:160.210
mean reward:157.270
mean reward:165.310
mean reward:534.780
You Win!
