# Laboratorium 7

Celem siódmego laboratorium jest zapoznanie się oraz zaimplementowanie algorytmu głębokiego uczenia aktywnego - Actor-Critic. Zaimplementowany algorytm będzie testowany z wykorzystaniem środowiska z OpenAI - *CartPole*.


Dołączenie standardowych bibliotek

In [2]:
from collections import deque
import gym
import numpy as np
import random

In [3]:
import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.layers as layers

In [4]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  0


Dołączenie bibliotek do obsługi sieci neuronowych

## Zadanie 1 - Actor-Critic

<p style='text-align: justify;'>
Celem ćwiczenie jest zaimplementowanie algorytmu Actor-Critic. W tym celu należy utworzyć dwie głębokie sieci neuronowe:
    1. *actor* - sieć, która będzie uczyła się optymalnej strategii (podobna do tej z laboratorium 6),
    2. *critic* - sieć, która będzie uczyła się funkcji oceny stanu (podobnie jak się DQN).
Wagi sieci *actor* aktualizowane są zgodnie ze wzorem:
\begin{equation*}
    \theta \leftarrow \theta + \alpha \delta_t \nabla_\theta log \pi_{\theta}(a_t, s_t | \theta).
\end{equation*}
Wagi sieci *critic* aktualizowane są zgodnie ze wzorem:
\begin{equation*}
    w \leftarrow w + \beta \delta_t \nabla_w\upsilon(s_{t + 1}, w),
\end{equation*}
gdzie:
\begin{equation*}
    \delta_t \leftarrow r_t + \gamma \upsilon(s_{t + 1}, w) - \upsilon(s_t, w).
\end{equation*}
</p>

In [5]:
from tensorflow.keras.optimizers import Adam

class REINFORCEAgent:
    def __init__(self, state_size, action_size, actor, critic):
        self.state_size = state_size
        self.action_size = action_size
        self.gamma = 0.99    # discount rate
        self.actor = actor
        self.critic = critic #critic network should have only one output
        self.actor_optimizer = Adam(learning_rate=0.0001)
        self.critic_optimizer = Adam(learning_rate=0.0005)


    def get_action(self, state):
        """
        Compute the action to take in the current state, basing on policy returned by the network.

        Note: To pick action according to the probability generated by the network
        """

        #
        # INSERT CODE HERE to get action in a given state
        #        
        predictions = self.actor.predict_on_batch(np.array([state]))[0]
        return random.choices(range(len(predictions)), weights=predictions)[0]


    def learn(self, state, action, reward, next_state, done):
        """
        Function learn networks using information about state, action, reward and next state. 
        First the values for 
        Critic network should be trained based on target value:state and next_state should be estimated based on output of critic network.
        target = r + \gamma next_state_value if not done]
        target = r if done.
        Actor network shpuld be trained based on delta value:
        delta = target - state_value
        """
        #
        # INSERT CODE HERE to train network
        #
        critic_response = self.critic.predict_on_batch(np.array([state, next_state]))
        delta = reward + self.gamma * (critic_response[1][0] if not done else 0) - critic_response[0][0]
        self.__update_critic(delta, state)
        self.__update_actor(delta, state, action)
        
    def __update_critic(self, delta, state):
        with tf.GradientTape() as tape:
            predictions = self.critic(np.array([state]))[0]
            loss = - tf.multiply(delta, predictions)
            
        gradients = tape.gradient(loss, self.critic.trainable_variables)
        self.critic_optimizer.apply_gradients(zip(gradients, self.critic.trainable_variables))
        
    def __update_actor(self, delta, state, action):
        with tf.GradientTape() as tape:            
            policy = self.actor(np.array([state]))[0]
            loss = - delta * tf.math.log(policy[action])
            
        gradients = tape.gradient(loss, self.actor.trainable_variables)
        self.actor_optimizer.apply_gradients(zip(gradients, self.actor.trainable_variables))

Czas przygotować model sieci, która będzie się uczyła działania w środowisku [*CartPool*](https://gym.openai.com/envs/CartPole-v0/):

In [6]:
env = gym.make("CartPole-v0").env
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

In [7]:
actor_model = keras.models.Sequential([
    layers.InputLayer((state_size,)),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(action_size, activation='softmax')
])

In [8]:
critic_model = keras.models.Sequential([
    layers.InputLayer((state_size,)),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(32, activation='relu'),
    layers.Dense(1)
])

Czas nauczyć agenta gry w środowisku *CartPool*:

In [9]:
agent = REINFORCEAgent(state_size, action_size, actor_model, critic_model)
from tqdm import tqdm


for i in range(100):
    score_history = []

    for i in range(100):
        done = False
        score = 0
        state = env.reset()
        while not done:
            action = agent.get_action(state)
            next_state, reward, done, _ = env.step(action)
            agent.learn(state, action, reward, next_state, done)
            state = next_state
            score += reward
        score_history.append(score)

    print("mean reward:%.3f" % (np.mean(score_history)))

    if np.mean(score_history) > 300:
        print("You Win!")
        break

mean reward:18.420
mean reward:35.390
mean reward:116.940
mean reward:363.910
You Win!
