# Laboratorium 6

Celem szóstego laboratorium jest zapoznanie się oraz zaimplementowanie algorytmu głębokiego uczenia aktywnego - REINFORCE. Zaimplementowany algorytm będzie testowany z wykorzystaniem środowiska z OpenAI - *CartPole*.


Dołączenie standardowych bibliotek

In [1]:
from collections import deque
import gym
import numpy as np
import random
from functools import reduce

Dołączenie bibliotek do obsługi sieci neuronowych

In [2]:
import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.layers as layers

Przygotuj funkcję obliczającą wartość nagrody skumulowanej:

In [3]:
def get_cumulative_rewards(rewards,  # rewards at each step
                           gamma=0.99  # discount for reward
                           ):
    """
    based on https://github.com/yandexdataschool/Practical_RL/blob/spring20/week06_policy_based/reinforce_tensorflow.ipynb
    take a list of immediate rewards r(s,a) for the whole session
    compute cumulative rewards R(s,a) (a.k.a. G(s,a) in Sutton '16)
    R_t = r_t + gamma*r_{t+1} + gamma^2*r_{t+2} + ...

    The simple way to compute cumulative rewards is to iterate from last to first time tick
    and compute R_t = r_t + gamma*R_{t+1} recurrently

    You must return an array/list of cumulative rewards with as many elements as in the initial rewards.
    """
    cumulative_rewards = []
    last = rewards[-1]
    
    for reward in reversed(rewards):
        last = last * gamma + reward
        cumulative_rewards.insert(0, last)

    return cumulative_rewards


assert len(get_cumulative_rewards(range(100))) == 100
assert np.allclose(get_cumulative_rewards([0, 0, 1, 0, 0, 1, 0], gamma=0.9),
                   [1.40049, 1.5561, 1.729, 0.81, 0.9, 1.0, 0.0])
assert np.allclose(get_cumulative_rewards([0, 0, 1, -2, 3, -4, 0], gamma=0.5),
                   [0.0625, 0.125, 0.25, -1.5, 1.0, -4.0, 0.0])
assert np.allclose(get_cumulative_rewards([0, 0, 1, 2, 3, 4, 0], gamma=0), [0, 0, 1, 2, 3, 4, 0])

## Zadanie 1 - REINFORCE

<p style='text-align: justify;'>
Celem ćwiczenie jest zaimplementowanie algorytmu REINFORCE. Wagi sieci aktualizowane są zgodnie ze wzorem:
\begin{equation*}
    \theta \leftarrow \theta + \alpha G_t \nabla_\theta log \pi_{\theta}(a_t, s_t | \theta)
\end{equation*}.
</p>

In [4]:
from tensorflow.keras.optimizers import SGD

class REINFORCEAgent:
    def __init__(self, state_size, action_size, model):
        self.state_size = state_size
        self.action_size = action_size
        self.gamma = 0.99    # discount rate
        self.learning_rate = 0.001
        self.model = model
        self.state_memory = []
        self.action_memory = []
        self.reward_memory = []
        self.optimizer = SGD(learning_rate=0.001)
        
        
    def remember(self, state, action, reward):
        #Function adds information to the memory about last action and its results
        self.state_memory.append(state)
        self.action_memory.append(action)
        self.reward_memory.append(reward)

    def get_action(self, state):
        """
        Compute the action to take in the current state, basing on policy returned by the network.

        Note: To pick action according to the probability generated by the network
        """

        #
        # INSERT CODE HERE to get action in a given state
        #        
        predictions = self.model.predict_on_batch(np.array([state]))[0]
        return random.choices(range(len(predictions)), weights=predictions)[0]

  

    def replay(self, batch_size):
        """
        Function learn network using data stored in state, action and reward memory. 
        First calculates G_t for each state and train network
        """
        #
        # INSERT CODE HERE to train network
        #
        batch_size = min(batch_size, len(self.state_memory))
        cumulative_rewards = get_cumulative_rewards(self.reward_memory)
        
        indexes = random.sample(range(len(self.state_memory)), k=batch_size)
        states = [self.state_memory[i] for i in indexes]
        actions = [self.action_memory[i] for i in indexes]
        cumulative_rewards = [cumulative_rewards[i] for i in indexes]
        
        with tf.GradientTape() as tape:
            policy = self.model(np.array(states))
            actions_one_hot = tf.keras.utils.to_categorical(actions, num_classes=self.action_size)
            log_probabilities = tf.math.log(tf.reduce_sum(tf.multiply(policy, actions_one_hot), axis=1))
            loss = -tf.reduce_mean(log_probabilities * cumulative_rewards)

        gradients = tape.gradient(loss, self.model.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))
        
        self.state_memory = []
        self.action_memory = []
        self.reward_memory = []

Czas przygotować model sieci, która będzie się uczyła działania w środowisku [*CartPool*](https://gym.openai.com/envs/CartPole-v0/):

In [5]:
env = gym.make("CartPole-v0").env
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
learning_rate = 0.001

model = keras.models.Sequential([
    layers.InputLayer((state_size,)),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(action_size, activation='softmax')
])

model.compile(optimizer='adam', loss='mse')
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 64)                320       
                                                                 
 dense_1 (Dense)             (None, 64)                4160      
                                                                 
 dense_2 (Dense)             (None, 64)                4160      
                                                                 
 dense_3 (Dense)             (None, 2)                 130       
                                                                 
Total params: 8,770
Trainable params: 8,770
Non-trainable params: 0
_________________________________________________________________


Czas nauczyć agenta gry w środowisku *CartPool*:

In [6]:
agent = REINFORCEAgent(state_size, action_size, model)


def generate_session(t_max=1000):
    """play env with REINFORCE agent and train at the session end"""

    reward = 0

    s = env.reset()

    for t in range(t_max):

        # chose action
        a = agent.get_action(s)
        new_s, r, done, _ = env.step(a)

        # record session history to train later
        agent.remember(s, a, r)

        reward += r

        s = new_s
        if done: break

    agent.replay(batch_size=32)

    return reward


for i in range(100):

    rewards = [generate_session() for _ in range(100)]  # generate new sessions

    print("mean reward:%.3f" % (np.mean(rewards)))

    if np.mean(rewards) > 300:
        print("You Win!")
        break

mean reward:23.050
mean reward:23.760
mean reward:23.420
mean reward:23.740
mean reward:24.130
mean reward:25.130
mean reward:25.570
mean reward:27.150
mean reward:27.830
mean reward:27.800
mean reward:28.800
mean reward:30.890
mean reward:30.130
mean reward:32.010
mean reward:32.380
mean reward:32.820
mean reward:32.610
mean reward:29.360
mean reward:34.970
mean reward:32.540
mean reward:37.130
mean reward:39.560
mean reward:40.490
mean reward:41.750
mean reward:45.430
mean reward:42.480
mean reward:48.240
mean reward:46.890
mean reward:46.940
mean reward:53.860
mean reward:52.830
mean reward:56.370
mean reward:54.800
mean reward:65.020
mean reward:71.380
mean reward:70.100
mean reward:79.420
mean reward:78.230
mean reward:98.290
mean reward:103.010
mean reward:97.550
mean reward:107.580
mean reward:78.710
mean reward:93.840
mean reward:110.180
mean reward:105.100
mean reward:53.270
mean reward:125.920
mean reward:128.170
mean reward:158.530
mean reward:107.340
mean reward:143.340
mea