<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Zadanie-1---REINFORCE" data-toc-modified-id="Zadanie-1---REINFORCE-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Zadanie 1 - REINFORCE</a></span></li></ul></div>

# Laboratorium 6

Celem szóstego laboratorium jest zapoznanie się oraz zaimplementowanie algorytmu głębokiego uczenia aktywnego - REINFORCE. Zaimplementowany algorytm będzie testowany z wykorzystaniem środowiska z OpenAI - *CartPole*.


Dołączenie standardowych bibliotek

In [1]:
from collections import deque
import gym
import numpy as np
import random

Dołączenie bibliotek do obsługi sieci neuronowych

In [2]:
import tensorflow.keras.backend as K
from keras.layers import Dense, Input
from keras.optimizers import Adam
from keras.models import Model

Using TensorFlow backend.


## Zadanie 1 - REINFORCE

<p style='text-align: justify;'>
Celem ćwiczenie jest zaimplementowanie algorytmu REINFORCE. Wagi sieci aktualizowane są zgodnie ze wzorem:
\begin{equation*}
    \theta \leftarrow \theta + \alpha G_t \nabla_\theta log \pi_{\theta}(a_t, s_t | \theta)
\end{equation*}.
</p>

In [3]:
class REINFORCEAgent:
    def __init__(self, action_size, model, get_actions):
        self.action_size = action_size
        self.gamma = 0.99    # discount rate
        self.model = model
        self.get_actions = get_actions
        self.state_memory = []
        self.action_memory = []
        self.reward_memory = []
        self.possible_actions = [0, 1]
        
    def remember(self, state, action, reward):
        # Function adds information to the memory about last action and its results
        self.state_memory.append(state)
        self.action_memory.append(action)
        self.reward_memory.append(reward)

    def get_action(self, state):
        """
        Compute the action to take in the current state, basing on policy returned by the network.

        Note: To pick action according to the probability generated by the network
        """
    
        possible_actions = self.possible_actions
        action_probabilities = self.get_actions(state[np.newaxis, :])[0]

        # Pick possible action based on its probabilities from network
        return np.random.choice(self.possible_actions, p = action_probabilities)

    def get_cumulative_rewards(self, rewards, gamma = -1):
        '''
        Take a list of immediate rewards r(s,a) for the whole session
        compute cumulative rewards R(s,a) (a.k.a. G(s,a) in Sutton '16)
        R_t = r_t + gamma*r_{t+1} + gamma^2*r_{t+2} + ...

        The simple way to compute cumulative rewards is to iterate from last to first time tick
        and compute R_t = r_t + gamma*R_{t+1} recurrently

        You must return an array/list of cumulative rewards with as many elements as in the initial rewards.
        '''
        
        if gamma == -1:
            gamma = self.gamma
        # Calculate rewards
        G_t = np.zeros(len(rewards))
        reward = 0
        for reverse_idx in reversed(range(0, len(rewards))):
            reward = reward * gamma + rewards[reverse_idx]
            G_t[reverse_idx] = reward
        
        return G_t
    
    def replay(self, batch_size=64):
        """
        Function learn network using data stored in state, action and reward memory. 
        First calculates G_t for each state and train network
        """
        # Not enough samples
        if len(self.reward_memory) < batch_size:
            return
        
        G_t = self.get_cumulative_rewards(self.reward_memory)
            
        # Normalize the data
        G_t -= np.mean(G_t)
        G_t /= np.std(G_t)
        
        # state and reward
        rewards = [self.state_memory, G_t]
        
        # make one hot for selected actions
        actions_one_hot = np.zeros([len(self.action_memory), self.action_size])
        actions_one_hot[np.arange(len(self.action_memory)), self.action_memory] = 1 
        
        self.model.fit(rewards, actions_one_hot, verbose=0)
        
        # Empty the history
        self.state_memory = []
        self.action_memory = []
        self.reward_memory = []


Czas przygotować model sieci, która będzie się uczyła działania w środowisku [*CartPool*](https://gym.openai.com/envs/CartPole-v0/):

In [4]:
env = gym.make("CartPole-v0").env
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
learning_rate = 0.001

input_layer = Input(shape=(env.observation_space.shape[0], )) # for state tensor
actions_layer = Input(shape=[1]) # for taken actions
dense_1 = Dense(24, activation='relu')(input_layer)
dense_2 = Dense(24, activation='relu')(dense_1)
output_layer = Dense(env.action_space.n, activation='softmax')(dense_2)

  result = entry_point.load(False)


The model has to be compiled with custom loss as we need gradient ascend rather than descend

In [5]:
def gradient_ascend(true, pred):
    log_lk = true * K.log(K.clip(pred, 1e-8, 1)) # log of 0 is a bad idea
    return K.mean(-log_lk * actions_layer)

In [6]:
model = Model(inputs=[input_layer, actions_layer], outputs=[output_layer])
model.compile(optimizer=Adam(lr=learning_rate), loss=gradient_ascend) # to train one hot encoded actions
get_actions_model = Model(inputs=[input_layer], outputs=[output_layer]) # to get action probabilities, not for training

model.summary()
get_actions_model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 4)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 24)                120       
_________________________________________________________________
dense_2 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 50        
Total params: 770
Trainable params: 770
Non-trainable params: 0
_________________________________________________________________
Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 4)                 0         
____________________________________________

In [7]:
agent = REINFORCEAgent(action_size=env.action_space.n, model=model, get_actions = get_actions_model.predict)

Przygotuj funkcję obliczającą wartość nagrody skumulowanej:

In [8]:
get_cumulative_rewards = agent.get_cumulative_rewards

assert len(get_cumulative_rewards(range(100))) == 100
assert np.allclose(get_cumulative_rewards([0, 0, 1, 0, 0, 1, 0], gamma=0.9),
                   [1.40049, 1.5561, 1.729, 0.81, 0.9, 1.0, 0.0])
assert np.allclose(get_cumulative_rewards([0, 0, 1, -2, 3, -4, 0], gamma=0.5),
                   [0.0625, 0.125, 0.25, -1.5, 1.0, -4.0, 0.0])
assert np.allclose(get_cumulative_rewards([0, 0, 1, 2, 3, 4, 0], gamma=0), [0, 0, 1, 2, 3, 4, 0])

Czas nauczyć agenta gry w środowisku *CartPool*:

In [9]:
def generate_session(t_max=1000):
    """play env with REINFORCE agent and train at the session end"""

    reward = 0

    s = env.reset()

    for t in range(t_max):

        # chose action
        a = agent.get_action(s)

        new_s, r, done, info = env.step(a)

        # record session history to train later
        agent.remember(s, a, r)

        reward += r

        s = new_s
        if done: break

    agent.replay()

    return reward

In [10]:
for i in range(100):

    rewards = [generate_session() for _ in range(100)]  # generate new sessions

    print("epoch: {},  mean reward: {}".format(i, np.round(np.mean(rewards),3)))

    if np.mean(rewards) > 300:
        print("You Win!")
        break

epoch: 0,  mean reward: 21.39
epoch: 1,  mean reward: 21.37
epoch: 2,  mean reward: 20.5
epoch: 3,  mean reward: 21.54
epoch: 4,  mean reward: 19.68
epoch: 5,  mean reward: 22.91
epoch: 6,  mean reward: 22.75
epoch: 7,  mean reward: 23.72
epoch: 8,  mean reward: 25.75
epoch: 9,  mean reward: 27.79
epoch: 10,  mean reward: 35.8
epoch: 11,  mean reward: 52.37
epoch: 12,  mean reward: 124.82
epoch: 13,  mean reward: 187.65
epoch: 14,  mean reward: 212.01
epoch: 15,  mean reward: 121.89
epoch: 16,  mean reward: 120.39
epoch: 17,  mean reward: 567.88
You Win!
