# Laboratorium 4 (4 pkt.)

Celem czwartego laboratorium jest zapoznanie się oraz zaimplementowanie algorytmów głębokiego uczenia aktywnego. Zaimplementowane algorytmy będą testowane z wykorzystaniem wcześniej przygotowanych środowisk: *FrozenLake* i *Pacman* oraz środowiska z OpenAI - *CartPole*.


Dołączenie standardowych bibliotek

In [1]:
from collections import deque
import gym
import numpy as np
import random

Dołączenie bibliotek ze środowiskami:

In [2]:
from FrozenLakeMDP import frozenLake
from FrozenLakeMDPExtended import frozenLake as frozenLakeExtended

Dołączenie bibliotek do obsługi sieci neuronowych

In [3]:
%tensorflow_version 1.x
from keras import Model
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import Adam
from keras.utils import to_categorical

TensorFlow 1.x selected.


Using TensorFlow backend.


## Zadanie 1 - Deep Q-Network

<p style='text-align: justify;'>
Celem ćwiczenie jest zaimplementowanie algorytmu Deep Q-Network. Wartoscią oczekiwaną sieci jest:
\begin{equation}
        Q(s_t, a_t) = r_{t+1} + \gamma \text{max}_a Q(s_{t + 1}, a)
\end{equation}
</p>

In [4]:
class DQNAgent:
    def __init__(self, action_size, learning_rate, model, get_legal_actions):
        self.get_legal_actions = get_legal_actions
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95    # discount rate
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.999
        self.learning_rate = learning_rate
        self.model = model

    def remember(self, state, action, reward, next_state, done):
        #Function adds information to the memory about last action and its results
        self.memory.append((state, action, reward, next_state, done)) 

    def get_action(self, state):
        """
        Compute the action to take in the current state, including exploration.
        With probability self.epsilon, we should take a random action.
            otherwise - the best policy action (self.get_best_action).

        Note: To pick randomly from a list, use random.choice(list).
              To pick True or False with a given probablity, generate uniform number in [0, 1]
              and compare it with your probability
        """
        # Pick Action
        possible_actions = self.get_legal_actions(state)

        # If there are no legal actions, return None
        if len(possible_actions) == 0:
            return None

        # agent parameters:
        epsilon = self.epsilon

        #
        # INSERT CODE HERE to get action in a given state (according to epsilon greedy algorithm)
        #        
        if np.random.random() < epsilon:
            chosen_action = random.choice(possible_actions)
        else:
            chosen_action = self.get_best_action(state)
                
        return chosen_action
  
    def get_best_action(self, state):
        """
        Compute the best action to take in a state.
        """

        #
        # INSERT CODE HERE to get best possible action in a given state (remember to break ties randomly)
        #
        best_action = np.argmax(self.model.predict(state))

        return best_action

    def replay(self, batch_size):
        """
        Function learn network using randomly selected actions from the memory. 
        First calculates Q value for the next state and choose action with the biggest value.
        Target value is calculated according to:
                Q(s,a) := (r + gamma * max_a(Q(s', a)))
        except the situation when the next action is the last action, in such case Q(s, a) := r.
        In order to change only those weights responsible for chosing given action, the rest values should be those
        returned by the network for state state.
        The network should be trained on batch_size samples.
        """
        #
        # INSERT CODE HERE to train network
        #
        if len(self.memory) < batch_size:
            return

        info_sets = random.sample(self.memory, batch_size)
        states_list = []
        targets_list = []
        for info_set in info_sets:
            state, action, reward, next_state, done = info_set
            states_list.append(state.flatten())
            target = self.model.predict(state)
            if done:
                target[0][action] = reward
            else:
                Q_future = max(self.model.predict(next_state)[0])
                target[0][action] = reward + Q_future * self.gamma
            targets_list.append(target.flatten())

        states_array = np.array(states_list)
        targets_array = np.array(targets_list)

        self.model.train_on_batch(states_array, targets_array)
        self.update_epsilon_value()

    def update_epsilon_value(self):
        #Every each epoch epsilon value should be updated according to equation: 
        #self.epsilon *= self.epsilon_decay, but the updated value shouldn't be lower then epsilon_min value
        new_epsilon = self.epsilon * self.epsilon_decay
        if new_epsilon >= self.epsilon_min:
            self.epsilon = new_epsilon
        

Czas przygotować model sieci, która będzie się uczyła poruszania po środowisku *FrozenLake*, warstwa wejściowa powinna mieć tyle neuronów ile jest możlliwych stanów, warstwa wyjściowa tyle neuronów ile jest możliwych akcji do wykonania:

In [5]:
env = frozenLake("8x8")

state_size = env.get_number_of_states()
action_size = len(env.get_possible_actions(None))
learning_rate = 0.001

        #
        # INSERT CODE HERE to build network
        #

model = Sequential()
model.add(Dense(16, input_dim=state_size, activation="relu"))
model.add(Dense(32, activation="relu"))
model.add(Dense(16,activation='relu'))
model.add(Dense(action_size))
model.compile(loss="mean_squared_error", optimizer=Adam(learning_rate=learning_rate))

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


 Czas nauczyć agenta poruszania się po środowisku *FrozenLake*, jako stan przyjmij wektor o liczbie elementów równej liczbie możliwych stanów, z wartością 1 ustawioną w komórce o indeksie równym aktualnemu stanowi, pozostałe elementy mają być wypełnione zerami:
* 1 pkt < 35 epok,
* 0.5 pkt < 60 epok,
* 0.25 pkt - w pozostałych przypadkach.

In [6]:
agent = DQNAgent(action_size, learning_rate, model, env.get_possible_actions)

agent.epsilon = 0.9

done = False
batch_size = 64
EPISODES = 10000
counter = 0
for e in range(EPISODES):

    summary = []
    for _ in range(100):
        total_reward = 0
        env_state = env.reset()
    
        #
        # INSERT CODE HERE to prepare appropriate format of the state for network
        #
        state = np.array([to_categorical(env_state, num_classes=state_size)])
        
        for _ in range(1000):
            action = agent.get_action(state)
            next_state_env, reward, done, _ = env.step(action)
            total_reward += reward

            #
            # INSERT CODE HERE to prepare appropriate format of the next state for network
            #
            next_state = np.array([to_categorical(next_state_env, num_classes=state_size)])

            #add to experience memory
            agent.remember(state, action, reward, next_state, done)
            state = next_state
            if done:
                break

        #
        # INSERT CODE HERE to train network if in the memory is more samples then size of the batch
        #
        if len(agent.memory) > batch_size:
            agent.replay(batch_size)
        
        summary.append(total_reward)

    print("epoch #{}\tmean reward = {:.3f}\tepsilon = {:.3f}".format(e, np.mean(summary), agent.epsilon))

    if np.mean(summary) > 0.9:
        print ("You Win!")
        break


epoch #0	mean reward = 0.000	epsilon = 0.815
epoch #1	mean reward = 0.000	epsilon = 0.738
epoch #2	mean reward = 0.010	epsilon = 0.667
epoch #3	mean reward = 0.010	epsilon = 0.604
epoch #4	mean reward = 0.020	epsilon = 0.546
epoch #5	mean reward = 0.000	epsilon = 0.494
epoch #6	mean reward = 0.020	epsilon = 0.447
epoch #7	mean reward = 0.040	epsilon = 0.405
epoch #8	mean reward = 0.190	epsilon = 0.366
epoch #9	mean reward = 0.290	epsilon = 0.331
epoch #10	mean reward = 0.600	epsilon = 0.300
epoch #11	mean reward = 0.680	epsilon = 0.271
epoch #12	mean reward = 0.590	epsilon = 0.245
epoch #13	mean reward = 0.660	epsilon = 0.222
epoch #14	mean reward = 0.660	epsilon = 0.201
epoch #15	mean reward = 0.600	epsilon = 0.182
epoch #16	mean reward = 0.630	epsilon = 0.164
epoch #17	mean reward = 0.690	epsilon = 0.149
epoch #18	mean reward = 0.720	epsilon = 0.135
epoch #19	mean reward = 0.650	epsilon = 0.122
epoch #20	mean reward = 0.780	epsilon = 0.110
epoch #21	mean reward = 0.840	epsilon = 0.1

Czas przygotować model sieci, która będzie się uczyła poruszania po środowisku *FrozenLakeExtended*, tym razem stan nie jest określany poprzez pojedynczą liczbę, a przez 3 tablice:
* pierwsza zawierająca informacje o celu,
* druga zawierająca informacje o dziurach,
* trzecia zawierająca informację o położeniu gracza.

In [7]:
env = frozenLakeExtended("4x4")

state_size = env.get_number_of_states() * 3
action_size = len(env.get_possible_actions(None))
learning_rate = 0.001
     
        #
        # INSERT CODE HERE to build network
        #

model = Sequential()
model.add(Dense(16, input_dim=state_size, activation="relu"))
model.add(Dense(32, activation="relu"))
model.add(Dense(action_size))
model.compile(loss="mean_squared_error", optimizer=Adam(learning_rate=learning_rate))

 Czas nauczyć agenta poruszania się po środowisku *FrozenLakeExtended*, jako stan przyjmij wektor składający się ze wszystkich trzech tablic (2 pkt.):

In [8]:
agent = DQNAgent(action_size, learning_rate, model, env.get_possible_actions)

agent.epsilon = 0.75

done = False
batch_size = 64
EPISODES = 2000
counter = 0
for e in range(EPISODES):
    summary = []
    for _ in range(100):
        total_reward = 0
        env_state = env.reset()
    
        #
        # INSERT CODE HERE to prepare appropriate format of the state for network
        #
        state = np.array([np.array(env_state).flatten()])
        
        for time in range(1000):
            action = agent.get_action(state)
            next_state_env, reward, done, _ = env.step(action)
            total_reward += reward

            #
            # INSERT CODE HERE to prepare appropriate format of the next state for network
            #
            next_state = np.array([np.array(next_state_env).flatten()])

            #add to experience memory
            agent.remember(state, action, reward, next_state, done)
            state = next_state
            if done:
                break

        #
        # INSERT CODE HERE to train network if in the memory is more samples then size of the batch
        #
        if len(agent.memory) > batch_size:
            agent.replay(batch_size)
        
        summary.append(total_reward)
    
    print("epoch #{}\tmean reward = {:.3f}\tepsilon = {:.3f}".format(e, np.mean(summary), agent.epsilon))
    
    if np.mean(summary) > 0.9:
        print ("You Win!")
        break

epoch #0	mean reward = 0.000	epsilon = 0.683
epoch #1	mean reward = 0.010	epsilon = 0.618
epoch #2	mean reward = 0.000	epsilon = 0.559
epoch #3	mean reward = 0.000	epsilon = 0.506
epoch #4	mean reward = 0.050	epsilon = 0.458
epoch #5	mean reward = 0.030	epsilon = 0.414
epoch #6	mean reward = 0.290	epsilon = 0.375
epoch #7	mean reward = 0.450	epsilon = 0.339
epoch #8	mean reward = 0.420	epsilon = 0.307
epoch #9	mean reward = 0.560	epsilon = 0.278
epoch #10	mean reward = 0.620	epsilon = 0.251
epoch #11	mean reward = 0.670	epsilon = 0.227
epoch #12	mean reward = 0.600	epsilon = 0.206
epoch #13	mean reward = 0.730	epsilon = 0.186
epoch #14	mean reward = 0.720	epsilon = 0.168
epoch #15	mean reward = 0.840	epsilon = 0.152
epoch #16	mean reward = 0.740	epsilon = 0.138
epoch #17	mean reward = 0.730	epsilon = 0.125
epoch #18	mean reward = 0.840	epsilon = 0.113
epoch #19	mean reward = 0.930	epsilon = 0.102
You Win!


Czas przygotować model sieci, która będzie się uczyła działania w środowisku [*CartPool*](https://gym.openai.com/envs/CartPole-v0/):

In [9]:
env = gym.make("CartPole-v0").env
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
learning_rate = 0.001
       
        #
        # INSERT CODE HERE to build network
        #

model = Sequential()
model.add(Dense(64, input_dim=state_size, activation="relu"))
model.add(Dense(128, activation="relu"))
model.add(Dense(256, activation="relu"))
model.add(Dense(64, activation="relu"))
model.add(Dense(16, activation="relu"))
model.add(Dense(action_size))
model.compile(loss="mean_squared_error", optimizer=Adam(learning_rate=learning_rate))

Czas nauczyć agenta gry w środowisku *CartPool*:
* 1 pkt < 10 epok,
* 0.5 pkt < 20 epok,
* 0.25 pkt - w pozostałych przypadkach.

In [16]:
def get_possible_actions(state):
    return [0,1]

agent = DQNAgent(action_size, learning_rate, model, get_possible_actions)

agent.epsilon = 0.75

done = False
batch_size = 64
EPISODES = 1000
counter = 0
for e in range(EPISODES):
    summary = []
    for _ in range(100):
        total_reward = 0
        env_state = env.reset()
    
        #
        # INSERT CODE HERE to prepare appropriate format of the state for network
        #
        state = np.array([np.array(env_state).flatten()])
        
        for time in range(300):
            action = agent.get_action(state)
            next_state_env, reward, done, _ = env.step(action)
            total_reward += reward

            #
            # INSERT CODE HERE to prepare appropriate format of the next state for network
            #
            next_state = np.array([np.array(next_state_env).flatten()])

            #add to experience memory
            agent.remember(state, action, reward, next_state, done)
            state = next_state
            if done:
                break

        #
        # INSERT CODE HERE to train network if in the memory is more samples then size of the batch
        #
        if len(agent.memory) > batch_size:
            agent.replay(batch_size)

        summary.append(total_reward)

    print("epoch #{}\tmean reward = {:.3f}\tepsilon = {:.3f}".format(e, np.mean(summary), agent.epsilon))

    if np.mean(summary) > 195:
        print ("You Win!")
        break

epoch #0	mean reward = 17.550	epsilon = 0.681
epoch #1	mean reward = 17.600	epsilon = 0.616
epoch #2	mean reward = 27.880	epsilon = 0.558
epoch #3	mean reward = 49.770	epsilon = 0.505
epoch #4	mean reward = 102.380	epsilon = 0.457
epoch #5	mean reward = 178.300	epsilon = 0.413
epoch #6	mean reward = 178.470	epsilon = 0.374
epoch #7	mean reward = 246.710	epsilon = 0.338
You Win!
