# Laboratorium 7

Celem siĂłdmego laboratorium jest zapoznanie siÄ oraz zaimplementowanie algorytmu gĹÄbokiego uczenia aktywnego - Actor-Critic. Zaimplementowany algorytm bÄdzie testowany z wykorzystaniem Ĺrodowiska z OpenAI - _CartPole_.


DoĹÄczenie standardowych bibliotek


In [1]:
from collections import deque
import gym
import numpy as np
import random

DoĹÄczenie bibliotek do obsĹugi sieci neuronowych


In [2]:
import torch
import torch.nn as nn
from torch.optim import Adam

## Zadanie 1 - Actor-Critic

Celem Äwiczenie jest zaimplementowanie algorytmu Actor-Critic. W tym celu naleĹźy utworzyÄ dwie gĹÄbokie sieci neuronowe: 1. _actor_ - sieÄ, ktĂłra bÄdzie uczyĹa siÄ optymalnej strategii (podobna do tej z laboratorium 6), 2. _critic_ - sieÄ, ktĂłra bÄdzie uczyĹa siÄ funkcji oceny stanu (podobnie jak siÄ DQN).
Wagi sieci _actor_ aktualizowane sÄ zgodnie ze wzorem:

\begin{equation*}
\theta \leftarrow \theta + \alpha \delta*t \nabla*\theta log \pi\_{\theta}(a_t, s_t | \theta).
\end{equation*}

Wagi sieci _critic_ aktualizowane sÄ zgodnie ze wzorem:
\begin{equation*}
w \leftarrow w + \beta \delta*t \nabla_w\upsilon(s*{t + 1}, w),
\end{equation*}
gdzie:
\begin{equation*}
\delta*t \leftarrow r_t + \gamma \upsilon(s*{t + 1}, w) - \upsilon(s_t, w).
\end{equation*}


In [26]:
class Agent:
    def __init__(self, state_size, action_size, actor, critic, alpha_learning_rate, beta_learning_rate):
        self.state_size = state_size
        self.action_size = action_size
        self.gamma = 0.99    # discount rate
        self.actor = actor
        self.critic = critic #critic network should have only one output
        self.actor_optimizer = Adam(self.actor.parameters(), alpha_learning_rate)
        self.critic_optimizer = Adam(self.critic.parameters(), beta_learning_rate)


    def get_action(self, state):
        state = torch.tensor(state, dtype=torch.float32)
        with torch.no_grad():
            probs = torch.log_softmax(self.actor(state), dim=-1)
        return torch.multinomial(probs.exp(), 1).item()

    def learn(self, state, action, reward, next_state, done):
        """
        Function learn networks using information about state, action, reward and next state. 
        First the values for state and next_state should be estimated based on output of critic network.
        Critic network should be trained based on target value:
        target = r + \gamma next_state_value if not done]
        target = r if done.
        Actor network shpuld be trained based on delta value:
        delta = target - state_value
        """
        #
        # INSERT CODE HERE to train network
        #
        state = torch.tensor(state, dtype=torch.float32)
        action = torch.tensor(action, dtype=torch.int64).unsqueeze(0)
        reward = torch.tensor(reward, dtype=torch.float32).unsqueeze(0)
        next_state = torch.tensor(next_state, dtype=torch.float32)
        done = torch.tensor(done, dtype=torch.float32).unsqueeze(0)

        def get_delta():
            next_value = self.critic(next_state).detach()
            return reward + self.gamma * next_value * (1 - done) - self.critic(state)

        actor_loss = -(get_delta() * torch.log_softmax(self.actor(state), dim=-1)[action])
        critic_loss = get_delta().pow(2).mean()

        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()

        


Czas przygotowaÄ model sieci, ktĂłra bÄdzie siÄ uczyĹa dziaĹania w Ĺrodowisku [_CartPool_](https://gym.openai.com/envs/CartPole-v0/):


In [27]:
env = gym.make("CartPole-v0").env
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
alpha_learning_rate = 0.0001*2
beta_learning_rate = 0.0005*2

    
actor_model = nn.Sequential(
    nn.Linear(state_size, 32),
    nn.ReLU(),
    nn.Linear(32, action_size)
)

critic_model = nn.Sequential(
    nn.Linear(state_size, 16),
    nn.ReLU(),
    nn.Linear(16, 1)
)

Czas nauczyÄ agenta gry w Ĺrodowisku _CartPool_:


In [28]:
agent = Agent(state_size, action_size, actor_model, critic_model, alpha_learning_rate, beta_learning_rate)


for i in range(100):
    score_history = []

    for i in range(100):
        done = False
        score = 0
        state = env.reset()
        state = state[0]
        while not done:
            action = agent.get_action(state)
            next_state, reward, done, _, _ = env.step(action)
            agent.learn(state, action, reward, next_state, done)
            state = next_state
            score += reward
        score_history.append(score)

    print("mean reward:%.3f" % (np.mean(score_history)))

    if np.mean(score_history) > 300:
        print("You Win!")
        break

mean reward:18.140
mean reward:18.290
mean reward:22.750
mean reward:20.650
mean reward:23.780
mean reward:25.130
mean reward:27.390
mean reward:25.580
mean reward:29.430
mean reward:33.450
mean reward:39.720
mean reward:45.720
mean reward:52.030
mean reward:55.570
mean reward:61.900
mean reward:102.900
mean reward:90.240
mean reward:78.330
mean reward:176.450
mean reward:194.680
mean reward:75.930
mean reward:162.280
mean reward:159.740
mean reward:221.560
mean reward:250.870
mean reward:158.710
mean reward:166.260
mean reward:152.000
mean reward:150.720
mean reward:1095.120
You Win!
