# Laboratorium 7

Celem siĂłdmego laboratorium jest zapoznanie siÄ oraz zaimplementowanie algorytmu gĹÄbokiego uczenia aktywnego - Actor-Critic. Zaimplementowany algorytm bÄdzie testowany z wykorzystaniem Ĺrodowiska z OpenAI - _CartPole_.


DoĹÄczenie standardowych bibliotek


In [1]:
from collections import deque
import gym
import numpy as np
import random

DoĹÄczenie bibliotek do obsĹugi sieci neuronowych


In [2]:
import torch
import torch.nn as nn
from torch.optim import Adam

## Zadanie 1 - Actor-Critic

Celem Äwiczenie jest zaimplementowanie algorytmu Actor-Critic. W tym celu naleĹźy utworzyÄ dwie gĹÄbokie sieci neuronowe: 1. _actor_ - sieÄ, ktĂłra bÄdzie uczyĹa siÄ optymalnej strategii (podobna do tej z laboratorium 6), 2. _critic_ - sieÄ, ktĂłra bÄdzie uczyĹa siÄ funkcji oceny stanu (podobnie jak siÄ DQN).
Wagi sieci _actor_ aktualizowane sÄ zgodnie ze wzorem:

\begin{equation*}
\theta \leftarrow \theta + \alpha \delta*t \nabla*\theta log \pi\_{\theta}(a_t, s_t | \theta).
\end{equation*}

Wagi sieci _critic_ aktualizowane sÄ zgodnie ze wzorem:
\begin{equation*}
w \leftarrow w + \beta \delta*t \nabla_w\upsilon(s*{t + 1}, w),
\end{equation*}
gdzie:
\begin{equation*}
\delta*t \leftarrow r_t + \gamma \upsilon(s*{t + 1}, w) - \upsilon(s_t, w).
\end{equation*}


In [33]:
class Agent:
    def __init__(self, state_size, action_size, model, alpha_learning_rate):
        self.state_size = state_size
        self.action_size = action_size
        self.gamma = 0.99    # discount rate
        self.model = model
        self.optimizer = Adam(self.model.parameters(), alpha_learning_rate)
        # self.critic_optimizer = Adam(self.critic.parameters(), beta_learning_rate)


    def get_action(self, state):
        state = torch.tensor(state, dtype=torch.float32)
        with torch.no_grad():
            probs = torch.log_softmax(self.model(state)[0], dim=-1)
        return torch.multinomial(probs.exp(), 1).item()

    def learn(self, state, action, reward, next_state, done):
        """
        Function learn networks using information about state, action, reward and next state. 
        First the values for state and next_state should be estimated based on output of critic network.
        Critic network should be trained based on target value:
        target = r + \gamma next_state_value if not done]
        target = r if done.
        Actor network shpuld be trained based on delta value:
        delta = target - state_value
        """
        #
        # INSERT CODE HERE to train network
        #
        state = torch.tensor(state, dtype=torch.float32)
        action = torch.tensor(action, dtype=torch.int64).unsqueeze(0)
        reward = torch.tensor(reward, dtype=torch.float32).unsqueeze(0)
        next_state = torch.tensor(next_state, dtype=torch.float32)
        done = torch.tensor(done, dtype=torch.float32).unsqueeze(0)

        def get_delta():
            next_value = self.model(next_state)[1].detach()
            return reward + self.gamma * next_value * (1 - done) - self.model(state)[1]

        actor_loss = -(get_delta() * torch.log_softmax(self.model(state)[0], dim=-1)[action])
        critic_loss = get_delta().pow(2).mean()

        actor_loss.backward(retain_graph=True)
        critic_loss.backward()
        self.optimizer.step()

        self.optimizer.zero_grad()

        


Czas przygotowaÄ model sieci, ktĂłra bÄdzie siÄ uczyĹa dziaĹania w Ĺrodowisku [_CartPool_](https://gym.openai.com/envs/CartPole-v0/):


In [36]:
env = gym.make("CartPole-v0").env
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
alpha_learning_rate = 0.0001*2

    
actor_model = nn.Sequential(
    nn.Linear(state_size, 32),
    nn.ReLU(),
    nn.Linear(32, action_size)
)

critic_model = nn.Sequential(
    nn.Linear(state_size, 16),
    nn.ReLU(),
    nn.Linear(16, 1)
)

class ActorCriticModel(nn.Module):
    def __init__(self, state_size, action_size) -> None:
        super().__init__()
        self.backbone = nn.Sequential(nn.Linear(state_size, 32), nn.ReLU())
        self.actor = nn.Sequential(nn.Linear(32, 16), nn.ReLU(), nn.Linear(16, action_size))
        self.critic = nn.Sequential(nn.Linear(32, 16), nn.ReLU(), nn.Linear(16, 1))

    def forward(self, x):
        x = self.backbone(x)
        return self.actor(x), self.critic(x)
    
actor_critic_model = ActorCriticModel(state_size, action_size)

  logger.warn(


Czas nauczyÄ agenta gry w Ĺrodowisku _CartPool_:


In [37]:
agent = Agent(state_size, action_size, actor_critic_model, alpha_learning_rate)


for i in range(100):
    score_history = []

    for i in range(100):
        done = False
        score = 0
        state = env.reset()
        state = state[0]
        while not done:
            action = agent.get_action(state)
            next_state, reward, done, _, _ = env.step(action)
            agent.learn(state, action, reward, next_state, done)
            state = next_state
            score += reward
        score_history.append(score)

    print("mean reward:%.3f" % (np.mean(score_history)))

    if np.mean(score_history) > 300:
        print("You Win!")
        break

  if not isinstance(terminated, (bool, np.bool8)):


mean reward:14.830
mean reward:10.330
mean reward:9.980
mean reward:10.370
mean reward:13.550
mean reward:12.640
mean reward:16.910
mean reward:18.880
mean reward:21.870
mean reward:24.870
mean reward:33.770
mean reward:37.530
mean reward:43.040
mean reward:52.430
mean reward:88.790
mean reward:105.040
mean reward:110.820
mean reward:134.580
mean reward:332.730
You Win!
