# chap2 OpenAI Gym

## The anatomy of the agent

强化学习有以下两个实体：
* 代理(agent): 采取实际行动的对象
* 环境模型(environment): 对于代理来说属于外界的环境，并且给出奖励和提供观察基础

In [1]:
# start with the environment
import random
from typing import List

class Environment:
    
    def __init__(self):
        self.steps_left = 10
    
    def get_observation(self) -> List[float]:
        return [0.0, 0.0, 0.0]

    def get_actions(self) -> List[int]:
        return [0, 1]

    def is_done(self) -> bool:
        return self.steps_left == 0

    def action(self, action: int) -> float:
        if self.is_done():
            raise Exception("Game is over")
        self.steps_left -= 1
        return random.random()

In [2]:
# Look at the agent's part

class Agent:
    
    def __init__(self):
        self.total_reward = 0.0

    def step(self, env: Environment):
        current_obs = env.get_observation()
        actions = env.get_actions()
        reward = env.action(random.choice(actions))
        self.total_reward += reward

In [3]:
# 创建两个类然后运行一个测试
env = Environment()
agent = Agent()
while not env.is_done():
    agent.step(env)

print("Total reward got: %.4f" % agent.total_reward)

Total reward got: 4.0915


## The OpenAi Gym API

### The CartPole session

In [4]:
import gym
env = gym.make('CartPole-v0')
obs = env.reset()
obs

  for external in metadata.entry_points().get(self.group, []):


array([ 0.02098088,  0.00103006, -0.03950427,  0.04324083], dtype=float32)

In [5]:
env.action_space

Discrete(2)

In [6]:
env.observation_space

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)

In [7]:
env.step(0)

(array([ 0.02100148, -0.1935038 , -0.03863946,  0.32320273], dtype=float32),
 1.0,
 False,
 {})

In [8]:
env.action_space.sample()

1

In [9]:
env.action_space.sample()

0

In [10]:
env.observation_space.sample()

array([-6.7852871e-03, -5.8791424e+37, -2.8179330e-01,  2.8261877e+38],
      dtype=float32)

In [11]:
env.observation_space.sample()

array([ 4.0229750e+00, -3.3712064e+38,  1.8790500e-01, -2.3338294e+37],
      dtype=float32)

## The random CartPole agent

In [12]:
env = gym.make("CartPole-v0")
total_reward = 0.0
total_steps = 0
obs = env.reset()

In [13]:
while True:
    action = env.action_space.sample()
    obs, reward, done, _ = env.step(action)
    total_reward += reward
    total_steps += 1
    if done:
        break

print("Episode done in %d steps, total reward %.2f" % (total_steps, total_reward))

Episode done in 11 steps, total reward 11.00


## Extra Gym functionality - wrappers and monitors
### Wrappers

In [14]:
import gym
from typing import TypeVar
import random

Action = TypeVar('Action')

class RandomActionWrapper(gym.ActionWrapper):
    def __init__(self, env, epsilon = 0.1):
        super(RandomActionWrapper, self).__init__(env)
        self.epsilon = epsilon
    
    def action(self, action: Action) -> Action:
        if random.random() < self.epsilon:
            print("Random!")
            return self.env.action_space.sample()
        return action
    
env = RandomActionWrapper(gym.make("CartPole-v0"))
obs = env.reset()
total_reward = 0.0
while True:
    obs, reward, done, _ = env.step(0)
    total_reward += reward
    if done:
        break
print("Reward get: %.2f" % total_reward)

Reward get: 10.00


### Monitor

In [23]:
import time
env = gym.make("CartPole-v0")
env = gym.wrappers.Monitor(env, "recording", force=True)

total_reward = 0.0
total_steps = 0
obs = env.reset()

while True:
    action = env.action_space.sample()
    obs, reward, done, _ = env.step(action)
    total_reward += reward
    total_steps += 1
    if done:
        break
    # time.sleep(0.05)

print("Episode done in %d steps, total reward %.2f" % (
    total_steps, total_reward))
env.close()
env.env.close()

Episode done in 9 steps, total reward 9.00
