In [2]:
import gym

## Naive implementation

In [2]:
env = gym.make('CartPole-v0')
env.reset()

for _ in range(1000):
    env.render()
    env.step(env.action_space.sample())

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.[0m


## Understanding `env.step`

As mentioned in the documentation page, each environment is separated into different episodes, with `done=True` indicating that the specific episode has ended. Thus, we need to call reset there. For this, we need to understand what `env.step(action)` does and returns. `env.step(action)` takes the next step in the environment by performing the action specified by `action` and returns a tuple:
- observation: This is environment specific and represents our observation of the environment after taking the action specified in `env.step(action)`.
- reward: The reward we received upon performing the action.
- done: This is the parameter we discussed about. We need to monitor this and call `env.reset()` when `done=True`.
- info: Additional information for debugging

In [3]:
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(1000):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        
        if done:
            print('Episode #%d finished after %d timesteps' % (i_episode, t))
            break

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[ 0.00637188 -0.00521039 -0.03402556  0.01884412]
[ 0.00626768 -0.19982827 -0.03364868  0.30060039]
[ 0.00227111 -0.00424328 -0.02763667 -0.00250188]
[ 0.00218624  0.1912639  -0.0276867  -0.3037748 ]
[ 0.00601152 -0.00345276 -0.0337622  -0.0199506 ]
[ 0.00594247  0.1921367  -0.03416121 -0.3230918 ]
[ 0.0097852   0.38772802 -0.04062305 -0.62634905]
[ 0.01753976  0.58339277 -0.05315003 -0.93154393]
[ 0.02920762  0.77919012 -0.07178091 -1.24044398]
[ 0.04479142  0.97515657 -0.09658979 -1.55472215]
[ 0.06429455  1.17129426 -0.12768423 -1.8759106 ]
[ 0.08772044  1.36755751 -0.16520244 -2.20534619]
[ 0.11507159  1.17436393 -0.20930937 -1.96785017]
Episode #0 finished after 12 timesteps
[ 0.01957568 -0.01108415  0.04242628 -0.03739506]
[ 0.019354    0.18340454  0.04167838 -0.31639598]
[ 0.02302209 -0.01228552  0.03535046 -0.0108659 ]
[ 0.02277638 -0.20789615  0.03513315  0.29275759]
[ 0

[ 0.12420335  0.55170705 -0.11798819 -1.01503452]
[ 0.13523749  0.74818792 -0.13828888 -1.34231488]
[ 0.15020125  0.55505027 -0.16513518 -1.0959016 ]
[ 0.16130225  0.75191581 -0.18705321 -1.43551211]
Episode #9 finished after 16 timesteps
[ 0.03484364  0.03154726  0.04660657 -0.01517754]
[ 0.03547458 -0.16421103  0.04630302  0.2918384 ]
[0.03219036 0.03022118 0.05213979 0.01411089]
[ 0.03279478 -0.16560824  0.052422    0.322778  ]
[0.02948262 0.02872956 0.05887756 0.04707675]
[ 0.03005721  0.22295999  0.0598191  -0.22646405]
[0.03451641 0.02703642 0.05528982 0.08447199]
[ 0.03505714 -0.1688327   0.05697926  0.39407363]
[0.03168049 0.02543638 0.06486073 0.11988576]
[ 0.03218921 -0.17055201  0.06725845  0.43230587]
[0.02877817 0.02355636 0.07590456 0.16156071]
[ 0.0292493  -0.17256554  0.07913578  0.47719102]
[0.02579799 0.0213551  0.0886796  0.21046203]
[ 0.02622509 -0.17491552  0.09288884  0.52974875]
[0.02272678 0.01878538 0.10348381 0.26772006]
[ 0.02310249 -0.17764951  0.10883822  0

[ 0.02937532  0.4201345  -0.00529699 -0.56097628]
[ 0.03777801  0.22508729 -0.01651651 -0.26996687]
[ 0.04227976  0.03020488 -0.02191585  0.01746124]
[ 0.04288385 -0.16459603 -0.02156663  0.30314968]
[ 0.03959193 -0.35940408 -0.01550363  0.58895372]
[ 0.03240385 -0.1640685  -0.00372456  0.29142767]
[ 0.02912248  0.03110635  0.002104   -0.0024276 ]
[ 0.02974461  0.22619807  0.00205544 -0.29444595]
[ 0.03426857  0.03104687 -0.00383348 -0.00111548]
[ 0.03488951  0.22622359 -0.00385578 -0.29500544]
[ 0.03941398  0.4214003  -0.00975589 -0.58890193]
[ 0.04784198  0.6166575  -0.02153393 -0.88464197]
[ 0.06017513  0.81206511 -0.03922677 -1.18401592]
[ 0.07641644  1.00767341 -0.06290709 -1.48873236]
[ 0.0965699   1.2035026  -0.09268174 -1.80037761]
[ 0.12063996  1.39953062 -0.12868929 -2.12036709]
[ 0.14863057  1.20590437 -0.17109663 -1.87005969]
[ 0.17274866  1.01301683 -0.20849783 -1.63501035]
Episode #19 finished after 19 timesteps


## Understanding agent actions
The environments in gym have `Space` objects which describe the valid actions and observations


In [4]:
env.action_space

Discrete(2)

In [5]:
env.observation_space

Box(4,)

Basically, `Discrete` specifies the range of non-negative values. <br>
This means `Discrete(3)` means that the action can take values `{0, 1, 2}`. <br>
`Box` represents an n-dimensional (here, n=4) value.

In [7]:
env.P

AttributeError: 'TimeLimit' object has no attribute 'P'