In [1]:
import gym

env = gym.make("CartPole-v1")
obs = env.reset()
print(obs)    # print out initial observation

[-0.0247882   0.04667496 -0.0186847  -0.02538045]


In [2]:
obs, _ , _ , _ = env.step(0)    # taking an action
print(obs)                      # print out next observation

[-0.0238547  -0.14817412 -0.01919231  0.26134918]


# Learning Goal

## The learning goal in the `CartPole-v1` environment

<img src="images/cartpole/cartpole.png" width="700"></img>

## We have to engineer a reward such that the following holds.

<center><h3>maximization of cumulative reward $\equiv$ real world outcome</h3></center>

## What does "pole staying upright" mean?

- Pole must stay between $\pm 12^{\circ}$ from the vertical
<img src="images/upright/1.png" width="650"></img>

- Pole must stay between environment bounds ($\pm 2.4$)
<img src="images/upright/2.png" width="650"></img>

## The reward function

<img src="images/reward_examples/2.png" width="400"></img>
<img src="images/reward_examples/3.png" width="400"></img>

- Maximizing the cumulative reward is equivalent to maximizing the number of steps the pole stays upright.
- $\textrm{duration} = 0.02\textrm{s} \times \textrm{num steps}$
- Maximizing the cumulative reward is equivalent to maximizing the duration the pole stays upright.

<img src="images/cartpole/2.png" width="700"></img>

## Rewards in `gym`

<img src="images/api/1.png" width="500"></img>

## The second element of the return value of `env.step(action)` is the reward

In [4]:
obs, reward, _, _ = env.step(0)
print(obs)
print(reward)

[-0.03367853 -0.53793993 -0.00300698  0.83616768]
1.0


## Demo of the reward function in `CartPole-v1`

<img src="images/reward_examples/2.png" width="400"></img>
<img src="images/reward_examples/3.png" width="400"></img>

In [5]:
import numpy as np

obs = env.reset()
for _ in range(30):
    print(f"Pole angle at step start: {np.degrees(obs[2])}", end=" ")
    obs, reward, _, _  = env.step(0)
    print(f"Reward in this step: {reward}")

Pole angle at step start: -1.9815118943994086 Reward in this step: 1.0
Pole angle at step start: -1.9753829089133865 Reward in this step: 1.0
Pole angle at step start: -1.6465941014009762 Reward in this step: 1.0
Pole angle at step start: -0.9951006562956507 Reward in this step: 1.0
Pole angle at step start: -0.018748809799087378 Reward in this step: 1.0
Pole angle at step start: 1.2866772478336586 Reward in this step: 1.0
Pole angle at step start: 2.9273756163276556 Reward in this step: 1.0
Pole angle at step start: 4.911408071092779 Reward in this step: 1.0
Pole angle at step start: 7.248533431734815 Reward in this step: 1.0
Pole angle at step start: 9.949981079522844 Reward in this step: 1.0
Pole angle at step start: 13.028151032757677 Reward in this step: 0.0
Pole angle at step start: 16.496226864029968 Reward in this step: 0.0
Pole angle at step start: 20.367690254933308 Reward in this step: 0.0
Pole angle at step start: 24.655732323180533 Reward in this step: 0.0
Pole angle at st

