# Custom Environments

### ```WalkBot-v0``` 

WalkBot with Discrete Actions, Reward is shortest path, [notebook](walkbot_v0.ipynb)

### ```WalkBot-v1```

WalkBot with Discrete Actions, reward is shortest path, terminal states are absorbing, [notebook](walkbot_v1.ipyng)

###  ```WalkBot-v2```
WalkBot with Discrete Actions, reward is QR cost 

$$
Q = \begin{bmatrix}
1 & 0 & 0 & 0\\
0 & 1 & 0 & 0\\
0 & 0 & 1 & 0\\
0 & 0 & 0 & 1
\end{bmatrix},\, 
R = \begin{bmatrix}
0 & 0 \\
0 & 0
\end{bmatrix}
$$

NB: control $u$ is fixed per each action

[Notebook](walkbot_v2.ipynb)

### ```WalkBot-v3```

As above, but with perturbation $w_t$ defined as follows

$$
w_t = \begin{bmatrix}
0 \\
0 \\
\mathcal{N}(0,\sigma_{vx}^2)\\
\mathcal{N}(0,\sigma_{vy}^2)\\
\end{bmatrix}
$$

[Notebook](walkbot_v3.ipynb)

### WalkBot-RandomInit-v0

Like ```WalkBot-v2```, but with initial random state

$$
x_0 = \begin{bmatrix}
\mathcal{N}(5, 2)\\
\mathcal{N}(5, 2)\\
\mathcal{N}(0.05, 2)\\
\mathcal{N}(0.05, 2)
\end{bmatrix}
$$

[Notebook](walkbot_random_init_v0)

### WalkBot-RandomInit-v1

Like ```WalkBot-v3``` but randominsing the initial state $x_0$ as ```WalkBot-RandomInit-v0``` does.

[Notebook](walkbot_random_init_v1.ipynb)

### ContinuousWalkBot-v0

State

$$
\mathbf{x} = \begin{bmatrix} 
x \\ 
y \\ 
v_x \\ 
v_y
\end{bmatrix}
$$

control

$$
\mathbf{u} = \begin{bmatrix}
a_x \\
a_y
\end{bmatrix}
$$

subject to constraints

$$
\begin{align}
 0 \leq x \leq 10 \\
 0 \leq y \leq 10 \\
 -10 \leq v_x \leq 10 \\
 -10 \leq v_y \leq 10 \\
 -2 \leq a_x \leq 2 \\
 -2 \leq a_y \leq 2 \\
 v_{x}^2 + v_{y}^2 \geq 0.01
\end{align}
$$

and cost function

$$
Q = \begin{bmatrix}
1 & 0 & 0 & 0\\
0 & 1 & 0 & 0\\
0 & 0 & 1 & 0\\
0 & 0 & 0 & 1
\end{bmatrix},\, 
R = \begin{bmatrix}
1 & 0 \\
0 & 1
\end{bmatrix}
$$

State matrix

$$
A = \begin{bmatrix}
0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1 \\
0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0
\end{bmatrix}
$$

Input matrix

$$
B = \begin{bmatrix}
0 & 0 \\
0 & 0 \\
1 & 0 \\
0 & 1 \\
\end{bmatrix}
$$

[Notebook with model demonstration](continuous_walkbot_demo.ipynb)

[Notebook with Baseline Policies](continuous_walkbot_v0.ipynb)

### ContinuousWalkBot-v1

Task has same definition as ```ContinuousWalkBot-v1```, but adds noise to actual velocities in the same way ```WalkBot-v3``` does.

[Notebook](continuous_walkbot_v1.ipynb)

### ContinuousWalkBot-RandomInit-v0

Task is like ```ContinuousWalkBot-v0``` but with random initial states.

[Notebook](continuous_walkbot_random_init_v0.ipynb)

### ContinuousWalkBot-RandomInit-v1

Task is like ```ContinuousWalkBot-v1``` but with random initial states

[Notebook](continuous_walkbot_random_init_v1.ipynb)

### ```ContinuousMountainCar-v1``` 
Continuous Mountain Car with QR cost as rewards, [notebook](mountain_car_continuous_v1.ipynb)

# ``Hard'' Reinforcement Learning domains

### ```RechtLQR-v0```

Benjamin Recht's LQR benchmark ``Simple HVAC'' [problem](http://argmin.net/2018/05/11/coarse-id-control/). Applying random control will lead the servers temperature to grow exponentially very quickly.

[Notebook](recht_lqr_v0.ipynb)

### Antishape-v1

John Langford's _Antishape_ [domain](https://github.com/JohnLangford/RL_acid). Here's the description of the problem:

If rewards in the vicinity of a start state favor
staying near a start state, then reward values far from the start
state are irrelevant.  The name comes from "reward shaping" which is
typically used to make RL easier.  Here, we use it to make RL harder.

[Notebook](antishape_v1.ipynb)

### Combolock-v1

John Langford's _Combolock_ [domain](https://github.com/JohnLangford/RL_acid). Here's his description of the problem:

When most actions lead towards the start state
uniform random exploration is relatively useless.  The name comes
from "combination lock" where knowing the right sequence of steps to
take is the problem.

[Notebook](combolock_v1.ipynb)