# Introduction Gym Environments

In this exercise, we will evolve a controller for the widely used benchmark problem [Bipedal Walker](https://gymnasium.farama.org/environments/box2d/bipedal_walker/) (Brockman et al. 2016). 

## Installing dependencies

In [1]:
pip install swig

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install gymnasium[box2d]

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## Gym API

The Gym API is the standard notation for defining learning environments throught research and industry. It follows the description of a Markov Decision Process (MDP). At each time step, the environment is in some state $s$ and the decision maker (our agent or also called policy) may choose a action $a$. Conditioned on the previous state and the action, the environment is moving into the new state $s'$, giving the agent a corresponding reward $r_t$. This process is illustrated in the following figure:

<img src="learning_loop.svg" width="500">

Each environment of Gym contains two functions: `step` and `reset`.
1. `step` takes an `action` and returns an environment state `observation`. Action corresponds e.g. to motor signals of our robots. Observations describe sensor signals.
2. `reset` sets our environment back to an initial state, e.g. when the environment terminates after a maximum number of time steps or truncates when our environment is in a critical state (e.g. the robot falls down and does not know how to get back in a save state).

Gym (since 2021 Gymnasium) is frequently used to benchmark novel learning algorithms against each other and provides a large number of different environments e.g. video games with image observation spaces or simulated robots with continuous action spaces.

### Examples of environments with discrete action spaces

<img src="assault.gif" height="200"><img src="breakout.gif" height="200"><img src="montezuma_revenge.gif" height="200"><img src="pong.gif" height="200"><img src="space_invaders.gif" height="200">

### Examples of environments with continuous action spaces
<img src="reacher.gif" height="100"/><img src="pusher.gif" height="100"/>


## Example Usage

Here is an example for the usage of gym with Python. Note that currently we are randomly selecting actions with `env.action_space.sample()`.

In [3]:
import gymnasium as gym
environment = gym.make("LunarLander-v2")
observation, info = environment.reset(seed=0)

for step in range(1000):
    # we select randomly an action
    action = environment.action_space.sample()  
    
    # execute the action in environment
    observation, reward, terminated, truncated, info = environment.step(action)

    # if lander is colliding with moon or flies out of the picture
    if terminated or truncated:
        observation, info = environment.reset(seed=0)
    
    if step % 100 == 0:
        print(f"Reward at step {step}: {reward}")

environment.close()

Reward at step 0: 2.567178847980938
Reward at step 100: 0.6089772605790926
Reward at step 200: -2.3555817370705356
Reward at step 300: -1.4643772833698165
Reward at step 400: 3.5573684844788263
Reward at step 500: -1.4525940823439782
Reward at step 600: 2.16767366648956
Reward at step 700: -0.3106328884129425
Reward at step 800: -1.3630319897515324
Reward at step 900: -1.8764936616943533


Since we select our actions randomly, we do not observe an increase of performance. The goal of our exercise is to use evolution to generate a policy able to solve the environment (by maximising the reward). The following example give us adds a graphical render of the environment simulation.

In [4]:
import gymnasium as gym
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
from IPython.display import display, clear_output
from io import BytesIO
from ipywidgets import widgets

# Create the environment
env = gym.make("LunarLander-v2", render_mode="rgb_array")
env.reset()

# Create a widget for displaying the environment
image_widget = widgets.Image(format='jpeg')

# Display the widget
display(image_widget)

def update_image(env):
    img = env.render()
    img = Image.fromarray(img)
    with BytesIO() as output:
        img.save(output, format="JPEG")
        image_data = output.getvalue()
        image_widget.value = image_data

# Run and update environment
for _ in range(1000):
    action = env.action_space.sample()  # Replace with your action policy
    observation, reward, terminated, truncated, info = env.step(action)
    
    if terminated or truncated:
        observation, info = env.reset()
    
    update_image(env)
    clear_output(wait=True)

env.close()

Image(value=b'', format='jpeg')

## Vectorized Environments
This environments are running one agent in one environment. However, with evolutionary algorithms we will generate populations of different solutions, which will be evaluated in our simulator. Here vectorized environments provided by Gym come in handy.

In [10]:
import gymnasium as gym
import numpy as np

envs = gym.vector.make("BipedalWalker-v3", num_envs=2, asynchronous=False)
observations, infos = envs.reset(seed=42)

total_steps = 2000

for step in range(total_steps+1):
    actions = envs.action_space.sample()
    observations, rewards, terminations, truncations, infos = envs.step(actions)

    if step % 100 == 0 or step == total_steps:
        # print(f"Observations at step {step}: {observations}")
        print(f"Rewards at step {step}: {rewards}")

Rewards at step 0: [-0.22549972 -0.119934  ]
Rewards at step 100: [-0.17139345 -0.11991324]
Rewards at step 200: [-0.42181716 -0.0340379 ]
Rewards at step 300: [ 0.00606177 -0.01138707]
Rewards at step 400: [ 0.04695277 -0.05570645]
Rewards at step 500: [-0.03434103 -0.15757243]
Rewards at step 600: [-0.00609256  0.01684054]
Rewards at step 700: [-0.03342197 -0.19368984]
Rewards at step 800: [-0.25804086 -0.03672408]
Rewards at step 900: [-0.23872143  0.00505408]
Rewards at step 1000: [-0.15324709  0.03204482]
Rewards at step 1100: [-0.03446048 -0.07784076]
Rewards at step 1200: [-0.03603886 -0.07487069]
Rewards at step 1300: [-0.00501487 -0.10207238]
Rewards at step 1400: [-0.04403143 -0.13639763]
Rewards at step 1500: [-0.14361706 -0.06501747]
Rewards at step 1600: [-0.0947503  -0.13591045]
Rewards at step 1700: [ 0.18460886 -0.20983365]
Rewards at step 1800: [-0.06944387 -0.10652284]
Rewards at step 1900: [-0.20537487  0.07470934]
Rewards at step 2000: [-0.09714866 -0.14178289]
