## Environment 1: Racetrack

Consider driving a race car around a turn. You want
to go as fast as possible, but not so fast as to run off the track. 

The racetrack is a grid
– it begins with a vertical section 10 cells wide and 30 cells high, followed by a right turn
into a horizontal section 15 cells wide and 10 cells tall. 

The starting line is the row of cells
at the bottom of the first section. 

The finish line is the column of cells at the right of the
horizontal section. 

The car begins at any cell on the starting line and each turn it is at one
of the grid positions. 

The velocity is also discrete, a number of grid cells moved horizontally
and vertically per time step. 

The actions are increments to the velocity components. 

Each
may be changed by +1, −1, or 0 in each step, for a total of nine (3 ×3) actions. 

The
vertical velocity component is restricted to be nonnegative. 

Both velocity components are
restricted to be less than 5, and they cannot both be zero except at the starting line. 

Each
episode begins in one of the randomly selected start states with both velocity components
zero and ends when the car crosses the finish line. 

If the car attempts to go through the
track boundary (anywhere by the finish line) it crashes and the episode ends.

In [1]:
import random


class RacetrackEnv:
    def __init__(self, config):
        self.track_width = config["width"]

        self.x_max = self.track_width + config["turn"]
        self.y_max = self.track_width + config["straight"]

        self.x_inner = range(self.track_width, self.x_max)
        self.y_inner = range(0, config["straight"])

        self.action_space = [(dx, dy) for dx in [-1, 0, 1] for dy in [-1, 0, 1]]

        self.reset()

    def reset(self, seed=None):
        self.velocity = (0, 0)
        self.position = (random.randint(0, self.track_width - 1), 0)

    def __check_bounds(self, position) -> bool:
        return (
            0 <= position[0] < self.x_max
            and 0 <= position[1] < self.y_max
            and not (position[0] in self.x_inner and position[1] in self.y_inner)
        )

    def __bound(self, val, min_val, max_val):
        return max(min_val, min(val, max_val))

    def step(self, action):
        if 0 > action or action >= len(self.action_space):
            raise ValueError("Invalid action")
        # get acceleration from action
        acceleration = self.action_space[action]
        # apply acceleration with bounds
        self.velocity = (
            self.__bound(self.velocity[0] + acceleration[0], 1, 5),
            self.__bound(self.velocity[1] + acceleration[1], -5, 5),
        )

        # update position
        new_position = (
            self.position[0] + self.velocity[1],
            self.position[1] + self.velocity[0],
        )

        if not self.__check_bounds(new_position):
            # reset if out of bounds
            # self.reset()
            return {"state": [self.position, self.velocity], "r": -1, "done": True}

        self.position = new_position
        done = self.position[1] >= self.y_max - 1
        reward = 5 if done else 1
        return {"state": [self.position, self.velocity], "r": reward, "done": done}

    def render(self, past_positions=None):
        # generated using copilot auto-complete
        track = [["."] * self.x_max for _ in range(self.y_max)]
        for x in self.x_inner:
            for y in self.y_inner:
                track[y][x] = "#"
        if past_positions:
            for pos in past_positions:
                track[pos[1]][pos[0]] = "*"
        track[self.position[1]][self.position[0]] = "X"
        print("\n".join("".join(row) for row in reversed(track)))

In [2]:
config = {
    "width": 10,
    "straight": 30,
    "turn": 15,
}
num_episodes = 1000

env = RacetrackEnv(config)
results = []
for episode in range(num_episodes):
    state = env.reset()
    done = False
    past_positions = []
    total_reward = 0
    while not done:
        action = random.randint(0, len(env.action_space) - 1)
        result = env.step(action)
        state, reward, done = result["state"], result["r"], result["done"]
        past_positions.append(env.position)
        total_reward += reward
    results.append(total_reward)
    # env.render(past_positions)

print(f"Average reward over {num_episodes} episodes: {sum(results) / len(results)}")
print(f"Max reward over {num_episodes} episodes: {max(results)}")

Average reward over 1000 episodes: 4.78
Max reward over 1000 episodes: 27


## Environment 2 – a chase. 

Consider a chase on a square field with each side of length
n. 

In the field is a predator and a player. 

The player begins in cell position (0, 0). 

The
predator begins in a randomly generated cell position. At position (n, n) is the player’s base.


Unlike the racetrack example, the player and predator can be at any location in the field,
i.e. any real number between 0 and n. 

The velocity of the player is continuous, a distance
moved horizontally and vertically per time step. 

The actions are increments to the velocity
components.

Each may be changed by +1, −1, or 0 in each step, for a total of nine (3 ×3)
actions.

 In this case they are stochastic; given a horizontal velocity at time t of V (t)

-----

The vertical velocity is the same. 

Both velocity components are restricted to be no more
than 5. 

Each turn, randomly choose either the predator or the player to move first. 

The
predator moves a distance of no more than 4 directly toward the player.

 The player moves
according to their current velocity (but must stay within the field of play). 

If the predator
lands on the cell occupied by the player then the player is caught and the episode ends. 

If
the player is within a distance of 0.5 of their base without being caught then they escape
and the episode ends.

In [3]:
import math
import random


class ChaseEnv:
    def __init__(self, config):
        self.n = config["length"]
        self.pred_step_size = config["predator_step_size"]
        self.base_pos = (self.n - 1, self.n - 1)

        self.action_space = [(dx, dy) for dx in [-1, 0, 1] for dy in [-1, 0, 1]]

        self.reset()

    def reset(self, seed=None):
        # reset positions
        self.player_pos = (0, 0)
        self.pred_pos = (random.randint(0, self.n - 1), random.randint(0, self.n - 1))
        # reset velocities
        self.player_vel = (0, 0)

    def __bound(self, val, min_val, max_val):
        return max(min_val, min(val, max_val))

    def __move_player(self, acceleration):
        # a = 0.5 to 1.5 * action acceleration
        stochastic_fun = lambda x: x * (random.random() + 0.5)
        stochastic_accel = (
            stochastic_fun(acceleration[0]),
            stochastic_fun(acceleration[1]),
        )

        # apply acceleration with bounds
        self.player_vel = (
            self.__bound(self.player_vel[0] + stochastic_accel[0], -5, 5),
            self.__bound(self.player_vel[1] + stochastic_accel[1], -5, 5),
        )
        # update position with bounds
        self.player_pos = (
            self.__bound(self.player_pos[0] + self.player_vel[0], 0, self.n - 1),
            self.__bound(self.player_pos[1] + self.player_vel[1], 0, self.n - 1),
        )

    def __move_predator(self):
        # move towards player
        dx = self.player_pos[0] - self.pred_pos[0]
        dy = self.player_pos[1] - self.pred_pos[1]
        # normalize and scale to step size
        dist = math.hypot(dx, dy)
        dx = (dx / dist) * self.pred_step_size if dist != 0 else 0
        dy = (dy / dist) * self.pred_step_size if dist != 0 else 0
        # update predeator position
        self.pred_pos = (
            self.__bound(self.pred_pos[0] + dx, 0, self.n - 1),
            self.__bound(self.pred_pos[1] + dy, 0, self.n - 1),
        )

    def step(self, action):
        # get new player velocity
        if 0 > action or action >= len(self.action_space):
            raise ValueError("Invalid action")

        # update positions w/ random who goes first
        if random.random() < 0.5:
            self.__move_player(self.action_space[action])
            if math.dist(self.player_pos, self.base_pos) < 0.5:
                return {
                    "state": [self.player_pos, self.player_vel, self.pred_pos],
                    "r": 10,
                    "done": True,
                }
            self.__move_predator()
            if math.dist(self.pred_pos, self.player_pos) < 0.01:
                return {
                    "state": [self.player_pos, self.player_vel, self.pred_pos],
                    "r": -10,
                    "done": True,
                }
        else:
            self.__move_predator()
            if math.dist(self.pred_pos, self.player_pos) < 0.01:
                return {
                    "state": [self.player_pos, self.player_vel, self.pred_pos],
                    "r": -10,
                    "done": True,
                }
            self.__move_player(self.action_space[action])
            if math.dist(self.player_pos, self.base_pos) < 0.5:
                return {
                    "state": [self.player_pos, self.player_vel, self.pred_pos],
                    "r": 10,
                    "done": True,
                }
        return {
            "state": [self.player_pos, self.player_vel, self.pred_pos],
            "r": -1,
            "done": False,
        }

    def render(self, past_positions=None):
        # generated using copilot auto-complete
        grid = [["."] * self.n for _ in range(self.n)]
        if past_positions:
            for pos in past_positions:
                grid[int(pos[0][1])][int(pos[0][0])] = "*"
                grid[int(pos[1][1])][int(pos[1][0])] = "o"
        grid[self.base_pos[1]][self.base_pos[0]] = "B"
        grid[int(self.pred_pos[1])][int(self.pred_pos[0])] = "P"
        grid[int(self.player_pos[1])][int(self.player_pos[0])] = "X"
        print("\n".join("".join(row) for row in reversed(grid)))

In [4]:
config = {
    "length": 10,
    "predator_step_size": 4
}
num_episodes = 1000

env = ChaseEnv(config)
results = []
for episode in range(num_episodes):
    state = env.reset()
    done = False
    past_positions = []
    total_reward = 0
    while not done:
        action = random.randint(0, len(env.action_space) - 1)
        result = env.step(action)
        state, reward, done = result["state"], result["r"], result["done"]
        past_positions.append((env.player_pos, env.pred_pos))
        total_reward += reward
    results.append(total_reward)
    # env.render(past_positions)

print(f"Average reward over {num_episodes} episodes: {sum(results) / len(results)}")
print(f"Max reward over {num_episodes} episodes: {max(results)}")

Average reward over 1000 episodes: -12.954
Max reward over 1000 episodes: 7
