# Solarboat custom OpenAI Gym environment

This notebook shows how to create an OpenAI Gym environment from a solarboat physics model. Two versions are created. A version with a discrete state and action space. A version with a continuous state and action space. Your task: create an agent that learns.


## Solar boat info

Available sensor data:
- RPM of the motor
- position from GPS
- battery current to the motor
- solar power current
- battery voltage

Foreseen sensor data:
- pitot tube to detect water current
- wind speed
- sun intensity

Sensor data determines whether the RL model can be a MDP or POMDP.

Actions that the captain can perform to influence the boat behavior: direction, turn on/off pump, cooling (is automatic), emergency stop, motor power


## Some typical values of the Avans solar boats A1 and A2:

- boat speed of A1: 12 km/h
- expected boat speed of A2: 20 km/h
- expected mass of A2: 120-150 kg without captain of 70 kg
- solar boat team expects 1400 W average power
- max. power motor?

- solar panel power (actually available for boat propulsion):
- max (according to match rules) and actual solar panel surface: 6 m^2
- 1400 W average power (== 233 W/m^2) (== 23% efficiency, assuming sun power on a sunny day of 1 kW/m^2)
- motor efficiency: 80%
- drive train efficiency: 90%
- propeller efficiency: 40%
- ==> actual propulsion power from solar C_solar = 1 * 0.23 * 6 * 0.8 * 0.9 * 0.4 = 0.40 kW (energy after 300 sec = 120 kJ)


battery:
- max (according to match rules) and actual capacity battery of solar boat A2: 1.5 kWh
- solar boat battery is allowed to be fully charged at start
- ==> battery start energy E_battery_start = 1500 * 3600 = 5400 kJ

boat drag:
- typical speed v: 5 m/s or 18 km/h
- ==> C_drag = 10 means "F(v) = 250 Newton at v = 5 m/s" or "25 kg at 18 km/h" or 1250 Watt dissipation


## Physics model of the environment

Model: battery, solar panel, electric motor, boat

Time t and speed v are variable.

Assumptions (for the analytical approach):
- drag proportional to the square of speed, assuming higher speeds (no linear component)
- boat has mass 0 (no kinetic energy) -> so instantaneous velocity change
- C_solar is the actual propulsion power (so includes efficiency of complete drive train, motor, propeller)
- C_solar is constant (independent of t, v and sun power) (e.g. no effects of motor temperature)
- sun power is constant => P_charge is constant
- battery has no power max
- battery has no max capacity E_battery_capacity = infinity
- battery has an initial charge level of E_battery_start
- motor has no power max

`
x(t) = v * t
F(v) = C_drag * v**2
E_drag(t,v) = F * x = C_drag * x**3 / t**2
P_drag(v) = E_drag(t,v) / t = F(v) * v = C_drag * v**3
E_charge(t) = C_solar * t
P_charge = E_charge(t) / t = C_solar
E(t,v) = E_battery_start + E_charge(t) - E_drag(t,v)
       = E_battery_start + C_solar * t - C_drag * v**3 * t
       = E_battery_start + C_solar * t - C_drag * x(t)**3 / t**2
`

minimal time (== max speed): if at the finish E(t,v) = 0 and constant speed


Below an implementation of this physics model for the solarboat with a discrete state and action space. Questions to ask yourself:
* How big is the state space?
* How big is the action space?
* Does the reward function make sense?
* Is there a way to make the problem easier?

In [None]:
# solarboat_sprint_env

import math
import numpy as np
import gym
from gym import spaces
from gym.utils import seeding
import logging

log = logging.getLogger("solarboat_sprint_env")
log.setLevel(logging.INFO)
log.addHandler(logging.StreamHandler())


class SolarBoatSprintEnv(gym.Env):
    metadata = {'render.modes': ['human']}
    
    def __init__(self):
        self.max_position = 2000.0  # min_position = 0.0, start_position = 0.0
        self.max_energy = 1800.0  # goal_energy = 0.0, min_energy = 0.0, start_energy = max_energy
        self.goal_position = self.max_position
        self.max_speed = 10  # min_speed = 0
        self.n_speed_intervals = 1  # 4 means speed interval of 0.25 m/s
        self.solar_coef = 1.0
        self.drag_coef = 25.0
        self.timestep = 0
        self.max_time = 1000

        self.low = np.array([0.0, 0.0])  # [min_position, min_energy]
        self.high = np.array([self.max_position, self.max_energy])

        self.viewer = None

        self.action_space = spaces.Discrete(self.max_speed * self.n_speed_intervals + 1)
        self.observation_space = spaces.Box(self.low, self.high, dtype=np.float32)

        self.seed()
        
    def seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]
    
    def step(self, action):
        assert self.action_space.contains(action), "%r (%s) invalid" % (action, type(action))

        self.timestep += 1
        position, energy = self.state
        velocity = action / self.n_speed_intervals
        energy += self.solar_coef - (self.drag_coef * velocity ** 3) / 1000
        #if energy > 0.0:  # position can only change if energy left
        position += velocity  # assume time steps of 1 second
        log.info("time: %d, position: %0f, energy: %0f, velocity: %0f", self.timestep, position, energy, velocity)
        energy = np.clip(energy, 0.0, self.max_energy)
        position = np.clip(position, 0.0, self.max_position)

        # calculate reward
        done = False
        reward = 0.0
        reward += -1.0  # for every increase of t_
        if position == self.goal_position:  # reached the finish
            done = True
            reward += 1000  # - energy  # punished for remaining energy
            log.debug("reached finish, remaining energy is %0f", energy)
        if energy == 0.0 and position < self.goal_position:
            done = True  # can continue on solar, but must be punished in this step for trying this speed without sufficient energy
            reward += -10000  # run out of energy before reaching the finish
            log.debug("run out of energy, position is %0f", position)
        if self.timestep == self.max_time:
            done = True
            log.debug("run out of time, position is %0f, remaining energy is %0f", position, energy)

        self.state = (position, energy)
        return np.array(self.state), reward, done, {}
    
    def reset(self):
        log.info("reset called")
        self.state = (0.0, self.max_energy)  # (start_position, start_energy)
        self.timestep = 0
        return np.array(self.state)

    def render(self, mode='human'):
        log.info("time: %d, position: %0f, energy: %0f", self.timestep, self.state[0], self.state[1])

    def close(self):
        if self.viewer:
            self.viewer.close()
            self.viewer = None


Small demo how to use SolarBoatSprintEnv

In [None]:
import time
from stable_baselines.common.vec_env import DummyVecEnv

env = DummyVecEnv([lambda: SolarBoatSprintEnv()])

obs = env.reset()
done = False
while not done:
    env.render()
    speed = 5
    obs, reward, done, info = env.step([speed])
    print("reward: ", reward)
    time.sleep(0.1)
env.close()


Below an implementation of this physics model for the solarboat with a discrete state and action space. Questions to ask yourself:
* How big is the state space?
* How big is the action space?
* Does the reward function make sense?
* Is there a way to make the problem easier?

In [None]:
# solarboat_sprint_continuous_env

import math
import numpy as np
import gym
from gym import spaces
from gym.utils import seeding
import logging

log = logging.getLogger("solarboat_sprint_continuous_env")
log.setLevel(logging.INFO)
log.addHandler(logging.StreamHandler())


class SolarBoatSprintContinuousEnv(gym.Env):
    metadata = {'render.modes': ['human']}

    def __init__(self):
        self.min_pos = 0.0
        self.max_pos = 2000.0
        self.start_pos = self.min_pos
        self.goal_pos = self.max_pos

        self.min_speed = 0
        self.max_speed = 10
        # action space for the ddpg algorithm must be symmetric
        self.middle_speed = (self.max_speed - self.min_speed) / 2

        self.min_energy = 0.0
        self.max_energy = 1800.0
        self.start_energy = self.max_energy
        self.goal_energy = self.min_energy

        self.solar_coef = 1.0
        self.drag_coef = 25.0
        self.time_step = 0
        self.max_time = 1000

        self.low = np.array([self.min_pos, self.min_energy])
        self.high = np.array([self.max_pos, self.max_energy])
        self.observation_space = spaces.Box(self.low, self.high, dtype=np.float32)

        self.low = np.array([self.min_speed - self.middle_speed])
        self.high = np.array([self.max_speed - self.middle_speed])
        self.action_space = spaces.Box(self.low, self.high, dtype=np.float32)

        self.episode_reward = None
        self.viewer = None
        self.state = None
        self.old_state = None
        self.np_random = None
        self.boat_trans = None
        self.velocity = None
        self.seed()

    def seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

    def step(self, action):
        assert self.action_space.contains(action), "%r (%s) invalid" % (action, type(action))

        self.time_step += 1
        pos, energy = self.state
        velocity = action[0] + self.middle_speed
        energy += self.solar_coef - (self.drag_coef * velocity ** 3) / 1000
        #if energy > 0.0:  # position can only change if energy left
        pos += velocity  # assume time steps of 1 second
        log.debug("(before clipping) time: %d, position: %.0f, energy: %.0f, velocity: %.1f",
                  self.time_step, pos, energy, velocity)
        pos = np.clip(pos, self.min_pos, self.max_pos)
        energy = np.clip(energy, self.min_energy, self.max_energy)

        # calculate reward
        done = False
        reward = 0.0
        reward += -1.0  # for every increase of time_step
        if pos == self.goal_pos:  # reached the finish
            done = True
            reward += 1000  # - (energy - self.goal_energy)  # punished for remaining energy
            log.info("reached finish, remaining energy is %.0f, episode_reward is %.0f",
                     energy, self.episode_reward + reward)
        if energy == 0.0:  # and pos < self.goal_pos:
            done = True  # can continue on solar, but must be punished in this step for trying this speed without sufficient energy
            reward += -10000  # run out of energy before reaching the finish
            log.info("run out of energy, position is %.0f, episode_reward is %.0f",
                     pos, self.episode_reward + reward)
        if self.time_step == self.max_time:
            done = True
            log.info("run out of time, position is %.0f, remaining energy is %.0f, episode_reward is %.0f",
                     pos, energy, self.episode_reward + reward)

        self.old_state = self.state
        self.state = (pos, energy)
        self.episode_reward += reward
        return self.state, reward, done, {"v (non-clipped)": velocity}

    def reset(self):
        log.debug("reset called")
        self.state = np.array((self.start_pos, self.max_energy))
        self.time_step = 0
        self.episode_reward = 0.0
        return self.state

    def render(self, mode='human'):
        speed = self.state[0] - self.old_state[0]
        log.info("time: %d, position: %.0f, energy: %.0f, speed: %.1f",
                 self.time_step, self.state[0], self.state[1], speed)

        # to allow same code as slalom continuous env
        self.max_pos_x = self.max_pos
        self.min_pos_x = self.min_pos
        self.goal_pos_x = self.max_pos_x
        self.max_pos_y = 500.0
        self.min_pos_y = 0.0

        world_width = self.max_pos_x - self.min_pos_x
        world_height = self.max_pos_y - self.min_pos_y
        screen_width = 600
        scale = screen_width / world_width
        screen_height = int(world_height * scale)
        margin = 50
        screen_width += 2 * margin
        screen_height += 2 * margin

        def scl(val):
            return val * scale + margin

        if self.viewer is None:
            from gym.envs.classic_control import rendering
            self.viewer = rendering.Viewer(screen_width, screen_height)
            boat_width = 40
            boat_height = 20
            l, r, t, b = -boat_width * 1.5, -boat_width * 0.5, boat_height / 2, -boat_height / 2
            boat = rendering.make_polygon([(l, b), (l, t), (r, t), (r + boat_width * 0.5, (t + b) / 2), (r, b)])
            self.viewer.add_geom(boat)
            self.boat_trans = rendering.Transform()
            boat.add_attr(self.boat_trans)

            l, r = scl(self.min_pos_x), scl(self.max_pos_x)
            t, b = scl(self.max_pos_y), scl(self.min_pos_y)
            track = rendering.make_polyline([(l, b), (l, t), (r, t), (r, b), (l, b)])
            self.viewer.add_geom(track)

            flag_x = scl(self.goal_pos_x)
            flag_y1 = scl(self.max_pos_y)
            flag_y2 = flag_y1 + 50
            flagpole = rendering.PolyLine([(flag_x, flag_y1), (flag_x, flag_y2)], False)
            flagpole.set_linewidth(4)
            self.viewer.add_geom(flagpole)
            flag = rendering.FilledPolygon([(flag_x, flag_y2), (flag_x, flag_y2 - 10), (flag_x + 25, flag_y2 - 5)])
            flag.set_color(1, 0, 0)
            self.viewer.add_geom(flag)

        self.boat_trans.set_translation(scl(self.state[0] - self.min_pos_x), scl((self.max_pos_y - self.min_pos_y) / 2))

        return self.viewer.render(return_rgb_array=mode == 'rgb_array')

    def close(self):
        if self.viewer:
            self.viewer.close()
            self.viewer = None


For continuous state and action space the DDPG and PPO2 algorithms are a suitable choice.

DDPG can be thought of as being deep Q-learning for continuous action spaces.

`    
 n_actions = env.action_space.shape[-1]
 param_noise = None
 action_noise = OrnsteinUhlenbeckActionNoise(mean=np.zeros(n_actions), sigma=float(0.5) * np.ones(n_actions))
 from stable_baselines.ddpg.policies import MlpPolicy
 model = DDPG(MlpPolicy, env, verbose=1, param_noise=param_noise, action_noise=action_noise, 
              tensorboard_log="tensor_boards/solarboat_actorcritic_agent/")
`

The Proximal Policy Optimization actor-critic algorithm combines ideas from A2C (having multiple workers) and TRPO (it uses a trust region to improve the actor).

`    
from stable_baselines.common.policies import MlpPolicy
model = PPO2(MlpPolicy, env, learning_rate=0.005, tensorboard_log="tensor_boards/solarboat_actorcritic_agent/")`

Small demo how to use SolarBoatSprintContinuousEnv

In [None]:
import time
from stable_baselines.common.vec_env import DummyVecEnv

env = DummyVecEnv([lambda: SolarBoatSprintContinuousEnv()])

obs = env.reset()
done = False
while not done:
    speed = [[5.0]]
    obs, reward, done, info = env.step(speed)
    env.render()
    #print("reward: ", reward)
    time.sleep(0.2)
env.close()
