In [1]:
import os, sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
from utils.helpers import launch_env, wrap_env, view_results_ipython, change_exercise, seedall, force_done, evaluate_policy
from utils.helpers import SteeringToWheelVelWrapper, ResizeWrapper, ImgWrapper

import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

INFO:aido-protocols:aido-protocols 5.0.5
[2m21:08:52|[0mzn[2m|__init__.py:6|<module>(): [0m[32mzn 2.0.3[0m
[2m21:08:53|[0mzj[2m|__init__.py:5|<module>(): [0m[32mzj 2.0.4[0m
[2m21:08:53|[0mgym-duckietown[2m|__init__.py:10|<module>(): [0m[32mgym-duckietown 5.0.3[0m
[32m[0m
[2m21:08:53|[0mgym-duckietown[2m|__init__.py:24|reg_map_env(): [0m[32mRegistering gym environment id: Duckietown-small_loop_cw-v0[0m
[2m21:08:53|[0mgym-duckietown[2m|__init__.py:24|reg_map_env(): [0m[32mRegistering gym environment id: Duckietown-straight_road-v0[0m
[2m21:08:53|[0mgym-duckietown[2m|__init__.py:24|reg_map_env(): [0m[32mRegistering gym environment id: Duckietown-loop_dyn_duckiebots-v0[0m
[2m21:08:53|[0mgym-duckietown[2m|__init__.py:24|reg_map_env(): [0m[32mRegistering gym environment id: Duckietown-loop_obstacles-v0[0m
[2m21:08:53|[0mgym-duckietown[2m|__init__.py:24|reg_map_env(): [0m[32mRegistering gym environment id: Duckietown-zigzag_dists-v0[0m
[2m21:

# Reinforcement Learning Basics

Reinforcement Learning, as we saw in lecture, is the idea of learning a _policy_ in order to maximize future (potentially discounted) rewards. Our policy, similar to the imitation learning network, maps raw image observations to wheel velocities, and at every timestep, receives a _reward_ from the environment. 

Rewards can be sparse (`1` if goal or task is completed, `0` otherwise) or dense; in general, dense rewards make it easier to learn policies, but as we'll see later in this exercise, defining the correct dense reward is an engineering challenge on its own.

Today's reinforcement learning algorithms are often a mix between _value-based_ and _policy-gradient_ algorithms, instances of what is called an _actor-critic_ formulation. Actor-critic methods have had a lot of research done on them in recent years (especially within in the deep reinforcement learning era), and later in this exercise, we shall also rediscover the formulation's original problems and different methods currently used to stabilize learning.

We begin by defining two networks, an `Actor` and `Critic`; in this exercise, we'll be using a deep RL algorithm titled _Deep Deterministic Policy Gradients_. 

## Reward Engineering

In this part of the exercise, we will experiment with the reward formulation. Given the same model, we'll see how the effect of various reward functions changes the final policy trained. 

In the section below, we'll take a look at the reward function implemented in `gym-duckietown` with a slightly modified training loop. Traditionally, we `reset()` the environment to start an episode, and then `step()` the environment forward for a set amount of time, executing a new action. If this sounds a bit odd, especially for roboticists, you're right - in real robotics, most code runs asynchronously. As a result, although `gym-duckietown` runs locally by stopping the environment, the `AIDO` submissions will run asynchronously, executing the same action until a new one is received.

In [2]:
def updated_reward(env, c1=1.0, c2=-10, c3=40):
    # Compute the collision avoidance penalty
    pos, angle, speed = env.cur_pos, env.cur_angle, env.speed
    col_penalty = env.proximity_penalty2(pos, angle)
    
    # Get the position relative to the right lane tangent
    try:
        lp = env.get_lane_pos2(pos, angle)
    except NotInLane:
        reward = c3 * col_penalty
    else:
        # Compute the reward
        reward = (
                c1 * speed * lp.dot_dir +
                c2 * np.abs(lp.dist) +
                c3 * col_penalty
        )
    return reward

In [3]:
nepisodes = 3

In [4]:
local_env = launch_env()
local_env = wrap_env(local_env)
local_env = ResizeWrapper(local_env)
local_env = ImgWrapper(local_env)

# actor = Actor(action_dim=2, max_action=1.0)

for _ in range(nepisodes):
    done = False
    obs = local_env.reset()
    obs = torch.from_numpy(obs)
    
    c1 = 16
    c2 = -10
    c3 = 40
    
    while not done:
        action = np.random.random(2)
#         action = actor(obs)        
        obs, r, done, info = local_env.step(action)
#         obs = torch.from_numpy(obs)

        new_r = updated_reward(local_env, c1, c2, c3)
        print(r, new_r)
 

[2m21:08:55|[0mgym-duckietown[2m|graphics.py:121|create_frame_buffers(): [0m[35mFalling back to non-multisampled frame buffer[0m
[2m21:08:55|[0mgym-duckietown[2m|graphics.py:121|create_frame_buffers(): [0m[35mFalling back to non-multisampled frame buffer[0m
[2m21:08:55|[0mgym-duckietown[2m|simulator.py:550|_load_map(): [0m[35mloading map file "/duckietown/simulation/gym_duckietown/maps/loop_empty.yaml"[0m
[2m21:08:55|[0mgym-duckietown[2m|objmesh.py:50|__init__(): [0m[35mloading mesh "duckiebot.obj"[0m
[2m21:08:55|[0mgym-duckietown[2m|objmesh.py:238|_load_mtl(): [0m[35mloading materials from "/duckietown/simulation/gym_duckietown/meshes/duckiebot.mtl"[0m
[2m21:08:55|[0mgym-duckietown[2m|objmesh.py:50|__init__(): [0m[35mloading mesh "duckie.obj"[0m
[2m21:08:55|[0mgym-duckietown[2m|graphics.py:60|load_texture(): [0m[35mloading texture "duckie.png"[0m
[2m21:08:56|[0mgym-duckietown[2m|objmesh.py:50|__init__(): [0m[35mloading mesh "cone.obj"[0m

[2m21:08:56|[0mgym-duckietown[2m|simulator.py:1217|_valid_pose(): [0m[35mInvalid pose. Collision free: True On drivable area: False[0m
[2m21:08:56|[0mgym-duckietown[2m|simulator.py:1218|_valid_pose(): [0m[35msafety_factor: 1.3[0m
[2m21:08:56|[0mgym-duckietown[2m|simulator.py:1219|_valid_pose(): [0m[35mpos: [3.59826268 0.         2.08721855][0m
[2m21:08:56|[0mgym-duckietown[2m|simulator.py:1220|_valid_pose(): [0m[35ml_pos: [3.5894849  0.         2.18432262][0m
[2m21:08:56|[0mgym-duckietown[2m|simulator.py:1221|_valid_pose(): [0m[35mr_pos: [3.60704045 0.         1.99011448][0m
[2m21:08:56|[0mgym-duckietown[2m|simulator.py:1222|_valid_pose(): [0m[35mf_pos: [3.48173779 0.         2.07668522][0m
[2m21:08:56|[0mgym-duckietown[2m|simulator.py:1107|_drivable_pos(): [0m[35m[3.48172154 0.         1.84866924] corresponds to tile at (5, 3) which is not drivable: {'coords': (5, 3), 'kind': 'floor', 'angle': 0, 'drivable': False, 'texture': <simulation.gym_du

[2m21:08:56|[0mgym-duckietown[2m|simulator.py:1222|_valid_pose(): [0m[35mf_pos: [4.02468139 0.         1.75868687][0m
[2m21:08:56|[0mgym-duckietown[2m|simulator.py:1107|_drivable_pos(): [0m[35m[3.49620504 0.         2.20742184] corresponds to tile at (5, 3) which is not drivable: {'coords': (5, 3), 'kind': 'floor', 'angle': 0, 'drivable': False, 'texture': <simulation.gym_duckietown.graphics.Texture object at 0x7f5728a42278>, 'color': array([1, 1, 1])}[0m
[2m21:08:56|[0mgym-duckietown[2m|simulator.py:1217|_valid_pose(): [0m[35mInvalid pose. Collision free: True On drivable area: False[0m
[2m21:08:56|[0mgym-duckietown[2m|simulator.py:1218|_valid_pose(): [0m[35msafety_factor: 1.3[0m
[2m21:08:56|[0mgym-duckietown[2m|simulator.py:1219|_valid_pose(): [0m[35mpos: [3.57500004 0.         2.14999713][0m
[2m21:08:56|[0mgym-duckietown[2m|simulator.py:1220|_valid_pose(): [0m[35ml_pos: [3.49620504 0.         2.20742184][0m
[2m21:08:56|[0mgym-duckietown[2m|simu

[2m21:09:00|[0mgym-duckietown[2m|simulator.py:1217|_valid_pose(): [0m[35mInvalid pose. Collision free: True On drivable area: False[0m
[2m21:09:00|[0mgym-duckietown[2m|simulator.py:1218|_valid_pose(): [0m[35msafety_factor: 1.3[0m
[2m21:09:00|[0mgym-duckietown[2m|simulator.py:1219|_valid_pose(): [0m[35mpos: [2.33703727 0.         2.57700384][0m
[2m21:09:00|[0mgym-duckietown[2m|simulator.py:1220|_valid_pose(): [0m[35ml_pos: [2.24568806 0.         2.542922  ][0m
[2m21:09:00|[0mgym-duckietown[2m|simulator.py:1221|_valid_pose(): [0m[35mr_pos: [2.42838647 0.         2.61108568][0m
[2m21:09:00|[0mgym-duckietown[2m|simulator.py:1222|_valid_pose(): [0m[35mf_pos: [2.37793548 0.         2.46738479][0m
[2m21:09:00|[0mgym-duckietown[2m|simulator.py:1107|_drivable_pos(): [0m[35m[2.28873639 0.         2.68406941] corresponds to tile at (3, 4) which is not drivable: {'coords': (3, 4), 'kind': 'floor', 'angle': 0, 'drivable': False, 'texture': <simulation.gym_du

-0.11008460519299601 -1.3100253097773256
-0.11008460519299601 -1.3100253097773256
-0.11008460519299601 -1.3100253097773256
-0.11008460519299601 -1.3100253097773256
-0.10994753896554466 -1.3098857241571675
-0.10968448715635493 -1.3096794463677122
-0.10951339384152226 -1.309471538968344
-0.10925101288990735 -1.3090443315818474
-0.10840354841804478 -1.3082176817999824


[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.645371929008853, 0, 2.907957739612055] angle -1.388514761755808[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6466091030275405, 0, 2.9145483211159853] angle -1.381959866292802[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6466091030275405, 0, 2.9145483211159853] angle -1.381959866292802[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6479875915954354, 0, 2.921647466704174] angle -1.376052784917401[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6479875915954354, 0, 2.921647466704174] angle -1.376052784917401[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6493698400262793, 0, 2.9286551365940747] angle -1.3760439673986946[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img()

-0.1074339855717259 -1.3073703524440539
-0.10690796356573551 -1.306899555474468
-0.10684523921314115 -1.3068424304628978
-0.16290437685226156 -1.338829968390457
-0.18504963423090004 -1.3569714166395872
-0.2077971518540449 -1.3760148972289517
-0.22929780864719795 -1.3953193995480646
-0.2466979396738742 -1.4135434999394207
-0.2661946770722361 -1.4317762224191215


[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.658612943662789, 0, 2.9841357676704474] angle -1.432437576895691[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.658612943662789, 0, 2.9841357676704474] angle -1.432437576895691[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.659794294144918, 0, 2.992420002965229] angle -1.4258606014342756[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.659794294144918, 0, 2.992420002965229] angle -1.4258606014342756[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6611226559888768, 0, 3.0014951069175195] angle -1.4250477661463135[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6611226559888768, 0, 3.0014951069175195] angle -1.4250477661463135[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(

-0.2874148156397349 -1.4501890071264987
-0.3138276946080052 -1.4694527010168494
-0.34651387869979344 -1.4931123915583522
-0.3845570862694365 -1.5212728383831617
-0.4289980562422986 -1.5532316306952838
-0.47346321422728854 -1.583973322610782
-0.5258172016353138 -1.6224152576743744
-0.592277341243713 -1.6672081527291316
-0.6578215996590675 -1.714357437059457


[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.673404725043397, 0, 3.0688772933792787] angle -1.354853648091828[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.673404725043397, 0, 3.0688772933792787] angle -1.354853648091828[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.67572513436794, 0, 3.0792808779560064] angle -1.3478439273446972[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.67572513436794, 0, 3.0792808779560064] angle -1.3478439273446972[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.678089106045767, 0, 3.089521111112579] angle -1.339995093237278[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.678089106045767, 0, 3.089521111112579] angle -1.339995093237278[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0

-0.7257080299468599 -1.7606880086109244
-0.8030133874976195 -1.815740481497358
-0.8835111317684328 -1.8744396996233992
-0.9681021693787929 -1.9351310916735198
-1.0440723361629338 -1.9952905321058116
-1.124009754796769 -2.0527416677491286
-1.1995480049430411 -2.1115801094229796
-1.2648803038979395 -2.1656762638769584
-1.3386312333790535 -2.2245860967217896


[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.692997760580978, 0, 3.150934711086481] angle -1.334798597323957[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.692997760580978, 0, 3.150934711086481] angle -1.334798597323957[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.695198176945283, 0, 3.1601181851050253] angle -1.3364490616025477[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.695198176945283, 0, 3.1601181851050253] angle -1.3364490616025477[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.69729467436638, 0, 3.169015564831875] angle -1.342324111912039[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.69729467436638, 0, 3.169015564831875] angle -1.342324111912039[0m
[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m

-1.407907397619466 -2.2859077246590327
-1.4867034958694019 -2.3459006977827874
-1.564100437920649 -2.4122652775288063
-1.6368449866613195 -2.4773684074066287
-1.7160555287264945 -2.5492632877871966
-1.7964722646255513 -2.6267110193859864
-1.8723513469287707 -2.701865398449643
-1.9406137970341237 -2.778688224502929
-2.026321350014311 -2.864659712594992


[2m21:09:01|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.710237933352025, 0, 3.245803548270826] angle -1.446794246783979[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.710237933352025, 0, 3.245803548270826] angle -1.446794246783979[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.7115181769539634, 0, 3.256131877641889] angle -1.448147379421088[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.7115181769539634, 0, 3.256131877641889] angle -1.448147379421088[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.7129255061009525, 0, 3.2675259690155034] angle -1.447662208432897[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.7129255061009525, 0, 3.2675259690155034] angle -1.447662208432897[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): 

-2.1171292780082047 -2.949954754196345
-2.2272170359493377 -3.039233042399796
-2.3169905657256487 -3.1163915269468734
-2.417807516565988 -3.202736174486294
-2.5147111410941734 -3.2834470676903833
-2.6215357234502976 -3.3699919515709142
-2.7159995583360197 -3.4556732735192113
-2.804971235457042 -3.5319719903259417
-2.8804474158728306 -3.612499343238925


[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.7204120120374014, 0, 3.328911177378749] angle -1.4713228467539656[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.721256021971718, 0, 3.3382342107288854] angle -1.4897029349164161[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.721256021971718, 0, 3.3382342107288854] angle -1.4897029349164161[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.721975471531697, 0, 3.348186947814935] angle -1.507567540362966[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.721975471531697, 0, 3.348186947814935] angle -1.507567540362966[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.7224483560461192, 0, 3.3577133634923952] angle -1.534827954819212[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img():

-2.9603692918987488 -3.6873406672066933
-3.0245041871368983 -3.761535303691618
-3.101068927365766 -3.8402573592888323
-3.1573892191393282 -3.9148492373462274
-3.23098723498075 -3.9912071031796925
-3.3049800757080092 -4.06536433172254
-3.3780517005319783 -4.130443397290996
-3.437135884124042 -4.187919342231817


[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.722721197943222, 0, 3.4086002414183736] angle -1.5807416859394805[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.722721197943222, 0, 3.4086002414183736] angle -1.5807416859394805[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.7226168048502024, 0, 3.417240252809915] angle -1.5850148277922385[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.7226168048502024, 0, 3.417240252809915] angle -1.5850148277922385[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.7224818567269558, 0, 3.4251219268864346] angle -1.5908179979657695[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.7224818567269558, 0, 3.4251219268864346] angle -1.5908179979657695[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1435|_render_

[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1221|_valid_pose(): [0m[35mr_pos: [3.34876267 0.         1.03910715][0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1222|_valid_pose(): [0m[35mf_pos: [3.20681412 0.         1.09429644][0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1107|_drivable_pos(): [0m[35m[3.32870397 0.         0.56173371] corresponds to tile at (5, 0) which is not drivable: {'coords': (5, 0), 'kind': 'floor', 'angle': 0, 'drivable': False, 'texture': <simulation.gym_duckietown.graphics.Texture object at 0x7f5728a42278>, 'color': array([1, 1, 1])}[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1217|_valid_pose(): [0m[35mInvalid pose. Collision free: True On drivable area: False[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1218|_valid_pose(): [0m[35msafety_factor: 1.3[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1219|_valid_pose(): [0m[35mpos: [3.34768548 0.         0.65736818][0m
[2m21:09:02|[0mgym-duckietown[2m|simu

-3.4942084193685004 -4.248358699876863
-3.567841715416814 -4.310335778256232
-3.6390833942826384 -4.378364265660371
-3.7030936150904594 -4.4406171638798675
-3.7582994473595632 -4.500280165747158
-3.820928197467763 -4.566673182190591
-1000 -4.640769347257563


[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1107|_drivable_pos(): [0m[35m[3.30655235 0.         1.18763146] corresponds to tile at (5, 2) which is not drivable: {'coords': (5, 2), 'kind': 'floor', 'angle': 0, 'drivable': False, 'texture': <simulation.gym_duckietown.graphics.Texture object at 0x7f5728a42278>, 'color': array([1, 1, 1])}[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1217|_valid_pose(): [0m[35mInvalid pose. Collision free: True On drivable area: False[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1218|_valid_pose(): [0m[35msafety_factor: 1.3[0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1219|_valid_pose(): [0m[35mpos: [3.22934714 0.         1.12808647][0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1220|_valid_pose(): [0m[35ml_pos: [3.30655235 0.         1.18763146][0m
[2m21:09:02|[0mgym-duckietown[2m|simulator.py:1221|_valid_pose(): [0m[35mr_pos: [3.15214192 0.         1.06854149][0m
[2m21:09:02|[0mgym-duckietown[2m|simu

0.9218971740686563 -0.27745609700385643
0.9218971740686563 -0.27745609700385643
0.9218971740686563 -0.27745609700385643
0.9218971740686563 -0.27745609700385643
0.9217981739487426 -0.27781513078275477
0.921445427034135 -0.278472996417537
0.9211663589840997 -0.2788167503105987
0.921231041818467 -0.27872072178027896
0.9218657895972995 -0.27805233235598803


[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.456147200320564, 0, 0.9667686783564444] angle -0.00779714666435485[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.464925814156907, 0, 0.9668356975044787] angle -0.007471288031602657[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.464925814156907, 0, 0.9668356975044787] angle -0.007471288031602657[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.47469586808095, 0, 0.9669221697474844] angle -0.01022973712757588[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.47469586808095, 0, 0.9669221697474844] angle -0.01022973712757588[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.483955672000083, 0, 0.9670625117867186] angle -0.020080039655335485[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_re

0.9226503064515856 -0.2773132164355552
0.9233234831136712 -0.2766430249552121
0.9241589095093765 -0.275778302525157
0.9253832012003751 -0.274374882132814
0.9271929403626475 -0.2722402423076131
0.9293461480392644 -0.2697443561851476
0.9324629546058347 -0.2663976463913287
0.9361361160185786 -0.2630099956686849
0.939377885143775 -0.2603884862022313


[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.5331022718647493, 0, 0.96934016569918] angle -0.07530452683401559[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.5411645370531155, 0, 0.9700072883245721] angle -0.0898119135894818[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.5411645370531155, 0, 0.9700072883245721] angle -0.0898119135894818[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.5492479731573714, 0, 0.9707920234641096] angle -0.10374037142788303[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.5492479731573714, 0, 0.9707920234641096] angle -0.10374037142788303[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.5583278554951674, 0, 0.9718313639500131] angle -0.12420013968593291[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_r

0.940867265488166 -0.25910723081753617
0.9410977769820933 -0.2589006543824196
0.9400944389517643 -0.259751317022148
0.937485704711509 -0.26196295464548985
0.9333976131931916 -0.2655532820752929
0.925464200892033 -0.271151945536917
0.9115935956634228 -0.2801504159165945
0.887322694011115 -0.29516068122045325
0.8582517442530598 -0.31481710338530167


[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.6255199131573295, 0, 0.9824252495923869] angle -0.16219855493175897[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.6255199131573295, 0, 0.9824252495923869] angle -0.16219855493175897[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.6346108393336203, 0, 0.9839346801034852] angle -0.16687346360209654[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.6346108393336203, 0, 0.9839346801034852] angle -0.16687346360209654[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.644356059866934, 0, 0.9856279730162032] angle -0.17720376419012235[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.644356059866934, 0, 0.9856279730162032] angle -0.17720376419012235[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|

0.8209991123619964 -0.33737910953329897
0.7798157970893307 -0.36310358744453275
0.7317078819994784 -0.39336366876256607
0.6770466161262343 -0.42982403457014273
0.6118153671577257 -0.4714982995954887
0.5409725062528956 -0.5165955623418834
0.4795739200329776 -0.5602483035656667
0.4113561750659782 -0.6093248358494081
0.34775738001157097 -0.6550166036805218


[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.708913519796681, 0, 0.9992694847576913] angle -0.25883323424267707[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.708913519796681, 0, 0.9992694847576913] angle -0.25883323424267707[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7174062809219803, 0, 1.00158119899237] angle -0.2726858301944202[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7174062809219803, 0, 1.00158119899237] angle -0.2726858301944202[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7264846416376027, 0, 1.0041317396258433] angle -0.27508794468332803[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7264846416376027, 0, 1.0041317396258433] angle -0.27508794468332803[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_rende

0.2827369368487558 -0.7053761284597204
0.2116813832601998 -0.7622373346302496
0.14178306921339556 -0.8145414847827409
0.06297995574049742 -0.8725730642614014
-0.020484401390765794 -0.9354326868694154
-0.09537895127341367 -0.9919168741865734
-0.18515673582734804 -1.0547458858218954
-0.27737147776365856 -1.1263106891337016
-0.3703498089405446 -1.1999781848245945


[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7834459629084347, 0, 1.021193044347394] angle -0.29892201178681593[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7942002695994157, 0, 1.0245054004061904] angle -0.2986429923825372[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7942002695994157, 0, 1.0245054004061904] angle -0.2986429923825372[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8045471433684663, 0, 1.0277216950409707] angle -0.30411345039806786[0m
[2m21:09:03|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8045471433684663, 0, 1.0277216950409707] angle -0.30411345039806786[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8158229140658, 0, 1.0312738526219043] angle -0.3062549625738148[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_rend

-0.47523971110061713 -1.280712102724966
-0.5781582741721057 -1.364932066685443
-0.6804503949426657 -1.4476324876111417
-0.794689740401264 -1.539354622165317
-0.9076410864621505 -1.6282920898527848
-1.0172162437263392 -1.7097749407098428
-1.1123482777890792 -1.789802534449387
-1.2446802227168057 -1.8725618555050716
-1.3549118501462636 -1.95743199049781


[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.875324693334005, 0, 1.0479013904175813] angle -0.23010388721608666[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8856369387783403, 0, 1.0502792768577938] angle -0.2231508367087693[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8856369387783403, 0, 1.0502792768577938] angle -0.2231508367087693[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8962691555830236, 0, 1.052712239153025] angle -0.22676153091027731[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8962691555830236, 0, 1.052712239153025] angle -0.22676153091027731[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9066406901683677, 0, 1.0551248980813983] angle -0.2303550742324585[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_ren

-1.4782102037000708 -2.0536790028006084
-1.5914250520567514 -2.147099239734414
-1.6969160504678737 -2.2440838075617537
-1.8003848115649426 -2.339162760858113
-1.901953677926886 -2.4320403830407016
-1.9970556600369298 -2.525575268057556
-2.0849082477684044 -2.615807450089134
-2.1606312491303665 -2.703110423099599


[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.963637198308918, 0, 1.0700590121710842] angle -0.29211237378512306[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.963637198308918, 0, 1.0700590121710842] angle -0.29211237378512306[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9729435313442734, 0, 1.072962620252428] angle -0.3127525024125182[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9729435313442734, 0, 1.072962620252428] angle -0.3127525024125182[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9831665772459557, 0, 1.0763545601567706] angle -0.3279755785315771[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9831665772459557, 0, 1.0763545601567706] angle -0.3279755785315771[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_rende

-2.2506035367342996 -2.7856493025203024
-2.326407399482728 -2.8669900261909036
-2.3966401007224 -2.9536696050002242
-2.4867763703202415 -3.0489884220432555
-2.5952812772808165 -3.144020202272602
-2.6864012695789286 -3.2351979519167324
-2.7723628235091122 -3.313436992673913
-2.863247057146403 -3.3914592039238904


[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [4.036363210367341, 0, 1.094520438804678] angle -0.32863461948007294[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1107|_drivable_pos(): [0m[35m[4.09883115 0.         1.11582201] corresponds to tile at (7, 1) which is not drivable: {'coords': (7, 1), 'kind': 'floor', 'angle': 0, 'drivable': False, 'texture': <simulation.gym_duckietown.graphics.Texture object at 0x7f5728a42278>, 'color': array([1, 1, 1])}[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1217|_valid_pose(): [0m[35mInvalid pose. Collision free: True On drivable area: False[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1218|_valid_pose(): [0m[35msafety_factor: 1.0[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1219|_valid_pose(): [0m[35mpos: [4.0136476  0.         1.08677441][0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1220|_valid_pose(): [0m[35ml_pos: [4.03785392 0.         1.01578812][0m
[2m21:0

[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1217|_valid_pose(): [0m[35mInvalid pose. Collision free: True On drivable area: False[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1218|_valid_pose(): [0m[35msafety_factor: 1.3[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1219|_valid_pose(): [0m[35mpos: [3.50618012 0.         1.81369074][0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1220|_valid_pose(): [0m[35ml_pos: [3.56298004 0.         1.73444416][0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1221|_valid_pose(): [0m[35mr_pos: [3.44938021 0.         1.89293731][0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1222|_valid_pose(): [0m[35mf_pos: [3.60127601 0.         1.88185064][0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1107|_drivable_pos(): [0m[35m[4.11818143 0.         1.90573522] corresponds to tile at (7, 3) which is not drivable: {'coords': (7, 3), 'kind': 'floor', 'angle': 0, 'drivable': False, 'texture': <simulation.gym_du

-2.946083500410014 -3.468955192348294
-1000 -3.551214317380325


[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7855499619477313, 0, 2.103742738477171] angle -1.5811139864690011[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7855499619477313, 0, 2.103742738477171] angle -1.5811139864690011[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7855499619477313, 0, 2.103742738477171] angle -1.5811139864690011[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7855499619477313, 0, 2.103742738477171] angle -1.5811139864690011[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7855390027929015, 0, 2.1059667741489] angle -1.5703337843473058[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.785563389912638, 0, 2.1096097091636716] angle -1.557870348531164[0m
[2m21:09:04|[0mgym-duckietown[2m|simulator.py:1435|_render_img()

0.1994365086286205 -1.0004996194773108
0.1994365086286205 -1.0004996194773108
0.1994365086286205 -1.0004996194773108
0.1994365086286205 -1.0004996194773108
0.19960984370367596 -1.0003900279290168
0.19926585372096728 -1.000633899126382
0.1972237577534317 -1.0018728455021764
0.19429845863105655 -1.0043359100981464
0.19186335663106058 -1.007176718082352
0.18910827090392401 -1.0100815987085943
0.1866356048313509 -1.0129262665020233
0.18504658056336432 -1.0148714040394413
0.18512446034046248 -1.014760849047427
0.18672390035402797 -1.012524324971622
0.1894001557806022 -1.0078432932448456
0.19373345581515267 -1.0012755657437689


[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7839871290243416, 0, 2.2049069473892327] angle -1.6929892777181816[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7831390028554472, 0, 2.2114265162278635] angle -1.70732902053441[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7822735540907306, 0, 2.217352690673943] angle -1.7242901212322028[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7813540029149593, 0, 2.2231347855261943] angle -1.7327295050745846[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7802588820066063, 0, 2.229504303166257] angle -1.7493968124919346[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.778944568418288, 0, 2.2362586529217348] angle -1.7765679996447843[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_i

0.19929875054091317 -0.9933324645452934
0.20618118077011838 -0.9848712902434187
0.2124426294501074 -0.9763900285544702
0.21815598483921317 -0.967735540907313
0.22576090886954203 -0.9585400291495949
0.2333231203225854 -0.9475888200660609
0.24023864275259 -0.934445684182883
0.25018492842922035 -0.9204902097570233
0.2652962237076846 -0.9017349404068309
0.2792085150157959 -0.883644904563412
0.29733423464820485 -0.8609714971705795
0.32040573335503697 -0.834786903845095
0.34845690526685713 -0.8055840355065635
0.38068540743434176 -0.772702721835854
0.4116278587607387 -0.743508711419314
0.4384023604288054 -0.7145813363087594
0.46404807337490683 -0.6851753792273207


[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7509474590788936, 0, 2.3384615858560203] angle -1.872352187468841[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7476577928952506, 0, 2.3488961139787534] angle -1.880048538531732[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.744359454855346, 0, 2.3589686621308954] angle -1.894451330502156[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7412307686116977, 0, 2.3680232554142258] angle -1.9125259073710796[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.737861328090612, 0, 2.3771565755714232] angle -1.9359213972671876[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7345321095243103, 0, 2.3855786715389438] angle -1.9585600649558437[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_i

0.49137606291197033 -0.6544745907889382
0.5282617894499555 -0.6225532725605583
0.562612780173296 -0.5931519735509404
0.5894672906893204 -0.5675411915451998
0.6176310611120964 -0.5421337178056181
0.642739060015574 -0.5189624010812808
0.6709503358991366 -0.4930282603763517
0.7025370916495961 -0.46721882368914
0.7342520513974449 -0.4443553151421747
0.7645738078225058 -0.4251194861599591
0.7766704459889672 -0.4124839302523714
0.7916660567564915 -0.4020409961828195
0.8033959024844577 -0.39344727733949014
0.811148952197495 -0.38850558316036704
0.8101925725455317 -0.3891112429901875
0.7994059161550631 -0.3961800100037987


[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.67566091186009, 0, 2.4984731245446716] angle -2.1023177611959234[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.669940173959448, 0, 2.5082706207895242] angle -2.0962661080508878[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.66438088592101, 0, 2.5178779906330213] angle -2.0944700191163235[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.658518412628114, 0, 2.5281377562011356] angle -2.085378140928634[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.6530183824122386, 0, 2.537966021691534] angle -2.0766026260431714[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.64748539776463, 0, 2.548098332064531] angle -2.0646370456957186[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): 

0.7759224676685657 -0.40997333561276106
0.7380595173453964 -0.43127978005227197
0.6877763255040639 -0.46036176485426084
0.6275524995498525 -0.4959154790142229
0.5498303670525224 -0.54111656875872
0.4670964858804718 -0.5910139115494022
0.37144083815558726 -0.648545700542224
0.2788010609378865 -0.7075767035833089
0.19352171494565562 -0.7657645821189218
0.09847629984643924 -0.8253545579705344
0.007133599190610407 -0.8837848012775114
-0.09838595369219372 -0.9539802307280209
-0.22282034196629652 -1.0352936915017528
-0.31682116969947616 -1.113770123119541
-0.4601725967807134 -1.2013582883310887
-0.5687288832716886 -1.28739416956885
-0.6881204099022901 -1.382879176205683


[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.595193171900072, 0, 2.6643545670983197] angle -1.9298338259152519[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.5909440009632916, 0, 2.675813844521206] angle -1.921936508772683[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.5869227181463175, 0, 2.6871579647444612] angle -1.9009797777884352[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.5832472360587273, 0, 2.698034661040131] angle -1.8923633586628599[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.579704709406122, 0, 2.708991966730643] angle -1.874621286124725[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.5764038008694135, 0, 2.7200439272414987] angle -1.8474469353014031[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img

-0.8136481488853123 -1.477570001741049
-0.9309582659032917 -1.569351171844413
-1.0600008124258997 -1.673778986015617
-1.2030033291886704 -1.7784743154341065
-1.3299682109080566 -1.87992837800459
-1.4629836111739327 -1.9830253224315164
-1.6084308542048522 -2.0879610856244173
-1.7527976162738743 -2.196877432127967
-1.8991271162275558 -2.3021802739887662
-2.0543277963057562 -2.41736284669075
-2.215529064775749 -2.538848437565707
-2.3862069469699647 -2.6602609085481506
-2.554779497918785 -2.7825557814393505


[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.5605643415342882, 0, 2.8135580810558514] angle -1.6225481191460653[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.5600148720217946, 0, 2.8259857730252858] angle -1.607413621099599[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.5596138569404454, 0, 2.8374804193613334] angle -1.6039249811776743[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.559251939957646, 0, 2.8491900224253834] angle -1.5994634142565025[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.558984445254509, 0, 2.859051709066204] angle -1.5963652212890949[0m
[2m21:09:05|[0mgym-duckietown[2m|simulator.py:1107|_drivable_pos(): [0m[35m[3.55729708 0.         2.92503014] corresponds to tile at (6, 5) which is not drivable: {'coords': (6, 5), 'kind': 'floor', 'angle': 0, 'dr

-2.696806788444103 -2.8873142772250477
-2.8407252372837357 -3.0010660138273346
-2.9875457379852657 -3.1244729167873118
-3.1059934798215845 -3.2387608088267306
-3.2277771252077176 -3.3552221397110578
-1000 -3.453372796000826


In [5]:
view_results_ipython(local_env)

**Question 0: After understanding the above computed reward, experiment with the constants for each component. What type of behavior does the above reward function penalize? Is this good or bad in context of autonomous driving? Name some other issues that can arise with single-objective optimization. In addition, give three sets of constants and explain qualitatively what types of behavior each penalizes or rewards (note, you may want to use a different action policy than random)**. Place the answers to the above in `reinforcement-learning-answers.txt`




# The Reinforcement Learning Learning Code

Below we'll see a relatively naive implementation of the actor-critic training loop, which proceeds as follows: the critic is tasked with a supervised learning problem of fitting rewards acquired by the agent. Then, the policy, using policy gradients, maximizes the return according to the critic's estimate, rather than using Monte-Carlo updates.

Below, we see an implementation of `DDPGAgent`, a class which handles the networks and training loop. 

In [6]:
!pip install ipywidgets --quiet
!jupyter nbextension enable --py widgetsnbextension
!jupyter labextension install @jupyter-widgets/jupyterlab-manager

[33mDEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m
Traceback (most recent call last):
  File "/usr/local/bin/jupyter", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/dist-packages/jupyter_core/command.py", line 247, in main
    command = _jupyter_abspath(subcommand)
  File "/usr/local/lib/python2.7/dist-packages/jupyter_core/command.py", line 134, in _jupyter_abspath
    'Jupyter command `{}` not found.'.format(jupyter_subcommand)
Exception: Jupyter command `jupyter-labextension` not found

In [7]:
class Actor(nn.Module):
    def __init__(self, action_dim, max_action):
        super(Actor, self).__init__()

        # TODO: You'll need to change this!
        flat_size = 31968

        self.relu = nn.ReLU()
        self.tanh = nn.Tanh()

        self.conv1 = nn.Conv2d(3, 32, 8, stride=2)
        self.conv2 = nn.Conv2d(32, 32, 4, stride=2)

        self.bn1 = nn.BatchNorm2d(32)
        self.bn2 = nn.BatchNorm2d(32)

        self.dropout = nn.Dropout(.1)

        self.lin1 = nn.Linear(flat_size, 100)
        self.lin2 = nn.Linear(100, action_dim)

        self.max_action = max_action

    def forward(self, x):
        x = self.bn1(self.relu(self.conv1(x)))
        x = self.bn2(self.relu(self.conv2(x)))
        x = x.view(x.size(0), -1)  # flatten
        x = self.dropout(x)
        x = self.relu(self.lin1(x))

        x = self.lin2(x)
        x = self.max_action * self.tanh(x)
        
        return x
    
class Critic(nn.Module):
    def __init__(self, action_dim, max_action):
        super(Critic, self).__init__()

        # TODO: You'll need to change this!
        flat_size = 31968

        self.relu = nn.ReLU()
        self.tanh = nn.Tanh()

        self.conv1 = nn.Conv2d(3, 32, 8, stride=2)
        self.conv2 = nn.Conv2d(32, 32, 4, stride=2)

        self.bn1 = nn.BatchNorm2d(32)
        self.bn2 = nn.BatchNorm2d(32)

        self.dropout = nn.Dropout(.1)

        self.lin1 = nn.Linear(flat_size + action_dim, 100)
        self.lin2 = nn.Linear(100, action_dim)

        self.max_action = max_action

    def forward(self, obs, action):
        x = self.bn1(self.relu(self.conv1(obs)))
        x = self.bn2(self.relu(self.conv2(x)))
        x = x.view(x.size(0), -1)  # flatten
        x = self.dropout(x)
        x = torch.cat([x, action], 1)
        x = self.relu(self.lin1(x))

        x = self.lin2(x)
        x = self.max_action * self.tanh(x)
        
        return x

In [8]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


class DDPGAgent(object):
    def __init__(self, state_dim, action_dim, max_action=1.0):
        super(DDPGAgent, self).__init__()
        self.flat = False

        self.actor = Actor(action_dim, max_action).to(device)
        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=5e-2)
        
        self.critic = Critic(action_dim, max_action).to(device) # CriticCNN -> Critic
        self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=5e-2)

    def predict(self, state):
        assert state.shape[0] == 3
        state = torch.FloatTensor(np.expand_dims(state, axis=0)).to(device)
        return self.actor(state).cpu().data.numpy().flatten()

    def train(self, replay_buffer, iterations, batch_size=64, discount=0.99):
        for it in range(iterations):
            # Sample replay buffer
            sample = replay_buffer.sample(batch_size, flat=self.flat)
            state = torch.FloatTensor(sample["state"]).to(device)
            action = torch.FloatTensor(sample["action"]).to(device)
            next_state = torch.FloatTensor(sample["next_state"]).to(device)
            # NOTE: why 1 - done?
            done = torch.FloatTensor(1 - sample["done"]).to(device) 
            reward = torch.FloatTensor(sample["reward"]).to(device)
            
            # Compute the target Q value
            target_Q = self.critic(next_state, self.actor(next_state))
            
            # TODO: - no detach is a subtle, but important bug!
            target_Q = reward + (done * discount * target_Q)

            # Get current Q estimate
            current_Q = self.critic(state, action)

            # Compute critic loss
            critic_loss = F.mse_loss(current_Q, target_Q)

            # Optimize the critic
            self.critic_optimizer.zero_grad()
            critic_loss.backward()
            self.critic_optimizer.step()

            # Compute actor loss
            actor_loss = -self.critic(state, self.actor(state)).mean()
            
            # Optimize the actor
            self.actor_optimizer.zero_grad()
            actor_loss.backward()
            self.actor_optimizer.step()
            
            print(f"Iteration {i}, critic_loss: {critic_loss}, actor_loss: {actor_loss}")
            
    def save(self, filename, directory):
        torch.save(self.actor.state_dict(), '{}/{}_actor.pth'.format(directory, filename))
        torch.save(self.critic.state_dict(), '{}/{}_critic.pth'.format(directory, filename))

    def load(self, filename, directory):
        self.actor.load_state_dict(torch.load('{}/{}_actor.pth'.format(directory, filename), map_location=device))
        self.critic.load_state_dict(torch.load('{}/{}_critic.pth'.format(directory, filename), map_location=device))


You'll notice that the training loop needs a `replay_buffer` object. In value-based and actor-critic methods in deep reinforcement learning, the use of a replay buffer is crucial. In the following sections, you'll explore why this is the case, and some other stabilization techniques that are needed in order to get the above code to work. Below, you can find an implementation of the replay buffer, as well the training loop that we use to train DDPG.

In [9]:
from dataclasses import dataclass
import typing
from typing import *

class ReplayBufferEntry(NamedTuple):
    state: np.ndarray
    next_state: np.ndarray
    action: np.ndarray
    reward: float
    done: float
    

# Simple replay buffer
class ReplayBuffer(object):
    def __init__(self, max_size=1e6):
        self.storage = []
        self.max_size = max_size

    # Expects tuples of (state, next_state, action, reward, done)
    def add(self, state, next_state, action, reward, done):
        if len(self.storage) < self.max_size:
            self.storage.append(ReplayBufferEntry(state, next_state, action, reward, done))
        else:
            # Remove random element in the memory beforea adding a new one
            self.storage.pop(random.randrange(len(self.storage)))
            self.storage.append(ReplayBufferEntry(state, next_state, action, reward, done))


    def sample(self, batch_size=100, flat=True):
        ind = np.random.randint(0, len(self.storage), size=batch_size)
        states, next_states, actions, rewards, dones = [], [], [], [], []

        for i in ind:
            state, next_state, action, reward, done = self.storage[i]

            if flat:
                states.append(np.array(state, copy=False).flatten())
                next_states.append(np.array(next_state, copy=False).flatten())
            else:
                states.append(np.array(state, copy=False))
                next_states.append(np.array(next_state, copy=False))
            actions.append(np.array(action, copy=False))
            rewards.append(np.array(reward, copy=False))
            dones.append(np.array(done, copy=False))

        # state_sample, action_sample, next_state_sample, reward_sample, done_sample
        return {
            "state": np.stack(states),
            "next_state": np.stack(next_states),
            "action": np.stack(actions),
            "reward": np.stack(rewards).reshape(-1,1),
            "done": np.stack(dones).reshape(-1,1)
        }

In [12]:
seed_ = 123
max_timesteps = 500 
batch_size = 64
discount = 0.99
eval_freq = 5e3
file_name = 'dt-class-rl'
start_timesteps = 10e3
start_timesteps = 100
expl_noise = 0.1

env_timesteps = 300
save_models = True

In [14]:
import os
import logging
import contextlib
logger = logging.getLogger("gym-duckietown")
logging.basicConfig(level="ERROR")


@contextlib.contextmanager
def temporarily_disable_logging(logger):
    start_value = logger.disabled
    try:
        logger.disabled = True
        yield
    except UserWarning:
        pass
    finally:
        logger.disabled = start_value

with temporarily_disable_logging(logger):
    local_env = launch_env()
    # local_env = wrap_env(local_env)
    local_env = ResizeWrapper(local_env)
    local_env = ImgWrapper(local_env)

    if not os.path.exists("./pytorch_models"):
        os.makedirs("./pytorch_models")

    # Set seeds
    seedall(seed_)

    state_dim = local_env.observation_space.shape
    action_dim = local_env.action_space.shape[0]
    max_action = float(local_env.action_space.high[0])

    # Initialize policy
    policy = DDPGAgent(state_dim, action_dim, max_action)

    replay_buffer = ReplayBuffer()

    # Evaluate untrained policy
    evaluations= [evaluate_policy(local_env, policy)]

    total_timesteps = 0
    timesteps_since_eval = 0
    episode_num = 0
    done = True
    episode_reward = None
    env_counter = 0

    while total_timesteps < max_timesteps:
        if done:
            if total_timesteps != 0:
                print(("Total T: %d Episode Num: %d Episode T: %d Reward: %f") % (
                    total_timesteps, episode_num, episode_timesteps, episode_reward))
                policy.train(replay_buffer, episode_timesteps, batch_size, discount)

            # Evaluate episode
            if timesteps_since_eval >= eval_freq:
                timesteps_since_eval %= eval_freq
                evaluations.append(evaluate_policy(local_env, policy))

                policy.save(file_name, directory="./pytorch_models")
                np.savez("./pytorch_models/{}.npz".format(file_name),evaluations)

            # Reset environment
            env_counter += 1
            obs = local_env.reset()
            done = False
            episode_reward = 0
            episode_timesteps = 0
            episode_num += 1

        # Select action randomly or according to policy
        if total_timesteps < start_timesteps:
            action = local_env.action_space.sample()
        else:
            action = policy.predict(np.array(obs))
            if expl_noise != 0:
                action = (action + np.random.normal(
                    0,
                    expl_noise,
                    size=local_env.action_space.shape[0])
                ).clip(-1, +1)

        # Perform action
        new_obs, reward, done, _ = local_env.step(action)

        if episode_timesteps >= env_timesteps:
            done = True
            print("DONE")

        done_bool = 0 if episode_timesteps + 1 == env_timesteps else float(done)
        episode_reward += reward

        # Store data in replay buffer
        replay_buffer.add(obs, new_obs, action, reward, done_bool)

        obs = new_obs

        episode_timesteps += 1
        total_timesteps += 1
        timesteps_since_eval += 1
        print("total timesteps:", total_timesteps)

    # Final evaluation
    evaluations.append(evaluate_policy(local_env, policy))

    if save_models:
        policy.save(file_name, directory="./pytorch_models")
    np.savez("./pytorch_models/{}.npz".format(file_name),evaluations)

KeyboardInterrupt: 

# Stabilizing DDPG

As you may notice, the above model performs poorly or doesn't converge. Your job is to improve it; first in the notebook, later in the AIDO submission. This last part of the assignment consists of four sections:

**1. There are subtle, but important, bugs that have been introduced into the code above. Your job is to find them, and explain them in your `reinforcement-learning-answers.txt`. You'll want to reread the original [DQN](https://deepmind.com/research/publications/human-level-control-through-deep-reinforcement-learning) and [DDPG](https://arxiv.org/abs/1509.02971) papers in order to better understand the issue, but by answering the following subquestions (*please put the answers to these in the submission for full credit*), you'll be on the right track:**

   a) Read some literature on actor-critic methods, including the original [actor-critic](https://papers.nips.cc/paper/1786-actor-critic-algorithms.pdf) paper. What is an issue that you see related to *non-stationarity*? Define what _non-stationarity_ means in the context of machine learning and how it relates to actor-critic methods. In addition, give some hypotheses on why reinforcement learning is much more difficult (from an optimization perspective) than supervised learning, and how the answer to the previous question and this one are related.

   b) What role does the replay buffer play in off-policy reinforcement learning? It's most important parameter is `max_size` - how does changing this value (answer for both increasing and decreasing trends) qualitatively affect the training of the algorithm?

   c) **Challenge Question:** Briefly, explain how automatic differentiation works. In addition, expand on the difference between a single-element tensor (that `requires_grad`) and a scalar value as it relates to automatic differentiation; when do we want to backpropogate through a single-element tensor, and when do we not? Take a close look at the code and how losses are being backpropogated. On paper or your favorite drawing software, draw out the actor-critic architecture *as described in the code*, and label how the actor and critic losses are backpropogated. On your diagram, highlight the particular loss that will cause issues with the above code, and fix it.
   
For the next section, please pick **either** the theoretical or the practical pathway. If you don't have access to the necessary compute, for the exercise, please do the theoretical portion. 
   
_Theoretical Component_ 

**2. We discussed a case study of DQN in class. The original authors used quite a few tricks to get this to work. Detail some of the following, and explain what problem they solve in training the DQN:**

a) Target Networks

b) Annealed Learning Rates

c) Replay Buffer

d) Random Exploration Period

e) Preprocessing the Image


**3. Read about either [TD3](https://arxiv.org/abs/1802.09477) or [Soft Actor Critic](https://arxiv.org/abs/1801.01290); for your choice, summarize what problems they are addressing with the standard actor-critic formulation, and how they solve them**


_Practical Component_ 

**2. [Optional - if you have access to compute] Using your analysis from the reward engineering ablation, train two agents (after you've found the bugs in DDPG) - one with the standard, `gym-duckietown` reward, and another with the parameters of your choosing. Report each set of parameters, and describe qualitatively what type of behavior the agent produces.**

If you don't have the resources to actually train these agents, instead describe what types of behaviors each reward function might prioritize.

**3. [Optional - if you have access to compute] Using the instructions [here](http://docs.duckietown.org/DT19/AIDO/out/embodied_rl.html), use the saved policy files from this notebook and submit using the template submission provided through the AIDO submission. Report your best submission number (i.e the one you'd like to be graded) in `reinforcement-learning-answers.txt`**