# Skiing

---

In this notebook an agent of your creation will learn [how to ski](https://gym.openai.com/envs/Skiing-ram-v0/).

### 1. Start the Environment

We begin by importing some necessary packages. 

In [1]:
import sys
sys.path.append("../../") # root of all code is 2 floors below

In [2]:
import gym

import numpy as np
from tqdm import tqdm

from model import QNetwork
import random
import torch
from torch import nn
from collections import deque
import matplotlib.pyplot as plt
import time
%matplotlib inline
%load_ext autoreload
%autoreload 2


In [3]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

is_ipython = 'inline' in plt.get_backend()
if is_ipython:
    from IPython import display

plt.ion()

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [6]:
env = gym.make('Skiing-ram-v0')
print("observation_space: ", env.observation_space)
print("observation_space.high: \n", env.observation_space.high)
print("observation_space.low: \n", env.observation_space.low)
print("Action space: ", env.action_space)

observation_space:  Box(128,)
observation_space.high: 
 [255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255
 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255
 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255
 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255
 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255
 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255
 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255
 255 255]
observation_space.low: 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Action space:  Discrete(3)


  result = entry_point.load(False)


In [7]:
observation = env.reset()
print(observation)
print(observation[127])
state_size = len(observation)
print(state_size)
print(type(observation))
action_size = env.action_space.n
print(action_size)

[ 79  13 247  15 247  12 247  14 247  10 247   0   0   0 128   8   7   0
 128  64 128 234 246 126 247  76 120   0   1 134  28 234  80   6  52  98
 160 222 128   0   2 133 133   4   2 131 126 157 111 126 126 157 111 126
 255 255 255 255  31   7  15  31  79   7  30 155 152  59  41 128  85   5
   2  85  85   5   2  85   6   5   0   3   2   1   0   7 195 176 214 160
  80  40  24   4   4   0   0   0   0   0   0   0   0   0   0   0   0  32
 170 255   0 255 167 215  85 137   0   0  32   1  63   0 240  72  14 244
  44 241]
241
128
<class 'numpy.ndarray'>
3


In [8]:
import torch
torch.device("cuda:0" if torch.cuda.is_available() else "cpu")


device(type='cpu')

### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action (uniformly) at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [None]:
env.reset()
episode = 1
reward_on_episode = 0
rewards_per_episode = []
for i in range(10000):
    env.render()
    observation, reward, done, info = env.step(env.action_space.sample()) # take a random action
    reward_on_episode += reward
    if done:
        print("DONE episode %d; reward obtained = %.2f" % (episode, reward_on_episode))
        episode += 1
        rewards_per_episode.append(reward_on_episode)
        reward_on_episode = 0
        env.reset()


In [None]:
type(env)

In [None]:
# plot the scores
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(rewards_per_episode)), rewards_per_episode)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.title('Skiing randomly')
plt.show()

In [39]:
# create an agent
from dqn_agent import DQNAgent, DoubleDQNAgent
from memory import ReplayBuffer, WeightedReplayBuffer

hidden_sizes = [
    int(round(state_size * .8)), 
    int(round(state_size * .6)), 
    int(round(state_size * .4)), 
    int(round(action_size * 2))]
# self.hidden_sizes = [int(round(state_size * 10)), int(round(state_size * 5)), int(round(action_size * 10))]
fc = nn.Sequential(
    nn.Linear(state_size, hidden_sizes[0]),
    nn.ReLU(),
    nn.Linear(hidden_sizes[0], hidden_sizes[1]),
    nn.ReLU(),
    nn.Linear(hidden_sizes[1], hidden_sizes[2]),
    nn.ReLU(),
    nn.Linear(hidden_sizes[2], hidden_sizes[3]),
    nn.ReLU(),
    nn.Linear(hidden_sizes[3], action_size)
)
fc2 = nn.Sequential(
    nn.Linear(state_size, hidden_sizes[0]),
    nn.ReLU(),
    nn.Linear(hidden_sizes[0], hidden_sizes[1]),
    nn.ReLU(),
    nn.Linear(hidden_sizes[1], hidden_sizes[2]),
    nn.ReLU(),
    nn.Linear(hidden_sizes[2], hidden_sizes[3]),
    nn.ReLU(),
    nn.Linear(hidden_sizes[3], action_size)
)

main_model = QNetwork(name="main", fc=fc)
target_model = QNetwork(name="target", fc=fc2)


In [55]:
agent_dqn = DQNAgent(
    main_model=main_model,
    target_network=target_model,
    update_every_steps=10, 
    lr=1e-3)

agent_dqn_prioritized_replay = DQNAgent(
    main_model=main_model,
    target_network=target_model,
    memory = WeightedReplayBuffer(buffer_size=int(1e5), batch_size=64),
    update_every_steps=10, 
    lr=1e-3)


In [41]:
agent_dqn.qnetwork_local

QNetwork(
  (fc): Sequential(
    (0): Linear(in_features=128, out_features=102, bias=True)
    (1): ReLU()
    (2): Linear(in_features=102, out_features=77, bias=True)
    (3): ReLU()
    (4): Linear(in_features=77, out_features=51, bias=True)
    (5): ReLU()
    (6): Linear(in_features=51, out_features=6, bias=True)
    (7): ReLU()
    (8): Linear(in_features=6, out_features=3, bias=True)
  )
)

In [None]:
from runner.gym_runner import run as GymRun

skiing_scores_dqn = GymRun(
        agent_dqn,
        env, 
        render_env=False,
        n_episodes=500,
    max_t=2000,
    feedback_every_secs=25, 
    keep_last_scores=50,
    eps_start=0.995)

Episode:   3%|▎         | 16/500 [00:25<12:42,  1.58s/it]

Episode 16, eps: 0.918	Average Score (last 50 episodes): -14915.44


Episode:   6%|▌         | 30/500 [00:50<15:00,  1.92s/it]

Episode 30, eps: 0.856	Average Score (last 50 episodes): -14277.37


Episode:   9%|▉         | 45/500 [01:18<16:09,  2.13s/it]

Episode 45, eps: 0.794	Average Score (last 50 episodes): -13801.16


Episode:  12%|█▏        | 59/500 [01:44<13:30,  1.84s/it]

Episode 59, eps: 0.740	Average Score (last 50 episodes): -12652.66


Episode:  14%|█▍        | 72/500 [02:10<14:54,  2.09s/it]

Episode 72, eps: 0.694	Average Score (last 50 episodes): -11420.80


Episode:  17%|█▋        | 84/500 [02:35<14:31,  2.09s/it]

Episode 84, eps: 0.653	Average Score (last 50 episodes): -10539.96


Episode:  19%|█▉        | 97/500 [03:02<14:02,  2.09s/it]

Episode 97, eps: 0.612	Average Score (last 50 episodes): -10232.40


Episode:  22%|██▏       | 110/500 [03:27<12:51,  1.98s/it]

Episode 110, eps: 0.573	Average Score (last 50 episodes): -10809.70


Episode:  25%|██▍       | 123/500 [03:53<12:07,  1.93s/it]

Episode 123, eps: 0.537	Average Score (last 50 episodes): -10813.22


Episode:  27%|██▋       | 137/500 [04:20<11:39,  1.93s/it]

Episode 137, eps: 0.501	Average Score (last 50 episodes): -10810.70


Episode:  30%|███       | 151/500 [04:47<11:05,  1.91s/it]

Episode 151, eps: 0.467	Average Score (last 50 episodes): -10250.12


Episode:  33%|███▎      | 164/500 [05:13<10:56,  1.96s/it]

Episode 164, eps: 0.437	Average Score (last 50 episodes): -9984.50


Episode:  35%|███▌      | 176/500 [05:39<11:50,  2.19s/it]

Episode 176, eps: 0.412	Average Score (last 50 episodes): -9983.36


Episode:  38%|███▊      | 188/500 [06:06<12:09,  2.34s/it]

Episode 188, eps: 0.388	Average Score (last 50 episodes): -9985.56


Episode:  40%|████      | 200/500 [06:32<10:47,  2.16s/it]

Episode 200, eps: 0.365	Average Score (last 50 episodes): -9996.26


Episode:  42%|████▏     | 211/500 [07:02<16:06,  3.34s/it]

Episode 211, eps: 0.346	Average Score (last 50 episodes): -9993.02


Episode:  44%|████▍     | 221/500 [07:27<12:07,  2.61s/it]

Episode 221, eps: 0.329	Average Score (last 50 episodes): -9997.82


Episode:  46%|████▌     | 230/500 [07:54<11:42,  2.60s/it]

Episode 230, eps: 0.314	Average Score (last 50 episodes): -10002.06


Episode:  48%|████▊     | 242/500 [08:19<09:08,  2.13s/it]

Episode 242, eps: 0.296	Average Score (last 50 episodes): -9982.78


Episode:  51%|█████     | 254/500 [08:46<09:26,  2.30s/it]

Episode 254, eps: 0.279	Average Score (last 50 episodes): -9983.20


Episode:  53%|█████▎    | 263/500 [09:06<08:49,  2.23s/it]

In [None]:
state = env.reset() # reset the environment, get the current state
total_reward = 0
for _ in range(1000):
    env.render()
    action = agent_dqn.act(state=state, eps=.09)
    next_state, reward, done, info = env.step(action)        # send the action to the environment, get data
    total_reward += reward
    state = next_state                             # roll over the state to next time step
    if done:
        break
print("Reward obtained = %.2f" % (total_reward))


In [None]:
# plot the scores
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(skiing_scores_dqn)), skiing_scores_dqn)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.title('Skiing scores')
plt.show()

In [None]:
# scores = dqn(agent, n_episodes=50, feedback_every_secs=15, keep_last_scores=10)

# plot the scores
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores_dqn)), scores_dqn)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.title('DQN')
plt.show()

When finished, you can close the environment.

In [52]:
env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```