<a href="https://colab.research.google.com/github/jufabeck2202/KI-Lab/blob/main/aufgabe6/DeepReinforcementlearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

General idea:
1. Use a neural network that takes as input a state (represented as numbers)
and outputs a probability for every action.
2. Generate episodes by inputing the current state into the network and
sampling actions from the network’s output. Remember the
<state, action> pairs for every episode.
3. From these episodes, identify the ones with the highest reward.
4. Use the <state, action> pairs from those high reward episodes as
training examples for improving the neural network.
5. Go back to step 2

Task 2
1. Define a neural network with two fully connected-layers. The hidden layer uses a
Relu activation. The output layer uses a softmax. Try different hidden layer sizes
(between 100 and 500). The network takes as input a vector of the current states
and gives out a probability for each action.
2. Generate 100 episodes by sampling actions using the network output. Limit the
number of steps per episode to 500. Sum up the reward of all steps of one episode.
3. Print out the mean reward per episode of the 100 episodes.
4. Identify the 20 best of those episodes in terms of reward and use the
<state, action> pairs of these episodes as training examples for the network.
5. Update the weights of the network by performing backpropagation on these <state,
action> pairs.
6. Repeat steps 2 to 5 until a mean reward of 100 is reached.
7. Record a video of the lunar lander by running the trained network on one additional
episode.

Hints
* Use !pip3 install box2d-py to make the environment work in Colab.
* You cannot show the video of your lander in Colab (env.render() fails).  
* Workaround: Download the model on your local machine and record the video there,
using recording_demo.py as template (see Mattermost).
* The loss is not a useful indicator for the learning progress in RL. Instead check how
the mean reward develops over time.
* The mean reward will jump back and forth quite a bit, but overall should increase.
* After roughly 70 training iterations the mean reward should be positive, and after
roughly 100 steps be over 100.
* Note that these numbers depend on your parameter setting and it may also take
longer or shorter.
* Reinforcement learning is much more difficult than supervised learning, you have to
play around quite a bit to get things into the right direction.

# Imports

In [1]:
!pip3 install box2d-py gym > /dev/null

In [21]:
import gym
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as f
import numpy as np
import matplotlib.pyplot as plt
import heapq
%matplotlib inline


In [8]:
env = gym.make('LunarLander-v2')
rewards = []

##get Action space
print(env.observation_space.shape)
print(env.action_space)
print()
print(env.observation_space.sample())
print(env.action_space.sample())

(8,)
Discrete(4)

[-0.15572956 -0.353235   -0.1183357  -0.13081883  0.6019697  -0.8882228
  0.26233912 -2.2259324 ]
1


# Define Network

In [29]:
torch.backends.cudnn.enabled = True
GPU_ON = torch.cuda.is_available()
device = torch.device("cuda:0" if GPU_ON else "cpu")
device

device(type='cpu')

In [33]:
class Network(torch.nn.Module):
    def __init__(self, hidden):
        super(Network, self).__init__()
        # input == observation space, for this problem its 8
        self.linear1 = torch.nn.Linear(env.observation_space.shape[0], hidden)
        # action == action space, up down left right for this problem, so 4
        self.linear2 = torch.nn.Linear(hidden, env.action_space.n)

    def forward(self, state):
        hidden = f.relu(self.linear1(state))
        return f.softmax(self.linear2(hidden)) 

In [None]:
def train(training_data):
    for action, observation in enumerate(training_data, 0):
        # get the inputs; data is a list of [inputs, labels]
        if GPU_ON:
          action = action.cuda()
          observation = observation.cuda()
        # zero the parameter gradients
        optimizer.zero_grad()
        # forward + backward + optimize
        outputs = net(observation)
        loss = criterion(outputs, action)
        loss.backward()
        optimizer.step()

# Sample Episodes

In [15]:
def sample(episodes_n=100, max_steps=500):
  episodes_data = []
  for i_episode in range(episodes_n):
      steps = []
      rewards = []
      observation = env.reset()
      for t in range(max_steps):
          action = env.action_space.sample()
          observation, reward, done, info = env.step(action)
          rewards.append(reward)
          steps.append([action, observation])
          if done:
              break
      print(f"Mean reward of episode {i_episode}: {np.array(rewards).mean()}")
      episodes_data.append({ "total_reward": np.array(rewards).sum() ,"steps": steps})
  return np.array(episodes_data)

In [18]:
episodes_data = sample()

Mean reward of episode 0: -1.8671913423563917
Mean reward of episode 1: -1.2933785652526817
Mean reward of episode 2: -1.3527540171949266
Mean reward of episode 3: -3.079354920344859
Mean reward of episode 4: -2.767937045128633
Mean reward of episode 5: -2.125392769794649
Mean reward of episode 6: -1.689989581339652
Mean reward of episode 7: -0.3373901094490223
Mean reward of episode 8: -1.0452666530154997
Mean reward of episode 9: -2.3270546329412616
Mean reward of episode 10: -0.871604878093666
Mean reward of episode 11: -1.9051051518316524
Mean reward of episode 12: -0.8645924647355984
Mean reward of episode 13: -3.6880257965894003
Mean reward of episode 14: -3.250509963309707
Mean reward of episode 15: -2.6520184786944614
Mean reward of episode 16: -4.948416313515281
Mean reward of episode 17: -1.1253923274732167
Mean reward of episode 18: -0.8902477653690054
Mean reward of episode 19: -1.478248514381291
Mean reward of episode 20: -2.5758432158338276
Mean reward of episode 21: -0.9

In [17]:
def calculate_total_mean_reward(episode_data):
   all_rewards = [d['total_reward'] for d in episode_data]
   return np.array(all_rewards).mean()

In [14]:
calculate_total_mean_reward(episodes_data)

-203.85825702792033

In [26]:
best_20 = heapq.nlargest(20, episodes_data, key=lambda s: s['total_reward'])
calculate_total_mean_reward(best_20)

-68.25264381334406

In [27]:
best_20[0]

{'steps': [[0,
   array([-0.00818567,  1.4014179 , -0.4139943 , -0.22400075,  0.0093892 ,
           0.09281062,  0.        ,  0.        ], dtype=float32)],
  [0, array([-0.01227884,  1.3957785 , -0.4140088 , -0.2506794 ,  0.01402657,
           0.09275611,  0.        ,  0.        ], dtype=float32)],
  [1, array([-0.0164587 ,  1.3895322 , -0.42488346, -0.2776981 ,  0.02084308,
           0.13634305,  0.        ,  0.        ], dtype=float32)],
  [2, array([-0.02073212,  1.3837118 , -0.43384618, -0.2587809 ,  0.02727913,
           0.12873283,  0.        ,  0.        ], dtype=float32)],
  [3, array([-0.02491045,  1.3772984 , -0.42191187, -0.2851285 ,  0.03131577,
           0.08074023,  0.        ,  0.        ], dtype=float32)],
  [0, array([-0.02908888,  1.3702849 , -0.4219226 , -0.31179398,  0.03535268,
           0.08074541,  0.        ,  0.        ], dtype=float32)],
  [0, array([-0.0332675 ,  1.3626717 , -0.4219347 , -0.33846295,  0.03938843,
           0.08072238,  0.        ,  0. 

In [None]:
net = Network(100)

if GPU_ON:
  net.cuda()

In [None]:

train()