<a href="https://colab.research.google.com/github/jufabeck2202/KI-Lab/blob/main/aufgabe6/DeepReinforcementlearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

General idea:
1. Use a neural network that takes as input a state (represented as numbers)
and outputs a probability for every action.
2. Generate episodes by inputing the current state into the network and
sampling actions from the networkâ€™s output. Remember the
<state, action> pairs for every episode.
3. From these episodes, identify the ones with the highest reward.
4. Use the <state, action> pairs from those high reward episodes as
training examples for improving the neural network.
5. Go back to step 2

Task 2
1. Define a neural network with two fully connected-layers. The hidden layer uses a
Relu activation. The output layer uses a softmax. Try different hidden layer sizes
(between 100 and 500). The network takes as input a vector of the current states
and gives out a probability for each action.
2. Generate 100 episodes by sampling actions using the network output. Limit the
number of steps per episode to 500. Sum up the reward of all steps of one episode.
3. Print out the mean reward per episode of the 100 episodes.
4. Identify the 20 best of those episodes in terms of reward and use the
<state, action> pairs of these episodes as training examples for the network.
5. Update the weights of the network by performing backpropagation on these <state,
action> pairs.
6. Repeat steps 2 to 5 until a mean reward of 100 is reached.
7. Record a video of the lunar lander by running the trained network on one additional
episode.

Hints
* Use !pip3 install box2d-py to make the environment work in Colab.
* You cannot show the video of your lander in Colab (env.render() fails).  
* Workaround: Download the model on your local machine and record the video there,
using recording_demo.py as template (see Mattermost).
* The loss is not a useful indicator for the learning progress in RL. Instead check how
the mean reward develops over time.
* The mean reward will jump back and forth quite a bit, but overall should increase.
* After roughly 70 training iterations the mean reward should be positive, and after
roughly 100 steps be over 100.
* Note that these numbers depend on your parameter setting and it may also take
longer or shorter.
* Reinforcement learning is much more difficult than supervised learning, you have to
play around quite a bit to get things into the right direction.

# Imports

In [1]:
!pip3 install box2d-py gym > /dev/null

In [2]:
import gym
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as f
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


In [26]:
env = gym.make('LunarLander-v2')
rewards = []

##get Action space
print(env.observation_space.shape)
print(env.action_space)
print()
print(env.observation_space.sample())
print(env.action_space.sample())

(8,)
Discrete(4)

[ 0.7880384   0.07604121 -1.5635289  -0.39409915 -1.0680902   0.00341499
 -1.4095597   0.5805479 ]
1


# Define Network

In [20]:
class Network(torch.nn.Module):
    def __init__(self, hidden):
        super(TwoLayerNet, self).__init__()
        # input == observation space, for this problem its 8
        self.linear1 = torch.nn.Linear(env.observation_space.shape[0], hidden)
        # action == action space, up down left right for this problem, so 4
        self.linear2 = torch.nn.Linear(hidden, env.action_space.n)

    def forward(self, state):
        hidden = f.relu(self.linear1(state))
        return f.softmax(self.linear2(hidden)) 

# Sample Episodes

In [47]:
def sample(episodes_n=100, max_steps=500):
  episodes_data = []
  for i_episode in range(episodes_n):
      steps = []
      rewards = []
      observation = env.reset()
      for t in range(max_steps):
          action = env.action_space.sample()
          observation, reward, done, info = env.step(action)
          rewards.append(reward)
          steps.append([reward, observation, reward])
          rewards.append(reward)
          if done:
              break
      episodes_data.append({ "total_reward": np.array(rewards).sum() ,"steps": steps})
  return np.array(episodes_data)

In [71]:
def calculate_mean_reward(episode_data):
   all_rewards = [d['total_reward'] for d in episode_data]
   return np.array(all_rewards).mean()

In [76]:
episodes_data = sample()

In [73]:
calculate_mean_reward()

-351.9240413734967