<a href="https://colab.research.google.com/github/jufabeck2202/KI-Lab/blob/main/aufgabe6/DeepReinforcementlearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

General idea:
1. Use a neural network that takes as input a state (represented as numbers)
and outputs a probability for every action.
2. Generate episodes by inputing the current state into the network and
sampling actions from the network’s output. Remember the
<state, action> pairs for every episode.
3. From these episodes, identify the ones with the highest reward.
4. Use the <state, action> pairs from those high reward episodes as
training examples for improving the neural network.
5. Go back to step 2

Task 2
1. Define a neural network with two fully connected-layers. The hidden layer uses a
Relu activation. The output layer uses a softmax. Try different hidden layer sizes
(between 100 and 500). The network takes as input a vector of the current states
and gives out a probability for each action.
2. Generate 100 episodes by sampling actions using the network output. Limit the
number of steps per episode to 500. Sum up the reward of all steps of one episode.
3. Print out the mean reward per episode of the 100 episodes.
4. Identify the 20 best of those episodes in terms of reward and use the
<state, action> pairs of these episodes as training examples for the network.
5. Update the weights of the network by performing backpropagation on these <state,
action> pairs.
6. Repeat steps 2 to 5 until a mean reward of 100 is reached.
7. Record a video of the lunar lander by running the trained network on one additional
episode.

Hints
* Use !pip3 install box2d-py to make the environment work in Colab.
* You cannot show the video of your lander in Colab (env.render() fails).  
* Workaround: Download the model on your local machine and record the video there,
using recording_demo.py as template (see Mattermost).
* The loss is not a useful indicator for the learning progress in RL. Instead check how
the mean reward develops over time.
* The mean reward will jump back and forth quite a bit, but overall should increase.
* After roughly 70 training iterations the mean reward should be positive, and after
roughly 100 steps be over 100.
* Note that these numbers depend on your parameter setting and it may also take
longer or shorter.
* Reinforcement learning is much more difficult than supervised learning, you have to
play around quite a bit to get things into the right direction.

# Imports

In [133]:
!pip3 install box2d-py gym > /dev/null

In [134]:
import gym
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as f
import numpy as np
import matplotlib.pyplot as plt
import heapq
import torch.optim as optim
%matplotlib inline


In [135]:
env = gym.make('LunarLander-v2')
rewards = []

##get Action space
print(env.observation_space.shape)
print(env.action_space)
print()
print(env.observation_space.sample())
print(env.action_space.sample())

(8,)
Discrete(4)

[ 0.63734573 -0.7649312   0.56808794  0.96067256  0.40587637  0.6303302
  0.34192967  0.9885021 ]
3


# Define Network

In [136]:
torch.backends.cudnn.enabled = True
GPU_ON = torch.cuda.is_available()
device = torch.device("cuda:0" if GPU_ON else "cpu")
device

device(type='cpu')

In [137]:
class Network(torch.nn.Module):
    def __init__(self, hidden):
        super(Network, self).__init__()
        # input == observation space, for this problem its 8
        self.linear1 = torch.nn.Linear(env.observation_space.shape[0], hidden)
        # action == action space, up down left right for this problem, so 4
        self.linear2 = torch.nn.Linear(hidden, env.action_space.n)

    def forward(self, state):
        hidden = f.relu(self.linear1(state))
        return f.softmax(self.linear2(hidden)) 

# Sample Episodes

In [138]:
def sample(episodes_n=100, max_steps=500):
  episodes_data = []
  for i_episode in range(episodes_n):
      steps = []
      rewards = []
      observation = env.reset()
      for t in range(max_steps):
          action = env.action_space.sample()
          observation, reward, done, info = env.step(action)
          rewards.append(reward)
          steps.append([action, observation])
          if done:
              break
      #print(f"Mean reward of episode {i_episode}: {np.array(rewards).mean()}")
      episodes_data.append({ "total_reward": np.array(rewards).sum() ,"steps": steps})
  return np.array(episodes_data)

In [139]:
episodes_data = sample()

In [140]:
def calculate_total_mean_reward(episode_data):
   all_rewards = [d['total_reward'] for d in episode_data]
   return np.array(all_rewards).mean()

In [141]:
calculate_total_mean_reward(episodes_data)

-169.80325244098614

In [142]:
best_20 = heapq.nlargest(20, episodes_data, key=lambda s: s['total_reward'])
calculate_total_mean_reward(best_20)

-56.479210049855894

In [143]:
def get_training_data(episode_data):
  train_data = []
  for entry in episodes_data:
    for step in entry['steps']:
      one_hot = np.zeros(env.action_space.n)
      one_hot[step[0]] = 1
      train_data.append([one_hot, step[1]])
  return train_data


In [144]:
def get_training_data_batch(episode_data):
  x_data = []
  y_data = []
  for entry in episodes_data:
    for step in entry['steps']:
      one_hot = np.zeros(env.action_space.n)
      one_hot[step[0]] = 1
      x_data.append(step[1])
      y_data.append(one_hot)
  return [x_data, y_data]

In [188]:
net = Network(100)
#criterion = nn.CrossEntropyLoss()
criterion = nn.MSELoss()
optimizer = optim.Adam(net.parameters(), lr=0.9)
if GPU_ON:
  net.cuda()

In [241]:
def train(training_data):
    for data in training_data:
        # get the inputs; data is a list of [inputs, labels]
        action, observation = data
        #convert to torch?
        observation=torch.from_numpy(np.array(observation)).float()
        action =torch.from_numpy(np.array(action )).float()
        print(f"state {state}")

        if GPU_ON:
          action = action.cuda()
          observation = observation.cuda()
        # zero the parameter gradients
        optimizer.zero_grad()
        # forward + backward + optimize
        output_action = net(observation)
        loss = criterion(output_action, action)
        print(f"Predicted action {output_action}")
        print(f"Target action{action}")
        loss.backward()
        optimizer.step()

In [242]:
train(get_training_data(best_20))

state [-0.0130374   1.426001   -0.66486704  0.3224429   0.01671271  0.1839219
  0.          0.        ]
Predicted action tensor([0., 1., 0., 0.], grad_fn=<SoftmaxBackward>)
Target actiontensor([0., 1., 0., 0.])
state [-0.0130374   1.426001   -0.66486704  0.3224429   0.01671271  0.1839219
  0.          0.        ]
Predicted action tensor([0., 1., 0., 0.], grad_fn=<SoftmaxBackward>)
Target actiontensor([0., 0., 1., 0.])
state [-0.0130374   1.426001   -0.66486704  0.3224429   0.01671271  0.1839219
  0.          0.        ]
Predicted action tensor([0., 1., 0., 0.], grad_fn=<SoftmaxBackward>)
Target actiontensor([0., 0., 0., 1.])
state [-0.0130374   1.426001   -0.66486704  0.3224429   0.01671271  0.1839219
  0.          0.        ]
Predicted action tensor([0., 1., 0., 0.], grad_fn=<SoftmaxBackward>)
Target actiontensor([0., 1., 0., 0.])
state [-0.0130374   1.426001   -0.66486704  0.3224429   0.01671271  0.1839219
  0.          0.        ]
Predicted action tensor([0., 1., 0., 0.], grad_fn=<S

  # This is added back by InteractiveShellApp.init_path()


[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
state [-0.0130374   1.426001   -0.66486704  0.3224429   0.01671271  0.1839219
  0.          0.        ]
Predicted action tensor([0., 1., 0., 0.], grad_fn=<SoftmaxBackward>)
Target actiontensor([1., 0., 0., 0.])
state [-0.0130374   1.426001   -0.66486704  0.3224429   0.01671271  0.1839219
  0.          0.        ]
Predicted action tensor([0., 1., 0., 0.], grad_fn=<SoftmaxBackward>)
Target actiontensor([0., 0., 0., 1.])
state [-0.0130374   1.426001   -0.66486704  0.3224429   0.01671271  0.1839219
  0.          0.        ]
Predicted action tensor([0., 1., 0., 0.], grad_fn=<SoftmaxBackward>)
Target actiontensor([0., 1., 0., 0.])
state [-0.0130374   1.426001   -0.66486704  0.3224429   0.01671271  0.1839219
  0.          0.        ]
Predicted action tensor([0., 1., 0., 0.], grad_fn=<SoftmaxBackward>)
Target actiontensor([0., 1., 0., 0.])
state [-0.0130374   1.426001   -0.66486704  0.3224429   0.01671271  0.18392

In [223]:
def train2(training_data):
  # get the inputs; data is a list of [inputs, labels]
  x_data, y_data = training_data
  x_data= torch.from_numpy(np.array(x_data)).float()
  y_data =torch.from_numpy(np.array(y_data )).float()
  if GPU_ON:
    x_data = x_data.cuda()
    y_data = y_data.cuda()
  # zero the parameter gradients
  optimizer.zero_grad()
  # forward + backward + optimize
  print(f"single state Value  {x_data[0]}")
  print(f"single action Value  {y_data[0]}")
  outputs = net(x_data)
  print(f"single predicted action {outputs[0]}")
  loss = criterion(outputs, y_data )
  loss.backward()
  optimizer.step()
  print(loss)

In [226]:
while True:
  episodes_data = sample()
  #calculate_total_mean_reward(episodes_data)
  best_20 = heapq.nlargest(20, episodes_data, key=lambda s: s['total_reward'])
  train2(get_training_data_batch(best_20))
  

single state Value  tensor([-0.0056,  1.3903, -0.2801, -0.4446,  0.0066,  0.0670,  0.0000,  0.0000])
single action Value  tensor([0., 0., 1., 0.])
single predicted action tensor([0., 1., 0., 0.], grad_fn=<SelectBackward>)
tensor(0.3749, grad_fn=<MseLossBackward>)


  # This is added back by InteractiveShellApp.init_path()


single state Value  tensor([-0.0062,  1.4237, -0.3073,  0.2704,  0.0049,  0.0238,  0.0000,  0.0000])
single action Value  tensor([0., 0., 0., 1.])
single predicted action tensor([0., 1., 0., 0.], grad_fn=<SelectBackward>)
tensor(0.3774, grad_fn=<MseLossBackward>)
single state Value  tensor([ 0.0062,  1.3862,  0.3073, -0.5615, -0.0055, -0.0386,  0.0000,  0.0000])
single action Value  tensor([0., 1., 0., 0.])
single predicted action tensor([0., 1., 0., 0.], grad_fn=<SelectBackward>)
tensor(0.3731, grad_fn=<MseLossBackward>)
single state Value  tensor([ 3.0263e-03,  1.4243e+00,  1.4574e-01,  2.8519e-01, -1.1190e-03,
         1.3685e-02,  0.0000e+00,  0.0000e+00])
single action Value  tensor([0., 1., 0., 0.])
single predicted action tensor([0., 1., 0., 0.], grad_fn=<SelectBackward>)
tensor(0.3743, grad_fn=<MseLossBackward>)
single state Value  tensor([ 0.0150,  1.4328,  0.7537,  0.4730, -0.0153, -0.1310,  0.0000,  0.0000])
single action Value  tensor([0., 1., 0., 0.])
single predicted acti

KeyboardInterrupt: ignored