<a href="https://colab.research.google.com/github/jufabeck2202/KI-Lab/blob/main/aufgabe6/DeepReinforcementlearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

General idea:
1. Use a neural network that takes as input a state (represented as numbers)
and outputs a probability for every action.
2. Generate episodes by inputing the current state into the network and
sampling actions from the network’s output. Remember the
<state, action> pairs for every episode.
3. From these episodes, identify the ones with the highest reward.
4. Use the <state, action> pairs from those high reward episodes as
training examples for improving the neural network.
5. Go back to step 2

Task 2
1. Define a neural network with two fully connected-layers. The hidden layer uses a
Relu activation. The output layer uses a softmax. Try different hidden layer sizes
(between 100 and 500). The network takes as input a vector of the current states
and gives out a probability for each action.
2. Generate 100 episodes by sampling actions using the network output. Limit the
number of steps per episode to 500. Sum up the reward of all steps of one episode.
3. Print out the mean reward per episode of the 100 episodes.
4. Identify the 20 best of those episodes in terms of reward and use the
<state, action> pairs of these episodes as training examples for the network.
5. Update the weights of the network by performing backpropagation on these <state,
action> pairs.
6. Repeat steps 2 to 5 until a mean reward of 100 is reached.
7. Record a video of the lunar lander by running the trained network on one additional
episode.

Hints
* Use !pip3 install box2d-py to make the environment work in Colab.
* You cannot show the video of your lander in Colab (env.render() fails).  
* Workaround: Download the model on your local machine and record the video there,
using recording_demo.py as template (see Mattermost).
* The loss is not a useful indicator for the learning progress in RL. Instead check how
the mean reward develops over time.
* The mean reward will jump back and forth quite a bit, but overall should increase.
* After roughly 70 training iterations the mean reward should be positive, and after
roughly 100 steps be over 100.
* Note that these numbers depend on your parameter setting and it may also take
longer or shorter.
* Reinforcement learning is much more difficult than supervised learning, you have to
play around quite a bit to get things into the right direction.

# Imports

In [54]:
!pip3 install box2d-py gym > /dev/null

In [55]:
import gym
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as f
import numpy as np
import matplotlib.pyplot as plt
import heapq
import random
import torch.optim as optim
%matplotlib inline


In [74]:
env = gym.make('LunarLander-v2')
rewards = []

##get Action space
print(env.observation_space.shape)
print(env.action_space)
print()
print(env.observation_space.sample())
print(env.action_space.sample())

(8,)
Discrete(4)

[-0.29276934  0.09673637 -0.3055346   0.07672118  0.03873079 -0.48137853
 -0.1174748  -0.361454  ]
1


# Define Network

In [57]:
class Network(torch.nn.Module):
    def __init__(self, hidden):
        super(Network, self).__init__()
        # input == observation space, for this problem its 8
        self.linear1 = torch.nn.Linear(env.observation_space.shape[0], hidden)
        # action == action space, up down left right for this problem, so 4
        self.linear2 = torch.nn.Linear(hidden, env.action_space.n)

    def forward(self, state):
        hidden = f.relu(self.linear1(state))
        return self.linear2(hidden)

In [58]:
torch.backends.cudnn.enabled = True
GPU_ON = torch.cuda.is_available()
device = torch.device("cuda:0" if GPU_ON else "cpu")
net = Network(200)
#criterion = nn.CrossEntropyLoss()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=0.1)
if GPU_ON:
  net.cuda()

# Sample Episodes

In [73]:
def sample(episodes_n=100, max_steps=500):
  episodes_data = []
  print(episodes_data)
  softmax = nn.Softmax(dim=1)
  for i_episode in range(episodes_n):
      steps = []
      rewards = []
      state = env.reset()
      for t in range(max_steps):

          state_vector = torch.FloatTensor([state])
          probability= softmax(net(state_vector))
          selected_action = np.random.choice(len(probability.data.numpy()[0]), p=probability.data.numpy()[0])
         
          new_state, reward, done, info = env.step(selected_action)
          rewards.append(reward)
          #importnt: use old state
          steps.append([selected_action, state])
          state = new_state
          

          if done:
              episodes_data.append({ "mean_reward": np.array(rewards).sum() ,"steps": steps})
              break
      #print(f"Mean reward of episode {i_episode}: {np.array(rewards).mean()}")
  return np.array(episodes_data)

In [60]:
episodes_data = sample()

In [61]:
def calculate_total_mean_reward(episode_data):
   all_rewards = [d['mean_reward'] for d in episode_data]
   return np.array(all_rewards).mean()

In [62]:
def get_top_20(data):
  return heapq.nlargest(20, data, key=lambda s: s['mean_reward'])

In [63]:
def get_training_data(episode_data):
  train_data = []
  for entry in episodes_data:
    for step in entry['steps']:
      one_hot = np.zeros(env.action_space.n)
      one_hot[step[0]] = 1
      train_data.append([one_hot, step[1]])
  return train_data


In [64]:
def get_training_data_batch(episode_data):
  x_data = []
  y_data = []
  for entry in episodes_data:
    for step in entry['steps']:
      x_data.append(step[1])
      y_data.append(step[0])
  return [x_data, y_data]

In [69]:
def train(training_data):
  for data in training_data:
    optimizer.zero_grad()
    # get the inputs; data is a list of [inputs, labels]
    action, observation = data
    #convert to torch?
    observation=torch.from_numpy(np.expand_dims(observation, axis=0))
    action =torch.from_numpy(np.array([np.argmax(action, axis=0)]))
    if GPU_ON:
      action = action.cuda()
      observation = observation.cuda()

    # forward + backward + optimize
    output_action = net(observation)
    loss = criterion(output_action, action)
    loss.backward()
    optimizer.step()

def train_batch(batch_data):
  optimizer.zero_grad()
  # get the inputs; data is a list of [inputs, labels]
  x_data, y_data = batch_data
  #convert to torch?
  observation=torch.from_numpy(np.array(x_data))
  target =torch.from_numpy(np.array(y_data))
  if GPU_ON:
    target = target.cuda()
    observation = observation.cuda()

  # forward + backward + optimize
  action_pred = net(observation)
  print(action_pred[0])
  print(target[0])
  loss = criterion(action_pred, target)
  print(f"Loss {loss}")
  loss.backward()
  optimizer.step()



In [66]:
#train_batch(get_training_data_batch(best_20))

In [67]:
while False:
  episodes_data = sample(episodes_n=200)
  best_20 = get_top_20(episodes_data)
  print(f"Mean Reward for Sampling 20 episodes {calculate_total_mean_reward(best_20)}")
  train(get_training_data(best_20))
  test_data = sample(episodes_n=10, training=False)
  print(f"Mean Reward for 10 episodes {calculate_total_mean_reward(test_data)}")


  

In [None]:
for i in range(10000):
  episodes_data = sample()
  print(f"{1} Mean Reward for Sampling 20 episodes {calculate_total_mean_reward(episodes_data)}")
  best_20 = get_top_20(episodes_data)
  train_batch(get_training_data_batch(best_20))

[]
1 Mean Reward for Sampling 20 episodes -546.057700399698
tensor([-31.8211,  45.3078, -22.2468,  -1.7579], grad_fn=<SelectBackward>)
tensor(1)
Loss 0.0
[]
1 Mean Reward for Sampling 20 episodes -566.8107148099164
tensor([-30.5632,  42.8974, -22.5649,  -2.0903], grad_fn=<SelectBackward>)
tensor(1)
Loss 0.0
[]
1 Mean Reward for Sampling 20 episodes -599.5380120802761
tensor([-42.3141,  62.6549, -37.2489,  -6.6790], grad_fn=<SelectBackward>)
tensor(1)
Loss 0.0
[]
1 Mean Reward for Sampling 20 episodes -588.6702129954579
tensor([-46.7895,  69.9956, -42.8068,  -9.1575], grad_fn=<SelectBackward>)
tensor(1)
Loss 0.0
[]
1 Mean Reward for Sampling 20 episodes -623.6690116598968
tensor([-49.0571,  73.5927, -44.6151, -11.4792], grad_fn=<SelectBackward>)
tensor(1)
Loss 0.0
[]
1 Mean Reward for Sampling 20 episodes -609.123018708564
tensor([-41.1370,  59.5764, -36.1616,  -7.9662], grad_fn=<SelectBackward>)
tensor(1)
Loss 0.0
[]
1 Mean Reward for Sampling 20 episodes -583.5912979566223
tensor([-39