In [None]:
import pong
import printing

try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

if IN_COLAB:
    import os
    os.environ['SDL_VIDEODRIVER']='dummy'

from pong import Pong
from pong_observer import Player
from main import train
import strategy_pattern
from typing import List
from strategy_pattern import Strategy
from play import play

RUNTIME=60

# Getting to know the environment

We are going to train a pong agent. But before that play a couple of games to get a feel for the game.
These are the controls:
W/S: move left paddle up/down
Up/Down (arrow keys): move right paddle up/down

In [None]:
play()

Now we are going to train an agent. During training the agent controls the left paddle.
The right paddle has a cheat code so that it always hits the ball.

In [None]:
train(training_time=RUNTIME)

You can also play against the agent you just trained.
You control the right paddle.

In [None]:
play(ai_enemy=True)

You may notice that the agent has learned something, but it is not very good.
Why is that?

# Example: Implementing a strategy

When you implement a function in this notebook, you override the default implementation. We call this implementing a new strategy.

In the cell above, you see the implementation for a reward strategy, telling the agent if the action taken was "good" or "bad".
Let's take a look at what happens when you override the default strategy for the reward function with something that is not very helpful for learning:

In [None]:
def get_reward(observation: Pong, next_observation: Pong) -> float:
    return 0 # no action is ever good nor bad

strategy_pattern.strategies[Strategy.REWARD] = get_reward

If you run the following cell, and then run training the agent should fail because
a reward strategy is used that is not really helpful for learning.

In [None]:
train(training_time=RUNTIME)

Even though you will implement strategies one by one, the code will also work if you skip implementing a strategy.
In this case the default strategy is used which means you can always test your code!

If you are stuck you can always reset the strategies to the default implementation by running the following cell:

In [None]:
strategy_pattern.strategies = {}

You can verify this by running the training again:

In [None]:
train(training_time=RUNTIME)

# Exercise: Implementing your own strategies

## Reward strategy

Now it is your turn to implement a reward strategy.

The agent takes an action and then gives us two arguments:
- `observation`: the observation of the game before the action has been taken.
- `next_observation`: the observation of the game after the action has been taken.

Then the agent asks: "Was the action I took good or bad? Did something change that can be considered good or bad?".

That answer is for you to decide:
- higher positive rewards are good
- lower negative rewards are bad
- zero rewards are neutral
(there are other ways for classifying good, bad and neutral, but let's keep it simple for now)


Note:
Both arguments are full copies of the `Pong` class. You can take a look at the `Pong` class by shift-clicking on the class name below.
Then you can decide which information may be relevant for deciding a reward.

Als keep in mind that we train an agent controlling the left paddle. Choose the rewards accordingly!

In [None]:
def get_reward(observation: Pong, next_observation: Pong) -> float:
    return 0 # your reward strategy goes here

strategy_pattern.strategies[Strategy.REWARD] = get_reward

Before we blindly run the training again you should check, that your reward strategy does what you intended.
Run the following cell to control the left paddle. The console now prints the reward for each action (neutral rewards are omitted).

In [None]:
# Don't bother with the details of this code, but ask if you are interested
import copy
flags = copy.copy(printing.print_flags)
printing.print_flags.append(printing.PrintFlag.REWARD)
try:
    play(invincible_enemy=True, debug=True)
finally:
    printing.print_flags = flags

Everything looks good? Then let's train the agent again:

In [None]:
train(training_time=RUNTIME)

## State strategy

The thing which is the input to the neural network is called the state.
Neither do we need, nor want, nor can input the whole pong into the neural network. A neural network only takes numbers as input!
So we need to decide which information is relevant for the agent to make a decision.

The state strategy is a function that takes the `Pong` class and should return a tuple of relevant information (called features).

In [None]:
def get_state(observation: Pong) -> str:
    return observation.ticks_this_ball_exchange, 42, 99 # ,observation.somethingElse,... your state strategy goes here

strategy_pattern.strategies[Strategy.STATE] = get_state

Train the agent again:

In [None]:
train(training_time=RUNTIME)

## Action strategy

The neural network can predict a value for each possible action. In our algorithm the action with the highest value is chosen.
For the neural network, an action is just the index of the associated neuron in the output/last layer.

Implement an action strategy by assigning an action to take in case the associated output neuron has the highest value.

Hint:
The first boolean in the `PongAction` constructor is for moving the paddle up, the second for moving it down. It corresponds to the associated keys being pressed.

In [None]:
from pong_action import PongAction


def get_action_map():
    return {
        0: PongAction(False, False), # do nothing
        # 1: ...
        # 2: ...
    }
strategy_pattern.strategies[Strategy.ACTION_MAP] = get_action_map

## Network structure strategy

In the following code the returned list represents the structure of the neural network.

E.g. `[4, 2, 3, 1]` created the following neural network:

In [None]:
from PIL import Image
im = Image.open("nn_structure.png")
display(im)

Before we deal with the question of "Which network structure is best?" let's first run the following code which defines and immediately runs training for only 1 second:

In [None]:
def get_network_structure() -> List[int]:
    # a neural network with 1 input, two hidden layers of size 2 and 1 output
    return [1, 2, 2, 1]

strategy_pattern.strategies[Strategy.NETWORK_STRUCTURE] = get_network_structure
train(training_time=1)

Huh? An error? What happened? Figure it out and fix it!

Now that we have a working network structure, let's try to find a better one.
Try out different network structures and see how they perform.

Note:
You invest a lot of time in randomly trying out different network structures.
It is enough to try a few out and settle with one that works just "okay".

In [None]:
def get_network_structure() -> List[int]:
    # a neural network with 1 input, two hidden layers of size 2 and 1 output
    return [1, 2, 2, 1]

strategy_pattern.strategies[Strategy.NETWORK_STRUCTURE] = get_network_structure

In [None]:
train(training_time=RUNTIME)

## Pushing the limits

Now lets train the agent for longer!

In [None]:
train(training_time=60*5) # 5 minutes

Finally, lets play against the agent again!

In [None]:
play(ai_enemy=True)

## Transforming observations

Now that we have trained our agent on the left hand side, lets try it for the other side too!

In [None]:
play(ai_enemy=True, swap_players=True)

It doesn't work! Why?

Okay then let's just make the agent thing that it is playing on the left hand side.
Implement a function which transforms the observation in such a way, that the agent can leverage what it has learned on the left hand side!

In [None]:
def transform_observation(observation: Pong) -> Pong:
    width = pong.width
    height = pong.height

strategy_pattern.strategies[Strategy.TRANSFORM_OBSERVATION] = transform_observation

Now let's test if it works:

In [None]:
play(ai_enemy=True, swap_players=True)