# Project part 2: beat flappy bird

You may be familiar with the game [flappy bird](https://flappybird.io/). It is very simple: a bird moves at constant speed on the x axis and, to direct him, you can either push it up or let it fall at each step. The goal of the game is to go as far as possible.

Your goal for this project is as follow: design and train an agent which does the best possible score at flappy bird !

In [2]:
#@title Installations  { form-width: "30%" }

# This is just for the purpose of this colab. Please do not share a ssh
# private key in real life, it is a really unsafe practice.
GITHUB_PRIVATE_KEY = -----END OPENSSH PRIVATE KEY-----


# Create the directory if it doesn't exist.
! mkdir -p /root/.ssh
# Write the key
with open("/root/.ssh/id_ed25519", "w") as f:
    f.write(GITHUB_PRIVATE_KEY)
# Add github.com to our known hosts
! ssh-keyscan -t ed25519 github.com >> ~/.ssh/known_hosts
# Restrict the key permissions, or else SSH will complain.
! chmod go-rwx /root/.ssh/id_ed25519

# Clone and install the RL Games repository
! if [ -d "rl_games" ]; then echo "rl_games directory exists."; else git clone git@github.com:Molugan/rl_games.git; fi
! cd rl_games ; git pull;  pip install .

# Other dependencies
# If you just want to play your environment and does not intend to use either
# jax or haiku you can comment this part.
!pip install dm-acme[jax]
!pip install dm-acme[tf]
!pip install dm-haiku
!pip install chex
!pip install optax

The syntax of the command is incorrect.


FileNotFoundError: [Errno 2] No such file or directory: '/root/.ssh/id_ed25519'

## What you are expected to do:

First off, constitute groups of 5 or less and fill in this [sheet](https://docs.google.com/spreadsheets/d/16TqSBGN33izSbom9-Vk2KYAxeYpEnz79-ylCHrTQiEA/edit#gid=0).
You are then asked to:
- Implement an agent to reach a score as high as you can on the environment within 2h of GPU compute. The code should be well designed and commented. You are not required to use reinforcement learning, but can if you find it useful. You can take inspiration from both the practicals and any online codebase that you may find useful, provided you reference it. However any suspicion of plagiarism on another team will results in grades being divided by two for both teams. The states of the environment are purposefully obfuscated, it is your job to find a representation that will be easily ingestible by whatever method you are going to be using. Two notes:
  - Since GPU access is limited on Colab, you may want to experiment with CPUs and only use GPUs for your final run. Depending on the kind of algorithms you implement there is not necessarily going to be a huge difference.
  - It is obviously forbidden to load external weights, that could be used to checkpoint your training.
- Write a report (2-4 pages) explaining the approach you took in details, the hyperparameter searches you performed, and the final results you obtained.

## Deadline
You should complete this project and send us our results by April 7th 11:59pm, you will get a penalty of one point by day of delay.

## Evaluation

We are going to run your notebook on a Colab GPU instance for one hour and we will consider the performance of your model after that time.

Grade decomposition:

- Report 4pts:
  - Method description.
  - Hyperparameter choice explanation.
  - Results presentation.
- Code 4pts:
  - Does the method described in the report match the method implemented?
  - Is the code readable?
  - Is the code well presented and documented?
- Performance 5 pts

## The environment

We will use the Flappy Bird environment defined in the deep_rl package. Let's have a closer look at it.


In [None]:
from deep_rl.environments.flappy_bird import FlappyBird

env = FlappyBird(
        gravity=0.05,
        force_push=0.1,
        vx=0.05,
        prob_new_bar=1,
        invictus_mode=False,
        max_height_bar=0.5,
    )

print(env.help)

For example let's interact with it a little bit.

In [None]:
rows, cols = env.min_res
print(f"We should use at least {rows} rows and {cols} when rendering the environment")

obs_reset = env.reset()
print("First observation when reseting the environment:")
print(obs_reset)
print()

print("Now, let's perform a few steps\n")

print("Step 1: we let the bird fall")
obs, reward, done = env.step(0)
print(f"Observation: {obs}")
print(f"Reward: {reward}")
print(f"Game over: {done}")
print()

print("Step 2: we push the bird up")
obs, reward, done = env.step(1)
print(f"Observation: {obs}")
print(f"Reward: {reward}")
print(f"Game over: {done}")
print()

print("Step 3: we push the bird up again")
obs, reward, done = env.step(1)
print(f"Observation: {obs}")
print(f"Reward: {reward}")
print(f"Game over: {done}")
print()

print("Step 4: we push the bird up again")
obs, reward, done = env.step(1)
print(f"Observation: {obs}")
print(f"Reward: {reward}")
print(f"Game over: {done}")
print()

To simplify typing a bit, the deep_rl package implements a new type `FlappyObs` which corresponds to a state of the flappy bird environment.

In [None]:
from typing import List, Tuple

BarObs = Tuple[float, float, float, bool]
BirdObs = Tuple[float, float, float]
FlappyObs = Tuple[BirdObs, List[BarObs]]

## Baseline

We provide you with a simple baseline: the `StableAgent` which does nothing more than keeping the bird stable.

In [None]:
from deep_rl.environments.flappy_bird import FlappyObs

class StableAgent:
  """An agent which just keeps the bird stable.
  """

  def __init__(self,
               target_y : float = 0.5):
    self._target_y = target_y

  def sample_action(self,
                    observation: FlappyObs,
                    evaluation: bool,
                    ) -> int:
    _, y_bird, v_y_bird = observation[0]

    if y_bird <= self._target_y and v_y_bird <= 0:
      return 1
    else:
      return 0

Let's see how a single runs works in practice with this agent.

In [None]:
from IPython.display import clear_output
from deep_rl.terminal_renderer import BashRenderer
from deep_rl.episode_runner import run_episode
from deep_rl.project_values import PROJECT_FLAPPY_BIRD_ENV

# We are going to render the environment !
ROWS = 30
COLS = 60
# Because ipython sucks, I have not found a cleaner option to add
# the refresher function
renderer = BashRenderer(ROWS,
                        COLS,
                        clear_fn = lambda: clear_output(wait=True))

# Flappy bird environment
env = PROJECT_FLAPPY_BIRD_ENV

# Our agent
agent = StableAgent()

# We run a single episode, with rendering, over a maximum of 100 steps
run_episode(env,
            agent,
            max_steps=100,
            renderer = renderer,
            time_between_frame=0.1)

Without rendering now, let's see the average reward we can get over 100 episodes with this agent.

In [None]:
from deep_rl.project_values import PROJECT_FLAPPY_BIRD_ENV
from deep_rl.episode_runner import run_episode

# Flappy bird environment
env = PROJECT_FLAPPY_BIRD_ENV

# Our agent
agent = StableAgent()

N_EPISODES = 100

reward = 0
for _ in range(N_EPISODES):
  reward+= run_episode(env, agent, max_steps=1000, renderer = None)

reward /= N_EPISODES

print(f"Average reward over {N_EPISODES} episodes: {reward}")

An now, you need to do much better.

## Let's get to work !

Design and train an agent that performs the best possible score on Flappy bird. You can use any method learned in this class. Here are the constraints:
- if you chose a Deep learning algorithm, you must use jax and Haiku. Pytorch is not allowed for this project.
- your agent should converge in less than an hour. To make sure of that, we will run your code and use whatever checkpoint you have dumped in the given time.
- your agent must maximize the reward obtained over 100 episodes with a maximal number of 1000 steps per episode.

Do not forget to write **clear and commented code**, you will also be evaluated on that.

On top of that, you are asked to plot and analyse the relevant curves showing the evolution of your training loop.

### Agent's API

Your agent should implement a method, `sample_action`, which takes two arguments as input, the observed state and wether or not it is in evaluation mode, and pick the action to perform. Appart from that, you can add any other method you want to your model.

In [None]:
class MyAgent:
  """Your agent to beat Flappy bird."""

  def __init__(self, ...):
  # Put whatever you want here

  def sample_action(self,
                    observation: FlappyObs,
                    evaluation: bool,
                    ) -> int:
    """Pick the next action to perform

    Args:
      observation: state of the flappy bird environment.,
      evaluation: True if we are in evaluation mode, False if we are training.
    """

    # Your code here !
    ...

    return action

## Environment

You must use the following flappy bird environment from the deep_rl package.


In [None]:
from deep_rl.project_values import PROJECT_FLAPPY_BIRD_ENV

### Training loop

You can use the following training loop to train your agent. Do not hesitate to play with the different parameters or even modify the code if you think you have a better option.

In [None]:
from typing import List, Tuple
from dataclasses import dataclass
import time

# Your training loop should perform in less than 2h.
MAX_TIME_TRAINING = 3600 * 2

@dataclass
class EpisodeTrainingStatus:
  episode_number: int
  reward: float
  training_time: float

def run_episode_no_rendering(env,
                             agent,
                             evaluation: bool,
                             max_steps: int,
                             ) -> float:
  """Runs a single episode.

  Args:
    env: environment to consider.
    agent: agent to run.
    evaluation: if False, will train the agent.
    max_steps: number of steps after wich the evaluation should be stoppped
      no matter what.
  Returns:
    The total reward accumulated over the episode.
	"""

	observation = env.reset()
	tot_reward = 0

	for _ in range(max_steps):

		action = agent.sample_action(observation, evaluation)
		observation, reward, end_game = env.step(action)
		tot_reward += reward

		if end_game:
			break

	return tot_reward

def train_agent(env,
                agent,
                num_episodes: int,
                num_eval_episodes: int,
                eval_every_N: int,
                max_steps_episode: int,
                max_time_training: float = MAX_TIME_TRAINING,
                ) -> List[EpisodeTrainingStatus]:
  """Train your agent on the given environment.

  Args:
    env: environment to consider.
    agent: agent to train.
    num_episodes: number of episode to run for training.
    eval_every_N: frequency at which the agent is evaluated.
    max_steps_episode: maximal number of step per episode.
    max_time_training: maximal duration of the training loop (in seconds).
  Returns:
    The total reward accumulated over the episode.
	"""

  all_status = []
  print(f"Episode number:\t| Average reward on {num_eval_episodes} eval episodes")
  print("------------------------------------------------------")

  start_time = time.time()

  for episode in range(num_episodes):

    run_episode_no_rendering(env,
                             agent,
                             evaluation=False,
                             max_steps=max_steps_episode)

    if episode % eval_every_N == 0:
      reward=0
      d_time = time.time() - start_time
      for _ in range(num_eval_episodes):
        reward += run_episode(env,
                              agent,
                              evaluation=True,
                              max_steps=max_steps_episode)
      reward /= num_eval_episodes
      print(f"\t{episode}\t|\t{reward}")
      all_status.append(EpisodeTrainingStatus(episode_number=episode,
                                              reward=reward,
                                              training_time=d_time))

      if d_time > max_time_training:
        break

  return all_status

### Visualisation

You can use the following code to visualize a single run made by your agent. This can help you for debugging.

In [None]:
from IPython.display import clear_output
from deep_rl.project_values import PROJECT_FLAPPY_BIRD_ENV
from deep_rl.terminal_renderer import BashRenderer
from deep_rl.episode_runner import run_episode

# Your agent
agent = ...

# We are going to render the environment !
ROWS = 30
COLS = 60
renderer = BashRenderer(ROWS,
                        COLS,
                        clear_fn= lambda: clear_output(wait=True))


# We run a single episode, with rendering, over a maximum of 1000 steps
run_episode(PROJECT_FLAPPY_BIRD_ENV,
            agent,
            max_steps= 1000,
            renderer= renderer,
            time_between_frame= 0.1)