# Tutorial

<a href="https://colab.research.google.com/github/mibrahimi/rlba/blob/main/examples/tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This colab provides an overview of how to use the interface defiend in this library and experiment with existing example agents and environments.

## Installation

We'll start by installing the package and required depencencies.

In [None]:
#@title Install necessary dependencies.

%pip install --upgrade pip
%pip install git+https://github.com/mibrahimi/rlba.git

## Import Modules

Now we can import all the relevant modules.

In [None]:
#@title Import modules.
#python3
from dataclasses import dataclass
from random import random
from typing import Sequence

from rlba.environment_loop import EnvironmentLoop
from rlba.environments import BernoulliBanditEnv
from rlba.types import Array, ArraySpec, BoundedArraySpec, DiscreteArraySpec, NestedArray
from rlba.utils import metrics
import numpy as np
from numpy.random import default_rng

## Create an agent

An `Agent` is the *mind* that interacts and learns from an environment.

<img src="https://github.com/mibrahimi/rlba/raw/main/docs/imgs/RLProblem.png" width="500" />

You can write your own agent by writing a class that implements the [Agent](https://github.com/mibrahimi/rlba/raw/main/rlba/Agent.py) protocol, i.e., implements the following four methods: `action_spec`, `observation_spec`, `observe`, and `select_action`. As an example, below we implement a greedy agent for Bernoulli bandit environments.


In [None]:
#@title A greedy agent for Bernoulli bandit environment.

class GreedyBernoulliAgent:
  """A greedy agent that select action that maximized current point estimate."""
  def __init__(
    self,
    n_action: int,
    seed: int,
    alpha_0: int = 1,
    beta_0: int = 1,
  ):
    self._action_spec: DiscreteArraySpec = DiscreteArraySpec(n_action, name='action spec')
    self._observation_spec: ArraySpec = BoundedArraySpec(
        shape=(), dtype=np.float32, minimum=0.0, maximum=1.0,
        name='observation spec')
    self._n_success = np.zeros(shape=(n_action,), dtype=int) + alpha_0
    self._n_failure = np.zeros(shape=(n_action,), dtype=int) + beta_0
    self._rng = default_rng(seed)
    
  def select_action(self) -> int:
    """Samples from the policy and returns an action."""
    pvals_hat = self._n_success / (self._n_success + self._n_failure)
    action = random_argmax(pvals_hat, self._rng)
    return action

  def reward_spec(self) -> Array:
    """Describes the reward returned by the environment.
    Returns:
      An `Array` spec.
    """
    return Array(shape=(), dtype=float, name='reward')

  def discount_spec(self) -> BoundedArraySpec:
    """Describes the discount considered by the agent for planning.

    By default this is assumed to be a single float between 0 and 1.

    Returns:
      An `Array` spec.
    """
    return BoundedArraySpec(
        shape=(), dtype=float, minimum=0., maximum=1., name='discount')

  def observe(
      self,
      action: int,
      obs: float,
  ):
    """Make an observation from the environment.

    Args:
      action: action taken in the environment.
      obs: observation produced by the environment given the action.
    """
    self._n_success[action] += obs
    self._n_failure[action] += (1 - obs)
    return obs


def random_argmax(vals: Sequence[float], rng):
  maxval = max(vals)
  argmax = [idx for idx, val in enumerate(vals) if val == maxval]
  return rng.choice(argmax)


## Training loop
Finally, we can have the agent interact with the environment in an environment loop and evaluate its performance. We use the Bernoulli bandit implementation [here](https://github.com/mibrahimi/rlba/raw/main/rlba/environments/bernoulli_bandit.py) as the environment.

In [None]:
pvals = [0, 0.55, 0.5]
n_action = len(pvals)
seed=0
env = BernoulliBanditEnv(pvals, seed)
agent = GreedyBernoulliAgent(n_action, seed)

In [None]:
loop = EnvironmentLoop(env, agent)
loop.run(100)