# 1. Problem Framing in Reinforcement Learning

## Copyright 2019 Google LLC.

In [0]:
#@title
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

This Colab is part of the [Reinforcement Learning self-study course](https://developers.google.com/machine-learning/reinforcement-learning/). **Reinforcement Learning** (**RL**) is fundamentally a framework for a decision-making process. In this Colab, you'll do the following:

* Learn basic RL terminology.
* Understand the state-action-reward framework.
* Learn to frame problems in RL.

## Setup

Run the following cell to set up Google Analytics for the Colab. We use Google Analytics to  improve these Colabs.

In [0]:
#@title Setup Google Analytics for Colab
%reset -f
import uuid
client_id = uuid.uuid4()

import requests

# Bundle up reporting into a function.
def report_execution():
  requests.post('https://www.google-analytics.com/collect', 
                data=('v=1'
                      '&tid=UA-48865479-3'
                      '&cid={}'
                      '&t=event'
                      '&ec=cell'                 # <-- event type
                      '&ea=execute'              # <-- event action
                      '&el=rl-problem-framing'   # <-- event label
                      '&ev=1'                    # <-- event value
                      '&an=bundled'.format(client_id)))

from IPython import get_ipython
get_ipython().events.register('post_execute', report_execution)

Run the following cell to import libraries. [**Gym**](https://gym.openai.com) is the standard RL library for specifying RL environments. You will use Gym throughout this course.

In [0]:
import gym
import numpy as np

## Introduction

Suppose you've moved to a new neighborhood. You're looking for great restaurants in the neighborhood. At first, you explore randomly. Slowly, you begin  returning to the restaurants you like. After a while, you tend to stick to the restaurants you like. Through this process, you learned new favorite restaurants through reinforcement from repeated exploration. Reinforcement Learning formalizes this learning process.

For a more concrete example, imagine you're playing Pac-Man. You need to eat dots and avoid ghosts. You can frame this scenario in RL as follows:

* Pac-Man is the **agent**.
* The maze is the **environment**.
* Eating the dots returns **rewards**. Failure, such as being eaten by a ghost, is framed as a negative reward.

If you assign a positive numeric reward to eating the dots, and a negative numeric reward to being eaten, then you can frame Pac-Man's objective as reward maximization.

![Screenshot of the Pac-Man computer game.](https://www.google.com/logos/2010/pacman10-hp.png)

To maximize reward, Pac-Man moves from cell to cell by going up, down, left, and right. In RL, these movements are called **actions**. On each action, the position of Pac-Man, the positions of the ghosts, and the available rewards change. Together, these positions and rewards are the environment's **state**. Therefore, the state changes on each action . An agent takes an action $a$ in a state $s$ and transitions to another state $s’$ while receiving a reward $r$.

![A simple schematic shows the framework provided by Reinforcement Learning at a high level of abstraction. There are five boxes labeled. The first box, agent, is connected to the third box, environment, through the second box, action. The environment is connected back to the agent through two boxes: state and reward. ](https://developers.google.com/machine-learning/reinforcement-learning/images/rl-framing-outline.png)

## Explore a Simple Environment

Understand the state-action-reward framework by exploring a simple Gym environment, called  `NChain-v0`.  The `NChain-v0` environment is a linear chain of states. When an agent takes an action in `NChain-v0`, the agent moves to a new state in the chain and receives a numeric reward. You will explore this chain of states by imitating such an agent. That is, you will map the environment by taking actions to move between states.

Create the `NChain-v0` environment by running the following code:

In [0]:
env = gym.make('NChain-v0')
print('Resetting environment to starting state.')
state = env.reset()
print("Reset env to starting state '" + str(state) + "'.")

In the states of `NChain-v0`, the agent can take actions. Confirm that the environment allows two actions, labeled `0` and `1`:

In [0]:
print("Number of allowed actions: " + str(env.action_space.n))

The agent explores the environment by taking actions. After taking an action $a$ in a state $s$, the agent gets a reward $r$ and moves to a new state $s'$.

Explore the environment by setting the action to `0` or `1` and running the code. Try to answer these questions:

* How many states can you find?
* What reward values do you observe for state transitions?
* Do the outcomes of actions `0` and `1` differ?

Do not spend more than a few minutes. Then view the discussion in the next section.

In [0]:
action = 0  #@param ; action is 0 or 1
state_next, reward, done, _ = env.step(action)
transition_tuple = "In state = %d, taking action = %d returns reward = %d and state_next = %d" % (state, action, reward, state_next)
if 'transitions' in locals():
  transitions.append(transition_tuple)
else:
  transitions = [transition_tuple]
for item in transitions:
  print(item)
state = state_next

### Discussion (expand to view)

The environment is a linear chain of states. Action `0` advances the agent along this chain with no reward. Action `1` returns the agent to the starting state $s_0$ with a small reward.

However, at each action, the environment might return the result of the other action with a small probability. For example, action `0` might return the agent to the starting state, while `1` might advance the agent. The environmental probabilities that determine this next state $s'$ for a given action $a$ in a given state $s$ are defined by the **state-transition function** $P(s’|s,a)$.

The outcome $s'$ for the state-action pair $s,a$ is captured in the tuple $(s,a,s')$. The environment returns a unique reward $r$ for every tuple $s,a,s'$. This reward is defined by the **reward function**  $R(s,a,s')$. Both these functions are defined by the environment independent of the agent.

In practice, you never know the true reward function. Instead, this course represents the expected reward from taking action $a$ in state $s$ using the notation $R(s,a)$.

## Map an Environment

Now, let's try mapping out the environment by recording the agent's transitions. Each transition is characterized by the tuple $(s, a, r, s')$.  This sequence of tuples $(s, a, r, s’)$ is the agent’s **trajectory**.

$$s_0 \xrightarrow[r_0]{a_0} s_1 \xrightarrow[r_1]{a_1} s_2 \ldots \xrightarrow[r_2]{a_{n-2}} s_{n-1}\xrightarrow[r_{n-1}]{a_{n-1}} s_n
$$

Reset the memory for recording transitions:

In [0]:
transitions = []

Take an action by setting `action` and running the cell. The output shows the history of the tuple $(s,a,r,s')$. Can you find:
* All possible states.
* The possible state transitions and their probabilities.
* The rewards associated with (s,a) pairs.

In [0]:
action = 0  # set `action` to `0` or `1`

state_next, reward, _, _ = env.step(action)
transitions.append({ "state": state,
                     "action": action,
                     "reward": reward,
                     "state_next": state_next
                  })
for transition in transitions:
  print(transition)
state = state_next

### Solution (expand to view)

The environment is as follows, described in this [paper](https://ceit.aut.ac.ir/~shiry/lecture/machine-learning/papers/BRL-2000.pdf).

<img alt="A schematic that shows the NChain environment. The schematic shows the states, possible actions, and results of taking those actions in the state. When an agent takes an action in a state, the agent moves to a new state and receives a reward. There are 5 states. The allowed actions from each state are labelled 0 and 1. Action 0 always leads to a reward of 0, except from state 4 where action 0 returns a reward of 10. Action 1 always returns a reward of 2." width="75%" src="https://developers.google.com/machine-learning/reinforcement-learning/images/nchain-state-transitions.svg"/>

Check your observations of the environment against these environment characteristics. Specifically, note the probabilistic nature of the environment.

* Number of states: 5, labelled from `0` to `4`.
* State transition function:
  * `a=0` advances state from $s_n$ to $s_{n+1}$.
  * `a=1` returns to first state $s_0$.
  * The environment reverses the result of an action with a probability of $0.2$. That is, `a=0` will return $s_0$ and `a=1` will return $s_{n+1}$.
* Reward function:
  * `r=2` for returning to start.
  * `r=0` for advancing.
  * `r=10` for looping on state 4.

Confirm these characteristics by querying the environment variables as follows:

In [0]:
print("Number of states: " + str(env.env.observation_space.n))
print("Reward on returning to starting state 0: " + str(env.env.small))
print("Reward for looping on state 4: " + str(env.env.large))
print("Probability that an action has the opposite effect: " + str(env.env.slip))

## Episodes

A trajectory from start to termination is called an **episode**. The trajectory terminates when the **termination condition** is met. The termination condition could be achieving a certain reward, taking too many actions without solving the environment, or reaching a terminal state.

How many actions can you take before `NChain-v0` terminates? Run the following code to find out:

In [0]:
counter = 0
state = env.reset()
done = False
while not done:
  state, _, done, _ = env.step(env.action_space.sample())
  counter +=1
print("Termination condition for number of actions = " + str(counter))

Confirm that Gym sets the termination condition to a constant of 1000 actions:

In [0]:
env._max_episode_steps

## Conclusion and Next Steps

In this Colab, you learned:

* Basic RL terminology:
    * agent
    * environment
    * state, action, and reward
    * state-transition function
    * reward function
    * episode
    * termination condition.
* How RL frames problems as state-action-reward frameworks.
* How RL environments are probabilistic.

Move onto the next Colab: [Q-learning Framework](https://colab.research.google.com/drive/1ZPsEEu30SH1BUqUSxNsz0xeXL2Aalqfa#forceEdit=true&sandboxMode=true?utm_source=ss-reinforcement-learning&utm_campaign=colab-external&utm_medium=referral&utm_content=rl-q-learning).

For reference, the sequence of course Colabs is as follows:

1. [Problem Framing in Reinforcement Learning](https://colab.research.google.com/drive/1sUYro4ZyiHuuKfy6KXFSdWjNlb98ZROd#forceEdit=true&sandboxMode=true?utm_source=ss-reinforcement-learning&utm_campaign=colab-external&utm_medium=referral&utm_content=rl-problem-framing)
1. [Q-learning Framework](https://colab.research.google.com/drive/1ZPsEEu30SH1BUqUSxNsz0xeXL2Aalqfa#forceEdit=true&sandboxMode=true?utm_source=ss-reinforcement-learning&utm_campaign=colab-external&utm_medium=referral&utm_content=rl-q-learning)
1. [Tabular Q-Learning](https://colab.research.google.com/drive/1sX2kO_RA1DckhCwX25OqjUVBATmOLgs2#forceEdit=true&sandboxMode=true?utm_source=ss-reinforcement-learning&utm_campaign=colab-external&utm_medium=referral&utm_content=rl-tabular-q-learning)
1. [Deep Q-Learning](https://colab.research.google.com/drive/1XnFxIE882ptpO83mcAz7Zg8PxijJOsUs#forceEdit=true&sandboxMode=true?utm_source=ss-reinforcement-learning&utm_campaign=colab-external&utm_medium=referral&utm_content=rl-deep-q-learning)
1. [Experience Replay and Target Networks](https://colab.research.google.com/drive/1DEv8FSjMvsgCDPlOGQrUFoJeAf67cFSo#forceEdit=true&sandboxMode=true?utm_source=ss-reinforcement-learning&utm_campaign=colab-external&utm_medium=referral&utm_content=rl-experience-replay-and-target-networks)