# The basics of reinforcement learning

Reinforcement learning (RL) is an area of machine learning concerned with how
software <b> agents </b> ought to take <b> actions </b> in a given <b> state </b> of an <b> environment </b> to maximize
the notion of cumulative <b> reward </b>.


To understand how RL helps, let's consider a simple scenario. Imagine that you are
playing chess against a computer (in our case, the computer is an agent that has
learned/is learning how to play chess). The setup (rules) of the game constitutes the
environment. Furthermore, as we make a move (take an action), the state of the
board (the location of various pieces on the chessboard) changes. At the end of the
game, depending on the result, the agent gets a reward. The objective of the agent is
to maximize the reward.

If the machine (agent1) is playing against a human, the number of games that it can
play is finite (depending on the number of games the human can play). This might
create a bottleneck for the agent to learn well. However, what if agent1 (the agent that
is learning the game) can play against agent2 (agent2 could be another agent that is
learning chess or it could be a piece of chess software that has been pre-programmed
to play the game well)? Theoretically, the agents can play infinite games with each
other, which results in maximizing the opportunity to learn to play the game well.
This way, by playing multiple games with each other, the learning agent is likely to
learn how to address the different scenarios/states of the game well.

Let's understand the process that the learning agent will follow to learn well:

1. Initially, the agent takes a random action in a given state.

2. The agent stores the action it has taken in various states within a game in
memory.

3. Then, the agent associates the result of the action in various states with a
reward.

4. After playing multiple games, the agent can correlate the action in a state to
a potential reward by replaying its experiences.

Next comes the question of quantifying the value that corresponds to taking an action
in a given state.

## Calculating the state value

To understand how to quantify the value of a state, let's use a simple scenario where
we will define the environment and objective as follows:

![rl](../imgs/rl0.png)

The environment is a grid with two rows and three columns. The agent starts at the
Start cell and it achieves its objective (rewarded with a score of +1) if the agent reaches
the bottom-right grid cell. The agent does not get a reward if it goes to any other cell.
The agent can take an action by going to the right, left, bottom, or up, depending on
the feasibility of the action (the agent can go to the right or to the bottom in the start
grid cell, for example). The reward of reaching any of the remaining cells other than
the bottom-right cell is 0.

By using this information, let's calculate the <b> value </b> of a cell (the state that the agent is
in, in a given snapshot). Given that some energy is spent moving from one cell to
another, we discount the value of reaching a cell by a factor of γ, where γ takes care of
the energy that's spent in moving from one cell to another. Furthermore, the
introduction of γ results in the agent learning to play well sooner. With this, let's
formalize the Bellman equation, which helps in calculating the value of a cell:

![rl](../imgs/rl1.png)

With the preceding equation in place, let's calculate the values of all cells <b> (once the
optimal actions in a state have been identified) </b> with the value of γ being 0.9 (the
typical value of γ is between 0.9 and 0.99):

![rl](../imgs/rl2.png)

From the preceding calculations, we can understand how to calculate the values in a
given state (cell), when given the optimal actions in that state. These are as follows for
our simplistic scenario of reaching the terminal state:

![rl](../imgs/rl3.png)

With the values in place, we expect the agent to follow a path of increasing value.

## Calculating the state-action value

In the previous section, we provided a scenario where we already know that the
agent is taking optimal actions (which is not realistic). In this section, we will look at a
scenario where we can identify the value that corresponds to a state-action
combination.

In the following image, each sub-cell within a cell represents the value of taking an
action in the cell. Initially, the cell values for various actions are as follows:

![rl](../imgs/rl4.png)

Note that, in the preceding image, cell b1 (2 nd row and 2 nd column) will have a value of
1 if the agent moves right from the cell (as it corresponds to the terminal cell); the
other actions result in a value of 0. X represents that the action is not possible and
hence no value is associated with it.

Over four iterations (steps), the updated cell values for the actions in the given state
are as follows:

![rl](../imgs/rl5.png)

This would then go through multiple iterations to provide the optimal action that
maximizes value at each cell.

Let's understand how to obtain the cell values in the second table (Iteration 2 in the
preceding image). Let's narrow this down to 0.3, which was obtained by taking the
downward action when present in the 1st row and 2nd column of the second table.
When the agent takes the downward action, there is a 1/3 chance of it taking the
optimal action in the next state. Hence, the value of taking a downward action is as
follows:

![rl](../imgs/rl6.png)

In a similar manner, we can obtain the values of taking different possible actions in differnt cells.

## Q-value

The Q in Q-learning or Q-value represents the quality of an action. Let's learn how to
calculate it:

![rl](../imgs/rl7.png)

We already know that we must keep updating the state-action value of a given state
until it is saturated. Hence, we'll modify the preceding formula like so:

![rl](../imgs/rl8.png)

In the preceding equation, we replace 1 with the learning rate so that we can update
the value of the action that's taken in a state more gradually:

![rl](../imgs/rl9.png)