Proposed rewording for Unit 1, Chapter 3: The Reinforcement Learning … #403
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request Description
Issue: The chapter had issues related to wording structure and redundancy.
Proposed Changes:
Changes Made:
The Reinforcement Learning Framework
Understanding the RL Process
Reinforcement Learning is like teaching an agent to play a video game. Imagine you're coaching a player in a platform game:
[Image]
This RL loop generates a sequence of state, action, reward, and next state.
[Image]
The agent's goal is to maximize its cumulative reward, which we call the expected return.
The Reward Hypothesis: RL's Central Idea
⇒ Why does the agent aim to maximize expected return?
RL is built on the reward hypothesis, which means that all goals can be described as maximizing expected return (the expected cumulative reward).
In RL, achieving the best behavior means learning to take actions that maximize the expected cumulative reward.
Understanding the Markov Property
In academic circles, the RL process is often referred to as a Markov Decision Process (MDP).
We'll discuss the Markov Property in depth later, but for now, remember this: the Markov Property implies that our agent only needs the current state to decide its action, not the entire history of states and actions taken previously.
Observations/States Space
Observations/States are the information our agent receives from the environment. In a video game, it could be a single frame, like a screenshot. In trading, it might be the value of a stock.
However, it's key to distinguish between observation and state as explained below:
[Image]
In a fully observed environment, we have access to the entire board, just like in a game of chess.
[Image]
In a partially observed environment, such as Super Mario Bros, we can't see the whole level, just the section that is surrounding the character.
To keep it simple, we'll use the term "state" to refer to both state and observation in this course, but we'll distinguish them in practice.To recap:
[Image]
Action Space
The Action space encompasses all possible actions an agent can take in an environment.
Actions can belong to either a discrete or continuous space:
[Image]
For example, in Super Mario Bros, there are only four possible actions: left, right, up (jumping), and down (crouching). It is a finite set of actions.
[Image]
For instance, as seen in the above figure, a Self-Driving Car agent can perform a wide range of continuous actions, such as turning at different angles (left or right 20°, 21,1°, 21,2°) or honking.
Understanding these action spaces is crucial when choosing RL algorithms in the future.
To recap:
[Image]
Rewards and Discounting
In RL, the reward is the agent's only feedback. It helps the agent determine whether an action was good or *not.
The cumulative reward at each time step t can be expressed as:
[Image]
This is equivalent to:
[Image]
However, we can't simply add rewards like this. Rewards that arrive early (at the game's start) are more likely to occur than long-term future rewards.
Imagine your agent as a small mouse, trying to eat as much cheese as possible before being caught by the cat. The mouse can move one tile at a time, just like the cat. The mouse's objective is to eat the maximum amount of cheese (maximum reward) before being eaten by the cat (discount).
[Image]
In this scenario, it's more probable to eat cheese nearby than cheese close to the cat (dangerous territory).
As a result, rewards near the cat, even if larger, are more heavily discounted since we're unsure if we'll reach them.
To incorporate this discounting, we follow these steps:
Our expected cumulative discounted reward would be:
[Image]