Proposed rewording for Unit 1, Chapter 3: The Reinforcement Learning … #403

adelaparras · 2023-09-25T15:41:04Z

Pull Request Description

Issue: The chapter had issues related to wording structure and redundancy.

Proposed Changes:

Improved the wording for better clarity and readability.
Removed duplicated information to enhance understanding for beginners.
Reorganized the content to follow a logical flow.

Changes Made:

The Reinforcement Learning Framework

Understanding the RL Process

Reinforcement Learning is like teaching an agent to play a video game. Imagine you're coaching a player in a platform game:

[Image]

Our agent starts with an initial state \(S_0\) from the Environment; think of it as the first frame of our game.
Based on this state \(S_0\), the agent makes an action \(A_0\); in this case, our agent decides to move to the right.
This action leads to a new state \(S_1\), representing the new frame.
The environment provides a reward \(R_1\); luckily, we're still alive, resulting in a positive reward of +1.

This RL loop generates a sequence of state, action, reward, and next state.

[Image]

The agent's goal is to maximize its cumulative reward, which we call the expected return.

The Reward Hypothesis: RL's Central Idea

⇒ Why does the agent aim to maximize expected return?

RL is built on the reward hypothesis, which means that all goals can be described as maximizing expected return (the expected cumulative reward).

In RL, achieving the best behavior means learning to take actions that maximize the expected cumulative reward.

Understanding the Markov Property

In academic circles, the RL process is often referred to as a Markov Decision Process (MDP).

We'll discuss the Markov Property in depth later, but for now, remember this: the Markov Property implies that our agent only needs the current state to decide its action, not the entire history of states and actions taken previously.

Observations/States Space

Observations/States are the information our agent receives from the environment. In a video game, it could be a single frame, like a screenshot. In trading, it might be the value of a stock.

However, it's key to distinguish between observation and state as explained below:

State s: This is a complete description of the world with no hidden information.

[Image]

In a fully observed environment, we have access to the entire board, just like in a game of chess.

Observation o: This provides only a partial description of the state, a partially observed environment.

[Image]

In a partially observed environment, such as Super Mario Bros, we can't see the whole level, just the section that is surrounding the character.

To keep it simple, we'll use the term "state" to refer to both state and observation in this course, but we'll distinguish them in practice.

To recap:

[Image]

Action Space

The Action space encompasses all possible actions an agent can take in an environment.

Actions can belong to either a discrete or continuous space:

Discrete space: Here, the number of possible actions is finite.

[Image]

For example, in Super Mario Bros, there are only four possible actions: left, right, up (jumping), and down (crouching). It is a finite set of actions.

Continuous space: This involves an infinite number of possible actions.

[Image]

For instance, as seen in the above figure, a Self-Driving Car agent can perform a wide range of continuous actions, such as turning at different angles (left or right 20°, 21,1°, 21,2°) or honking.

Understanding these action spaces is crucial when choosing RL algorithms in the future.

To recap:

[Image]

Rewards and Discounting

In RL, the reward is the agent's only feedback. It helps the agent determine whether an action was good or *not.

The cumulative reward at each time step t can be expressed as:

[Image]

This is equivalent to:

[Image]

However, we can't simply add rewards like this. Rewards that arrive early (at the game's start) are more likely to occur than long-term future rewards.

Imagine your agent as a small mouse, trying to eat as much cheese as possible before being caught by the cat. The mouse can move one tile at a time, just like the cat. The mouse's objective is to eat the maximum amount of cheese (maximum reward) before being eaten by the cat (discount).

[Image]

In this scenario, it's more probable to eat cheese nearby than cheese close to the cat (dangerous territory).

As a result, rewards near the cat, even if larger, are more heavily discounted since we're unsure if we'll reach them.

To incorporate this discounting, we follow these steps:

We will be defining a discount rate as gamma. This rate value must be between 0 and 1. Typically, the values would fall between 0.95 and 0.99.

A higher gamma value equals a higher discount, meaning that our agent would prioritize long-term rewards
On the other hand, a lower gamma value equals a lower discount, meaning that our agent would prioritize short-term rewards.

Each reward is discounted by the value of gamma to the exponent of the time step. As the time step increases, the cat would get closer to the mouse, meaning that the future reward would be lower and would be less likely to take place.

Our expected cumulative discounted reward would be:

[Image]

…Framework

simoninithomas · 2023-11-07T10:51:22Z

Hey there 👋 thanks for this PR I'm keeping it for the next month update to see what I'm going to keep and not keeping since there's a lot of modifications.

Keep you updated 🤗

adelaparras · 2023-11-07T15:54:53Z

Thank you, Thomas! I'm excited to keep contributing to this project and others. If you're interested, I'd be more than willing to continue reviewing the entire course and offering suggestions. We're already connected on LinkedIn, so please don't hesitate to reach out whenever you'd like!

Proposed rewording for Unit 1, Chapter 3: The Reinforcement Learning …

31add92

…Framework

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed rewording for Unit 1, Chapter 3: The Reinforcement Learning … #403

Proposed rewording for Unit 1, Chapter 3: The Reinforcement Learning … #403

adelaparras commented Sep 25, 2023

simoninithomas commented Nov 7, 2023

adelaparras commented Nov 7, 2023

Proposed rewording for Unit 1, Chapter 3: The Reinforcement Learning … #403

Are you sure you want to change the base?

Proposed rewording for Unit 1, Chapter 3: The Reinforcement Learning … #403

Conversation

adelaparras commented Sep 25, 2023

Pull Request Description

The Reinforcement Learning Framework

Understanding the RL Process

The Reward Hypothesis: RL's Central Idea

Understanding the Markov Property

Observations/States Space

Action Space

Rewards and Discounting

simoninithomas commented Nov 7, 2023

adelaparras commented Nov 7, 2023