# Continuous Environments
## Libraries used in notebook

In [None]:
import gymnasium as gym

## Introduction

In previous Notebook we explored the easiest variant of RL with discrete environment and a small number of states. To understand the next part we have to explain some theory we didn't cover before.

### Policies

When we talk about a **policy**, we are referring to the "behaviour" of the agent. In other words, given a state, what **action** should the agent take. A policy can be:
- **Deterministic**: always returns the same action for a given state;
- **Stochastic**: returns a probability distribution over actions.

In simpler, discrete environments, a policy can be represented as a table mapping states to actions. But in continuous environments with large or infinite state spaces, this becomes impossible - we need a way to **approximate** the Q-value or policy instead.

Here we also need to differentiate between **policy-based** and **value-based** methods.

### Value-based methods

In Q-learning, the agent learns a Q-function that estimates the expected **return value** of taking action $a$ in state $s$. In discrete cases, we used a Q-table to store this value. But in continuous environments, we can't store all possible combinations of states and actions - there are infinitely many.

To handle this, we approximate the Q-function using a **function approximator**, such as a **neural network**. This network takes a state (and possibly an action) as input and outputs an estimated Q-value.

Approximating the Q-function this way allows the agent to generalize across similar states - even if it has never seen a specific state before, it can make a reasonable guess based on what it has learned.

### Policy-based methods

Instead of learning a Q-function and deriving a policy from it, some algorithms aim to learn the policy **directly**. This is especially common in continuous action spaces, where finding the best action is hard or impossible.

In **policy-based methods**, we use a neural network (called the **policy network**) to approximate distribution over actions (not action's return values). The network is trained to maximize expected reward, often using gradient-based techniques.

Some algorithms (like **actor-critic methods**) combine both approaches:
- The **actor** learns the policy (what to do),
- The **critic** learns the value function (how good it is).

## Introduction to algorithms from `stable-baselines3`

**Stable-baselines3** is an open source python library with implementations of reinforcement learning algorithms in PyTorch. It has well maintained [documentation](https://stable-baselines3.readthedocs.io/en/master/index.html) and we strongly recommend getting familiar with it.

The algorithms in `stable-baselines3` are ready to use however it's good to know how they work to easier choose their parameters. This section is based on the [Hugging Face RL Tutorial](https://huggingface.co/learn/deep-rl-course/unit0/introduction).

### Deep Reinforcement Learning

Before we talked about approximating either the Q-function or policy but what we really meant was **Deep Q-learning** and **Deep Reinforcement Learning**.

### Deep Q-Network (DQN)

Deep Q-Learning uses a deep neural network to approximate the different Q-values for each possible action at a state. To make it more stable the agents saves its experiences and learns from random samples of past events. It uses two networks - one that is used for learning and second - called the target network - that is updated less frequently to help stabilize learning.

### Advantage Actor Critic (A2C)

Advantage Actor Critic is a hybrid architecture, combining value- and policy-based methods. It uses:
- An *Actor* that controls **how the agent behaves** (policy-based method)
- A *Critic* that measures **how good the taken action is** (value-based method)

The *Advantage* in A2C refers to an advantage function that calculates the relative advantage of an action compared to the others possible at a state - **how taking that action at a state is better compared to the average value of the state**. If said action in better than the expected mean value the gradient is pushed in that direction, if it's worse - the opposite. In other words, it pushes the agent to take actions that lead to higher rewards and avoid harmful actions.

### Proximal Policy Optimization

Proximal Policy Optimization is similar to A2C except to improve the agent's training stability it avoids policy updates that are too large. To do so, it uses a ratio that indicates the difference between the current and old policy and clips this ratio to a specific range (hence the proximal policy term).


## Training with `stable-baselines3`
- Setting up an environment
- Creating and training a model
- Saving, loading, and continuing training

## Tuning and Customization
- Important hyperparameters to tune (learning rate, gamma, buffer size, etc.)
- Callbacks and logging
- Common issues (e.g. instability, no learning)

## Evaluating and Visualizing Performance
- Monitoring reward over time
- Using evaluate_policy()
- Rendering and recording episodes