# Continuous Environments
## Libraries used in notebook

In [None]:
import gymnasium as gym

## Introduction

In previous Notebook we explored the easiest variant of RL with discrete environment and a small number of states. To understand the next part we have to explain some theory we didn't cover before.

### Policies

When we talk about a **policy**, we are referring to the "behaviour" of the agent. In other words, given a state, what action should the agent take. A policy is typically denoted by the symbol $\pi$, and can be:
- **Deterministic**: always returns the same action for a given state ($\pi(s) = a$)
- **Stochastic**: returns a probability distribution over actions ($\pi(a|s)$)

In simpler, discrete environments, a policy can be represented as a table mapping states to actions. But in continuous environments with large or infinite state spaces, this becomes impossible - we need a way to **approximate the policy** instead.

### Approximating the Q-function

In Q-learning, the agent learns a Q-function $Q(s, a)$ that estimates the expected return of taking action $a$ in state $s$ and then following the optimal policy. In discrete cases, we used a Q-table to store this value. But in continuous environments, we can't store all possible combinations of states and actions - there are infinitely many.

To handle this, we approximate the Q-function using a **function approximator**, such as a **neural network**. This network takes a state (and possibly an action) as input and outputs an estimated Q-value.

Approximating the Q-function this way allows the agent to generalize across similar states - even if it has never seen a specific state before, it can make a reasonable guess based on what it has learned.

### Policy Approximation

Instead of learning a Q-function and deriving a policy from it, some algorithms aim to learn the policy **directly**. This is especially common in continuous action spaces, where computing $\arg\max_a Q(s, a)$ (finding the best action) is intractable.

In **policy-based methods**, we use a neural network (called the **policy network**) to approximate $\pi(a|s)$ - it maps states to actions (or distributions over actions). The network is trained to maximize expected reward, often using gradient-based techniques.

Some algorithms (like **actor-critic methods**) combine both approaches:
- The **actor** learns the policy (what to do),
- The **critic** learns the value function (how good it is).

## Introduction to `stable-baselines3`

## Key Algorithms for Continuous Control
- PPO – Proximal Policy Optimization
- DDPG – Deep Deterministic Policy Gradient
- SAC – Soft Actor-Critic

## Training with `stable-baselines3`
- Setting up an environment
- Creating and training a model
- Saving, loading, and continuing training

## Tuning and Customization
- Important hyperparameters to tune (learning rate, gamma, buffer size, etc.)
- Callbacks and logging
- Common issues (e.g. instability, no learning)

## Evaluating and Visualizing Performance
- Monitoring reward over time
- Using evaluate_policy()
- Rendering and recording episodes