# Reinforcement Learning Categories: Model-Based and Model-Free

Reinforcement Learning (RL) algorithms can be categorized into two main types: Model-Based and Model-Free. Each type has its own approach to learning and interacting with the environment. Here, we provide an in-depth explanation of both categories.

## Model-Free Reinforcement Learning

Model-Free methods do not use a model of the environment. They learn policies or value functions solely based on interactions with the environment. These methods are divided into Value-Based, Policy-Based, and Actor-Critic methods.

### Value-Based Methods

Value-Based methods focus on learning a value function that estimates the expected return (reward) for each state or state-action pair. The policy is derived implicitly from the value function by selecting actions that maximize the value.

#### Q-Learning (1989)
- **Type**: Off-Policy
- **Description**: Learns the optimal action-value function $Q(s, a)$ by iteratively updating estimates using the Bellman equation:
  $$
  Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]
  $$
  where $\alpha$ is the learning rate and $\gamma$ is the discount factor.

#### SARSA (1996)
- **Type**: On-Policy
- **Description**: Updates the action-value function $Q(s, a)$ based on the action actually taken by the policy:
  $$
  Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma Q(s', a') - Q(s, a) \right]
  $$
  where $a'$ is the action taken in the next state $s'$.

#### Deep Q-Networks (DQN) (2013)
- **Type**: Off-Policy
- **Description**: Extends Q-Learning by using deep neural networks to approximate $Q(s, a)$. Introduces experience replay and target networks for stability.

### Policy-Based Methods

Policy-Based methods directly learn a policy that maps states to actions without explicitly learning a value function. They optimize the policy by adjusting its parameters to maximize the expected return.

#### REINFORCE (1992)
- **Type**: On-Policy
- **Description**: Uses Monte Carlo methods to update policy parameters $\theta$ in the direction of the gradient of expected return:
  $$
  \nabla J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta (s, a) R \right]
  $$
  where $R$ is the total return following state $s$ and action $a$.

#### Proximal Policy Optimization (PPO) (2017)
- **Type**: On-Policy
- **Description**: Improves the stability and reliability of policy gradient methods by using a clipped objective function that restricts large policy updates.

### Actor-Critic Methods

Actor-Critic methods combine value-based and policy-based approaches. They consist of two components: an actor that updates the policy and a critic that updates the value function, providing feedback to the actor.

#### Asynchronous Advantage Actor-Critic (A3C) (2016)
- **Type**: On-Policy
- **Description**: Uses multiple agents to explore different parts of the state space in parallel, updating a global policy and value function asynchronously.

#### Soft Actor-Critic (SAC) (2018)
- **Type**: Off-Policy
- **Description**: Incorporates a maximum entropy framework to encourage exploration by adding an entropy term to the reward, balancing exploration and exploitation.

## Model-Based Reinforcement Learning

Model-Based methods use a model of the environment to predict the next state and reward given a current state and action. These methods can simulate future states, enabling the agent to plan its actions more effectively.

### Dyna-Q (1991)
- **Description**: Combines model-free Q-Learning with a model-based approach. The agent learns a model of the environment and uses it to simulate experiences, updating the Q-values using both real and simulated experiences.

### Monte Carlo Tree Search (MCTS) (2006)
- **Description**: Uses a tree structure to represent the possible future states of the environment. It performs simulations to evaluate the outcomes of different actions, guiding the agent's decision-making process.

### MuZero (2019)
- **Description**: A sophisticated model-based approach that learns a model of the environment implicitly through a combination of a learned value function, a policy, and a model that predicts future states and rewards. It balances model-based planning and model-free learning.

### World Models (2018)
- **Description**: Constructs a generative model of the environment, enabling the agent to simulate and plan future actions. The agent uses this internal model to learn policies efficiently.

## Summary

### Model-Free vs. Model-Based

- **Model-Free**:
  - Do not use an explicit model of the environment.
  - Learn policies or value functions from direct interaction with the environment.
  - Examples: Q-Learning, SARSA, DQN, PPO, A3C, SAC.

- **Model-Based**:
  - Use a model of the environment to predict future states and rewards.
  - Enable planning by simulating potential future experiences.
  - Examples: Dyna-Q, MCTS, MuZero, World Models.

Understanding these categories and their respective methods is crucial for selecting the appropriate RL algorithm based on the specific problem and computational resources available.
