# Advanced Statistics for Reinforcement Learning

## 1. Markov Decision Processes (MDP)


### Definition:
- A mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker.

### Components:
- **States (S)**: All possible situations in which the agent can be.
- **Actions (A)**: All possible actions the agent can take.
- **Transition Model (T)**: Describes the probability of transitioning from one state to another given an action.
- **Reward Function (R)**: Provides feedback to the agent about the immediate benefit of taking an action in a state.
- **Discount Factor (γ)**: Determines the importance of future rewards.


## 2. Value Functions


### Concepts:
- **State Value Function (V)**: The expected return when starting from state \( s \) and following a policy \( π \).
- **Action Value Function (Q)**: The expected return when taking action \( a \) in state \( s \) and then following policy \( π \).


## 3. Policies


### Concepts:
- **Policy (π)**: A strategy that the agent employs to determine its actions based on the current state.
- **Exploration vs. Exploitation**: Balancing the need to explore the environment versus exploiting known strategies to maximize rewards.


## 4. Temporal Difference Learning


### Techniques:
- **Q-Learning**: A model-free reinforcement learning algorithm that learns the value of action in a particular state without needing a model of the environment.
- **SARSA (State-Action-Reward-State-Action)**: An on-policy learning algorithm that updates Q-values based on the action taken by the current policy.


## 5. Monte Carlo Methods


### Definition:
- Algorithms that rely on repeated random sampling to obtain numerical results, used in RL for estimating value functions and improving policies.

### Returns:
- The total discounted reward from a given time step until the end of an episode.


## 6. Policy Gradient Methods


### Techniques:
- **Definition**: Techniques that optimize the policy directly rather than the value function.
- **REINFORCE Algorithm**: A policy gradient method that uses Monte Carlo returns to update the policy.


## 7. Actor-Critic Methods


### Definition:
- Combines value function approximation (critic) and policy gradient (actor) to improve learning efficiency.

### Advantage Function:
- Used to reduce the variance of the policy gradient estimates.


## 8. Exploration Strategies


### Techniques:
- **Epsilon-Greedy**: With probability \( ε \), take a random action instead of the best-known action.
- **Softmax Action Selection**: Actions are selected based on a softmax distribution over Q-values.


## 9. Temporal Logic in RL


### Concepts:
- **Linear Temporal Logic (LTL)**: Used to specify desired properties in the behavior of the agent over time.
- **Reward Shaping**: Modifying the reward structure to encourage specific behaviors.


## 10. Advanced Topics in RL


### Topics:
- **Hierarchical Reinforcement Learning**: Breaking down tasks into smaller subtasks.
- **Multi-Agent Reinforcement Learning**: Learning in environments with multiple interacting agents.
- **Transfer Learning in RL**: Applying knowledge gained in one task to improve learning in a related task.


## 11. Evaluation Metrics


### Metrics:
- **Cumulative Reward**: Total reward received over an episode.
- **Average Reward**: Average reward per time step over episodes.
- **Convergence**: The process of learning stabilizing to a point where further learning has minimal impact.
