# Reinforcement Learning - an introduction (Part 1)

## Introduction to Reinforcement Learning with Tic Tac Toe

![Tic Tac Toe](https://github.com/paolodeangelis/Sistemi_a_combustione/blob/main/insert_image_url_here?raw=1)

Reinforcement Learning (RL) is a powerful paradigm that has applications in various fields, including energy and chemical engineering. In this notebook, we will explore the fundamentals of RL by using the classic game of Tic Tac Toe (also known as Noughts and Crosses) as an example.

### What is Reinforcement Learning?

Reinforcement Learning is a type of machine learning where an agent learns to make sequential decisions by interacting with an environment. The core components of reinforcement learning include:

- **Agent (A):** The learner or decision-maker that interacts with the environment.
- **Environment (E):** The external system with which the agent interacts. It provides feedback to the agent in the form of rewards and state transitions.
- **State (S):** A representation of the current situation or configuration of the environment.
- **Action (A):** The set of possible choices or decisions that the agent can make.
- **Policy (π):** The strategy or rule that defines the agent's behavior, specifying which actions to take in each state.
- **Reward (R):** A numerical value that the agent receives from the environment after taking an action in a particular state. The goal of the agent is to maximize the cumulative reward over time.
- **Value Function:** The expected cumulative reward that an agent can achieve starting from a particular state while following a given policy π.
- **Q-Value Function:** The expected cumulative reward from taking action a in state s and then following a specific policy π.

#### Mathematical Notations

We can represent these concepts mathematically:

- **State-Action Pair:** (S, A) represents a state-action pair, where S is a state, and A is an action.
- **Policy (π):** π(a|s) represents the probability of taking action a in state s under policy π.
- **State Transition Probability:** P(s' | s, a) represents the probability of transitioning to state s' from state s when taking action a.
- **Reward Function:** R(s, a, s') represents the immediate reward obtained when transitioning from state s to s' by taking action a.
- **Value Function:** V(s) represents the expected cumulative reward from state s following a specific policy π.

### Relations Between Quantities

To understand the relationships between these quantities, we can use the Bellman equation, which is a fundamental equation in RL:

$$V^{\pi}(s) = \sum_{a}\pi(a|s) \sum_{s'}P(s' | s, a)[R(s, a, s') + \gamma V^{\pi}(s')]$$

The Bellman equation relates the value of a state to the expected sum of rewards when following a policy π. It accounts for the probability of taking actions, the probability of state transitions, and the discount factor γ.

In addition, the Q-value function is related to the value function and the policy through:

$$Q(s, a) = \sum_{s'}P(s' | s, a)[R(s, a, s') + \gamma V^{\pi}(s')]$$

This equation expresses the expected cumulative reward of taking action a in state s and then following policy π.






### Objective

The primary objective of this notebook is to introduce the concept of RL through practical examples. We will use the Frozen Lake environment, a simple gridworld, to illustrate key RL concepts such as states, actions, rewards, policies, and the learning process.

### How RL Algorithms Work and Learn

In RL, the agent learns by interacting with the environment over multiple time steps. The learning process typically follows these steps:

1. **Initialization**: The agent initializes its policy, value functions, and other parameters.

2. **Interaction**: The agent takes actions in the environment based on its current policy. It receives rewards from the environment based on its actions.

3. **Learning**: The agent updates its policy and value functions based on the rewards received and its interactions with the environment. This is often done using various RL algorithms.

4. **Repeat**: Steps 2 and 3 are repeated for many episodes or time steps to improve the agent's performance.

The agent's goal is to find an optimal policy that maximizes the cumulative reward over time. This involves a trade-off between exploration (trying new actions to discover better policies) and exploitation (choosing actions that are known to yield high rewards).

### RL Algorithms

#### 1. Brute Force

Brute force RL involves trying every possible policy and selecting the one that yields the highest expected reward. The value function for each policy can be computed using the Bellman equation:

$$V^{\pi}(s) = \sum_{a}\pi(a|s) \sum_{s'}P(s' | s, a)[R(s, a, s') + \gamma V^{\pi}(s')]$$

where:
- $V^{\pi}(s)$ is the value function for state $s$ under policy $\pi$.
- $\pi(a|s)$ is the probability of taking action $a$ in state $s$ under policy $\pi$.
- $P(s' | s, a)$ is the probability of transitioning to state $s'$ from state $s$ when taking action $a$.
- $R(s, a, s')$ is the immediate reward obtained when transitioning from state $s$ to $s'$ by taking action $a$.
- $\gamma$ is the discount factor.

However, this approach is usually not feasible for large state and action spaces due to the exponential number of policies.

#### 2. Monte Carlo Methods

Monte Carlo methods estimate value functions and policies by simulating episodes and averaging the returns obtained. They are well-suited for episodic tasks and are based on the law of large numbers.

#### 3. Q-Learning

Q-Learning is a model-free, off-policy algorithm that learns Q-values through iterative updates. The Q-value represents the expected cumulative reward for taking a specific action in a specific state. Q-Learning uses the Bellman equation to update Q-values:

$$Q(s, a) \leftarrow Q(s, a) + \alpha[R(s, a, s') + \gamma \max_{a'}Q(s', a') - Q(s, a)]$$

where:
- $Q(s, a)$ is the Q-value for state-action pair $(s, a)$.
- $\alpha$ is the learning rate.
- $R(s, a, s')$ is the immediate reward obtained when transitioning from state $s$ to $s'$ by taking action $a$.
- $\gamma$ is the discount factor.

#### 4. Proximal Policy Optimization (PPO)

PPO is a policy optimization algorithm that aims to improve policies in an iterative manner. It balances between exploring new policies and exploiting known policies while ensuring stable learning through a clipped objective function. The objective of PPO is to maximize the expected cumulative reward:

$$\max_\theta \mathbb{E}[\min(r(\theta)\hat{A}, \text{clip}(r(\theta), 1-\epsilon, 1+\epsilon)\hat{A})]$$

where:
- $\theta$ represents the policy parameters.
- $r(\theta)$ is the ratio of the new policy to the old policy's probability.
- $\hat{A}$ is the advantage function, which estimates the advantage of taking a specific action.
- $\epsilon$ is a hyperparameter that controls the clipping range.

Let's get started by setting up the environment and understanding its components.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/paolodeangelis/Sistemi_a_combustione/blob/main/4.1-Reinforcement_Learning_P1.ipynb)

![working progress](https://raw.githubusercontent.com/paolodeangelis/Sistemi_a_combustione/main/assets/img/warning-work-in-progress.jpg)