# Reinforcement Learning - an introduction (Part 1)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/paolodeangelis/Sistemi_a_combustione/blob/main/4.1-Reinforcement_Learning_P1.ipynb)

## Introduction to Reinforcement Learning with Tic Tac Toe

Reinforcement Learning (RL) is a powerful paradigm that has applications in various fields, including energy and chemical engineering. In this notebook, we will explore the fundamentals of RL by using the classic game of Tic Tac Toe (also known as Noughts and Crosses) as an example.

### What is Reinforcement Learning?

Reinforcement Learning is a type of machine learning where an agent learns to make sequential decisions by interacting with an environment. The core components of reinforcement learning include:

- **Agent (A):** The learner or decision-maker that interacts with the environment.
- **Environment (E):** The external system with which the agent interacts. It provides feedback to the agent in the form of rewards and state transitions.
- **State (S):** A representation of the current situation or configuration of the environment.
- **Action (A):** The set of possible choices or decisions that the agent can make.
- **Policy (π):** The strategy or rule that defines the agent's behavior, specifying which actions to take in each state.
- **Reward (R):** A numerical value that the agent receives from the environment after taking an action in a particular state. The goal of the agent is to maximize the cumulative reward over time.
- **Value Function:** The expected cumulative reward that an agent can achieve starting from a particular state while following a given policy π.

The value function for each policy can be computed using the Bellman equation:

$V^{\pi}(s) = \sum_{a}\pi(a|s) \sum_{s'}P(s' | s, a)[R(s, a, s') + \gamma V^{\pi}(s')]$

where:
-$V^{\pi}(s)$ is the value function for state $s$ under policy $\pi$.
- $\pi(a|s)$ is the probability of taking action $a$ in state $s$ under policy $\pi$.
- $P(s' | s, a)$ is the probability of transitioning to state $s'$ from state $s$ when taking action $a$.
- $R(s, a, s')$ is the immediate reward obtained when transitioning from state $s$ to $s'$ by taking action $a$.
- $\gamma$ is the discount factor.

Basic reinforcement learning is modeled as a *Markov decision process*:

* a set of environment and agent states,$\mathcal{S}$;
* a set of actions,$\mathcal{A}$, of the agent;
*$P_a(s,s')=\Pr(S_{t+1}=s'\mid S_t=s, A_t=a)$, the probability of transition (at time$t$) from state$s$ to state$s'$ under action$a$.
*$R_a(s,s')$, the immediate reward after transition from$s$ to$s'$ with action$a$.

### How RL Algorithms Work and Learn

In RL, the agent learns by interacting with the environment over multiple time steps. The learning process typically follows these steps:

<img src="https://github.com/paolodeangelis/Sistemi_a_combustione/blob/main/assets/img/AE_loop.png?raw=true" width="500" alt="Agend-Enviroment-Action">


1. **Initialization**: The agent initializes its policy, value functions, and other parameters.

2. **Interaction**: The agent takes actions in the environment based on its current policy. It receives rewards from the environment based on its actions.

3. **Learning**: The agent updates its policy and value functions based on the rewards received and its interactions with the environment. This is often done using various RL algorithms.

4. **Repeat**: Steps 2 and 3 are repeated for many episodes or time steps to improve the agent's performance.

The agent's goal is to find an optimal policy that maximizes the cumulative reward over time. This involves a trade-off between exploration (trying new actions to discover better policies) and exploitation (choosing actions that are known to yield high rewards).

### RL Algorithms

#### 1. Brute Force

Brute force RL involves trying every possible policy and selecting the one that yields the highest expected reward. The value function for each policy can be computed using the Bellman equation:

However, this approach is usually not feasible for large state and action spaces due to the exponential number of policies.

#### 2. Monte Carlo Methods

Monte Carlo methods estimate value functions and policies by simulating episodes and averaging the returns obtained. They are well-suited for episodic tasks and are based on the law of large numbers.

#### 3. Q-Learning

Q-Learning is a model-free, off-policy algorithm that learns Q-values through iterative updates. The Q-value represents the expected cumulative reward for taking a specific action in a specific state. Q-Learning uses the Bellman equation to update Q-values:

$$Q(s, a) \leftarrow Q(s, a) + \alpha[R(s, a, s') + \gamma \max_{a'}Q(s', a') - Q(s, a)]$$

where:
- $Q(s, a)$ is the Q-value for state-action pair $(s, a)$.
- $\alpha$ is the learning rate.
- $R(s, a, s')$ is the immediate reward obtained when transitioning from state $s$ to $s'$ by taking action $a$.
- $\gamma$ is the discount factor.

#### 4. Proximal Policy Optimization (PPO)

PPO is a policy optimization algorithm that aims to improve policies in an iterative manner. It balances between exploring new policies and exploiting known policies while ensuring stable learning through a clipped objective function. The objective of PPO is to maximize the expected cumulative reward:

$$\max_\theta \mathbb{E}[\min(r(\theta)\hat{A}, \text{clip}(r(\theta), 1-\epsilon, 1+\epsilon)\hat{A})]$$

where:
- $\theta$ represents the policy parameters.
- $r(\theta)$ is the ratio of the new policy to the old policy's probability.
- $\hat{A}$ is the advantage function, which estimates the advantage of taking a specific action.
- $\epsilon$ is a hyperparameter that controls the clipping range.

Let's get started by setting up the environment and understanding its components.


## Model 1: Tic-Tac-Toe

<img src="https://github.com/paolodeangelis/Sistemi_a_combustione/blob/main/assets/img/tic-tac-toe.jpeg?raw=true" width="400" alt="Tic Tac Toe">

In the context of illustrating reinforcement learning, let's consider the game of Tic-Tac-Toe, a simple yet instructive example.

### The Tic-Tac-Toe Game

Tic-Tac-Toe is a two-player game played on a 3x3 board, where one player uses Xs, and the other uses Os. The objective is to place three marks in a row, horizontally, vertically, or diagonally, to win the game. If the board fills up without a winner, the game ends in a draw.

### The Challenge

To illustrate reinforcement learning, let's assume we are playing against an imperfect opponent, one whose play allows us to win occasionally. The objective is to construct a player that learns from its opponent's imperfections and maximizes its chances of winning.

### Classical Techniques Not Suitable

Classical techniques like "minimax" from game theory or dynamic programming are not suitable for this problem because they assume knowledge of the opponent's behavior, which is often unavailable in practice.

### Reinforcement Learning Approach

We can tackle this problem using reinforcement learning. Here's how it works:

1. We create a table of values, one for each possible game state, representing the estimated probability of winning from that state. This table is called the value function.

2. We start with certain states having known values:
   - States with three Xs in a row have a winning probability of 1.
   - States with three Os in a row or filled-up states have a winning probability of 0.
   - Initial values of other states are set to 0.5, representing a 50% chance of winning.

3. We play many games against the opponent, selecting moves that maximize our estimated chances of winning.

4. During the game, we update the values of states we encounter. After a move, we adjust the earlier state's value to be closer to the later state's value using a step-size parameter (↵).

5. This approach converges to optimal play against the imperfect opponent. The step-size parameter ↵ influences the learning rate.



<figure>
    <img src="https://github.com/paolodeangelis/Sistemi_a_combustione/blob/main/assets/img/RL-action-chain.png?raw=true" alt="Action chain">
<figcaption><strong>Figure 1</strong>: Tic-tac-toe strategy: Solid lines are taken moves, dashed lines are considered but not chosen. * marks the current best move. Exploratory moves, like our second one, don't affect learning. Red arrows show value updates along the tree. (source <a href="http://www.incompleteideas.net/book/RLbook2020.pdf">Reinforcement Learning: An Introduction second edition Richard S. Sutton and Andrew G. Barto</a>)
<figcaption>
<figure>

### Building a Reinforcement Learning Model from Scratch

When constructing a RL model from the ground up, we follow a series of structured steps to ensure the success of our learning process. Here's a detailed breakdown of each step:

* **1. Setting up the Environment, i.e defining the Tic-Tac-Toe Game**

  Before diving into the specifics of our model, we need to establish the environment in which the learning will take place. This includes:

  - Defining the state space: In our Tic-Tac-Toe example, the state space represents all possible configurations of the game board.
  - Identifying the action space: This defines the set of possible moves or actions that the RL agent can take in each state.
  - Determining the reward structure: We decide how rewards will be assigned to different game outcomes, such as wins, losses, and draws.

    For the **tic-tac-toe**  example this means:
  - Designing the game board: Specifying the size of the board (e.g., 3x3) and how it's represented in code.
  - Implementing the game logic: Writing the rules for valid moves, checking for wins, losses, or draws, and updating the game state after each move.
  - Handling player interactions: Creating mechanisms for human or agent players to make moves and interact with the game.


* **2. Building the Reinforcement Learning Model**

  Now, we start constructing the RL model itself. Key elements include:

  - Designing the agent: Defining the RL agent that will learn to play Tic-Tac-Toe. This could involve choosing the learning algorithm (e.g., Q-Learning, Deep Q-Networks) and specifying its parameters.
  - Developing the state representation: Creating a suitable representation of the game state so that the agent can make decisions based on it.
  - Formulating the reward function: Determining how the agent will be rewarded based on the game outcomes and intermediate states.
  - Setting exploration-exploitation strategies: Balancing exploration (trying new moves) and exploitation (choosing known good moves) to improve learning.

* **3. Training the Model**

  In the training phase, we allow the RL agent to play numerous games against opponents (possibly itself). This involves:

  - Iterative learning: The agent repeatedly plays the game, makes decisions, and receives rewards. It uses these experiences to update its strategies and policies.
  - Reinforcement learning algorithms: Applying the chosen RL algorithm to update the agent's Q-values (or policy) based on reward feedback and state transitions.
  - Fine-tuning parameters: Adjusting parameters like the learning rate and discount factor to optimize the learning process.

* **4. Testing the Model**

  After training, we evaluate the RL model's performance to ensure it has learned an effective policy. This phase includes:

  - Assessing gameplay: Having the trained agent play Tic-Tac-Toe against various opponents, including perfect and imperfect players, to gauge its performance.
  - Analyzing results: Examining win rates, strategies employed, and any potential shortcomings to identify areas for improvement.
  - Iterative refinement: If necessary, returning to earlier steps to modify the agent or environment and retraining the model for better performance.


### STEP 1 : Setting up the Environment, i.e defining the Tic-Tac-Toe Game

We are going to:

1. Define the enviroment with the object `TicTacToe`
2. Explaining the main *methods*
3. Test it

Import libraries

In [1]:
import numpy as np
import random
import pickle
from tqdm import tqdm

In [2]:
import numpy as np

class TicTacToe:
    def __init__(self, board_rows=3, board_cols=3):
        """
        Initialize the Tic Tac Toe game.

        Args:
            board_rows (int): Number of rows in the game board.
            board_cols (int): Number of columns in the game board.
        """
        self.BOARD_ROWS = board_rows
        self.BOARD_COLS = board_cols
        self.data = np.zeros((self.BOARD_ROWS, self.BOARD_COLS))
        self.winner = None
        self.hashVal = None
        self.end = None

    def getHash(self):
        """
        Calculate the hash value for the current state.

        Returns:
            int: The unique hash value for the state.
        """
        if self.hashVal is None:
            self.hashVal = 0
            for i in self.data.reshape(self.BOARD_ROWS * self.BOARD_COLS):
                if i == -1:
                    i = 2
                self.hashVal = self.hashVal * 3 + i
        return int(self.hashVal)

    def isEnd(self):
        """
        Determine whether the game has ended and who has won.

        Returns:
            bool: True if the game has ended, False otherwise.
        """
        if self.end is not None:
            return self.end
        results = []

        # Check rows and columns
        for i in range(0, self.BOARD_ROWS):
            results.append(np.sum(self.data[i, :]))
            results.append(np.sum(self.data[:, i]))

        # Check diagonals
        results.append(0)
        for i in range(0, self.BOARD_ROWS):
            results[-1] += self.data[i, i]
        results.append(0)
        for i in range(0, self.BOARD_ROWS):
            results[-1] += self.data[i, self.BOARD_ROWS - 1 - i]

        for result in results:
            if result == 3:
                self.winner = 1
                self.end = True
                return self.end
            if result == -3:
                self.winner = -1
                self.end = True
                return self.end

        # Check for a tie
        sum = np.sum(np.abs(self.data))
        if sum == self.BOARD_ROWS * self.BOARD_COLS:
            self.winner = 0
            self.end = True
            return self.end

        # The game is still ongoing
        self.end = False
        return self.end

    def nextState(self, i, j, symbol):
        """
        Generate the next state after making a move.

        Args:
            i (int): Row index for the move.
            j (int): Column index for the move.
            symbol (int): The player's symbol (1 or -1).

        Returns:
            TicTacToe: The next game state after the move.
        """
        new_game = TicTacToe(self.BOARD_ROWS, self.BOARD_COLS)
        new_game.data = np.copy(self.data)
        new_game.data[i, j] = symbol
        return new_game

    def show(self):
        """
        Print the current game board.
        """
        print("    1   2   3")
        print("  -------------")
        for i in range(0, self.BOARD_ROWS):
            print(f"{i+1} |", end=' ')
            for j in range(0, self.BOARD_COLS):
                if self.data[i, j] == 1:
                    print('O', end=' ')
                if self.data[i, j] == 0:
                    print(' ', end=' ')
                if self.data[i, j] == -1:
                    print('X', end=' ')
                if j < 2:
                    print('|', end=' ')
            print(f"|  {i+1}")
            if i <= 2:
                print("  -------------")
        print("    1   2   3\n")


### Understanding the `TicTacToe` Class

The `TicTacToe` class represents the Tic-Tac-Toe game. Let's break down its functionalities and methods:


#### Initialization

```python
def __init__(self, board_rows, board_cols):
```

- The class constructor initializes the game state.
- `board_rows` and `board_cols` specify the number of rows and columns in the game board.
- The `data` attribute represents the game board as an n * n array, where 1 represents the first player's chessman (X), -1 represents the second player's chessman (O), and 0 represents an empty position.
- `winner` is set to `None` initially and will hold the winner's symbol (1 or -1) if there's a winner.
- `hashVal` is a unique hash value for the state, computed dynamically.
- `end` determines whether the game has ended (True or False).

In [3]:
enviroment = TicTacToe()

#### Get Hash Value

```python
def getHash(self):
```

- The `getHash` method calculates a unique hash value for the current game state.
- It is used to efficiently represent and index game states.
- *Hash function* is any function that can be used to map data of arbitrary size to fixed-size values

In [4]:
enviroment.getHash()

0

#### Check if the Game Has Ended

```python
def isEnd(self):
```

- The `isEnd` method determines whether the game has ended and, if so, who has won.
- It checks for wins in rows, columns, and diagonals, as well as ties.
- The game result is stored in `winner` (1 for Player 1, -1 for Player 2, 0 for a tie) and `end` (True for game end, False for ongoing).





In [5]:
enviroment.isEnd()

False

#### Generate the Next State

```python
def nextState(self, i, j, symbol):
```

- The `nextState` method generates the next game state after making a move.
- It takes `i` (row index), `j` (column index), and `symbol` (1 for Player 1, -1 for Player 2) as inputs.
- A new `TicTacToe` instance is created with the updated game board.


In [6]:
enviroment = enviroment.nextState(1,1, -1)

####  Display the Game Board

```python
def show(self):
```

- The `show` method prints the current game board in a human-readable format.
- It displays the game board with 'X' for Player 1, 'O' for Player 2, and '0' for empty positions.

In [7]:
enviroment.show()

    1   2   3
  -------------
1 |   |   |   |  1
  -------------
2 |   | X |   |  2
  -------------
3 |   |   |   |  3
  -------------
    1   2   3



In [8]:
# @title Game

from IPython.display import clear_output

def play_vs_dumb():
    # Initialize the game
    game = TicTacToe(3, 3)  # Specify the number of rows and columns
    DUMB_SYMBOL = -1
    PLAYER_SYMBOL = 1

    # Game loop
    while not game.isEnd():
        clear_output(wait=True)
        game.show()  # Use the 'show' method to print the game board

        # Dumb Player's Turn (Random Move)
        print("Dumb Player's Turn (Random Move)")
        available_moves = [(i,j) for i, j in zip(np.where(game.data == 0)[0], np.where(game.data == 0)[1])]

        random_move = random.choice(available_moves)
        print(random_move)
        game = game.nextState(random_move[0], random_move[1], DUMB_SYMBOL)  # Make the dumb player's move

        # Check the winner
        if game.isEnd():
            if game.winner == DUMB_SYMBOL:
                clear_output(wait=True)
                game.show()
                print("Dumb Player wins!")
                break
            elif game.winner == PLAYER_SYMBOL:
                clear_output(wait=True)
                game.show()
                print("Congratulations! You win!")
                break
            else:
                clear_output(wait=True)
                game.show()
                print("It's a draw!")
                break


        clear_output(wait=True)
        game.show()  # Display the updated game board after the dumb player's move

        # Player's turn
        while True:
            try:
                print("Your Turn:")
                row = int(input("\tEnter row (1, 2, or 3): ")) - 1
                col = int(input("\tEnter column (1, 2, or 3): ")) - 1
                if game.data[row, col] == 0:
                    game = game.nextState(row, col, PLAYER_SYMBOL)  # Make the player's move
                    break
                else:
                    print("Invalid move. Try again.")
            except ValueError:
                print("Invalid input. Please enter row and column as integers.")

        # Check the winner
        if game.isEnd():
            if game.winner == DUMB_SYMBOL:
                clear_output(wait=True)
                game.show()
                print("Dumb Player wins!")
                break
            elif game.winner == PLAYER_SYMBOL:
                clear_output(wait=True)
                game.show()
                print("Congratulations! You win!")
                break
            else:
                clear_output(wait=True)
                game.show()
                print("It's a draw!")
                break


In [9]:
play_vs_dumb()

    1   2   3
  -------------
1 | O |   |   |  1
  -------------
2 | X | O | X |  2
  -------------
3 |   | X | O |  3
  -------------
    1   2   3

Congratulations! You win!


### Step 2: Building the Reinforcement Learning Model

In this step, we will implement the Temporal-Difference (TD) learning agent that will learn to play Tic-Tac-Toe through trial and error. The agent employs the TD learning algorithm to update its value estimates based on state transitions and rewards. TD learning is a fundamental reinforcement learning method that learns to estimate the expected cumulative rewards for each state.

#### Temporal-Difference (TD) Learning

TD learning is a form of reinforcement learning that updates its value estimates based on the difference between the current estimate and the expected future rewards. The core idea behind TD learning is to learn a state-value function, denoted as V(s), that estimates the expected cumulative reward when starting in state 's' and following the current policy thereafter.

The TD learning update equation is as follows:

$
V(s) \leftarrow V(s) + \alpha \cdot \left( R + \gamma \cdot V(s') - V(s) \right)
$

Where:
- $ V(s) $ is the current estimate of the value for state 's.'
- $ \alpha $ is the learning rate ($ 0 \leq \alpha \leq 1 $) that controls the step size of updates.
- $ R $ is the immediate reward received after transitioning from state 's' to state 's'.'
- $ \gamma $ is the discount factor ($ 0 \leq \gamma \leq 1 $) that balances immediate rewards and future rewards.
- $ V(s') $ is the value estimate for the next state 's' following the current policy.

This explanation provides an overview of the TD learning algorithm and its key components, which are used in the agent's learning process. If you have any further questions or need additional explanations, please let me know.

In [10]:
# @title ausiliar
BOARD_ROWS = 3
BOARD_COLS = 3

def getAllStatesImpl(currentState, currentSymbol, allStates):
    for i in range(0, 3):
        for j in range(0, 3):
            if currentState.data[i][j] == 0:
                newState = currentState.nextState(i, j, currentSymbol)
                newHash = newState.getHash()
                if newHash not in allStates.keys():
                    isEnd = newState.isEnd()
                    allStates[newHash] = (newState, isEnd)
                    if not isEnd:
                        getAllStatesImpl(newState, -currentSymbol, allStates)

def getAllStates():
    currentSymbol = 1
    currentState = TicTacToe()
    allStates = dict()
    allStates[currentState.getHash()] = (currentState, currentState.isEnd())
    getAllStatesImpl(currentState, currentSymbol, allStates)
    return allStates

# all possible board configurations
allStates = getAllStates()

In [11]:
class Agent:
    def __init__(self, stepSize=0.1, exploreRate=0.1):
        """
        Initialize the Reinforcement Learning Agent.

        Args:
            stepSize (float): The step size or learning rate for updating estimations.
            exploreRate (float): The exploration rate, controlling the probability of exploration.
        """
        self.allStates = allStates
        self.estimations = dict()
        self.stepSize = stepSize
        self.exploreRate = exploreRate
        self.states = []

    def reset(self):
        """
        Reset the agent's state history.
        """
        self.states = []

    def setSymbol(self, symbol):
        """
        Set the agent's symbol and initialize estimations.

        Args:
            symbol (int): The agent's symbol (1 or -1).
        """
        self.symbol = symbol
        for hash in self.allStates.keys():
            (state, isEnd) = self.allStates[hash]
            if isEnd:
                # Initialize estimations for terminal states
                self.estimations[hash] = 1.0 if state.winner == self.symbol else 0
            else:
                # Initialize estimations for non-terminal states
                self.estimations[hash] = 0.5

    def feedState(self, state):
        """
        Accept a game state and add it to the agent's state history.

        Args:
            state (State): The current game state.
        """
        self.states.append(state)

    def feedReward(self, reward):
        """
        Update estimations based on the received reward using Temporal-Difference Learning.

        Args:
            reward (float): The received reward.
        """
        if len(self.states) == 0:
            return
        self.states = [state.getHash() for state in self.states]
        target = reward
        for latestState in reversed(self.states):
            # Temporal-Difference (TD) learning update equation
            value = self.estimations[latestState] + self.stepSize * (target - self.estimations[latestState])
            self.estimations[latestState] = value
            target = value
        self.states = []

    def takeAction(self):
        """
        Determine the next action to take using an exploration-exploitation strategy.

        Returns:
            list: A list containing [row, column, symbol] for the next action.
        """
        state = self.states[-1]
        nextStates = []
        nextPositions = []
        for i in range(BOARD_ROWS):
            for j in range(BOARD_COLS):
                if state.data[i, j] == 0:
                    nextPositions.append([i, j])
                    nextStates.append(state.nextState(i, j, self.symbol).getHash())
        if np.random.binomial(1, self.exploreRate):
            # Exploration: Choose a random move
            np.random.shuffle(nextPositions)
            self.states = []
            action = nextPositions[0]
            action.append(self.symbol)
            return action

        values = []
        for hash, pos in zip(nextStates, nextPositions):
            values.append((self.estimations[hash], pos))
        np.random.shuffle(values)
        values.sort(key=lambda x: x[0], reverse=True)
        # Exploitation: Choose the move with the highest estimated value
        action = values[0][1]
        action.append(self.symbol)
        return action

    def savePolicy(self):
        """
        Save the learned policy to a file using pickle.
        """
        fw = open('optimal_policy_' + str(self.symbol), 'wb')
        pickle.dump(self.estimations, fw)
        fw.close()

    def loadPolicy(self):
        """
        Load a learned policy from a file.
        """
        fr = open('optimal_policy_' + str(self.symbol), 'rb')
        self.estimations = pickle.load(fr)
        fr.close()


#### Initialize the Agent

```python
def __init__(self, stepSize=0.1, exploreRate=0.1):
    """
    Initialize the Reinforcement Learning Agent.

    Args:
        stepSize (float): The step size or learning rate for updating estimations.
        exploreRate (float): The exploration rate, controlling the probability of exploration.
    """
    self.allStates = allStates
    self.estimations = dict()
    self.stepSize = stepSize
    self.exploreRate = exploreRate
    self.states = []
```

- Initialize the agent with the specified learning rate and exploration rate.
- Create a dictionary to store state estimations.
- Initialize an empty list to keep track of visited states.

#### Reset the agent's state history.

```python
def reset(self):
    """
    Reset the agent's state history.
    """
    self.states = []
```

- Clear the list of visited states.

#### Set the agent's symbol and initialize estimations.

```python
def setSymbol(self, symbol):
    """
    Set the agent's symbol and initialize estimations.

    Args:
        symbol (int): The agent's symbol (1 or -1).
    """
    self.symbol = symbol
    for hash in self.allStates.keys():
        (state, isEnd) = self.allStates[hash]
        if isEnd:
            self.estimations[hash] = 1.0 if state.winner == self.symbol else 0
        else:
            self.estimations[hash] = 0.5
```

- Set the agent's symbol (1 or -1).
- Initialize state estimations for terminal and non-terminal states based on the symbol.


#### State

Accept a game state and add it to the agent's state history.

```python
def feedState(self, state):
    """
    Accept a game state and add it to the agent's state history.

    Args:
        state (State): The current game state.
    """
    self.states.append(state)
```

- Append the current game state to the list of visited states.



#### Reward
Update estimations based on the received reward using Temporal-Difference Learning (TD).

```python
def feedReward(self, reward):
    """
    Update estimations based on the received reward using Temporal-Difference Learning (TD).

    Args:
        reward (float): The received reward.
    """
    if len(self.states) == 0:
        return
    self.states = [state.getHash() for state in self.states]
    target = reward
    for latestState in reversed(self.states):
        value = self.estimations[latestState] + self.stepSize * (target - self.estimations[latestState])
        self.estimations[latestState] = value
        target = value
    self.states = []
```

- Update state estimations using Temporal-Difference (TD) learning with the received reward.



#### Determine the next action to take using an exploration-exploitation strategy.

```python
def takeAction(self):
    """
    Determine the next action to take using an exploration-exploitation strategy.

    Returns:
        list: A list containing [row, column, symbol] for the next action.
    """
    state = self.states[-1]
    nextStates = []
    nextPositions = []
    for i in range(BOARD_ROWS):
        for j in range(BOARD_COLS):
            if state.data[i, j] == 0:
                nextPositions.append([i, j])
                nextStates.append(state.nextState(i, j, self.symbol).getHash())
    if np.random.binomial(1, self.exploreRate):
        np.random.shuffle(nextPositions)
        self.states = []
        action = nextPositions[0]
        action.append(self.symbol)
        return action

    values = []
    for hash, pos in zip(nextStates, nextPositions):
        values.append((self.estimations[hash], pos))
    np.random.shuffle(values)
    values.sort(key=lambda x: x[0], reverse=True)
    action = values[0][1]
    action.append(self.symbol)
    return action
```

- Determine the next action to take using an exploration-exploitation strategy.
- Exploration: Choose a random move with a probability of `exploreRate`.
- Exploitation: Choose the move with the highest estimated value.


#### Save and Loadthe learned policy to a file.

```python
def savePolicy(self):
    """
    Save the learned policy to a file.
    """
    fw = open('optimal_policy_' + str(self.symbol), 'wb')
    pickle.dump(self.estimations, fw)
    fw.close()
```

- Save the learned policy to a file using pickle.

```python
def loadPolicy(self):
    """
    Load a learned policy from a file.
    """
    fr = open('optimal_policy_' + str(self.symbol), 'rb')
    self.estimations = pickle.load(fr)
    fr.close()
```

- Load a previously learned policy from a file.

### Step 3 : Training the Model

#### `Judger` Object

In this step, we will implement the `Judger` object that facilitates the training and evaluation of our reinforcement learning agents in playing Tic-Tac-Toe. The `Judger` object is responsible for managing the game, determining the winner, and providing feedback to the agents.

The `Judger` object is designed to facilitate the training and evaluation process of our reinforcement learning agents. It takes two agents as input: `agent1` and `agent2`, representing the two players who will play Tic-Tac-Toe against each other. The `feedback` parameter controls whether both players receive rewards when the game ends. If `feedback` is set to `True`, both players receive rewards based on the game outcome. Otherwise, only the winner receives a reward.

The `Judger` object maintains the current state of the game, the current player's turn, and handles the game loop. It also manages the transition between game states and checks for game termination conditions.

In [12]:
from IPython.display import clear_output

class Judger:
    def __init__(self, agent1, agent2, feedback=True):
        """
        Initialize the Judger object.

        Args:
            agent1 (Agent): The first player (agent) in the game.
            agent2 (Agent): The second player (agent) in the game.
            feedback (bool): If True, both players receive rewards when the game ends.
        """
        self.p1 = agent1
        self.p2 = agent2
        self.feedback = feedback
        self.currentPlayer = None
        self.p1Symbol = 1
        self.p2Symbol = -1
        self.p1.setSymbol(self.p1Symbol)
        self.p2.setSymbol(self.p2Symbol)
        self.currentState = TicTacToe()
        self.allStates = allStates

    def giveReward(self):
        """
        Assign rewards to both players based on the game outcome.
        If one player wins, the winning player receives a reward of 1, and the other player receives a reward of 0.
        If the game ends in a draw, both players receive intermediate rewards.
        """
        if self.currentState.winner == self.p1Symbol:
            self.p1.feedReward(1)
            self.p2.feedReward(0)
        elif self.currentState.winner == self.p2Symbol:
            self.p1.feedReward(0)
            self.p2.feedReward(1)
        else:
            self.p1.feedReward(0.1)
            self.p2.feedReward(0.5)

    def feedCurrentState(self):
        """Feed the current game state to both agents."""
        self.p1.feedState(self.currentState)
        self.p2.feedState(self.currentState)

    def reset(self):
        """Reset the game and agents to start a new round of Tic-Tac-Toe."""
        self.p1.reset()
        self.p2.reset()
        self.currentState = TicTacToe()
        self.currentPlayer = None

    def play(self, show=False):
        """
        Orchestrate the game loop where players take turns making moves until the game ends.

        Args:
            show (bool): If True, display the game board after each move.
        """
        self.reset()
        self.feedCurrentState()
        while True:
            # set current player
            if self.currentPlayer == self.p1:
                self.currentPlayer = self.p2
            else:
                self.currentPlayer = self.p1
            if show:
                self.currentState.show()
            [i, j, symbol] = self.currentPlayer.takeAction()
            self.currentState = self.currentState.nextState(i, j, symbol)
            hashValue = self.currentState.getHash()
            self.currentState, isEnd = self.allStates[hashValue]
            self.feedCurrentState()
            if isEnd:
                if self.feedback:
                    self.giveReward()
                return self.currentState.winner


##### Initialize

Initialize the `Judger` object with two agents, `agent1` and `agent2`, representing the two players in the game. The `feedback` parameter, when set to `True`, allows both players to receive rewards when the game ends. It sets the initial `currentPlayer` to `None`.

```python
def __init__(self, agent1, agent2, feedback=True):
    """
    Initialize the Judger object.

    Args:
        agent1 (Agent): The first player (agent) in the game.
        agent2 (Agent): The second player (agent) in the game.
        feedback (bool): If True, both players receive rewards when the game ends.
    """
```

##### Assign rewards
Assign rewards to both players based on the game outcome. If one player wins, the winning player receives a reward of 1, and the other player receives a reward of 0. If the game ends in a draw, both players receive intermediate rewards (e.g., 0.1 and 0.5).

```python
def giveReward(self):
    """
    Assign rewards to both players based on the game outcome.
    If one player wins, the winning player receives a reward of 1, and the other player receives a reward of 0.
    If the game ends in a draw, both players receive intermediate rewards.
    """
```


##### Feed the current game state
Feed the current game state to both agents, allowing them to make informed decisions based on the current game state.

```python
def feedCurrentState(self):
    """Feed the current game state to both agents."""
    self.agent1.feedState(self.agent1.currentState)
    self.agent2.feedState(self.agent2.currentState)
```

##### Reset agents
The `reset()` method resets the game, agents, and game state to start a new round of Tic-Tac-Toe.

```python
def reset(self):
    """Reset the game, agents, and game state to start a new round of Tic-Tac-Toe."""
    self.agent1.reset()
    self.agent2.reset()
    self.currentPlayer = None
```

##### Let the two agents play
The `play()` method orchestrates the game loop, where players take turns making moves until the game ends. If `show` is set to `True`, it will display the game board after each move.

```python
def play(self, show=False):
    """
    Orchestrate the game loop where players take turns making moves until the game ends.

    Args:
        show (bool): If True, display the game board after each move.
    """
```

#### Train function

In [13]:
def train(epochs=20000, stepSize=0.1, exploreRate=0.1):
    """
    Train two agents to play Tic-Tac-Toe using Q-learning.

    Args:
        epochs (int): The number of training epochs (games). Default is 20,000.

    Returns:
        None
    """
    # Create two agents
    agent1 = Agent(stepSize=stepSize, exploreRate=exploreRate)
    agent2 = Agent(stepSize=stepSize, exploreRate=exploreRate)

    # Create a Judger to manage the game
    judger = Judger(agent1, agent2)

    # Initialize win counts for each player
    agent1_wins = 0.0
    agent2_wins = 0.0

    # Training loop with tqdm progress bar
    for i in tqdm(range(epochs), desc="Training", ncols=100):
        # Play a game with the Judger
        winner = judger.play()

        # Update win counts based on the game outcome
        if winner == 1:
            agent1_wins += 1
        elif winner == -1:
            agent2_wins += 1

        # Reset the Judger for the next game
        judger.reset()

    # Print win rates for both agents
    agent1_win_rate = agent1_wins / epochs
    agent2_win_rate = agent2_wins / epochs
    print()
    print(f"Agent 1 Win Rate: {agent1_win_rate:.4f}")
    print(f"Agent 2 Win Rate: {agent2_win_rate:.4f}")

    # Save learned policies for both agents
    agent1.savePolicy()
    agent2.savePolicy()


The train function trains two agents using Q-learning to play Tic-Tac-Toe over a specified number of epochs.
It initializes two agents, `agent1` and `agent2`, and a Judger to manage the game.
Win counts for each agent are tracked, and the training loop plays games with the `Judger`.
After each game, the win counts are updated based on the game outcome, and the `Judger` is reset for the next game.
The function prints the win rates for both agents and saves their learned policies.

#### Run Train

In [14]:
num_episodes = 3000
alpha = 0.25
epsilon = 0.1

In [15]:
train(epochs=num_episodes, stepSize=alpha, exploreRate=epsilon)

Training: 100%|███████████████████████████████████████████████| 3000/3000 [00:02<00:00, 1323.77it/s]


Agent 1 Win Rate: 0.3733
Agent 2 Win Rate: 0.3647





### Evaluating Agent's Performance

After training the  agent, it's essential to evaluate its performance by playing a series of games against a random player. This allows us to measure how well the agent has learned to play Tic-Tac-Toe.

We'll create a `test` function to achieve this. The `test` function takes two parameters: the trained agent and the number of games to play. It returns the percentage of games won by the agent.

In the `test` function, we play the specified number of games. At each step, we determine the current player and let the agent or the random player make a move accordingly. If the game is won by the agent, we increment the win count and calculate the win percentage. This function allows us to assess the agent's performance in real game scenarios.

In [16]:
def test(num_games):
    ai_player = Agent(exploreRate=0)
    dumb_player = Agent(exploreRate=1)
    judger = Judger(ai_player, dumb_player, False)
    ai_player.loadPolicy()
    player1Win = 0.0
    player2Win = 0.0
    for i in tqdm(range(num_games), desc="Games", ncols=100):
        winner = judger.play()
        if winner == 1:
            player1Win += 1
        if winner == -1:
            player2Win += 1
        judger.reset()
    print()
    print(f"RL Agend wins rates: {player1Win/num_games*100:.4f} %")
    print(f"Dumb Agend wins: {player2Win/num_games*100:.4f} %")

In [17]:
test(5000)

Games: 100%|██████████████████████████████████████████████████| 5000/5000 [00:02<00:00, 1904.70it/s]


RL Agend wins rates: 89.6600 %
Dumb Agend wins: 6.8600 %





### Let's play against AI

In [18]:
# @title Game

from IPython.display import clear_output

class HumanPlayer:
    def __init__(self, stepSize = 0.1, exploreRate=0.1):
        self.symbol = None
        self.currentState = None
        return
    def reset(self):
        return
    def setSymbol(self, symbol):
        self.symbol = symbol
        return
    def feedState(self, state):
        self.currentState = state
        return
    def feedReward(self, reward):
        return
    def takeAction(self):
        clear_output(wait=True)
        self.currentState.show()
        print("Your Turn:")
        row = int(input("\tEnter row (1, 2, or 3): ")) - 1
        col = int(input("\tEnter column (1, 2, or 3): ")) - 1
        if self.currentState.data[row, col] != 0:
            return (row, col, self.symbol)
        return (row, col, self.symbol)


def play_vs_ai():
    player1 = Agent(exploreRate=0)
    player2 = HumanPlayer()
    judger = Judger(player2, player1, False)
    player1.loadPolicy()
    winner = judger.play(False)
    if winner == player2.symbol:
        print("Win!")
    elif winner == player1.symbol:
        print("Lose!")
    else:
        print("Tie!")

In [20]:
play_vs_ai()

    1   2   3
  -------------
1 |   |   |   |  1
  -------------
2 |   | X |   |  2
  -------------
3 |   |   | O |  3
  -------------
    1   2   3

Your Turn:


KeyboardInterrupt: ignored