# Decision Theory Project - TicTacToe
*By Jelle Huibregtse and Aron Hemmes*

Below is a TicTacToe environment build from scratch with an Agent based on reward.

## 1. Setup
- Loading in some libraries
- Configuring layout

In [1]:
# Libraries
import random
from enum import Enum
from IPython.display import display
from ipywidgets import Layout, Button, HTML, Box


# Layout
field_layout = Layout(width="50px", height="50px")
wide_layout = Layout(width="158px")
column_layout = Layout(flex_flow="column")

# Formatting
%load_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
%%HTML
<style>
.widget-button {
    outline: none !important;
}

.widget-html-content {
    white-space: pre-wrap;
    line-height: normal !important;
}
</style>

<IPython.core.display.Javascript object>

## 2. Definition of the Environment

The code below defines all characteristics of our tic-tac-toe environment with the following characteristics:

Environment state:

- the player type is either X or O
- the opposing player (agent) is either X or O depending on the player
- X and O take turns placing an X or O on empty fields untill either one has won or there are no more fields left on the board
- a board starts out empty and can contain X and O marks

The following methods are described in code below:

- `reset()` which completely resets the board to an empty state
- `update()` update the visualisation of the current TicTacToe game
- `render()` visualisation of the current TicTacToe game
- `set_field()` sets a field to a type
- `change_player()` the player switches between X and O and resets the board
- `field_click()` the player sets a field to a type if the field is None
- `get_result()` returns if there's three of the same type in a row
- `get_turns()` returns the amount of turns that have passed
- `get_turn()` returns which player's turn it currently is

In [None]:
class Type(Enum):
    X = 1
    O = 2


class TicTacToeEnvironment:
    def __init__(self, agent=None):
        self.board = [None for _ in range(9)]
        self.player = Type.X
        self.result = None
        self.data = []
        self.agent = agent
        self.__possible_states = [i for i in range(9)]

        # Execute agent code
        if not agent == None and not self.player == Type.X:
            self.agent(self)

    def get_turns(self) -> int:
        return len([field for field in self.board if not field == None])

    def get_turn(self) -> Type:
        return Type.X if self.get_turns() % 2 == 0 else Type.O

    def get_result(self) -> Type:
        # Check for three of the same marks in a row
        board = self.board
        result = None
        if not board[4] == None and (
            board[1] == board[4] == board[7]
            or board[3] == board[4] == board[5]
            or board[0] == board[4] == board[8]
            or board[2] == board[4] == board[6]
        ):
            result = board[4]
        elif not board[0] == None and (
            board[0] == board[1] == board[2] or board[0] == board[3] == board[6]
        ):
            result = board[0]
        elif not board[8] == None and (
            board[8] == board[5] == board[2] or board[8] == board[7] == board[6]
        ):
            result = board[8]

        return result

    def field_click(self, e) -> None:
        if self.result == None and self.board[int(e.tooltip)] == None:
            if self.agent == None or self.get_turn() == self.player:
                self.set_field(int(e.tooltip), self.get_turn())

    def reset(self, e) -> None:
        self.board = [None for _ in range(9)]
        self.data = []
        self.result = None

        # Execute opposing agent code
        if not self.agent == None and not self.player == Type.X:
            self.agent(self)

        self.update()

    def change_player(self, e) -> None:
        self.player = Type.O if self.player == Type.X else Type.X

        self.reset("")

    def set_field(self, field, type: Type, reward=1.0) -> None:
        # Set field to type
        self.board[field] = type

        # Add move information to data
        if not type == self.player and not self.agent == None:
            self.data.append(
                "type: {}　　　field: {}　　　reward: {}".format(type.name, field, reward)
            )

        # Set result to current result
        self.result = self.get_result()

        # Execute opposing agent code
        if (
            not self.agent == None
            and not self.get_turn() == self.player
            and self.result == None
        ):
            self.agent(self)

        # Update the board
        self.update()

    def update(self) -> None:
        if hasattr(self, "field_buttons"):
            for i in range(len(self.field_buttons)):
                self.field_buttons[i].description = (
                    self.board[i].name if not self.board[i] == None else " "
                )
            if not self.agent == None:
                self.player_select_button.description = "PLAYER " + self.player.name
            self.result_text.value = (
                "winner is <b>{}</b>".format(self.result.name)
                if not self.result == None
                else ""
            )
            self.data_box.value = "\n".join(self.data)

    def render(self) -> None:
        if not hasattr(self, "field_buttons"):
            self.field_buttons = []
            elements = []

            # Add header
            elements.append(HTML(value="<h1>TicTacToe</h1>"))

            # Add field buttons
            buttons = []
            rows = []
            for i in range(9):
                btn = Button(tooltip=str(i), layout=field_layout)
                btn.on_click(self.field_click)
                buttons.append(btn)
                self.field_buttons.append(btn)
                if (i + 1) % 3 == 0:
                    rows.append(Box(buttons))
                    buttons = []
            elements.append(Box(children=rows, layout=column_layout))

            # Add player select
            if not self.agent == None:
                player_btn = Button(
                    description="PLAYER " + self.player.name, layout=wide_layout
                )
                player_btn.on_click(self.change_player)
                self.player_select_button = player_btn

                elements.append(player_btn)

            # Add reset button
            reset_btn = Button(description="RESET", layout=wide_layout)
            reset_btn.on_click(self.reset)
            elements.append(reset_btn)

            # Add result text
            result_text = HTML()
            elements.append(result_text)
            self.result_text = result_text

            # Display elements and data
            all_elements_box = Box(children=elements, layout=column_layout)
            self.data_box = HTML()
            display(Box([all_elements_box, self.data_box]))

    def get_possible_states(self) -> List[int]:
        return self.__possible_states

## 3. No Agent (player vs player)

In [9]:
tictactoe = TicTacToeEnvironment(None)
tictactoe.render()

Box(children=(Box(children=(HTML(value='<h1>TicTacToe</h1>'), Box(children=(Box(children=(Button(layout=Layout…

<IPython.core.display.Javascript object>

## 4. Random Agent

The agent places X or O marks on random empty fields.

In [10]:
# Calculate random move for agent
def random_move(tictactoe: TicTacToeEnvironment) -> None:
    # Getting all the empty fields
    empty_fields = [
        f for f in range(len(tictactoe.board)) if tictactoe.board[f] == None
    ]

    # Choose random empty field
    if len(empty_fields) > 0:
        agent_type = Type.O if tictactoe.player == Type.X else Type.X
        tictactoe.set_field(random.choice(empty_fields), agent_type)


# Initializing environment and rendering
tictactoe = TicTacToeEnvironment(random_move)
tictactoe.render()

Box(children=(Box(children=(HTML(value='<h1>TicTacToe</h1>'), Box(children=(Box(children=(Button(layout=Layout…

<IPython.core.display.Javascript object>

## 5. Decisions based on reward agent

Next we have the code for the agent that makes the decision based on reward. Since, we are working with a **Markov Decision Process** (MDP), we also have to define a reward function $R(s,a)$ that returns a value based on performing action $a$ in state $s$ both of which have been previously defined. Usually, we also define a transition function $T(s,a,s')$ that gives the probability of moving from a state $s$ to $s'$ when performing action $a$. This is only used in an environment where we are unsure or where it is unclear if some action always gives a predictable outcome. However, since tic-tac-toe is a simple game where every action is defined, we won't be needing a transition function.

We will define the reward function $R(s,a)$ as follows:
- Trivially, if the agent performs some action $â$ that wins the game from $s$, then $R(s,â) = 1$.
- If the agent makes a mistake where the wrong action $ã$ loses the game, we then say $R(s,ã) = -1$.
- When nothing happens we can simply say $R(s,a)=0$.

We will be using Q-learning to find an optimal policy that the agent uses to decide which actions to pick. We will simply denote our policy as the action $a$ that maximises a function $Q(s,a)$ when the agent is in some state $s$. So, we would have something like:

$$a^{best} = \text{arg}\max_{a\in A}Q(s, a)$$

For every state, each action has an associated value of $Q$ and we want to pick the $Q$ with the highest value. So, to compute $Q(s,a)$, the agent has to go over each possible pairs of states and actions while getting feedback from the reward function. We will update $Q(s,a)$ iteratively by letting the agent play. We will update $Q$ as follows:

$$Q(s,a)^{new} \leftarrow (1 - \alpha)\cdot Q(s,a)+\alpha\cdot(R(s,a)+\gamma\cdot\max_{â\in A}Q(ŝ, â))$$

- We perform an action $a$ in the current state $s$.
- $\max_{â\in A}Q(ŝ, â))$ takes into account future states and returns the largest $Q$, ŝ is the state that is the new state after performing $a$. Then $â$ is the best action.
- $\alpha$ is the learning rate that decides to what extent we overwrite the old value, we will use $\alpha=0.1$.
- The discount factor $\gamma$ decides how much future rewards should be weighted compared to present rewards at the current time step $t$. We will be using $\gamma=0.9$.

We will find (read: learn) the best values for $Q(s,a)$. This will be done by letting two agents play against each other. To make sure it is balanced and that agents also seek out new options we will introduce a probability $\epsilon$ that an agent picks a random action.

In [13]:
# Calculate the best move for agent based on reward
def reward_based_move(tictactoe: TicTacToeEnvironment) -> None:
    # Getting all the empty fields
    empty_fields = [
        [f, 0.0] for f in range(len(tictactoe.board)) if tictactoe.board[f] == None
    ]

    if len(empty_fields) > 0:
        # Calculate reward for empty fields

        # Choose field with highest reward
        highest_reward = max([x[1] for x in empty_fields])
        highest_field = random.choice(
            [field for field in empty_fields if field[1] == highest_reward]
        )
        agent_type = Type.O if tictactoe.player == Type.X else Type.X
        tictactoe.set_field(highest_field[0], agent_type, highest_field[1])


# Initializing environment and rendering
tictactoe = TicTacToeEnvironment(reward_based_move)
tictactoe.render()

Box(children=(Box(children=(HTML(value='<h1>TicTacToe</h1>'), Box(children=(Box(children=(Button(layout=Layout…

<IPython.core.display.Javascript object>

## Sources
- For reinforcement learning (Q-learning): https://towardsdatascience.com/how-to-play-tic-tac-toe-using-reinforcement-learning-9604130e56f6