# Nim AI Explanation: Reinforcement Learning

## Background

Nim can be classified as a combinatorial adversarial game. A combinatorial game is a game that satisfies the following constraints:
1. There is perfect information about the game: all players know all information about the state of the game and there is no hidden information. This property separates games like Nim, Tic-Tac-Toe and chess from games like Minesweeper, where the states of each square is hidden from the player.
2. The game is deterministic: all actions are determined by players and there are no random actions (eg. dice rolling, drawing cards, flipping a coin).
3. Only two players are involved.

A finite combinatorial game is a game that will always end, ie, it has a finite sequence of moves. In order to play Nim, each player must make a move x, y in the game, where x represents the pile number and y represents the number of items removed from the pile. y must be a minimum of 1, meaning that the player must make a minimum of 1 removal. In addition, all of the removals in a single turn must be from the same pile.

When played with a normal play convention, Nim ends when a player is unable to make a move and loses a game as a result. Conversely, the winner is determined by the last player to remove an item from the pile. (The misere play convention refers to the reverse, where the outcomes are flipped, but this remains out of the scope of this project.)

The purpose of this project is to use reinforcement learning to elucidate the winning position of this finite combinatorial game. The winning position of this game referes to the optimal playing strategy of the game.

Nim is chosen for this project because it is one of the best known examples of an impartial game - referring to the fact that both players have exactly the same moves, and the ONLY differentiating factor is that one player goes before another. Because the winning strategies of Nim have been studied extensively, this provides a baseline to gain greater understanding and mastery of the skills required before moving on to more complex problems.

## AI: Reinforcement Learning

The main focus of this project is to acquire the skills required to construct an artificial intelligence agent with the capabilities of using reinforcement learning to solve a problem in a dynamic environment.

Reinforcement learning is the training of AI models to make decisions in a labyrinthe and dynamic game-like environment. By modelling the environment as a game, the AI model explores the environment using test-iteration and experiments with a decision. If the decision is deemed expedient, the AI would be given a "reward". This concept is highly analogous to that of operant conditioning in psychology, which is a method of learning that employs rewards and punishments for a particular action. If the action is deemed advantageous, it will be reinforced with rewards, causing the agent to form an association between the action and the reward. As a result, the agent will exhibit a higher proclivity to perform that action.

In reinforcement learning, the AI is given no strategies or rules on how to learn the game; only a list of potential rewards for every action. By using a search algorithm, the AI is then able to perform the game, "level up", and maximize its rewards, going from a complete bumbling amateur to an expert with world-class, superhuman skills, as evidenced by the AI DeepBlue which beat world champion Garry Kasparov in a game of chess.

To achieve the goals of this project, Q-learning will be utilized. The workings of the Q-learning algorithm will be further expounded upon under the "Q-learning" section below.

### Setting up the configuration of the game

The following code has been acquired from Harvard University's Brian Yu and David Malan and deals with establishing the environment of the game in order for the AI agent to practice.

In [None]:
import math
import random
import time
import numpy as np


class Nim():

    def __init__(self, initial=[1, 3, 5, 7]):
        """
        Initialize game board.
        Each game board has
            - `piles`: a list of how many elements remain in each pile
            - `player`: 0 or 1 to indicate which player's turn
            - `winner`: None, 0, or 1 to indicate who the winner is
        """
        self.piles = initial.copy()
        self.player = 0
        self.winner = None

    @classmethod
    def available_actions(cls, piles):
        """
        Nim.available_actions(piles) takes a `piles` list as input
        and returns all of the available actions `(i, j)` in that state.

        Action `(i, j)` represents the action of removing `j` items
        from pile `i` (where piles are 0-indexed).
        """
        actions = set()
        for i, pile in enumerate(piles):
            for j in range(1, pile + 1):
                actions.add((i, j))
        return actions

    @classmethod
    def other_player(cls, player):
        """
        Nim.other_player(player) returns the player that is not
        `player`. Assumes `player` is either 0 or 1.
        """
        return 0 if player == 1 else 1

    def switch_player(self):
        """
        Switch the current player to the other player.
        """
        self.player = Nim.other_player(self.player)

    def move(self, action):
        """
        Make the move `action` for the current player.
        `action` must be a tuple `(i, j)`.
        """
        pile, count = action

        # Check for errors
        if self.winner is not None:
            raise Exception("Game already won")
        elif pile < 0 or pile >= len(self.piles):
            raise Exception("Invalid pile")
        elif count < 1 or count > self.piles[pile]:
            raise Exception("Invalid number of objects")

        # Update pile
        self.piles[pile] -= count
        self.switch_player()

        # Check for a winner
        if all(pile == 0 for pile in self.piles):
            self.winner = self.player

### Q-Learning

Q learning is defined as a model-free reinforcement learning algorithm with the skill of handling dynamic environments with stochastic transitions without requiring any modifications. For any finite Markov Decision Process (FMDP), Q-learning finds the actions that maximizes the expected reward of the current state, as well as all successive states.

The "Q" in Q-learning stands for "Quality" learning. The quality of each action is operationalized by its q-value; whose formula is explained under the "update_q_value" function.

For any state s and action a, the function "Update Q value" performs the following value-iteration update calculation based on the Bellman equation:

Q[(s,a)] <-- Q[(s,a)] + alpha (rewards - Q[(s,a)]),

where alpha is the learning rate of the problem.

The choose_action function gives the AI two possibilities: explore and exploit.

The algorithm is trained using exploration (epsilon = True), where it makes a random choice between an optimal action (the action with the highest Q-value) and a random action (any legal action) based on a weighted probability of epsilon. Exploration allows the algorithm to discover and internalize new possibilities of acting in the environment. Through exploration, the initial Q-value may be lower relative to a more immediately aggressive algorithm; but its long-run Q value is higher, making it a more optimal algorithm.

After 10000 rounds of training and building up a database of optimal moves, the algorithm is then launched using the exploit function (epsilon = False), which causes it to perform actions that maximize the Q-value.

In [None]:
class NimAI():

    def __init__(self, alpha=0.5, epsilon=0.1):
        """
        Initialize AI with an empty Q-learning dictionary,
        an alpha (learning) rate, and an epsilon rate.

        The Q-learning dictionary maps `(state, action)`
        pairs to a Q-value (a number).
         - `state` is a tuple of remaining piles, e.g. (1, 1, 4, 4)
         - `action` is a tuple `(i, j)` for an action
        """
        self.q = dict()
        self.alpha = alpha
        self.epsilon = epsilon

    def update(self, old_state, action, new_state, reward):
        """
        Update Q-learning model, given an old state, an action taken
        in that state, a new resulting state, and the reward received
        from taking that action.
        """
        old = self.get_q_value(old_state, action)
        best_future = self.best_future_reward(new_state)
        self.update_q_value(old_state, action, old, reward, best_future)

    def get_q_value(self, state, action):
        """
        Return the Q-value for the state `state` and the action `action`.
        If no Q-value exists yet in `self.q`, return 0.
        """
        
        #state: list, eg. [1, 1, 3, 5] means that 
        #pile 0 has 1 object, pile 1 has 1 object,
        #pile 2 has 3 objects and pile 3 has 5 objects
        
        #action: i, j, taking j objects from pile i
        
        if (tuple(state), action) in self.q.keys():
            return self.q[tuple(state), action]
        return 0

    def update_q_value(self, state, action, old_q, reward, future_rewards):
        
        self.q[tuple(state), action] = old_q + self.alpha * (reward + future_rewards - old_q)

    def best_future_reward(self, state):
            
        possible_actions = Nim.available_actions(state)
        
        #return 0 if no available actions
        if len(possible_actions) == 0:
            return 0
        
        max_q = -np.inf
        
        for action in possible_actions:
            q_value = self.get_q_value(state, action)
                
            if q_value > max_q:
                max_q = q_value
            
        return max_q

    def choose_action(self, state, epsilon=True):
    
        #obtain all possible actions in the state        
        avail = Nim.available_actions(state)
        
        if len(avail) == 0:
            print("No available actions in this state")
            return None
        
        
        #list of optimal actions
        optimal_actions = []
        
        for a in avail:
            if self.get_q_value(state, a) == self.best_future_reward(state):
                optimal_actions.append(a)
                
        print("list of optimal actions", optimal_actions)
        
        #if greedy algorithm and optimal actions are present:
        if not epsilon and len(optimal_actions) >= 1:
            act = random.choice(optimal_actions)
            return act
        
        else:
            random_action = random.choice(list(avail))
            optimal_action = random.choice(optimal_actions)
            act = random.choices([optimal_action, random_action], weights = \
                                  [1- self.epsilon, self.epsilon], k = 1)
            return act[0]

### Training and Playing

After training the AI for 10000 games, the human should never be able to win against the AI.

In [None]:
def train(n):
    """
    Train an AI by playing `n` games against itself.
    """

    player = NimAI()

    # Play n games
    for i in range(n):
        print(f"Playing training game {i + 1}")
        game = Nim()

        # Keep track of last move made by either player
        last = {
            0: {"state": None, "action": None},
            1: {"state": None, "action": None}
        }

        # Game loop
        while True:

            # Keep track of current state and action
            state = game.piles.copy()
            action = player.choose_action(game.piles)
            print("action chosen by AI...", action)

            # Keep track of last state and action
            last[game.player]["state"] = state
            last[game.player]["action"] = action

            # Make move
            game.move(action)
            new_state = game.piles.copy()

            # When game is over, update Q values with rewards
            if game.winner is not None:
                player.update(state, action, new_state, -1)
                player.update(
                    last[game.player]["state"],
                    last[game.player]["action"],
                    new_state,
                    1
                )
                break

            # If game is continuing, no rewards yet
            elif last[game.player]["state"] is not None:
                player.update(
                    last[game.player]["state"],
                    last[game.player]["action"],
                    new_state,
                    0
                )

    print("Done training")

    # Return the trained AI
    return player


def play(ai, human_player=None):
    """
    Play human game against the AI.
    `human_player` can be set to 0 or 1 to specify whether
    human player moves first or second.
    """

    # If no player order set, choose human's order randomly
    if human_player is None:
        human_player = random.randint(0, 1)

    # Create new game
    game = Nim()

    # Game loop
    while True:

        # Print contents of piles
        print()
        print("Piles:")
        for i, pile in enumerate(game.piles):
            print(f"Pile {i}: {pile}")
        print()

        # Compute available actions
        available_actions = Nim.available_actions(game.piles)
        print("available actions...", available_actions)
        time.sleep(1)

        # Let human make a move
        if game.player == human_player:
            print("Your Turn")
            while True:
                pile = int(input("Choose Pile: "))
                count = int(input("Choose Count: "))
                if (pile, count) in available_actions:
                    break
                print("Invalid move, try again.")

        # Have AI make a move
        else:
            print("AI's Turn")
            
            #check what output does ai.choose_action give
            #ensure that it can be elicited in a tuple form
            
            pile, count = ai.choose_action(game.piles, epsilon=False)
            print(f"AI chose to take {count} from pile {pile}.")

        # Make move
        game.move((pile, count))

        # Check for winner
        if game.winner is not None:
            print()
            print("GAME OVER")
            winner = "Human" if game.winner == human_player else "AI"
            print(f"Winner is {winner}")
            return


### Influence of Hyperparameters

Alpha refers to the learning rate of the algorithm - the extent by which newly attained information overrides past knowledge. A value of 0 causes the AI to learn nothing - purely utilizing past knowledge. Although the optimal alpha for a purely deterministic game such as Nim would be 1, the agreed upon conventions for alpha is 0.1.

Epsilon refers to the degree which the algorithm considers future states. With an epsilon of zero, the algorithm is "short-sighted", meaning that it merely searches to maximize the best Q-value 1 iteration later.

## Conclusion & Project Extensions

This concludes my initial attempt at constructing a reinforcement learning algorithm. A more in depth analysis of the Bellman equation and Markov Decision Processes would allow the transfer of this algorithm, or other similar reinforcement learning algorithms, into novel situations.

In addition, it would be interesting to explore how encoding the optimal solution for the Nim game performs relative to this reinforcement learning algorithm, where the algorithm is given zero knowledge about the best strategies required to solve this problem.