# Nim AI Explanation: Reinforcement Learning

## Background

Nim is a two-player combinatorial game. Each player takes turns removing a minimum of one object, but all of the objects they remove have to be from the same pile.

Nim is one of the most famous examples of an impartial game - a game where both players have the same moves all the time, and the 1st mover of the game always wins. It is also completely solved, in the sense that the exact strategy has been found for every starting configuration.

The winner is the player who makes the last move.

In this project, I use Q-learning, a type of reinforcement learning strategy, to train the AI 10000 times in order to play against a human player.

## AI

### Setting up the configuration of the game

In [None]:
import math
import random
import time
import numpy as np


class Nim():

    def __init__(self, initial=[1, 3, 5, 7]):
        """
        Initialize game board.
        Each game board has
            - `piles`: a list of how many elements remain in each pile
            - `player`: 0 or 1 to indicate which player's turn
            - `winner`: None, 0, or 1 to indicate who the winner is
        """
        self.piles = initial.copy()
        self.player = 0
        self.winner = None

    @classmethod
    def available_actions(cls, piles):
        """
        Nim.available_actions(piles) takes a `piles` list as input
        and returns all of the available actions `(i, j)` in that state.

        Action `(i, j)` represents the action of removing `j` items
        from pile `i` (where piles are 0-indexed).
        """
        actions = set()
        for i, pile in enumerate(piles):
            for j in range(1, pile + 1):
                actions.add((i, j))
        return actions

    @classmethod
    def other_player(cls, player):
        """
        Nim.other_player(player) returns the player that is not
        `player`. Assumes `player` is either 0 or 1.
        """
        return 0 if player == 1 else 1

    def switch_player(self):
        """
        Switch the current player to the other player.
        """
        self.player = Nim.other_player(self.player)

    def move(self, action):
        """
        Make the move `action` for the current player.
        `action` must be a tuple `(i, j)`.
        """
        pile, count = action

        # Check for errors
        if self.winner is not None:
            raise Exception("Game already won")
        elif pile < 0 or pile >= len(self.piles):
            raise Exception("Invalid pile")
        elif count < 1 or count > self.piles[pile]:
            raise Exception("Invalid number of objects")

        # Update pile
        self.piles[pile] -= count
        self.switch_player()

        # Check for a winner
        if all(pile == 0 for pile in self.piles):
            self.winner = self.player

### Reinforcement Learning

The main focus on this project is the reinforcement learning algorithm. Reinforcement learning works by giving the algorithm a reward when it chooses an optimal state, and "punishing" the algorithm if it chooses a state that is deemed incorrect. In general, a reinforcement learning algorithm is able to perceive and interpret its environment and learn the optimal moves through trial and error.

The reinforcement learning algorithm used here is Q-learning. Q-learning trains the algorithm to search for the combination of game moves that optimizes the q-value, which is the reward in this context. This concept is influenced by operant conditioning in psychology.

Expounding upon the algorithm:

- The function get_q_value obtains a q-value stored in a dictionary, self.q.

- The function update_q_value updates the q-value based on the q-learning formula:
Q(s, a) <- old value estimate
                   + alpha * (new value estimate - old value estimate)

- The function best_future_reward searches through all available actions in the Nim game and returns the q-value of the best move.

- The function choose_action searches through all available actions:
1. If there are no available actions, it returns None.
2. The algorithm then searches for optimal actions with q_value equating to the best future reward.
3. If the algorithm is greedy (epsilon = False), it aims to maximize the current best future reward, so the optimal action is returned.
4. If the algorithm is not greedy (explorative), it returns the optimal action vs the random action based on a weighted probability, self.epsilon.

In [None]:
class NimAI():

    def __init__(self, alpha=0.5, epsilon=0.1):
        """
        Initialize AI with an empty Q-learning dictionary,
        an alpha (learning) rate, and an epsilon rate.

        The Q-learning dictionary maps `(state, action)`
        pairs to a Q-value (a number).
         - `state` is a tuple of remaining piles, e.g. (1, 1, 4, 4)
         - `action` is a tuple `(i, j)` for an action
        """
        self.q = dict()
        self.alpha = alpha
        self.epsilon = epsilon

    def update(self, old_state, action, new_state, reward):
        """
        Update Q-learning model, given an old state, an action taken
        in that state, a new resulting state, and the reward received
        from taking that action.
        """
        old = self.get_q_value(old_state, action)
        best_future = self.best_future_reward(new_state)
        self.update_q_value(old_state, action, old, reward, best_future)

    def get_q_value(self, state, action):
        """
        Return the Q-value for the state `state` and the action `action`.
        If no Q-value exists yet in `self.q`, return 0.
        """
        
        #state: list, eg. [1, 1, 3, 5] means that 
        #pile 0 has 1 object, pile 1 has 1 object,
        #pile 2 has 3 objects and pile 3 has 5 objects
        
        #action: i, j, taking j objects from pile i
        
        if (tuple(state), action) in self.q.keys():
            return self.q[tuple(state), action]
        return 0

    def update_q_value(self, state, action, old_q, reward, future_rewards):
        
        self.q[tuple(state), action] = old_q + self.alpha * (reward + future_rewards - old_q)

    def best_future_reward(self, state):
            
        possible_actions = Nim.available_actions(state)
        
        #return 0 if no available actions
        if len(possible_actions) == 0:
            return 0
        
        max_q = -np.inf
        
        for action in possible_actions:
            q_value = self.get_q_value(state, action)
                
            if q_value > max_q:
                max_q = q_value
            
        return max_q

    def choose_action(self, state, epsilon=True):
    
        #obtain all possible actions in the state        
        avail = Nim.available_actions(state)
        
        if len(avail) == 0:
            print("No available actions in this state")
            return None
        
        
        #list of optimal actions
        optimal_actions = []
        
        for a in avail:
            if self.get_q_value(state, a) == self.best_future_reward(state):
                optimal_actions.append(a)
                
        print("list of optimal actions", optimal_actions)
        
        #if greedy algorithm and optimal actions are present:
        if not epsilon and len(optimal_actions) >= 1:
            act = random.choice(optimal_actions)
            return act
        
        else:
            random_action = random.choice(list(avail))
            optimal_action = random.choice(optimal_actions)
            act = random.choices([optimal_action, random_action], weights = \
                                  [1- self.epsilon, self.epsilon], k = 1)
            return act[0]

### Training and Playing

After training the AI for 10000 games, the human should never be able to win against the AI.

In [None]:
def train(n):
    """
    Train an AI by playing `n` games against itself.
    """

    player = NimAI()

    # Play n games
    for i in range(n):
        print(f"Playing training game {i + 1}")
        game = Nim()

        # Keep track of last move made by either player
        last = {
            0: {"state": None, "action": None},
            1: {"state": None, "action": None}
        }

        # Game loop
        while True:

            # Keep track of current state and action
            state = game.piles.copy()
            action = player.choose_action(game.piles)
            print("action chosen by AI...", action)

            # Keep track of last state and action
            last[game.player]["state"] = state
            last[game.player]["action"] = action

            # Make move
            game.move(action)
            new_state = game.piles.copy()

            # When game is over, update Q values with rewards
            if game.winner is not None:
                player.update(state, action, new_state, -1)
                player.update(
                    last[game.player]["state"],
                    last[game.player]["action"],
                    new_state,
                    1
                )
                break

            # If game is continuing, no rewards yet
            elif last[game.player]["state"] is not None:
                player.update(
                    last[game.player]["state"],
                    last[game.player]["action"],
                    new_state,
                    0
                )

    print("Done training")

    # Return the trained AI
    return player


def play(ai, human_player=None):
    """
    Play human game against the AI.
    `human_player` can be set to 0 or 1 to specify whether
    human player moves first or second.
    """

    # If no player order set, choose human's order randomly
    if human_player is None:
        human_player = random.randint(0, 1)

    # Create new game
    game = Nim()

    # Game loop
    while True:

        # Print contents of piles
        print()
        print("Piles:")
        for i, pile in enumerate(game.piles):
            print(f"Pile {i}: {pile}")
        print()

        # Compute available actions
        available_actions = Nim.available_actions(game.piles)
        print("available actions...", available_actions)
        time.sleep(1)

        # Let human make a move
        if game.player == human_player:
            print("Your Turn")
            while True:
                pile = int(input("Choose Pile: "))
                count = int(input("Choose Count: "))
                if (pile, count) in available_actions:
                    break
                print("Invalid move, try again.")

        # Have AI make a move
        else:
            print("AI's Turn")
            
            #check what output does ai.choose_action give
            #ensure that it can be elicited in a tuple form
            
            pile, count = ai.choose_action(game.piles, epsilon=False)
            print(f"AI chose to take {count} from pile {pile}.")

        # Make move
        game.move((pile, count))

        # Check for winner
        if game.winner is not None:
            print()
            print("GAME OVER")
            winner = "Human" if game.winner == human_player else "AI"
            print(f"Winner is {winner}")
            return


## Conclusion

This is my 1st attempt at writing a reinforcement learning algorithm. Further developments of this algorithm can be used for other games, such as Knights and Chess.