# Temporal Difference Reinforcement Learning with Nim
Nim is a strategy game where players take turns removing stones from a number of piles. <br>
The goal is to force your opponent to take the last stone, resulting in a loss for your opponent. 

#### import agents and environments

In [13]:
from NimEnv import *
from BotPlayerEnv import *
from Agents import *
from TDAgent import * 

#### Set up environment, agents, and train

In [15]:
nim1 = Nim(piles=3, stones=9, limit=5)
bot1 = RandomPlayer('Bot')
bot_env1 = BotPlayerEnv(game_env=nim1, agent=bot1)
td1 = TDAgent(bot_env1, gamma=1, random_state=1)

td1.q_learning(episodes=10000, epsilon=0.1, alpha=0.1, track_history=False)

#### Define players.

In [17]:
p1 = PolicyPlayer('Policy Player', policy=td1.policy)
random1 = RandomPlayer('Random Player')
minimax2 = MinimaxPlayer('Minimax Player 2', depth=2)
minimax3 = MinimaxPlayer('Minimax Player 3', depth=3)
minimax4 = MinimaxPlayer('Minimax Player 4', depth=4)

### Tournaments between Temporal Difference Agent and other agents using Q-learning. 
Single instance of basic training against random player for demonstration

# TDAgent VS RandomPlayer

In [19]:
t1 = tournament(nim1, [p1, random1], rounds=1000, switch_players=True, random_state=1)
t1

100%|██████████| 1000/1000 [00:00<00:00, 6094.74it/s]

Policy Player vs. Random Player
-------------------------------
Ties:                    0
Policy Player Wins:      920
Random Player Wins:      80
Policy Player took:      0.04 seconds
Random Player took:      0.09 seconds
Average number of turns: 11.4





# Versus Minimax(2)

In [36]:
t2 = tournament(nim1, [p1, minimax2], rounds=1000, switch_players=True, random_state=1)
t2

100%|██████████| 1000/1000 [00:01<00:00, 607.55it/s]

Policy Player vs. Minimax Player 2
----------------------------------
Ties:                    0
Policy Player Wins:      726
Minimax Player 2 Wins:   274
Policy Player took:      0.02 seconds
Minimax Player 2 took:   1.57 seconds
Average number of turns: 11.2





# Versus Minimax(3)

In [38]:
t3 = tournament(nim1, [p1, minimax3], rounds=1000, switch_players=True, random_state=1)
t3

100%|██████████| 1000/1000 [00:16<00:00, 61.91it/s]

Policy Player vs. Minimax Player 3
----------------------------------
Ties:                    0
Policy Player Wins:      402
Minimax Player 3 Wins:   598
Policy Player took:      0.03 seconds
Minimax Player 3 took:   16.01 seconds
Average number of turns: 12.2





# Versus Minimax(4)

In [39]:
t4 = tournament(nim1, [p1, minimax4], rounds=1000, switch_players=True, random_state=1)
t4

100%|██████████| 1000/1000 [02:47<00:00,  5.98it/s]

Policy Player vs. Minimax Player 4
----------------------------------
Ties:                    0
Policy Player Wins:      188
Minimax Player 4 Wins:   812
Policy Player took:      0.03 seconds
Minimax Player 4 took:   166.49 seconds
Average number of turns: 11.6





Temporal Difference Agent won: 

* 92% of games played against the RandomPlayer agent.
* 72.6% of games played against the MinimaxPlayer agent with depth 2.
* 40.2% of games played against the MinimaxPlayer agent with depth 3.
* 18.8% of games played against the MinimaxPlayer agent with depth 4. 

## Improved Exploration, Transition, and Fine-Tuning

In [63]:
nim2 = Nim(piles=3, stones=9, limit=5)
bot2 = RandomPlayer('Bot')
bot_env2 = BotPlayerEnv(game_env=nim2, agent=bot2)
td2 = TDAgent(bot_env2, gamma=1, random_state=1)

#Exploration
td2.q_learning(episodes=10000, epsilon=1.0, alpha=1, exploring_starts=True, track_history=False)

#Transition
td2.q_learning(episodes=10000, epsilon=0.8, alpha=0.4, exploring_starts=True, track_history=False)
td2.q_learning(episodes=10000, epsilon=0.6, alpha=0.3, track_history=False)
td2.q_learning(episodes=10000, epsilon=0.4, alpha=0.2, track_history=False)
td2.q_learning(episodes=10000, epsilon=0.2, alpha=0.2, track_history=False)

#Fine-tuning
td2.q_learning(episodes=10000, epsilon=0.1, alpha=0.05, track_history=False)
td2.q_learning(episodes=10000, epsilon=0.01, alpha=0.01, track_history=False)


Re-declare players/variables

In [66]:
p2 = PolicyPlayer('Policy Player', policy=td2.policy)
random1 = RandomPlayer('Random Player')
minimax2 = MinimaxPlayer('Minimax Player 2', depth=2)
minimax3 = MinimaxPlayer('Minimax Player 3', depth=3)
minimax4 = MinimaxPlayer('Minimax Player 4', depth=4)

# TDAgent VS RandomPlayer

In [68]:
t5 = tournament(nim2, [p2, random1], rounds=1000, switch_players=True, random_state=1)
t5

100%|██████████| 1000/1000 [00:00<00:00, 7321.35it/s]

Policy Player vs. Random Player
-------------------------------
Ties:                    0
Policy Player Wins:      999
Random Player Wins:      1
Policy Player took:      0.02 seconds
Random Player took:      0.07 seconds
Average number of turns: 10.9





# Versus Minimax(2)

In [70]:
t6 = tournament(nim2, [p2, minimax2], rounds=1000, switch_players=True, random_state=1)
t6

100%|██████████| 1000/1000 [00:01<00:00, 654.56it/s]


Policy Player vs. Minimax Player 2
----------------------------------
Ties:                    0
Policy Player Wins:      994
Minimax Player 2 Wins:   6
Policy Player took:      0.02 seconds
Minimax Player 2 took:   1.46 seconds
Average number of turns: 10.9


# Versus Minimax(3)

In [72]:
t7 = tournament(nim2, [p2, minimax3], rounds=1000, switch_players=True, random_state=1)
t7

100%|██████████| 1000/1000 [00:15<00:00, 64.00it/s]

Policy Player vs. Minimax Player 3
----------------------------------
Ties:                    0
Policy Player Wins:      988
Minimax Player 3 Wins:   12
Policy Player took:      0.03 seconds
Minimax Player 3 took:   15.46 seconds
Average number of turns: 11.7





# Versus Minimax(4)

In [73]:
t8 = tournament(nim2, [p2, minimax4], rounds=1000, switch_players=True, random_state=1)
t8

100%|██████████| 1000/1000 [02:41<00:00,  6.20it/s]

Policy Player vs. Minimax Player 4
----------------------------------
Ties:                    0
Policy Player Wins:      973
Minimax Player 4 Wins:   27
Policy Player took:      0.04 seconds
Minimax Player 4 took:   160.74 seconds
Average number of turns: 11.7





Temporal Difference Agent won: 

* 99.9% of games played against the RandomPlayer agent.
* 99.4% of games played against the MinimaxPlayer agent with depth 2.
* 98.8% of games played against the MinimaxPlayer agent with depth 3.
* 97.3% of games played against the MinimaxPlayer agent with depth 4. 