# Week 2: Lab Assignment - Q-Learning for Nim

---

## The Game of Nim

Nim is a mathematical strategy game:
- There are one or more **piles** of objects
- Two players take turns
- On your turn, you must remove **at least one** object from **exactly one** pile
- You can remove as many objects as you want from that pile
- **The player who takes the last object LOSES**

### Why Nim?
- Perfect for tabular Q-learning (discrete, finite states)
- Trains in seconds
- Has a mathematical optimal strategy (but it's not obvious!)
- **Challenge:** Can you beat the trained agent?

---

## Learning Objectives
By the end of this lab, you will:
- Understand how Q-learning applies to turn-based games
- Implement the Q-learning update rule
- Train an agent and play against it interactively
- Experience the exploration-exploitation tradeoff firsthand

In [None]:
import numpy as np
import random
from collections import defaultdict
import matplotlib.pyplot as plt

print("Libraries loaded!")

---
## Part 1: Understanding the Nim Environment

The environment is provided for you. Let's explore how it works.

In [None]:
class NimGame:
    """
    The game of Nim.
    
    State: tuple of pile sizes, e.g., (3, 4, 5) means 3 piles with 3, 4, and 5 objects
    Action: tuple (pile_index, num_to_remove), e.g., (1, 2) means remove 2 from pile 1
    """
    
    def __init__(self, initial_piles):
        """
        Args:
            initial_piles: List or tuple of initial pile sizes, e.g., [3, 4, 5]
        """
        self.initial_piles = tuple(initial_piles)
        self.reset()
    
    def reset(self):
        """Start a new game."""
        self.piles = list(self.initial_piles)
        self.current_player = 1  # Player 1 starts
        self.done = False
        self.winner = None
        return self.get_state()
    
    def get_state(self):
        """Return current state as a tuple (hashable)."""
        return tuple(self.piles)
    
    def get_valid_actions(self):
        """
        Return list of valid actions.
        Each action is (pile_index, num_to_remove).
        """
        actions = []
        for pile_idx, pile_size in enumerate(self.piles):
            for num_remove in range(1, pile_size + 1):
                actions.append((pile_idx, num_remove))
        return actions
    
    def step(self, action):
        """
        Take an action.
        
        Args:
            action: (pile_index, num_to_remove)
            
        Returns:
            new_state, reward, done
        """
        pile_idx, num_remove = action
        
        # Validate action
        if pile_idx < 0 or pile_idx >= len(self.piles):
            raise ValueError(f"Invalid pile index: {pile_idx}")
        if num_remove < 1 or num_remove > self.piles[pile_idx]:
            raise ValueError(f"Invalid remove amount: {num_remove} from pile with {self.piles[pile_idx]}")
        
        # Make the move
        self.piles[pile_idx] -= num_remove
        
        # Check if game is over (all piles empty)
        if sum(self.piles) == 0:
            self.done = True
            # Current player took the last object, so they LOSE
            self.winner = 3 - self.current_player  # Other player wins
            reward = -1  # Taking last object is bad!
        else:
            reward = 0
        
        # Switch player
        self.current_player = 3 - self.current_player
        
        return self.get_state(), reward, self.done
    
    def render(self):
        """Display the current game state."""
        print("\nPiles:")
        for i, size in enumerate(self.piles):
            objects = '|' * size if size > 0 else '(empty)'
            print(f"  Pile {i}: {objects} ({size})")
        print(f"Player {self.current_player}'s turn")


print("NimGame class loaded!")

### Task 1.1: Explore the Environment

Run the cells below to understand how the game works.

In [None]:
# Create a simple game with one pile of 5 objects
game = NimGame([5])
game.render()

print("\nValid actions:", game.get_valid_actions())
print("State:", game.get_state())

In [None]:
# Let's play a few moves
print("Player 1 removes 2 from pile 0:")
state, reward, done = game.step((0, 2))  # Remove 2 from pile 0
game.render()
print(f"Reward: {reward}, Done: {done}")

print("\nPlayer 2 removes 2 from pile 0:")
state, reward, done = game.step((0, 2))  # Remove 2 from pile 0
game.render()
print(f"Reward: {reward}, Done: {done}")

print("\nPlayer 1 removes 1 from pile 0 (last object!):")
state, reward, done = game.step((0, 1))  # Remove the last one
game.render()
print(f"Reward: {reward}, Done: {done}, Winner: Player {game.winner}")

### Task 1.2: Try a Two-Pile Game

**Your turn:** Change the pile configuration below to `[3, 4]` (two piles) and run the cell to see how the valid actions change.

In [None]:
# TODO: Change [5] to [3, 4] to create a two-pile game
game = NimGame([5])
game.render()

print("\nValid actions:")
for action in game.get_valid_actions():
    pile_idx, num_remove = action
    print(f"  Remove {num_remove} from pile {pile_idx}")

---
## Part 2: Random Agent

First, let's create a random agent to serve as a baseline and training partner.

In [None]:
class RandomAgent:
    """An agent that picks random valid actions."""
    
    def choose_action(self, state, valid_actions):
        return random.choice(valid_actions)


# Test random agent
random_agent = RandomAgent()
game = NimGame([3, 4])

print("Random agent playing:")
state = game.reset()
while not game.done:
    game.render()
    action = random_agent.choose_action(state, game.get_valid_actions())
    print(f"  -> Removes {action[1]} from pile {action[0]}")
    state, reward, done = game.step(action)

game.render()
print(f"\nWinner: Player {game.winner}")

---
## Part 3: Q-Learning Agent

Now let's build the Q-Learning agent. Most of the code is provided - you just need to fill in the key parts.

In [None]:
class QLearningAgent:
    """
    Q-Learning agent for Nim.
    
    The Q-table maps (state, action) pairs to expected values.
    """
    
    def __init__(self, epsilon=0.1, alpha=0.5, gamma=0.9):
        """
        Args:
            epsilon: Exploration rate (probability of random action)
            alpha: Learning rate (how fast to update Q-values)
            gamma: Discount factor (importance of future rewards)
        """
        self.epsilon = epsilon
        self.alpha = alpha
        self.gamma = gamma
        
        # Q-table: defaultdict returns 0.0 for unseen (state, action) pairs
        self.Q = defaultdict(float)
        
        # History of moves in current game (for learning)
        self.history = []
    
    def choose_action(self, state, valid_actions):
        """
        Choose an action using epsilon-greedy strategy.
        
        With probability epsilon: choose randomly (explore)
        Otherwise: choose the action with highest Q-value (exploit)
        """
        # Exploration: random action
        if random.random() < self.epsilon:
            return random.choice(valid_actions)
        
        # Exploitation: best action based on Q-values
        q_values = [self.Q[(state, action)] for action in valid_actions]
        max_q = max(q_values)
        
        # If multiple actions have the same Q-value, pick randomly among them
        best_actions = [a for a, q in zip(valid_actions, q_values) if q == max_q]
        return random.choice(best_actions)
    
    def store_transition(self, state, action):
        """Remember this move for learning later."""
        self.history.append((state, action))
    
    def learn(self, final_reward):
        """
        Update Q-values based on the game outcome.
        
        We work backwards through the history, propagating the reward.
        
        Args:
            final_reward: 1 if won, -1 if lost, 0 for draw
        """
        reward = final_reward
        
        # Process moves in reverse order (from game end to start)
        for state, action in reversed(self.history):
            # Current Q-value
            current_q = self.Q[(state, action)]
            
            # ============================================================
            # TASK 3.1: Implement the Q-learning update
            # ============================================================
            # Update formula: Q(s,a) = Q(s,a) + alpha * (target - Q(s,a))
            # Where target = reward (we're working backwards, so no future state)
            #
            # Hint: This is similar to the bandit update from Week 1!
            # 
            # Replace the line below with your implementation:
            self.Q[(state, action)] = current_q  # TODO: Update this line
            # ============================================================
            
            # Discount the reward for earlier moves
            reward = reward * self.gamma
        
        # Clear history for next game
        self.history = []


print("QLearningAgent class loaded!")

### Task 3.1: Complete the Q-Learning Update

In the cell above, find the TODO and replace:
```python
self.Q[(state, action)] = current_q  # TODO: Update this line
```

With the Q-learning update formula:
```python
self.Q[(state, action)] = current_q + self.alpha * (reward - current_q)
```

This moves the Q-value toward the observed reward, at a rate controlled by `alpha`.

---
## Part 4: Training Loop

The training loop plays many games between our agent and an opponent.

In [None]:
def train_agent(agent, opponent, game, n_games=1000, agent_player=1):
    """
    Train the Q-learning agent by playing games.
    
    Args:
        agent: The Q-learning agent to train
        opponent: The opponent agent (e.g., RandomAgent)
        game: The Nim game environment
        n_games: Number of games to play
        agent_player: Which player the agent is (1 or 2)
    
    Returns:
        wins, losses: Number of wins and losses
    """
    wins = 0
    losses = 0
    
    for _ in range(n_games):
        state = game.reset()
        
        while not game.done:
            valid_actions = game.get_valid_actions()
            
            if game.current_player == agent_player:
                # Agent's turn
                action = agent.choose_action(state, valid_actions)
                agent.store_transition(state, action)
            else:
                # Opponent's turn
                action = opponent.choose_action(state, valid_actions)
            
            state, reward, done = game.step(action)
        
        # Game over - learn from result
        if game.winner == agent_player:
            agent.learn(1.0)  # Win!
            wins += 1
        else:
            agent.learn(-1.0)  # Loss
            losses += 1
    
    return wins, losses


print("Training function loaded!")

### Task 4.1: Train on a Simple Game (1 Pile)

Let's start with the simplest case: a single pile of 5 objects.

In [None]:
# Create agent and game
agent = QLearningAgent(epsilon=0.3, alpha=0.5, gamma=0.9)
opponent = RandomAgent()
game = NimGame([5])  # Single pile of 5

# Train!
print("Training on 1 pile of 5 objects...")
wins, losses = train_agent(agent, opponent, game, n_games=5000, agent_player=1)

print(f"\nTraining complete!")
print(f"Wins: {wins} ({wins/50:.1f}%)")
print(f"Losses: {losses} ({losses/50:.1f}%)")
print(f"Q-table entries: {len(agent.Q)}")

### Task 4.2: Inspect What the Agent Learned

Let's see the Q-values for the starting position.

In [None]:
def show_q_values(agent, state, valid_actions):
    """Display Q-values for all valid actions in a state."""
    print(f"\nState: {state}")
    print("Q-values:")
    
    q_values = [(action, agent.Q[(state, action)]) for action in valid_actions]
    q_values.sort(key=lambda x: x[1], reverse=True)  # Sort by Q-value
    
    for action, q in q_values:
        pile_idx, num_remove = action
        marker = " <-- BEST" if q == max(v[1] for v in q_values) else ""
        print(f"  Remove {num_remove} from pile {pile_idx}: Q = {q:+.3f}{marker}")


# Show Q-values for starting state
game = NimGame([5])
state = game.get_state()
show_q_values(agent, state, game.get_valid_actions())

### Question: What's the Best Opening Move?

Look at the Q-values above. Which move does the agent prefer and why?

**Hint:** In Nim with one pile, the winning strategy is to leave your opponent with 1 object (so they must take it and lose).

---
## Part 5: Play Against the Agent!

Now let's see if you can beat the trained agent.

In [None]:
def play_against_agent(agent, initial_piles, human_player=1):
    """
    Play interactively against the trained agent.
    
    Args:
        agent: The trained Q-learning agent
        initial_piles: Starting pile configuration
        human_player: 1 to go first, 2 to go second
    """
    game = NimGame(initial_piles)
    agent.epsilon = 0  # No exploration during play
    
    print("\n" + "="*50)
    print("NIM vs Q-Learning Agent")
    print("="*50)
    print(f"You are Player {human_player} ({'first' if human_player == 1 else 'second'})")
    print("Remember: Taking the LAST object means you LOSE!")
    print("\nHow to play:")
    print("  Enter: pile_number, amount_to_remove")
    print("  Example: 0, 3 (removes 3 from pile 0)")
    
    state = game.reset()
    
    while not game.done:
        game.render()
        valid_actions = game.get_valid_actions()
        
        if game.current_player == human_player:
            # Human's turn
            while True:
                try:
                    user_input = input("Your move (pile, amount): ")
                    pile_idx, num_remove = map(int, user_input.replace(' ', '').split(','))
                    action = (pile_idx, num_remove)
                    if action in valid_actions:
                        break
                    print(f"Invalid move! Valid actions: {valid_actions}")
                except (ValueError, IndexError):
                    print("Please enter: pile_number, amount (e.g., 0, 2)")
        else:
            # Agent's turn
            action = agent.choose_action(state, valid_actions)
            print(f"Agent removes {action[1]} from pile {action[0]}")
        
        state, _, _ = game.step(action)
    
    # Game over
    game.render()
    if game.winner == human_player:
        print("\nðŸŽ‰ Congratulations! You won!")
    else:
        print("\nðŸ¤– The agent wins!")


print("Play function loaded!")
print("\nTo play, run: play_against_agent(agent, [5], human_player=1)")

In [None]:
# Uncomment the line below to play against the agent!
# play_against_agent(agent, [5], human_player=1)

### Task 5.1: Can You Beat the Agent?

Try playing as both Player 1 (first) and Player 2 (second). 

- When you go first with pile [5], what's the winning strategy?
- When you go second, can you ever win against a perfect opponent?

---
## Part 6: Increase the Difficulty!

Now let's train on more complex configurations.

### Task 6.1: Two Piles

**Your turn:** Train a new agent on two piles `[3, 4]`

In [None]:
# TODO: Create a new agent and train on [3, 4]
# Hint: Copy and modify the code from Task 4.1

agent_2pile = QLearningAgent(epsilon=0.3, alpha=0.5, gamma=0.9)
opponent = RandomAgent()
game_2pile = NimGame([3, 4])  # Two piles

print("Training on 2 piles [3, 4]...")
wins, losses = train_agent(agent_2pile, opponent, game_2pile, n_games=10000, agent_player=1)

print(f"\nTraining complete!")
print(f"Wins: {wins} ({wins/100:.1f}%)")
print(f"Losses: {losses} ({losses/100:.1f}%)")
print(f"Q-table entries: {len(agent_2pile.Q)}")

In [None]:
# Try to beat it!
# play_against_agent(agent_2pile, [3, 4], human_player=1)

### Task 6.2: Three Piles

Train on `[2, 3, 4]` - this is where it gets interesting!

In [None]:
# Train on 3 piles
agent_3pile = QLearningAgent(epsilon=0.3, alpha=0.5, gamma=0.9)
opponent = RandomAgent()
game_3pile = NimGame([2, 3, 4])

print("Training on 3 piles [2, 3, 4]...")
wins, losses = train_agent(agent_3pile, opponent, game_3pile, n_games=20000, agent_player=1)

print(f"\nTraining complete!")
print(f"Wins: {wins} ({wins/200:.1f}%)")
print(f"Losses: {losses} ({losses/200:.1f}%)")
print(f"Q-table entries: {len(agent_3pile.Q)}")

In [None]:
# Can you beat it now?
# play_against_agent(agent_3pile, [2, 3, 4], human_player=1)

### Task 6.3: Four Piles (The Real Challenge!)

Train on `[1, 3, 5, 7]` - a classic Nim configuration.

In [None]:
# Train on 4 piles
agent_4pile = QLearningAgent(epsilon=0.3, alpha=0.5, gamma=0.9)
opponent = RandomAgent()
game_4pile = NimGame([1, 3, 5, 7])

print("Training on 4 piles [1, 3, 5, 7]...")
wins, losses = train_agent(agent_4pile, opponent, game_4pile, n_games=50000, agent_player=1)

print(f"\nTraining complete!")
print(f"Wins: {wins} ({wins/500:.1f}%)")
print(f"Losses: {losses} ({losses/500:.1f}%)")
print(f"Q-table entries: {len(agent_4pile.Q)}")

In [None]:
# The ultimate challenge - can you beat a trained agent on 4 piles?
# play_against_agent(agent_4pile, [1, 3, 5, 7], human_player=2)

---
## Part 7: Evaluation

Let's evaluate how well our agents perform against random opponents.

In [None]:
def evaluate_agent(agent, game, n_games=1000):
    """Evaluate agent win rate (no exploration)."""
    original_epsilon = agent.epsilon
    agent.epsilon = 0
    
    wins_as_p1 = 0
    wins_as_p2 = 0
    opponent = RandomAgent()
    
    # Test as Player 1
    for _ in range(n_games // 2):
        state = game.reset()
        while not game.done:
            valid_actions = game.get_valid_actions()
            if game.current_player == 1:
                action = agent.choose_action(state, valid_actions)
            else:
                action = opponent.choose_action(state, valid_actions)
            state, _, _ = game.step(action)
        if game.winner == 1:
            wins_as_p1 += 1
    
    # Test as Player 2
    for _ in range(n_games // 2):
        state = game.reset()
        while not game.done:
            valid_actions = game.get_valid_actions()
            if game.current_player == 2:
                action = agent.choose_action(state, valid_actions)
            else:
                action = opponent.choose_action(state, valid_actions)
            state, _, _ = game.step(action)
        if game.winner == 2:
            wins_as_p2 += 1
    
    agent.epsilon = original_epsilon
    return wins_as_p1 / (n_games // 2), wins_as_p2 / (n_games // 2)


# Evaluate all agents
print("Agent Performance vs Random Opponent:\n")

configs = [
    ("1 pile [5]", agent, NimGame([5])),
    ("2 piles [3,4]", agent_2pile, NimGame([3, 4])),
    ("3 piles [2,3,4]", agent_3pile, NimGame([2, 3, 4])),
    ("4 piles [1,3,5,7]", agent_4pile, NimGame([1, 3, 5, 7])),
]

for name, ag, gm in configs:
    win_p1, win_p2 = evaluate_agent(ag, gm, n_games=1000)
    print(f"{name:20s} | As P1: {win_p1*100:5.1f}% | As P2: {win_p2*100:5.1f}%")

---
## Part 8: Analysis Questions

Answer these questions based on your experiments:

### Question 1:
Look at your evaluation results. Why does the agent perform better as certain player positions in different configurations?

**Your Answer:**

### Question 2:
For the 4-pile game [1,3,5,7], what happens if both players play perfectly? 

(Hint: This is a famous configuration. The XOR of all pile sizes = 1^3^5^7 = 0)

**Your Answer:**

### Question 3:
Why does the Q-table size grow significantly with more piles?

**Your Answer:**

---
## (Optional) Bonus Challenges

If you finish early, try these more advanced tasks:

### Bonus 1: Self-Play Training

Train an agent by playing against itself instead of a random opponent. Does it learn better strategies?

In [None]:
# BONUS: Implement self-play training
# Hint: The agent plays both sides, keeping separate histories
# for each player, then learns from both perspectives

def train_self_play(agent, game, n_games=10000):
    """
    Train agent by playing against itself.
    
    TODO: Implement this function
    """
    pass  # Your implementation here


# Test your implementation
# self_play_agent = QLearningAgent(epsilon=0.3, alpha=0.5, gamma=0.9)
# train_self_play(self_play_agent, NimGame([3, 4]), n_games=20000)
# evaluate_agent(self_play_agent, NimGame([3, 4]))

### Bonus 2: Learning Curve Visualization

Track and plot the agent's win rate during training.

In [None]:
# BONUS: Track learning progress over time
# Hint: Evaluate every N training games and store win rates

def train_and_track(game, n_games=50000, eval_every=2000):
    """
    Train agent and track win rate over time.
    
    Returns:
        game_numbers: List of training game counts
        win_rates: List of win rates at each checkpoint
    """
    # TODO: Implement this function
    pass


# Plot the learning curve
# game_nums, win_rates = train_and_track(NimGame([2, 3, 4]))
# plt.plot(game_nums, win_rates)
# plt.xlabel('Training Games')
# plt.ylabel('Win Rate')
# plt.title('Learning Curve')
# plt.show()

### Bonus 3: Explore Hyperparameters

How do different values of epsilon, alpha, and gamma affect learning?

In [None]:
# BONUS: Compare different hyperparameter settings
# Try varying:
#   - epsilon: [0.1, 0.2, 0.3, 0.5]
#   - alpha: [0.1, 0.3, 0.5, 0.9]
#   - gamma: [0.5, 0.9, 0.99]

# Your experiments here

---
## Summary

In this lab, you:

1. âœ… Learned how Q-learning applies to turn-based games
2. âœ… Implemented the Q-learning update rule
3. âœ… Trained agents on progressively harder Nim configurations
4. âœ… Played interactively against your trained agents
5. âœ… Analyzed how complexity affects learning

**Key Takeaways:**
- Q-learning can discover winning strategies through trial and error
- State space size grows exponentially with problem complexity
- More training is needed for larger problems
- Self-play can lead to stronger strategies than playing vs random

**Next Week:** Deep Q-Networks (DQN) - using neural networks when the state space is too large for a table!