<a href="https://colab.research.google.com/github/jeniferGoncalvesDaSilvaDev/algo_min_max_tic_tac_toe/blob/main/checkers_q_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Develop an AI for playing checkers using Q-learning and Minimax with Alpha-Beta pruning, including environment setup, agent development, a heuristic evaluation function for Q-values, training, and performance evaluation.

## Configuração do Ambiente de Damas

### Subtask:
Implementar a lógica do jogo de damas, incluindo a representação do tabuleiro, regras de movimento das peças, detecção de jogadas válidas (capturas, movimentos), e identificação de estados terminais (vitória, derrota, empate).


**Reasoning**:
The first step is to define a class to represent the checkers board, including its initialization with standard piece positions and how pieces are represented.



In [1]:
class CheckersBoard:
    def __init__(self):
        # Initialize an 8x8 board
        self.board = [[0 for _ in range(8)] for _ in range(8)]
        # 0: empty, 1: player 1 piece, 2: player 2 piece
        # 3: player 1 king, 4: player 2 king

        # Setup initial pieces for Player 1 (bottom of the board)
        for row in range(3):
            for col in range(8):
                if (row + col) % 2 != 0:  # Pieces are on dark squares
                    self.board[row][col] = 1

        # Setup initial pieces for Player 2 (top of the board)
        for row in range(5, 8):
            for col in range(8):
                if (row + col) % 2 != 0:  # Pieces are on dark squares
                    self.board[row][col] = 2

    def __str__(self):
        board_str = ""
        for r_idx, row in enumerate(self.board):
            board_str += f"{8 - r_idx} |"
            for piece in row:
                if piece == 0:
                    board_str += " . "
                elif piece == 1:
                    board_str += " x "
                elif piece == 2:
                    board_str += " o "
                elif piece == 3:
                    board_str += " X "
                elif piece == 4:
                    board_str += " O "
            board_str += "\n"
        board_str += "  ------------------------\n"
        board_str += "    a  b  c  d  e  f  g  h"
        return board_str

# Example usage:
board = CheckersBoard()
print(board)

8 | .  x  .  x  .  x  .  x 
7 | x  .  x  .  x  .  x  . 
6 | .  x  .  x  .  x  .  x 
5 | .  .  .  .  .  .  .  . 
4 | .  .  .  .  .  .  .  . 
3 | o  .  o  .  o  .  o  . 
2 | .  o  .  o  .  o  .  o 
1 | o  .  o  .  o  .  o  . 
  ------------------------
    a  b  c  d  e  f  g  h


**Reasoning**:
The previous step successfully initialized and displayed the checkers board. The next logical step, as per the subtask instructions, is to implement the functionality to determine all valid moves for a given player, considering both basic moves and mandatory captures, including chaining captures. This will involve adding several helper methods to the `CheckersBoard` class to check square validity, piece ownership, and then generate possible moves and captures.



In [2]:
class CheckersBoard:
    def __init__(self):
        self.board = [[0 for _ in range(8)] for _ in range(8)]
        # 0: empty, 1: player 1 piece, 2: player 2 piece
        # 3: player 1 king, 4: player 2 king

        # Setup initial pieces for Player 1 (bottom of the board)
        for row in range(3):
            for col in range(8):
                if (row + col) % 2 != 0:
                    self.board[row][col] = 1

        # Setup initial pieces for Player 2 (top of the board)
        for row in range(5, 8):
            for col in range(8):
                if (row + col) % 2 != 0:
                    self.board[row][col] = 2

    def __str__(self):
        board_str = ""
        for r_idx, row in enumerate(self.board):
            board_str += f"{8 - r_idx} |"
            for piece in row:
                if piece == 0:
                    board_str += " . "
                elif piece == 1:
                    board_str += " x "
                elif piece == 2:
                    board_str += " o "
                elif piece == 3:
                    board_str += " X "
                elif piece == 4:
                    board_str += " O "
            board_str += "\n"
        board_str += "  ------------------------\n"
        board_str += "    a  b  c  d  e  f  g  h"
        return board_str

    def _is_valid_coord(self, r, c):
        return 0 <= r < 8 and 0 <= c < 8

    def _get_piece_owner(self, piece):
        if piece == 1 or piece == 3:
            return 1 # Player 1
        elif piece == 2 or piece == 4:
            return 2 # Player 2
        return 0 # Empty or invalid

    def _is_king(self, piece):
        return piece == 3 or piece == 4

    def _get_possible_moves_for_piece(self, r, c, player):
        moves = []
        piece = self.board[r][c]
        is_king = self._is_king(piece)

        # Define directions for non-king pieces
        if player == 1: # Player 1 moves up (decreasing row index)
            directions = [(-1, -1), (-1, 1)]
        else: # Player 2 moves down (increasing row index)
            directions = [(1, -1), (1, 1)]

        # Kings can move in all 4 diagonal directions
        if is_king:
            directions = [(-1, -1), (-1, 1), (1, -1), (1, 1)]

        for dr, dc in directions:
            nr, nc = r + dr, c + dc
            if self._is_valid_coord(nr, nc) and self.board[nr][nc] == 0:
                moves.append(((r, c), (nr, nc)))
        return moves

    def _get_possible_captures_for_piece(self, r, c, player, current_board=None, path=None):
        if current_board is None:
            current_board = [row[:] for row in self.board]
        if path is None:
            path = [(r, c)]

        captures = []
        piece = current_board[r][c]
        is_king = self._is_king(piece)

        # Define directions for non-king pieces
        if player == 1: # Player 1 moves up (decreasing row index)
            capture_directions = [(-1, -1), (-1, 1)]
        else: # Player 2 moves down (increasing row index)
            capture_directions = [(1, -1), (1, 1)]

        # Kings can capture in all 4 diagonal directions
        if is_king:
            capture_directions = [(-1, -1), (-1, 1), (1, -1), (1, 1)]

        found_capture_in_this_step = False
        for dr, dc in capture_directions:
            # Square of the opponent's piece to be jumped over
            enemy_r, enemy_c = r + dr, c + dc
            # Landing square after the jump
            land_r, land_c = r + 2 * dr, c + 2 * dc

            if (self._is_valid_coord(land_r, land_c) and
                    current_board[land_r][land_c] == 0 and
                    self._is_valid_coord(enemy_r, enemy_c) and
                    self.board[enemy_r][enemy_c] != 0 and
                    self._get_piece_owner(current_board[enemy_r][enemy_c]) != player):

                found_capture_in_this_step = True
                # Simulate the capture on a temporary board
                temp_board = [row[:] for row in current_board]
                temp_board[land_r][land_c] = temp_board[r][c]
                temp_board[r][c] = 0
                temp_board[enemy_r][enemy_c] = 0

                # Check for king promotion after landing, only if not king yet and reaching back row
                promoted_piece = temp_board[land_r][land_c]
                if not self._is_king(promoted_piece):
                    if (player == 1 and land_r == 0) or (player == 2 and land_r == 7):
                        temp_board[land_r][land_c] += 2 # Promote to king

                # Recursively look for more captures from the new position
                chained_captures = self._get_possible_captures_for_piece(land_r, land_c, player, temp_board, path + [(land_r, land_c)])

                if chained_captures: # If more captures are found, add them
                    captures.extend(chained_captures)
                else: # If no more captures, this is a terminal capture sequence
                    captures.append(path + [(land_r, land_c)])

        # If no capture was found in this step, but it's part of a longer path, return empty to signify end of chain
        if not found_capture_in_this_step and len(path) > 1:
            return []

        return captures



    def get_valid_moves(self, player):
        all_possible_moves = []
        all_possible_captures = []

        for r in range(8):
            for c in range(8):
                piece = self.board[r][c]
                if self._get_piece_owner(piece) == player:
                    # Get basic moves
                    all_possible_moves.extend(self._get_possible_moves_for_piece(r, c, player))
                    # Get captures
                    captures_from_piece = self._get_possible_captures_for_piece(r, c, player)
                    if captures_from_piece:
                        all_possible_captures.extend(captures_from_piece)

        if all_possible_captures:
            # If captures are available, they are mandatory
            # The `_get_possible_captures_for_piece` returns paths. We need to convert them to (start, end) format for consistency
            # or adapt the rest of the code to handle paths.
            # For now, let's return paths as they represent chained captures.
            return all_possible_captures
        else:
            # Otherwise, return all basic moves
            return all_possible_moves

# Example usage:
board = CheckersBoard()
print("Initial Board:")
print(board)

print("\nValid moves for Player 1:")
moves_p1 = board.get_valid_moves(1)
if moves_p1:
    for move in moves_p1:
        if isinstance(move[0], tuple): # If it's a move ((r1,c1), (r2,c2))
            print(f"  Move from {chr(move[0][1]+97)}{8-move[0][0]} to {chr(move[1][1]+97)}{8-move[1][0]}")
        else: # If it's a capture path [(r1,c1), (r2,c2), (r3,c3)...]
            print(f"  Capture path: {' -> '.join([f'{chr(pos[1]+97)}{8-pos[0]}' for pos in move])}")
else:
    print("  No valid moves for Player 1.")

print("\nValid moves for Player 2:")
moves_p2 = board.get_valid_moves(2)
if moves_p2:
    for move in moves_p2:
        if isinstance(move[0], tuple): # If it's a move ((r1,c1), (r2,c2))
            print(f"  Move from {chr(move[0][1]+97)}{8-move[0][0]} to {chr(move[1][1]+97)}{8-move[1][0]}")
        else: # If it's a capture path [(r1,c1), (r2,c2), (r3,c3)...]
            print(f"  Capture path: {' -> '.join([f'{chr(pos[1]+97)}{8-pos[0]}' for pos in move])}")
else:
    print("  No valid moves for Player 2.")


Initial Board:
8 | .  x  .  x  .  x  .  x 
7 | x  .  x  .  x  .  x  . 
6 | .  x  .  x  .  x  .  x 
5 | .  .  .  .  .  .  .  . 
4 | .  .  .  .  .  .  .  . 
3 | o  .  o  .  o  .  o  . 
2 | .  o  .  o  .  o  .  o 
1 | o  .  o  .  o  .  o  . 
  ------------------------
    a  b  c  d  e  f  g  h

Valid moves for Player 1:
  No valid moves for Player 1.

Valid moves for Player 2:
  No valid moves for Player 2.


**Reasoning**:
The previous execution indicated that no valid moves were found for either player on a fresh board, which is incorrect. This points to a misalignment between piece placement and movement directions, and potentially an issue in the capture logic's use of board states. This step corrects the movement directions for Player 1 and Player 2 to match their initial board positions (Player 1 at rows 0-2 moving 'down' and Player 2 at rows 5-7 moving 'up'), updates the king promotion conditions, and ensures capture logic uses the correct `current_board` for chained captures.



In [3]:
class CheckersBoard:
    def __init__(self):
        self.board = [[0 for _ in range(8)] for _ in range(8)]
        # 0: empty, 1: player 1 piece, 2: player 2 piece
        # 3: player 1 king, 4: player 2 king

        # Setup initial pieces for Player 1 (top of the board, rows 0-2)
        for row in range(3):
            for col in range(8):
                if (row + col) % 2 != 0:
                    self.board[row][col] = 1

        # Setup initial pieces for Player 2 (bottom of the board, rows 5-7)
        for row in range(5, 8):
            for col in range(8):
                if (row + col) % 2 != 0:
                    self.board[row][col] = 2

    def __str__(self):
        board_str = ""
        for r_idx, row in enumerate(self.board):
            board_str += f"{8 - r_idx} |"
            for piece in row:
                if piece == 0:
                    board_str += " . "
                elif piece == 1:
                    board_str += " x "
                elif piece == 2:
                    board_str += " o "
                elif piece == 3:
                    board_str += " X "
                elif piece == 4:
                    board_str += " O "
            board_str += "\n"
        board_str += "  ------------------------\n"
        board_str += "    a  b  c  d  e  f  g  h"
        return board_str

    def _is_valid_coord(self, r, c):
        return 0 <= r < 8 and 0 <= c < 8

    def _get_piece_owner(self, piece):
        if piece == 1 or piece == 3:
            return 1 # Player 1
        elif piece == 2 or piece == 4:
            return 2 # Player 2
        return 0 # Empty or invalid

    def _is_king(self, piece):
        return piece == 3 or piece == 4

    def _get_possible_moves_for_piece(self, r, c, player):
        moves = []
        piece = self.board[r][c]
        is_king = self._is_king(piece)

        # Define directions for non-king pieces
        if player == 1: # Player 1 pieces are at rows 0,1,2 and move 'down' (increasing row index)
            directions = [(1, -1), (1, 1)]
        else: # Player 2 pieces are at rows 5,6,7 and move 'up' (decreasing row index)
            directions = [(-1, -1), (-1, 1)]

        # Kings can move in all 4 diagonal directions
        if is_king:
            directions = [(-1, -1), (-1, 1), (1, -1), (1, 1)]

        for dr, dc in directions:
            nr, nc = r + dr, c + dc
            if self._is_valid_coord(nr, nc) and self.board[nr][nc] == 0:
                moves.append(((r, c), (nr, nc)))
        return moves

    def _get_possible_captures_for_piece(self, r, c, player, current_board=None, path=None):
        if current_board is None:
            current_board = [row[:] for row in self.board]
        if path is None:
            path = [(r, c)]

        captures = []
        piece = current_board[r][c]
        is_king = self._is_king(piece)

        # Define directions for non-king pieces
        if player == 1: # Player 1 pieces are at rows 0,1,2 and move 'down' (increasing row index)
            capture_directions = [(1, -1), (1, 1)]
        else: # Player 2 pieces are at rows 5,6,7 and move 'up' (decreasing row index)
            capture_directions = [(-1, -1), (-1, 1)]

        # Kings can capture in all 4 diagonal directions
        if is_king:
            capture_directions = [(-1, -1), (-1, 1), (1, -1), (1, 1)]

        found_capture_in_this_step = False
        for dr, dc in capture_directions:
            # Square of the opponent's piece to be jumped over
            enemy_r, enemy_c = r + dr, c + dc
            # Landing square after the jump
            land_r, land_c = r + 2 * dr, c + 2 * dc

            if (self._is_valid_coord(land_r, land_c) and
                    current_board[land_r][land_c] == 0 and
                    self._is_valid_coord(enemy_r, enemy_c) and
                    current_board[enemy_r][enemy_c] != 0 and # Changed from self.board to current_board
                    self._get_piece_owner(current_board[enemy_r][enemy_c]) != player):

                found_capture_in_this_step = True
                # Simulate the capture on a temporary board
                temp_board = [row[:] for row in current_board]
                temp_board[land_r][land_c] = temp_board[r][c]
                temp_board[r][c] = 0
                temp_board[enemy_r][enemy_c] = 0

                # Check for king promotion after landing, only if not king yet and reaching back row
                promoted_piece = temp_board[land_r][land_c]
                if not self._is_king(promoted_piece): # only promote if not already a king
                    if (player == 1 and land_r == 7) or (player == 2 and land_r == 0): # Adjusted promotion rows
                        temp_board[land_r][land_c] += 2 # Promote to king (1->3, 2->4)

                # Recursively look for more captures from the new position
                chained_captures = self._get_possible_captures_for_piece(land_r, land_c, player, temp_board, path + [(land_r, land_c)])

                if chained_captures: # If more captures are found, add them
                    captures.extend(chained_captures)
                else: # If no more captures, this is a terminal capture sequence
                    captures.append(path + [(land_r, land_c)])

        # If no capture was found in this step, but it's part of a longer path, return empty to signify end of chain
        if not found_capture_in_this_step and len(path) > 1:
            return []

        return captures



    def get_valid_moves(self, player):
        all_possible_moves = []
        all_possible_captures = []

        for r in range(8):
            for c in range(8):
                piece = self.board[r][c]
                if self._get_piece_owner(piece) == player:
                    # Get basic moves
                    all_possible_moves.extend(self._get_possible_moves_for_piece(r, c, player))
                    # Get captures
                    captures_from_piece = self._get_possible_captures_for_piece(r, c, player)
                    if captures_from_piece:
                        all_possible_captures.extend(captures_from_piece)

        if all_possible_captures:
            # If captures are available, they are mandatory
            return all_possible_captures
        else:
            # Otherwise, return all basic moves
            return all_possible_moves

# Example usage:
board = CheckersBoard()
print("Initial Board:")
print(board)

print("\nValid moves for Player 1:")
moves_p1 = board.get_valid_moves(1)
if moves_p1:
    for move in moves_p1:
        if isinstance(move[0], tuple): # If it's a move ((r1,c1), (r2,c2))
            print(f"  Move from {chr(move[0][1]+97)}{8-move[0][0]} to {chr(move[1][1]+97)}{8-move[1][0]}")
        else: # If it's a capture path [(r1,c1), (r2,c2), (r3,c3)...]
            print(f"  Capture path: {' -> '.join([f'{chr(pos[1]+97)}{8-pos[0]}' for pos in move])}")
else:
    print("  No valid moves for Player 1.")

print("\nValid moves for Player 2:")
moves_p2 = board.get_valid_moves(2)
if moves_p2:
    for move in moves_p2:
        if isinstance(move[0], tuple): # If it's a move ((r1,c1), (r2,c2))
            print(f"  Move from {chr(move[0][1]+97)}{8-move[0][0]} to {chr(move[1][1]+97)}{8-move[1][0]}")
        else: # If it's a capture path [(r1,c1), (r2,c2), (r3,c3)...]
            print(f"  Capture path: {' -> '.join([f'{chr(pos[1]+97)}{8-pos[0]}' for pos in move])}")
else:
    print("  No valid moves for Player 2.")

Initial Board:
8 | .  x  .  x  .  x  .  x 
7 | x  .  x  .  x  .  x  . 
6 | .  x  .  x  .  x  .  x 
5 | .  .  .  .  .  .  .  . 
4 | .  .  .  .  .  .  .  . 
3 | o  .  o  .  o  .  o  . 
2 | .  o  .  o  .  o  .  o 
1 | o  .  o  .  o  .  o  . 
  ------------------------
    a  b  c  d  e  f  g  h

Valid moves for Player 1:
  Move from b6 to a5
  Move from b6 to c5
  Move from d6 to c5
  Move from d6 to e5
  Move from f6 to e5
  Move from f6 to g5
  Move from h6 to g5

Valid moves for Player 2:
  Move from a3 to b4
  Move from c3 to b4
  Move from c3 to d4
  Move from e3 to d4
  Move from e3 to f4
  Move from g3 to f4
  Move from g3 to h4


**Reasoning**:
The previous step successfully generated valid moves for both players. The next logical step, according to the subtask, is to implement a `make_move` function that applies a given move (either a simple move or a capture path) to the board, updates piece positions, removes captured pieces, and handles king promotions. This function will return a new `CheckersBoard` instance to allow for exploring game states without modifying the original.



In [4]:
class CheckersBoard:
    def __init__(self, board=None):
        if board is None:
            self.board = [[0 for _ in range(8)] for _ in range(8)]
            # 0: empty, 1: player 1 piece, 2: player 2 piece
            # 3: player 1 king, 4: player 2 king

            # Setup initial pieces for Player 1 (top of the board, rows 0-2)
            for row in range(3):
                for col in range(8):
                    if (row + col) % 2 != 0:
                        self.board[row][col] = 1

            # Setup initial pieces for Player 2 (bottom of the board, rows 5-7)
            for row in range(5, 8):
                for col in range(8):
                    if (row + col) % 2 != 0:
                        self.board[row][col] = 2
        else:
            self.board = [row[:] for row in board] # Deep copy of the board

    def __str__(self):
        board_str = ""
        for r_idx, row in enumerate(self.board):
            board_str += f"{8 - r_idx} |"
            for piece in row:
                if piece == 0:
                    board_str += " . "
                elif piece == 1:
                    board_str += " x "
                elif piece == 2:
                    board_str += " o "
                elif piece == 3:
                    board_str += " X "
                elif piece == 4:
                    board_str += " O "
            board_str += "\n"
        board_str += "  ------------------------\n"
        board_str += "    a  b  c  d  e  f  g  h"
        return board_str

    def _is_valid_coord(self, r, c):
        return 0 <= r < 8 and 0 <= c < 8

    def _get_piece_owner(self, piece):
        if piece == 1 or piece == 3:
            return 1 # Player 1
        elif piece == 2 or piece == 4:
            return 2 # Player 2
        return 0 # Empty or invalid

    def _is_king(self, piece):
        return piece == 3 or piece == 4

    def _get_possible_moves_for_piece(self, r, c, player):
        moves = []
        piece = self.board[r][c]
        is_king = self._is_king(piece)

        # Define directions for non-king pieces
        if player == 1: # Player 1 pieces are at rows 0,1,2 and move 'down' (increasing row index)
            directions = [(1, -1), (1, 1)]
        else: # Player 2 pieces are at rows 5,6,7 and move 'up' (decreasing row index)
            directions = [(-1, -1), (-1, 1)]

        # Kings can move in all 4 diagonal directions
        if is_king:
            directions = [(-1, -1), (-1, 1), (1, -1), (1, 1)]

        for dr, dc in directions:
            nr, nc = r + dr, c + dc
            if self._is_valid_coord(nr, nc) and self.board[nr][nc] == 0:
                moves.append(((r, c), (nr, nc)))
        return moves

    def _get_possible_captures_for_piece(self, r, c, player, current_board=None, path=None):
        if current_board is None:
            current_board = [row[:] for row in self.board]
        if path is None:
            path = [(r, c)]

        captures = []
        piece = current_board[r][c]
        is_king = self._is_king(piece)

        # Define directions for non-king pieces
        if player == 1: # Player 1 pieces are at rows 0,1,2 and move 'down' (increasing row index)
            capture_directions = [(1, -1), (1, 1)]
        else: # Player 2 pieces are at rows 5,6,7 and move 'up' (decreasing row index)
            capture_directions = [(-1, -1), (-1, 1)]

        # Kings can capture in all 4 diagonal directions
        if is_king:
            capture_directions = [(-1, -1), (-1, 1), (1, -1), (1, 1)]

        found_capture_in_this_step = False
        for dr, dc in capture_directions:
            # Square of the opponent's piece to be jumped over
            enemy_r, enemy_c = r + dr, c + dc
            # Landing square after the jump
            land_r, land_c = r + 2 * dr, c + 2 * dc

            if (self._is_valid_coord(land_r, land_c) and
                    current_board[land_r][land_c] == 0 and
                    self._is_valid_coord(enemy_r, enemy_c) and
                    current_board[enemy_r][enemy_c] != 0 and
                    self._get_piece_owner(current_board[enemy_r][enemy_c]) != player):

                found_capture_in_this_step = True
                # Simulate the capture on a temporary board
                temp_board = [row[:] for row in current_board]
                temp_board[land_r][land_c] = temp_board[r][c]
                temp_board[r][c] = 0
                temp_board[enemy_r][enemy_c] = 0

                # Check for king promotion after landing, only if not king yet and reaching back row
                promoted_piece = temp_board[land_r][land_c]
                if not self._is_king(promoted_piece): # only promote if not already a king
                    if (player == 1 and land_r == 7) or (player == 2 and land_r == 0): # Adjusted promotion rows
                        temp_board[land_r][land_c] += 2 # Promote to king (1->3, 2->4)

                # Recursively look for more captures from the new position
                chained_captures = self._get_possible_captures_for_piece(land_r, land_c, player, temp_board, path + [(land_r, land_c)])

                if chained_captures: # If more captures are found, add them
                    captures.extend(chained_captures)
                else: # If no more captures, this is a terminal capture sequence
                    captures.append(path + [(land_r, land_c)])

        # If no capture was found in this step, but it's part of a longer path, return empty to signify end of chain
        if not found_capture_in_this_step and len(path) > 1:
            return []

        return captures



    def get_valid_moves(self, player):
        all_possible_moves = []
        all_possible_captures = []

        for r in range(8):
            for c in range(8):
                piece = self.board[r][c]
                if self._get_piece_owner(piece) == player:
                    # Get basic moves
                    all_possible_moves.extend(self._get_possible_moves_for_piece(r, c, player))
                    # Get captures
                    captures_from_piece = self._get_possible_captures_for_piece(r, c, player)
                    if captures_from_piece:
                        all_possible_captures.extend(captures_from_piece)

        if all_possible_captures:
            # If captures are available, they are mandatory
            return all_possible_captures
        else:
            # Otherwise, return all basic moves
            return all_possible_moves

    def make_move(self, move, player):
        # Create a new board state for the move
        new_board = CheckersBoard(board=self.board)

        if isinstance(move[0], tuple) and len(move) == 2 and isinstance(move[1], tuple): # It's a simple move ((r1,c1), (r2,c2))
            start_r, start_c = move[0]
            end_r, end_c = move[1]
            piece_to_move = new_board.board[start_r][start_c]

            new_board.board[end_r][end_c] = piece_to_move
            new_board.board[start_r][start_c] = 0

            # Check for king promotion
            if not new_board._is_king(piece_to_move):
                if (player == 1 and end_r == 7) or (player == 2 and end_r == 0):
                    new_board.board[end_r][end_c] += 2 # Promote to king

        elif isinstance(move, list): # It's a capture path [(r1,c1), (r2,c2), ...]
            for i in range(len(move) - 1):
                start_r, start_c = move[i]
                end_r, end_c = move[i+1]
                piece_to_move = new_board.board[start_r][start_c]

                new_board.board[end_r][end_c] = piece_to_move
                new_board.board[start_r][start_c] = 0

                # Remove captured piece
                # Calculate the position of the captured piece (midpoint)
                captured_r = (start_r + end_r) // 2
                captured_c = (start_c + end_c) // 2
                new_board.board[captured_r][captured_c] = 0

            # After the whole capture path, check for king promotion at the final landing spot
            final_r, final_c = move[-1]
            promoted_piece = new_board.board[final_r][final_c]
            if not new_board._is_king(promoted_piece):
                if (player == 1 and final_r == 7) or (player == 2 and final_r == 0):
                    new_board.board[final_r][final_c] += 2

        return new_board

# Example usage:
board = CheckersBoard()
print("Initial Board:")
print(board)

print("\nValid moves for Player 1:")
moves_p1 = board.get_valid_moves(1)
if moves_p1:
    print(f"Found {len(moves_p1)} moves for Player 1. Applying the first move.")
    first_move_p1 = moves_p1[0]
    new_board_p1 = board.make_move(first_move_p1, 1)
    print("Board after Player 1's first move:")
    print(new_board_p1)
else:
    print("No valid moves for Player 1.")

print("\nValid moves for Player 2:")
moves_p2 = board.get_valid_moves(2)
if moves_p2:
    print(f"Found {len(moves_p2)} moves for Player 2. Applying the first move.")
    first_move_p2 = moves_p2[0]
    new_board_p2 = board.make_move(first_move_p2, 2)
    print("Board after Player 2's first move:")
    print(new_board_p2)
else:
    print("No valid moves for Player 2.")


Initial Board:
8 | .  x  .  x  .  x  .  x 
7 | x  .  x  .  x  .  x  . 
6 | .  x  .  x  .  x  .  x 
5 | .  .  .  .  .  .  .  . 
4 | .  .  .  .  .  .  .  . 
3 | o  .  o  .  o  .  o  . 
2 | .  o  .  o  .  o  .  o 
1 | o  .  o  .  o  .  o  . 
  ------------------------
    a  b  c  d  e  f  g  h

Valid moves for Player 1:
Found 7 moves for Player 1. Applying the first move.
Board after Player 1's first move:
8 | .  x  .  x  .  x  .  x 
7 | x  .  x  .  x  .  x  . 
6 | .  .  .  x  .  x  .  x 
5 | x  .  .  .  .  .  .  . 
4 | .  .  .  .  .  .  .  . 
3 | o  .  o  .  o  .  o  . 
2 | .  o  .  o  .  o  .  o 
1 | o  .  o  .  o  .  o  . 
  ------------------------
    a  b  c  d  e  f  g  h

Valid moves for Player 2:
Found 7 moves for Player 2. Applying the first move.
Board after Player 2's first move:
8 | .  x  .  x  .  x  .  x 
7 | x  .  x  .  x  .  x  . 
6 | .  x  .  x  .  x  .  x 
5 | .  .  .  .  .  .  .  . 
4 | .  o  .  .  .  .  .  . 
3 | .  .  o  .  o  .  o  . 
2 | .  o  .  o  .  o  .  o 
1 |

**Reasoning**:
The previous steps successfully implemented board representation, move generation, and move execution. The final part of the subtask is to implement a function to detect terminal states of the game, determining if the game has ended and who won or if it's a draw, based on piece count and available moves.



In [5]:
class CheckersBoard:
    def __init__(self, board=None):
        if board is None:
            self.board = [[0 for _ in range(8)] for _ in range(8)]
            # 0: empty, 1: player 1 piece, 2: player 2 piece
            # 3: player 1 king, 4: player 2 king

            # Setup initial pieces for Player 1 (top of the board, rows 0-2)
            for row in range(3):
                for col in range(8):
                    if (row + col) % 2 != 0:
                        self.board[row][col] = 1

            # Setup initial pieces for Player 2 (bottom of the board, rows 5-7)
            for row in range(5, 8):
                for col in range(8):
                    if (row + col) % 2 != 0:
                        self.board[row][col] = 2
        else:
            self.board = [row[:] for row in board] # Deep copy of the board

    def __str__(self):
        board_str = ""
        for r_idx, row in enumerate(self.board):
            board_str += f"{8 - r_idx} |"
            for piece in row:
                if piece == 0:
                    board_str += " . "
                elif piece == 1:
                    board_str += " x "
                elif piece == 2:
                    board_str += " o "
                elif piece == 3:
                    board_str += " X "
                elif piece == 4:
                    board_str += " O "
            board_str += "\n"
        board_str += "  ------------------------\n"
        board_str += "    a  b  c  d  e  f  g  h"
        return board_str

    def _is_valid_coord(self, r, c):
        return 0 <= r < 8 and 0 <= c < 8

    def _get_piece_owner(self, piece):
        if piece == 1 or piece == 3:
            return 1 # Player 1
        elif piece == 2 or piece == 4:
            return 2 # Player 2
        return 0 # Empty or invalid

    def _is_king(self, piece):
        return piece == 3 or piece == 4

    def _get_possible_moves_for_piece(self, r, c, player):
        moves = []
        piece = self.board[r][c]
        is_king = self._is_king(piece)

        # Define directions for non-king pieces
        if player == 1: # Player 1 pieces are at rows 0,1,2 and move 'down' (increasing row index)
            directions = [(1, -1), (1, 1)]
        else: # Player 2 pieces are at rows 5,6,7 and move 'up' (decreasing row index)
            directions = [(-1, -1), (-1, 1)]

        # Kings can move in all 4 diagonal directions
        if is_king:
            directions = [(-1, -1), (-1, 1), (1, -1), (1, 1)]

        for dr, dc in directions:
            nr, nc = r + dr, c + dc
            if self._is_valid_coord(nr, nc) and self.board[nr][nc] == 0:
                moves.append(((r, c), (nr, nc)))
        return moves

    def _get_possible_captures_for_piece(self, r, c, player, current_board=None, path=None):
        if current_board is None:
            current_board = [row[:] for row in self.board]
        if path is None:
            path = [(r, c)]

        captures = []
        piece = current_board[r][c]
        is_king = self._is_king(piece)

        # Define directions for non-king pieces
        if player == 1: # Player 1 pieces are at rows 0,1,2 and move 'down' (increasing row index)
            capture_directions = [(1, -1), (1, 1)]
        else: # Player 2 pieces are at rows 5,6,7 and move 'up' (decreasing row index)
            capture_directions = [(-1, -1), (-1, 1)]

        # Kings can capture in all 4 diagonal directions
        if is_king:
            capture_directions = [(-1, -1), (-1, 1), (1, -1), (1, 1)]

        found_capture_in_this_step = False
        for dr, dc in capture_directions:
            # Square of the opponent's piece to be jumped over
            enemy_r, enemy_c = r + dr, c + dc
            # Landing square after the jump
            land_r, land_c = r + 2 * dr, c + 2 * dc

            if (self._is_valid_coord(land_r, land_c) and
                    current_board[land_r][land_c] == 0 and
                    self._is_valid_coord(enemy_r, enemy_c) and
                    current_board[enemy_r][enemy_c] != 0 and
                    self._get_piece_owner(current_board[enemy_r][enemy_c]) != player):

                found_capture_in_this_step = True
                # Simulate the capture on a temporary board
                temp_board = [row[:] for row in current_board]
                temp_board[land_r][land_c] = temp_board[r][c]
                temp_board[r][c] = 0
                temp_board[enemy_r][enemy_c] = 0

                # Check for king promotion after landing, only if not king yet and reaching back row
                promoted_piece = temp_board[land_r][land_c]
                if not self._is_king(promoted_piece): # only promote if not already a king
                    if (player == 1 and land_r == 7) or (player == 2 and land_r == 0): # Adjusted promotion rows
                        temp_board[land_r][land_c] += 2 # Promote to king (1->3, 2->4)

                # Recursively look for more captures from the new position
                chained_captures = self._get_possible_captures_for_piece(land_r, land_c, player, temp_board, path + [(land_r, land_c)])

                if chained_captures: # If more captures are found, add them
                    captures.extend(chained_captures)
                else: # If no more captures, this is a terminal capture sequence
                    captures.append(path + [(land_r, land_c)])

        # If no capture was found in this step, but it's part of a longer path, return empty to signify end of chain
        if not found_capture_in_this_step and len(path) > 1:
            return []

        return captures

    def get_valid_moves(self, player):
        all_possible_moves = []
        all_possible_captures = []

        for r in range(8):
            for c in range(8):
                piece = self.board[r][c]
                if self._get_piece_owner(piece) == player:
                    # Get basic moves
                    all_possible_moves.extend(self._get_possible_moves_for_piece(r, c, player))
                    # Get captures
                    captures_from_piece = self._get_possible_captures_for_piece(r, c, player)
                    if captures_from_piece:
                        all_possible_captures.extend(captures_from_piece)

        if all_possible_captures:
            # If captures are available, they are mandatory
            return all_possible_captures
        else:
            # Otherwise, return all basic moves
            return all_possible_moves

    def make_move(self, move, player):
        # Create a new board state for the move
        new_board = CheckersBoard(board=self.board)

        if isinstance(move[0], tuple) and len(move) == 2 and isinstance(move[1], tuple): # It's a simple move ((r1,c1), (r2,c2))
            start_r, start_c = move[0]
            end_r, end_c = move[1]
            piece_to_move = new_board.board[start_r][start_c]

            new_board.board[end_r][end_c] = piece_to_move
            new_board.board[start_r][start_c] = 0

            # Check for king promotion
            if not new_board._is_king(piece_to_move): # Only promote if not already a king
                if (player == 1 and end_r == 7) or (player == 2 and end_r == 0):
                    new_board.board[end_r][end_c] += 2 # Promote to king

        elif isinstance(move, list): # It's a capture path [(r1,c1), (r2,c2), ...]
            for i in range(len(move) - 1):
                start_r, start_c = move[i]
                end_r, end_c = move[i+1]
                piece_to_move = new_board.board[start_r][start_c]

                new_board.board[end_r][end_c] = piece_to_move
                new_board.board[start_r][start_c] = 0

                # Remove captured piece
                # Calculate the position of the captured piece (midpoint)
                captured_r = (start_r + end_r) // 2
                captured_c = (start_c + end_c) // 2
                new_board.board[captured_r][captured_c] = 0

            # After the whole capture path, check for king promotion at the final landing spot
            final_r, final_c = move[-1]
            promoted_piece = new_board.board[final_r][final_c]
            if not new_board._is_king(promoted_piece): # Only promote if not already a king
                if (player == 1 and final_r == 7) or (player == 2 and final_r == 0):
                    new_board.board[final_r][final_c] += 2

        return new_board

    def is_game_over(self):
        player1_pieces = 0
        player2_pieces = 0
        for r in range(8):
            for c in range(8):
                piece = self.board[r][c]
                if self._get_piece_owner(piece) == 1:
                    player1_pieces += 1
                elif self._get_piece_owner(piece) == 2:
                    player2_pieces += 1

        # Check for no pieces left
        if player1_pieces == 0: return True, 2 # Player 1 has no pieces, Player 2 wins
        if player2_pieces == 0: return True, 1 # Player 2 has no pieces, Player 1 wins

        # Check for no valid moves for the current player
        # This would typically be checked by the game loop for the *current* player's turn.
        # For this function, we check for both to determine if ANY player can make a move.
        # However, a more accurate check for a terminal state is often for the *current player*.
        # Let's assume a simplified check here: if either player has no moves.
        # The game loop will determine whose turn it is. This function determines if
        # a general end state has been reached.

        # A more robust implementation would check for no valid moves for the 'current' player
        # at the time this function is called. For now, let's just make sure there are pieces.
        # If we reach here, and both players have pieces, we assume the game is not over
        # unless explicitly told whose turn it is and they have no moves.
        # For now, let's implement a simple version where if a player has no valid moves,
        # the other player wins. This requires `is_game_over` to know whose turn it is.
        # To avoid making `is_game_over` take `current_player` as input,
        # let's assume this check happens before calling `is_game_over` or `is_game_over`
        # itself checks for both players.

        # Re-evaluating: The subtask specifies 'no valid moves available'.
        # This implies checking for the current player whose turn it is.
        # Since this method is part of the board, it should ideally be agnostic to whose turn it is,
        # or take the current player as input.
        # Let's make it check for both players and return a draw if neither can move.
        # In Checkers, if a player cannot move, they lose. So, if current player has no moves,
        # the other player wins. If both have no moves (unlikely with pieces), it's a draw.

        # Let's count pieces and if either player cannot move (and the other has pieces), that's a loss.
        # For this function to be truly 'terminal state detection', it needs the current player.
        # For now, I'll provide an example where a player is passed to 'is_game_over'
        # and assume it's called with `current_player` to check for no moves.

        # Default return for ongoing game
        return False, None

    # Helper to check if a specific player has any valid moves
    def _has_valid_moves(self, player):
        return bool(self.get_valid_moves(player))

    # The main game over function, now taking the current player
    def check_terminal_state(self, current_player):
        player1_pieces = 0
        player2_pieces = 0
        for r in range(8):
            for c in range(8):
                piece = self.board[r][c]
                if self._get_piece_owner(piece) == 1:
                    player1_pieces += 1
                elif self._get_piece_owner(piece) == 2:
                    player2_pieces += 1

        # Condition 1: No pieces left
        if player1_pieces == 0: return True, 2 # Player 2 wins
        if player2_pieces == 0: return True, 1 # Player 1 wins

        # Condition 2: Current player has no valid moves
        if not self._has_valid_moves(current_player):
            if current_player == 1: return True, 2 # Player 1 has no moves, Player 2 wins
            if current_player == 2: return True, 1 # Player 2 has no moves, Player 1 wins

        # Default: Game is not over
        return False, None

# Example usage for terminal state detection:
board = CheckersBoard()
print("Initial Board:")
print(board)

# Check terminal state for Player 1 at the beginning (should not be over)
is_over, winner = board.check_terminal_state(1)
print(f"\nGame over (P1's turn)? {is_over}, Winner: {winner}")

# Simulate a scenario where Player 2 has no pieces
empty_board_p2 = CheckersBoard()
for r in range(8):
    for c in range(8):
        if empty_board_p2.board[r][c] == 2 or empty_board_p2.board[r][c] == 4:
            empty_board_p2.board[r][c] = 0

print("\nBoard with no P2 pieces:")
print(empty_board_p2)
is_over, winner = empty_board_p2.check_terminal_state(1) # P1's turn
print(f"Game over (P1's turn)? {is_over}, Winner: {winner}")

# Simulate a scenario where Player 1 has no valid moves (e.g., block all pieces)
# This is harder to construct artificially without a lot of setup.
# For demonstration, let's assume a board state where P1 has pieces but no moves.
# For a true no-move test, we'd need a very specific board setup.
# Let's create a minimal board where P1 has one piece at (0,1) and cannot move.
no_move_board_p1 = CheckersBoard([[0 for _ in range(8)] for _ in range(8)])
no_move_board_p1.board[0][1] = 1 # P1 piece, but cannot move (top edge)
no_move_board_p1.board[7][0] = 2 # P2 piece for game to theoretically continue

print("\nBoard with P1 having no valid moves:")
print(no_move_board_p1)
is_over, winner = no_move_board_p1.check_terminal_state(1) # P1's turn
print(f"Game over (P1's turn)? {is_over}, Winner: {winner}")

is_over, winner = no_move_board_p1.check_terminal_state(2) # P2's turn
print(f"Game over (P2's turn)? {is_over}, Winner: {winner}")

Initial Board:
8 | .  x  .  x  .  x  .  x 
7 | x  .  x  .  x  .  x  . 
6 | .  x  .  x  .  x  .  x 
5 | .  .  .  .  .  .  .  . 
4 | .  .  .  .  .  .  .  . 
3 | o  .  o  .  o  .  o  . 
2 | .  o  .  o  .  o  .  o 
1 | o  .  o  .  o  .  o  . 
  ------------------------
    a  b  c  d  e  f  g  h

Game over (P1's turn)? False, Winner: None

Board with no P2 pieces:
8 | .  x  .  x  .  x  .  x 
7 | x  .  x  .  x  .  x  . 
6 | .  x  .  x  .  x  .  x 
5 | .  .  .  .  .  .  .  . 
4 | .  .  .  .  .  .  .  . 
3 | .  .  .  .  .  .  .  . 
2 | .  .  .  .  .  .  .  . 
1 | .  .  .  .  .  .  .  . 
  ------------------------
    a  b  c  d  e  f  g  h
Game over (P1's turn)? True, Winner: 1

Board with P1 having no valid moves:
8 | .  x  .  .  .  .  .  . 
7 | .  .  .  .  .  .  .  . 
6 | .  .  .  .  .  .  .  . 
5 | .  .  .  .  .  .  .  . 
4 | .  .  .  .  .  .  .  . 
3 | .  .  .  .  .  .  .  . 
2 | .  .  .  .  .  .  .  . 
1 | o  .  .  .  .  .  .  . 
  ------------------------
    a  b  c  d  e  f  g  h
Game

## Desenvolvimento do Agente Q-Learning

### Subtask:
Criar a estrutura básica para os agentes de Q-learning, incluindo a representação de estados e ações, a tabela Q (ou rede neural), a função de recompensa, e o mecanismo de exploração/explotação (e.g., epsilon-greedy).


**Reasoning**:
The first step is to define the `CheckersAgent` class, initializing it with Q-learning parameters like learning rate, discount factor, and exploration rate. Additionally, I will implement a method to convert the current board state into a hashable representation (tuple of tuples) suitable for use as keys in the Q-table.



In [6]:
import random

class CheckersAgent:
    def __init__(self, player, alpha=0.1, gamma=0.9, epsilon=0.1):
        self.player = player # 1 for player 1, 2 for player 2
        self.alpha = alpha # Learning rate
        self.gamma = gamma # Discount factor
        self.epsilon = epsilon # Exploration rate
        self.q_table = {}

    def _get_state_representation(self, board):
        # Convert the 2D list board into a tuple of tuples for hashability
        return tuple(tuple(row) for row in board.board)

    def _get_possible_actions(self, board):
        # Actions are the valid moves obtained from the CheckersBoard class
        return board.get_valid_moves(self.player)

    def get_q_value(self, state, action):
        # Get the Q-value for a given state-action pair, default to 0 if not seen
        # Actions can be tuples (for simple moves) or lists of tuples (for capture paths)
        # Convert action to a hashable format if it's a list
        hashable_action = tuple(action) if isinstance(action, list) else action
        return self.q_table.get((state, hashable_action), 0.0)

    def choose_action(self, board):
        state = self._get_state_representation(board)
        possible_actions = self._get_possible_actions(board)

        if not possible_actions:
            return None # No valid moves

        # Epsilon-greedy strategy
        if random.uniform(0, 1) < self.epsilon:
            # Explore: choose a random action
            return random.choice(possible_actions)
        else:
            # Exploit: choose the action with the highest Q-value
            q_values = [self.get_q_value(state, action) for action in possible_actions]
            max_q = -float('inf')
            best_actions = []

            for i, q_val in enumerate(q_values):
                if q_val > max_q:
                    max_q = q_val
                    best_actions = [possible_actions[i]]
                elif q_val == max_q:
                    best_actions.append(possible_actions[i])

            return random.choice(best_actions)

    def learn(self, old_state, action, reward, new_state, new_board):
        hashable_action = tuple(action) if isinstance(action, list) else action

        old_state_rep = self._get_state_representation(old_state)
        new_state_rep = self._get_state_representation(new_state)

        current_q = self.get_q_value(old_state_rep, hashable_action)

        # Calculate max Q for the new state (Q(s', a'))
        future_possible_actions = self._get_possible_actions(new_board) # Use new_board for new_state's possible actions
        if future_possible_actions:
            max_future_q = max([self.get_q_value(new_state_rep, a) for a in future_possible_actions])
        else:
            max_future_q = 0.0 # Terminal state or no actions

        # Q-learning update rule
        self.q_table[(old_state_rep, hashable_action)] = current_q + self.alpha * (reward + self.gamma * max_future_q - current_q)

# Reward function (can be a standalone function or part of the agent/game class)
def get_reward(board, player, opponent_player):
    is_over, winner = board.check_terminal_state(player)

    if is_over:
        if winner == player:
            return 100 # Win
        elif winner == opponent_player:
            return -100 # Loss
        else:
            return -10 # Draw or other negative outcome for current player

    # Non-terminal state rewards (adjust as needed)
    # Example: slightly positive for capturing, slightly negative for losing pieces
    return -1 # Small negative reward for each step to encourage faster wins

# Example usage (requires CheckersBoard class from previous steps):
# board = CheckersBoard()
# agent_p1 = CheckersAgent(player=1)
# agent_p2 = CheckersAgent(player=2)

# # Simulate a turn for Player 1
# current_board = CheckersBoard()
# print("Initial board:")
# print(current_board)

# old_state_p1 = current_board # Keep a reference to the old board object for state representation
# action_p1 = agent_p1.choose_action(old_state_p1)

# if action_p1:
#     new_board_p1 = current_board.make_move(action_p1, 1)
#     reward_p1 = get_reward(new_board_p1, 1, 2)
#     agent_p1.learn(old_state_p1, action_p1, reward_p1, new_board_p1, new_board_p1)
#     print(f"\nPlayer 1 chose action: {action_p1}")
#     print("Board after P1's move:")
#     print(new_board_p1)
#     print(f"Reward for P1: {reward_p1}")
# else:
#     print("Player 1 has no moves.")

# # Simulate a turn for Player 2 (using the board after P1's move)
# if 'new_board_p1' in locals():
#     old_state_p2 = new_board_p1 # Keep a reference to the old board object for state representation
#     action_p2 = agent_p2.choose_action(old_state_p2)

#     if action_p2:
#         new_board_p2 = old_state_p2.make_move(action_p2, 2)
#         reward_p2 = get_reward(new_board_p2, 2, 1)
#         agent_p2.learn(old_state_p2, action_p2, reward_p2, new_board_p2, new_board_p2)
#         print(f"\nPlayer 2 chose action: {action_p2}")
#         print("Board after P2's move:")
#         print(new_board_p2)
#         print(f"Reward for P2: {reward_p2}")
#     else:
#         print("Player 2 has no moves.")

## Função de Avaliação Heurística para Q-Value

### Subtask:
Desenvolver uma função de avaliação heurística para estados do jogo.


**Reasoning**:
The subtask requires developing a heuristic evaluation function. The first step is to implement the `evaluate_board` method within the `CheckersAgent` class, which will analyze the board state and return a score based on piece counts and their positions, differentiating between normal pieces and kings.



In [7]:
import random

class CheckersBoard:
    def __init__(self, board=None):
        if board is None:
            self.board = [[0 for _ in range(8)] for _ in range(8)]
            # 0: empty, 1: player 1 piece, 2: player 2 piece
            # 3: player 1 king, 4: player 2 king

            # Setup initial pieces for Player 1 (top of the board, rows 0-2)
            for row in range(3):
                for col in range(8):
                    if (row + col) % 2 != 0:
                        self.board[row][col] = 1

            # Setup initial pieces for Player 2 (bottom of the board, rows 5-7)
            for row in range(5, 8):
                for col in range(8):
                    if (row + col) % 2 != 0:
                        self.board[row][col] = 2
        else:
            self.board = [row[:] for row in board] # Deep copy of the board

    def __str__(self):
        board_str = ""
        for r_idx, row in enumerate(self.board):
            board_str += f"{8 - r_idx} |"
            for piece in row:
                if piece == 0:
                    board_str += " . "
                elif piece == 1:
                    board_str += " x "
                elif piece == 2:
                    board_str += " o "
                elif piece == 3:
                    board_str += " X "
                elif piece == 4:
                    board_str += " O "
            board_str += "\n"
        board_str += "  ------------------------\n"
        board_str += "    a  b  c  d  e  f  g  h"
        return board_str

    def _is_valid_coord(self, r, c):
        return 0 <= r < 8 and 0 <= c < 8

    def _get_piece_owner(self, piece):
        if piece == 1 or piece == 3:
            return 1 # Player 1
        elif piece == 2 or piece == 4:
            return 2 # Player 2
        return 0 # Empty or invalid

    def _is_king(self, piece):
        return piece == 3 or piece == 4

    def _get_possible_moves_for_piece(self, r, c, player):
        moves = []
        piece = self.board[r][c]
        is_king = self._is_king(piece)

        # Define directions for non-king pieces
        if player == 1: # Player 1 pieces are at rows 0,1,2 and move 'down' (increasing row index)
            directions = [(1, -1), (1, 1)]
        else: # Player 2 pieces are at rows 5,6,7 and move 'up' (decreasing row index)
            directions = [(-1, -1), (-1, 1)]

        # Kings can move in all 4 diagonal directions
        if is_king:
            directions = [(-1, -1), (-1, 1), (1, -1), (1, 1)]

        for dr, dc in directions:
            nr, nc = r + dr, c + dc
            if self._is_valid_coord(nr, nc) and self.board[nr][nc] == 0:
                moves.append(((r, c), (nr, nc)))
        return moves

    def _get_possible_captures_for_piece(self, r, c, player, current_board=None, path=None):
        if current_board is None:
            current_board = [row[:] for row in self.board]
        if path is None:
            path = [(r, c)]

        captures = []
        piece = current_board[r][c]
        is_king = self._is_king(piece)

        # Define directions for non-king pieces
        if player == 1: # Player 1 pieces are at rows 0,1,2 and move 'down' (increasing row index)
            capture_directions = [(1, -1), (1, 1)]
        else: # Player 2 pieces are at rows 5,6,7 and move 'up' (decreasing row index)
            capture_directions = [(-1, -1), (-1, 1)]

        # Kings can capture in all 4 diagonal directions
        if is_king:
            capture_directions = [(-1, -1), (-1, 1), (1, -1), (1, 1)]

        found_capture_in_this_step = False
        for dr, dc in capture_directions:
            # Square of the opponent's piece to be jumped over
            enemy_r, enemy_c = r + dr, c + dc
            # Landing square after the jump
            land_r, land_c = r + 2 * dr, c + 2 * dc

            if (self._is_valid_coord(land_r, land_c) and
                    current_board[land_r][land_c] == 0 and
                    self._is_valid_coord(enemy_r, enemy_c) and
                    current_board[enemy_r][enemy_c] != 0 and
                    self._get_piece_owner(current_board[enemy_r][enemy_c]) != player):

                found_capture_in_this_step = True
                # Simulate the capture on a temporary board
                temp_board = [row[:] for row in current_board]
                temp_board[land_r][land_c] = temp_board[r][c]
                temp_board[r][c] = 0
                temp_board[enemy_r][enemy_c] = 0

                # Check for king promotion after landing, only if not king yet and reaching back row
                promoted_piece = temp_board[land_r][land_c]
                if not self._is_king(promoted_piece): # only promote if not already a king
                    if (player == 1 and land_r == 7) or (player == 2 and land_r == 0): # Adjusted promotion rows
                        temp_board[land_r][land_c] += 2 # Promote to king (1->3, 2->4)

                # Recursively look for more captures from the new position
                chained_captures = self._get_possible_captures_for_piece(land_r, land_c, player, temp_board, path + [(land_r, land_c)])

                if chained_captures: # If more captures are found, add them
                    captures.extend(chained_captures)
                else: # If no more captures, this is a terminal capture sequence
                    captures.append(path + [(land_r, land_c)])

        # If no capture was found in this step, but it's part of a longer path, return empty to signify end of chain
        if not found_capture_in_this_step and len(path) > 1:
            return []

        return captures

    def get_valid_moves(self, player):
        all_possible_moves = []
        all_possible_captures = []

        for r in range(8):
            for c in range(8):
                piece = self.board[r][c]
                if self._get_piece_owner(piece) == player:
                    # Get basic moves
                    all_possible_moves.extend(self._get_possible_moves_for_piece(r, c, player))
                    # Get captures
                    captures_from_piece = self._get_possible_captures_for_piece(r, c, player)
                    if captures_from_piece:
                        all_possible_captures.extend(captures_from_piece)

        if all_possible_captures:
            # If captures are available, they are mandatory
            return all_possible_captures
        else:
            # Otherwise, return all basic moves
            return all_possible_moves

    def make_move(self, move, player):
        # Create a new board state for the move
        new_board = CheckersBoard(board=self.board)

        if isinstance(move[0], tuple) and len(move) == 2 and isinstance(move[1], tuple): # It's a simple move ((r1,c1), (r2,c2))
            start_r, start_c = move[0]
            end_r, end_c = move[1]
            piece_to_move = new_board.board[start_r][start_c]

            new_board.board[end_r][end_c] = piece_to_move
            new_board.board[start_r][start_c] = 0

            # Check for king promotion
            if not new_board._is_king(piece_to_move): # Only promote if not already a king
                if (player == 1 and end_r == 7) or (player == 2 and end_r == 0):
                    new_board.board[end_r][end_c] += 2 # Promote to king

        elif isinstance(move, list): # It's a capture path [(r1,c1), (r2,c2), ...]
            for i in range(len(move) - 1):
                start_r, start_c = move[i]
                end_r, end_c = move[i+1]
                piece_to_move = new_board.board[start_r][start_c]

                new_board.board[end_r][end_c] = piece_to_move
                new_board.board[start_r][start_c] = 0

                # Remove captured piece
                # Calculate the position of the captured piece (midpoint)
                captured_r = (start_r + end_r) // 2
                captured_c = (start_c + end_c) // 2
                new_board.board[captured_r][captured_c] = 0

            # After the whole capture path, check for king promotion at the final landing spot
            final_r, final_c = move[-1]
            promoted_piece = new_board.board[final_r][final_c]
            if not new_board._is_king(promoted_piece): # Only promote if not already a king
                if (player == 1 and final_r == 7) or (player == 2 and final_r == 0):
                    new_board.board[final_r][final_c] += 2

        return new_board

    def is_game_over(self):
        player1_pieces = 0
        player2_pieces = 0
        for r in range(8):
            for c in range(8):
                piece = self.board[r][c]
                if self._get_piece_owner(piece) == 1:
                    player1_pieces += 1
                elif self._get_piece_owner(piece) == 2:
                    player2_pieces += 1

        # Check for no pieces left
        if player1_pieces == 0: return True, 2 # Player 1 has no pieces, Player 2 wins
        if player2_pieces == 0: return True, 1 # Player 2 has no pieces, Player 1 wins

        return False, None

    def _has_valid_moves(self, player):
        return bool(self.get_valid_moves(player))

    def check_terminal_state(self, current_player):
        player1_pieces = 0
        player2_pieces = 0
        for r in range(8):
            for c in range(8):
                piece = self.board[r][c]
                if self._get_piece_owner(piece) == 1:
                    player1_pieces += 1
                elif self._get_piece_owner(piece) == 2:
                    player2_pieces += 1

        # Condition 1: No pieces left
        if player1_pieces == 0: return True, 2 # Player 2 wins
        if player2_pieces == 0: return True, 1 # Player 1 wins

        # Condition 2: Current player has no valid moves
        if not self._has_valid_moves(current_player):
            if current_player == 1: return True, 2 # Player 1 has no moves, Player 2 wins
            if current_player == 2: return True, 1 # Player 2 has no moves, Player 1 wins

        # Default: Game is not over
        return False, None

class CheckersAgent:
    def __init__(self, player, alpha=0.1, gamma=0.9, epsilon=0.1):
        self.player = player # 1 for player 1, 2 for player 2
        self.alpha = alpha # Learning rate
        self.gamma = gamma # Discount factor
        self.epsilon = epsilon # Exploration rate
        self.q_table = {}

    def _get_state_representation(self, board):
        # Convert the 2D list board into a tuple of tuples for hashability
        return tuple(tuple(row) for row in board.board)

    def _get_possible_actions(self, board):
        # Actions are the valid moves obtained from the CheckersBoard class
        return board.get_valid_moves(self.player)

    def get_q_value(self, state, action):
        # Get the Q-value for a given state-action pair, default to 0 if not seen
        # Actions can be tuples (for simple moves) or lists of tuples (for capture paths)
        # Convert action to a hashable format if it's a list
        hashable_action = tuple(action) if isinstance(action, list) else action
        return self.q_table.get((state, hashable_action), 0.0)

    def choose_action(self, board):
        state = self._get_state_representation(board)
        possible_actions = self._get_possible_actions(board)

        if not possible_actions:
            return None # No valid moves

        # Epsilon-greedy strategy
        if random.uniform(0, 1) < self.epsilon:
            # Explore: choose a random action
            return random.choice(possible_actions)
        else:
            # Exploit: choose the action with the highest Q-value
            q_values = [self.get_q_value(state, action) for action in possible_actions]
            max_q = -float('inf')
            best_actions = []

            for i, q_val in enumerate(q_values):
                if q_val > max_q:
                    max_q = q_val
                    best_actions = [possible_actions[i]]
                elif q_val == max_q:
                    best_actions.append(possible_actions[i])

            return random.choice(best_actions)

    def learn(self, old_state_obj, action, reward, new_state_obj, new_board_obj):
        hashable_action = tuple(action) if isinstance(action, list) else action

        old_state_rep = self._get_state_representation(old_state_obj)
        new_state_rep = self._get_state_representation(new_state_obj)

        current_q = self.get_q_value(old_state_rep, hashable_action)

        # Calculate max Q for the new state (Q(s', a'))
        # Note: new_board_obj is passed directly to _get_possible_actions
        future_possible_actions = new_board_obj.get_valid_moves(self.player) # Use new_board_obj for new_state's possible actions
        if future_possible_actions:
            max_future_q = max([self.get_q_value(new_state_rep, a) for a in future_possible_actions])
        else:
            max_future_q = 0.0 # Terminal state or no actions

        # Q-learning update rule
        self.q_table[(old_state_rep, hashable_action)] = current_q + self.alpha * (reward + self.gamma * max_future_q - current_q)

    def evaluate_board(self, board_obj):
        score = 0
        opponent_player = 3 - self.player # If player is 1, opponent is 2, and vice versa

        piece_weight = 1
        king_weight = 2

        for r in range(8):
            for c in range(8):
                piece = board_obj.board[r][c]
                owner = board_obj._get_piece_owner(piece)

                if owner == self.player:
                    if board_obj._is_king(piece):
                        score += king_weight
                    else:
                        score += piece_weight

                    # Positional bonus: pieces closer to enemy back rank
                    if self.player == 1: # Moving down, higher row is better
                        score += r * 0.1
                    else: # Moving up, lower row is better
                        score += (7 - r) * 0.1

                    # Center control bonus (simplified) for rows 2-5 and cols 2-5
                    if 2 <= r <= 5 and 2 <= c <= 5:
                        score += 0.05

                elif owner == opponent_player:
                    if board_obj._is_king(piece):
                        score -= king_weight
                    else:
                        score -= piece_weight

                    # Positional penalty for opponent
                    if self.player == 1: # Opponent moving up, lower row is worse for them
                        score -= (7 - r) * 0.1
                    else: # Opponent moving down, higher row is worse for them
                        score -= r * 0.1

                    # Opponent center control penalty
                    if 2 <= r <= 5 and 2 <= c <= 5:
                        score -= 0.05

        return score

# Reward function (can be a standalone function or part of the agent/game class)
def get_reward(board, player, opponent_player):
    is_over, winner = board.check_terminal_state(player)

    if is_over:
        if winner == player:
            return 100 # Win
        elif winner == opponent_player:
            return -100 # Loss
        else:
            return -10 # Draw or other negative outcome for current player

    # Non-terminal state rewards (adjust as needed)
    # Example: slightly positive for capturing, slightly negative for losing pieces
    return -1 # Small negative reward for each step to encourage faster wins

# Example usage:
board = CheckersBoard()
agent_p1 = CheckersAgent(player=1)
agent_p2 = CheckersAgent(player=2)

print("Initial Board:")
print(board)

# Evaluate initial board for Player 1
score_p1_initial = agent_p1.evaluate_board(board)
print(f"\nEvaluation for Player 1 (Initial Board): {score_p1_initial:.2f}")

# Evaluate initial board for Player 2
score_p2_initial = agent_p2.evaluate_board(board)
print(f"Evaluation for Player 2 (Initial Board): {score_p2_initial:.2f}")

# Simulate a move and re-evaluate
moves_p1 = board.get_valid_moves(1)
if moves_p1:
    first_move_p1 = moves_p1[0]
    new_board_p1 = board.make_move(first_move_p1, 1)
    print("\nBoard after Player 1's first move:")
    print(new_board_p1)
    score_p1_after_move = agent_p1.evaluate_board(new_board_p1)
    print(f"Evaluation for Player 1 (After Move): {score_p1_after_move:.2f}")
    score_p2_after_move = agent_p2.evaluate_board(new_board_p1)
    print(f"Evaluation for Player 2 (After Move): {score_p2_after_move:.2f}")

# Custom board state for testing captures and kings
custom_board = CheckersBoard([[0 for _ in range(8)] for _ in range(8)])
custom_board.board[2][1] = 1 # Player 1 piece
custom_board.board[3][2] = 2 # Player 2 piece to be captured
custom_board.board[4][3] = 1 # Player 1 piece (to test advanced position)
custom_board.board[0][5] = 3 # Player 1 King
custom_board.board[7][4] = 4 # Player 2 King

print("\nCustom Board:")
print(custom_board)
score_p1_custom = agent_p1.evaluate_board(custom_board)
print(f"Evaluation for Player 1 (Custom Board): {score_p1_custom:.2f}")
score_p2_custom = agent_p2.evaluate_board(custom_board)
print(f"Evaluation for Player 2 (Custom Board): {score_p2_custom:.2f}")

Initial Board:
8 | .  x  .  x  .  x  .  x 
7 | x  .  x  .  x  .  x  . 
6 | .  x  .  x  .  x  .  x 
5 | .  .  .  .  .  .  .  . 
4 | .  .  .  .  .  .  .  . 
3 | o  .  o  .  o  .  o  . 
2 | .  o  .  o  .  o  .  o 
1 | o  .  o  .  o  .  o  . 
  ------------------------
    a  b  c  d  e  f  g  h

Evaluation for Player 1 (Initial Board): -0.00
Evaluation for Player 2 (Initial Board): 0.00

Board after Player 1's first move:
8 | .  x  .  x  .  x  .  x 
7 | x  .  x  .  x  .  x  . 
6 | .  .  .  x  .  x  .  x 
5 | x  .  .  .  .  .  .  . 
4 | .  .  .  .  .  .  .  . 
3 | o  .  o  .  o  .  o  . 
2 | .  o  .  o  .  o  .  o 
1 | o  .  o  .  o  .  o  . 
  ------------------------
    a  b  c  d  e  f  g  h
Evaluation for Player 1 (After Move): 0.10
Evaluation for Player 2 (After Move): -0.10

Custom Board:
8 | .  .  .  .  .  X  .  . 
7 | .  .  .  .  .  .  .  . 
6 | .  x  .  .  .  .  .  . 
5 | .  .  o  .  .  .  .  . 
4 | .  .  .  x  .  .  .  . 
3 | .  .  .  .  .  .  .  . 
2 | .  .  .  .  .  .  .  . 
1

## Implementação do Minimax com Poda Alpha-Beta

### Subtask:
Integrar o algoritmo Minimax com busca de corte (poda Alpha-Beta) para auxiliar na tomada de decisão dos agentes, utilizando a função de avaliação heurística desenvolvida.


**Reasoning**:
The subtask requires implementing the Minimax algorithm with Alpha-Beta pruning. This will be done by adding a recursive `minimax` method to the `CheckersAgent` class, handling base cases, maximizing and minimizing player logic, and alpha-beta pruning. Then, a `choose_action_minimax` method will be added to leverage this minimax function for selecting the optimal move. The `CheckersBoard` class needs to be included for context.



In [8]:
import random
import math

class CheckersBoard:
    def __init__(self, board=None):
        if board is None:
            self.board = [[0 for _ in range(8)] for _ in range(8)]
            # 0: empty, 1: player 1 piece, 2: player 2 piece
            # 3: player 1 king, 4: player 2 king

            # Setup initial pieces for Player 1 (top of the board, rows 0-2)
            for row in range(3):
                for col in range(8):
                    if (row + col) % 2 != 0:
                        self.board[row][col] = 1

            # Setup initial pieces for Player 2 (bottom of the board, rows 5-7)
            for row in range(5, 8):
                for col in range(8):
                    if (row + col) % 2 != 0:
                        self.board[row][col] = 2
        else:
            self.board = [row[:] for row in board] # Deep copy of the board

    def __str__(self):
        board_str = ""
        for r_idx, row in enumerate(self.board):
            board_str += f"{8 - r_idx} |"
            for piece in row:
                if piece == 0:
                    board_str += " . "
                elif piece == 1:
                    board_str += " x "
                elif piece == 2:
                    board_str += " o "
                elif piece == 3:
                    board_str += " X "
                elif piece == 4:
                    board_str += " O "
            board_str += "\n"
        board_str += "  ------------------------\n"
        board_str += "    a  b  c  d  e  f  g  h"
        return board_str

    def _is_valid_coord(self, r, c):
        return 0 <= r < 8 and 0 <= c < 8

    def _get_piece_owner(self, piece):
        if piece == 1 or piece == 3:
            return 1 # Player 1
        elif piece == 2 or piece == 4:
            return 2 # Player 2
        return 0 # Empty or invalid

    def _is_king(self, piece):
        return piece == 3 or piece == 4

    def _get_possible_moves_for_piece(self, r, c, player):
        moves = []
        piece = self.board[r][c]
        is_king = self._is_king(piece)

        # Define directions for non-king pieces
        if player == 1: # Player 1 pieces are at rows 0,1,2 and move 'down' (increasing row index)
            directions = [(1, -1), (1, 1)]
        else: # Player 2 pieces are at rows 5,6,7 and move 'up' (decreasing row index)
            directions = [(-1, -1), (-1, 1)]

        # Kings can move in all 4 diagonal directions
        if is_king:
            directions = [(-1, -1), (-1, 1), (1, -1), (1, 1)]

        for dr, dc in directions:
            nr, nc = r + dr, c + dc
            if self._is_valid_coord(nr, nc) and self.board[nr][nc] == 0:
                moves.append(((r, c), (nr, nc)))
        return moves

    def _get_possible_captures_for_piece(self, r, c, player, current_board=None, path=None):
        if current_board is None:
            current_board = [row[:] for row in self.board]
        if path is None:
            path = [(r, c)]

        captures = []
        piece = current_board[r][c]
        is_king = self._is_king(piece)

        # Define directions for non-king pieces
        if player == 1: # Player 1 pieces are at rows 0,1,2 and move 'down' (increasing row index)
            capture_directions = [(1, -1), (1, 1)]
        else: # Player 2 pieces are at rows 5,6,7 and move 'up' (decreasing row index)
            capture_directions = [(-1, -1), (-1, 1)]

        # Kings can capture in all 4 diagonal directions
        if is_king:
            capture_directions = [(-1, -1), (-1, 1), (1, -1), (1, 1)]

        found_capture_in_this_step = False
        for dr, dc in capture_directions:
            # Square of the opponent's piece to be jumped over
            enemy_r, enemy_c = r + dr, c + dc
            # Landing square after the jump
            land_r, land_c = r + 2 * dr, c + 2 * dc

            if (self._is_valid_coord(land_r, land_c) and
                    current_board[land_r][land_c] == 0 and
                    self._is_valid_coord(enemy_r, enemy_c) and
                    current_board[enemy_r][enemy_c] != 0 and
                    self._get_piece_owner(current_board[enemy_r][enemy_c]) != player):

                found_capture_in_this_step = True
                # Simulate the capture on a temporary board
                temp_board = [row[:] for row in current_board]
                temp_board[land_r][land_c] = temp_board[r][c]
                temp_board[r][c] = 0
                temp_board[enemy_r][enemy_c] = 0

                # Check for king promotion after landing, only if not king yet and reaching back row
                promoted_piece = temp_board[land_r][land_c]
                if not self._is_king(promoted_piece): # only promote if not already a king
                    if (player == 1 and land_r == 7) or (player == 2 and land_r == 0): # Adjusted promotion rows
                        temp_board[land_r][land_c] += 2 # Promote to king (1->3, 2->4)

                # Recursively look for more captures from the new position
                chained_captures = self._get_possible_captures_for_piece(land_r, land_c, player, temp_board, path + [(land_r, land_c)])

                if chained_captures: # If more captures are found, add them
                    captures.extend(chained_captures)
                else: # If no more captures, this is a terminal capture sequence
                    captures.append(path + [(land_r, land_c)])

        # If no capture was found in this step, but it's part of a longer path, return empty to signify end of chain
        if not found_capture_in_this_step and len(path) > 1:
            return []

        return captures

    def get_valid_moves(self, player):
        all_possible_moves = []
        all_possible_captures = []

        for r in range(8):
            for c in range(8):
                piece = self.board[r][c]
                if self._get_piece_owner(piece) == player:
                    # Get basic moves
                    all_possible_moves.extend(self._get_possible_moves_for_piece(r, c, player))
                    # Get captures
                    captures_from_piece = self._get_possible_captures_for_piece(r, c, player)
                    if captures_from_piece:
                        all_possible_captures.extend(captures_from_piece)

        if all_possible_captures:
            # If captures are available, they are mandatory
            return all_possible_captures
        else:
            # Otherwise, return all basic moves
            return all_possible_moves

    def make_move(self, move, player):
        # Create a new board state for the move
        new_board = CheckersBoard(board=self.board)

        if isinstance(move[0], tuple) and len(move) == 2 and isinstance(move[1], tuple): # It's a simple move ((r1,c1), (r2,c2))
            start_r, start_c = move[0]
            end_r, end_c = move[1]
            piece_to_move = new_board.board[start_r][start_c]

            new_board.board[end_r][end_c] = piece_to_move
            new_board.board[start_r][start_c] = 0

            # Check for king promotion
            if not new_board._is_king(piece_to_move): # Only promote if not already a king
                if (player == 1 and end_r == 7) or (player == 2 and end_r == 0):
                    new_board.board[end_r][end_c] += 2 # Promote to king

        elif isinstance(move, list): # It's a capture path [(r1,c1), (r2,c2), ...]
            for i in range(len(move) - 1):
                start_r, start_c = move[i]
                end_r, end_c = move[i+1]
                piece_to_move = new_board.board[start_r][start_c]

                new_board.board[end_r][end_c] = piece_to_move
                new_board.board[start_r][start_c] = 0

                # Remove captured piece
                # Calculate the position of the captured piece (midpoint)
                captured_r = (start_r + end_r) // 2
                captured_c = (start_c + end_c) // 2
                new_board.board[captured_r][captured_c] = 0

            # After the whole capture path, check for king promotion at the final landing spot
            final_r, final_c = move[-1]
            promoted_piece = new_board.board[final_r][final_c]
            if not new_board._is_king(promoted_piece): # Only promote if not already a king
                if (player == 1 and final_r == 7) or (player == 2 and final_r == 0):
                    new_board.board[final_r][final_c] += 2

        return new_board

    def is_game_over(self):
        player1_pieces = 0
        player2_pieces = 0
        for r in range(8):
            for c in range(8):
                piece = self.board[r][c]
                if self._get_piece_owner(piece) == 1:
                    player1_pieces += 1
                elif self._get_piece_owner(piece) == 2:
                    player2_pieces += 1

        # Check for no pieces left
        if player1_pieces == 0: return True, 2 # Player 1 has no pieces, Player 2 wins
        if player2_pieces == 0: return True, 1 # Player 2 has no pieces, Player 1 wins

        return False, None

    def _has_valid_moves(self, player):
        return bool(self.get_valid_moves(player))

    def check_terminal_state(self, current_player):
        player1_pieces = 0
        player2_pieces = 0
        for r in range(8):
            for c in range(8):
                piece = self.board[r][c]
                if self._get_piece_owner(piece) == 1:
                    player1_pieces += 1
                elif self._get_piece_owner(piece) == 2:
                    player2_pieces += 1

        # Condition 1: No pieces left
        if player1_pieces == 0: return True, 2 # Player 2 wins
        if player2_pieces == 0: return True, 1 # Player 1 wins

        # Condition 2: Current player has no valid moves
        if not self._has_valid_moves(current_player):
            if current_player == 1: return True, 2 # Player 1 has no moves, Player 2 wins
            if current_player == 2: return True, 1 # Player 2 has no moves, Player 1 wins

        # Default: Game is not over
        return False, None

class CheckersAgent:
    def __init__(self, player, alpha=0.1, gamma=0.9, epsilon=0.1):
        self.player = player # 1 for player 1, 2 for player 2
        self.alpha = alpha # Learning rate
        self.gamma = gamma # Discount factor
        self.epsilon = epsilon # Exploration rate
        self.q_table = {}

    def _get_state_representation(self, board):
        # Convert the 2D list board into a tuple of tuples for hashability
        return tuple(tuple(row) for row in board.board)

    def _get_possible_actions(self, board):
        # Actions are the valid moves obtained from the CheckersBoard class
        return board.get_valid_moves(self.player)

    def get_q_value(self, state, action):
        # Get the Q-value for a given state-action pair, default to 0 if not seen
        # Actions can be tuples (for simple moves) or lists of tuples (for capture paths)
        # Convert action to a hashable format if it's a list
        hashable_action = tuple(action) if isinstance(action, list) else action
        return self.q_table.get((state, hashable_action), 0.0)

    def choose_action(self, board):
        state = self._get_state_representation(board)
        possible_actions = self._get_possible_actions(board)

        if not possible_actions:
            return None # No valid moves

        # Epsilon-greedy strategy
        if random.uniform(0, 1) < self.epsilon:
            # Explore: choose a random action
            return random.choice(possible_actions)
        else:
            # Exploit: choose the action with the highest Q-value
            q_values = [self.get_q_value(state, action) for action in possible_actions]
            max_q = -float('inf')
            best_actions = []

            for i, q_val in enumerate(q_values):
                if q_val > max_q:
                    max_q = q_val
                    best_actions = [possible_actions[i]]
                elif q_val == max_q:
                    best_actions.append(possible_actions[i])

            return random.choice(best_actions)

    def learn(self, old_state_obj, action, reward, new_state_obj, new_board_obj):
        hashable_action = tuple(action) if isinstance(action, list) else action

        old_state_rep = self._get_state_representation(old_state_obj)
        new_state_rep = self._get_state_representation(new_state_obj)

        current_q = self.get_q_value(old_state_rep, hashable_action)

        # Calculate max Q for the new state (Q(s', a'))
        # Note: new_board_obj is passed directly to _get_possible_actions
        future_possible_actions = new_board_obj.get_valid_moves(self.player) # Use new_board_obj for new_state's possible actions
        if future_possible_actions:
            max_future_q = max([self.get_q_value(new_state_rep, a) for a in future_possible_actions])
        else:
            max_future_q = 0.0 # Terminal state or no actions

        # Q-learning update rule
        self.q_table[(old_state_rep, hashable_action)] = current_q + self.alpha * (reward + self.gamma * max_future_q - current_q)

    def evaluate_board(self, board_obj):
        score = 0
        opponent_player = 3 - self.player # If player is 1, opponent is 2, and vice versa

        piece_weight = 1
        king_weight = 2

        for r in range(8):
            for c in range(8):
                piece = board_obj.board[r][c]
                owner = board_obj._get_piece_owner(piece)

                if owner == self.player:
                    if board_obj._is_king(piece):
                        score += king_weight
                    else:
                        score += piece_weight

                    # Positional bonus: pieces closer to enemy back rank
                    if self.player == 1: # Moving down, higher row is better
                        score += r * 0.1
                    else: # Moving up, lower row is better
                        score += (7 - r) * 0.1

                    # Center control bonus (simplified) for rows 2-5 and cols 2-5
                    if 2 <= r <= 5 and 2 <= c <= 5:
                        score += 0.05

                elif owner == opponent_player:
                    if board_obj._is_king(piece):
                        score -= king_weight
                    else:
                        score -= piece_weight

                    # Positional penalty for opponent
                    if self.player == 1: # Opponent moving up, lower row is worse for them
                        score -= (7 - r) * 0.1
                    else: # Opponent moving down, higher row is worse for them
                        score -= r * 0.1

                    # Opponent center control penalty
                    if 2 <= r <= 5 and 2 <= c <= 5:
                        score -= 0.05

        return score

    def minimax(self, board_obj, depth, alpha, beta, maximizing_player):
        current_player_in_minimax = self.player if maximizing_player else (3 - self.player)
        is_over, winner = board_obj.check_terminal_state(current_player_in_minimax)

        if depth == 0 or is_over:
            if is_over:
                if winner == self.player: # If our agent wins
                    return 1000000000 + depth # Large positive value for win
                elif winner == (3 - self.player): # If opponent wins
                    return -1000000000 - depth # Large negative value for loss
                else: # Draw, very unlikely with current terminal state checks
                    return 0
            return self.evaluate_board(board_obj)

        if maximizing_player:
            max_eval = -math.inf
            valid_moves = board_obj.get_valid_moves(self.player)
            if not valid_moves: # If no moves, consider it a loss for this player's turn
                return -1000000000 - depth # Opponent wins
            for move in valid_moves:
                new_board = board_obj.make_move(move, self.player)
                eval = self.minimax(new_board, depth - 1, alpha, beta, False)
                max_eval = max(max_eval, eval)
                alpha = max(alpha, eval)
                if beta <= alpha:
                    break # Alpha Cut-off
            return max_eval
        else:
            min_eval = math.inf
            opponent_player = 3 - self.player
            valid_moves = board_obj.get_valid_moves(opponent_player)
            if not valid_moves: # If no moves, consider it a win for this player's turn
                return 1000000000 + depth # Current player wins (as opponent can't move)
            for move in valid_moves:
                new_board = board_obj.make_move(move, opponent_player)
                eval = self.minimax(new_board, depth - 1, alpha, beta, True)
                min_eval = min(min_eval, eval)
                beta = min(beta, eval)
                if beta <= alpha:
                    break # Beta Cut-off
            return min_eval

    def choose_action_minimax(self, board_obj, depth=3):
        possible_actions = board_obj.get_valid_moves(self.player)

        if not possible_actions:
            return None # No valid moves

        # If only one action, take it without further calculation
        if len(possible_actions) == 1:
            return possible_actions[0]

        best_eval = -math.inf
        best_move = None
        alpha = -math.inf
        beta = math.inf

        for move in possible_actions:
            new_board = board_obj.make_move(move, self.player)
            # Call minimax for the next state, assuming opponent will play optimally (minimizing player)
            eval = self.minimax(new_board, depth - 1, alpha, beta, False) # After my move, it's opponent's turn, so False (minimizing)

            if eval > best_eval:
                best_eval = eval
                best_move = move
            alpha = max(alpha, eval) # Update alpha based on maximizing player's perspective

        return best_move

# Reward function (can be a standalone function or part of the agent/game class)
def get_reward(board, player, opponent_player):
    is_over, winner = board.check_terminal_state(player)

    if is_over:
        if winner == player:
            return 100 # Win
        elif winner == opponent_player:
            return -100 # Loss
        else:
            return -10 # Draw or other negative outcome for current player

    # Non-terminal state rewards (adjust as needed)
    # Example: slightly positive for capturing, slightly negative for losing pieces
    return -1 # Small negative reward for each step to encourage faster wins

# Example usage:
board = CheckersBoard()
agent_p1 = CheckersAgent(player=1)
agent_p2 = CheckersAgent(player=2)

print("Initial Board:")
print(board)

print("\nPlayer 1 (Minimax) choosing action...")
minimax_move_p1 = agent_p1.choose_action_minimax(board, depth=3)

if minimax_move_p1:
    print(f"Player 1 chose move: {minimax_move_p1}")
    new_board_minimax_p1 = board.make_move(minimax_move_p1, 1)
    print("Board after Player 1's Minimax move:")
    print(new_board_minimax_p1)
else:
    print("Player 1 (Minimax) has no moves.")

# To test Player 2's minimax move, we would need to simulate the game flow
# For simplicity, let's show P2 choosing a move on the original board
print("\nPlayer 2 (Minimax) choosing action on initial board (for demonstration)...")
minimax_move_p2 = agent_p2.choose_action_minimax(board, depth=3)
if minimax_move_p2:
    print(f"Player 2 chose move: {minimax_move_p2}")
    new_board_minimax_p2 = board.make_move(minimax_move_p2, 2)
    print("Board after Player 2's Minimax move:")
    print(new_board_minimax_p2)
else:
    print("Player 2 (Minimax) has no moves.")

Initial Board:
8 | .  x  .  x  .  x  .  x 
7 | x  .  x  .  x  .  x  . 
6 | .  x  .  x  .  x  .  x 
5 | .  .  .  .  .  .  .  . 
4 | .  .  .  .  .  .  .  . 
3 | o  .  o  .  o  .  o  . 
2 | .  o  .  o  .  o  .  o 
1 | o  .  o  .  o  .  o  . 
  ------------------------
    a  b  c  d  e  f  g  h

Player 1 (Minimax) choosing action...
Player 1 chose move: ((2, 1), (3, 2))
Board after Player 1's Minimax move:
8 | .  x  .  x  .  x  .  x 
7 | x  .  x  .  x  .  x  . 
6 | .  .  .  x  .  x  .  x 
5 | .  .  x  .  .  .  .  . 
4 | .  .  .  .  .  .  .  . 
3 | o  .  o  .  o  .  o  . 
2 | .  o  .  o  .  o  .  o 
1 | o  .  o  .  o  .  o  . 
  ------------------------
    a  b  c  d  e  f  g  h

Player 2 (Minimax) choosing action on initial board (for demonstration)...
Player 2 chose move: ((5, 0), (4, 1))
Board after Player 2's Minimax move:
8 | .  x  .  x  .  x  .  x 
7 | x  .  x  .  x  .  x  . 
6 | .  x  .  x  .  x  .  x 
5 | .  .  .  .  .  .  .  . 
4 | .  o  .  .  .  .  .  . 
3 | .  .  o  .  o  .  o 

## Treinamento e Integração dos Agentes

### Subtask:
Configurar o loop de treinamento onde os dois agentes (cada um utilizando Q-learning e Minimax) jogam um contra o outro. O Q-learning ajustará os Q-valores com base nas experiências de jogo e nas recompensas, enquanto o Minimax guiará a escolha de movimentos com base na heurística aprendida.


**Reasoning**:
The subtask requires implementing the training loop. This step defines the `train_agents` function that orchestrates the game between a Q-learning agent and a Minimax agent over multiple episodes, handling turns, moves, Q-value updates, and exploration-exploitation decay.



In [None]:
import random
import math

class CheckersBoard:
    def __init__(self, board=None):
        if board is None:
            self.board = [[0 for _ in range(8)] for _ in range(8)]
            # 0: empty, 1: player 1 piece, 2: player 2 piece
            # 3: player 1 king, 4: player 2 king

            # Setup initial pieces for Player 1 (top of the board, rows 0-2)
            for row in range(3):
                for col in range(8):
                    if (row + col) % 2 != 0:
                        self.board[row][col] = 1

            # Setup initial pieces for Player 2 (bottom of the board, rows 5-7)
            for row in range(5, 8):
                for col in range(8):
                    if (row + col) % 2 != 0:
                        self.board[row][col] = 2
        else:
            self.board = [row[:] for row in board] # Deep copy of the board

    def __str__(self):
        board_str = ""
        for r_idx, row in enumerate(self.board):
            board_str += f"{8 - r_idx} |"
            for piece in row:
                if piece == 0:
                    board_str += " . "
                elif piece == 1:
                    board_str += " x "
                elif piece == 2:
                    board_str += " o "
                elif piece == 3:
                    board_str += " X "
                elif piece == 4:
                    board_str += " O "
            board_str += "\n"
        board_str += "  ------------------------\n"
        board_str += "    a  b  c  d  e  f  g  h"
        return board_str

    def _is_valid_coord(self, r, c):
        return 0 <= r < 8 and 0 <= c < 8

    def _get_piece_owner(self, piece):
        if piece == 1 or piece == 3:
            return 1 # Player 1
        elif piece == 2 or piece == 4:
            return 2 # Player 2
        return 0 # Empty or invalid

    def _is_king(self, piece):
        return piece == 3 or piece == 4

    def _get_possible_moves_for_piece(self, r, c, player):
        moves = []
        piece = self.board[r][c]
        is_king = self._is_king(piece)

        # Define directions for non-king pieces
        if player == 1: # Player 1 pieces are at rows 0,1,2 and move 'down' (increasing row index)
            directions = [(1, -1), (1, 1)]
        else: # Player 2 pieces are at rows 5,6,7 and move 'up' (decreasing row index)
            directions = [(-1, -1), (-1, 1)]

        # Kings can move in all 4 diagonal directions
        if is_king:
            directions = [(-1, -1), (-1, 1), (1, -1), (1, 1)]

        for dr, dc in directions:
            nr, nc = r + dr, c + dc
            if self._is_valid_coord(nr, nc) and self.board[nr][nc] == 0:
                moves.append(((r, c), (nr, nc)))
        return moves

    def _get_possible_captures_for_piece(self, r, c, player, current_board=None, path=None):
        if current_board is None:
            current_board = [row[:] for row in self.board]
        if path is None:
            path = [(r, c)]

        captures = []
        piece = current_board[r][c]
        is_king = self._is_king(piece)

        # Define directions for non-king pieces
        if player == 1: # Player 1 pieces are at rows 0,1,2 and move 'down' (increasing row index)
            capture_directions = [(1, -1), (1, 1)]
        else: # Player 2 pieces are at rows 5,6,7 and move 'up' (decreasing row index)
            capture_directions = [(-1, -1), (-1, 1)]

        # Kings can capture in all 4 diagonal directions
        if is_king:
            capture_directions = [(-1, -1), (-1, 1), (1, -1), (1, 1)]

        found_capture_in_this_step = False
        for dr, dc in capture_directions:
            # Square of the opponent's piece to be jumped over
            enemy_r, enemy_c = r + dr, c + dc
            # Landing square after the jump
            land_r, land_c = r + 2 * dr, c + 2 * dc

            if (self._is_valid_coord(land_r, land_c) and
                    current_board[land_r][land_c] == 0 and
                    self._is_valid_coord(enemy_r, enemy_c) and
                    current_board[enemy_r][enemy_c] != 0 and
                    self._get_piece_owner(current_board[enemy_r][enemy_c]) != player):

                found_capture_in_this_step = True
                # Simulate the capture on a temporary board
                temp_board = [row[:] for row in current_board]
                temp_board[land_r][land_c] = temp_board[r][c]
                temp_board[r][c] = 0
                temp_board[enemy_r][enemy_c] = 0

                # Check for king promotion after landing, only if not king yet and reaching back row
                promoted_piece = temp_board[land_r][land_c]
                if not self._is_king(promoted_piece): # only promote if not already a king
                    if (player == 1 and land_r == 7) or (player == 2 and land_r == 0): # Adjusted promotion rows
                        temp_board[land_r][land_c] += 2 # Promote to king (1->3, 2->4)

                # Recursively look for more captures from the new position
                chained_captures = self._get_possible_captures_for_piece(land_r, land_c, player, temp_board, path + [(land_r, land_c)])

                if chained_captures: # If more captures are found, add them
                    captures.extend(chained_captures)
                else: # If no more captures, this is a terminal capture sequence
                    captures.append(path + [(land_r, land_c)])

        # If no capture was found in this step, but it's part of a longer path, return empty to signify end of chain
        if not found_capture_in_this_step and len(path) > 1:
            return []

        return captures

    def get_valid_moves(self, player):
        all_possible_moves = []
        all_possible_captures = []

        for r in range(8):
            for c in range(8):
                piece = self.board[r][c]
                if self._get_piece_owner(piece) == player:
                    # Get basic moves
                    all_possible_moves.extend(self._get_possible_moves_for_piece(r, c, player))
                    # Get captures
                    captures_from_piece = self._get_possible_captures_for_piece(r, c, player)
                    if captures_from_piece:
                        all_possible_captures.extend(captures_from_piece)

        if all_possible_captures:
            # If captures are available, they are mandatory
            return all_possible_captures
        else:
            # Otherwise, return all basic moves
            return all_possible_moves

    def make_move(self, move, player):
        # Create a new board state for the move
        new_board = CheckersBoard(board=self.board)

        if isinstance(move[0], tuple) and len(move) == 2 and isinstance(move[1], tuple): # It's a simple move ((r1,c1), (r2,c2))
            start_r, start_c = move[0]
            end_r, end_c = move[1]
            piece_to_move = new_board.board[start_r][start_c]

            new_board.board[end_r][end_c] = piece_to_move
            new_board.board[start_r][start_c] = 0

            # Check for king promotion
            if not new_board._is_king(piece_to_move): # Only promote if not already a king
                if (player == 1 and end_r == 7) or (player == 2 and end_r == 0):
                    new_board.board[end_r][end_c] += 2 # Promote to king

        elif isinstance(move, list): # It's a capture path [(r1,c1), (r2,c2), ...]
            for i in range(len(move) - 1):
                start_r, start_c = move[i]
                end_r, end_c = move[i+1]
                piece_to_move = new_board.board[start_r][start_c]

                new_board.board[end_r][end_c] = piece_to_move
                new_board.board[start_r][start_c] = 0

                # Remove captured piece
                # Calculate the position of the captured piece (midpoint)
                captured_r = (start_r + end_r) // 2
                captured_c = (start_c + end_c) // 2
                new_board.board[captured_r][captured_c] = 0

            # After the whole capture path, check for king promotion at the final landing spot
            final_r, final_c = move[-1]
            promoted_piece = new_board.board[final_r][final_c]
            if not new_board._is_king(promoted_piece): # Only promote if not already a king
                if (player == 1 and final_r == 7) or (player == 2 and final_r == 0):
                    new_board.board[final_r][final_c] += 2

        return new_board

    def is_game_over(self):
        player1_pieces = 0
        player2_pieces = 0
        for r in range(8):
            for c in range(8):
                piece = self.board[r][c]
                if self._get_piece_owner(piece) == 1:
                    player1_pieces += 1
                elif self._get_piece_owner(piece) == 2:
                    player2_pieces += 1

        # Check for no pieces left
        if player1_pieces == 0: return True, 2 # Player 1 has no pieces, Player 2 wins
        if player2_pieces == 0: return True, 1 # Player 2 has no pieces, Player 1 wins

        return False, None

    def _has_valid_moves(self, player):
        return bool(self.get_valid_moves(player))

    def check_terminal_state(self, current_player):
        player1_pieces = 0
        player2_pieces = 0
        for r in range(8):
            for c in range(8):
                piece = self.board[r][c]
                if self._get_piece_owner(piece) == 1:
                    player1_pieces += 1
                elif self._get_piece_owner(piece) == 2:
                    player2_pieces += 1

        # Condition 1: No pieces left
        if player1_pieces == 0: return True, 2 # Player 2 wins
        if player2_pieces == 0: return True, 1 # Player 1 wins

        # Condition 2: Current player has no valid moves
        if not self._has_valid_moves(current_player):
            if current_player == 1: return True, 2 # Player 1 has no moves, Player 2 wins
            if current_player == 2: return True, 1 # Player 2 has no moves, Player 1 wins

        # Default: Game is not over
        return False, None

class CheckersAgent:
    def __init__(self, player, alpha=0.1, gamma=0.9, epsilon=0.1):
        self.player = player # 1 for player 1, 2 for player 2
        self.alpha = alpha # Learning rate
        self.gamma = gamma # Discount factor
        self.epsilon = epsilon # Exploration rate
        self.q_table = {}

    def _get_state_representation(self, board):
        # Convert the 2D list board into a tuple of tuples for hashability
        return tuple(tuple(row) for row in board.board)

    def _get_possible_actions(self, board):
        # Actions are the valid moves obtained from the CheckersBoard class
        return board.get_valid_moves(self.player)

    def get_q_value(self, state, action):
        # Get the Q-value for a given state-action pair, default to 0 if not seen
        # Actions can be tuples (for simple moves) or lists of tuples (for capture paths)
        # Convert action to a hashable format if it's a list
        hashable_action = tuple(action) if isinstance(action, list) else action
        return self.q_table.get((state, hashable_action), 0.0)

    def choose_action(self, board):
        state = self._get_state_representation(board)
        possible_actions = self._get_possible_actions(board)

        if not possible_actions:
            return None # No valid moves

        # Epsilon-greedy strategy
        if random.uniform(0, 1) < self.epsilon:
            # Explore: choose a random action
            return random.choice(possible_actions)
        else:
            # Exploit: choose the action with the highest Q-value
            q_values = [self.get_q_value(state, action) for action in possible_actions]
            max_q = -float('inf')
            best_actions = []

            for i, q_val in enumerate(q_values):
                if q_val > max_q:
                    max_q = q_val
                    best_actions = [possible_actions[i]]
                elif q_val == max_q:
                    best_actions.append(possible_actions[i])

            return random.choice(best_actions)

    def learn(self, old_state_obj, action, reward, new_state_obj, new_board_obj):
        hashable_action = tuple(action) if isinstance(action, list) else action

        old_state_rep = self._get_state_representation(old_state_obj)
        new_state_rep = self._get_state_representation(new_state_obj)

        current_q = self.get_q_value(old_state_rep, hashable_action)

        # Calculate max Q for the new state (Q(s', a'))
        # Note: new_board_obj is passed directly to _get_possible_actions
        future_possible_actions = new_board_obj.get_valid_moves(self.player) # Use new_board_obj for new_state's possible actions
        if future_possible_actions:
            max_future_q = max([self.get_q_value(new_state_rep, a) for a in future_possible_actions])
        else:
            max_future_q = 0.0 # Terminal state or no actions

        # Q-learning update rule
        self.q_table[(old_state_rep, hashable_action)] = current_q + self.alpha * (reward + self.gamma * max_future_q - current_q)

    def evaluate_board(self, board_obj):
        score = 0
        opponent_player = 3 - self.player # If player is 1, opponent is 2, and vice versa

        piece_weight = 1
        king_weight = 2

        for r in range(8):
            for c in range(8):
                piece = board_obj.board[r][c]
                owner = board_obj._get_piece_owner(piece)

                if owner == self.player:
                    if board_obj._is_king(piece):
                        score += king_weight
                    else:
                        score += piece_weight

                    # Positional bonus: pieces closer to enemy back rank
                    if self.player == 1: # Moving down, higher row is better
                        score += r * 0.1
                    else: # Moving up, lower row is better
                        score += (7 - r) * 0.1

                    # Center control bonus (simplified) for rows 2-5 and cols 2-5
                    if 2 <= r <= 5 and 2 <= c <= 5:
                        score += 0.05

                elif owner == opponent_player:
                    if board_obj._is_king(piece):
                        score -= king_weight
                    else:
                        score -= piece_weight

                    # Positional penalty for opponent
                    if self.player == 1: # Opponent moving up, lower row is worse for them
                        score -= (7 - r) * 0.1
                    else: # Opponent moving down, higher row is worse for them
                        score -= r * 0.1

                    # Opponent center control penalty
                    if 2 <= r <= 5 and 2 <= c <= 5:
                        score -= 0.05

        return score

    def minimax(self, board_obj, depth, alpha, beta, maximizing_player):
        current_player_in_minimax = self.player if maximizing_player else (3 - self.player)
        is_over, winner = board_obj.check_terminal_state(current_player_in_minimax)

        if depth == 0 or is_over:
            if is_over:
                if winner == self.player: # If our agent wins
                    return 1000000000 + depth # Large positive value for win
                elif winner == (3 - self.player): # If opponent wins
                    return -1000000000 - depth # Large negative value for loss
                else: # Draw, very unlikely with current terminal state checks
                    return 0
            return self.evaluate_board(board_obj)

        if maximizing_player:
            max_eval = -math.inf
            valid_moves = board_obj.get_valid_moves(self.player)
            if not valid_moves: # If no moves, consider it a loss for this player's turn
                return -1000000000 - depth # Opponent wins
            for move in valid_moves:
                new_board = board_obj.make_move(move, self.player)
                eval = self.minimax(new_board, depth - 1, alpha, beta, False)
                max_eval = max(max_eval, eval)
                alpha = max(alpha, eval)
                if beta <= alpha:
                    break # Alpha Cut-off
            return max_eval
        else:
            min_eval = math.inf
            opponent_player = 3 - self.player
            valid_moves = board_obj.get_valid_moves(opponent_player)
            if not valid_moves: # If no moves, consider it a win for this player's turn
                return 1000000000 + depth # Current player wins (as opponent can't move)
            for move in valid_moves:
                new_board = board_obj.make_move(move, opponent_player)
                eval = self.minimax(new_board, depth - 1, alpha, beta, True)
                min_eval = min(min_eval, eval)
                beta = min(beta, eval)
                if beta <= alpha:
                    break # Beta Cut-off
            return min_eval

    def choose_action_minimax(self, board_obj, depth=3):
        possible_actions = board_obj.get_valid_moves(self.player)

        if not possible_actions:
            return None # No valid moves

        # If only one action, take it without further calculation
        if len(possible_actions) == 1:
            return possible_actions[0]

        best_eval = -math.inf
        best_move = None
        alpha = -math.inf
        beta = math.inf

        for move in possible_actions:
            new_board = board_obj.make_move(move, self.player)
            # Call minimax for the next state, assuming opponent will play optimally (minimizing player)
            eval = self.minimax(new_board, depth - 1, alpha, beta, False) # After my move, it's opponent's turn, so False (minimizing)

            if eval > best_eval:
                best_eval = eval
                best_move = move
            alpha = max(alpha, eval) # Update alpha based on maximizing player's perspective

        return best_move

# Reward function (can be a standalone function or part of the agent/game class)
def get_reward(board, player, opponent_player):
    is_over, winner = board.check_terminal_state(player)

    if is_over:
        if winner == player:
            return 100 # Win
        elif winner == opponent_player:
            return -100 # Loss
        else:
            # This case covers draws or scenarios where a player cannot move.
            # If `check_terminal_state` returns a winner for `opponent_player` because `player` had no moves,
            # that's already covered by the -100 case. If it's a true draw, then it's a neutral outcome.
            # For now, let's keep it simple; a 'draw' is less bad than a loss.
            return -5 # Default for other terminal states (e.g., if neither can move, or stalemate type)

    # Non-terminal state rewards (adjust as needed)
    # Example: slightly positive for capturing, slightly negative for losing pieces
    return -1 # Small negative reward for each step to encourage faster wins

def train_agents(num_episodes, q_agent_player, minimax_agent_player, minimax_depth, epsilon_decay_rate, max_moves_per_game):
    # Initialize agents
    q_agent = CheckersAgent(player=q_agent_player, epsilon=1.0) # Start with high exploration
    minimax_agent = CheckersAgent(player=minimax_agent_player, epsilon=0.0) # Minimax is always greedy

    # Track wins for analysis
    q_wins = 0
    minimax_wins = 0
    draws = 0

    for episode in range(num_episodes):
        board = CheckersBoard() # New board for each episode
        current_player = 1 # Start with player 1
        game_over = False
        move_count = 0

        last_q_agent_state = None
        last_q_agent_action = None

        # Store historical boards to detect draws by repetition (simple check)
        board_history = []

        while not game_over and move_count < max_moves_per_game:
            board_rep = tuple(tuple(row) for row in board.board)
            if board_rep in board_history:
                # Draw by repetition
                game_over = True
                draws += 1
                # Give Q-agent a negative reward for drawing
                if last_q_agent_state and last_q_agent_action:
                    q_agent.learn(last_q_agent_state, last_q_agent_action, -10, board, board)
                break
            board_history.append(board_rep)

            old_board_obj = CheckersBoard(board=board.board) # Capture state before move
            action = None

            if current_player == q_agent_player:
                action = q_agent.choose_action(board)
                # Store state and action for potential Q-learning update if game ends on opponent's turn
                last_q_agent_state = old_board_obj
                last_q_agent_action = action

            elif current_player == minimax_agent_player:
                action = minimax_agent.choose_action_minimax(board, depth=minimax_depth)
                # Reset last Q-agent state/action as Minimax took a turn
                last_q_agent_state = None
                last_q_agent_action = None

            if action is None: # Current player has no valid moves
                game_over = True
                if current_player == q_agent_player:
                    # Q-agent loses because it has no moves
                    q_agent.learn(last_q_agent_state, last_q_agent_action, -100, board, board) # Final loss penalty
                    minimax_wins += 1
                else:
                    # Minimax loses because it has no moves
                    # If the Q-agent made the previous move and now Minimax has no moves, Q-agent wins.
                    # We need to ensure the Q-agent's last action is rewarded for opponent's inability to move.
                    if last_q_agent_state and last_q_agent_action:
                        q_agent.learn(last_q_agent_state, last_q_agent_action, 100, board, board) # Q-agent gets win reward
                    q_wins += 1
                break

            # Make the move on the board
            new_board = board.make_move(action, current_player)
            board = new_board # Update the current board reference

            move_count += 1

            # Check for terminal state after the move
            is_game_over, winner = board.check_terminal_state(3 - current_player) # Check for next player's turn to determine game over
            game_over = is_game_over

            if game_over:
                reward = get_reward(board, current_player, 3 - current_player)
                if current_player == q_agent_player:
                    q_agent.learn(old_board_obj, action, reward, board, board)
                    if winner == q_agent_player: q_wins += 1
                    elif winner == minimax_agent_player: minimax_wins += 1
                    else: draws += 1 # Should ideally be covered by win/loss if terminal_state is accurate
                elif current_player == minimax_agent_player:
                    # If Minimax made the winning move, and Q-agent was the previous player,
                    # Q-agent's previous action needs to be negatively reinforced (or opponent's win rewarded).
                    # This might require a more sophisticated reward propagation for Q-learning.
                    # For now, we only learn for Q-agent's moves.
                    if winner == q_agent_player: q_wins += 1
                    elif winner == minimax_agent_player: minimax_wins += 1
                    else: draws += 1
                break # Game is over, break from inner game loop

            # If game is not over and it was Q-agent's turn, learn from non-terminal state
            if current_player == q_agent_player:
                reward = get_reward(board, current_player, 3 - current_player)
                q_agent.learn(old_board_obj, action, reward, board, board)

            current_player = 3 - current_player # Switch turns

        if move_count >= max_moves_per_game and not game_over: # Max moves reached, consider it a draw
            draws += 1
            if last_q_agent_state and last_q_agent_action:
                q_agent.learn(last_q_agent_state, last_q_agent_action, -5, board, board) # Small penalty for draw

        # Epsilon decay
        q_agent.epsilon = max(0.01, q_agent.epsilon * epsilon_decay_rate)

        if (episode + 1) % 1000 == 0:  # Changed to report every 1000 episodes
            print(f"Episode {episode + 1}/{num_episodes} - Q-wins: {q_wins}, Minimax-wins: {minimax_wins}, Draws: {draws}, Epsilon: {q_agent.epsilon:.4f}")
            # Reset for next reporting period or just keep cumulative for overall progress
            # q_wins, minimax_wins, draws = 0, 0, 0 # Uncomment to see stats per 100 episodes

    print(f"\nTraining complete after {num_episodes} episodes.")
    print(f"Final Results: Q-agent Wins: {q_wins}, Minimax Agent Wins: {minimax_wins}, Draws: {draws}")
    return q_agent, minimax_agent

# Example of how to call the training function
# num_episodes = 500
# minimax_depth = 3
# epsilon_decay_rate = 0.995
# max_moves_per_game = 100 # To prevent infinite games

# print("Starting training...")
# trained_q_agent, trained_minimax_agent = train_agents(num_episodes, 1, 2, minimax_depth, epsilon_decay_rate, max_moves_per_game)
# print("Training finished.")


# Let's run a small test with fewer episodes and moves to avoid long execution in this step
print("Starting a quick test training run (10000 episodes)...") # Increased episodes
TEST_NUM_EPISODES = 10000 # Increased episodes for better learning
TEST_MINIMAX_DEPTH = 2
TEST_EPSILON_DECAY_RATE = 0.995 # Slightly less aggressive decay
TEST_MAX_MOVES_PER_GAME = 50 # Increased max moves

trained_q_agent, trained_minimax_agent = train_agents(TEST_NUM_EPISODES, 1, 2, TEST_MINIMAX_DEPTH, TEST_EPSILON_DECAY_RATE, TEST_MAX_MOVES_PER_GAME)
print("Quick test training finished.")

Starting a quick test training run (10000 episodes)...
Episode 1000/10000 - Q-wins: 0, Minimax-wins: 14, Draws: 986, Epsilon: 0.0100
Episode 2000/10000 - Q-wins: 0, Minimax-wins: 31, Draws: 1969, Epsilon: 0.0100


## Cálculo de Métricas de Desempenho

### Subtask:
Simular um grande número de jogos entre os agentes treinados e calcular as chances de vitória, empate e derrota para avaliar o desempenho de cada agente.


## Summary:

### Data Analysis Key Findings

*   **Checkers Game Environment Fully Implemented**: The core game logic, including board representation, piece movement rules (simple moves and mandatory multi-captures), piece promotion to kings, and comprehensive terminal state detection (no pieces, no valid moves), has been successfully developed and verified.
*   **Q-learning Agent Architecture Established**: A `CheckersAgent` class is implemented with key Q-learning components: state representation (hashable board tuples), action representation (valid moves from the game board), a Q-table for storing state-action values, an epsilon-greedy strategy for balancing exploration and exploitation, and the Q-learning update rule.
*   **Heuristic Evaluation Function Integrated**: A `evaluate_board` method within the `CheckersAgent` provides a numerical score for any given board state. This heuristic considers factors such as the count and type of pieces (normal vs. king, with different weights), positional bonuses for pieces advanced on the board, and control of central squares, demonstrating logical scoring in test scenarios.
*   **Minimax with Alpha-Beta Pruning Functional**: The `CheckersAgent` successfully integrates a `minimax` algorithm enhanced with Alpha-Beta pruning. This enables the agent to perform a deeper search into potential game states and make more informed, optimal move selections based on the developed heuristic evaluation function.
*   **Agent Training Loop Operational**: A `train_agents` function has been set up to orchestrate games between a Q-learning agent and a Minimax agent. This loop handles turn-taking, move execution, Q-value updates, and epsilon decay for the Q-learning agent, along with game termination conditions (win, loss, draw by repetition or max moves). An initial small test run of 10 episodes confirmed the functional integration, with all games resulting in draws, which is an expected outcome for a Q-learning agent in its early stages of exploration.

### Insights or Next Steps

*   **Extend Training Duration and Optimize Hyperparameters**: The initial training results (10 draws in 10 episodes) indicate that the Q-learning agent is still in its early exploratory phase. Significantly more training episodes and careful tuning of Q-learning hyperparameters ($\alpha$, $\gamma$, $\epsilon\_decay\_rate$) are crucial for the Q-agent to learn effective strategies and become competitive against the Minimax agent.
*   **Refine Reward Function for Strategic Learning**: Consider enhancing the `get_reward` function to provide more granular feedback beyond just win/loss/draw. Incorporating rewards for strategic achievements, such as successful captures, promoting pieces to kings, or controlling key central squares, could accelerate and improve the Q-learning agent's development of more sophisticated gameplay.


# Task
Simulate a specified number of games between the trained Q-learning agent and the Minimax agent using the `CheckersBoard` game environment. Implement a function `evaluate_agents` that records the win, loss, and draw outcomes for each agent, calculates their respective percentages, and then prints these performance metrics in a clear, understandable format.

## Implement game simulation for evaluation

### Subtask:
Create a function `evaluate_agents` that simulates a specified number of games between the trained Q-learning agent and the Minimax agent, recording the outcome (win, loss, draw) for each agent.


**Reasoning**:
The subtask requires creating an `evaluate_agents` function. I will define this function with the specified parameters and implement the game simulation logic, including turn-taking, move execution, win/loss/draw tracking, epsilon handling for the Q-agent, and draw detection by repetition or max moves.



In [12]:
import random
import math

# Re-including the CheckersBoard and CheckersAgent classes for completeness and execution in a single block
class CheckersBoard:
    def __init__(self, board=None):
        if board is None:
            self.board = [[0 for _ in range(8)] for _ in range(8)]
            # 0: empty, 1: player 1 piece, 2: player 2 piece
            # 3: player 1 king, 4: player 2 king

            # Setup initial pieces for Player 1 (top of the board, rows 0-2)
            for row in range(3):
                for col in range(8):
                    if (row + col) % 2 != 0:
                        self.board[row][col] = 1

            # Setup initial pieces for Player 2 (bottom of the board, rows 5-7)
            for row in range(5, 8):
                for col in range(8):
                    if (row + col) % 2 != 0:
                        self.board[row][col] = 2
        else:
            self.board = [row[:] for row in board] # Deep copy of the board

    def __str__(self):
        board_str = ""
        for r_idx, row in enumerate(self.board):
            board_str += f"{8 - r_idx} |"
            for piece in row:
                if piece == 0:
                    board_str += " . "
                elif piece == 1:
                    board_str += " x "
                elif piece == 2:
                    board_str += " o "
                elif piece == 3:
                    board_str += " X "
                elif piece == 4:
                    board_str += " O "
            board_str += "\n"
        board_str += "  ------------------------\n"
        board_str += "    a  b  c  d  e  f  g  h"
        return board_str

    def _is_valid_coord(self, r, c):
        return 0 <= r < 8 and 0 <= c < 8

    def _get_piece_owner(self, piece):
        if piece == 1 or piece == 3:
            return 1 # Player 1
        elif piece == 2 or piece == 4:
            return 2 # Player 2
        return 0 # Empty or invalid

    def _is_king(self, piece):
        return piece == 3 or piece == 4

    def _get_possible_moves_for_piece(self, r, c, player):
        moves = []
        piece = self.board[r][c]
        is_king = self._is_king(piece)

        # Define directions for non-king pieces
        if player == 1: # Player 1 pieces are at rows 0,1,2 and move 'down' (increasing row index)
            directions = [(1, -1), (1, 1)]
        else: # Player 2 pieces are at rows 5,6,7 and move 'up' (decreasing row index)
            directions = [(-1, -1), (-1, 1)]

        # Kings can move in all 4 diagonal directions
        if is_king:
            directions = [(-1, -1), (-1, 1), (1, -1), (1, 1)]

        for dr, dc in directions:
            nr, nc = r + dr, c + dc
            if self._is_valid_coord(nr, nc) and self.board[nr][nc] == 0:
                moves.append(((r, c), (nr, nc)))
        return moves

    def _get_possible_captures_for_piece(self, r, c, player, current_board=None, path=None):
        if current_board is None:
            current_board = [row[:] for row in self.board]
        if path is None:
            path = [(r, c)]

        captures = []
        piece = current_board[r][c]
        is_king = self._is_king(piece)

        # Define directions for non-king pieces
        if player == 1: # Player 1 pieces are at rows 0,1,2 and move 'down' (increasing row index)
            capture_directions = [(1, -1), (1, 1)]
        else: # Player 2 pieces are at rows 5,6,7 and move 'up' (decreasing row index)
            capture_directions = [(-1, -1), (-1, 1)]

        # Kings can capture in all 4 diagonal directions
        if is_king:
            capture_directions = [(-1, -1), (-1, 1), (1, -1), (1, 1)]

        found_capture_in_this_step = False
        for dr, dc in capture_directions:
            # Square of the opponent's piece to be jumped over
            enemy_r, enemy_c = r + dr, c + dc
            # Landing square after the jump
            land_r, land_c = r + 2 * dr, c + 2 * dc

            if (self._is_valid_coord(land_r, land_c) and
                    current_board[land_r][land_c] == 0 and
                    self._is_valid_coord(enemy_r, enemy_c) and
                    current_board[enemy_r][enemy_c] != 0 and
                    self._get_piece_owner(current_board[enemy_r][enemy_c]) != player):

                found_capture_in_this_step = True
                # Simulate the capture on a temporary board
                temp_board = [row[:] for row in current_board]
                temp_board[land_r][land_c] = temp_board[r][c]
                temp_board[r][c] = 0
                temp_board[enemy_r][enemy_c] = 0

                # Check for king promotion after landing, only if not king yet and reaching back row
                promoted_piece = temp_board[land_r][land_c]
                if not self._is_king(promoted_piece): # only promote if not already a king
                    if (player == 1 and land_r == 7) or (player == 2 and land_r == 0): # Adjusted promotion rows
                        temp_board[land_r][land_c] += 2 # Promote to king (1->3, 2->4)

                # Recursively look for more captures from the new position
                chained_captures = self._get_possible_captures_for_piece(land_r, land_c, player, temp_board, path + [(land_r, land_c)])

                if chained_captures: # If more captures are found, add them
                    captures.extend(chained_captures)
                else: # If no more captures, this is a terminal capture sequence
                    captures.append(path + [(land_r, land_c)])

        # If no capture was found in this step, but it's part of a longer path, return empty to signify end of chain
        if not found_capture_in_this_step and len(path) > 1:
            return []

        return captures

    def get_valid_moves(self, player):
        all_possible_moves = []
        all_possible_captures = []

        for r in range(8):
            for c in range(8):
                piece = self.board[r][c]
                if self._get_piece_owner(piece) == player:
                    # Get basic moves
                    all_possible_moves.extend(self._get_possible_moves_for_piece(r, c, player))
                    # Get captures
                    captures_from_piece = self._get_possible_captures_for_piece(r, c, player)
                    if captures_from_piece:
                        all_possible_captures.extend(captures_from_piece)

        if all_possible_captures:
            # If captures are available, they are mandatory
            return all_possible_captures
        else:
            # Otherwise, return all basic moves
            return all_possible_moves

    def make_move(self, move, player):
        # Create a new board state for the move
        new_board = CheckersBoard(board=self.board)

        if isinstance(move[0], tuple) and len(move) == 2 and isinstance(move[1], tuple): # It's a simple move ((r1,c1), (r2,c2))
            start_r, start_c = move[0]
            end_r, end_c = move[1]
            piece_to_move = new_board.board[start_r][start_c]

            new_board.board[end_r][end_c] = piece_to_move
            new_board.board[start_r][start_c] = 0

            # Check for king promotion
            if not new_board._is_king(piece_to_move): # Only promote if not already a king
                if (player == 1 and end_r == 7) or (player == 2 and end_r == 0):
                    new_board.board[end_r][end_c] += 2 # Promote to king

        elif isinstance(move, list): # It's a capture path [(r1,c1), (r2,c2), ...]
            for i in range(len(move) - 1):
                start_r, start_c = move[i]
                end_r, end_c = move[i+1]
                piece_to_move = new_board.board[start_r][start_c]

                new_board.board[end_r][end_c] = piece_to_move
                new_board.board[start_r][start_c] = 0

                # Remove captured piece
                # Calculate the position of the captured piece (midpoint)
                captured_r = (start_r + end_r) // 2
                captured_c = (start_c + end_c) // 2
                new_board.board[captured_r][captured_c] = 0

            # After the whole capture path, check for king promotion at the final landing spot
            final_r, final_c = move[-1]
            promoted_piece = new_board.board[final_r][final_c]
            if not new_board._is_king(promoted_piece): # Only promote if not already a king
                if (player == 1 and final_r == 7) or (player == 2 and final_r == 0):
                    new_board.board[final_r][final_c] += 2

        return new_board

    def _has_valid_moves(self, player):
        return bool(self.get_valid_moves(player))

    def check_terminal_state(self, current_player):
        player1_pieces = 0
        player2_pieces = 0
        for r in range(8):
            for c in range(8):
                piece = self.board[r][c]
                if self._get_piece_owner(piece) == 1:
                    player1_pieces += 1
                elif self._get_piece_owner(piece) == 2:
                    player2_pieces += 1

        # Condition 1: No pieces left
        if player1_pieces == 0: return True, 2 # Player 2 wins
        if player2_pieces == 0: return True, 1 # Player 1 wins

        # Condition 2: Current player has no valid moves
        if not self._has_valid_moves(current_player):
            if current_player == 1: return True, 2 # Player 1 has no moves, Player 2 wins
            if current_player == 2: return True, 1 # Player 2 has no moves, Player 1 wins

        # Default: Game is not over
        return False, None

class CheckersAgent:
    def __init__(self, player, alpha=0.1, gamma=0.9, epsilon=0.1):
        self.player = player # 1 for player 1, 2 for player 2
        self.alpha = alpha # Learning rate
        self.gamma = gamma # Discount factor
        self.epsilon = epsilon # Exploration rate
        self.q_table = {}

    def _get_state_representation(self, board):
        return tuple(tuple(row) for row in board.board)

    def _get_possible_actions(self, board):
        return board.get_valid_moves(self.player)

    def get_q_value(self, state, action):
        hashable_action = tuple(action) if isinstance(action, list) else action
        return self.q_table.get((state, hashable_action), 0.0)

    def choose_action(self, board):
        state = self._get_state_representation(board)
        possible_actions = self._get_possible_actions(board)

        if not possible_actions:
            return None

        if random.uniform(0, 1) < self.epsilon:
            return random.choice(possible_actions)
        else:
            q_values = [self.get_q_value(state, action) for action in possible_actions]
            max_q = -float('inf')
            best_actions = []

            for i, q_val in enumerate(q_values):
                if q_val > max_q:
                    max_q = q_val
                    best_actions = [possible_actions[i]]
                elif q_val == max_q:
                    best_actions.append(possible_actions[i])

            return random.choice(best_actions)

    def learn(self, old_state_obj, action, reward, new_state_obj, new_board_obj):
        hashable_action = tuple(action) if isinstance(action, list) else action

        old_state_rep = self._get_state_representation(old_state_obj)
        new_state_rep = self._get_state_representation(new_state_obj)

        current_q = self.get_q_value(old_state_rep, hashable_action)

        future_possible_actions = new_board_obj.get_valid_moves(self.player)
        if future_possible_actions:
            max_future_q = max([self.get_q_value(new_state_rep, a) for a in future_possible_actions])
        else:
            max_future_q = 0.0

        self.q_table[(old_state_rep, hashable_action)] = current_q + self.alpha * (reward + self.gamma * max_future_q - current_q)

    def evaluate_board(self, board_obj):
        score = 0
        opponent_player = 3 - self.player

        piece_weight = 1
        king_weight = 2

        for r in range(8):
            for c in range(8):
                piece = board_obj.board[r][c]
                owner = board_obj._get_piece_owner(piece)

                if owner == self.player:
                    if board_obj._is_king(piece):
                        score += king_weight
                    else:
                        score += piece_weight

                    if self.player == 1:
                        score += r * 0.1
                    else:
                        score += (7 - r) * 0.1

                    if 2 <= r <= 5 and 2 <= c <= 5:
                        score += 0.05

                elif owner == opponent_player:
                    if board_obj._is_king(piece):
                        score -= king_weight
                    else:
                        score -= piece_weight

                    if self.player == 1:
                        score -= (7 - r) * 0.1
                    else:
                        score -= r * 0.1

                    if 2 <= r <= 5 and 2 <= c <= 5:
                        score -= 0.05

        return score

    def minimax(self, board_obj, depth, alpha, beta, maximizing_player):
        current_player_in_minimax = self.player if maximizing_player else (3 - self.player)
        is_over, winner = board_obj.check_terminal_state(current_player_in_minimax)

        if depth == 0 or is_over:
            if is_over:
                if winner == self.player:
                    return 1000000000 + depth
                elif winner == (3 - self.player):
                    return -1000000000 - depth
                else:
                    return 0
            return self.evaluate_board(board_obj)

        if maximizing_player:
            max_eval = -math.inf
            valid_moves = board_obj.get_valid_moves(self.player)
            if not valid_moves:
                return -1000000000 - depth
            for move in valid_moves:
                new_board = board_obj.make_move(move, self.player)
                eval = self.minimax(new_board, depth - 1, alpha, beta, False)
                max_eval = max(max_eval, eval)
                alpha = max(alpha, eval)
                if beta <= alpha:
                    break
            return max_eval
        else:
            min_eval = math.inf
            opponent_player = 3 - self.player
            valid_moves = board_obj.get_valid_moves(opponent_player)
            if not valid_moves:
                return 1000000000 + depth
            for move in valid_moves:
                new_board = board_obj.make_move(move, opponent_player)
                eval = self.minimax(new_board, depth - 1, alpha, beta, True)
                min_eval = min(min_eval, eval)
                beta = min(beta, eval)
                if beta <= alpha:
                    break
            return min_eval

    def choose_action_minimax(self, board_obj, depth=3):
        possible_actions = board_obj.get_valid_moves(self.player)

        if not possible_actions:
            return None

        if len(possible_actions) == 1:
            return possible_actions[0]

        best_eval = -math.inf
        best_move = None
        alpha = -math.inf
        beta = math.inf

        for move in possible_actions:
            new_board = board_obj.make_move(move, self.player)
            eval = self.minimax(new_board, depth - 1, alpha, beta, False)

            if eval > best_eval:
                best_eval = eval
                best_move = move
            alpha = max(alpha, eval)

        return best_move

def get_reward(board, player, opponent_player):
    is_over, winner = board.check_terminal_state(player)

    if is_over:
        if winner == player:
            return 100
        elif winner == opponent_player:
            return -100
        else:
            return -5

    return -1

def evaluate_agents(q_agent, minimax_agent, num_evaluation_games, minimax_depth, max_moves_per_game):
    q_wins = 0
    minimax_wins = 0
    draws = 0

    original_q_epsilon = q_agent.epsilon # Store original epsilon

    for game_num in range(num_evaluation_games):
        board = CheckersBoard()
        current_player = 1 # Start with player 1
        game_over = False
        move_count = 0
        board_history = set() # Use a set for efficient lookup of board states

        # Determine which agent is which player
        agent_for_p1 = q_agent if q_agent.player == 1 else minimax_agent
        agent_for_p2 = q_agent if q_agent.player == 2 else minimax_agent

        while not game_over and move_count < max_moves_per_game:
            board_rep = tuple(tuple(row) for row in board.board)
            if board_rep in board_history:
                draws += 1
                game_over = True
                break
            board_history.add(board_rep)

            action = None
            current_agent = None
            if current_player == 1: # Player 1's turn
                current_agent = agent_for_p1
            else: # Player 2's turn
                current_agent = agent_for_p2

            if current_agent.player == q_agent.player: # If it's the Q-agent's turn
                q_agent.epsilon = 0 # Set to greedy for evaluation
                action = q_agent.choose_action(board)
                q_agent.epsilon = original_q_epsilon # Restore original epsilon
            else: # It's the Minimax agent's turn
                action = minimax_agent.choose_action_minimax(board, depth=minimax_depth)

            if action is None: # Current player has no valid moves, opponent wins
                game_over = True
                if current_player == q_agent.player: # Q-agent had no moves, Minimax wins
                    minimax_wins += 1
                else: # Minimax had no moves, Q-agent wins
                    q_wins += 1
                break

            # Make the move
            board = board.make_move(action, current_player)
            move_count += 1

            # Check terminal state for the *next* player
            next_player = 3 - current_player
            is_game_over_check, winner = board.check_terminal_state(next_player)
            game_over = is_game_over_check

            if game_over:
                if winner == q_agent.player:
                    q_wins += 1
                elif winner == minimax_agent.player:
                    minimax_wins += 1
                else: # This handles cases where check_terminal_state returns true but winner is None/draw
                    draws += 1
                break

            current_player = next_player # Switch turns

        if not game_over: # Max moves reached without a clear winner
            draws += 1

        if (game_num + 1) % 100 == 0:
            print(f"Evaluation Game {game_num + 1}/{num_evaluation_games} - Q-wins: {q_wins}, Minimax-wins: {minimax_wins}, Draws: {draws}")

    print(f"\nEvaluation complete after {num_evaluation_games} games.")
    print(f"Final Evaluation Results: Q-agent Wins: {q_wins}, Minimax Agent Wins: {minimax_wins}, Draws: {draws}")
    return q_wins, minimax_wins, draws

# Assuming trained_q_agent and trained_minimax_agent exist from previous training step
# For demonstration, let's create mock agents if they are not defined yet after a kernel restart
if 'trained_q_agent' not in locals():
    print("Trained agents not found in current session. Running a quick training session...")
    # Use the training parameters from the previous step
    TEST_NUM_EPISODES = 1000
    TEST_MINIMAX_DEPTH = 2
    TEST_EPSILON_DECAY_RATE = 0.995
    TEST_MAX_MOVES_PER_GAME = 50
    trained_q_agent, trained_minimax_agent = train_agents(TEST_NUM_EPISODES, 1, 2, TEST_MINIMAX_DEPTH, TEST_EPSILON_DECAY_RATE, TEST_MAX_MOVES_PER_GAME)
    print("Mock training complete.")

print("\n--- Starting Agent Evaluation ---")
EVAL_NUM_GAMES = 100 # Number of games to simulate for evaluation
EVAL_MINIMAX_DEPTH = trained_minimax_agent.minimax_depth if hasattr(trained_minimax_agent, 'minimax_depth') else 3 # Use depth from training or default
EVAL_MAX_MOVES_PER_GAME = 100 # Max moves to prevent infinite loops during evaluation


q_wins, minimax_wins, draws = evaluate_agents(trained_q_agent, trained_minimax_agent, EVAL_NUM_GAMES, EVAL_MINIMAX_DEPTH, EVAL_MAX_MOVES_PER_GAME)



--- Starting Agent Evaluation ---
Evaluation Game 100/100 - Q-wins: 11, Minimax-wins: 15, Draws: 74

Evaluation complete after 100 games.
Final Evaluation Results: Q-agent Wins: 11, Minimax Agent Wins: 15, Draws: 74


**Reasoning**:
The previous step successfully ran the evaluation. Now, I will calculate and print the win, loss, and draw percentages for each agent based on the evaluation results to fully complete the subtask requirement.



In [13]:
total_games = q_wins + minimax_wins + draws

print("\n--- Performance Metrics ---")
if total_games > 0:
    # Q-agent percentages
    q_win_percentage = (q_wins / total_games) * 100
    q_loss_percentage = (minimax_wins / total_games) * 100 # Q-agent loses when Minimax wins
    q_draw_percentage = (draws / total_games) * 100

    print(f"Q-learning Agent (Player {trained_q_agent.player}):")
    print(f"  Wins: {q_wins} ({q_win_percentage:.2f}%) - (vs Minimax)")
    print(f"  Losses: {minimax_wins} ({q_loss_percentage:.2f}%) - (vs Minimax)")
    print(f"  Draws: {draws} ({q_draw_percentage:.2f}%)")

    # Minimax agent percentages
    minimax_win_percentage = (minimax_wins / total_games) * 100
    minimax_loss_percentage = (q_wins / total_games) * 100 # Minimax loses when Q-agent wins
    minimax_draw_percentage = (draws / total_games) * 100

    print(f"\nMinimax Agent (Player {trained_minimax_agent.player}):")
    print(f"  Wins: {minimax_wins} ({minimax_win_percentage:.2f}%) - (vs Q-learning)")
    print(f"  Losses: {q_wins} ({minimax_loss_percentage:.2f}%) - (vs Q-learning)")
    print(f"  Draws: {draws} ({minimax_draw_percentage:.2f}%)")
else:
    print("No games were played for evaluation.")


--- Performance Metrics ---
Q-learning Agent (Player 1):
  Wins: 11 (11.00%) - (vs Minimax)
  Losses: 15 (15.00%) - (vs Minimax)
  Draws: 74 (74.00%)

Minimax Agent (Player 2):
  Wins: 15 (15.00%) - (vs Q-learning)
  Losses: 11 (11.00%) - (vs Q-learning)
  Draws: 74 (74.00%)


## Summary:

### Data Analysis Key Findings
*   An `evaluate_agents` function was successfully implemented to simulate games between a Q-learning agent and a Minimax agent, recording win, loss, and draw outcomes, and handling draw conditions due to board state repetition or exceeding the maximum allowed moves per game.
*   During evaluation, the Q-learning agent's exploration rate (`epsilon`) was temporarily set to 0 to ensure greedy action selection, reflecting its learned policy.
*   The evaluation consisted of 100 games, and the results showed the following distribution:
    *   Q-learning Agent Wins: 11 games.
    *   Minimax Agent Wins: 15 games.
    *   Draws: 74 games.
*   The calculated performance percentages are:
    *   **Q-learning Agent:** 11.00% wins, 15.00% losses (to Minimax), and 74.00% draws.
    *   **Minimax Agent:** 15.00% wins, 11.00% losses (to Q-learning), and 74.00% draws.

### Insights or Next Steps
*   The high percentage of draws (74%) suggests that either the game often reaches a stalemate condition within the set `max_moves_per_game`, or both agents are playing cautiously without frequently forcing a decisive win. Future analysis could involve inspecting game logs of drawn games to understand the reasons for stalemates.
*   The Minimax agent, with 15% wins, slightly outperforms the Q-learning agent's 11% wins in this evaluation. This indicates that the Minimax strategy, for the given depth, is marginally more effective at securing victories against the Q-learning agent under the current training and evaluation parameters.
