# Connect 4: Exploring Algorithmic Approaches

**Project Context:** Constraint Programming for Connect 4

This notebook outlines and explains various algorithmic approaches that can be considered for developing an intelligent agent to play Connect 4. While the core project might involve Constraint Programming (CP) for aspects like state validation, win condition checking, or perhaps even move generation under certain constraints, this document focuses on common game-playing algorithms that could leverage or complement a CP model.

## 1. Introduction to Connect 4

Connect 4 is a classic two-player connection game in which players first choose a color and then take turns dropping colored discs from the top into a seven-column, six-row vertically suspended grid. The pieces fall straight down, occupying the lowest available space within the column. The objective of the game is to be the first to form a horizontal, vertical, or diagonal line of four of one's own discs.

## 2. Constraint Programming (CP) in Connect 4

While the primary decision-making logic for our agents might rely on search algorithms (Minimax, Negamax) or learning (Reinforcement Learning), Constraint Programming offers a powerful declarative paradigm to model and enforce the rules and properties of the Connect 4 game. Instead of explicitly coding *how* to check conditions step-by-step, CP allows us to define the conditions (constraints) that must hold true.

In the context, CP can be strategically employed for several key tasks:

1.  **Board State Representation & Rule Enforcement:**
    *   **Model:** The 6x7 board can be represented using CP variables. For example, a 2D array `board[row][col]` where each variable's domain is `{0, 1, 2}` (representing Empty, Player 1, Player 2).
    *   **Gravity Constraint:** A fundamental rule. We can define constraints stating that if `board[r][c]` is non-empty (belongs to Player 1 or 2), then `board[r-1][c]` must also be non-empty, unless `r` is the bottom row (r=0). This ensures pieces stack correctly.
    *   **Column Capacity:** A constraint can limit the number of non-empty cells in any given column `c` to be at most 6.
    *   **Benefit:** While simple checks are often used in practice for these, using CP provides a formal, declarative model of the game's physics, which can be useful for validation or more complex reasoning.

2.  **Win Condition Checking:**
    *   **Horizontal:** For each player `P` and each possible starting position `(r, c)`, a constraint like: `board[r][c] == P AND board[r][c+1] == P AND board[r][c+2] == P AND board[r][c+3] == P`.
    *   **Vertical:** Similarly: `board[r][c] == P AND board[r+1][c] == P AND board[r+2][c] == P AND board[r+3][c] == P`.
    *   **Diagonal (Both directions):** Analogous constraints for diagonal lines.
    *   **Integration:** This CP-based check is essential for:
        *   **Minimax/Negamax:** Determining if a node is a terminal state (win/loss) in the search tree.
        *   **Reinforcement Learning:** Determining when an episode ends (game over) and assigning the appropriate terminal reward (+1 for win, -1 for loss) in the environment's `step` function.
        *   **All Agents:** Knowing when to stop the game loop.

3.  **Valid Move Identification / Generation:**
    *   Define constraints for a legal move in column `c`: `c >= 0 AND c < 7` AND `board[5][c] == 0` (assuming row 5 is the top row). The actual row a piece lands in depends on the gravity constraint.
    *   **Integration:** Used by the Random agent to pick from legal moves, and by Minimax/Negamax/RL agents to know the available actions from a given state.

5.  **Heuristic Feature Definition (Supporting Minimax/Negamax):**
    *   CP can define what constitutes strategically important patterns, even if calculating them uses procedural code for speed.
    *   **Example:** Define a constraint set representing "a Player 1 three-in-a-row with open space on both ends". The heuristic function could then try to count how many solutions exist for this constraint set on the current board.
    *   **Threat Detection:** Model "a column `c` where Player 2 wins if they play there next". This involves checking if placing Player 2's piece in `c` satisfies a win condition.
    *   **Integration:** CP helps formalize the *features* that the procedural `evaluate_board` function will count or check. Running a CP solver *inside* the heuristic for every node might be too slow, but CP informs the design.

**Summary of CP's Role:** In this project setup, CP is unlikely to be the main AI decision engine itself. Instead, it acts as a powerful **rule engine and state analyzer**. Its primary roles are:
*   **Robustly checking for terminal states (win/loss/draw)**, which is fundamental for all other algorithms.
*   **Formally defining game rules and valid states/moves**, ensuring correctness.
*   Potentially **defining complex features** used by the heuristics in search algorithms like Minimax/Negamax.

## 3. Adversarial Search: Minimax
*TODO*

## 4. Adversarial Search: Negamax
*TODO*

## 5. Reinforcement Learning (RL)

### Concept
Reinforcement Learning is a machine learning paradigm where an agent learns to make decisions by taking actions in an environment to maximize a cumulative reward signal. The agent isn't told *what* action to take, but instead discovers which actions yield the most reward through trial and error.

### Key Components
*   **Agent:** The learner or decision-maker (our Connect 4 player).
*   **Environment:** The external system the agent interacts with (the Connect 4 game).
*   **State (s):** A representation of the environment's current situation (the board configuration).
*   **Action (a):** A choice the agent can make (dropping a disc in a column).
*   **Reward (r):** Feedback from the environment indicating the immediate consequence of an action (e.g., +1 for winning, -1 for losing, 0 for other moves).
*   **Policy (π):** The agent's strategy for choosing actions based on states.
*   **Value Function (V(s) or Q(s,a)):** Estimates the expected long-term return (cumulative reward) from a state or state-action pair.

### Relevance to Connect 4
RL allows an agent to learn to play Connect 4 potentially without prior knowledge of optimal strategies, simply by playing many games (often against itself or variations of itself) and learning from the outcomes.

### 5.1 Deep Q-Learning (DQN)

Deep Q-Learning is a popular RL algorithm that uses a deep neural network to approximate the optimal action-value function, known as Q*(s, a). This function represents the maximum expected future reward achievable from state 's' by taking action 'a' and following the optimal policy thereafter.

**How it Works (Briefly):**

1.  **Q-Network:** A neural network takes the current state 's' as input and outputs estimated Q-values for each possible action 'a' in that state.
2.  **Experience Replay:** The agent's experiences (state, action, reward, next_state) tuples are stored in a memory buffer. During training, mini-batches are randomly sampled from this buffer. This breaks temporal correlations and improves learning stability.
3.  **Target Network:** A separate neural network (a periodically updated copy of the main Q-network) is used to calculate the target Q-values for the learning update. This target is `r + γ * max_a'(Q_target(s', a'))`, where `γ` (gamma) is the discount factor for future rewards, `s'` is the next state, and `Q_target` is the value from the target network. Using a fixed target network for a period stabilizes training.
4.  **Learning:** The main Q-network is updated using gradient descent to minimize the difference (e.g., Mean Squared Error) between its predicted Q(s, a) and the calculated target Q-value.
5.  **Exploration:** An exploration strategy (like epsilon-greedy, where the agent takes a random action with probability epsilon, otherwise chooses the action with the highest Q-value) is used to balance exploring new actions and exploiting known good actions.

#### 5.1.1 DQN with a Simple Neural Network (MLP)

*   **Input:** The Connect 4 board (6x7 grid) is typically flattened into a 1D vector (e.g., 42 elements). Each element represents a cell, possibly encoded as 0 (empty), 1 (player 1), -1 (player 2).
*   **Network:** A Multi-Layer Perceptron (MLP) with one or more hidden dense layers (using activation functions like ReLU) followed by an output layer.
*   **Output:** The output layer has 7 neurons, one for each column. The value of each output neuron represents the estimated Q-value for dropping a disc in that column.
*   **Pros:** Relatively simple to implement.
*   **Cons:** Flattening the input loses the spatial relationships between cells (adjacency, lines), which are crucial in Connect 4. The network must learn these relationships implicitly, which can be inefficient.

#### 5.1.2 DQN with a Convolutional Neural Network (CNN)

*   **Input:** The board is treated as a 2D grid (e.g., 6x7). Often, multiple input *channels* are used to represent the state more effectively (e.g., one 6x7 channel indicating positions of player 1's pieces, another for player 2's pieces, maybe a third channel indicating whose turn it is).
*   **Network:** Starts with one or more convolutional layers. These layers use filters (kernels) to detect spatial patterns (like horizontal, vertical, diagonal lines, or local configurations) across the board. Pooling layers might be used to reduce dimensionality. The output from the convolutional/pooling layers is then typically flattened and fed into one or more dense layers (MLP style).
*   **Output:** Similar to the MLP approach, the final output layer has 7 neurons representing the Q-values for each column.
*   **Pros:** CNNs are designed to recognize spatial hierarchies and patterns, making them well-suited for board games like Connect 4. They can learn relevant features more efficiently than MLPs processing flattened input.
*   **Cons:** More complex architecture and more computationally intensive to train than a simple MLP.

In [None]:
# Code cell for potential DQN structure (pseudo-code or library calls)
# E.g. (Conceptual):
# model = create_cnn_q_network(input_shape=(6, 7, num_channels), output_size=7)
# target_model = create_cnn_q_network(input_shape=(6, 7, num_channels), output_size=7)
# replay_buffer = ReplayBuffer(capacity=10000)
# optimizer = Adam(learning_rate=0.001)
# 
# for episode in range(num_episodes):
#     state = env.reset()
#     done = False
#     while not done:
#         action = select_action_epsilon_greedy(state, model, epsilon)
#         next_state, reward, done, _ = env.step(action)
#         replay_buffer.add(state, action, reward, next_state, done)
#         state = next_state
#         
#         if len(replay_buffer) > batch_size:
#             sample_batch = replay_buffer.sample(batch_size)
#             train_step(model, target_model, sample_batch, optimizer, gamma)
#             
#     # Update target network periodically
#     if episode % target_update_frequency == 0:
#         target_model.set_weights(model.get_weights())
#     
#     # Decay epsilon, log metrics, etc.

## 6. Performance Analysis and Benchmarking (Just as a Draft with fake values for the moment --')

Evaluating the performance of different agents is crucial.

### Metrics
*   **Win Rate:** The primary metric. Percentage of games won against specific opponents.
*   **Draw Rate / Loss Rate:** Complementary to win rate.
*   **Average Game Length:** Can indicate playing style (aggressive vs. defensive).
*   **Computation Time:** Average time taken per move or per game.
*   **(RL Specific) Reward Curve:** Plot of cumulative reward per episode during training.
*   **(RL Specific) Convergence:** How many episodes/steps does it take for the agent's performance to stabilize?

### Benchmarking Setup Example

1.  **Define Agents:** *(TBD)*
    *   `Agent_Minimax_D4`: Minimax agent with search depth 4.
    *   `Agent_Minimax_D6`: Minimax agent with search depth 6.
    *   `Agent_Negamax_D4`: Negamax agent with search depth 4 (should be similar to Minimax D4).
    *   `Agent_RL_FNN`: DQN agent trained with an Feed-Forward NN.
    *   `Agent_RL_CNN`: DQN agent trained with a Convolutional NN.
    *   `Agent_Random`: Plays legal moves randomly.
    *   `Agent_Heuristic`: Plays based on a simple heuristic (e.g., prioritize winning moves, blocking opponent wins, center columns).

2.  **Tournament:** Pit agents against each other in a round-robin or specific matchups. For each matchup (e.g., `Agent_RL_CNN` vs. `Agent_Minimax_D4`), play a significant number of games (e.g., N=200), alternating which agent goes first (N/2 games each).

3.  **Data Collection:** Record wins, losses, and draws for each matchup. For RL agents, track win rate against a fixed benchmark (e.g., `Agent_Random` or `Agent_Minimax_D4`) throughout the training process.

4.  **Presentation:**
    *   **Tables:** Show win/loss/draw percentages for all matchups.

    (**HOW TO PUT A TABLE ON THE LEFT SIDE D:**)
    | Agent          | vs Random | vs Heuristic | vs Minimax D4 | vs RL_CNN |
    |----------------|-----------|--------------|---------------|-----------|
    | Minimax D4     | 95% / 3% / 2% | 70% / 25% / 5% | 50% / 50% / 0%*| 40% / 55% / 5% |
    | RL_CNN         | 98% / 1% / 1% | 85% / 10% / 5% | 55% / 40% / 5% | 50% / 50% / 0%*|

    ```
    (* Self-play results depend on deterministic vs stochastic implementation)
    (Example data only)
    ```


    *   **Graphs:** Plot RL agent win rate vs. training episodes/time. Plot average move time for different agents. (TODO)
6.  **Analysis:** Discuss the relative strengths and weaknesses observed. Does the CNN-based RL agent outperform the MLP-based one? How does RL compare to Minimax at different depths? How significant is the computational cost difference? (TODO)