# Connect4 DQN model
By LaughingSkull 
as new RL agent for my game COnnect-4: https://www.laughingskull.org/Games/Connect4/Connect4.php


## DQN (Deep Q-Network)

* A value-based reinforcement learning method.
* Uses a neural network to approximate the Q-function: estimates the expected future rewards for taking actions in given states.
* Learns via Q-learning: updates the Q-values using the Bellman equation.
* Typically uses techniques like experience replay and target networks to stabilize training.

### version log    

* 0.8.0 - start with new convolutional part<br>
    * keeping L2 regularization
* 0.9.0 - extension to shallow training
* 0.10.0 - extension to L3 training
    * NO PRUNE:
        * worse 
    * SHARPer PRUNE
    * extend phases
        * MixedR12: no improvement; &cross;
        * Shallow
        * Fixed2
        * Variable23
        * Variable3
* 0.11.0 - recover fixed 2
    * corrected reward leaking bug
    * corrected draw bug
* 0.12.0 refactoring and restarting
* 0.13 restarting again
    * strategy weights djustment
* 0.14 new restart from scratch
    * intermediate phases
* 0.15 shaped rewards
* 0.16 testing target_update_interval
* 0.17 changing CNN
* 0.18 8-part weights approach, rebalancing and restarting (again)
    * frozen model for self play
    * adding mirrored rewards
    * adding player POV then getting rid of it
        * splitting agent  and oppo players channel
* 0.19 restart with tweaking weights (again)

## Links, learning from
[https://docs.pytorch.org/tutorials/intermediate/reinforcement_q_learning.html](https://docs.pytorch.org/tutorials/intermediate/reinforcement_q_learning.html)
<br>
[https://pettingzoo.farama.org/tutorials/agilerl/DQN/](https://pettingzoo.farama.org/tutorials/agilerl/DQN/)
<br>
### Other helpful links

<br>[https://medium.com/@vishwapatel214/building-a-connect-4-game-bot-with-deep-learning-models-dbcd019d8967](https://medium.com/@vishwapatel214/building-a-connect-4-game-bot-with-deep-learning-models-dbcd019d8967)
<br>[https://codebox.net/pages/connect4#:~:text=This%20requires%20a%20lot%20of%20work%20up%2Dfront,possible%20action%20at%20each%20step%20is%20impractical.](https://codebox.net/pages/connect4#:~:text=This%20requires%20a%20lot%20of%20work%20up%2Dfront,possible%20action%20at%20each%20step%20is%20impractical.)
<br>[https://medium.com/advanced-machine-learning/deep-learning-meets-board-games-creating-a-connect-4-ai-using-cnns-and-vits-89c8cdab0041](https://medium.com/advanced-machine-learning/deep-learning-meets-board-games-creating-a-connect-4-ai-using-cnns-and-vits-89c8cdab0041)
<br>[https://medium.com/@piyushkashyap045/understanding-dropout-in-deep-learning-a-guide-to-reducing-overfitting-26cbb68d5575#:~:text=Choosing%20Dropout%20Rate:%20Common%20dropout,is%20better%20for%20simpler%20models.](https://medium.com/@piyushkashyap045/understanding-dropout-in-deep-learning-a-guide-to-reducing-overfitting-26cbb68d5575#:~:text=Choosing%20Dropout%20Rate:%20Common%20dropout,is%20better%20for%20simpler%20models.)
<br>
[https://medium.com/oracledevs/lessons-from-alphazero-connect-four-e4a0ae82af68](https://medium.com/oracledevs/lessons-from-alphazero-connect-four-e4a0ae82af68)
<br>
[https://docs.agilerl.com/en/latest/tutorials/pettingzoo/dqn.html](https://docs.agilerl.com/en/latest/tutorials/pettingzoo/dqn.html)
<br>



## Import dependecies and recheck installation

In [None]:
%matplotlib inline

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchinfo import summary
import numpy as np
import random
from collections import deque
import matplotlib.pyplot as plt
from tqdm import tqdm
import time
import pprint;
import pandas as pd
import os
import json
from copy import deepcopy
from IPython.display import display, clear_output, HTML

print("All dependencies imported successfully.")
print("torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)

if torch.cuda.is_available():
    print("GPU name:", torch.cuda.get_device_name(0))
else:
    print("CUDA not available. Using CPU.")


### Fixed Random seeds

In [None]:
SEED = 666
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

torch.cuda.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

os.environ["PYTHONHASHSEED"] = str(SEED)

## Custom imports

In [None]:
from C4.connect4_env import Connect4Env
from C4.connect4_lookahead import Connect4Lookahead
from DQN.training_phases_config import TRAINING_PHASES
from DQN.training_phases_config import set_training_phases_length
from DQN.opponent_action import get_opponent_action
from DQN.DQN_replay_memory import ReplayMemory
from DQN.dqn_model import DQN
from DQN.dqn_agent import DQNAgent
from DQN.dqn_utilities import *
from C4.connect4_board_display import Connect4_BoardDisplayer
from C4.plot_phase_summary import plot_phase_summary

Lookahead = Connect4Lookahead()

#Training phases

In [None]:
set_training_phases_length(TRAINING_PHASES)

## Training session name and settings

In [None]:
lookahead_depth = 7  # prophet = 7

start_episode = 0
num_episodes = 700
num_episodes -= 1 # debug

batch_size = 128
target_update_interval = 25
plot_interval = 10
log_every_x_episode = 100
tag = "Random"

TRAINING_SESSION = f"{tag}-{num_episodes}-TU-{target_update_interval}-BS-{batch_size}"
begin_start_time = time.time()
LOG_DIR ="Logs/DQN/"
MODEL_DIR ="Models/DQN/"
PLOTS = "Plots/DQN/"
PRUNE = True
print("Started training session", TRAINING_SESSION)

### Model overview

In [None]:
_model = DQN()
summary(_model, input_size=(1, 2, 6, 7))  # batch=1, channels=2, height=6, width=7

## Training loop - DQN against lookahead opponent (Prophet-style)

### Training config

In [None]:
# --- Save training configuration to Excel ---
from C4.training_config_logger import export_training_config

paths = export_training_config(
    training_phases=TRAINING_PHASES,
    lookahead_depth=lookahead_depth,
    num_episodes=num_episodes,
    batch_size=batch_size,
    target_update_interval=target_update_interval,
    log_dir=LOG_DIR,
    session_name=TRAINING_SESSION,
    write_excel=True,    # set False if you only want CSV/JSON
    write_json=False,     # handy for exact reproduction later
)

print("config written:", paths)



### Training loop

In [None]:
summary_stats = {}  
env = Connect4Env()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("using device:", device)
agent = DQNAgent(device=device)

reward_history = []
win_history = []
epsilon_history = []
epsilon_min_history = []
memory_prune_history = []
win_count = loss_count = draw_count = 0
phase = None
frozen_opp = None
strategy_weights = []
memory_prune = 0  # default until first get_phase() call
start_time = time.time()

with tqdm(total=num_episodes, desc="Training Episodes") as pbar:
    for episode in range(start_episode + 1, num_episodes + 1):
        state = env.reset()        
        total_reward = 0           # will hold ONLY the terminal reward
        done = False
        final_result = None        # 1 = win, -1 = loss, 0.5 = draw

        new_phase, strategy_weights, epsilon, memory_prune, epsilon_min = get_phase(episode)    
        phase, frozen_opp = handle_phase_change(agent, new_phase, phase, epsilon, memory_prune, epsilon_min)


        # --------- main training loop --------- --------- --------- ---------
        # Random start (opponent moves first, player -1)
        if random.random() < 0.5:
            opp_action = get_opponent_action(env, agent, episode, state, player=-1, depth=lookahead_depth, frozen_opp=frozen_opp, phase=phase)
            state, r_opp, done = env.step(opp_action)

        # --- Main episode loop --- (agent moves first -> player 1)
        while not done:
            valid_actions = env.available_actions()
            action = agent.act(state, valid_actions, player=1, depth=lookahead_depth, strategy_weights=strategy_weights)
            next_state, r_agent, done = env.step(action)

            if done:
                # Terminal on the agent's move
                final_result = evaluate_final_result(env, agent_player=1)
                terminal_reward = map_final_result_to_reward(final_result)  
                total_reward = terminal_reward        
                agent.remember(state, 1, action, r_agent, next_state, -1, True)

            else:
                # Opponent responds (player -1)
                opp_action = get_opponent_action(env, agent, episode, next_state, player=-1, depth=lookahead_depth, frozen_opp=frozen_opp, phase=phase)
                next_state2, r_opp, done = env.step(opp_action)

                if done:
                    # Opponent ended the game
                    final_result = evaluate_final_result(env, agent_player=1)
                    terminal_reward = map_final_result_to_reward(final_result) 
                    total_reward = terminal_reward 
                    agent.remember(state, 1, action, total_reward, next_state2, 1, True)
                    next_state = next_state2  # maintain state for any displays/logs

                else:
                    # Non-terminal full turn: turn reward
                    agent.remember(state, 11, action, r_agent, next_state2, 1, False)
                    next_state = next_state2

            agent.replay(batch_size)
            state = next_state

        # --------- --------- --------- --------- --------- --------- ---------

        epsilon_history.append(agent.epsilon)
        epsilon_min_history.append(agent.epsilon_min)
        reward_history.append(total_reward)  # strictly  terminal reward
        memory_prune_history.append(memory_prune)

        # Decay epsilon once per episode
        if agent.epsilon > agent.epsilon_min:
            agent.epsilon *= agent.epsilon_decay
            agent.epsilon = max(agent.epsilon, agent.epsilon_min)


        wins, losses, draws = track_result(final_result, win_history)
        win_count += wins
        loss_count += losses
        draw_count += draws

        if episode % target_update_interval == 0:
            agent.update_target_model()

        if episode % plot_interval == 0:
            avg_reward = np.mean(reward_history[-25:])
            pbar.set_postfix(avg_reward=f"{avg_reward:.2f}", epsilon=f"{agent.epsilon:.3f}",
                             wins=win_count, losses=loss_count, draws=draw_count, phase=phase)

            clear_output(wait=True)
            
            if done: Connect4_BoardDisplayer.display_board(next_state)
            plot_live_training(episode, reward_history, win_history, epsilon_history, 
                               phase, win_count, loss_count, draw_count, TRAINING_SESSION, memory_prune_history,
                              epsilon_min_history)

        if episode % log_every_x_episode == 0:
            log_summary_stats(episode=episode, reward_history=reward_history, win_history=win_history, phase=phase,
                              strategy_weights=strategy_weights, agent=agent, win_count=win_count, loss_count=loss_count,
                              draw_count=draw_count, summary_stats_dict=summary_stats)

        pbar.update(1)

end_time = time.time()
elapsed = end_time - start_time
print(f"\nTraining completed in {elapsed/60:.1f} minutes ({elapsed / num_episodes:.2f} s/episode)")

# --- Save final Win Rate plot ---
save_final_winrate_plot(win_history=win_history, training_phases=TRAINING_PHASES, save_path=PLOTS, session_name=TRAINING_SESSION)
print(f"Win rate plot saved to {PLOTS}DQN-{TRAINING_SESSION}_final_winrate.png")


In [None]:
print(f"\nSummary stats (every {log_every_x_episode} episodes):")
#pprint.pprint(summary_stats)
pd.DataFrame.from_dict(summary_stats, orient='index').to_excel(f"{LOG_DIR}DQN-{TRAINING_SESSION}-training_summary.xlsx", index=True)

In [None]:
plot_phase_summary(summary_stats)

In [None]:
plt.plot(reward_history)
plt.xlabel("Episode")
plt.ylabel("Total Reward")
plt.title("DQN Training Progress")
plt.grid(True)
plt.show()

In [None]:
window = 50
smoothed = [np.mean(reward_history[max(0, i-window):i+1]) for i in range(len(reward_history))]
plt.plot(smoothed)
plt.show()

In [None]:
window = 250
smoothed = [np.mean(reward_history[max(0, i-window):i+1]) for i in range(len(reward_history))]
plt.plot(smoothed)
plt.show()

In [None]:
window = 1000
smoothed = [np.mean(reward_history[max(0, i-window):i+1]) for i in range(len(reward_history))]
plt.plot(smoothed)
plt.show()

In [None]:
window = 250
smoothed = [np.mean(reward_history[max(0, i - window):i + 1]) for i in range(len(reward_history))]

final_reward_fig, final_reward_ax = plt.subplots(figsize=(10, 5))
final_reward_ax.plot(smoothed, label=f"Smoothed Reward (window={window})", color='blue')

# --- Add phase transitions ---
for name, meta in TRAINING_PHASES.items():
    ep = meta["length"]
    if ep is not None and ep <= len(reward_history):
        final_reward_ax.axvline(x=ep, color='black', linestyle='dotted', linewidth=1)
        final_reward_ax.text(ep + 5, max(smoothed) * 0.95, name,
                             rotation=90, va='top', ha='left', fontsize=8)

final_reward_ax.set_title("Smoothed Reward Over Episodes")
final_reward_ax.set_xlabel("Episode")
final_reward_ax.set_ylabel("Smoothed Reward")
final_reward_ax.legend()
final_reward_ax.grid(True)
final_reward_fig.tight_layout()

# --- Show plot ---
plt.show()

# --- Save to file ---
final_reward_fig.savefig(f"{PLOTS}DQN-{TRAINING_SESSION}_final_reward_smoothed.png")
plt.close(final_reward_fig)
print(f"Smoothed reward plot saved to {PLOTS}DQN-{TRAINING_SESSION}_final_reward_smoothed.png")


## Save model

In [None]:
timestamp = time.strftime("%Y%m%d-%H%M%S")
model_path = f"{MODEL_DIR}{TRAINING_SESSION}_Connect4 dqn_model_{timestamp} episodes-{num_episodes} lookahead-{lookahead_depth}.pt"
default_model_path = "Connect4 DQN model.pt"

torch.save(agent.model.state_dict(), model_path)
torch.save(agent.model.state_dict(), default_model_path)
print(f"Model saved to {model_path}")


## Load model

In [None]:
agent = DQNAgent(device=device)  # Fresh agent instance
state_dict = torch.load(default_model_path, map_location=device, weights_only=True)
agent.model.load_state_dict(state_dict)
agent.update_target_model()
agent.epsilon = 0.0  # Fully greedy — no exploration
print("✅ Trained model loaded and ready for evaluation.")


# Evaluation

In [None]:
# === EVALUATION CONFIGURATION ===
evaluation_opponents = {
    "Random": 200,
    "Lookahead-1": 100,
    "Lookahead-2": 100,
    "Lookahead-3": 25,
    # "Lookahead-5": 10,
    # "Lookahead-7": 5,
}

# === Evaluation Loop ===
evaluation_results = {}

# --- Force deterministic, greedy evaluation ---
agent_model_mode = agent.model.training
agent_target_mode = agent.target_model.training
agent.model.eval()
agent.target_model.eval()
_eps_backup = agent.epsilon
_epsmin_backup = agent.epsilon_min
agent.epsilon = 0.0
agent.epsilon_min = 0.0

start_time = time.time()

for label, num_games in evaluation_opponents.items():
    wins = losses = draws = 0
    depth = int(label.split("-")[1]) if label.startswith("Lookahead") else None

    with tqdm(total=num_games, desc=f"Opponent: {label}") as pbar:
        for _ in range(num_games):
            state = env.reset()
            done = False
            agent_first = random.choice([True, False])  # randomize who starts

            while not done:
                # Agent's turn?
                is_agent_turn = ((env.current_player == 1 and agent_first) or
                                 (env.current_player == -1 and not agent_first))

                if is_agent_turn:
                    valid_actions = env.available_actions()
                    action = agent.act(state, valid_actions, player=env.current_player,
                                       depth=lookahead_depth, strategy_weights=None)
                else:
                    valid_actions = env.available_actions()
                    if label == "Random":
                        action = random.choice(valid_actions)
                    else:
                        # Lookahead opponent
                        board = np.array(state)
                        action = Lookahead.n_step_lookahead(board, env.current_player, depth=depth)
                        # Safety: if lookahead returns a filled column (shouldn't), fall back to random legal
                        if action not in valid_actions:
                            action = random.choice(valid_actions)

                state, _, done = env.step(action)

            # --- Use env.winner (1, -1, or 0 for draw) ---
            if env.winner == 1:
                # Player +1 won
                wins += 1 if agent_first else 0
                losses += 0 if agent_first else 1
            elif env.winner == -1:
                # Player -1 won
                wins += 0 if agent_first else 1
                losses += 1 if agent_first else 0
            elif env.winner == 0:
                draws += 1
            else:
                # Shouldn't happen; treat as draw to avoid skewing
                draws += 1

            pbar.update(1)

    evaluation_results[label] = {
        "wins": wins,
        "losses": losses,
        "draws": draws,
        "win_rate": round(wins / num_games, 3),
        "loss_rate": round(losses / num_games, 3),
        "draw_rate": round(draws / num_games, 3),
    }

end_time = time.time()
elapsed = end_time - start_time

# --- Restore agent mode/state ---
agent.epsilon = _eps_backup
agent.epsilon_min = _epsmin_backup
agent.model.train(agent_model_mode)
agent.target_model.train(agent_target_mode)

print(f"Evaluation completed in {elapsed/60:.1f} minutes")

# === Print Summary ===
print("\n📊 Evaluation Summary:")
for label, stats in evaluation_results.items():
    print(f"{label}: {stats['wins']}W / {stats['losses']}L / {stats['draws']}D → "
          f"Win: {stats['win_rate']*100:.1f}%, Loss: {stats['loss_rate']*100:.1f}%, Draw: {stats['draw_rate']*100:.1f}%")

# === Bar Plot Summary ===
labels = list(evaluation_results.keys())
win_rates  = [evaluation_results[k]['win_rate']  * 100 for k in labels]
loss_rates = [evaluation_results[k]['loss_rate'] * 100 for k in labels]
draw_rates = [evaluation_results[k]['draw_rate'] * 100 for k in labels]

x = range(len(labels))
bar_width = 0.25

plt.figure(figsize=(12, 6))
plt.bar(x, win_rates, width=bar_width, label='Win %')
plt.bar([i + bar_width for i in x], loss_rates, width=bar_width, label='Loss %')
plt.bar([i + 2 * bar_width for i in x], draw_rates, width=bar_width, label='Draw %')
plt.xlabel('Opponent Type')
plt.ylabel('Percentage')
plt.title('DQN Agent Performance vs Various Opponents')
plt.xticks([i + bar_width for i in x], labels, rotation=15)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

# Save results
df_eval = pd.DataFrame.from_dict(evaluation_results, orient='index')
df_eval.index.name = "Opponent"
# Use Excel if available; otherwise fall back to CSV
try:
    df_eval.to_excel(f"{LOG_DIR}DQN-{TRAINING_SESSION}-evaluation_results.xlsx", index=True)
except Exception as e:
    print("Excel export failed, saving CSV instead:", e)
    df_eval.to_csv(f"{LOG_DIR}DQN-{TRAINING_SESSION}-evaluation_results.csv", index=True)


# DONE

In [None]:
total_end_time = time.time()
total_elapsed = (total_end_time - begin_start_time) / 3600
print(f"Evaluation completed in {total_elapsed:.1f} hours")

## Training log

In [None]:
# TRAINING_SESSION

training_log_file = "DQN training_sessions.xlsx"
log_row = {"TRAINING_SESSION": TRAINING_SESSION, "TIME [h]": total_elapsed, "EPISODES": num_episodes}

for label, stats in evaluation_results.items():
    log_row[label] = stats["win_rate"]

# === Load or Create Excel File ===
if os.path.exists(training_log_file):
    df_log = pd.read_excel(training_log_file)
else:
    df_log = pd.DataFrame()

# === Append and Save ===
df_log = pd.concat([df_log, pd.DataFrame([log_row])], ignore_index=True)
df_log.to_excel(training_log_file, index=False)

print(f"\n📁 Training session logged to: {training_log_file}")