# Training a REINFORCE (with baseline) Agent

**GOAL:** To train a _REINFORCE with baseline Agent_ to play Connect4. This is the second part of our training pipeline (the second part also includes other RL algorithms):
  - **Part 1) Supervised Learning**
    - refer to *'src/train/part1_supervised_learning.ipynb'*.
    - RESULT: a pre-trained network with basic knowledge of the game
  - **Part 2) Reinforcement Learning**
    - In this case: REINFORCE with baseline
    - **TRANFER LEARNING FROM PART 1:**
      - Load the pre-trained weights from Part 1
      - Freeze the convolutional block (*feature extractor*, it is not trained here)
      - Train the rest of Fully Connected to estimate the optimal policy and the state values.
<br>

**METHOD:**
   - We used an *Episode Buffer* to store an episode before updating
       - exponent for reward backpropagation = 3
       - for more details on the implementation refer to '*src/data/replay_memory.py*'
   - The network architecture we used is defined in '*src/models/architectures/cnet128.json*'
   - We applied *transfer learning* to use the knowledge learned in '*src/train/part1_supervised_learning.ipynb*'
       - 1. load the network weights from '*src/models/saved_models/network_128.pt*'
       - 2. freeze the convolutional block (*feature extractor*)
       - 3. train the fully-connected layers to learn the policy and the state values
   - There is an '*old agent*' that is an older and stable version of the agent. It is updated when:
       - the agent achieves a new best win rate against the 1-StepLA Agent
   - When the performance of the current network decreases significantly, the latest changes are undone and it goes back to the most recent *old weights*
<br>

**TRAINING:**
   - We trained for 100k time steps
   - The learning hyperparameters are:
       - c1 = 0.75
       - learning rate = 5e-6
       - weight decay (L2 regularization) = 1e-3
       - discount factor (gamma) = 0.95
       - loss function (state value) = Smooth L1
   - Every 1000 updates, the REINFORCE Agent competes against:
       - vs the Random Agent
       - vs the older network
       - vs the 1-Step Lookahead Agent
<br>

**REINFORCE RESULTS:**
   - Our best REINFORCE gent beats the 1-Step LookAhead Agent **≈59%** of the time
   - The weights of the model are saved in '*src/models/saved_models/best_reinforce.pt*'
   - The training hyperaparameters are saved in '*src/models/saved_models/best_reinforce_hparams.json*'
   - Plots of the training losses
   - Plots of the average game length in self-play games
   - Plots of the evolution of the win rate vs 1StepLA

## 1) Imports

In [None]:
import copy
import os
from datetime import datetime

import torch
import torch.nn as nn
import numpy as np
from torchsummary import summary
import matplotlib.pyplot as plt

In [None]:
### YOUR PATH HERE
code_dir = '/home/marc/Escritorio/RL-connect4/'

if os.path.isdir(code_dir):
    # local environment
    os.chdir(code_dir)
    print(f"directory -> '{code_dir }'")
else:
    # google colab environment
    if os.path.isdir('./src'):
        print("'./src' dir already exists")
    else:  # not unzipped yet
        !unzip -q src.zip
        print("'./src.zip' file successfully unzipped")

In [None]:
from src.agents.baselines.random_agent import RandomAgent
from src.agents.baselines.n_step_lookahead_agent import NStepLookaheadAgent
from src.models.custom_network import CustomNetwork
from src.agents.trainable.pg_agent import PGAgent
from src.environment.connect_game_env import ConnectGameEnv
from src.data.replay_memory import ReplayMemory
from src.eval.competition import competition

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

In [None]:
timestamp = datetime.now().strftime("%d%m_%H%M")
print(f"'{timestamp}'")

In [None]:
SAVE_MODELS = False  # if False, it is debug mode 

In [None]:
hparams = {
    # environment, data, memory
    'reward_backprop_exponent': 3,

    # agent properties and model architecture
    'avg_symmetric_probs': True,
    'model_arch_path': './src/models/architectures/cnet128.json',
    'load_weights_path': './src/models/saved_models/supervised_cnet128.pt',
    'freeze_conv_block': True,

    # Information displayed while training
    'loss_log_every': 2,
    'comp_every': 10,
    'vs_1StepLA_win_rate_decrease_to_undo_updates': 0.08,
    'moving_avg': 100,
    
    # Loss
    'c1': 0.75,

    # Training loop params
    'num_episodes': 10000,
    'gamma' : 0.95,
    'weight_decay': 1e-3,
    'lr': 1e-6,
}

## 3) REINFORCE Agent

In [None]:
def load_state_dict(from_: nn.Module, to_: nn.Module) -> None:
    """
    Copies the weights from the module 'from_' to the module 'to_'
    It ensures that the convolutional block is frozen (if necessary)
    """

    to_.load_state_dict(from_.state_dict())
    if hparams['freeze_conv_block']:
        for param in to_.conv_block.parameters():
            param.requires_grad = False



def create_new_policy_model() -> nn.Module:
    """
    Create a Policy Network following the architecture in 'model_arch_file',
    and nitializing the weights as 'load_weights_path'.
    If 'freeze_conv_block', set the convolutional gradients to zero
    """

    policy_net = CustomNetwork.from_architecture(
        file_path=hparams['model_arch_path']
    ).to(device)

    policy_net.load_weights(hparams['load_weights_path'])

    for param in policy_net.conv_block.parameters():
        param.requires_grad = False

    return policy_net


def create_new_reinforce_agent():
    """
    Create a new REINFORCE Agent
    """

    model_ = create_new_policy_model()
    agent_ = PGAgent(model=model_,
                      avg_symmetric_probs=hparams['avg_symmetric_probs'],
                      name="Reinforce Agent")
    return agent_

In [None]:
reinforce_agent = create_new_reinforce_agent()

print("REINFORCE Agent device is cuda?", next(reinforce_agent.model.parameters()).is_cuda)
print()
print(summary(reinforce_agent.model, input_size=reinforce_agent.model.input_shape))

In [None]:
old_reinforce_agent = create_new_reinforce_agent()
load_state_dict(from_=reinforce_agent.model, to_=old_reinforce_agent.model)
old_reinforce_agent.model.eval()

In [None]:
agent_name = reinforce_agent.name.replace(' ', '_')
save_best_vs_1StepLA_file = f'{agent_name}_'+'{win_rate}_vs_1StepLA_'+f'{timestamp}.pt'

print('"' + save_best_vs_1StepLA_file + '"')

## 4) Episode Buffer

In [None]:
buffer = ReplayMemory(
    capacity=200,
    reward_backprop_exponent=hparams['reward_backprop_exponent']
)

## 5) Prepare the training loop

In [None]:
optimizer = torch.optim.Adam(
    params=reinforce_agent.model.parameters(),
    lr=hparams['lr'],
    weight_decay=hparams['weight_decay']
)

comp_env = ConnectGameEnv()

In [None]:
def training_step(policy_net_, buffer_, optimizer_, gamma_):

    policy_net_.train()
    
    # separate the symmetric episodes
    ep_experience = buffer_.all_data()
    episodes_list = [[]]
    for t in ep_experience:
        episodes_list[-1].append(t)
        if t.done:
            episodes_list.append([])
    episodes_list.pop()

    policy_losses, value_losses, entropies = [], [], []
    for ep_transitions_ in episodes_list:

        if len(ep_transitions_) == 0:
            continue

        ep_transitions = copy.deepcopy(ep_transitions_)
        batch = buffer_.Transition(*zip(*ep_transitions))

        state_batch_ = tuple([policy_net_.obs_to_model_input(obs=s)
                              for s in batch.state])
        state_batch = torch.cat(state_batch_).float().to(device)
        action_batch = torch.tensor(batch.action, device=device)

        returns = []
        G = 0

        # Compute the returns by reading the rewards vector backwards
        for i, r in enumerate(list(batch.reward)[::-1]):
            G = r - gamma_*G
            returns.insert(0, G)
        returns = torch.tensor(returns, dtype=torch.float, device=device)

        ep_policy_losses = []
        ep_value_losses = []  
        ep_entropies = []

        i = 0
        for s, G in zip(state_batch, returns):
            logits, baseline = policy_net_(s.unsqueeze(0))
            m = torch.distributions.Categorical(logits=logits)
            log_prob = m.log_prob(action_batch[i])
            entropy = m.entropy()
            
            advantage = G - baseline

            policy_loss = - log_prob * advantage
            
            value_loss = nn.functional.smooth_l1_loss(baseline.squeeze(), G)
            total_loss = policy_loss + hparams['c1']*value_loss
            
            optimizer_.zero_grad()
            total_loss.backward()
            optimizer_.step()
            
            ep_policy_losses.append(-policy_loss.item())
            ep_value_losses.append(value_loss.item())
            ep_entropies.append(entropy.item())
            i += 1

        policy_losses.extend(ep_policy_losses)
        value_losses.extend(ep_value_losses)
        entropies.extend(ep_entropies)

    return policy_losses, value_losses, entropies

In [None]:
history = {'n_updates': 0, 'policy_losses': [], 'value_losses': [], 'entropies': [], 
           'vs_random_win_rate': [], 'vs_random_avg_game_len': [],
           'vs_1StepLA_win_rate': [], 'vs_1StepLA_avg_game_len': [],
           'vs_old_self_win_rate': [], 'vs_old_self_avg_game_len': [],
           'comp_every': hparams['comp_every'], 'comp_n_episodes': 100,
          }

vs_1StepLA_best_win_rate = 0.50

if not os.path.exists('checkpoints'):
    os.makedirs('checkpoints')

env = ConnectGameEnv()
comp_env = ConnectGameEnv()

random_opponent = RandomAgent()
oneStepLA = NStepLookaheadAgent(n=1, prefer_central_columns=True)

episode_count = 0
for i_episode in range(hparams['num_episodes']):

    buffer.reset()

    buffer.push_self_play_episode_transitions(
        agent=reinforce_agent,
        env=env,
        init_random_obs=True,
        push_symmetric=True
    )

    # Perform one step of the optimization
    policy_losses, value_losses, entropies = training_step(
        policy_net_=reinforce_agent.model,
        buffer_=buffer,
        optimizer_=optimizer,
        gamma_=hparams['gamma']
    )
    
    history['n_updates'] += len(policy_losses)
    history['policy_losses'].extend(policy_losses)
    history['value_losses'].extend(value_losses)
    history['entropies'].extend(entropies)

    # display information about the training process
    if (i_episode+1) % hparams['loss_log_every'] == 0:
        policy_losses_ = history['policy_losses'][-hparams['moving_avg']:]
        value_losses_ = history['value_losses'][-hparams['moving_avg']:]
        entropies_ = history['entropies'][-hparams['moving_avg']:]
        print(f"Episode: {i_episode+1}/{hparams['num_episodes']}   " +
              f"Nupdates: {history['n_updates']}   "
              f"PolLoss: {round(np.mean(policy_losses_), 3)}   " +
              f"ValLoss: {round(np.mean(value_losses_), 3)}   " +
              f"Entropy: {round(np.mean(entropies_), 3)}"
              )

    # compete against the Random Agent
    if (i_episode+1) % hparams['comp_every'] == 0:
        reinforce_agent.model.eval()
        with torch.no_grad():
            res1, o1 = competition(
                env=comp_env,
                agent1=reinforce_agent,
                agent2=random_opponent,
                progress_bar=False
            )
        win_rate_rand = round(res1['win_rate1'], 3)
        print(f"    {win_rate_rand} vs. RAND" +
              f"    avg_len={round(res1['avg_game_len'], 2)}")
        history['vs_random_win_rate'].append(win_rate_rand)
        history['vs_random_avg_game_len'].append(res1['avg_game_len'])

        # compete againts the old (stable) version of the network
    
        reinforce_agent.model.eval()
        old_reinforce_agent.model.eval()
        with torch.no_grad():
            res2, o2 = competition(
                env=comp_env,
                agent1=reinforce_agent,
                agent2=old_reinforce_agent,
                progress_bar=False
            )
        win_rate_self = round(res2['win_rate1'], 3)
        print(f"    {win_rate_self} vs. SELF" +
              f"    avg_len={round(res2['avg_game_len'], 2)}")
        history['vs_old_self_win_rate'].append(win_rate_self)
        history['vs_old_self_avg_game_len'].append(res2['avg_game_len'])


        # compete against the 1StepLA
        reinforce_agent.model.eval()
        with torch.no_grad():
            res3, o3 = competition(
                env=comp_env,
                agent1=reinforce_agent,
                agent2=oneStepLA,
                progress_bar=False,
            )
        win_rate_1StepLA = round(res3['win_rate1'], 3)
        print(f"    {win_rate_1StepLA} vs. 1StepLA" +
              f"    avg_len={round(res3['avg_game_len'], 2)}")
        history['vs_1StepLA_win_rate'].append(win_rate_1StepLA)
        history['vs_1StepLA_avg_game_len'].append(res3['avg_game_len'])
        if win_rate_1StepLA > vs_1StepLA_best_win_rate:
            vs_1StepLA_best_win_rate = win_rate_1StepLA
            load_state_dict(from_=reinforce_agent.model, to_=old_reinforce_agent.model)
            old_reinforce_agent.model.eval()
            if SAVE_MODELS:
                file_name = (
                    f"checkpoints/" + save_best_vs_1StepLA_file.format(win_rate=int(win_rate_1StepLA*100))
                )
                reinforce_agent.model.save_weights(
                    file_path=file_name,
                    training_hparams=hparams,
                )
                print(f"        new best {file_name} is saved!!!")
        elif win_rate_1StepLA <= vs_1StepLA_best_win_rate-hparams['vs_1StepLA_win_rate_decrease_to_undo_updates']:
            load_state_dict(from_=old_reinforce_agent.model, to_=reinforce_agent.model)
            print("        undoing last updates...")

## 7) Plot training results

In [None]:
def moving_average(x, w):
    return np.convolve(x, np.ones(w), 'valid') / w

In [None]:
p_losses = np.array(history['policy_losses'])
v_losses = np.array(history['value_losses'])
total_losses = p_losses + hparams['c1']

In [None]:
data = moving_average(total_losses[:100000], w=1000)
x_vals = [x/1000 for x in range(len(data))]

plt.plot(x_vals, data)
plt.title('REINFORCE with baseline Training Loss')
plt.xlabel("updates (in thousands)")
plt.ylabel("loss")
#plt.gca().xaxis.set_major_locator(MultipleLocator(10))
#plt.gca().yaxis.set_major_locator(MultipleLocator(0.025))
plt.show()

In [None]:
#num_updates = 100000
#x_vals = range(0, num_updates, num_updates//len(history['vs_1StepLA_win_rate']))
#x_vals = [x/1000 for x in x_vals]
data = history['vs_1StepLA_win_rate']
x_vals = [x/1000 for x in range(len(data))]

plt.plot(x_vals, data)
plt.title('REINFORCE with baseline Win rate vs 1StepLA')
plt.xlabel("updates (in thousands)")
plt.ylabel("win rate")
#plt.gca().xaxis.set_major_locator(MultipleLocator(10000))
plt.axhline(1, linestyle='--', alpha=0.4)
plt.axhline(0.5, linestyle='--', alpha=0.4)
plt.ylim(0.35, 1.09)
#plt.xlim(0, 105)
plt.show()

In [None]:
#num_updates = 100000
#x_vals = range(0, num_updates, num_updates//len(history['vs_1StepLA_win_rate']))
#x_vals = [x/1000 for x in x_vals]
data = history['vs_old_self_avg_game_len']
x_vals = [x/1000 for x in range(len(data))]

plt.title('REINFORCE with baseline Self-play game length')
plt.plot(x_vals, data)
plt.xlabel("updates (in thousands)")
plt.ylabel("game length")
#plt.gca().xaxis.set_major_locator(MultipleLocator(10))
plt.axhline(42, linestyle='--', alpha=0.4)
plt.axhline(7, linestyle='--', alpha=0.4)
plt.ylim(-1, 45)