In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
skip_training = False

In [None]:
from pathlib import Path
import torch

import ppo_blackjack_agent as a
import ppo_blackjack_train as t
import utils as u

In [None]:
work_dir = Path().cwd()/'results'

In [None]:
cfg_args = {
    "exp_name": "ex1",
    "seed": 2025,
    "env_name": "Blackjack-v1",
    "model_name": "Blackjack-v1",
    "max_episode_steps": 100, # not relevant for Blackjack
    "train_episodes": 500000,
    "batch_size": 256,
    "min_update_samples": 5000,
    "testing": False,
    "model_path": "default",
    "save_video": True,
    "save_model": True,
    "save_logging": True,
    "silent": False,
    "use_wandb": True,
    "run_suffix": 0,
}

In [None]:
if not skip_training:
    t.train(cfg_args=cfg_args)

In [None]:
if not skip_training:
    u.plot_reward(Path().cwd()/'results'/'logging'/'Blackjack-v1_2025.csv', 'Blackjack')

In [None]:
model_path = work_dir / 'model' / f'{cfg_args["model_name"]}_params.pt'

observation_space_dim = 3
action_space_dim = 2
policy = a.Policy(observation_space_dim, action_space_dim)

print(f"Loading model from {model_path}...")
state_dict = torch.load(model_path, map_location='cuda')
policy.load_state_dict(state_dict)

policy.eval()

In [None]:
import matplotlib.pyplot as plt

value_grid_usable_ace, policy_grid_usable_ace = u.create_grids_nn(policy, usable_ace=True)
fig1 = u.create_plots(value_grid_usable_ace, policy_grid_usable_ace, title="With usable ace")
plt.show()

value_grid_no_ace, policy_grid_no_ace = u.create_grids_nn(policy, usable_ace=False)
fig2 = u.create_plots(value_grid_no_ace, policy_grid_no_ace, title="Without usable ace")
plt.show()


Overall it seems to converge to a reasonable policy, though not the optimal one. The differences are quite small and for edge cases longer training might still change some of the decisions.

A good benchmark for the problem can be found here: https://chisness.github.io/2020-09-21/monte-carlo-rl-and-blackjack

Their optimal policy, found with a Monte Carlo method after 10 million iterations is almost identical (there is a single difference) to the optimal policy proposed by Sutton and Barto in their RL book.

The method also seems better suitable for blackjack than PPO algorithm, but the point was to test that the implementation works for blackjack and based on this it does. For the absolute optimal policy hyperparameters, batch size and episode amount could be adjusted. With this policy the reward is about -0.050 when the optimal policy yields -0.047.