# Gradient Bandit Algorithm

## Installation

In [None]:
!pip install numpy==1.26.4 matplotlib==3.8.4 torch==2.5.1

## Gradient Bandit Algorithms

###  The Softmax Policy in Bandit Algorithms
In the Gradient Bandit, the probability of choosing action $a$ (or policy) is given by the softmax function.

$$\pi_{\theta}(a) = \frac{e^{H_{\theta}(a)}}{\sum_{b=1}^{K} e^{H_{\theta}(b)}}$$

**$H_{\theta}(a)$** is the learned numerical **preference** (or logit). This exponentiated values are positive; as such, their sum with respect to action $a$ is 1. The parameters $\theta$ are obtained through training. We select the action with the largest probability.

$$A_t = \arg\max_a\{ \pi_{\theta}(a)\} $$

For example K=3: $\pi_{\theta}(1)=0.25$, $\pi_{\theta}(2)=0.50$ and $\pi_{\theta}(3)=0.25$

$A_1 = \arg\max_a \{0.25, 0.5,0.25 \}$  
$A_1 =2$

### Action Preference ($H_{\theta}(a)$) in the Model

In the `SimpleLinearModel`, the action preference $H_{\theta}(a)$ is the direct, linear output of the network *before* the Softmax is applied. This preference is calculated using the learned weights ($\mathbf{w}$), the bias ($b$), and the fixed input vector ($\mathbf{x}$):

$$H_{\theta}(a) = \mathbf{w}_a^T \mathbf{x}$$

Since the model fixes the input $\mathbf{x}$ for $a=j$ the single feature $x_j=1$, this simplifies the preference to:

$H_{\theta}(a) = w_a$

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from deepmind_bandits import GaussianBandits, BanditDataAnalyzer

# Set random seed
np.random.seed(42)
torch.manual_seed(42)

## PyTorch Model Implementation

In PyTorch, we can easily implement the softmax function. The input dimension is typically 1, and the output dimension corresponds to the number of classes. Unlike the standard softmax function, where all inputs are treated as features with equal weighting, we implement it within the forward function.

In [None]:
class SimpleLinearModel(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(SimpleLinearModel, self).__init__()
        self.linear = nn.Linear(input_dim, output_dim)

    def forward(self):
        x = torch.tensor([1.0])
        # returns: (batch, output_dim) logits
        return self.linear(x)

## Policy Gradient Derivation

We have a policy parameterized by $\theta$ that defines a probability distribution $\pi_\theta(a)$ over actions. Each action $a$ has a true mean reward $q(a)$. Our goal is to choose $\theta$ so that the policy selects actions that maximize the expected reward. The expected reward under the policy is

$$
J(\theta) = \mathbb{E}[R \mid \pi_\theta] = \sum_a \pi_\theta(a)\, q(a).
$$

Since we are optimizing with respect to $\theta$, and the reward function $q(a)$ does not depend on $\theta$, the gradient of the objective is

$$
\nabla_\theta J(\theta)
= \nabla_\theta \sum_a \pi_\theta(a)\, q(a)
= \sum_a q(a)\, \nabla_\theta \pi_\theta(a).
$$

Using the log-derivative trick:

$$
\nabla_\theta \pi_\theta(a)
= \pi_\theta(a)\, \nabla_\theta \log \pi_\theta(a).
$$

Substituting this gives

$$
\nabla_\theta J(\theta)
= \sum_a q(a)\, \pi_\theta(a)\, \nabla_\theta \log \pi_\theta(a).
$$

Since actions are drawn according to $A \sim \pi_\theta(a)$, the sum above is an expectation:

$$
\nabla_\theta J(\theta)
\approx
\frac{1}{N}\sum_{t=1}^N q(A_t)\,\nabla_\theta \log \pi_\theta(A_t),
\qquad A_t \sim \pi_\theta.
$$

## REINFORCE Update Rule

In the bandit setting, we do not know the true action-value function $q(a)$. Instead, after selecting an action $A_t$, we observe a reward sample $R_t$, which is an unbiased estimator of $q(A_t)$.

Since $q(a)$ is unknown, we replace it with the sampled reward $R_t$, yielding the Monte Carlo policy gradient estimator:

$$
\nabla_\theta J(\theta)
\approx R_t\, \nabla_\theta \log \pi_\theta(A_t).
$$

Using this estimator, we update the policy parameters via gradient ascent:

$$
\theta_{t+1}
= \theta_t + \alpha_t\, R_t\, \nabla_\theta \log \pi_\theta(A_t).
$$

This gives the **REINFORCE** update rule in the bandit setting:

$$
\theta \leftarrow \theta + \alpha\, R\, \nabla_\theta \log \pi_\theta(A).
$$

## Loss Function

The loss function in PyTorch (cross-entropy with logits) is given by:

$$
\ell_t
= - \sum_{a=1}^{C} \mathbf{1}[A_t = a]\,
\ln\left( \frac{e^{w_a}}{\sum_{i=1}^{C} e^{w_i}} \right).
$$

This simplifies to:

$$
\ell_t = - \ln\!\bigl(\pi_\theta(A_t)\bigr).
$$

To recover the bandit policy-gradient form, we multiply by the reward $R_t$:

$$
\ell_t = -\, R_t\, \ln\!\bigl(\pi_\theta(A_t)\bigr).
$$

In [None]:
def bandit_loss(logits, actions, rewards):
    """Policy gradient loss for bandits."""
    rewards = torch.tensor(rewards, requires_grad=False)
    return rewards * F.cross_entropy(logits, actions)

## Setup Environment and Model

In [None]:
# Environment setup
means = [1.0, 2.0, -1.0, 0.0]
stds = [0.1, 0.2, 0.1, 0.3]
env = GaussianBandits(means, stds)
num_arms = env.num_arms
num_actions = num_arms

print(f"Number of arms: {num_arms}")
print(f"True means: {means}")
print(f"Optimal arm: {np.argmax(means)} (mean={max(means)})")

In [None]:
# Model configuration
lr = 0.1
T = 1000

model = SimpleLinearModel(1, num_actions)
optimizer = torch.optim.SGD(model.parameters(), lr=lr)

# Create analyzer
analyzer = BanditDataAnalyzer(means, num_actions)

# Baseline (running mean)
running_mean = 0.0
n = 1

print(f"Learning rate: {lr}")
print(f"Total steps: {T}")

## Helper Function

In [None]:
def argmax_(z):
    """Get argmax and detach from computation graph."""
    return torch.argmax(z.detach())

## Training Loop

In [None]:
for t in range(T):
    # Forward pass: get logits (action preferences)
    logits = model()
    action = torch.tensor(argmax_(logits))
    
    # Pull arm and observe reward
    reward = env.pull_arm(action)
    
    # Update running mean (baseline)
    running_mean = running_mean + (reward - running_mean) / n
    
    # Compute loss
    loss = bandit_loss(logits, action, running_mean)
    
    # Backprop and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # Track performance
    analyzer.update_and_analyze(action.item(), reward, loss_sample=loss)

print(f"\nTraining completed: {T} steps")

## Analyze Results

In [None]:
analyzer.finalize_analysis()

### Q-Value Progression

In [None]:
analyzer.plot_Qvalue()

### Cumulative Regret and Loss

In [None]:
analyzer.plot_regret()

### Cumulative Reward

In [None]:
analyzer.plot_cumulative_reward()

## Model Weights (Learned Preferences)

In [None]:
print("=== Gradient Bandit Performance Summary ===")
print(f"\nLearned action preferences (weights):")
with torch.no_grad():
    final_logits = model()
    probs = F.softmax(final_logits, dim=0)
    for a in range(num_actions):
        print(f"  Arm {a}: H(a)={final_logits[a].item():.3f}, Ï€(a)={probs[a].item():.3f} (true mean={means[a]:.3f})")

optimal_arm = np.argmax(means)
print(f"\nOptimal arm: {optimal_arm}")
print(f"Final cumulative regret: {analyzer.regret[-1]:.2f}")

## Loss Distribution Histogram

Let's visualize the distribution of loss values during training. This helps us understand:
- How loss values are distributed
- Whether learning stabilized over time
- The range of loss values encountered

In [None]:
# Extract loss values (convert from tensors to numpy)
loss_values = [loss.item() if torch.is_tensor(loss) else loss for loss in analyzer.loss]

# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(loss_values, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Loss Value', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Loss Values During Training', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)

# Add statistics
mean_loss = np.mean(loss_values)
median_loss = np.median(loss_values)
plt.axvline(mean_loss, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_loss:.3f}')
plt.axvline(median_loss, color='green', linestyle='--', linewidth=2, label=f'Median: {median_loss:.3f}')
plt.legend()

plt.tight_layout()
plt.show()

print(f"Loss Statistics:")
print(f"  Mean: {mean_loss:.4f}")
print(f"  Median: {median_loss:.4f}")
print(f"  Std Dev: {np.std(loss_values):.4f}")
print(f"  Min: {np.min(loss_values):.4f}")
print(f"  Max: {np.max(loss_values):.4f}")