In [44]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

## Contextual Bandit 

Contextual bandit models are a class of machine learning models used in online decision-making scenarios, where an agent/policy needs to make a sequence of decisions or actions over time while interacting with an environment. The rewards or outcomes of actions are influenced by additional contextual information.

### Key Components of Contextual Bandit Models: 
- <b>Actions</b>: A set of choices that an agent can make. Each action has an associated reward or outcome.
- <b>Context</b>: Features that describe the current state or environment. Contextual features help inform the agent's decision on which action to take.
- <b>Reward Function</b>: Estimation or prediction of the expected reward for each action given the current context. The reward function maps actions and contexts to reward values, and is typically learned from the data.
- <b>Learning Algorithm</b>: Mathematical representation of the decision space, used to estimate the value of actions in different contexts. The algorithm updates the value estimates based on observed rewards and contexts.
- <b>Policy</b>: Definition of the strategy for selecting actions based on the current context. The goal is to find an optimal policy that maximizes cumulative rewards over time.
- <b>Exploration vs. Exploitation</b>: Effectively balancing the trade-off between exploration (trying new actions to learn their values) and exploitation (choosing actions that are believed to have high rewards based on current knowledge). Exploration is essential for learning the true values of actions, especially when there is uncertainty or limited knowledge about the rewards associated with different actions in various contexts. Exploitation aims to maximize the immediate rewards by choosing actions that are likely to perform well. The choice of policy (exploration strategy) determines how the agent balances these two objectives.

### Prototypical Use Cases: 
Contextual bandit models are prototypically used in situations where sequential decision-making is required based on contextual information.
- <b>Online Advertising</b>: Displaying targeted ads to users on websites or mobile apps. Choosing which ad to show to a user based on their browsing behavior, demographics, and context.
- <b>Recommender Systems</b>: Recommending products, movies, music, or content to users. Selecting the next item to recommend in a sequence based on user preferences, historical interactions, and real-time context.
- <b>Content Personalization</b>: Customizing the content shown to users on websites or news platforms, e.g., tailoring news articles or videos to individual preferences and reading patterns.
- <b>Dynamic Pricing</b>: Setting prices for products or services in real-time based on market conditions, user behavior, and competitor pricing. Offering personalized discounts or promotions to maximize revenue and customer satisfaction.
- <b>Supply Chain Management</b>:Optimizing inventory management and order fulfillment based on real-time demand, supplier conditions, and inventory levels.

### Contextual Bandit Models

#### - Linear Model 
Linear bandit models consist of a single linear layer that takes the context as input and produces action scores or logits. The assumption is that the relationship between the context and action values is linear. Linear models are easily scalable to handle large datasets and high-dimensional contexts, however they may not capture complex non-linear relationships present in the data.

In [46]:
class LinearBanditModel(nn.Module):
    def __init__(self, n_features, n_actions):
        super(LinearBanditModel, self).__init__()
        self.fc = nn.Linear(n_features, n_actions)

    def forward(self, x):
        return self.fc(x)

#### - Feedforward Neural Network (DNN) Model
DNN bandit models are complex neural network architecture comprised of multiple layers, including one or more hidden layers, with non-linear activation functions. Useful when making decisions based on contextual information that require capturing complex, non-linear relationships between the context and the expected rewards for each action. DNN models can be computationally intensive, require larger datasets, and are prone to overfitting if not regularized correctly (i.e., hyperparameter tuning is crucial).

In [47]:
class FeedforwardBanditModel(nn.Module):
    def __init__(self, n_features, n_actions, hidden_dim=64):
        super(FeedforwardBanditModel, self).__init__()
        self.fc1 = nn.Linear(n_features, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, n_actions)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        return self.fc2(x)

#### - Wide & Deep Model
Wide and deep bandit models combine the strengths of linear models (wide) to capture broad, abstract patterns and the deep neural networks (deep) to capture low-level feature interactions in the contextual data. By combining both components, the model can potentially perform well across a wide range of contextual bandit problems.

In [49]:
class WideComponent(nn.Module):
    def __init__(self, n_features, n_actions):
        super(WideComponent, self).__init__()
        self.linear = nn.Linear(n_features, n_actions)

    def forward(self, x):
        return self.linear(x)

class DeepComponent(nn.Module):
    def __init__(self, n_features, n_actions, hidden_dim=64):
        super(DeepComponent, self).__init__()
        self.fc1 = nn.Linear(n_features, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, n_actions)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        return self.fc2(x)

class WideAndDeepModel(nn.Module):
    def __init__(self, n_features, n_actions, hidden_dim=64):
        super(WideAndDeepModel, self).__init__()
        self.wide_component = WideComponent(n_features, n_actions)
        self.deep_component = DeepComponent(n_features, n_actions, hidden_dim=hidden_dim)

    def forward(self, x):
        # Combine the outputs of both components (wide and deep).
        wide_output = self.wide_component(x)
        deep_output = self.deep_component(x)
        return wide_output + deep_output

#### - Convolutional Neural Network (CNN) Model
CNN bandit models are employed when contextual information is structured as images or grid-like data, making them well-suited for capturing spatial patterns and features in the context. The model consists of convolutional layers, followed by one or more fully connected layers. Convolutional layers perform spatial feature extraction, while fully connected layers process extracted features to make action predictions.

In [50]:
class ConvolutionalBanditModel(nn.Module):
    def __init__(self, n_actions):
        super(ConvolutionalBanditModel, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.relu = nn.ReLU()
        self.fc = nn.Linear(32 * 32 * 32, n_actions)

    def forward(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        x = x.view(x.size(0), -1) # Flatten the output.
        return self.fc(x)

#### - Recurrent Neural Network (RNN) Model
RNN bandit models are able to handle temporal dependencies in the data, making them a suitable choice when the contextual information is sequential or time-dependent. Typically comprised of one or more recurrent layers, followed by one or more fully connected layers. The recurrent layers process sequential context data, while fully connected layers make action predictions. Dependencies can be short-range (e.g., RNN) or long-range (e.g., LSTM).

In [51]:
class RNNBanditModel(nn.Module):
    def __init__(self, n_features, n_actions, hidden_dim=64, num_layers=1):
        super(RNNBanditModel, self).__init__()
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.rnn = nn.RNN(n_features, hidden_dim, num_layers, batch_first=True) # Define RNN layer.
        self.fc = nn.Linear(hidden_dim, n_actions) # Define output layer.
        
    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device) # Initialize hidden state with zeros.
        out, _ = self.rnn(x, h0) # Forward pass through the RNN layer.
        out = self.fc(out[:, -1, :]) # Take the output from the last time step and pass it through the output layer.
        return out
    

class LSTMContextualBandit(nn.Module):
    def __init__(self, n_features, n_actions, hidden_dim=64, num_layers=1):
        super(LSTMContextualBandit, self).__init__()
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.lstm = nn.LSTM(n_features, hidden_dim, num_layers, batch_first=True) #Define LSTM layer.
        self.fc = nn.Linear(hidden_dim, n_actions) # Define output layer.

    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device) # Initialize hidden states with zeros.
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device)
        out, _ = self.lstm(x, (h0, c0)) # Forward pass through the LSTM layer.
        out = self.fc(out[:, -1, :]) # Take the output from the last time step and pass it through the output layer.
        return out

### Model Training Function

In [39]:
def train_bandit(model, optimizer, criterion, num_epochs):
    num_epochs = num_epochs
    for epoch in range(num_epochs):
        # Forward pass.
        action_logits = model(context)
        # Convert chosen_actions to a PyTorch tensor
        chosen_actions_tensor = torch.tensor(chosen_actions, dtype=torch.long)
        # Calculate the loss
        loss = criterion(action_logits, chosen_actions_tensor)

        # Backpropagation and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        print(f"Epoch [{epoch}/{num_epochs}] Loss: {loss.item():.4f}")

### Model Prediction Function

In [129]:
def predict_bandit(k):
    for i in range(1,k+1):
        new_context = torch.tensor(np.random.randn(1, n_features), dtype=torch.float32)
        action_logits = model(new_context)
        chosen_action = torch.argmax(action_logits).item()
        reward = rewards[chosen_action].item()
        print(f'Turn: {i}, Action: {chosen_action}, Reward: {reward}')

## Model Specifications

- <b>Learning Rate</b>
- <b>Optimizer</b>
- <b>Criterion (Loss Function)</b>

### <u>Learning Rate</u>

Learning rate is a hyperparameter that determines the step size at which the model's parameters are updated during the training process. It controls the magnitude of adjustments made to the model's weights or coefficients in response to the computed gradients.

#### - Fixed Learning Rate
Learning rate is held constant throughout training. This approach is simple and can work well when the data distribution and model architecture are relatively stable (e.g., `lr=0.01`).

#### - Learning Rate Schedules
Learning rate is changed during training according to a predefined schedule. Helps balance between fast convergence in the early stages and fine-tuning in the later stages of model training. E.g., 
- <b>Step Decay</b>: Reduce the learning rate by a fixed factor after a fixed number of epochs or steps. 
 - `scheduler = StepLR(optimizer, step_size=10, gamma=0.95)`, where gamma decay factor of 0.95 signals a 5% reduced learning rate at each step_size (10 epochs).
- <b>Exponential Decay</b>: Exponentially decrease the learning rate over time.  
 - `scheduler = ExponentialLR(optimizer, gamma=0.95)`
- <b>Cosine Annealing</b>: Use a cosine function to decrease the learning rate in a cyclical manner.
 - `scheduler = CosineAnnealingLR(optimizer, T_max=50)`, where T_max represents the number of epochs that make up a full cycle of cosine annealing (50 epochs).
- <b>Learning Rate Decay on Plateau</b>: Monitor a validation metric (e.g., validation loss or accuracy) during training, and if it plateaus or worsens, reduce the learning rate. This helps the model fine-tune as it gets closer to convergence.  
 - `scheduler = ReduceLROnPlateau(optimizer, mode='min', patience=5, factor=0.5, verbose=True)`
 
The `scheduler.step()` action is included in the model training loop.

#### - Learning Rate Warmup 
Model training starts with a small learning rate that gradually increases over a few epochs. Helps the model avoid getting stuck in local minima early in training. E.g., 

```
# Define learning rate warmup parameters
initial_learning_rate = 0.1
warmup_epochs = 10  # Number of epochs for warmup
warmup_factor = 0.1  # Warmup factor for initial learning rate

# Training Loop Learning Rate
for epoch in range(num_epochs):
    if epoch < warmup_epochs:
        # Learning rate warmup phase
        lr = initial_learning_rate * (warmup_factor + (1.0 - warmup_factor) * (epoch / warmup_epochs))
    else:
        # Regular training phase
        lr = initial_learning_rate
        
    # Update the optimizer's learning rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr
```


#### - Adaptive Learning Rate
Adjust the learning rate based on the gradient information. Popular algorithms include Adam, RMSprop, and Adagrad. Adaptive learning rate algorithms adapt the learning rate on a per-parameter basis, which can be beneficial in more complex models. E.g., 

`optimizer = optim.Adam(model.parameters(), lr=0.001)`

### <u>Optimizer</u>
Optimizers are algorithms or methods used to adjust the parameters of a model in order to minimize the error or loss function during the training process. The primary goal of an optimizer is to find the optimal set of model parameters that result in the best possible performance on a given task. To do this, the optimizer iteratively updates the model's parameters based on the computed gradients of the loss function with respect to those parameters.

#### - Stochastic Gradient Descent (SGD)

SGD is the most fundamental optimizer. It updates model parameters based on the gradient of the loss with respect to the parameters. While it can be slower to converge than more advanced optimizers like Adam, it is often used as a baseline and can work well with appropriate learning rate scheduling.

In [15]:
sgd_optimizer = optim.SGD(model.parameters(), lr=learning_rate)

#### - RMSprop
RMSprop is an adaptive learning rate optimizer that maintains a moving average of squared gradients for each parameter. It scales the learning rates differently for each parameter, which can help in training deep networks.

In [16]:
rmsprop_optimizer = optim.RMSprop(model.parameters(), lr=learning_rate)

#### - Adagrad
Adagrad adapts the learning rate for each parameter based on the historical gradient information. It performs well on sparse data but may decrease the learning rate too aggressively over time.

In [17]:
adagrad_optimizer = optim.Adagrad(model.parameters(), lr=learning_rate)

#### - Adadelta
Adadelta is an extension of Adagrad that addresses its aggressive learning rate decay by using a moving average of past gradients rather than accumulating them.

In [18]:
adadelta_optimizer = optim.Adadelta(model.parameters(), rho=0.9)

#### - Nesterov Accelerated Gradient (NAG)
NAG is an improved version of SGD that takes into account the future gradient information when updating parameters. It often converges faster than plain SGD.

In [19]:
nag_optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, nesterov=True)

#### - L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno)
L-BFGS is a quasi-Newton optimization algorithm that is often used for small to medium-sized datasets. It can be a good choice for optimizing shallow networks with a limited number of parameters.

In [20]:
lbfgs_optimizer = optim.LBFGS(model.parameters(), lr=learning_rate)

#### - Proximal Gradient Descent (PGD)
PGD is an optimization algorithm used for problems with sparse parameter updates. It's suitable for models with L1 regularization.

In [21]:
pgd_optimizer = optim.SGD(model.parameters(), lr=learning_rate, weight_decay=1e-4)

#### - Adam (Adaptive Moment Estimation)
Adam is an optimization algorithm used for training machine learning models. It combines the advantages of AdaGrad and RMSprop, offering robust and efficient convergence. Adam maintains moving averages of gradients (first moment) and squared gradients (second moment) for each parameter, adaptively scales learning rates, and applies bias correction.

In [189]:
adam_optimizer = optim.Adam(model.parameters(), lr=learning_rate)

### <u>Criterion (Loss Function)</u>
Criterion refers to the loss function or objective function used to quantify the error or discrepancy between the model's predictions and the true target values (ground truth) during the training process. The purpose of a loss function is to provide a single scalar value that the optimization algorithm (optimizer) seeks to minimize.

#### - Mean Squared Error (MSE) Loss
Used for regression problems where the goal is to predict a continuous target variable. Measures the average squared difference between predicted and actual values.

In [68]:
mse_criterion = nn.MSELoss()

#### - L1 Loss (Absolute Error)
Similar to MSE loss but measures the average absolute difference between predicted and actual values. Often used when outliers in the data should be handled with less sensitivity.

In [72]:
l1_criterion = nn.L1Loss()

#### - Smooth L1 Loss (Huber Loss)
Combines properties of MSE and L1 loss. Smooth transition from L1 loss for small errors to L2 loss (MSE) for large errors. Less sensitive to outliers compared to MSE.

In [73]:
smooth_l1_criterion = nn.SmoothL1Loss()

#### - Binary Cross-Entropy Loss (BCE Loss)
Used for binary classification problems where the target variable has two classes (0 and 1). Measures the negative log-likelihood of the predicted class probabilities.

In [71]:
bce_criterion = nn.BCELoss()

#### - Binary Cross-Entropy Loss with Logits (BCEWithLogits Loss)
Similar to BCE Loss but applied to the logits (before the sigmoid activation) rather than probabilities. Often used when applying a sigmoid activation function to the model's output.

In [74]:
bce_logit_criterion = nn.BCEWithLogitsLoss()

#### - Multi-Class Hinge Loss (MultiMargin Loss)
Used for multi-class classification problems. Encourages correct class scores to be higher than incorrect class scores by a margin.

In [75]:
multimargin_criterion = nn.MultiMarginLoss()

#### - Cross Entropy Loss (Log Loss or Negative Log-Likelihood Loss)
Primarily used for classification problems. Measures the dissimilarity between the predicted class probabilities and the true class labels, where the goal is to minimize this dissimilarity, encouraging the model to assign higher probabilities to the correct classes. 

In [76]:
cross_entropy_criterion = nn.CrossEntropyLoss()

<hr>

## Model Training & Prediction Steps

### 1. Data Preparation

#### - Generate Random Data Specifications

In [169]:
np.random.seed(4932)
n_samples = 10000
n_features = 5
n_actions = 3

#### - Generate Random Contextual Data

In [170]:
context = torch.tensor(np.random.randn(n_samples, n_features), dtype=torch.float32)

#### - Generate Random Action Probabilities for Each Context

In [171]:
true_theta = torch.tensor(np.random.randn(n_features, n_actions), dtype=torch.float32)

#### - Choose Actions Based on Probabilities

In [172]:
action_probabilities = torch.exp(torch.matmul(context, true_theta))
action_probabilities /= action_probabilities.sum(dim=1, keepdim=True)

#### - Sample Actions Based on the Probabilities

In [173]:
chosen_actions = torch.multinomial(action_probabilities, 1).squeeze().numpy()

#### - Calculate Rewards

In [174]:
rewards = torch.matmul(context, true_theta)  # Shape: [n_samples, n_actions]
chosen_action_indices = torch.arange(n_samples, dtype=torch.long), chosen_actions
rewards = rewards[chosen_action_indices]  # Select rewards based on chosen actions

### 2. Select Bandit Model & Initialize Model Parameters

#### - Build Model
Load features and actions.

In [175]:
model = LinearBanditModel(n_features, n_actions)

#### - Initialize Model Parameters 
Select learning rate, optimization method, criterion, and number of epochs.

In [195]:
learning_rate = 0.001
optimizer = adam_optimizer
criterion = cross_entropy_criterion
num_epochs = 50

### 3. Train Bandit Model

In [196]:
train_bandit(model, optimizer, criterion, num_epochs)

Epoch [0/50] Loss: 0.7356
Epoch [1/50] Loss: 0.7324
Epoch [2/50] Loss: 0.7293
Epoch [3/50] Loss: 0.7263
Epoch [4/50] Loss: 0.7234
Epoch [5/50] Loss: 0.7206
Epoch [6/50] Loss: 0.7179
Epoch [7/50] Loss: 0.7153
Epoch [8/50] Loss: 0.7127
Epoch [9/50] Loss: 0.7103
Epoch [10/50] Loss: 0.7079
Epoch [11/50] Loss: 0.7056
Epoch [12/50] Loss: 0.7033
Epoch [13/50] Loss: 0.7012
Epoch [14/50] Loss: 0.6991
Epoch [15/50] Loss: 0.6970
Epoch [16/50] Loss: 0.6951
Epoch [17/50] Loss: 0.6931
Epoch [18/50] Loss: 0.6913
Epoch [19/50] Loss: 0.6895
Epoch [20/50] Loss: 0.6877
Epoch [21/50] Loss: 0.6860
Epoch [22/50] Loss: 0.6843
Epoch [23/50] Loss: 0.6827
Epoch [24/50] Loss: 0.6811
Epoch [25/50] Loss: 0.6796
Epoch [26/50] Loss: 0.6781
Epoch [27/50] Loss: 0.6767
Epoch [28/50] Loss: 0.6752
Epoch [29/50] Loss: 0.6739
Epoch [30/50] Loss: 0.6725
Epoch [31/50] Loss: 0.6712
Epoch [32/50] Loss: 0.6699
Epoch [33/50] Loss: 0.6687
Epoch [34/50] Loss: 0.6674
Epoch [35/50] Loss: 0.6663
Epoch [36/50] Loss: 0.6651
Epoch [37/5

### 4. Model Predictions
Updates context, chooses an action based on that context, and collects rewards.

In [198]:
predict_bandit(20)

Turn: 1, Action: 0, Reward: 5.472799777984619
Turn: 2, Action: 2, Reward: 0.19325566291809082
Turn: 3, Action: 1, Reward: 1.050110101699829
Turn: 4, Action: 1, Reward: 1.050110101699829
Turn: 5, Action: 2, Reward: 0.19325566291809082
Turn: 6, Action: 0, Reward: 5.472799777984619
Turn: 7, Action: 0, Reward: 5.472799777984619
Turn: 8, Action: 1, Reward: 1.050110101699829
Turn: 9, Action: 2, Reward: 0.19325566291809082
Turn: 10, Action: 1, Reward: 1.050110101699829
Turn: 11, Action: 1, Reward: 1.050110101699829
Turn: 12, Action: 1, Reward: 1.050110101699829
Turn: 13, Action: 2, Reward: 0.19325566291809082
Turn: 14, Action: 2, Reward: 0.19325566291809082
Turn: 15, Action: 2, Reward: 0.19325566291809082
Turn: 16, Action: 0, Reward: 5.472799777984619
Turn: 17, Action: 1, Reward: 1.050110101699829
Turn: 18, Action: 1, Reward: 1.050110101699829
Turn: 19, Action: 2, Reward: 0.19325566291809082
Turn: 20, Action: 0, Reward: 5.472799777984619
