# Research Report: UCB Algorithm in Bandit Algorithm Markets (BAM)

## 4.1 Introduction to UCB in BAM
In the context of Bandit Algorithm Markets (BAM), the UCB (Upper Confidence Bound) algorithm is a fundamental strategy used to guide decision-making under uncertainty. It offers a principled way to balance the trade-off between exploration and exploitation. In over-the-counter (OTC) markets, where liquidity providers (LPs) must set prices without access to competitor quotes or full market information, UCB offers a powerful approach to adaptive pricing.

The core idea behind UCB is to not only consider the average reward of each action (or price quote), but also account for the uncertainty around that estimate. This makes it particularly well-suited for financial environments where decisions must be made with limited feedback.

## 4.1 BAM 中的 UCB 简介
在 Bandit 算法市场 (BAM) 的背景下，UCB（置信上限）算法是一种用于指导不确定情况下决策的基本策略。它提供了一种在探索和利用之间权衡的原则性方法。在场外交易 (OTC) 市场中，流动性提供者 (LP) 必须在无法获得竞争对手报价或完整市场信息的情况下设定价格，UCB 提供了一种强大的自适应定价方法。

UCB 背后的核心思想是不仅要考虑每个动作（或价格报价）的平均回报，还要考虑该估计的不确定性。这使得它特别适合必须在有限反馈下做出决策的金融环境。

## 4.2 UCB Algorithm Structure
At each time step $ t $, the algorithm selects the arm (i.e., the price spread) that maximizes the following expression:
$ 
\text{UCB}_t(i) = \hat{\mu}_t(i) + \alpha_t(i) 
$
Where:
- $ \hat{\mu}_t(i) $: the empirical average reward of arm $ i $;
- $ \alpha_t(i) = \sqrt{\frac{2 \log t}{n_t(i)}} $: the confidence interval term;
- $ n_t(i) $: the number of times arm $ i $ has been selected so far.

This term $ \alpha_t(i) $ becomes smaller as more data is collected, encouraging early exploration and eventual exploitation of the best-performing arms.

## 4.3 Application to OTC Market Pricing
In the BAM framework applied to OTC markets, each LP is modeled as an agent choosing among several discrete price spreads. These spreads correspond to the "arms" in the UCB setup.

- The reward is only observed if a trade occurs, based on whether the trader accepts the quote.
- LPs use UCB to decide which spread to quote next.
- Over time, the LP learns which spreads yield better long-term revenue, balancing undercutting competitors and earning high margins.

This process can lead to convergence toward a pricing equilibrium in some scenarios.

## 4.4 Strengths and Limitations of UCB in BAM
**Advantages:**
- Theoretical guarantee of logarithmic regret in stochastic settings;
- Easy to implement and interpret;
- Rapid convergence in many pricing environments.

**Drawbacks in BAM:**
- **Pseudo-collusion**: When multiple LPs use UCB simultaneously, they may converge to similar quoting strategies, potentially reducing market competitiveness;
- Not ideal in highly adversarial environments or when reward feedback is sparse;
- Sensitive to initial conditions and spread discretization.

In the main paper (Cartea et al., 2022), experiments using UCB showed that while convergence was fast, there was a higher risk of **collusive behavior** compared to more randomized strategies like EXP3.

## 4.5 Conclusion
UCB is a robust and theoretically grounded algorithm that performs well in structured market environments. When used in BAM applications such as OTC pricing, it can help agents learn optimal quoting strategies effectively. However, its deterministic nature can lead to coordinated behavior among agents, which must be carefully managed in real-world market settings.