Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Multi-armed bandit algorithms in Clojure
Clojure R CSS
Branch: restruct
Pull request Compare This branch is 96 commits behind master.

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.


A simple Clojure library for multi-armed bandit optimisation.

Algorithms are implemented following "Bandit Algorithms for Website Optimization" by John Myles White.

This library aims to be simple; it concerns itself with multi-armed bandit optimisation only. Integrating with web applications etc. is the responsibility of other libraries.

By keeping it small and simple we hope to make it far easier to integrate than existing tools.


(ns casino.wynn
  (:require [clj-bandit.arms :as a]
            [clj-bandit.algo.epsilon :as e]))

;; arms represents the knowledge the algorithm acquires- using clj-bandit.Arm records
(def arms (map a/mk-arm [:arm1 :arm2 :arm3 :arm4 :arm5]))

;; which arm should we pull?
(def pull (let [epsilon 0.1]
            (e/select-arm epsilon arms)))

;; it told us to pull :arm5
;; casino.wynn> pull
;;#clj_bandit.arms.Arm{:name :arm5, :pulls 0, :value 0}

;; let's update the arms to record our payout from :arm5
;; this becomes the next arms state we pass in to e/select-arm
;; we can use 1.0 to indicate we were paid, and 0 to indicate
;; we weren't
(fold-arm (reward pull 1.0) arms)


As per the book, the following algorithms have been implemented so far:

  • Epsilon-Greedy
  • Softmax
  • UCB


"Bandit Algorithms for Website Optimization" uses Monte Carlo Simulation to measure the performance of the algorithms. These can be run using clj-bandit.simulate/run-simulation, and ./scripts/plot_results.r will produce the following plots with ggplot2.

In the plots below, algo.variant can refer to either a "standard" or "annealing" algorithm (annealing applies a factor that causes the algorithm to explore less as it gains more experience). For "standard" algorithms, algo.parameter represents the temperature or epsilon value used to tune the algorithms tendency to explore.

  • Epsilon-Greedy: the variant is the epsilon value (0.1, 0.2 etc.); 0.1 would mean the algorithm would experiment 10% of the time, and exploit the best performing for the remainder.
  • softmax: the variant is the algorithm's temperature, behaving much like the epsilon value above.
  • ucb: no variant value is used.

Results are for a 5-armed machine, rewarding at rates of: 10%, 10%, 10%, 10%, 90%. This is the same example as used in the book. Such significantly varying payouts are unlikely for most other applications so I'll update with some more complex simulations later.

Average Reward

Average Reward



Cumulative Reward

Cumulative Reward

The following plot shows the summary for the algorithms when the simulation finished. 1500 simulations were run.



Copyright © Paul Ingles 2012

Distributed under the Eclipse Public License, the same as Clojure.

Something went wrong with that request. Please try again.