# Conclusion

We’ve presented three algorithms for solving the Multiarmed Bandit Problem:
- The epsilon-Greedy Algorithm
- The Softmax Algorithm
- The UCB Algorithm

In order to really take advantage of these three algorithms, you’ll need to develop a good intuition for how they’ll behave when you deploy them on a live website. Having an intuition about which algorithms will work in practice is important because there is no universal bandit algorithm that will always do the best job of optimizing a website: domain expertise and good judgment will always be necessary.

To help you develop the intuition and judgment you’ll need, we’ve advocated a Monte Carlo simulation framework that lets you see how these algorithms and others will behave in hypothetical worlds. By testing an algorithm in many different hypothetical worlds, you can build an appreciation for the qualitative dynamics that cause a bandit algorithm to succeed in one scenario and to fail in another.

In this last section, we’d like to help you further down that path by highlighting these qualitative patterns explicitly.

We’ll start off with some general life lessons that we think are exemplified by bandit algorithms, but actually apply to any situation you might ever find yourself in. Here are the most salient lessons:
- Trade-offs, trade-offs, trade-offs
    - In the real world, you always have to trade off between gathering data and acting on that data. Pure experimentation in the form of exploration is always a short- term loss, but pure profit-making in the form of exploitation is always blind to the long-term benefits of curiosity and openmindedness. You can be clever about the compromises you make, but you will have to make some compromises.
- God does play dice
    - Randomisation is the key to the good life. Controlled experiments online won’t work without randomisation. If you want to learn from your experiences, you need to be in complete control of those experiences. While the UCB algorithms we’ve used in this book aren’t truly randomised, they behave at least partially like randomised algorithms from the perspective of your users. Ultimately what matters most is that you make sure that end-users can’t self-select into the arms you want to experiment with.
- Defaults matter a lot
    - The way in which you initialize an algorithm can have a powerful effect on its long- term success. You need to figure out whether your biases are helping you or hurting you. No matter what you do, you will be biased in some way or another. What matters is that you spend some time learning whether your biases help or hurt. Part of the genius of the UCB family of algorithms is that they make a point to do this initialization in a very systematic way right at the start.
- Take a chance
    - You should try everything at the start of your explorations to insure that you know a little bit about the potential value of every option. Don’t close your mind without giving something a fair shot. At the same time, just one experience should be enough to convince you that some mistakes aren’t worth repeating.
- Everybody’s gotta grow up sometime
    - You should make sure that you explore less over time. No matter what you’re doing, it’s important that you don’t spend your whole life trying out every crazy idea that comes into your head. In the bandit algorithms we’ve tried, we’ve seen this lesson play out when we’ve implemented annealing. The UCB algorithms achieve similar effects to annealing by explicitly counting their experiences with different arms. Either strategy is better than not taking any steps to become more conservative over time.
- Leave your mistakes behind
    - You should direct your exploration to focus on the second-best option, the third- best option and a few other options that are just a little bit further away from the best. Don’t waste much or any of your time on options that are clearly losing bets. Naive experimentation of the sort that occurs in A/B testing is often a deadweight loss if some of the ideas you’re experimenting with are disasters waiting to happen.
- Don’t be cocky
    - You should keep track of how confident you are about your evaluations of each of the options available to you. Don’t be close-minded when you don’t have evidence to support your beliefs. At the same time, don’t be so unsure of yourself that you forget how much you already know. Measuring one’s confidence explicitly is what makes UCB so much more effective than either the epsilon-Greedy algorithm or the Softmax algorithm in some settings.
- Context matters
    - You should use any and every piece of information you have available to you about the context of your experiments. Don’t simplify the world too much and pretend you’ve got things figured out: there’s more to optimizing your business that com‐ paring A with B. If you can figure out a way to exploit context using strategies like those seen in the contextual bandit algorithms we briefly discussed, use them. And if there are ways to generalize your experiences across arms, take advantage of them.

## A Taxonomy of Bandit Algorithms

To help your remember how these lessons relate to the algorithms we’ve described, here are six dimensions along which you can measure most bandit algorithms you’ll come across, including all of the algorithms presented in this book:

1. **Curiosity**: Does the algorithm keep track of how much it knows about each arm? Does the algorithm try to gain knowledge explicitly, rather than incidentally? In other words, is the algorithm curious?
2. **Increased Exploitation over Time**: Does the algorithm explicitly try to explore less over time? In other words, does the algorithm use annealing?
3. **Strategic Exploration**: What factors determine the algorithm’s decision at each time point? Does it maximize reward, knowledge, or a combination of the two?
4. **Number of Tunable Parameters**: How many parameters does the algorithm have? Since you have to tune these parameters, it’s generally better to use algorithms that have fewer parameters.
5. **Initialisation Strategy**: What assumptions does the algorithm make about the value of arms it has not yet explored?
6. **Context-Aware**: Is the algorithm able to use background context about the value of the arms?

## Learning More and Other Topics

While you could easily spend the rest your life tinkering with the simulation framework we’ve given you to find the best possible settings of different parameters for the algorithms we’ve de‐ scribed, it’s probably better for you to read about how other people are using bandit algorithms. Here’s a very partial reading list we’d suggest for those interested:

- If you’re interested in digging into the academic literature on the Multiarmed Bandit Problem, the best introduction is probably in the classic textbook on Reinforcement Learning, which is a broader topic than the Multiarmed Bandit Problem:
    - Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto, (1998).
- A good starting point for going beyond Sutton and Barto’s introduction is to read about some of the other bandit algorithms out there:
    - **Exp3**: You can read about Exp3 in “The Nonstochastic Multiarmed Bandit Prob‐ lem” by Auer et al., (2001).
    - **Exp4**: You can also read about Exp4 in “The Nonstochastic Multiarmed Bandit Problem” by Auer et al. (2001).
    - **The Knowledge Gradient**: You can read about the Knowledge Gradient in “A knowledge-gradient policy for sequential information collection” by Frazier et al. (2008).
    - **Randomized Probability Matching**: You can read about Randomized Probability Matching in “A modern Bayesian look at the multiarmed bandit” by Steven L. Scott. (2010).
    - **Thompson Sampling**: You can read about Thompson Sampling in “An Empirical Evaluation of Thompson Sampling” by Olivier Chapelle and Lihong Li. (2011).
- If you’re interested in contextual bandit algorithms like `LinUCB` and `GLMUCB`, you might look at:
    - **LinUCB**: “A Contextual-Bandit Approach to Personalized News Article Recom‐ mendation” by Li et al. (2010).
    - **GLMUCB**: “Parametric Bandits: The Generalized Linear Case” by Filippi et al. (2010).
- If you’re ready to do some much heavier reading on this subject, you might benefit from some of the best recent review papers discussing bandit algorithms
    - “Sequential Decision Making in Non-stochastic Environments” by Jacob Aber‐ nethy (2012).
    - “Online Learning and Online Convex Optimization” by Shai Shalev-Shwartz (2012).
- If you’re interested in reading about how Yahoo! used bandit algorithms in its busi‐ ness, John Langford and colleagues have written many interesting papers and pre‐ sentations including:
    - “Learning for Contextual Bandits” by Alina Beygelzimer and John Langford (2011).
    - “Contextual-Bandit Approach to Personalized News Article Recommendation” by Lihong Li et al. (2010).