# The epsilon-Greedy Algorithm

In computer science, a greedy algorithm is an algorithm that always takes whatever action seems best at the present moment, even when that decision might lead to bad long term consequences. The epsilon-Greedy algorithm is almost a greedy algorithm because it generally exploits the best available option, but every once in a while the epsilon-Greedy algorithm explores the other available options.

Using the previous example let's continue to optimise the colour of our web page. The epsilon-Greedy al‐ gorithm attempts to find the best color logo using the following procedure, which is applied to each new potential customer sequentially:
- When a new visitor comes to the site, the algorithm flips a coin that comes up tails with probability epsilon.
-  If the coin comes up heads, the algorithm is going to exploit. To exploit, the algo‐ rithm looks up the historical conversion rates for both the green and red logos in whatever data source it uses to keep track of things. After determining which color had the highest success rate in the past, the algorithm decides to show the new visitor the color that’s been most successful historically.
- If, instead of coming up heads, the coin comes up tails, the algorithm is going to explore. Since exploration involves randomly experimenting with the two colors being considered, the algorithm needs to flip a second coin to choose between them. Unlike the first coin, we’ll assume that this second coin comes up head 50% of the time.

After letting this algorithm loose on the visitors to a site for a long time, you’ll see that it works by oscillating between (A) exploiting the best option that it currently knows about and (B) exploring at random among all of the options available to it. In fact, you know from the definition of the algorithm that:
- With probability 1 – epsilon, the epsilon-Greedy algorithm exploits the best known option.
- With probability epsilon / 2, the epsilon-Greedy algorithm explores the best known option.
- With probability epsilon / 2, the epsilon-Greedy algorithm explores the worst known option.

## Describing Our Logo-Choosing Problem Abstractly
### What’s an Arm?
We want to consider the possibility that we have hundreds or thousands of colors to choose from, rather than just two. In general, we’re going to assume that we have a fixed set of N different options and that we can enumerate them, so that we can call our green logo "Option 1" and our red logo "Option 2" and any other logo "Option N". For historical reasons, these options are typically referred to as arms, so we’ll talk about "Arm 1" and "Arm 2" and "Arm N" rather than Option 1, Option 2 or Option N. But the main idea is the same regardless of the words we choose to employ.

### What’s a Reward?

Now that we’ve explained what an arm is, we’ve described one half of the abstract setup of the epsilon-Greedy algorithm. Next, we need to define a reward. A reward is simply a measure of success: it might tell us whether a customer clicked on an ad or signed up as a user. What matters is simply that (A) a reward is something quantitative that we can keep of track of mathematically and that (B) larger amounts of reward are better than smaller amounts.

### What’s a Bandit Problem?

Now that we’ve defined both arms and rewards, we can describe the abstract idea of a bandit problem that motivates all of the algorithms we’ll implement:
- We’re facing a complicated slot machine, called a bandit, that has a set of N arms that we can pull on.
- When pulled, any given arm will output a reward. But these rewards aren’t reliable, which is why we’re gambling: Arm 1 might give us 1 unit of reward only 1% of the time, while Arm 2 might give us 1 unit of reward only 3% of the time. Any specific pull of any specific arm is risky.
- Not only is each pull of an arm risky, we also don’t start off knowing what the reward rates are for any of the arms. We have to figure this out experimentally by actually pulling on the unknown arms.

So far the problem we’ve described in just a problem in statistics: you need to cope with risk by figuring out which arm has the highest average reward. You can calculate the average reward by pulling on each arm a lot of times and computing the mean of the rewards you get back. But a real bandit problem is more complicated and also more realistic.

What makes a bandit problem special is that we only receive a small amount of the information about the rewards from each arm. Specifically:
- We only find out about the reward that was given out by the arm we actually pulled. Whichever arm we pull, we miss out on information about the other arms that we didn’t pull. Just like in real life, you only learn about the path you took and not the paths you could have taken.

In fact, the situation is worse than that. Not only do we get only partial feedback about the wisdom of our past decisions, we’re literally falling behind every time we don’t make a good decision:
- Every time we experiment with an arm that isn’t the best arm, we lose reward be‐ cause we could, at least in principle, have pulled on a better arm.

The full Multiarmed Bandit Problem is defined by the five features above. Any algorithm that offers you a proposed solution to the Multiarmed Bandit Problem must give you a rule for selecting arms in some sequence. And this rule has to balance out your com‐ peting desires to (A) learn about new arms and (B) earn as much reward as possible by pulling on arms you already know are good choices.

## Implementing the epsilon-Greedy Algorithm