## Reinforcement Learning
- Reinforcement Learning is a branch of Machine Learning, also called Online Learning. It is used to solve interacting problems where the data observed up to time **t** is considered to decide which action to take at time **t + 1**. It is also used for Artificial Intelligence when training machines to perform tasks such as walking. Desired outcomes provide the AI with reward, undesired with punishment. Machines learn through **trial and error**.

- In this part, you will understand and learn how to implement the following Reinforcement Learning models:

    1. Upper Confidence Bound (UCB)
    2. Thompson Sampling

In [1]:
import os, sys

sys.path.append(os.path.abspath("Datasets"))
sys.path.append(os.path.abspath("Images"))

### The Multi-Armed Bandit Problem
- 1 Arm bandit: the slot machine! Bandit: quick way to lose money.
- Multi-Arm bandit: it's like 1 person have to deal with a set of many slot machines with only 1 hand. Historic problem. 
<img src="Images/16_1_problem.JPG" height="300">
- **Problem**: Which one to play if you want to maximize your returns! 
- Each one of these machines has a distribution of outcomes out of which the machine pick result. You pull the trigger, it will randomly give out a result. 
- We don't know the distribution, and it's assumed that it's different from each other. So, which one is the best one?
<img src="Images/16_1_problem_2.JPG" height="300">
- The most right is the best because it got the best tail on the left. So, bet on the D5 all the time.
- But we don't know it, so you have to play mind games, how can you analyze it?
- So we need to explore the machine, and then exploiting the finding to get the maximum return.
- Real life example:
<img src="Images/16_1_coke.JPG" height="300">
- Which of those arts works the best? The same, there are distributions behind those arts. 
=> Find out which is the best one in the process of the actual campaign. Find the best in the fastest way, and exploit it along.

### Upper Confidence Bound (UCB)
- Multi-armed bandit problem summary:
<img src="Images/16_2_multi_armed.JPG" height="300">
- UCB Algorithm:
<img src="Images/16_2_algorithm.JPG" height="300">
- What actually running in the background of UCB? Let's consider the expected return for the "Slot machines" problem above:
<img src="Images/16_2_expected.JPG" height="300">
- The horizontal lines represent for the Mean (Expected value) of each distribution of each machine.
- But because we don't know that they will return which value, so we assume that they will return the same:
<img src="Images/16_2_same.JPG" height="300">
- The algorithm will create a confident band that includes the actual expected result:
<img src="Images/16_2_confident.JPG" height="300">
- At the beginning, we have the same expected result, with the same confident band, so we randomly choose 1 of them play (in the case of cola ads, we popup a random art). Let's choose 3 in this example:
<img src="Images/16_2_first_pick.JPG" height="300">
- Assume that user doesn't click on the ads, the expected value go down! Those real expected values still lie at fixed positions.
- Next, choose the one with the highest expected value, so one of the 4 others to pop up. Let's choose 4:
<img src="Images/16_2_second.JPG" height="300">
- We see that, it will converge into the final expected result after each round.
- Now D4 is highest, so we can choose this. However, we are not confident in this result, so we lower the confident bound of 4:
<img src="Images/16_2_lower_bound.JPG" height="300">
- If we choose 1, there are still some cases that the bound will go up, not down even when the expected value is below (that's why we call it random). For further steps, it can converge, however, it always has a chance to go up or down randomly.
- Overall, the algorithm will go on, decrease, increase the bound, the expected value...
<img src="Images/16_2_further.JPG" height="300">

### Implementation

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [3]:
dataset = pd.read_csv("Datasets/ML_a_z/Ads_CTR_Optimisation.csv")

In [4]:
dataset.head()

Unnamed: 0,Ad 1,Ad 2,Ad 3,Ad 4,Ad 5,Ad 6,Ad 7,Ad 8,Ad 9,Ad 10
0,1,0,0,0,1,0,0,0,1,0
1,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,0,0,0


In [None]:
# This one is just for testing, in reality, we dont have these records, 
# just use real time record to perform the algorithm
# Use source code
