## Predicting ad positioning using Q-Learning

Placement of ads on website is the primary problem for companies that operate on ad revenue. The position where the ad is placed plays pivotal role on whether or not the ad will be clicked. Here we have the following choices:
	1. Place them randomly, or
	2. Place the ad on the same position

The problem with placing the ad on the same position is the user, after a certain time, will start ignoring the space since he's used to seeing ad at the place, he will end up ignoring that particular position hereafter. Hence, this will reduce the number of clicks on ads. The problem with the former option, placing them randomly, is it wouldn't take optimal positions into consideration. For instance, text beside images are viewed higher number of times than those text which are placed at a distance. It is infeasible to go through every website and repeat the procedure. 

Solution: Reinforcement Learning
Using Reinforcement Learning we can approximate the human behavior. 

Why Reinforcement Learning? 
We cannot use traditional Machine Learning here, since it requires:
	1. Huge data
	2. Features
	3. Tuning of many hyperparameters
And we neither have huge data, nor features. The only data we have is the position of the baner/ad and whether or not it was clicked. We will use this dataset from Kaggle: https://www.kaggle.com/akram24/ads-ctr-optimisation. We will solve this problem using Q-Learning. The reason for using Q-Learning here is :
	1. It is model free, so it doesn't require to know all the states.
	2. Intuitive to understand, and converges faster. 

In [1]:
# import routines
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math


Our environment will be the dataset. It contains 10 ads position per row having values either 1, when the ad is clicked, or 0 when it is not. Every row can be considered as a state in the space, considering it kind of a navigation across multiple pages (on website, for instance) Lets load the dataset and visualize the first few rows

In [2]:

env = pd.read_csv('Ads_CTR_Optimisation.csv')
env.head()

Unnamed: 0,Ad 1,Ad 2,Ad 3,Ad 4,Ad 5,Ad 6,Ad 7,Ad 8,Ad 9,Ad 10
0,1,0,0,0,1,0,0,0,1,0
1,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,0,0,0


## Random policy

If we were to not have Q-Learning, we would place the ads randomly at given positions. We will now simulate the same.  

In [4]:
# total rewards earned
reward = 0
# random policy: for every state, choose a random
# position for displaying the ad
for x in range(len(env)):
    action = np.random.randint(0, 10)
    # if the guess was correct, increase the reward
    if env.values[x][action] == 1:
        reward += 1
print("Reward collected: {}".format(reward))

Reward collected: 1234


## Using Max Policy
Another question we might ask, is to display the ad where it is clicked the most number of times. For instance, there might be a certain position where the ad clicked with a higher probability. Since the values of the rows is either 1 or 0, we can sum across the columns and count the number of times ad in the position was clicked. 

In [16]:
clicked_counts = env.values.sum(axis=0)
counts = pd.DataFrame({"ad": np.arange(1, 11), "counts": clicked_counts})
counts.set_index("ad")

Unnamed: 0_level_0,counts
ad,Unnamed: 1_level_1
1,1703
2,1295
3,728
4,1196
5,2695
6,126
7,1112
8,2091
9,952
10,489


Which indicates ad 5 was clicked 2695 times. So if we were to always place an ad on position 5, it would be click around 2695 times. But can we do better? 

## Using Q-Learning



In [29]:
# using q-learning
states = len(env)
actions = 10
q_table = np.zeros((states, actions))

learning_rate = 0.7
gamma = 0.618

epsilon = 1.0
max_epsilon = 1.0
min_epsilon = 0.01
decay_rate = 0.01


In [30]:
def exploit(eps):
    """Randomizes a number to select
    whether or not to expolit"""
    return np.random.uniform() > eps

def random_action():
    return np.random.randint(0, 10)

In [34]:
reward = 0
for episode in range(100):
    if episode % 5 == 0:
        print("Episode: {}".format(episode))
    for state in range(states):
        if exploit(epsilon):
            action = random_action()
        else:
            action = np.argmax(q_table[state])
        r = env.values[state][action]
        reward += r
        q_table[state][action] += learning_rate*(r + gamma*np.max(q_table[state, :]) - q_table[state][action])
    epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate*episode)

Episode: 0
Episode: 5
Episode: 10
Episode: 15
Episode: 20
Episode: 25
Episode: 30
Episode: 35
Episode: 40
Episode: 45
Episode: 50
Episode: 55
Episode: 60
Episode: 65
Episode: 70
Episode: 75
Episode: 80
Episode: 85
Episode: 90
Episode: 95


In [35]:
# test
rewards = 0
for state in range(states):
    best_action = np.argmax(q_table[state, :])
    r = env.values[state][best_action]
    rewards += r
print("Reward collected: {}".format(rewards))

Reward collected: 7366
