# <center>Multi-armed bandits and Gittins index</center>
### <center>Alfred Galichon (NYU & Sciences Po)</center>
## <center>'math+econ+code' masterclass series</center>
#### <center>With python code examples</center>
© 2018–2023 by Alfred Galichon. Past and present support from NSF grant DMS-1716489, ERC grant CoG-866274 are acknowledged, as well as inputs from contributors listed [here](http://www.math-econ-code.org/team).

**If you reuse material from this masterclass, please cite as:**<br>
Alfred Galichon, 'math+econ+code' masterclass series. https://www.math-econ-code.org/

# Introduction

### Learning objectives

* Gittins index
* LCP formulation


### References

* Bertsekas TBC.
* Weber (2016). *Optimization and Control*. Lecture notes.


# Setup

Consider a finite state space $\mathcal{X}$, and an action space $\mathcal{Y}
$. The short-term reward associated with taking action $y$ in space $x$ is $%
\Phi _{xy}$. The decision variable $\mu _{xy}\geq 0$ is the number of units of type $x$ for
which action $y$ is taken. 

If action $y$ is taken for a unit in state $x$, then the probability of
transitioning to a state $x^{\prime }$ is $N_{x^{\prime },x}^{y}$. We have 
$$
N=\sum_{y}N^{y}\left( e^{y}\right) ^{\top },
$$
so that $N_{x^{\prime },xy}=N_{x^{\prime },x}^{y}$.

Consider $M=\mathbf{I}_{\mathcal{X}}\otimes \mathbf{1}_{\mathcal{Y}}^{\top }$, so that $M_{x^{\prime },xy}=1\left\{ x=x^{\prime }\right\} $. 


In [1]:
#!pip install gurobipy
import numpy as np
import scipy.sparse as sp
import gurobipy as grb

# Bandits

Each bandit $i \in \mathcal I$ has a state space $\mathcal Z^i$. The state space is endowed with a markov chain $P^i_{z^\prime, z}$ which is the probability of a transition from $z$ to $z^\prime$. A reward $\phi^i_z$ is associated with bandit $i$ accessing state $z$.

In [2]:
class Bandit():
    def __init__(self,P_zp_z,phi_z, beta):
        self.nbz = len(phi_z)
        self.P_zp_z = P_zp_z
        self.phi_z = phi_z
        self.beta = beta
        
            
        

## Two-armed bandits

In a single-armed bandit, one has the choice between continuing and stopping. Thus the action space $\mathcal Y = \{0,1 \}$ consists of continuing or stopping and getting $g$ forever.

For each $z^{\ast }\in \mathcal{Z}$, the Gittins index $g_{z^{\ast }}$ at state $z^{\ast }$ is the solution to the following equation
$$
\left\{ 
\begin{array}{l}
p_{z}=\max \left\{ \frac{g_{z^{\ast }}}{1-\beta },\phi _{z}+\beta \left(
P^{\top }p\right) _{z}\right\} \forall z\in \mathcal{Z} \\ 
\frac{g_{z^{\ast }}}{1-\beta }=\phi _{z^{\ast }}+\beta \left( P^{\top
}p\right) _{z^{\ast }}%
\end{array}%
\right. 
$$
which can be solved using the following LCP 
$$
\left\{
\begin{array}{l}
0\leq \rho _{z}^{1}\perp \mu _{z1}\geq 0 \\ 
0\leq \rho _{z}^{2}\perp \mu _{z2}\geq 0 \\ 
\mu _{z1}+\mu _{z2}=1 \\ 
\rho _{z}^{1}=p_{z}-\frac{g }{1-\beta } \\ 
\rho _{z}^{2}=p_{z}-\phi _{z}-\beta \left( P^{\top }p\right) _{z} \\ 
\frac{g }{1-\beta }=\phi _{z^{\ast }}+\beta \left( P^{\top }p\right)
_{z^{\ast }}.
\end{array}
\right. 
$$

In [3]:
abs(-1.3)

1.3

In [4]:
def Bandit_gittins_lcp(self):
    self.g_z = np.zeros(self.nbz)
    for zstar in range(self.nbz):
        m=grb.Model()
        m.Params.OutputFlag = 0
        m.Params.NonConvex = 2
        mu1_z = m.addMVar(self.nbz) 
        mu2_z = m.addMVar(self.nbz) 
        rho1_z = m.addMVar(self.nbz) 
        rho2_z = m.addMVar(self.nbz) 
        p_z = m.addMVar(self.nbz,lb = - grb.GRB.INFINITY) 
        g = m.addMVar(1,lb = - grb.GRB.INFINITY)
        m.setObjective(mu1_z@ rho1_z + mu2_z@ rho2_z, grb.GRB.MINIMIZE)
        m.addConstr(mu1_z+mu2_z == np.ones(self.nbz))
        m.addConstr(rho1_z == p_z - g * np.ones(self.nbz) / (1-self.beta))
        m.addConstr(rho2_z == p_z - self.phi_z - self.beta * self.P_zp_z.T @ p_z )
        m.addConstr(g / (1-self.beta) == self.phi_z[zstar] +  self.beta * (self.P_zp_z.T @ p_z)[zstar] )
        m.optimize()
        self.g_z[zstar] = g.X
    return(self.g_z)
    
Bandit.gittins_lcp = Bandit_gittins_lcp


### Example: single machine scheduling

A single machine, $n$ jobs. Job $i$ takes time $t_{i}$ to process and reward $r_{i}$ is obtained at the end of
the job. 

Stopping time for job $i$=$t_{i}$. Hence Gittins index associated with job $i$ is

$\frac{r_{i}\beta ^{t_{i}}}{1+\beta +...+\beta ^{t_{i}}}=\frac{r_{i}\beta
^{t_{i}}\left( 1-\beta \right) }{1-\beta ^{t_{i}}}.$

Therefore jobs should be processed in the decreasing order of the Gittins
index.

In [5]:
def single_machine_scheduling_bandit(nbperiods,payoff,beta,verbose = True):
    nbz=nbperiods
    P_zp_z = np.diag(np.ones(nbz-1),-1)
    P_zp_z[-1,-1]=1
    phi_z = np.zeros(nbz)
    phi_z[-2] = payoff
    if verbose:
        print('P_zp_z=\n',P_zp_z)
        print('phi_z=\n',phi_z)
    return(Bandit(P_zp_z,phi_z,beta))

### Solution via LCP

We solve for the Gittins indices using our LCP:

In [6]:
the_payoff = 35
the_time = 6
the_beta = 0.95
the_bandit = single_machine_scheduling_bandit(the_time,the_payoff,the_beta)
the_bandit.gittins_lcp()[:-1]

P_zp_z=
 [[0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 1.]]
phi_z=
 [ 0.  0.  0.  0. 35.  0.]
Set parameter Username
Academic license - for non-commercial use only - expires 2023-12-23


array([ 6.30090993,  8.08871593, 11.07361963, 17.05128205, 35.        ])

### Solution via Jacobi

In [7]:
def Bandit_gittins_jacobi(self, tol = 1e-5, maxit = 1000):
    self.g_z = np.zeros(self.nbz)
    for zstar in range(self.nbz):
        p_z = (self.phi_z.min()  / (1-self.beta) ) * np.ones(self.nbz) 
        g = (1 - self.beta)* self.phi_z.min() + self.beta * (self.phi_z.min()  / (1-self.beta) )
        cont,nbit = True,0
        while cont:
            PTp_z = self.P_zp_z.T @ p_z
            newp_z = np.maximum( g / (1-self.beta) , self.phi_z+self.beta * PTp_z)
            newg = (1-self.beta)*(self.phi_z[zstar]+self.beta*PTp_z[zstar])
            if (max(np.linalg.norm(p_z - newp_z, ord=1),abs(newg-g))<tol) or (nbit > maxit):
                cont = False
            nbit += 1
            p_z,g = newp_z,newg
        self.g_z[zstar] = g
    return(self.g_z)

Bandit.gittins_jacobi=Bandit_gittins_jacobi

In [8]:
Bandit.gittins_jacobi(the_bandit)[:-1]

array([ 6.30090824,  8.0887134 , 11.07361651, 17.05127724, 34.99999061])

### Solution in closed form

Actually, the problem can be solved in closed form. We can compare the results:

In [9]:
the_times = np.arange(the_time-2,-1,-1)
the_payoff*(the_beta**the_times)*(1-the_beta)/(1-the_beta**(the_times+1)) 
# careful: formula in Weber's lecture notes has a typo

array([ 6.30090993,  8.08871593, 11.07361963, 17.05128205, 35.        ])

## Multi-armed bandits



In [10]:
class Bandits():
    def __init__(self,bandits):
        self.bandits = bandits
        

In [11]:
the_bandits = Bandits([single_machine_scheduling_bandit(3,4,0.9,False),single_machine_scheduling_bandit(2,6,0.9,False)])
