In [78]:
import numpy as np
import matplotlib.pyplot as plt
import numpy.random as nprand

# Street Racer

In this notebook, you'll apply the methods of chapter 4 of Sutton's book to a simple racing problem.

The problem consists in driving a car as fast of possible over an exact distance $L$, and stopping there.

This distance is divided in steps $0, ..., L$. The car can drive at three different speed: _low_, _medium_, _high_. Leaving step $j$ at _low_ speed, it will move to $j+1$. _Medium_ and _high_ bring it to $j+2$ and $j+3$, respectively.

At any step, the driver can decide to _decelerate_, _maintain speed_ or _accelerate_. Decelarating will cause the car to leave its current place at one speed lower. If the car is already at _low_ speed, decelarating keeps it in the same spot. Maintaining speed does exactly what you think. Accelerating will increase the speed by one, except at _high_ speed, where it is equivalent to maintaining speed.

The car starts on step $0$ at _low_ speed.

Beyond the $L$ distance there is a huge, hot lake of lava. Needless to say, the car must be able to stop at $L$, or the driver will suffer quite a lot.

To help the driver win the race and not die, build a model of the problem and apply the policy iteration and value iteration methods to find her optimal trajectory.

As this problem is an (over-)simplification of our traffic light problem, any work done here could serve as a building block for later.

# Building the model

Start by figuring out the number of states you will need and build transition matrices for every action. For now, actions move the car from state to state in the deterministic manner described above.

In [79]:
l = 3

In [80]:
T_decelerate = np.zeros(shape=(3*l, 3*l))
T_maintain = np.zeros(shape=(3*l, 3*l))
T_accel = np.zeros(shape=(3*l, 3*l))

the matrix repeats itself for each action :  (here the decelerate matrix)

\begin{pmatrix}1&0&0 & 0&0&0 & 0&0&0 &  0&0&0 & \ldots 
\\ 0&0&0 & 1&0&0 & 0&0&0 &  0&0&0 & \ldots 
\\ 0&0&0 & 0&0&0 & 0&1&0 &  0&0&0 & \ldots
\\ 0&0&0 & 1&0&0 & 0&0&0 &  0&0&0 & \ldots
\\ 0&0&0 & 0&0&0 & 1&0&0 &  0&0&0 & \ldots
\\ 0&0&0 & 0&0&0 & 0&0&0 &  0&1&0 & \ldots
\end{pmatrix}

In [81]:
for i in range(len(T_decelerate)-5):
    if (i%3==0 ):
        T_decelerate[i][i]=1
    if (i%3==1):
        T_decelerate[i][i+2]=1
    if (i%3==2):
        T_decelerate[i][i+5]=1

n=len(T_decelerate)
T_decelerate[n-5][n-3]=1
T_decelerate[n-3][n-3]=1      


for i in range(len(T_accel)-10):
    if (i%3==0):
        T_accel[i][i+7]=1
        T_maintain[i][i+3]=1
    if (i%3==1):
        T_accel[i][i+10]=1
        T_maintain[i][i+6]=1
    if (i%3==2):
        T_accel[i][i+9]=1
        T_maintain[i][i+9]=1
        
m=len(T_accel)
j=len(T_maintain)

T_accel[m-10][m-1]=1
T_accel[m-9][m-2]=1

T_maintain[j-10][j-1]=1
T_maintain[j-9][j-6]=1
T_maintain[j-8][j-2]=1

And define the reward function

In [82]:
R = -np.ones(3*l)
R[len(R)-1]=-1000^l
R[len(R)-2]=-1000^l
R[len(R)-4]=-1000^l
R[len(R)-7]=-1000^l

R[len(R)-3]=10000^l

defining a policy : at the beginning we choose a random policy. A policy is here encoded as a sequence of letters : 1,2,3 for maintain, accelerate, decelerate. The sequence is of length l

In [83]:
policy_initial = np.zeros(3*l)
for i in range(len(policy_initial)):
    policy_initial[i]=nprand.choice([1,2,3])

In [84]:
policy_initial

array([ 1.,  3.,  2.,  2.,  3.,  3.,  1.,  1.,  1.])

Here we define the find_new_state function, which takes as argument a policy p and an index i. The index i represents the state we are in actually and the function return in which state we will be if we follow the policy 

In [85]:
def find_new_state(p,i):
    k=p[i]
    if k ==1 :
        for j in range(len(T_maintain[i])):
            j=3*l-1 ##si la voiture dépasse le point L, on la force à être au point L avec vitesse high
            if  j != i and T_maintain[i][j]==1:
                return j
            return j
                
    elif k==2:
        j=3*l-1  ##si la voiture dépasse le point L, on la force à être au point L avec vitesse high
        for j in range(len(T_accel[i])):
            if  j != i and T_accel[i][j]==1:
                return j
            return j
    elif k==3:
        retour = i;
        for j in range(len(T_decelerate[i])):
            if  j != i and T_decelerate[i][j]==1:
                retour = j
        return retour
    else :
         raise ValueError("k should be between 1 and 3")

Problème de modélisation de l'environnement : la voiture a techniquement la possibilité de "dépasser" L, j'ai pas encore trouvé de bonne façon d'empêcher ça... Une bonnne façon peut être est juste de modifier la fonction find new state d'une façon plus optimale (test i-i%3 --> position puis check vitesse puis check action pour trouve position future et ensuite se débrouiller)

Ou sinon rajouter un autre état, celui dans lequel il ne faudrait surtout pas aller et laisser tout le monde à -1, sauf L à +1 et lui à -10*3*l ? 

Tests while modeling everything : 

In [86]:
find_new_state(policy_initial,0)

8

In [87]:
policy_initial[0]

1.0

In [88]:
print(T_decelerate[0])

[ 1.  0.  0.  0.  0.  0.  0.  0.  0.]


# Policy iteration

Apply the policy iteration procedure to figure out the best policy to follow.

In [89]:
#definition of  stopping criterion 
epsilon = 0.01
#definition of discount factor
gamma=0.5
#definitions of the states

this policy evaluation is only working for non-stochastic policies and is an in-place one 

In [90]:
def policy_eval(p,epsilon):
    delta = 10
    V=np.zeros(3*l)
    while(delta >= epsilon):
        delta = 0
        for i in range(3*l):
            v=V[i]
            j=find_new_state(p,i)
            V[i]=R[j]+gamma*V[j]
    return V

In [91]:
#print(find_new_state(policy_initial,52))
#policy_initial[52]

In [92]:
policy_eval(policy_initial,0.1)

array([ -9.97000000e+02,  -1.00000000e+00,  -4.99500000e+02,
        -4.99500000e+02,   1.00030000e+04,  -9.97000000e+02,
        -9.97000000e+02,  -9.97000000e+02,  -9.97000000e+02])

 <font color='red'>
policy_improv is to be improved, but I don't know the exact command to do what I want, so it's a beginning 

In [93]:
def policy_improv(p,V,boolean):
    new_p=np.zeros(shape=3*l)
    for i in range(3*l):
        for j in range(len(T_maintain[i])):
            if  j != i and T_maintain[i][j]==1:
                k=j
        action_1=R[k]+gamma*V[k]
        for j in range(len(T_accel[i])):
            if  j != i and T_accel[i][j]==1:
                k=j
        action_2=R[k]+gamma*V[k]
        for j in range(len(T_decelerate[i])):
            if T_decelerate[i][j]==1:
                k=j
        action_3=R[k]+gamma*V[k]
        new_p[i]=np.argmax([action_1,action_2,action_3])+1
        if new_p[i]!=p[i]:
            boolean = False
    return new_p


Now combine the two functions to iterate over policies!

In [94]:
# policy iteration
p=policy_initial
Delta_iter=1
epsilon_iter=0.1
policy_stable=False

while  not policy_stable :
    V=policy_eval(p,epsilon)
    policy_stable= True
    p=policy_improv(p,V,policy_stable)
    


In [95]:
V

array([ -9.97000000e+02,  -1.00000000e+00,  -4.99500000e+02,
        -4.99500000e+02,   1.00030000e+04,  -9.97000000e+02,
        -9.97000000e+02,  -9.97000000e+02,  -9.97000000e+02])

In [97]:
p

array([ 1.,  3.,  1.,  3.,  3.,  1.,  1.,  1.,  1.])

To figure out if everything is going well, make sure that at each iteration you keep track of the value vector, as well as the trajectory of the car according to the current policy. The latter allows you to compute the current policy's total reward and plot the evolution.

Then use the stored values to make a video similar to _street_racer.mp4_ on the repo. The following procedure can be used to save figures.

In [96]:
for idx, v in enumerate(values):
    v = np.array(v[:trap]).reshape(3, l)
    fig = plt.figure(figsize=(l*2, 6), dpi=72)
    ax = fig.add_subplot(111)
    ax.imshow(v, interpolation='nearest', cmap='gray')
    plt.yticks([])
    plt.savefig('img/value_'+str(idx)+'.jpg', dpi=72, bbox_inches='tight', pad_inches=0)
    plt.close(fig)

NameError: name 'values' is not defined

Install the command-line utility _ffmpeg_ and use it to transform the saved sequence of images into a mp4 video.

(https://en.wikibooks.org/wiki/FFMPEG_An_Intermediate_Guide/image_sequence#Making_a_video_from_an_Image_Sequence)

Play around with your model. What happens if you introduce uncertainty about the car's brakes?