## Temporal Difference Learning

\begin{equation*}
TD\left( \lambda \right) = \sum_{k=1}^\infty \left( 1 - \lambda \right)\lambda^{k-1}E_k
\end{equation*}

Example
```
probToState1 = 0.5
valueEstimates = {0, 3, 8, 2, 1, 2, 0}
rewards = {0, 0, 0, 4, 1, 1, 1}

```
when

\begin{equation*}
\gamma = 1 \\
\end{equation*}

Solving it with the MDP as shown in the Quiz: Value Computation Example,

\begin{equation*}
V(S_t) = r + \gamma V(S_{t+1})
\end{equation*}

So

\begin{equation*}
TD\left( \lambda \right)
\end{equation*}

is going backwards from the furthest state possible.

\begin{equation*}
V(S_6) = 0 \\
V(S_5) = 1 + V(S_6) = 1 \\
V(S_4) = 1 + V(S_5) = 2 \\
V(S_3) = 1 + V(S_4) = 3 \\
V(S_2) = 4 + V(S_3) = 7 \\
V(S_1) = 0 + V(S_3) = 3 \\
V(S_0) = 0 + (0.5)(3) + (0.5)(7) = 5.0 \\
 = TD(1) = E_{\infty}
\end{equation*}

Similarly,

\begin{equation*}
TD(0) 
\end{equation*}

is only from one step ahead.

\begin{equation*}
TD(0) = E_1 \\
= 0 + (0.5)(3) + (0.5)(8) = 5.5
\end{equation*}

? using same analytical approach...

\begin{equation*}
E_2 = 4 \\
E_3 = 4 \\
E_4 = 6 \\
E_5 = 5 \\
E_6 = 5 \\
...
\end{equation*}

So we solve the polynomial where

\begin{equation*}
TD(\lambda) = TD(1)
\end{equation*}

Hence,

\begin{equation*}
TD(1) = (1 - \lambda)E_1 + \lambda(1 - \lambda)E_2 + \lambda^2(1 - \lambda)E_3 + \lambda^3(1 - \lambda)E_4 + \lambda^4(1 - \lambda)E_5 + ... + (1 - \lambda)\lambda^\infty E_{\infty}
\end{equation*}

Simplified to

\begin{equation*}
(E_6 - E_5)\lambda^5E_5 + (E_5 - E_4)\lambda^4 + (E_4 - E_3)\lambda^3 + (E_3 - E_2)\lambda^2 + (E_2 - E_1)\lambda + E_1 - TD(1) = 0
\end{equation*}

In [1]:
import numpy as np

In [2]:
def getTD1(probToState1, valueEstimates, rewards):
    VS6 = 0
    VS5 = VS6 + rewards[6]
    VS4 = VS5 + rewards[5]
    VS3 = VS4 + rewards[4]
    
    VS2 = VS3 + rewards[3]
    VS1 = VS3 + rewards[2]
    
    VS0 = probToState1 * (VS1 + rewards[0]) + (1 - probToState1) * (VS2 + rewards[1])
    
    return VS0

In [3]:
def getE1(probToState1, valueEstimates, rewards):
    VS1 = valueEstimates[1]
    VS2 = valueEstimates[2]
    
    VS0 = probToState1 * (VS1 + rewards[0]) + (1 - probToState1) * (VS2 + rewards[1])
    
    return VS0

In [4]:
def getE2(probToState1, valueEstimates, rewards):
    VS3 = valueEstimates[3]
    VS1 = rewards[2] + VS3
    VS2 = rewards[3] + VS3
    
    VS0 = probToState1 * (VS1 + rewards[0]) + (1 - probToState1) * (VS2 + rewards[1])
    
    return VS0

In [5]:
def getE3(probToState1, valueEstimates, rewards):
    VS4 = valueEstimates[4]
    VS3 = rewards[4] + VS4
    VS1 = rewards[2] + VS3
    VS2 = rewards[3] + VS3
    
    VS0 = probToState1 * (VS1 + rewards[0]) + (1 - probToState1) * (VS2 + rewards[1])
    
    return VS0

In [6]:
def getE4(probToState1, valueEstimates, rewards):
    VS5 = valueEstimates[5]
    VS4 = rewards[5] + VS5
    VS3 = rewards[4] + VS4
    VS1 = rewards[2] + VS3
    VS2 = rewards[3] + VS3
    
    VS0 = probToState1 * (VS1 + rewards[0]) + (1 - probToState1) * (VS2 + rewards[1])
    
    return VS0

In [7]:
def getE5(probToState1, valueEstimates, rewards):
    VS6 = valueEstimates[6]
    VS5 = rewards[6] + VS6
    VS4 = rewards[5] + VS5
    VS3 = rewards[4] + VS4
    VS1 = rewards[2] + VS3
    VS2 = rewards[3] + VS3
    
    VS0 = probToState1 * (VS1 + rewards[0]) + (1 - probToState1) * (VS2 + rewards[1])
    
    return VS0

In [8]:
def getE6(probToState1, valueEstimates, rewards):
    VS6 = 0 + 0 # no reward no value from "S7"
    VS5 = rewards[6] + VS6
    VS4 = rewards[5] + VS5
    VS3 = rewards[4] + VS4
    VS1 = rewards[2] + VS3
    VS2 = rewards[3] + VS3
    
    VS0 = probToState1 * (VS1 + rewards[0]) + (1 - probToState1) * (VS2 + rewards[1])
    
    return VS0

In [9]:
getTD1(0.5, [0, 3, 8, 2, 1, 2, 0], [0, 0, 0, 4, 1, 1, 1])

5.0

In [10]:
def getEstimators(probToState1, valueEstimates, rewards):
    E1 = getE1(probToState1, valueEstimates, rewards)
    E2 = getE2(probToState1, valueEstimates, rewards)
    E3 = getE3(probToState1, valueEstimates, rewards)
    E4 = getE4(probToState1, valueEstimates, rewards)
    E5 = getE5(probToState1, valueEstimates, rewards)
    E6 = getE6(probToState1, valueEstimates, rewards)
    
    print((E1, E2, E3, E4, E5, E6))
    return (E1, E2, E3, E4, E5, E6)

In [11]:
getEstimators(0.5, [0, 3, 8, 2, 1, 2, 0], [0, 0, 0, 4, 1, 1, 1])

(5.5, 4.0, 4.0, 6.0, 5.0, 5.0)


(5.5, 4.0, 4.0, 6.0, 5.0, 5.0)

In [12]:
def findLambda(probToState1, valueEstimates, rewards):
    E = getEstimators(probToState1, valueEstimates, rewards)

    coeffs = [E[5] - E[4], E[4] - E[3], E[3] - E[2], E[2] - E[1], E[1] - E[0], E[0] - E[5]]

    print(np.roots(coeffs))

In [13]:
# Example 1
findLambda(0.81, [0.0,4.0,25.7,0.0,20.1,12.2,0.0], [7.9,-5.1,2.5,-7.2,9.0,0.0,1.6])

(13.553, 6.0870000000000015, 35.187, 27.287000000000003, 16.687, 16.687)
[-2.14692153  1.          0.6227695  -0.22113099]


In [14]:
# Example 2
findLambda(0.22, [0.0,-5.2,0.0,25.4,10.6,9.2,12.3], [-2.4,0.8,4.0,2.5,8.6,-6.4,6.1])

(-1.0479999999999998, 28.326, 22.126, 14.325999999999999, 23.526, 11.225999999999999)
[-1.16015001+0.j         0.20622303+1.3010633j  0.20622303-1.3010633j
  1.        +0.j         0.49567142+0.j       ]


In [15]:
# Example 2
findLambda(0.64, [0.0,4.9,7.8,-2.3,25.5,-10.2,-6.5], [-2.4,9.6,-7.8,0.1,3.4,-2.1,7.9])

(7.864, -5.336, 25.864, -11.935999999999998, -0.3360000000000003, 6.164000000000001)
[-3.72950282+0.j          1.        +0.j          0.36969234+0.45229758j
  0.36969234-0.45229758j  0.20550276+0.j        ]


# Test

In [37]:
findLambda(0.15, [0.0,0,4.1,17.4,17.4,21.8,5.7], [4.2,-1.2,1.3,5.9,7.4,-2.1,0.1])

(3.0949999999999993, 22.219999999999995, 29.619999999999994, 31.92, 15.92, 10.22)
[-2.94257095+0.j         -0.60032884+0.95031208j -0.60032884-0.95031208j
  1.        +0.j          0.33621109+0.j        ]
