## Implementing the TD(1) rule to solve the given state diagram

This notebook is an implementation of the TD(1) rule on a quiz question on the ud600 class. More details below

Implemented using this as reference: https://www.youtube.com/watch?v=rJUjAjHJ8qI (lecture on TD(1) in ud600 class)

The diagram is depicted below (picked up from): https://www.youtube.com/watch?v=0ElJayK6jTo (quiz in TD(1) lectures in ud600 class)
```
s1 -(+1)--             ----------------- s4 -(+1)--
         |             | (P: 0.9)                 |
          -- s3 -(+0)---                          ----- sF
         |             | (P: 0.1)                 |
s2 -(+2)--             ----------------- s5 -(+10)-
```

In [29]:
import numpy as np

In [30]:
# Initializing values
totalStates = 6 # total number of states
totalEpisodes = 5 # total number of episodes
V0 = np.zeros(totalStates) # value function
e = np.zeros(totalStates) # eligibility
alphaT = 1 # learning rate
gamma = 1 # delayed reward

In [31]:
# steps: (sTold, sTnew, reward)
episode1 = [[0,2,1],[2,3,0],[3,5,1]]
episode2 = [[0,2,1],[2,4,0],[4,5,10]]
episode3 = [[0,2,1],[2,3,0],[3,5,1]]
episode4 = [[0,2,1],[2,3,0],[3,5,1]]
episode5 = [[1,2,2],[2,4,0],[4,5,10]]

In [32]:
# For a new episode T
VoldT = V0
e = np.zeros(totalStates) # initialize eligibility to 0
VnewT = VoldT.copy() # set the new value function equal to old

# Run this loop: 1 iteration == 1 step
for value in episode5:
    
    # increase eligbility for the state we just left
    e[value[0]] = e[value[0]] + 1
    
    # update the value of value function for all s
    VnewT = VnewT + alphaT * e * (value[2] + gamma * VoldT[value[1]] - VoldT[value[0]])
    
    # decay eligibility
    e = gamma * e

VoldT = VnewT.copy() # store the generated value function

In [34]:
print(VnewT)
print('the answer for the quiz question: ', VnewT[1])

[ 0. 12. 10.  0. 10.  0.]
the answer for the quiz question:  12.0
