<a href="https://colab.research.google.com/github/michaelwnau/ai-academy-machine-learning-2023/blob/main/hmm_rl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Introduction to Value Iteration in Reinforcement Learning
This notebook implements a basic form of value iteration, a key algorithm in reinforcement learning. Value iteration is used to compute the optimal policy for a given reward structure in a Markov decision process. It iteratively updates the value of each state to find the optimal policy.

In [None]:
# Initial Setup: Defining the number of iterations and initial rewards for different states/actions
iterations = 500
PU_rew = 0
PF_rew = 0
RU_rew = 10
RF_rew = 10

### Displaying Initial Rewards
First, let's display the initial rewards for each state/action. These values will be updated through the iteration process.

In [None]:
print(f' PU    PF    RU     RF')
print(f'{PU_rew : .2f} {PF_rew : .2f} {RU_rew : .2f} {RF_rew: .2f}')

### The Value Iteration Loop
In this section, we perform the value iteration. In each iteration, we calculate the value for each state under different actions or conditions. The calculation considers both the current reward and an estimated future reward, discounted by a factor (0.9 in this example). The future reward is estimated based on the probability of transitioning to other states. The max function is used to select the action with the highest value for each state, updating the rewards for the next iteration.

In [None]:
for iteration in range(iterations):
    # Calculating values for each state/action combination
    PU_sav = 0 + .9*1*PU_rew
    PU_Ad = 0 + .9*(.5*PU_rew + .5*PF_rew)
    PF_sav = 0 + .9*(.5*PU_rew + .5*RF_rew)
    PF_Ad = 0 + .9*(1*PF_rew)
    RU_sav = 10 + .9*(.5*RU_rew + .5*PU_rew)
    RU_Ad = 10 + .9*(.5*PU_rew + .5*PF_rew)
    RF_sav = 10 + .9*(.5*RU_rew + .5*RF_rew)
    RF_Ad = 10 + .9*(PF_rew)

    # Updating rewards based on calculated values
    PU_rew = max(PU_sav, PU_Ad)
    PF_rew = max(PF_sav, PF_Ad)
    RU_rew = max(RU_sav, RU_Ad)
    RF_rew = max(RF_sav, RF_Ad)

    # Printing updated rewards
    print(f'{PU_rew : .2f} {PF_rew : .2f} {RU_rew : .2f} {RF_rew: .2f}')


### Conclusion
As the iteration progresses, the printed values show how the estimated rewards for each state/action combination evolve. The algorithm aims to converge to an optimal set of values, representing the best possible policy under the given reward structure and transition probabilities.