## What is Value Iteration?
Value Iteration is like a magic trick that helps you figure out the best moves to make in the game. It tells you which room to go to from every room to collect the most treasures over time. Here's how it works:

### Value of Each Room (State): 
First, we imagine the value of being in each room. At the start, we might say every room has a value of zero.

### Updating Values: 
We keep updating these values based on the rewards we get and the chances of moving to other rooms. We look at all possible actions from each room and calculate how good it is to move to other rooms, considering the treasures we can get there.

### Best Move (Policy): 
After updating the values many times, we can decide the best action (move) to take from each room. This action is the one that leads to the highest value room.

### Repeat Until Done: 
We repeat the updating process until the values stop changing much. This means we've found the best way to move around the rooms to get the most treasures.

## Example
Let's say you are in a maze, and you want to find the shortest path to the exit. You can move up, down, left, or right. Some paths might have traps, and some might lead you closer to the exit.

+ Start with Zero Values: At first, you think every position in the maze has zero value.

+ Update Values: You keep trying different paths and updating the value of each position based on how close it gets you to the exit (more treasures).

+ Find the Best Move: After trying enough times, you know which move to make from each position to get to the exit the fastest.

+ Repeat: You keep doing this until you’re sure of the best moves from every position.

+ In the end, Value Iteration helps you find the shortest and best path to the exit by figuring out the value of being in each position and the best move to make from there.

In [None]:
import numpy as np

In [None]:
def value_iteration(mdp_file):
    with open(mdp_file, 'r') as file:
        lines = file.readlines()
    
    S = int(lines[0].split()[1])
    A = int(lines[1].split()[1])
    gamma = float(lines[-1].split()[1])
    transitions = [line.split() for line in lines[4:-2]]

    V = np.zeros(S)
    policy = np.zeros(S, dtype=int)
    theta = 1e-10
    while True:
        delta = 0
        for s in range(S):
            v = V[s]
            Q = np.zeros(A)
            for t in transitions:
                if int(t[0]) == s:
                    Q[int(t[1])] += float(t[4]) * (float(t[3]) + gamma * V[int(t[2])])
            V[s] = np.max(Q)
            delta = max(delta, abs(v - V[s]))
            policy[s] = np.argmax(Q)
        if delta < theta:
            break
    
    return V, policy


In [None]:
mdp_file_path = 'path_to_mdp_file.txt'
optimal_value_function, optimal_policy = value_iteration(mdp_file_path)
print("Optimal Value Function:", optimal_value_function)
print("Optimal Policy:", optimal_policy)
