11. Implement  a  Python  script  to  model  a  Markov  Decision  Process  using  a  given 
transition matrix and reward function of 3 states and 2 actions. Calculate the value 
function for each state provided the policy for all states and actions are initialized as 
0.5. Provide a sample MDP and demonstrate the corresponding value function (Set 
discount factor = 0.9).

In [14]:
states = ['Rainy', 'Cloudy', 'Sunny']
actions = ['Umbrella', 'No Umbrella']

In [15]:
P = {
    "Rainy": {
        "Umbrella": [0.8, 0.2, 0.0],  
        "No Umbrella": [0.0, 1.0, 0.0],  
    },
    "Cloudy": {
        "Umbrella": [0.0, 0.9, 0.1],
        "No Umbrella": [0.5, 0.0, 0.5],
    },
    "Sunny": {
        "Umbrella": [0.0, 0.0, 1.0],
        "No Umbrella": [0.3, 0.3, 0.4],
    }
}

In [16]:
R = {
    "Rainy": {
        "Umbrella": [5, 0, 0],  
        "No Umbrella": [0, 10, 0],  
    },
    "Cloudy": {
        "Umbrella": [0, -1, 1],
        "No Umbrella": [2, 0, 2],
    },
    "Sunny": {
        "Umbrella": [0, 0, 3],
        "No Umbrella": [1, 1, 1],
    }
}

In [17]:
discount_factor = 0.9

In [18]:
# Initialize uniform policy: π(state, action) = 0.5
policy = {s: {a: 0.5 for a in actions} for s in states}

In [19]:
policy

{'Rainy': {'Umbrella': 0.5, 'No Umbrella': 0.5},
 'Cloudy': {'Umbrella': 0.5, 'No Umbrella': 0.5},
 'Sunny': {'Umbrella': 0.5, 'No Umbrella': 0.5}}

In [20]:
# Initialize value function V[state] = 0
V = {s: 0.0 for s in states}

In [21]:
V

{'Rainy': 0.0, 'Cloudy': 0.0, 'Sunny': 0.0}

The **Bellman expectation formula** for the value function under a policy π is:

V<sub>π</sub>(s) = Σ<sub>a</sub> π(a|s) Σ<sub>s'</sub> P(s'|s,a) [ R(s,a,s') + γ V<sub>π</sub>(s') ]

In [22]:
# Run policy evaluation
for _ in range(20):  
    new_V = {}
    for s in states:
        v = 0
        for a in actions:
            pi = policy[s][a]
            for i, s_prime in enumerate(states):
                prob = P[s][a][i]
                reward = R[s][a][i]
                v += pi * prob * (reward + discount_factor * V[s_prime])
        new_V[s] = v
    V = new_V

In [23]:
print("Final Value Function:")
for s in states:
    print(f"{s}: {V[s]:.4f}")

Final Value Function:
Rainy: 28.7554
Cloudy: 21.7311
Sunny: 22.9308


In [24]:
V_40 = {s: 0.0 for s in states}
for _ in range(40):
    new_V = {}
    for s in states:
        v = 0
        for a in actions:
            pi = policy[s][a]
            for i, s_prime in enumerate(states):
                prob = P[s][a][i]
                reward = R[s][a][i]
                v += pi * prob * (reward + discount_factor * V_40[s_prime])
        new_V[s] = v
    V_40 = new_V

print("Value Function after 40 iterations:")
for s in states:
    print(f"{s}: {V_40[s]:.4f}")

Value Function after 40 iterations:
Rainy: 31.6656
Cloudy: 24.6412
Sunny: 25.8409


In [25]:
print("\nDifference between 40 and 20 iterations:")
for s in states:
    diff = V_40[s] - V[s]
    print(f"{s}: {diff:.6f}")


Difference between 40 and 20 iterations:
Rainy: 2.910185
Cloudy: 2.910185
Sunny: 2.910185
