### Assignment : Week 1 
## Modeling simple RL problems by making their MDPs in Python

We will create the MDPs for some of the example problems from Grokking textbook. For the simple environments, we can just hardcode the MDPs into a dictionary by exhaustively encoding the whole state space and the transition function. We will also go through a more complicated example where the state space is too large to be manually coded and we need to implement the transition function based on some state parameters.

Later on, you will not need to implement the MDPs of common RL problems yourself, most of the work is already done by the OpenAI Gym library, which includes models for most of the famous RL envis.

You can start this assignment during/after reading Grokking Ch-2.

## Environment 0 - Bandit Walk

Let us consider the BW environment on Page 39. 

State Space has 3 elements, states 0, 1 and 2.
States 0 and 2 are terminal states and state 1 is the starting state.

Action space has 2 elements, left and right.

The environment is deterministic - transition probability of any action is 1.

Only 1 (State, Action, State') tuple has positive reward, (1, Right, 2) gives the agent +1 reward.

In [3]:
bw_mdp = {
    0 : {
        "Right" : [(1, 0, 0, True)],
        "Left" : [(1, 0, 0, True)]
    },
    1 : {
        "Right" : [(1, 2, 1, True)],
        "Left" : [(1, 0, 0, True)]
    },
    2 : {
        "Right" : [(1, 2, 0, True)],
        "Left" : [(1, 2, 0, True)]
    }
}

## Environment 1 - Slippery Walk

Now, we'll model the Slippery Walk MDP correctly.

In [5]:
swf_mdp = {
    0 : {
        "Right" : [(1, 0, 0, True)],
        "Left" : [(1, 0, 0, True)]
    },
    1 : {
        "Right" : [(1/2, 2, 0, False), (1/3, 1, 0, False), (1/6, 0, 0, True)],
        "Left" : [(1, 0, 0, True)]
    },
    2 : {
        "Right" : [(1/2, 3, 0, False), (1/3, 2, 0, False), (1/6, 1, 0, False)],
        "Left" : [(1/2, 1, 0, False), (1/3, 2, 0, False), (1/6, 0, 0, True)]
    },
    3 : {
        "Right" : [(1/2, 4, 0, False), (1/3, 3, 0, False), (1/6, 2, 0, False)],
        "Left" : [(1/2, 2, 0, False), (1/3, 3, 0, False), (1/6, 1, 0, False)]
    },
    4 : {
        "Right" : [(1/2, 5, 1, True), (1/3, 4, 0, False), (1/6, 3, 0, False)],
        "Left" : [(1/2, 3, 0, False), (1/3, 4, 0, False), (1/6, 2, 0, False)]
    },
    5 : {
        "Right" : [(1, 5, 0, True)],
        "Left" : [(1, 5, 0, True)]
    }
}

In [7]:
fl_mdp = {5: {}, 7: {}, 11: {}, 12: {}, 15: {}}  # Terminal states
for state in range(16):
    if state in fl_mdp: continue  # Skip terminal states
    fl_mdp[state] = {}
    for action, move in [("Up", -4), ("Down", 4), ("Right", 1), ("Left", -1)]:
        next_state = state + move
        if 0 <= next_state < 16 and (action in ["Left", "Right"] or state % 4 == next_state % 4):
            reward = 1 if next_state == 15 else 0
            done = next_state in fl_mdp
            fl_mdp[state][action] = [(1, next_state, reward, done)]
