# Markov Decision Processes

Markov decision processes are just Markov reward processes with actions thrown in. 

<img src="/files/images/mdp_definition.png", width=500px>

With the introduction of actions the rewards and transition probabilities are now conditioned on which action was taken. In the case of our simple Student MDP, most of the actions-save for the "Pub" action-are deterministic. The Pub action randomly deposits the actor into one of the 3 "Class" states.

<img src="/files/images/student_mdp.png", width=500px>

## Policies

No longer being at the mercy of fate, we now need to be able to choose actions depending on which state we're in. That's the job of the *Policy*. Policies are funcitons which map states to action probabilities. Below we'll consider a policy that gives each available action an equal probability of being chosen. 

<img src="/files/images/policy.png", width=500px>

## Generating samples

With a properly defined set of transition probabilities-one for each action-and a policy which probabilistically chooses actions in each state, we can sample trajectories just like in MRPs and MCs.

1. The agent starts in a state 
2. The policy determines which action will be chosen (probabilistically).
3. A reward is given for the resulting action.
4. The associated transition probabilities for that action determine where the agent will end up next. 
5. Repeat until the terminal state is reached.

In [279]:
state_names = ["C1", "C2", "C3", "FB", "Sleep"]


# Each action has an associated transition matrix
# Most transitions are deterministic with a value of 1
# with the exception of the "Pub" action, which can only be taken from
# "C3" (third row), and transitions the agent to C1, C2 or C3 randomly
_transitions = {"Study": [[0, 1, 0, 0, 0],
                           [0, 0, 1, 0, 0],
                           [0, 0, 0, 0, 1],
                           [0, 0, 0, 0, 0],
                           [0, 0, 0, 0, 0]],
                "Sleep": [[0, 0, 0, 0, 0],
                          [0, 0, 0, 0, 1],
                          [0, 0, 0, 0, 0],
                          [0, 0, 0, 0, 0],
                          [0, 0, 0, 0, 1]],
                "FB": [[0, 0, 0, 1, 0],
                       [0, 0, 0, 0, 0],
                       [0, 0, 0, 0, 0],
                       [0, 0, 0, 1, 0],
                       [0, 0, 0, 0, 0]],
                "Quit": [[0, 0, 0, 0, 0],
                         [0, 0, 0, 0, 0],
                         [0, 0, 0, 0, 0],
                         [1, 0, 0, 0, 0],
                         [0, 0, 0, 0, 0]],
                "Pub": [[0, 0, 0, 0, 0],
                        [0, 0, 0, 0, 0],
                        [.2, .4, .4, 0, 0],
                        [0, 0, 0, 0, 0],
                        [0, 0, 0, 0, 0]]}


# Each entry for an action corresponds to the state it is taken in.
# Choosing to "Study" while in "C3" yields a +10 reward.
# "Study" will never take an agent to "FB" or "Sleep", so it has a None value in those states
_rewards = {
    "Study": [-2, -2, 10, None, None],
    "Sleep": [None, 0, None, None, 0],
    "FB": [-1, None, None, -1, None],
    "Quit": [None, None, None, 0, None],
    "Pub": [None, None, 1, None, None]
}


# If the agent is in "C1" it has a 50% chance of choosing to either Study or go to FB.
# Technically ALL actions should be represented in each state but with probability 0
# but that's too much writing for no payoff so I didn't bother. Not sorry.
_policy = {
    "C1": {"Study": .5, "FB": .5},
    "C2": {"Study": .5, "Sleep": .5},
    "C3": {"Study": .5, "Pub": .5},
    "FB": {"Quit": .5, "FB": .5},
    "Sleep": {"Sleep": 1}
}


    
    
class MDP:
    def __init__(self, transitions, rewards, policy, state_names):
        self.transitions = transitions
        self.rewards = rewards
        self.state_names = state_names
        self._policy = policy
        self.terminal_state = "Sleep"
        
    def policy(self, state):
        probabilities = self._policy[state]
        action = choice(list(probabilities.keys()), p=list(probabilities.values()))
        return action
    
    def act(self, state):
        action = self.policy(state)
        p_matrix = self.transitions[action]
        P = p_matrix[self.state_names.index(state)]
        next_state = choice(self.state_names, p=P)
        return action, next_state
    
        
    def sample(self, state):
        states = []
        actions = []
        while state != self.terminal_state:
            states.append(state)
            action, next_state = self.act(state)
            actions.append(action)
            state = next_state
        states.append(self.terminal_state)
        return states, actions
    
    
    def G(self, sample, gamma=1):
        states, actions = sample
        states.pop()
        rewards = []
        for i in range(len(states)):
            reward_list = self.rewards[actions[i]]
            state_index = self.state_names.index(states[i])
            # Make sure we're taking an action we're allowed to in this state
            assert reward_list[state_index] is not None
            rewards.append(reward_list[state_index] * gamma**i)
            
        return np.sum(rewards)
    
    def pprint(self, sample, **kwargs):
        ret = self.G(sample, **kwargs)
        states, actions = sample
        idx = 0
        trajectory = ""
        for state, action in zip(states, actions):
            R = self.rewards[action][self.state_names.index(state)]
            trajectory += "{state}: {action}({reward}) --> ".format(state=state, action=action, reward=R)
        
        trajectory += "END\n"
        print("TRAJECTORY: \n")
        print(trajectory)
        print("Total Reward = {}".format(ret))
        print("--------------------------------------------------\n")

            
        
        




# Example Trajectories

Below are some examples of random trajectories generated using the **uniform random policy** and the **transition dynamics** and **rewards** specified in the MDP graph above.

In [282]:
mdp = MDP(_transitions, _rewards, _policy, state_names)
print("State: Action(Reward) --> Transition \n\n")
for i in range(10):
    sample = mdp.sample("C1")
    mdp.pprint(sample)

State: Action(Reward) --> Transition 


TRAJECTORY: 

C1: Study(-2) --> C2: Sleep(0) --> END

Total Reward = -2
--------------------------------------------------

TRAJECTORY: 

C1: FB(-1) --> FB: Quit(0) --> C1: FB(-1) --> FB: FB(-1) --> FB: FB(-1) --> FB: FB(-1) --> FB: Quit(0) --> C1: FB(-1) --> FB: FB(-1) --> FB: Quit(0) --> C1: FB(-1) --> FB: Quit(0) --> C1: Study(-2) --> C2: Study(-2) --> C3: Pub(1) --> C3: Study(10) --> END

Total Reward = -1
--------------------------------------------------

TRAJECTORY: 

C1: FB(-1) --> FB: FB(-1) --> FB: FB(-1) --> FB: Quit(0) --> C1: Study(-2) --> C2: Study(-2) --> C3: Study(10) --> END

Total Reward = 3
--------------------------------------------------

TRAJECTORY: 

C1: FB(-1) --> FB: Quit(0) --> C1: FB(-1) --> FB: FB(-1) --> FB: FB(-1) --> FB: FB(-1) --> FB: FB(-1) --> FB: Quit(0) --> C1: Study(-2) --> C2: Sleep(0) --> END

Total Reward = -8
--------------------------------------------------

TRAJECTORY: 

C1: Study(-2) --> C2: Sleep(0) 

# Expected Return under a specified Policy

Just like with Markov Reward Processes, we can ask what the average or *expected* return would be. 

## State Value Function

The expected return for a specific state is also known as the **state-value**. Unlike the MRP case, the state value in an MDP is necessarily conditioned on what policy the agent follows.

<img src="/files/images/state_value_mdp.png", width=500px>

We can *estimate* the state-value by simply taking many samples and averaging over them.

In [307]:
## Monte-Carlo State-Value Estimation
estimates = {}
for state in mdp.state_names:
    rewards = []
    for i in range(0, 10000):
        sample = mdp.sample(state)
        reward = mdp.G(sample)
        rewards.append(reward)
    estimates[state] = np.mean(rewards)
estimates

{'C1': -1.1519999999999999,
 'C2': 2.7252999999999998,
 'C3': 7.3715000000000002,
 'FB': -2.3479999999999999,
 'Sleep': 0.0}

## Action Value Function

We can also ask what the expected value of a particular *action* is under a specific policy. This is known as the **action-value**.

<img src="/files/images/action_value_mdp.png", width=500px>

Just like with the state-value, we can *estimate* the action-value by averaging over many samples.

In the (messy) example below, each state-action pair may not have the same number of samples associated with them, but each are guaranteed to have at least `min_iterations` informing their estimates. Nevertheless, we still get a rough estimate of the action-values under our uniform-random policy.

In [308]:
# Monte-Carlo Action-Value Estimation

action_estimates = {"C1": {}, "C2": {}, "C3": {}, "FB": {}}

action_rewards = {
    "C1": {"Study": [], "FB": []},
    "C2": {"Study": [], "Sleep": []},
    "C3": {"Study": [], "Pub": []},
    "FB": {"Quit": [], "FB": []},
    "Sleep": {"Sleep": []}
}

min_iterations = 10000

# Yeesh this is ugly
for state in mdp.state_names:
    if state == "Sleep":
        continue
        
    possible_actions = list(action_rewards[state].keys())
    while possible_actions:
        sample = mdp.sample(state)
        _, actions = sample
        reward = mdp.G(sample)
        action_rewards[state][actions[0]].append(reward)
        
        for action in possible_actions:
            # Stop considering this action
            # once all are popped, we're done with this state
            if len(action_rewards[state][action]) >= min_iterations:
                i = possible_actions.index(action)
                possible_actions.pop(i)

action_estimates = action_rewards
for state in action_rewards:
    if state == "Sleep":
        continue
    for action in action_rewards[state]:
        action_estimates[state][action] = np.mean(action_rewards[state][action])
action_estimates

{'C1': {'FB': -3.3879612038796121, 'Study': 0.64785239559567498},
 'C2': {'Sleep': 0.0, 'Study': 5.3791620837916208},
 'C3': {'Pub': 4.7533246675332466, 'Study': 10.0},
 'FB': {'FB': -3.2624407582938391, 'Quit': -1.3082691730826916},
 'Sleep': {'Sleep': []}}

# Solving an MDP Analytically

We can get a closed-form solution for the state-value of our MDP under our uniform-random policy by converting it into an MRP and solving it the same way we did in the MRP notebook.

To convert any MDP into an MRP, we simply average over the dynamics that our policy implies:

<img src="/files/images/mdp_to_mrp_reward.png", width=500px>
<img src="/files/images/mdp_to_mrp_transition.png", width=500px>

Averaging over our policy like this converts our **action-dependent** rewards and transition probabilities into **action-independent** rewards and transition probabilities. With action-independent rewards and transitions we once again have an MRP, which we already know how to solve.

Below is the student MDP under a uniform-random policy pre-converted to an MRP. How these numbers were found is left as an exercise to the reader. 

(I've always wanted to say that)

In [None]:
from numpy.random import choice
import numpy as np

### Converting an MDP into an MRP

state_names = ["C1", "C2", "C3", "FB", "Sleep"]

# Probabilities changed to reflect uniform random policy

# Notice Class 3 probabilities reflect possible pub choice:
# Row 3 Column 1:
# (.5 * .2) = .1 = probability of picking pub action (.5) AND
# probability of being sent to class 1 (.2) as a result

# Together they mean a .1 probability 
# of ending up back in C1 from C3

p_matrix = [[0, .5, 0, .5, 0],
            [0, 0, .5, 0, .5],
            [.1, .2, .2, 0, .5],
            [.5, 0, 0, .5, 0],
            [0, 0, 0, 0, 0]]

# Action rewards are weighted and summed by probability of being chosen
# I.E: 5.5 = (.5 * 10) + (.5 * 1)
_rewards = [-1.5, -1, 5.5, -.5, 0]


gamma = 1
R = np.array(_rewards)
P = np.matrix(p_matrix)
I = np.identity(len(p_matrix))

solution = np.dot(np.linalg.inv((I-gamma*P)), R)
solution = solution.tolist()[0]

solutions = {}
for state in range(len(state_names)):
    solutions[state_names[state]] = solution[state]

solutions
    

## Estimates vs. Solution

If we compare our estimates to the closed-form solution, it looks like we got pretty dang close. Playing around with how many samples were taken in each case might get us closer to the answer, so try playing around with the code yourself and see how close you can get.

In [311]:
print("State Values")
print("     Solution \t State-Value Estimate \t Action to State-Value Estimate")
print("     -------- \t -------------------- \t ------------------------------")
for state in solutions.keys():
    if state == "Sleep":
        continue
    sol = solutions[state]
    est = estimates[state]
    
    a_est = 0
    # Averaging over all actions under the policy to get state-values
    for action in action_estimates[state]:
        a_est += mdp._policy[state][action] * action_estimates[state][action]
    
    print("{state}:   {sol:.3f}\t\t{est:.3f}\t\t\t{a_est:.3f}".format(state=state, sol=sol, est=est, a_est=a_est))

State Values
     Solution 	 State-Value Estimate 	 Action to State-Value Estimate
     -------- 	 -------------------- 	 ------------------------------
FB:   -2.308		-2.348			-2.285
C3:   7.385		7.372			7.377
C1:   -1.308		-1.152			-1.370
C2:   2.692		2.725			2.690
