3.1 The state space is defined as $\mathcal{S} = \{w|1 \leq w \leq W\}$, which is the wage earned at a certain day.
The action space is  $\mathcal{A} = \{(l,s)|0\leq l \leq H, 0\leq s\leq H-l\}$, which is the learning and job searching hours at this day.
The reward is defined as the total wage earned the day before this day $R(w',(l,s),w) = w*(H-l-s)$.

To get the state transition probability, we consider three different cases.

1. No new job offer
2. There is a new job offer but the salary is lower than the current job
3. THere is a new job offer and the salary is higher that the current job

$$Pr[w'|w,a =(l,s)] = \begin{cases} (1-\beta s/H) \times Poisson(\mu = \alpha l,k=x)& for \ w' = min(w+x,W) \\ \beta s/H \times Poisson(\mu = \alpha l,k=0) & for \ w' = min(w+1,W) \\ beta s/H \times Poisson(\mu = \alpha l,k=x) & for \ w' = min(w+x,W), x \geq 1 \end{cases}$$

We group the same terms we have that

$$Pr[w'|w,a =(l,s)] = \begin{cases} (1-\beta s/H) \times Poisson(\mu = \alpha l,k=0)&  \ w' = w \\ \beta s/H \times Poisson(\mu = \alpha l,k=0) +  Poisson(\mu = \alpha l,k=1)&  \ w' = min(w+1,W) \\ Poisson(\mu = \alpha l,k=x) & \ w' = min(w+x,W), x \geq 2 \end{cases}$$

Especailly for edge case $w = W$ and $w = W -1$,

$$Pr[w'=W|w=W,a = (l,s)] = 1 $$
$$Pr[w'=W-1|w = W-1,a = (l,s)] = (1-\beta s/H)Poisson(\mu = \alpha l,k=0) $$
$$Pr[w'=W|w = W-1,a = (l,s)] = \beta s/H \times Poisson(\mu = \alpha l,k=0) + 1- Poisson(\mu = \alpha l,k=0)$$

The discount factor is $\gamma \in [0,1)$

3.2 The MDP are implemented below

In [1]:
from dataclasses import dataclass
from typing import Tuple, Dict
from rl.markov_decision_process import FiniteMarkovDecisionProcess
from rl.markov_decision_process import FinitePolicy, StateActionMapping
from rl.markov_process import FiniteMarkovProcess, FiniteMarkovRewardProcess
from rl.distribution import Categorical, Constant
from rl.dynamic_programming import policy_iteration_result
from rl.dynamic_programming import value_iteration_result
from scipy.stats import poisson


In [5]:
# The state class
@dataclass(frozen=True)
class CareerState:
    w: int

# The aciton clasee
@dataclass(frozen=True)
class Action:
    s: int
    l: int

CareerMapping = StateActionMapping[CareerState, Action]

class CareerMDP(FiniteMarkovDecisionProcess[CareerState, Action]):

    def __init__(
        self,
        H: int,
        W: int,
        alpha: float,
        beta: float
    ):
        self.H: int = H
        self.W: int = W
        self.alpha: float= alpha
        self.beta: float = beta
        super().__init__(self.get_action_transition_reward_map())

    def get_action_transition_reward_map(self) -> CareerMapping:
        d: Dict[CareerState, Dict[Action, Categorical[Tuple[CareerState,
                                                            float]]]] = {}
        for w in range(1,self.W+1):
            state: CareerState = CareerState(w)
            d1: Dict[Action, Categorical[Tuple[CareerState, float]]] = {}

            for s in range(self.H+1):
                for l in range(self.H-s+1):
                    action = Action(s,l)
                    if w == self.W:
                        sr_probs_dict: Dict[Tuple[CareerState, float], float] =\
                                    {(CareerState(self.W),w*(self.H-l-s)):1}
                    elif w == self.W-1:
                        # probability that wp = self.W-1
                        p1 = (1-self.beta*s/self.H)*poisson.pmf(k=0,mu=self.alpha*l)  # probability that wp = self.W-1
                        # probability that wp = self.W
                        p2 = self.beta*s/self.H*poisson.pmf(k=0,mu=self.alpha*l) +1-poisson.cdf(k=0,mu=self.alpha*l)
                        sr_probs_dict: Dict[Tuple[CareerState, float], float] =\
                                {(CareerState(self.W-1),w*(self.H-l-s)):p1}
                        sr_probs_dict[(CareerState(self.W),w*(self.H-l-s))] = p2
                    elif w < self.W -1:
                         # probability that wp = w
                        p0 = (1-self.beta*s/self.H)*poisson.pmf(k=0,mu=self.alpha*l)
                        # probability that wp = w+1
                        p1 = self.beta*s/self.H*poisson.pmf(k=0,mu=self.alpha*l) + poisson.pmf(k=1,mu=self.alpha*l)
                        sr_probs_dict: Dict[Tuple[CareerState, float], float] =\
                             {(CareerState(w),w*(self.H-l-s)):p0,\
                             (CareerState(w+1),w*(self.H-l-s)):p1}
                        for x in range(2,self.W-w):
                            sr_probs_dict[(CareerState(w+x),w*(self.H-l-s))] = poisson.pmf(k=x,mu=self.alpha*l)
                        sr_probs_dict[(CareerState(self.W),w*(self.H-l-s))] = 1-poisson.cdf(k = self.W-w-1,mu=self.alpha*l)

                    d1[action] = Categorical(sr_probs_dict)
            d[state] = d1
        return d


3.3,3.4
Solve for the optimal value function and optimal policy using Iterations.

In [6]:
si_mdp2: FiniteMarkovDecisionProcess[CareerState, str] =\
        CareerMDP(
            H=10,
            W =30,
            alpha = 0.08,
            beta = 0.82
        )

print("MDP Transition Map")
print("------------------")

MDP Transition Map
------------------


In [7]:
print("MDP Value Iteration Optimal Value Function and Optimal Policy")
print("--------------")
opt_vf_vi2, opt_policy_vi2 = value_iteration_result(si_mdp2, gamma=0.95)
print(opt_vf_vi2)
print(opt_policy_vi2)
print()

MDP Value Iteration Optimal Value Function and Optimal Policy
--------------
{CareerState(w=1): 1259.6504926227421, CareerState(w=2): 1340.415028776917, CareerState(w=3): 1426.3579138423866, CareerState(w=4): 1517.8111660380023, CareerState(w=5): 1615.128091467472, CareerState(w=6): 1718.684649031722, CareerState(w=7): 1828.8809028830492, CareerState(w=8): 1946.1425677166949, CareerState(w=9): 2070.9226507164517, CareerState(w=10): 2203.703212047008, CareerState(w=11): 2344.9974308597966, CareerState(w=12): 2495.3513213053984, CareerState(w=13): 2655.3256532742776, CareerState(w=14): 2825.6334066075733, CareerState(w=15): 3006.9962764014304, CareerState(w=16): 3199.999895219283, CareerState(w=17): 3399.9998886704875, CareerState(w=18): 3599.999882121693, CareerState(w=19): 3799.9998755728984, CareerState(w=20): 3999.9998690241036, CareerState(w=21): 4199.999862475308, CareerState(w=22): 4399.999855926513, CareerState(w=23): 4599.9998493777175, CareerState(w=24): 4799.999842828925, Care

We can see that the optimal policy tells us that when $w \leq 13$, one should use all $H$ hours to learn. When $14 \leq w \leq 15$, one should use all time to search for a new job. When $w \geq 16$, one should use all the time to work.

An intuitive explanation is that when we have very high wage, we prefer to put time in working to get higher overall pay. An extreme case is when we hae $w= W$, we should never learn or search.
When the wage is very low, we should try to increase the hourly wage or find a new job with higher pay. Although the average pay rise when we spend all time in searching is a bit higher than learning (0.82 vs 0.8), learning gives us opportunity to achieve higher pay ($x>1$).
So in order to optimize the career, when wage is low, we spend all the time learning. And we spend all the time searching for new job when the wage is in the middle range.
