## Problem 1: Sample Recolection
<br> <div style="text-align: justify"> 
The first step to applying the Policy Evaluation Algorithm to this problem will be coding all of the initial conditions of the problem. We start by coding the time steps, the possible number of samples and the possible actions to be taken. Due to the fact that this is a finite, *non-stationary* MDP, combinations of time steps and number of samples will be considered the states of the problem. For example, 1250 samples at 500m of immersion will be an state and 1250 samples at 400m of immersion will be another state. 
</div> 

In [1]:
### Time steps of decision
T = [1000, 900, 800, 700, 600, 500, 400, 300, 200, 100, 0]

### Samples
Samples = [0, 250, 500, 750, 1000, 1250, "Succes"]

### Space of States
S_t = []
for S in [0, 250, 500, 750]:
    S_t.append((1000,S))
for T in [900, 800, 700, 600, 500, 400, 300, 200, 100, 0]:
    for S in Samples:
        S_t.append((T,S))

### Actions 
# CM for Calculated Maneuver
# IM for improvised Maneuver 
A = ["CM", "IM"]

<br> <div style="text-align: justify"> 
Next, transition probabilities will be coded. As mention before, this probabilities will indicate the stochastic nature of the problem on the transitions between states. As the states contain steps of decision and number of samples, this information will be taken into account in the transition probabilities. Also, the probabilities depend on the action that is performed. Therefore, there will be to dictionaries of probabilites, one for Calculated Maneuvers and one for Improvised Maneuvers. At the end, both of this dictionaries will be consolidated in one dictionary. 
</div> 

In [2]:
### Transition probabilities
## Transition Probabilites will be defined as dictionaries
# Transition Probabilities for Calculated Maneuvers
p_CM = {}
for s_t in S_t:
    for s_tplus1 in S_t:
        if s_tplus1[0] == s_t[0] - 100:
            if s_t[1] != "Succes" and s_tplus1[1] != "Succes":
                if s_tplus1[1] == s_t[1] + 250 and s_t[1] < 1250:
                    p_CM[s_t, s_tplus1] = 0.3
                elif s_tplus1[1] == s_t[1] + 500 and s_t[1] < 1000:
                    p_CM[s_t, s_tplus1] = 0.2
                elif s_tplus1[1] == s_t[1] + 750 and s_t[1] < 750:
                    p_CM[s_t, s_tplus1] = 0.1
                elif s_tplus1[1] == s_t[1] and s_t[1] != "Succes":
                    p_CM[s_t, s_tplus1] = 0.4
                else: 
                    p_CM[s_t, s_tplus1] = 0
            elif s_tplus1[1] == "Succes" and s_t[1] == 1000:
                p_CM[s_t, s_tplus1] = 0.3
            elif s_tplus1[1] == "Succes" and s_t[1] == 1250:
                p_CM[s_t, s_tplus1] = 0.6
            elif s_tplus1[1] == "Succes" and s_t[1] == 750:
                p_CM[s_t, s_tplus1] = 0.1
            elif s_tplus1[1] == s_t[1] and s_t[1] == "Succes":
                p_CM[s_t, s_tplus1] = 1
            else: 
                p_CM[s_t, s_tplus1] = 0
        else:
            p_CM[s_t, s_tplus1] = 0
    
# Transition Probabilites for Improvised Maneuvers
p_IM = {}
for s_t in S_t:
    for s_tplus1 in S_t:
        if s_tplus1[0] == s_t[0] - 100:
            if s_t[1] != "Succes" and s_tplus1[1] != "Succes":
                if s_tplus1[1] == s_t[1] + 500 and s_t[1] < 1000:
                    p_IM[s_t, s_tplus1] = 0.5
                elif s_tplus1[1] == s_t[1] and s_t[1] <= 1250:
                    p_IM[s_t, s_tplus1] = 0.5
                else: 
                    p_IM[s_t, s_tplus1] = 0
            elif s_tplus1[1] == "Succes" and (s_t[1] == 1000 or s_t[1] == 1250):
                p_IM[s_t, s_tplus1] = 0.5
            elif s_tplus1[1] == s_t[1] and s_t[1] == "Succes":
                p_IM[s_t, s_tplus1] = 0.1
            else: 
                p_IM[s_t, s_tplus1] = 0
        else:
            p_IM[s_t, s_tplus1] = 0

# Consolidation of probabilities
pTrans = {"CM":p_CM, "IM": p_IM}

<br> <div style="text-align: justify"> 
Finally, to finish the basic definition of the problem, the rewards must be defined. As mentioned in the problem's definition, the only reward will be if the mission has succeded at the end of the immersion. Meaning, a reward of 1 will be recieved only at state ("Succes", 0).
</div> 

In [3]:
### Rewards
# Reward r(s_t)
r = {}
for s_t in S_t:
    r[s_t] = 0

r[(0,"Succes")] = 1

**Policy $\pi$ to evaluate**
<br> <div style="text-align: justify"> 
To apply Policy Evaluation, an arbitrary policy is requied. For implementation, an arbitrary policy will be defined as one is not provided by the exercise. For this application, the policy will be a *stochastic* policy described next. For the first 500 meters of immersion (1000m, 900m, 800m, 700m, 600m), only calculated maneuvers can be performed as these are safer for the ship and its tripulation. For the following 3 stops of the immersion (500m, 400m, 300m), if the samples are under 750, there will be a 60% chance of applying an improvised maneuver and 40% chance of applying a calculated maneuver. If the samples collected are 750 or more, the probabilities will be reversed. Finally, for the last 2 stops of the immersion (200m, 100m), if there are 0 samples collected, a calculated maneuver will be performed. If the samples are between 250 and 1000, an improvised maneuver will be performed with a probability of 0,7 and a calculated maneuver with 0,3. Finally, if there are over 1000 samples, a calculated maneuver will be performed with a probability of 0,9 and an improvised with 0,1. Always the mission has achieved 1500 samples (**succes**), a calculated maneuver will be performed in all next stops. 
</div> 

In [4]:
### Arbitrary Policy π
## As the policy is stated as probabilities, this will also be defined as a dictionary
pi = {}

## Over all steps, if succes is achieved, then only Calculated Maneuvers will be performed
# First 500 m of immersion: only calculated maneuvers will be performed
for s_t in S_t:
    if s_t[0] in [1000, 900, 800, 700, 600]:
        for a in A:
            if a == "CM":
                pi[s_t, a] = 1
            else:
                pi[s_t, a] = 0
                
# For stops in 500m, 400m and 300m: If s_t under 750, 60% for IM and 40% for CM.
#                                   If s_t equal or over 750, 40% for IM and 60% for CM
for s_t in S_t:
    if s_t[0] in [500, 400, 300]:
        for a in A:
            if s_t[1] == "Succes" and a == "CM":
                pi[s_t, a] = 1
            elif s_t[1] != "Succes":
                if s_t[1] < 750 and a == "CM":
                    pi[s_t, a] = 0.4
                elif s_t[1] < 750 and a == "IM":
                    pi[s_t, a] = 0.6
                elif s_t[1] >= 750 and a == "CM":
                    pi[s_t, a] = 0.6
                elif s_t[1] >= 750 and a == "IM":
                    pi[s_t, a] = 0.4
            else:
                pi[s_t, a] = 0

# For last 200m of immersion: If s_t = 0 samples, a CM will be performed
#                             If 250 <= s_t < 1000, 70% for IM and 30% for CM
#                             If s_t >= 1000, 10% for IM and 70% for CM
for s_t in S_t:
    if s_t[0] in [200, 100]:
        for a in A:
            if s_t[1] == "Succes" and a == "CM":
                pi[s_t, a] = 1
            elif s_t[1] != "Succes":
                if s_t[1] == 0 and a == "CM":
                    pi[s_t, a] = 1
                elif s_t[1] < 1000 and a == "IM":
                    pi[s_t, a] = 0.7
                elif s_t[1] < 1000 and a == "CM":
                    pi[s_t, a] = 0.3
                elif s_t[1] >= 1000 and a == "IM":
                    pi[s_t, a] = 0.1
                elif s_t[1] >= 1000 and a == "CM":
                    pi[s_t, a] = 0.9
            else:
                pi[s_t, a] = 0

<div style="text-align: justify"> 
Since policy $\pi$ is defined as a probability for each step, on each state for a decision, evaluation Policy can be implemented. For this process, the threshold $\theta$ will be 0.005. Then, a dictionary that will store the policy values for each state is defined as $V(s_t)$. All the values are arbitrarily set to 0.5 for the non-terminal states, for the terminal states (when the exploration has ended at 0m) this value will be 0. After this, the value of $\delta$ is initialized as 10 so the iterative process in the loop can start. As soon as the iterative process starts, the value of $\delta$ is set to 0. 
</div> 

In [5]:
### Policy evaluation
# Threshold theta indicates the desired accuracy of the finded value
theta = 0.005

# Defining dictionary V for the value of the policy for each state in 0.5 (arbitrarily)
V = {}
for s_t in S_t:
    V[s_t] = 0.5
for samples in [0, 250, 500, 750, 1000, 1250, "Succes"]:
    V[(0,samples)] = 0

# Delta being the change achieved, initailized in 0
delta = 10

# Iterations
while delta > theta:
    delta = 0
    for s_t in S_t:
        if s_t[0] != 0:
            v = V[s_t]
            value = 0
            for a in A:
                value1 = 0
                for s_tplus1 in S_t:
                    value1 += pTrans[a][s_t, s_tplus1] * (r[s_tplus1] + V[s_tplus1])
                value += pi[s_t,a] * value1   
            V[s_t] = value
            delta = max(delta, abs(v - V[s_t]))

<div style="text-align: justify"> 
At this point, the Policy Evaluation is completed. In the way this problem has been constructed, the policy's values indicate the probability of succes of the mission in a given state. For example, the overall succes can be finded by looking at the policy's value for the state (1000 meters, 0 samples), when the immersion has just started and none samples have been gathered. In addition, an specific scenario can also be analized. For example, the scenario where at the middle of the immersion, only 250 samples have been gathered. Of course, all of the states where succes has already been achieved will have a probability of succes of 1. 
</div> 

In [6]:
print(f"The probability of succes at the beggining of the immersion is {round(V[(1000,0)],3)}")
print(f"The probability of succes at 500m of immersion and 250 samples  is {round(V[(500,250)],3)}")
print(f"The probability of succes at 700m of immersion and already 1500 samples is {round(V[(700,'Succes')],3)}")

The probability of succes at the beggining of the immersion is 0.926
The probability of succes at 500m of immersion and 250 samples  is 0.564
The probability of succes at 700m of immersion and already 1500 samples is 1.0


## Problem 2: Aqueduct
<br> <div style="text-align: justify"> 
The same iterative process will be performed in this problem with some distinctions due to the characteristics of the problem. First, this is a *infinite* MDP, meaning the time steps have no valuable information. Considering this, states will contain only the liters of trash. Second, a discount factor will be applyied as stated in the description of the algorithm. On this terms, the basics of the problem will be coded in the next section: States, Transition Probabilities, Costs and the Discount Factor
</div> 

In [7]:
### States
S = list(range(3,11))

### Decisions
A = ["Send", "Don't Send"]

### Transition probabilities
# Send the cleaning team
p_SEND = {}
for s_t in S:
    for s_tplus1 in S:
        if s_tplus1 == 3:
            p_SEND[s_t, s_tplus1] = 1
        else:
            p_SEND[s_t, s_tplus1] = 0

# Dont's send the cleaning team 
p_DONTS = {}
for s_t in S:
    for s_tplus1 in S:
        if s_tplus1 >= s_t:
            p_DONTS[s_t, s_tplus1] = 1/(11-s_t)
        else:
            p_DONTS[s_t, s_tplus1] = 0

# Consolidation of probabilities
pTrans = {"Send": p_SEND, "Don't Send": p_DONTS}

### Costs
c = {}
for s_t in S:
    c[s_t, "Send"] = 12 + s_t * 0.01 * 200
    c[s_t, "Don't Send"] =  s_t * 0.01 * 200

### Discount Factor 
gamma = 0.95

**Policy $\pi$ to evaluate**
<div style="text-align: justify"> 
As stated in the instructions, the policy we're interesed on evaluating is sending the cleaning team ONLY when the trash has reached the maximum number of liters. This is, 10 liters. In any other scenario, the team will not be sent. 
</div>

In [8]:
### Arbitrary policy π
pi = {}
for s_t in S:
    if s_t == 10:
        pi[s_t, "Send"] = 1
        pi[s_t, "Don't Send"] = 0
    else: 
        pi[s_t, "Send"] = 0
        pi[s_t, "Don't Send"] = 1

<div style="text-align: justify"> 
With all components of the MDP defined, the iterative policy evaluation can be performed. The $\theta$ threshold will be the same as the one used in the previous problem. The arbitrary policy's initial value for all states will be $300 million pesos. There are no terminal states in this problem. 
</div>

In [9]:
### Policy Evaluation
# Threshold theta indicates the desired accuracy of the finded value
theta = 0.005

# Defining dictionary V for the value of the policy for each state in 0.5 (arbitrarily)
V = {}
for s_t in S:
    V[s_t] = 300

# Delta being the change achieved, initailized in 0
delta = 10

# Iterations
while delta > theta:
    delta = 0
    for s_t in S:
            v = V[s_t]
            value = 0
            for a in A:
                value1 = 0
                for s_tplus1 in S:
                    value1 += pTrans[a][s_t, s_tplus1] * (c[s_t,a] + gamma * V[s_tplus1])
                value += pi[s_t,a] * value1   
            V[s_t] = value
            delta = max(delta, abs(v - V[s_t]))

The Iterative Policy Evaluation is done. With the values found, it is possible to determine the expected cost associated with any of the states when following this policy.

In [10]:
print(f"The expected cost when there are 4 liters of trash is ${round(V[4], 3)} million pesos")
print(f"The expected cost when there are 7 liters of trash is ${round(V[7], 3)} million pesos")
print(f"The expected cost when there are 10 liters of trash is ${round(V[10], 3)} million pesos")

The expected cost when there are 4 liters of trash is $321.864 million pesos
The expected cost when there are 7 liters of trash is $331.6 million pesos
The expected cost when there are 10 liters of trash is $334.518 million pesos
