# Introduction

The main goal of this notebook is to get used to the computational environment provided by [Colab](https://colab.research.google.com/). This environment provides free of charge access to computational resources suitable for learning and experimentation.


## Initialization



Before starting, we need to run the following two initialization cells. 

To run a cell, click on it and then either press Shift-Enter or click the "play" button on the toolbar at the top of the menu.

In [None]:
import numpy as np

The cell above imports the module $\mathtt{numpy}$, which contains highly efficient methods for numerical computation involving arrays.

The next cell defines a class $\mathtt{FiniteMDP}$, which is a simple implementation of a finite Markov Decision Process. It is not meant for "production" use, so there is not error checking or concerns with efficiency.

There is a lot of code here, but don't be concerned with it.

It is recommended that you collapse the **Initialization* section by clicking on the little triangle next to the section heading.

In [None]:
import numpy as np
import warnings

class FiniteMDPWarning(UserWarning):
    pass

class FiniteMDP:
    def __init__(self, states, actions, P, R, seed=None):
        self.states = states.copy()
        self.size = len(self.states)
        self.state_index = {s: i for i, s in enumerate(states)}
        self.actions = actions.copy()        
        self.P = {
            action:  np.array(M, dtype=np.float64, copy=True)
            for action, M in P.items()
        }

        self.CP = {}
        for action, P_a in self.P.items():        
            self.CP[action] = np.cumsum(P_a, axis=1)
            self.CP[action][:, -1] = 1.0
        self.CP[action][:, -1] = 1.0

        self.R = {
            action:  np.array(M, dtype=np.float64, copy=True)
            for action, M in R.items()
        }
        
        self.P[None] = np.eye(self.size, dtype=np.float64)
        self.R[None] = np.zeros(shape=(self.size, self.size), dtype=np.float64)
        self.terminal_indices = {i for i, state in enumerate(states) if not actions[state]}

        self.rng = np.random.default_rng(seed)
        self.current_state_index = None
        self.terminated = True        

    def _policy_value_function_lu(self, indexed_policy, gamma):
        PP = np.array([self.P[indexed_policy[i]][i] for i in range(self.size)], dtype=np.float64)
        RR = np.array([self.R[indexed_policy[i]][i] for i in range(self.size)], dtype=np.float64)
        b = np.sum(PP * RR, axis=1)
        A = np.eye(self.size) - gamma * PP
        for i in range(self.size):
            if i in self.terminal_indices:
                A[i][i] = 1.0
                b[i] = 0
        return np.linalg.solve(A, b)

    def _policy_value_function_jacobi(self, indexed_policy, gamma, stop_tol, max_iterations, start=None):
        if start is None:
            V = np.zeros(self.size, dtype=np.float64)
        else:
            V = start.copy()
        PP = np.array([self.P[indexed_policy[i]][i] for i in range(self.size)], dtype=np.float64)
        RR = np.array([self.R[indexed_policy[i]][i] for i in range(self.size)], dtype=np.float64)
        b = np.sum(PP * RR, axis=1)
        for _ in range(max_iterations):
            Vnew = b + gamma * PP @ V
            sup_norm =  np.max(np.abs(V - Vnew))
            if stop_tol is not None and sup_norm < stop_tol:
                break
            V = Vnew
        else:
            if stop_tol is not None:
                warnings.warn(
                    f'Maximum number of iterations reached in policy_value_function. Final delta: {sup_norm:5.3e}',
                    FiniteMDPWarning,
                    stacklevel=3,
                )   
        return V
        
    def _policy_value_function_gs(self, indexed_policy, gamma, stop_tol, max_iterations, start=None):
        if start is None:
            V = np.zeros(self.size, dtype=np.float64)
        else:
            V = start.copy()
        PP = np.array([self.P[indexed_policy[i]][i] for i in range(self.size)], dtype=np.float64)
        RR = np.array([self.R[indexed_policy[i]][i] for i in range(self.size)], dtype=np.float64)
        b = np.sum(PP * RR, axis=1)
        max_delta = -np.inf
        for _ in range(max_iterations):
            sup_norm = 0.0
            for i in range(self.size):
                new_value = b[i] + gamma * PP[i] @ V
                sup_norm = max(sup_norm, abs(new_value - V[i]))
                V[i] = new_value
            if stop_tol is not None and sup_norm < stop_tol:
                break
        else:
            if stop_tol is not None:
                warnings.warn(
                    f'Maximum number of iterations reached in policy_value_function. Final delta: {sup_norm:5.3e}',
                    FiniteMDPWarning,
                    stacklevel=3,
                )   
        return V
        
    def _policy_value_function(self, policy, gamma, method, stop_tol, max_iterations, start, return_type):
        indexed_policy = self.size * [None]
        for state, action in policy.items():
            indexed_policy[self.state_index[state]] = action
        match method:
            case 'lu':
                VV = self._policy_value_function_lu(indexed_policy, gamma)
            case 'jacobi':
                VV = self._policy_value_function_jacobi(indexed_policy, gamma, stop_tol, max_iterations, start)
            case 'gs':
                VV = self._policy_value_function_gs(indexed_policy, gamma, stop_tol, max_iterations, start)
            case _:
                raise ValueError("method should be 'lu', 'jacobi' or 'gs'")

        match return_type:
            case 'dict':
                return {state: VV[self.state_index[state]] for state in self.states}
            case 'array':
                return VV
            case _:
                raise ValueError("return_type must be 'dict' or 'array'")

    def policy_value_function(self, policy, gamma=1, method='lu', stop_tol=1E-8, max_iterations=100, start=None, return_type='dict'):
        if start is not None:
            start = np.array([start[s] for s in self.states])
        return self._policy_value_function(policy, gamma, method, stop_tol, max_iterations, start, return_type)

    def value_iteration(self, gamma=1, stop_tol=1E-8, max_iterations=100, start=None):
        if start is None:
            V = np.zeros(self.size, dtype=np.float64)
        else:
            V = np.array([start[s] for s in states], dtype=np.float64)
            for i in range(self.size):
                if i in self.terminal_indices:
                    V[i] = 0.0
        for _ in range(max_iterations):
            max_delta = 0.0
            for i, state in enumerate(self.states):
                if i in self.terminal_indices:
                    continue
                new_value = -np.inf
                for action in self.actions[state]:
                    new_value = max(new_value, 
                                    sum(self.P[action][i, j] * (self.R[action][i, j] + gamma * V[j]) for j in range(self.size)))
                if stop_tol is not None:
                    max_delta = max(max_delta, abs(V[i] - new_value))
                V[i] = new_value
            if stop_tol is not None and max_delta < stop_tol:
                break
        else:
                warnings.warn(
                    f'Maximum number of iterations reached in policy_value_function. Final delta: {max_delta:5.3e}',
                    FiniteMDPWarning,
                    stacklevel=2,
                )   
        # Compute optimal policy
        policy = {}
        for i,state in enumerate(self.states):
            if i in self.terminal_indices:
                policy[state] = None
                continue
            max_value = -np.inf
            max_action = None
            for action in self.actions[state]:
                new_value = sum(self.P[action][i, j] * (self.R[action][i, j] + gamma * V[j]) for j in range(self.size))
                if new_value > max_value:
                    max_value = new_value
                    max_action = action
            policy[state] = max_action

        return {state: V[i] for i, state in enumerate(self.states)}, policy

    def policy_iteration(self, gamma=1, method='lu', stop_tol=1E-8, max_iterations=100, relaxations=20, start=None):
        if start is None:
            V = np.zeros(self.size, dtype=np.float64)
        else:
            V = np.array([start[s] for s in states], dtype=np.float64)
            for i in range(self.size):
                if i in self.terminal_indices:
                    V[i] = 0.0

        # Initialize random policy
        policy = {}
        for i, state in enumerate(self.states):
            if i in self.terminal_indices:
                policy[state] = None
                continue
            policy[state] = self.rng.choice(self.actions[state])

        # Compute approximate value of current policy
        V = self._policy_value_function(policy, gamma, method, None, relaxations, None, 'array')

        for _ in range(max_iterations):
            # Compute improved policy
            new_policy = {}
            for i, state in enumerate(self.states):
                if i in self.terminal_indices:
                    new_policy[state] = None
                    continue
                max_value = -np.inf
                max_action = None
                for action in self.actions[state]:
                    new_value = sum(self.P[action][i, j] * (self.R[action][i, j] + gamma * V[j]) for j in range(self.size))
                    if new_value > max_value:
                        max_value = new_value
                        max_action = action
                new_policy[state] = max_action

            # Stop criterion 1: no change in optimal policy
            if new_policy == policy:
                 break
            # Compute approximate value of improved policy
            Vnew = self._policy_value_function(new_policy, gamma, method, None, relaxations, V, 'array')
            # Stop criterion 2: Change in V smaller than stop_tol
            if np.max(np.abs(Vnew - V)) < stop_tol:
                break
            # Update for next iteration
            V = Vnew
            policy = new_policy
        else:
            warnings.warn(
                f'Maximum number of iterations reached in policy_value_function. Final delta: {max_delta:5.3e}',
                FiniteMDPWarning,
                stacklevel=2,
            )   
        # Compute higher precision approximation for value function of final policy
        V = self._policy_value_function(policy, gamma, method, stop_tol, max_iterations, V, 'dict')

        return V, policy

    def reset(self, initial_state=None):
        if initial_state is None:
            while True:
                index = self.rng.integers(0, len(self.states))
                if index not in self.terminal_indices:
                    break
            self.current_state_index = index
        else:
            self.current_state_index = self.states.index(initial_state)
        self.terminated = False
        return self.states[self.current_state_index]

    def step(self, action):
        if self.current_state_index is None:
            raise RuntimeError('MDP not initialized, call reset() before calling step() for the first time')
        if self.terminated:
            raise RuntimeError('run terminated, call reset() to start a new run')

        u = self.rng.random()
        next_state_index = np.searchsorted(self.CP[action][self.current_state_index], u, side='right')
        next_state =  self.states[next_state_index]
        self.terminated = next_state_index in self.terminal_indices
        reward = self.R[action][self.current_state_index, next_state_index]
        self.current_state_index = next_state_index
        state = self.states[self.current_state_index]
        
        return (state, reward, self.terminated)
        
    def sarsa(
        self,
        gamma=1.0,
        alpha=0.1,
        epsilon=0.1,
        n_episodes=100_000,
        max_steps_per_episode=10_000,
        seed=None,
    ):
        """
        Tabular SARSA(0) consistent with (P, R).

        - Rewards on transitions into terminal states are allowed.
        - Terminal states have no actions (actions[state] is None).
        - No bootstrapping from terminal states.
        """
        rng = np.random.default_rng(seed)

        # ------------------------------------------------------------
        # Initialize Q(s,a) only for non-terminal states
        # ------------------------------------------------------------
        Q = {}
        for s in self.states:
            if self.actions[s] is None:
                continue
            for a in self.actions[s]:
                Q[(s, a)] = 0.0

        # ------------------------------------------------------------
        # Epsilon-greedy policy (greedy over Q)
        # ------------------------------------------------------------
        def epsilon_greedy_action(state):
            actions = self.actions[state]
            if actions is None:
                return None

            if rng.random() < epsilon:
                return rng.choice(actions)

            q_vals = [Q[(state, a)] for a in actions]
            return actions[int(np.argmax(q_vals))]

        # ------------------------------------------------------------
        # SARSA learning loop
        # ------------------------------------------------------------
        for _ in range(n_episodes):

            state = self.reset()
            state_index = self.state_index[state]

            if state_index in self.terminal_indices:
                continue

            action = epsilon_greedy_action(state)

            for _ in range(max_steps_per_episode):

                i = self.state_index[state]

                next_state, _, terminated = self.step(action)
                j = self.state_index[next_state]

                # IMPORTANT: reward always comes from R
                reward = self.R[action][i, j]

                if terminated:
                    # no bootstrap from terminal states
                    Q[(state, action)] += alpha * (
                        reward - Q[(state, action)]
                    )
                    break

                next_action = epsilon_greedy_action(next_state)

                Q[(state, action)] += alpha * (
                    reward
                    + gamma * Q[(next_state, next_action)]
                    - Q[(state, action)]
                )

                state = next_state
                action = next_action

        # ------------------------------------------------------------
        # Extract greedy policy and value function
        # ------------------------------------------------------------
        policy = {}
        V = {}

        for s in self.states:
            idx = self.state_index[s]

            if idx in self.terminal_indices:
                policy[s] = None
                V[s] = 0.0
                continue

            actions = self.actions[s]
            q_vals = [(Q[(s, a)], a) for a in actions]
            best_q, best_a = max(q_vals, key=lambda x: x[0])

            policy[s] = best_a
            V[s] = best_q

        return Q, policy, V

# Example - Recycling Robot

In this first example, we use the Recycling Robot example from the book [*Reinforcement Learning - An Introduction* by Sutton and Barto](https://www.andrew.cmu.edu/course/10-703/textbook/BartoSutton.pdf)

All you have to do is to run the cells and observe the results.

The transition probability matrices for each action are:

$$
P^{\mathtt{search}}=
\begin{bmatrix}
\alpha & 1-\alpha\\
1-\beta & \beta
\end{bmatrix}
\qquad
P^{\texttt{wait}}=
\begin{bmatrix}
1 & 0\\
0 & 1
\end{bmatrix}
\qquad
P^\mathtt{recharge}=
\begin{bmatrix}
- & -\\
1 & 0
\end{bmatrix}
$$

And the reward matrices are:

$$
R^{\mathtt{search}}=
\begin{bmatrix}
r_{\mathtt{search}} & r_{\mathtt{search}}\\
r_{\mathtt{empty}} & r_{\mathtt{search}}
\end{bmatrix}
\quad
R^{\mathtt{wait}}=
\begin{bmatrix}
r_{\mathtt{wait}} & - \\
- & r_{\mathtt{wait}}
\end{bmatrix}
\quad
R^{\mathtt{recharge}}=
\begin{bmatrix}
- & -\\
0 & -
\end{bmatrix}
$$

The dashes $-$ represent entries that are not meaningful, since they represent actions that are not allowed in a given state. They can have any value and, in practice, are symply set to zero. 

To represent the model computationally, we first define the model parameters:

In [None]:
alpha = 0.8
beta = 0.3
rsearch = 15
rwait = 10
rempty = -3.0
print(f'Model parameters:\n{alpha=}, {beta=}, {rsearch=}, {rwait=}, {rempty=}')

Let's now define the states and admissible actions:

In [None]:
states = ['high', 'low']
actions = {
    'high': ['search', 'wait'],
    'low': ['search', 'wait', 'recharge']
}
print(f"Admissible actions for state 'high': {actions['high']}\n"
      f"Admissible actions for state 'low': {actions['low']}")

Next, we define the transition probability and reward matrices for each action:

In [None]:
P = {}
P['search'] = np.array([[alpha, 1 - alpha], [1 - beta, beta]], dtype=np.float64)
P['wait'] = np.array([[1, 0], [0, 1]], dtype=np.float64)
P['recharge'] = np.array([[0, 0], [1, 0]], dtype=np.float64)

for key in P.keys():
    print(f'Transition probability matrix for action {key}:')
    print(P[key])

In [None]:
R = {}
R['search'] = np.array([[rsearch, rsearch], [rempty, rsearch]], dtype=np.float64)
R['wait'] = np.array([[rwait, 0], [0, rwait]], dtype=np.float64)
R['recharge'] = np.array([[0, 0], [0, 0]], dtype=np.float64)

for key in R.keys():
    print(f'Reward matrix for action {key}:')
    print(R[key])

We are now ready to define the `FiniteMDP` object:

In [None]:
rr_model = FiniteMDP(states, actions, P, R)

## Simulating the chain

Let's now simulate the chain. We start by defining a random number generator, to generate random actions:

In [None]:
rng = np.random.default_rng(77)

The next cell simulates $\mathtt{n\_steps}$ steps of the chain (recall that this is a continuing task).

In [None]:
current_state = rr_model.reset()
n_steps = 20
gamma = 0.9
discount = 1.0
total_return = 0.0
print(f'Initial state: {current_state}')
for i in range(n_steps):
    action = rng.choice(actions[current_state])
    next_state, reward, terminated = rr_model.step(action)
    total_return += discount * reward
    discount *= gamma
    current_state = next_state
    print(f'Step {i + 1:2d}: action={action:>9}, state={next_state:>5}, reward={reward:6.2f}, return={total_return:8.3f}, terminated={terminated}')
    if terminated:
        break

## Computation of State Value Function for a Policy

We now turn to the problem of computing the state value function of a policy. In this case, the number of deterministic policies is small, so we just enumerate all policies in a list and compute the value function using different methods as a check.

Computing the value function associated to a policy is called *evaluating* the policy and is a crucial step in any RL algorithm.

In [None]:
rr_policies = [
    {'high': 'search', 'low': 'search'},
    {'high': 'search', 'low': 'wait'},
    {'high': 'search', 'low': 'recharge'},
    {'high': 'wait', 'low': 'search'},
    {'high': 'wait', 'low': 'wait'},
    {'high': 'wait', 'low': 'recharge'},
]

### Simulation


In the next code cell, for each policy, we compute the returns for $\mathtt{n\_runs}$ of the Markov chain. The runs are truncated at $\mathtt{n\_steps}$ (we need to truncate because this is a continuing task).

(A more efficient version of this code would use a "first visit" strategy in each run.)

In [None]:
gamma = 0.9
n_steps = 1000
n_trials = 200

for policy in rr_policies:
    V = {'high': 0.0, 'low': 0.0}
    for state in ['high', 'low']:
        for n in range(n_trials):
            discount = 1.0
            current_state = rr_model.reset(initial_state=state)
            total_return = 0.0
            for i in range(n_steps):
                action = policy[current_state]
                next_state, reward, terminated = rr_model.step(action)
                total_return += discount * reward
                discount *= gamma
                current_state = next_state
                if terminated:
                    break
            V[state] += 1 / (n + 1) * (total_return - V[state])
    print(f"Policy: pi(high)={policy['high']}, pi(low)={policy['low']}. "
          f"Value function: V(high)={V['high']:8.5f}, V(low)={V['low']:8.5f}")

### LU Decomposition

In [None]:
for rr_policy in rr_policies:
    V = rr_model.policy_value_function(rr_policy, gamma=0.9)
    print(f"Policy: pi(high)={rr_policy['high']}, pi(low)={rr_policy['low']}. "
          f"Value function: V(high)={V['high']:8.5f}, V(low)={V['low']:8.5f}")

### Jacobi iteration

In [None]:
for rr_policy in rr_policies:
    V = rr_model.policy_value_function(rr_policy, gamma=0.9, method='jacobi', max_iterations=200)
    print(f"Policy: pi(high)={rr_policy['high']}, pi(low)={rr_policy['low']}. "
          f"Value function: V(high)={V['high']:8.5f}, V(low)={V['low']:8.5f}")

### Gauss-Seidel iteration

In [None]:
for rr_policy in rr_policies:
    V = rr_model.policy_value_function(rr_policy, gamma=0.9, method='gs', max_iterations=200)
    print(f"Policy: pi(high)={rr_policy['high']}, pi(low)={rr_policy['low']}. "
          f"Value function: V(high)={V['high']:8.5f}, V(low)={V['low']:8.5f}")

# Defining a simple MDP

Let's now define the MDP used in the Activities for day 1. Recall that the transition probabilities and rewards in this case are given by:

$$
P^a=\begin{bmatrix}
0.2 & 0.8\\
0.7 & 0.3
\end{bmatrix}\quad
R^a=\begin{bmatrix}
10 & 7\\
12 & 15
\end{bmatrix}
$$

$$
P^b=\begin{bmatrix}
0.4 & 0.6\\
0.1 & 0.9
\end{bmatrix}\quad
R^b=\begin{bmatrix}
5 & 11\\
14 & 7
\end{bmatrix}
$$

$$
P^c=\begin{bmatrix}
0.8 & 0.2\\
0.2 & 0.8
\end{bmatrix}\quad
R^c=\begin{bmatrix}
14 & 3\\
2 & 12
\end{bmatrix}
$$


The first thing we need to do is to define the states and actions admissible at each state. There are two states, $\mathtt{1}$ and $\mathtt{2}$, and three actions, $\mathtt{'a'}$, $\mathtt{'b'}$ and $\mathtt{'c'}$. Complete the code in the following cell:

## Model Definition

In [None]:
# In the line below, replace the ... by the list of states
states = [ ... ]
# The line below shows how to define the actions for state 1. Do the same for state 2
actions = {
    1: ['a', 'b', 'c'],
    2: [ ... ],
}

We now need to define the transition probability matrices. The matrix for action $\mathtt{'a'}$ is defined. Do the same for the other actions:

In [None]:
P = {}
P['a'] = np.array([[0.2, 0.8], [0.7, 0.3]], dtype=np.float64)
P['b'] = ...
P['c'] = ...

We can print the matrices to check that we have the right values:

In [None]:
print(P['a'])
print(P['b'])
print(P['c'])

Now, in the cell below, define the reward matrices, following the pattern above:

In [None]:
R = {}
R['a'] = ...
R['b'] = ...
R['c'] = ...

Check that we have the correct values:

In [None]:
print(R['a'])
print(R['b'])
print(R['c'])

At this point, we have all the data needed to define the MDP. Run the following cell to create the object $\mathtt{mdp\_model}$ that represents our MDP:

In [None]:
mdp_model = FiniteMDP(states, actions, P, R)

Notice that there is no output. If there were no errors, then a Python object representing the MDP was created. If there were errors, check the definitions of your states, actions and transition and reward matrices. In the next sessions we will experiment with doing computations with the model.

## Simulating a policy

One way to evaluate a policy is to simulate it. The cell below shows how to do this for the deterministic policy:
$$
\pi(1)=\mathtt{'c'},\quad\pi(2)=\mathtt{'a'}
$$
When reading the simulation code, pay special attention to the following method calls:

In [None]:
pi = {1: 'c', 2: 'a'}
# Resets initial state
current_state = mdp_model.reset(initial_state=1)
n_steps = 10
gamma = 0.9
discount = 1.0
total_return = 0.0
print(f'Initial state: {current_state}')
for i in range(n_steps):
    # Select action according to current state
    action = pi[current_state]
    # Advances one step in the chain
    next_state, reward, terminated = mdp_model.step(action)
    # Add to total return and adjust discount
    total_return += discount * reward
    discount *= gamma
    # Updates state for next iteration and print information about transition
    current_state = next_state
    print(f'Step {i + 1:2d}: action={action:>2}, state={next_state:>2}, reward={reward:6.2f}, '
          f'return={total_return:8.3f}, terminated={terminated}')
    if terminated:
        break
print(f'Total return for this run: {total_return}')

Run the simulation several times. You will notice that the results are different each time. This is expected, since the simulation randomizes both the initial state and the transitions.

### Exercise

This is a continuing task, so an interesting question is how close we are to the actual return for this policy. Experiment with the variable $\mathtt{n\_steps}$ to determine how many steps of the chain we have to simulate to get a reasonable estimate for the return.

**Extra credit**: can you mathematically estimate the number of steps necessary to get an approximation with a given precision (say, $10^{-2}$)?

### Exercise

Is the simulation as defined above sufficient to estimate what is the value function? If not, describe verbally what should be done.

### Exercise

Repeat the simulation for initial state 2

## Computation of the value function

Let's now see what are the methods available to evaluate a policy. We will consider the same policy as above:

In [None]:
pi = {1: 'c', 2: 'a'}

## LU Decomposition

The first method we will demonstrate uses a LU decomposition. This is an exact linear algebra based in Gaussian elimination. It is equivalent to inverting the system matrix, but more efficient.

In [None]:
V = mdp_model.policy_value_function(pi, gamma=0.9)
print(V)

### Exercise

If we change the discount factor from $\gamma=0.9$ to $\gamma=0.8$ will the state value function increase or decrease? Confirm your answer by re-evaluating the value function with the new discount factor.

## Jacobi iteration

The next cell computes the value function using a Jacobi iteration. Recall that this method is not exact, and that the quality of the approximation depends on the number of iterations. (The method is guaranteed to converge if $\gamma<1$)

In [None]:
V = mdp_model.policy_value_function(pi, gamma=0.9, method='jacobi', max_iterations=200)
V

### Exercise

What happens if the parameter $\mathtt{max\_iterations}$ is too small? Try to find what is the minimum number of iteration that produces an acceptable result.

## Gauss-Seidel iteration

The next cell shows how to run a Gauss-Seidel iteration. This method tends to be slightly more efficient than a Jacobi iteration.

In [None]:
V = mdp_model.policy_value_function(pi, gamma=0.9, method='gs', max_iterations=200)
V

### Exercise

Try to find the minimum number of iterations that produce a "good" result both with Jacobi and Gauss-Seidel. Is it possible to determine, in this case, which method finds the answer in fewer interations?

### Exercise

Repeat the experiments above with a different policy. If you are feeling brave, try to find the optimal policy by computing the value function of all policies (there are only 8 deteministic policies).