# CA 2, Interactive Learning, Fall 2024
- **Name**: Majid Faridfar
- **Student ID**: 810199569

## Problem 1
What is the difference between reinforcement learning and supervised learning? Explain by providing two similar problems: one that requires reinforcement learning to solve, and another that can be solved with supervised learning.

Differences:
- How to learn? 
  - RL is based on an agent interacting with an environment. The agent takes actions, receives feedback in the form of rewards or penalties, and aims to maximize the cumulative reward over time. The feedback is sparse and often delayed, and there’s no explicit label or correct answer for each action.
  - SL involves learning from labeled data where each input comes with a corresponding correct output (label). The model is trained to map inputs to outputs based on this labeled dataset.

- What is the goal?
  - The goal of RL is to learn a strategy (or policy) that will maximize the total reward over time. The agent learns through trial and error, exploring and exploiting the environment to improve its decisions.
  - The goal of SL is to minimize the error between the predicted output and the true label, thereby learning a mapping function from inputs to outputs.

Illustration (Maze):
- A robot (agent) is placed in a maze (environment) and has to navigate from the start to the goal. The robot takes actions such as moving left, right, up, or down. Initially, it doesn’t know the best path to take and receives feedback in the form of rewards (e.g., +1 for reaching the goal, -1 for hitting a wall or making a wrong turn). The goal is to maximize the cumulative reward by learning which actions lead to the goal. The robot learns over time by exploring different paths, receiving feedback, and adjusting its strategy based on the rewards it receives.
- Instead of having the robot learn by exploring the maze, you collect a dataset of maze configurations (states) and the correct sequence of moves (labels) that leads to the goal. For each configuration, you have a label indicating the correct next action (e.g., move left, move right, etc.). You train a model on this labeled dataset to predict the correct action for a given maze state. Once trained, the model can predict the correct move for any new maze configuration.

## Problem 2
In an MDP problem, if the reward function undergoes a linear transformation, does the optimal policy change? (Provide a mathematical proof or a counterexample, and ignore the trivial case of multiplying by zero.) Does the answer to this question depend on whether the task is continuing or episodic?

## Problem 3
Assume a robot operates in a grid environment as follows. In each episode, the robot starts in one of the cells in the bottom row (with an equal probability of starting in each cell). The robot can move left, right, or up. If the robot chooses to move in a direction where there is a wall, it remains in place. If the robot enters one of the green cells, the episode ends.

![pics/P3.png](pics/P3.png)

### Scenario 1

The robot knows the grid environment completely and is aware of its current location at any moment, using this information to make decisions. The goal is for the robot to reach the second row and enter one of the green cells. If the robot enters a green cell, it receives a reward of +1 (and the episode ends), and if it moves from one blue cell to another blue cell, it receives a reward of 0. For this task, $\lambda = 0.9$.

#### a
Can the defined task be represented by an MDP? If not, explain why; if yes, fully specify the MDP.

Yes, this task can be defined by an MDP. To fully specify the Markov Decision Process (MDP) for this task, we need to define **States (S)**, **Actions (A)**, **Transition probabilities (P(s'|s,a))**, **Rewards (R(s,a))**, and **Discount Factor (λ)**.

##### States (S):
The states are the cells in the grid:
- S1, S2, S3, S4, S5, S6, S7 represent the blue cells.
- T1, T3, and T5 are the green terminal cells (absorbing states).

So the state space is:
$$S = \{S1, S2, S3, S4, S5, S6, S7, T1, T3, T5\}$$

##### Actions (A):
The robot has three possible actions in the bottom row: **Up (U)**, **Left (L)**, **Right (R)**. However, when the robot is in the terminal green cells T1, T3, T5, the episode ends, and no further actions are available.

Thus, the action set is:
$$A = \{\text{Up}, \text{Left}, \text{Right}\}$$

##### Transition Probabilities (P(s' | s, a)):
The robot transitions deterministically (we assume that there is no stochastic behavior, since it is not mentioned in the problem), so the probability of transitioning from one state to another depends on whether the robot encounters a wall.

- Moving "Left" in state S1 will not change the state.
- Moving "Up" in states S2, S4, S6 will move the robot into the green cells (and stop the episode), while in other states will not change the state.
- Moving "Right" in state S7 will not change the state.

Here are the key deterministic transitions:

- **Up movement**:
  - $P(T1 | S2, U) = 1$
  - $P(T3 | S4, U) = 1$
  - $P(T5 | S6, U) = 1$
  - Other cases: The robot stays in the current state if "Up" is attempted from a cell without a green terminal cell above it.

- **Left movement**:
  - $P(S1 | S2, L) = 1$
  - $P(S2 | S3, L) = 1$
  - $P(S3 | S4, L) = 1$
  - $P(S4 | S5, L) = 1$
  - $P(S5 | S6, L) = 1$
  - $P(S6 | S7, L) = 1$
  - Other cases: The robot stays in the current state if "Left" is attempted from  S1.

- **Right movement**:
  - $P(S2 | S1, R) = 1$
  - $P(S3 | S2, R) = 1$
  - $P(S4 | S3, R) = 1$
  - $P(S5 | S4, R) = 1$
  - $P(S6 | S5, R) = 1$
  - $P(S7 | S6, R) = 1$
  - Other cases: The robot stays in the current state if "Right" is attempted from  S7.

##### Rewards (R(s,a)):
The robot receives:
- A reward of **+1** upon entering a terminal green cell (T1, T3, T5).
- A reward of **0** for any movement from one blue cell to another blue cell.

Thus, the rewards are:
- $R(s' | s, U) = 1$ if the robot moves from S2 to T1, S4 to T3, or S6 to T5.
- $R(s' | s, a) = 0$ for any other movement.

For example:
- $R(T1 | S2, U) = 1$,
- $R(T3 | S4, U) = 1$,
- $R(T5 | S6, U) = 1$,
- $R(S2 | S3, L) = 0$,
- $R(S6 | S5, R) = 0$, etc.

##### Discount Factor \( λ \):
The **discount factor** is \( λ = 0.9 \). This means that rewards received in future states are weighted by a factor of 0.9 for every step into the future. It reflects the agent's preference for immediate rewards over delayed ones.

#### b
How many optimal deterministic policies exist for solving this task? Appropriately express $\pi(s)$ for each.

To define the optimal policy, we go state by state and analyze the optimal actions:

1. **S1**: The only possible optimal action here is to move **right**, since other actions are blocked by a wall. Thus, for $\pi(S1)$:
$$\pi(S1) = \text{R}$$
- Hence, there’s only **one optimal action** in S1.

2. **S2**: The optimal way to immediately end the episode and earn a reward is to move **Up** into the green terminal state T1. Thus:
$$\pi(S2) = \text{U}$$
- Hence, there’s only **one optimal action** in S2.

3. **S3**: Moving **Up** is not blocked by the above wall. From S3, the robot can either move **left** to S2, letting it access the green cell T1, or move **right** to S4, potentially heading toward T3. Since rewards are the same (all terminal cells give +1 and actions cost 0), **both left and right are equally optimal** choices. Therefore:
$$\pi(S3) = \text{L or R}$$
- Hence, there are **two optimal actions** in S3.

$. **S4**: The optimal action in state S4 is to move **Up** into T3 immediately to end the episode. Thus:
$$\pi(S4) = \text{U}$$
- Hence, there’s only **one optimal action** in S4.

5. **S5**: Robot can move **right** toward S6 to reach the green terminal T5, or **left** to S4 and reach T3. Since both terminal cells T3 and T5 give the same reward, either direction is optimal. Therefore:
$$\pi(S5) = \text{L or R}$$
- Hence, there are **two optimal choices** in S5.

6. **S6**: The optimal action is to go **Up** to T5 immediately. Thus:
$$\pi(S6) = \text{U}$$
- Hence, there’s only **one optimal action** in S6.

7. **S7**: Since S7 has no option but to move **left** (right and up are blocked by a wall), the optimal policy is:
$$\pi(S7) = \text{L}$$
- Hence, there’s only **one optimal action** in S7.

---

- In **S1, S2, S4, S6, S7**, the robot has only **one optimal action**.
- In **S3, S5**, the robot has **two optimal choices**: move left or right.

Therefore, the total number of **optimal deterministic policies** is:

$$\text{Total optimal policies} = 2 \times 2 = 4$$

---

1. **Policy 1**:
$$\pi(S1) = \text{R}, \quad \pi(S2) = \text{U}, \quad \pi(S3) = \text{L}, \quad \pi(S4) = \text{U}, \quad \pi(S5) = \text{L}, \quad \pi(S6) = \text{U}, \quad \pi(S7) = \text{L}$$
   
2. **Policy 2**:
$$\pi(S1) = \text{R}, \quad \pi(S2) = \text{U}, \quad \pi(S3) = \text{L}, \quad \pi(S4) = \text{U}, \quad \pi(S5) = \text{R}, \quad \pi(S6) = \text{U}, \quad \pi(S7) = \text{L}$$

3. **Policy 3**:
$$\pi(S1) = \text{R}, \quad \pi(S2) = \text{U}, \quad \pi(S3) = \text{R}, \quad \pi(S4) = \text{U}, \quad \pi(S5) = \text{L}, \quad \pi(S6) = \text{U}, \quad \pi(S7) = \text{L}$$

4. **Policy 4**:
$$\pi(S1) = \text{R}, \quad \pi(S2) = \text{U}, \quad \pi(S3) = \text{R}, \quad \pi(S4) = \text{U}, \quad \pi(S5) = \text{R}, \quad \pi(S6) = \text{U}, \quad \pi(S7) = \text{L}$$

### Scenario 2

The robot has no knowledge of the environment it's in. It has distance sensors on all four sides that tell it whether or not there is a wall next to it. For example, consider the images below.

![pics/P3S2.png](pics/P3S2.png)

Assume the objective is similar to the first scenario. In terms of key elements of an MDP, what has changed here? Can this problem still be represented by an MDP in a way that ultimately fulfills our objective? (Is the optimal policy what we want it to be?)

Yes, the problem can still be represented by a **Markov Decision Process (MDP)**, as it fundamentally satisfies the requirements for an MDP. But there are key changes because the robot no longer knows exactly where it is within the environment.

#### State Space (S)
In the previous scenario, **states** were the individual grid cells like S1, S2, S3, etc. Now, since the robot doesn't know its position in the overall grid layout, the **states are defined by the robot's local sensor readings**—that is, information about whether there are walls in each direction. The robot's sensory input can be represented as a **bit vector** indicating walls present in the four directions (up, right, down, left).

- For example, the robot in **S1** receives the reading `{right: 1, up: 0, left: 0, down: 0}`.
- In **S6**, it receives `{right: 1, up: 1, left: 1, down: 0}`.

So, the new **state space** would be the set of all possible sensor readings the robot can encounter:
$$S = \{up = 0 \, or \, 1, right = 0 \, or \, 1, left = 0 \, or \, 1, down = 0 \, or \, 1\}$$

Here, **zero (0) represents a wall**, while **one (1) represents open movement** in that direction. Since each sensor reading has four binary values (up, right, left, down), there are potentially $2^4 = 16$ possible "sensor states". However, in this specific environment, in addition to the fact that "down" is alwas blocked (i.e., {..., down: 0}), the "down" action does not exist, so the number of possible sensor stated can be reduced to $2^3 = 8$

#### Action Space (A)
The available actions for the robot remain the same as in the original problem: **{Left, Right, Up}**.

#### Transition Function (P(s', s, a))
Transition descriptions become conditional on the **sensor state**:
   
- For instance, in state `{right: 1, up: 1, left: 1, down: 0}` the **Right** action will move the robot into another sensor-defined state.
- If no movement is possible in a chosen direction (like hitting a wall), the robot stays in the same sensor state.
- If the robot takes "Up" when possible (e.g., in a state like `{right: 1, up: 1, left: 1, down: 0}`), the robot actually transitions from a non-terminal sensor state to a **terminal green state**, where the episode ends.

So, the transition function can be expressed as:

$$P(s' | s, a) = 
     \begin{cases}
       1 & \text{If a is allowed based on s (e.g., a = \text{Right} and s = \{up: 1, right: 1, left: 1, down: 0\})} \\
       0 & \text{Otherwise}
     \end{cases}$$

The transitions are deterministic once the robot knows whether a given action is available.

#### Reward Function (R(s, a))
The reward structure also changes slightly. Instead of rewards depending on grid positions, the **reward is now a function of the sensor state** and the action taken:

- If the robot takes the **Up** action in a sensor state where **Up is allowed** (i.e., there's no wall above), and this move leads to a terminal green cell, the robot gets a reward of **+1**.
- For all other valid moves (where no immediate green state is reached), the robot receives **0 reward**.
- If the robot tries an invalid action (e.g., moving into a wall), it stays in the same state and still receives a reward of **0**.
  
So, the reward function can be expressed as:

$$R(s' | s, a) = 
     \begin{cases}
       +1 & \text{If s = \{right: 1, up: 1, left: 1, down: 0\}, and a is \text{Up}} \\
       0 & \text{Otherwise}
     \end{cases}$$

#### Discount Factor (λ)
The problem didn't change the discount factor, so $\lambda$ remains at 0.9, meaning that future rewards are discounted slightly but still carry weight.

---

The optimal policy in the original problem was to **quickly reach the green terminal cells** by moving up when possible (if there was no wall above) and moving horizontally if necessary.

In this new formulation, the **optimal policy would still aim to achieve the same goal**—the robot must use its sensor readings to figure out:
- When it's best to move **Up** into a terminal cell,
- How to move **Left** or **Right** to position itself below a terminal (green) cell if it's not already there.

Thus, **in the optimal policy**, the robot will:
- Move **Up** when its sensors indicate that moving up is allowed (i.e., in a sensor state like `{right: 1, up: 1, left: 1, down: 0}`).
- Move **Left** or **Right** when needed to position itself for optimal transitions.

Thus, the optimal policy will likely involve:
$$
\pi(s) = \begin{cases} 
\text{Up} & \text{If sensor state allows Up and it leads to a goal state} \\
\text{Right/Left} & \text{If sensor state indicates the need to reposition below a terminal state}
\end{cases}
$$

So, the **optimal policy** still fulfills the ultimate goal by appropriately guiding the robot to the green terminal cells using sensor-guided movements. Therefore, under this formulation, the robot will still eventually learn the optimal way to achieve the goal.

### Scenario 3

Assume everything is the same as in Scenario 2, but our objective has changed, and we want the robot to enter only the two green cells on the left and right. If the robot enters the red cell in the figure below, it will receive a reward of -1, and if it enters one of the green cells, it will receive a reward of +1 (all other rewards are zero). For this task, $\lambda = 0.9$.

![pics/P3S3.png](pics/P3S3.png)

#### a
Can this problem still be represented by an MDP in a way that ultimately fulfills our objective? (Is the optimal policy what we want it to be?) If it can be represented, provide the desired optimal policy. If it cannot, explain what should be done, given the robot's input data, so that the robot can find an optimal policy that meets our objective.

#### b
If you solve Scenario 3 using dynamic programming methods, what MDP does the optimal solution correspond to? Represent the states of this MDP with $s_j$ (for $j = 0, 1, \dots$), and indicate which grid cells correspond to these states. Represent the actions with $a_i$ (for $i = 0, 1, 2, \dots$) and specify which robot actions they correspond to. Define the probability matrix $P$ accordingly.

### Implementation

Implement the problem in both finite-horizon and infinite-horizon settings.

### a

Based on your answer to the previous part of question (a), choose values of H such that after obtaining the optimal policy using VI and PI algorithms, the optimal policy exhibits the following characteristics:
- The optimal policy only buys.
- The optimal policy only sells.
- The optimal policy both buys and sells.

## Problem 4

In this question, we examine the effect of episode length (Horizon) on the agent’s policy. Consider a robot that is tasked with managing stock shares. (Assume this problem can be represented as an MDP.)

Let $ s $ represent the number of shares the robot currently has (an integer always between [0,10]). At each moment, the robot has two options: to sell (if possible, $ s $ decreases by one unit) or to buy (if possible, $ s $ increases by one unit).

- If $ s > 0 $ and the agent sells, it receives a reward of +1 for the sale, and the stock level changes to $ s - 1 $. If $ s = 0 $, nothing happens.
- If $ s < 9 $ and the agent buys, it receives no reward, and the stock level changes to $ s + 1 $.
- The stock owner wants the inventory to be fully stocked at the end of the day; therefore, if the stock level reaches the maximum value of $ s = 10 $, the agent receives a reward of +100.
- The state $ s = 10 $ is also a terminal state, and the problem ends if it is reached.

The reward function, denoted by $ r(s, a, s') $, is summarized as follows:

- $ r(s, \text{sell}, s - 1) = 1 $ for $ s > 0 $
- $ r(0, \text{sell}, 0) = 0 $
- $ r(s, \text{buy}, s + 1) = 0 $ for $ s < 9 $
- $ r(9, \text{buy}, 10) = 100 $, indicating that moving from $ s = 9 $ to $ s = 10 $ gives a reward of +100, reaching the maximum stock level.

It is assumed that the stock level always starts from $ s = 3 $ at the beginning of the day. We will examine how the agent’s optimal policy changes by setting a limited horizon \( H \) for the problem. Recall that the horizon $ H $ refers to a limit on the number of time steps in which the agent can interact with the MDP before the episode ends, regardless of whether a terminal state has been reached. We will analyze the characteristics of the optimal policy (the policy that maximizes the episode’s reward) as the horizon $ H $ changes. (For the finite horizon, the discount factor is $ \gamma = 1 $).

![pics/P4.png](pics/P4.png)

For example, assume $ H = 4 $. The agent can sell for three steps, moving from $ s = 3 $ to $ s = 2 $, then $ s = 1 $, and finally $ s = 0 $, receiving rewards of +1, +1, and +1 for each sell action. In the fourth step, the inventory is empty, so it can either sell or buy, but it will not receive any reward in either case. Then, the problem ends due to the time limit.

### a 
Starting from the initial state $ s = 3 $, is it possible to choose a value of $ H $ such that the optimal policy includes both buying and selling steps during the execution? Explain your answer.

### b
Starting from the initial state $ s = 3 $, for what values of $ H $ does the optimal policy lead to a fully stocked inventory? In other words, provide a range for $ H $.

*Note 1:* We consider the inventory fully stocked when the buy action is chosen in state $ s = 9 $, causing a transition to $ s = 10 $. This includes the last time step in the horizon as well.

*Note 2:* By performing only buy actions, the agent can reach $ s = 10 $ from $ s = 3 $ in $ H = 7 $ steps.

### c
Now, consider the infinite-horizon setting with a discount factor $ \gamma $. In other words, there is no time limit, and the problem only ends if a terminal state is reached. Suppose $ \gamma = 0 $; what action does the optimal policy take when $ s = 3 $? What action does the optimal policy take when $ s = 9 $?

### d
In the infinite-horizon setting with a discount factor $ \gamma $, is it possible to choose a constant $ \gamma \in (0, 1] $ such that the optimal policy, starting from $ s = 3 $, never fully stocks the inventory? If so, find a range of $ \gamma $ that meets this condition.