# CA 2, Interactive Learning, Fall 2024
- **Name**: Majid Faridfar
- **Student ID**: 810199569

## Problem 1
What is the difference between reinforcement learning and supervised learning? Explain by providing two similar problems: one that requires reinforcement learning to solve, and another that can be solved with supervised learning.

Differences:
- How to learn? 
  - RL is based on an agent interacting with an environment. The agent takes actions, receives feedback in the form of rewards or penalties, and aims to maximize the cumulative reward over time. The feedback is sparse and often delayed, and there’s no explicit label or correct answer for each action.
  - SL involves learning from labeled data where each input comes with a corresponding correct output (label). The model is trained to map inputs to outputs based on this labeled dataset.

- What is the goal?
  - The goal of RL is to learn a strategy (or policy) that will maximize the total reward over time. The agent learns through trial and error, exploring and exploiting the environment to improve its decisions.
  - The goal of SL is to minimize the error between the predicted output and the true label, thereby learning a mapping function from inputs to outputs.

Illustration (Maze):
- A robot (agent) is placed in a maze (environment) and has to navigate from the start to the goal. The robot takes actions such as moving left, right, up, or down. Initially, it doesn’t know the best path to take and receives feedback in the form of rewards (e.g., +1 for reaching the goal, -1 for hitting a wall or making a wrong turn). The goal is to maximize the cumulative reward by learning which actions lead to the goal. The robot learns over time by exploring different paths, receiving feedback, and adjusting its strategy based on the rewards it receives.
- Instead of having the robot learn by exploring the maze, you collect a dataset of maze configurations (states) and the correct sequence of moves (labels) that leads to the goal. For each configuration, you have a label indicating the correct next action (e.g., move left, move right, etc.). You train a model on this labeled dataset to predict the correct action for a given maze state. Once trained, the model can predict the correct move for any new maze configuration.

## Problem 2
In an MDP problem, if the reward function undergoes a linear transformation, does the optimal policy change? (Provide a mathematical proof or a counterexample, and ignore the trivial case of multiplying by zero.) Does the answer to this question depend on whether the task is continuing or episodic?

## Problem 3
Assume a robot operates in a grid environment as follows. In each episode, the robot starts in one of the cells in the bottom row (with an equal probability of starting in each cell). The robot can move left, right, or up. If the robot chooses to move in a direction where there is a wall, it remains in place. If the robot enters one of the green cells, the episode ends.

![pics/P3.png](pics/P3.png)

### Scenario 1

The robot knows the grid environment completely and is aware of its current location at any moment, using this information to make decisions. The goal is for the robot to reach the second row and enter one of the green cells. If the robot enters a green cell, it receives a reward of +1 (and the episode ends), and if it moves from one blue cell to another blue cell, it receives a reward of 0. For this task, $\lambda = 0.9$.

#### a
Can the defined task be represented by an MDP? If not, explain why; if yes, fully specify the MDP.

#### b
How many optimal deterministic policies exist for solving this task? Appropriately express $\pi(s)$ for each.

### Scenario 2

The robot has no knowledge of the environment it's in. It has distance sensors on all four sides that tell it whether or not there is a wall next to it. For example, consider the images below.

![pics/P3S2.png](pics/P3S2.png)

Assume the objective is similar to the first scenario. In terms of key elements of an MDP, what has changed here? Can this problem still be represented by an MDP in a way that ultimately fulfills our objective? (Is the optimal policy what we want it to be?)

### Scenario 3

Assume everything is the same as in Scenario 2, but our objective has changed, and we want the robot to enter only the two green cells on the left and right. If the robot enters the red cell in the figure below, it will receive a reward of -1, and if it enters one of the green cells, it will receive a reward of +1 (all other rewards are zero). For this task, $\lambda = 0.9$.

![pics/P3S3.png](pics/P3S3.png)

#### a
Can this problem still be represented by an MDP in a way that ultimately fulfills our objective? (Is the optimal policy what we want it to be?) If it can be represented, provide the desired optimal policy. If it cannot, explain what should be done, given the robot's input data, so that the robot can find an optimal policy that meets our objective.

#### b
If you solve Scenario 3 using dynamic programming methods, what MDP does the optimal solution correspond to? Represent the states of this MDP with $s_j$ (for $j = 0, 1, \dots$), and indicate which grid cells correspond to these states. Represent the actions with $a_i$ (for $i = 0, 1, 2, \dots$) and specify which robot actions they correspond to. Define the probability matrix $P$ accordingly.

### Implementation

Implement the problem in both finite-horizon and infinite-horizon settings.

### a

Based on your answer to the previous part of question (a), choose values of H such that after obtaining the optimal policy using VI and PI algorithms, the optimal policy exhibits the following characteristics:
- The optimal policy only buys.
- The optimal policy only sells.
- The optimal policy both buys and sells.

## Problem 4

In this question, we examine the effect of episode length (Horizon) on the agent’s policy. Consider a robot that is tasked with managing stock shares. (Assume this problem can be represented as an MDP.)

Let $ s $ represent the number of shares the robot currently has (an integer always between [0,10]). At each moment, the robot has two options: to sell (if possible, $ s $ decreases by one unit) or to buy (if possible, $ s $ increases by one unit).

- If $ s > 0 $ and the agent sells, it receives a reward of +1 for the sale, and the stock level changes to $ s - 1 $. If $ s = 0 $, nothing happens.
- If $ s < 9 $ and the agent buys, it receives no reward, and the stock level changes to $ s + 1 $.
- The stock owner wants the inventory to be fully stocked at the end of the day; therefore, if the stock level reaches the maximum value of $ s = 10 $, the agent receives a reward of +100.
- The state $ s = 10 $ is also a terminal state, and the problem ends if it is reached.

The reward function, denoted by $ r(s, a, s') $, is summarized as follows:

- $ r(s, \text{sell}, s - 1) = 1 $ for $ s > 0 $
- $ r(0, \text{sell}, 0) = 0 $
- $ r(s, \text{buy}, s + 1) = 0 $ for $ s < 9 $
- $ r(9, \text{buy}, 10) = 100 $, indicating that moving from $ s = 9 $ to $ s = 10 $ gives a reward of +100, reaching the maximum stock level.

It is assumed that the stock level always starts from $ s = 3 $ at the beginning of the day. We will examine how the agent’s optimal policy changes by setting a limited horizon \( H \) for the problem. Recall that the horizon $ H $ refers to a limit on the number of time steps in which the agent can interact with the MDP before the episode ends, regardless of whether a terminal state has been reached. We will analyze the characteristics of the optimal policy (the policy that maximizes the episode’s reward) as the horizon $ H $ changes. (For the finite horizon, the discount factor is $ \gamma = 1 $).

![pics/P4.png](pics/P4.png)

For example, assume $ H = 4 $. The agent can sell for three steps, moving from $ s = 3 $ to $ s = 2 $, then $ s = 1 $, and finally $ s = 0 $, receiving rewards of +1, +1, and +1 for each sell action. In the fourth step, the inventory is empty, so it can either sell or buy, but it will not receive any reward in either case. Then, the problem ends due to the time limit.

### a 
Starting from the initial state $ s = 3 $, is it possible to choose a value of $ H $ such that the optimal policy includes both buying and selling steps during the execution? Explain your answer.

### b
Starting from the initial state $ s = 3 $, for what values of $ H $ does the optimal policy lead to a fully stocked inventory? In other words, provide a range for $ H $.

*Note 1:* We consider the inventory fully stocked when the buy action is chosen in state $ s = 9 $, causing a transition to $ s = 10 $. This includes the last time step in the horizon as well.

*Note 2:* By performing only buy actions, the agent can reach $ s = 10 $ from $ s = 3 $ in $ H = 7 $ steps.

### c
Now, consider the infinite-horizon setting with a discount factor $ \gamma $. In other words, there is no time limit, and the problem only ends if a terminal state is reached. Suppose $ \gamma = 0 $; what action does the optimal policy take when $ s = 3 $? What action does the optimal policy take when $ s = 9 $?

### d
In the infinite-horizon setting with a discount factor $ \gamma $, is it possible to choose a constant $ \gamma \in (0, 1] $ such that the optimal policy, starting from $ s = 3 $, never fully stocks the inventory? If so, find a range of $ \gamma $ that meets this condition.