# Reinforcement Learning - Lab 2
### J. Martinet

## 1) Optional: finish the implementation of TicTacToe

Remember that the general definition of the TD update rule is:

$$ V(s_t) \leftarrow V(s_t) + \alpha[ V(s_{t+1}) - V(s_{t}] $$

(We will come back to this later.)

## 2) Excercices

### A) Examples
Devise three example tasks of your own that fit into the reinforcement learning framework, identifying for each its states, actions, and rewards. Make the three examples as different from each other as possible.
The framework is abstract and exible and can be applied in many different ways. Stretch its limits in some way in at least one of your examples.

#### Your answer here here:
...

Here are three diverse examples of tasks that fit into the reinforcement learning (RL) framework:

---

### 1. **Autonomous Drone Delivery System**  
**Domain**: Logistics and robotics.  

- **States**:  
  The drone's current position, battery level, wind conditions, and the location of packages and delivery points (e.g., a tuple: `(x, y, battery, wind)`).

- **Actions**:  
  Move up, down, left, right, hover, or return to base for recharging.

- **Rewards**:  
  - +10 for successfully delivering a package.  
  - -5 for running out of battery mid-air.  
  - -1 for unnecessary movement (to encourage efficiency).  
  - +2 for charging when battery is low but not empty.

This example tests the RL framework with continuous and dynamic environments involving physical constraints and uncertainty.

---

### 2. **Personalized Learning System**  
**Domain**: Education technology.  

- **States**:  
  The student’s current knowledge level in different topics, past quiz performance, and time spent studying. For instance, a vector `[0.7, 0.5, 0.9]` might represent proficiency levels in three topics.

- **Actions**:  
  Present a new question, suggest a revision video, give a quiz, or recommend taking a break.

- **Rewards**:  
  - +5 for improved quiz performance compared to a baseline.  
  - -2 for disengagement (e.g., skipping content).  
  - +3 for completing tasks successfully within a time frame.  

This example uses RL to adapt to human behavior, demonstrating how RL can work in personalized, user-centric domains.

---

### 3. **Alien Ecosystem Explorer**  
**Domain**: Speculative science fiction/environmental exploration.  

- **States**:  
  The current biome type (e.g., forest, desert, water), resource levels (e.g., oxygen, food, energy), and discovered alien species (e.g., a tuple: `(biome, oxygen, species_found)`).

- **Actions**:  
  Explore the current biome, move to a neighboring biome, collect resources, or attempt to interact with alien species.

- **Rewards**:  
  - +10 for discovering a new alien species.  
  - -10 for running out of oxygen.  
  - +5 for finding rare resources.  
  - -1 for revisiting a biome with no new discoveries.  

This task pushes the RL framework into speculative and exploratory domains, involving imaginative challenges and unknown dynamics.

---

### Key Insights:
1. **Autonomous Drone**: RL is applied to solve a real-world logistics problem with physical and dynamic constraints.
2. **Personalized Learning**: RL adapts to human-centric tasks requiring nuanced understanding of behavioral patterns.
3. **Alien Ecosystem Explorer**: RL stretches its application into a speculative domain, emphasizing exploration and discovery in unknown environments.

These examples highlight RL's broad applicability across different domains, from grounded real-world problems to imaginative scenarios.

### B) Agent-environment boundary
Consider the problem of driving. You could define the actions
in terms of the accelerator, steering wheel, and brake, that is, where your body meets the machine.

Or you could define them farther out -- say, where the rubber meets the road, considering your actions to be tire torques.

Or you could define them farther in -- say, where your brain meets your body, the actions being muscle twitches to control your limbs.

Or you could go to a really high level and say that your actions are your choices of where to drive.

What is the right level, the right place to draw the line between agent and environment? On what basis is one location of the line to be preferred over another? Is there any fundamental reason for preferring one location over another, or is it a free choice?

#### Your answer here here:
...

### C) Lazy robot
Imagine that you are designing a robot to run a maze. You decide
to give it a reward of +1 for escaping from the maze and a reward of zero at all other times.

The task seems to break down naturally into episodes (the
successive runs through the maze) so you decide to treat it as an episodic task, where the goal is to maximize expected total reward.

After running the learning agent for a while, you find that it is showing no improvement in escaping from the maze. What is going wrong? Have you effectively communicated to the agent what you want it to achieve?

#### Your answer here here:
...



### D Gridworld (from Sutton and Barto)

The figure below shows a rectangular gridworld representation of a simple finite MDP.

The cells of the grid correspond to the states of the environment.

At each cell, four actions are possible: north, south, east, and west, which deterministically cause the agent to move one cell in the respective direction on the grid.

Actions that would take the agent o↵ the grid leave its location unchanged, but also result in a reward of -1.

Other actions result in a reward of 0, except those that move the agent out of the special states A and B. From state A, all four actions yield a reward of +10 and take the agent to A'. From state B, all actions yield a reward of +5 and take the agent to B'.

![Grid world](gridworld.png)

Suppose the agent selects all four actions with equal probability in all states. 

The right part of the figure shows the value function, $v_\pi$, for this policy, for the discounted reward case with $\gamma$ = 0.9. This value function was computed by solving the system of linear equations (Bellman equation).

Notice the negative values near the lower edge; these are the result of the high probability of hitting the edge of the grid there under the random policy. State A is the best state to be in under this policy, but its expected return is less than 10, its immediate reward, because from A the agent is taken to A', from which it is likely to run into the edge of the grid. State B, on the other hand, is valued more than 5, its immediate reward, because from B the agent is taken to B', which has a positive value. From B' the expected penalty (negative reward) for possibly running into an edge is more than compensated for by the expected gain for possibly stumbling onto A or B.

The Bellman equation must hold for each state for the value function $v_\pi$ shown in the figure (right). Show numerically that this equation holds for the center state, valued at +0.7, with respect to its four neighboring states, valued at +2.3, +0.4, -0.4, and +0.7. (These numbers are accurate only to one decimal place.)


#### Your answer here here:
...