## OpenAI gym Taxi-v3
---
https://gym.openai.com/envs/Taxi-v3/

The **Taxi Problem** from "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition" by Tom Dietterich
<br><br>
**Description:**
<br><br>
There are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drives to the passenger's location, picks up the passenger, drives to the passenger's destination (another one of the four specified locations), and then drops off the passenger. Once the passenger is dropped off, the episode ends.
<br><br>
**Observations:**
<br><br>
There are 500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is in the taxi), and 4 destination locations. 
<br><br>    
**Passenger locations:**
- 0: R(ed)
- 1: G(reen)
- 2: Y(ellow)
- 3: B(lue)
- 4: in taxi
<br><br>

**Destinations:**
- 0: R(ed)
- 1: G(reen)
- 2: Y(ellow)
- 3: B(lue)
<br><br>        

**Actions:**
There are 6 discrete deterministic actions:
- 0: move south
- 1: move north
- 2: move east 
- 3: move west 
- 4: pickup passenger
- 5: dropoff passenger
<br><br>    

**Rewards:**
There is a default per-step reward of -1, except for delivering the passenger, which is +20, or executing "pickup" and "drop-off" actions illegally, which is -10.
<br><br>    

**Rendering:**
- blue: passenger
- magenta: destination
- yellow: empty taxi
- green: full taxi
- other letters (R, G, Y and B): locations for passengers and destinations
<br><br>    

**state space** is represented by:
        (taxi_row, taxi_col, passenger_location, destination)

---
### Value Iteration vs Q-Learning

In **Q-learning**, the agent does not know state transition probabilities or rewards (**model-free**). The agent only discovers that there is a reward for going from one state to another via a given action when it does so and receives a reward. Similarly, it only figures out what transitions are available from a given state by ending up in that state and looking at its options. If state transitions are stochastic, it learns the probability of transitioning between states by observing how frequently different transitions occur.
<br><br>
This is important if you want to create an agent that is capable of entering a new situation that you don't have any prior knowledge about and figuring out what to do. Alternately, if you don't care about the agent's ability to learn on its own, Q-learning might also be necessary if the state-space is too large to repeatedly enumerate. Having the agent explore without any starting knowledge can be more computationally tractable.
<br><br>
On the other hand, **value iteration** is based on the knowledge of the states transition probabilities and reward for every transition (**model-based**). Also what is learned is different. With value iteration, you learn the expected cost when you are given a state _x_. With q-learning, you get the expected discounted cost when you are in state _x_ and apply action _a_.

#### Value iteration algorithm (model-based):
![image.png](attachment:image.png)

#### Q-Learning algorithm (model-free):
![image.png](attachment:image.png)

