# Summary of Key Concepts in Reinforcement Learning and Markov Decision Processes (MDP)

## 1. Reinforcement Learning Framework
- **Components**:
  - **States**: Represent different situations or positions in a problem.
  - **Actions**: Choices the agent can make to transition between states.
  - **Rewards**: Feedback signal indicating the immediate benefit of an action.
  - **Discount Factor (γ)**: A number (0 ≤ γ ≤ 1) determining how much future rewards are valued compared to immediate rewards.
  - **Return**: The cumulative reward obtained over time, discounted by γ.
  - **Policy (π)**: A strategy that maps states to actions to maximize the return.

## 2. Mars Rover Example
- **States**: Six states numbered 1 to 6.
- **Actions**: "Go left" or "go right".
- **Rewards**: 
  - 100 for the leftmost state.
  - 40 for the rightmost state.
  - 0 for intermediate states.
- **Discount Factor (γ)**: 0.5.
- **Return**: Calculated using the reward and discount factor.
- **Policy (π)**: Determines actions based on the current state to maximize the return.

## 3. Application of Reinforcement Learning to Other Domains

### Autonomous Helicopter
- **States**: Positions, orientations, speeds, etc., of the helicopter.
- **Actions**: Possible ways to move the helicopter's control stick.
- **Rewards**:
  - +1 if the helicopter flies well.
  - -1,000 if it crashes.
- **Discount Factor (γ)**: Typically a high value like 0.99.
- **Return**: Computed using rewards and the discount factor.
- **Policy (π)**: Provides the optimal control stick movements for each state.

### Chess
- **States**: Positions of pieces on the board (simplified representation).
- **Actions**: All possible legal moves in the game.
- **Rewards**:
  - +1 for a win.
  - -1 for a loss.
  - 0 for a tie.
- **Discount Factor (γ)**: Very close to 1, such as 0.99, 0.995, or 0.999.
- **Return**: Calculated similarly as in other applications.
- **Policy (π)**: Selects the best move given a board position.

## 4. Markov Decision Process (MDP)

<img src="./images/markovdp.png" width="500">

- **Definition**: A formalism for reinforcement learning problems with states, actions, rewards, and policies.
- **Markov Property**: The future depends only on the current state and not on the sequence of states leading up to it.
- **Key Components**:
  - **Agent**: Chooses actions based on the policy (π).
  - **Environment**: Changes state and provides rewards based on the agent's actions.
  - **Cycle**:
    1. Agent chooses an action.
    2. Environment transitions to a new state and provides a reward.
    3. Agent observes the new state and reward.

## 5. Diagram Representation
- Visual diagrams illustrate the flow of states, actions, rewards, and the interaction between the agent and the environment.

## 6. Next Steps
- **State-Action Value Function**: A key concept for developing algorithms to select optimal actions.
- **Goal**: Learn how to define and compute the state-action value function, which is foundational for reinforcement learning algorithms.