# Q-Learning Algorithm

---

## Theory

Q-Learning is a **model-free Reinforcement Learning algorithm** used to find the **optimal action-value function (Q-function)** that helps an agent make decisions to maximize long-term rewards.

Unlike Markov Decision Processes (MDPs), Q-Learning does not require prior knowledge of the environment’s transition probabilities. It learns purely from experience through trial and error.

---

## Key Concept

The algorithm learns the **Q-value** for each state-action pair, which represents the expected future reward of taking a specific action in a given state and then following the optimal policy thereafter.

The **Q-value update rule** is given by:

\[
Q(s, a) = Q(s, a) + \alpha \times [R + \gamma \times \max_{a'} Q(s', a') - Q(s, a)]
\]

Where:

| Symbol | Meaning |
|:--------|:--------|
| **s** | Current state |
| **a** | Current action |
| **R** | Reward received after taking action *a* in state *s* |
| **s′** | Next state |
| **α** | Learning rate (how much new information overrides old) |
| **γ** | Discount factor (importance of future rewards) |

---

## Steps of the Algorithm

1. Initialize the Q-table with zeros for all state-action pairs.  
2. For each episode:
   - Choose an action **a** using an exploration strategy (like ε-greedy).  
   - Perform the action, observe the **reward (R)** and **next state (s′)**.  
   - Update the Q-value using the update rule.  
   - Move to the next state.  
3. Repeat until convergence — when Q-values stabilize or the maximum number of episodes is reached.  

---

## Exploration vs Exploitation

- **Exploration:** Take random actions to discover rewards.  
- **Exploitation:** Take the best-known action according to current Q-values.  
- Typically managed using an **ε-greedy policy**, where ε is the probability of exploring.

---

## Goal

To learn the **optimal policy (π\*)**, which maximizes the **expected cumulative reward** over time:

\[
\pi^*(s) = \arg\max_a Q(s, a)
\]

Once the Q-table converges, the agent can select the **best action for each state** to maximize rewards.

---

## Applications

- Robotics navigation and control  
- Game AI (chess, tic-tac-toe, gridworld)  
- Autonomous vehicles  
- Inventory and supply chain optimization  
- Finance and trading  

---

## Advantages

- Simple and effective for **discrete state-action spaces**  
- **Model-free**: no need for environment probabilities  
- Can converge to **optimal policy** with enough exploration  

---

## Limitations

- Does not scale well to **large or continuous state spaces**  
- Convergence can be slow without proper **hyperparameters**  
- Requires **exploration strategy** to avoid local optima


In [1]:
# ==============================
# Q-Learning Algorithm - Simple GridWorld Example
# ==============================

import numpy as np
import random

# 1. Define environment (GridWorld)
# States: 0 to 5, Goal state = 5
n_states = 6
actions = [0, 1]  # 0 = left, 1 = right
goal_state = 5

# Rewards: only goal state gives reward 1
R = np.zeros(n_states)
R[goal_state] = 1

# Q-table initialization
Q = np.zeros((n_states, len(actions)))

# Hyperparameters
alpha = 0.8       # learning rate
gamma = 0.9       # discount factor
epsilon = 0.2     # exploration probability
episodes = 50

# 2. Q-Learning Algorithm
for ep in range(episodes):
    state = 0  # start state
    done = False

    while not done:
        # Choose action (epsilon-greedy)
        if random.uniform(0, 1) < epsilon:
            action = random.choice(actions)  # explore
        else:
            action = np.argmax(Q[state, :])  # exploit

        # Take action
        if action == 0:
            next_state = max(0, state - 1)  # move left
        else:
            next_state = min(n_states - 1, state + 1)  # move right

        reward = R[next_state]
        done = (next_state == goal_state)

        # Update Q-value
        Q[state, action] = Q[state, action] + alpha * (
            reward + gamma * np.max(Q[next_state, :]) - Q[state, action]
        )

        # Move to next state
        state = next_state

# 3. Display results
print("Final Q-Table:")
for s in range(n_states):
    print(f"State {s}: {Q[s]}")

# 4. Derive Optimal Policy
policy = []
for s in range(n_states):
    best_action = np.argmax(Q[s])
    policy.append("Right" if best_action == 1 else "Left")

print("\nOptimal Policy:")
for s in range(n_states):
    print(f"State {s}: {policy[s]}")


Final Q-Table:
State 0: [0.59031642 0.6561    ]
State 1: [0.58576608 0.729     ]
State 2: [0.6560916 0.81     ]
State 3: [0.72895318 0.9       ]
State 4: [0.80999992 1.        ]
State 5: [0. 0.]

Optimal Policy:
State 0: Right
State 1: Right
State 2: Right
State 3: Right
State 4: Right
State 5: Left
