# Notebook 4: Reinforcement Learning

Welcome to the fourth notebook in our advanced machine learning series under **Part_3_Advanced_Topics**. In this notebook, we will explore **Reinforcement Learning (RL)**, a paradigm of machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards.

We'll cover the following topics:
- What is Reinforcement Learning?
- Key concepts: Agent, Environment, Reward, and Policy
- How Reinforcement Learning works
- Implementation of a basic Q-Learning algorithm
- Advantages and limitations

## What is Reinforcement Learning?

Reinforcement Learning is a type of machine learning where an agent learns optimal behavior through trial and error by interacting with an environment. Unlike supervised learning, where the model is trained on labeled data, or unsupervised learning, where patterns are found in unlabeled data, RL focuses on learning a strategy (policy) to maximize a long-term reward.

RL is widely used in robotics, game playing (e.g., AlphaGo), recommendation systems, and autonomous systems.

## Key Concepts

- **Agent:** The learner or decision-maker that interacts with the environment (e.g., a robot or game player).
- **Environment:** The external system with which the agent interacts, providing states, rewards, and responses to actions (e.g., a game board or physical space).
- **State (S):** A representation of the current situation or configuration of the environment at a given time.
- **Action (A):** A decision or move made by the agent that affects the environment.
- **Reward (R):** A numerical value given by the environment to the agent as feedback on the desirability of an action in a given state.
- **Policy (π):** A strategy or mapping from states to actions that the agent uses to decide what to do. The goal is to learn an optimal policy.
- **Value Function (V or Q):** Estimates the expected future reward for being in a state (V) or taking an action in a state (Q), guiding the agent's decisions.
- **Exploration vs. Exploitation:** The trade-off between trying new actions to discover better rewards (exploration) and sticking to known rewarding actions (exploitation).
- **Q-Learning:** A model-free RL algorithm that learns an action-value function (Q-table) to determine the best action for each state.

## How Reinforcement Learning Works

Reinforcement Learning operates on the following iterative process, often modeled as a Markov Decision Process (MDP):

1. **Initialization:** Start with an initial state and a policy (often random) or value function.
2. **Interaction:** The agent observes the current state of the environment.
3. **Action Selection:** Based on the policy, the agent chooses an action (balancing exploration and exploitation).
4. **Environment Response:** The environment transitions to a new state and provides a reward based on the action taken.
5. **Learning Update:** The agent updates its knowledge (e.g., Q-values) using the reward and new state information to improve future decisions.
6. **Repeat:** Continue steps 2-5 for many episodes or until a termination condition is met (e.g., reaching a goal state or maximum iterations).
7. **Convergence:** Over time, the agent learns an optimal policy that maximizes cumulative reward.

## Implementation of a Basic Q-Learning Algorithm

Let's implement a simple Q-Learning algorithm to solve a grid world problem. In this environment, the agent must navigate from a starting point to a goal while avoiding obstacles, learning through rewards and penalties.

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt

# Define the grid world environment
class GridWorld:
    def __init__(self, size=5):
        self.size = size
        self.grid = np.zeros((size, size))
        self.start = (0, 0)
        self.goal = (size-1, size-1)
        self.obstacles = [(1, 1), (2, 2), (3, 1)]  # Positions to avoid
        self.state = self.start
        self.actions = [(0, 1), (1, 0), (0, -1), (-1, 0)]  # Right, Down, Left, Up
        
    def reset(self):
        self.state = self.start
        return self.state
    
    def step(self, action):
        next_state = (self.state[0] + action[0], self.state[1] + action[1])
        # Check boundaries
        if (0 <= next_state[0] < self.size) and (0 <= next_state[1] < self.size):
            if next_state in self.obstacles:
                reward = -10  # Penalty for hitting an obstacle
                done = False
            else:
                self.state = next_state
                if self.state == self.goal:
                    reward = 100  # Reward for reaching the goal
                    done = True
                else:
                    reward = -1  # Small penalty for each step to encourage efficiency
                    done = False
        else:
            reward = -5  # Penalty for hitting a wall
            done = False
        return self.state, reward, done
    
    def render(self):
        grid = np.zeros((self.size, self.size))
        for obs in self.obstacles:
            grid[obs] = -1  # Obstacle
        grid[self.goal] = 2  # Goal
        grid[self.state] = 1  # Agent position
        return grid

# Q-Learning Algorithm
def q_learning(env, episodes=1000, alpha=0.1, gamma=0.9, epsilon=0.1):
    n_actions = len(env.actions)
    q_table = np.zeros((env.size, env.size, n_actions))  # State-action value table
    rewards_per_episode = []
    
    for episode in range(episodes):
        state = env.reset()
        total_reward = 0
        done = False
        
        while not done:
            # Epsilon-greedy policy for exploration vs exploitation
            if np.random.random() < epsilon:
                action_idx = np.random.randint(n_actions)  # Explore
            else:
                action_idx = np.argmax(q_table[state[0], state[1]])  # Exploit
            
            action = env.actions[action_idx]
            next_state, reward, done = env.step(action)
            total_reward += reward
            
            # Q-Learning update rule: Q(s,a) <- Q(s,a) + alpha * [reward + gamma * max(Q(s',a')) - Q(s,a)]
            old_value = q_table[state[0], state[1], action_idx]
            next_max = np.max(q_table[next_state[0], next_state[1]])
            new_value = old_value + alpha * (reward + gamma * next_max - old_value)
            q_table[state[0], state[1], action_idx] = new_value
            
            state = next_state
        
        rewards_per_episode.append(total_reward)
        # Decay epsilon to reduce exploration over time
        epsilon = max(0.01, epsilon * 0.995)
    
    return q_table, rewards_per_episode

# Visualize the learned policy
def visualize_policy(env, q_table):
    policy = np.argmax(q_table, axis=2)
    action_names = ['Right', 'Down', 'Left', 'Up']
    grid = env.render()
    
    plt.figure(figsize=(8, 8))
    plt.imshow(grid, cmap='coolwarm', interpolation='nearest')
    for i in range(env.size):
        for j in range(env.size):
            if (i, j) == env.goal:
                plt.text(j, i, 'Goal', ha='center', va='center', color='white')
            elif (i, j) in env.obstacles:
                plt.text(j, i, 'Obs', ha='center', va='center', color='white')
            elif (i, j) == env.start:
                plt.text(j, i, 'Start', ha='center', va='center', color='black')
            else:
                action_idx = policy[i, j]
                plt.text(j, i, action_names[action_idx], ha='center', va='center', color='black')
    plt.title('Learned Policy in Grid World')
    plt.colorbar(label='Grid Values')
    plt.show()

# Run the Q-Learning experiment
env = GridWorld(size=5)
q_table, rewards = q_learning(env, episodes=1000)

# Plot the learning curve (rewards over episodes)
plt.figure(figsize=(10, 6))
plt.plot(rewards, label='Total Reward per Episode')
plt.title('Learning Curve: Total Reward Over Episodes')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.legend()
plt.grid(True)
plt.show()

# Visualize the learned policy
visualize_policy(env, q_table)

# Demonstrate the agent following the learned policy
print('Demonstrating the learned policy:')
state = env.reset()
done = False
path = [state]
total_reward = 0

while not done:
    action_idx = np.argmax(q_table[state[0], state[1]])
    action = env.actions[action_idx]
    state, reward, done = env.step(action)
    path.append(state)
    total_reward += reward
    print(f'Step: {state}, Reward: {reward}')

print(f'Total Reward for the path: {total_reward}')
print(f'Path taken: {path}')

## Advantages and Limitations

**Advantages:**
- Can solve complex decision-making problems where the optimal strategy is not known in advance.
- Adapts to dynamic environments through continuous learning from interactions.
- Achieves impressive results in areas like game playing and robotics (e.g., AlphaGo, self-driving cars).

**Limitations:**
- Requires a large number of interactions with the environment to learn, making it computationally expensive and slow for real-world applications.
- Sensitive to the design of the reward function; poor design can lead to unintended behaviors.
- Exploration-exploitation trade-off can be challenging to balance, potentially leading to suboptimal policies.
- Struggles with high-dimensional state or action spaces without advanced techniques like deep reinforcement learning.

## Conclusion

Reinforcement Learning offers a powerful framework for learning optimal decision-making strategies through interaction with an environment. While basic algorithms like Q-Learning are effective for small, discrete problems like grid worlds, scaling RL to complex tasks often requires advanced methods such as Deep Q-Networks (DQN) or policy gradient methods. Understanding the core concepts of agents, rewards, and policies lays the foundation for tackling real-world challenges with RL.

This concludes our series on advanced machine learning topics. We hope these notebooks have expanded your toolkit and provided practical insights into specialized areas of machine learning.