<a href="https://colab.research.google.com/github/gnoejh/ict1022/blob/main/Reinforcement/theory-mba.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reinforcement Learning: The Basics

This notebook introduces the fundamental concepts of Reinforcement Learning (RL), how it works, and some real-world applications.

[Gymnasium (formerly Gym) documentation](https://www.gymlibrary.dev/content/tutorials/) - Python library for RL environments

## 1. What is Reinforcement Learning?

Reinforcement Learning is a type of machine learning where an agent learns to make decisions by interacting with an environment. Think of it like training a dog - when the dog does something good, you give it a treat (reward), and over time it learns what actions lead to treats.

In RL, our agent (like the dog) takes actions in an environment. After each action, it receives:
1. A new state (situation)
2. A reward (feedback on how good/bad the action was)

The goal is to learn which actions to take in different situations to get the most rewards over time.

![RL Basic Concept](https://miro.medium.com/max/1400/1*Z2yMvuQ1-t5Ol1ac_W4dOQ.png)

### 1.1 The RL Framework

The basic structure of reinforcement learning includes:

```mermaid
graph LR
    A[Agent] --Takes action--> B[Environment]
    B --Gives new state--> A
    B --Gives reward--> A
```

Key elements:
- **States**: Different situations our agent can be in
- **Actions**: Things our agent can do
- **Rewards**: Feedback on how good an action was
- **Policy**: The strategy our agent uses to decide which action to take

## 2. Key Components of Reinforcement Learning

| Component | Simple Explanation | Example |
| --- | --- | --- |
| Agent | The learner or decision maker | A robot, game player, or stock trading algorithm |
| Environment | Everything the agent interacts with | A game world, physical space, or market |
| State | Current situation of the agent | Position in a maze, cards in a poker hand |
| Action | Moves the agent can make | Move left/right, buy/sell a stock |
| Reward | Feedback from the environment | Points in a game, profit in trading |
| Policy | Strategy to decide actions | "When in state X, take action Y" |
| Value | Expected future rewards | How good it is to be in a particular state |

## 3. How Reinforcement Learning Works

```mermaid
flowchart LR
    A[Agent sees state] --> B[Agent chooses action]
    B --> C[Environment changes]
    C --> D[Agent gets reward]
    D --> E[Agent sees new state]
    E --> B
```

The basic process works like this:

1. The agent observes its current state
2. Based on this state, the agent selects an action
3. The environment changes in response to this action
4. The agent receives a reward and observes its new state
5. The agent learns from this experience to make better decisions in the future
6. The process repeats, and the agent gets better over time

## 4. Types of Reinforcement Learning Algorithms

There are several different ways to approach reinforcement learning:

### Value-Based Methods
- Learn how good each state or action is (the "value")
- Example: Q-learning where we build a table of state-action values

### Policy-Based Methods
- Directly learn what action to take in each state (the "policy")
- Example: Policy Gradients that learn to select actions without needing values

### Model-Based Methods
- Build a model of how the environment works and use it to plan ahead
- Example: AlphaZero which builds a model of the game to simulate future moves

### Combined Methods
- Actor-Critic: Uses both policy (actor) and value (critic) components
- Deep RL: Uses neural networks to handle complex states like images

## 5. Q-Learning: A Simple RL Algorithm

Q-Learning is one of the most basic and popular reinforcement learning algorithms. It's a good place to start understanding RL.

### The Basic Idea:

1. Build a table (Q-table) with rows for states and columns for actions
2. Fill the table with values representing how good each action is in each state
3. Update these values as the agent interacts with the environment

### The Q-Learning Process:

1. Start with a Q-table filled with zeros
2. For each episode of training:
   - Start in an initial state
   - While not at a terminal state:
     - Choose an action (using exploration vs. exploitation)
     - Take the action, observe reward and new state
     - Update Q-value using the formula
     - Move to new state

### The Q-Learning Update Formula (simplified):

Q[state, action] = Q[state, action] + learning_rate * (reward + discount_factor * max(Q[new_state]) - Q[state, action])

### 5.1 Q-Learning Example

Let's imagine teaching an agent to navigate a simple 3x3 grid world:

```
+---------+
| S |   |   |
+---------+
|   | X |   |
+---------+
|   |   | G |
+---------+
```

Where:
- S = Start position
- G = Goal (reward +10)
- X = Obstacle (can't go here)
- Each move has a small penalty (-0.1) to encourage finding the shortest path

#### Q-Table Progress:

At the beginning (untrained):
```
State/Action | Up    | Down  | Left  | Right
-------------|-------|-------|-------|-------
(0,0)        | 0.0   | 0.0   | 0.0   | 0.0
(0,1)        | 0.0   | 0.0   | 0.0   | 0.0
...
```

After training:
```
State/Action | Up    | Down  | Left  | Right
-------------|-------|-------|-------|-------
(0,0)        | 0.0   | 0.7   | 0.0   | 0.8
(0,1)        | 0.0   | 1.5   | 0.6   | 0.7
...
```

The highest values in each row tell us the best action to take in each state!

## 6. Policies: How Agents Make Decisions

A policy is the strategy that an agent uses to decide which actions to take. There are two main types:

### Deterministic Policies
- Always take the same action in a given state
- Example: "In state A, always move right"

### Stochastic (Random) Policies
- Take actions with certain probabilities
- Example: "In state A, move right with 80% probability and left with 20% probability"

### Common Policy Types:

#### Random Policy
- Take completely random actions (useful for exploration)

#### Greedy Policy
- Always take the action with the highest expected reward
- Problem: Might miss better solutions it hasn't discovered yet

#### ε-Greedy Policy
- With probability ε: Take a random action (exploration)
- With probability 1-ε: Take the best known action (exploitation)
- Good balance between trying new things and using what we know works

## 7. Challenges in Reinforcement Learning

```mermaid
graph TD
    A[RL Challenges] --> B[Exploration vs. Exploitation]
    A --> C[Delayed Rewards]
    A --> D[Sample Efficiency]
    A --> E[Generalization]
    A --> F[Stability]
```

### Exploration vs. Exploitation
- **Challenge**: Balancing trying new actions (exploration) vs. using actions known to work well (exploitation)
- **Solution**: Strategies like ε-greedy, where ε decreases over time

### Delayed Rewards
- **Challenge**: Actions might not give immediate rewards - rewards might come much later
- **Solution**: Techniques like credit assignment and discounting future rewards

### Sample Efficiency
- **Challenge**: Learning can require many interactions with the environment
- **Solution**: Experience replay, model-based methods, transfer learning

### Generalization
- **Challenge**: Applying learning to new situations not seen during training
- **Solution**: Function approximation like neural networks, good state representations

### Stability
- **Challenge**: Learning can be unstable, especially with complex algorithms
- **Solution**: Target networks, careful learning rate scheduling, proper initialization

### 7.1 The Exploration-Exploitation Dilemma

One of the biggest challenges in RL is knowing when to explore (try new things) versus exploit (use what you know works).

```mermaid
graph LR
    A["Too Much Exploration"] --> B["Wasteful, random behavior"]
    C["Too Much Exploitation"] --> D["Get stuck in bad solutions"]
    E["Good Balance"] --> F["Find optimal solutions"]
```

Think of it like this:
- **Exploration**: Trying a new restaurant you've never been to before
- **Exploitation**: Going to your favorite restaurant that you know you'll enjoy

Popular exploration strategies:
- **ε-greedy**: Choose random action with probability ε, best action with probability 1-ε
- **Softmax**: Choose actions with probabilities proportional to their expected rewards
- **Optimistic initialization**: Start believing all actions are amazing, then learn reality
- **Count-based**: Prefer actions that have been tried less frequently

## 8. Simple RL Algorithms for Beginners

Let's look at some basic algorithms you can start with:

### 1. Q-Learning
- Builds a table of state-action values
- Easy to understand and implement
- Works well for small environments

```python
# Pseudo-code for Q-Learning
Initialize Q-table with zeros
For each episode:
    Initialize state
    While not done:
        With probability ε, select random action
        Otherwise, select action with highest Q-value
        Take action, observe reward and next state
        Update Q[state, action] using the Q-learning formula
        Move to next state
```

### 2. SARSA (State-Action-Reward-State-Action)
- Similar to Q-learning but "on-policy" (learns the value of the policy it's following)
- Often more conservative than Q-learning

### 3. Monte Carlo Methods
- Learn from complete episodes of experience
- Simple concept: average the returns following each state
- Good for problems where episodes have a clear end point

## 9. Creating a Simple Q-Learning Agent

Let's look at how you might implement a basic Q-learning agent in Python:

In [None]:
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt

# Create the environment
env = gym.make('FrozenLake-v1', is_slippery=False)

# Initialize Q-table with zeros
num_states = env.observation_space.n
num_actions = env.action_space.n
q_table = np.zeros([num_states, num_actions])

# Hyperparameters
learning_rate = 0.8
discount_factor = 0.95
exploration_rate = 0.1
num_episodes = 2000

# Training the agent
rewards = []
for episode in range(num_episodes):
    # Reset the environment
    state, _ = env.reset()
    done = False
    total_reward = 0
    
    while not done:
        # Exploration-exploitation decision
        if np.random.random() < exploration_rate:
            # Explore: select random action
            action = env.action_space.sample()
        else:
            # Exploit: select action with highest Q-value
            action = np.argmax(q_table[state, :])
        
        # Take action and observe outcome
        new_state, reward, done, _, _ = env.step(action)
        total_reward += reward
        
        # Update Q-table using the Q-learning formula
        q_table[state, action] = q_table[state, action] + learning_rate * (
            reward + discount_factor * np.max(q_table[new_state, :]) - q_table[state, action])
        
        # Move to the next state
        state = new_state
    
    rewards.append(total_reward)

# Display average reward over time
plt.plot(np.cumsum(rewards) / (np.arange(num_episodes) + 1))
plt.xlabel('Episode')
plt.ylabel('Average Reward')
plt.title('Learning Progress')
plt.show()

# Print the learned Q-table
print("Q-table:")
print(q_table)

### Explaining the Code:

1. **Environment**: We're using FrozenLake, a simple grid world where the agent must navigate from start to goal without falling in holes.
2. **Q-table**: A table with states as rows and actions as columns, storing how good each action is in each state.
3. **Training Loop**:
   - For each episode, we start in the initial state
   - At each step, we either explore (random action) or exploit (best known action)
   - After taking an action, we update our Q-value using the Q-learning formula
   - We continue until we reach a terminal state (goal or hole)
4. **Results**: We plot the average reward over time to see if our agent is learning, and print the final Q-table.

## 10. Practical Exercises

### Exercise 1: Understanding Rewards
For the following situations, what types of rewards would help an RL agent learn effectively?
- Training a robot to walk
- Teaching an agent to play Tic-Tac-Toe
- Getting an agent to solve a maze

### Exercise 2: Q-Learning by Hand
Consider a 2x2 grid world where the top-right corner is the goal (+1 reward) and each step has a small penalty (-0.1). Starting with a Q-table of zeros, trace through the first few updates by hand using the Q-learning formula with learning rate = 0.1 and discount factor = 0.9.

### Exercise 3: Algorithm Selection
For each scenario, which RL approach might be most appropriate and why?
- Teaching a computer to play Chess
- Training a robot to balance a pole
- Creating a restaurant recommendation system
- Optimizing traffic lights in a city

### Exercise 4: Exploration-Exploitation Strategy
Design an exploration strategy for a food delivery robot that needs to learn the fastest routes around a college campus. How would your strategy change from initial deployment to after several weeks of operation?

## 11. Comparing RL with Other Machine Learning Types

| Aspect | Supervised Learning | Unsupervised Learning | Reinforcement Learning |
| --- | --- | --- | --- |
| Learning from | Labeled examples | Patterns in unlabeled data | Interaction & feedback |
| Goal | Predict correct output | Find structure in data | Maximize reward over time |
| Example tasks | Classification, Regression | Clustering, Dimensionality Reduction | Game playing, Robot control |
| Feedback | Immediate (right/wrong) | None | Delayed (rewards) |
| Real-world analogy | Learning with a teacher | Learning without guidance | Learning through trial and error |
| Example algorithm | Decision Trees, Neural Networks | K-means, PCA | Q-Learning, Policy Gradients |
| Example application | Spam detection | Customer segmentation | Self-driving cars |

## 12. Resources for Further Learning

### Books for Beginners:
- "Reinforcement Learning: An Introduction" by Sutton and Barto (first few chapters)
- "Grokking Deep Reinforcement Learning" by Miguel Morales

### Online Courses:
- [David Silver's RL Course](https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ) (DeepMind)
- [Reinforcement Learning Specialization](https://www.coursera.org/specializations/reinforcement-learning) (Coursera)

### Tutorials and Hands-On Resources:
- [Gymnasium Documentation](https://www.gymlibrary.dev/content/tutorials/)
- [Stable Baselines3 Documentation](https://stable-baselines3.readthedocs.io/en/master/guide/examples.html)
- [Hugging Face RL Course](https://huggingface.co/learn/deep-rl-course/unit0/introduction)

### Interesting Projects to Try:
- Train an agent to play simple Atari games
- Build a bot that learns to balance an inverted pendulum
- Create a simple traffic management system
- Develop an agent that learns to play card games

## 13. Real-World Applications of RL

| Application Area | Examples | How RL Helps |
| --- | --- | --- |
| Games | Chess (DeepMind's AlphaZero), Dota 2 (OpenAI Five) | Learning optimal strategies through self-play |
| Robotics | Robot navigation, manipulation tasks | Learning motor skills through trial and error |
| Business | Product recommendations, ad placement | Optimizing user engagement over time |
| Healthcare | Treatment planning, drug discovery | Personalizing treatments based on patient responses |
| Transportation | Traffic light control, ride-sharing | Optimizing resource allocation in dynamic systems |
| Energy Management | Smart grid control, data center cooling | Balancing efficiency and stability in complex systems |

## 14. Summary

### Key Takeaways:

```mermaid
graph TD
    A["Reinforcement Learning"] --> B["Learning through interaction"]
    A --> C["Balances exploration and exploitation"]
    A --> D["Uses rewards to guide learning"]
    A --> E["Works without explicit training data"]
    A --> F["Applicable to many real-world problems"]
```

### What We've Covered:

1. **The Fundamentals**: States, actions, rewards, and how they work together
2. **Key Algorithms**: Q-learning, SARSA, and their basic principles
3. **Common Challenges**: Exploration vs. exploitation, delayed rewards
4. **Implementation**: How to create a simple Q-learning agent
5. **Applications**: Real-world uses of reinforcement learning

### Next Steps:

If you're interested in diving deeper into RL, consider:
- Implementing more advanced algorithms like Deep Q-Networks (DQN)
- Exploring policy gradient methods
- Working with more complex environments
- Studying multi-agent reinforcement learning

Remember that reinforcement learning is a powerful approach for problems where an agent needs to make sequences of decisions and learn from feedback over time.