# 🎮 Chapter 18: Reinforcement Learning — Practical Guide

---

This notebook provides a hands-on, practical walkthrough of reinforcement learning (RL). We'll explore key concepts, implement simple agents, and experiment with popular algorithms using DeepAI Gym environments.

## I. Learning to Optimize Rewards

Reinforcement Learning agents learn to act in environments to **maximize cumulative rewards** over time. Think of training a robot to walk or playing a game like Pong—agents improve their behavior through trial, reward feedback, and policy updates.

## II. Policy Search

Instead of estimating value functions, policy-based methods directly learn a parameterized policy (e.g., a neural network) that maps states to actions. This approach can be more effective in continuous or high-dimensional action spaces.

## III. Introduction to DeepAI Gym

Let's start by setting up a simple environment using Gym. We'll interact with the classic CartPole environment.

In [None]:
import gym

# Create the environment
env = gym.make("CartPole-v1")

# Reset environment to start a new episode
state = env.reset()
print("Initial state vector:", state)

# Take a random action
action = env.action_space.sample()

# Step in environment
next_state, reward, done, info = env.step(action)
print("Next state:", next_state)
print("Reward received:", reward)
print("Episode done?", done)

## IV. Neural Network Policies

We can train neural networks to map observations to actions or action probabilities. This is the basis of policy gradient methods like REINFORCE.

## V. Evaluating Actions: Credit Assignment

Assigning credit to actions based on received rewards is essential. Using discounted rewards over episodes helps the agent learn which actions lead to better outcomes.

## VI. Policy Gradients (REINFORCE)

Let's define a simple policy network using TensorFlow/Keras and outline how to train it with Monte Carlo returns.

In [None]:
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np

# Build policy network
policy = tf.keras.Sequential([
    layers.Dense(16, activation='relu', input_shape=env.observation_space.shape),
    layers.Dense(env.action_space.n, activation='softmax')
])
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)

# Placeholder for training loop (full implementation omitted for brevity)
# Normally, you'd run episodes, collect states, actions, rewards,
# compute discounted returns, and update the network accordingly.

print("Policy network defined. Implement training with episodes to optimize.")

## VII. Markov Decision Processes

RL environments are modeled as Markov Decision Processes (MDPs), where the next state depends only on the current state and action, not on past history.

## VIII. Temporal Difference (TD) Learning

TD learning combines sampling from episodes with bootstrapping. Algorithms like SARSA and Q-learning learn value functions directly from experience.

## IX. Q-Learning

Here's a simple template for implementing Q-learning with epsilon-greedy action selection.

In [None]:
import numpy as np

# Initialize Q-table
n_states = 1000  # example discretization size
n_actions = env.action_space.n
Q = np.zeros((n_states, n_actions))

epsilon = 0.1  # exploration rate
alpha = 0.1    # learning rate
gamma = 0.99   # discount factor

# Function to discretize continuous state
def discretize_state(state):
    # For simplicity, assume state is 4D; discretize each dimension
    # Here, just a placeholder; in practice, define proper bins
    state_idx = int(state[0] * 10)  # example
    return min(max(state_idx, 0), n_states - 1)

# Example episode loop
for episode in range(10):  # small number for illustration
    state = env.reset()
    done = False
    while not done:
        s_idx = discretize_state(state)
        if np.random.rand() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[s_idx])
        next_state, reward, done, info = env.step(action)
        s_next_idx = discretize_state(next_state)
        # Q-update
        best_next_action = np.argmax(Q[s_next_idx])
        td_target = reward + gamma * Q[s_next_idx][best_next_action]
        Q[s_idx][action] += alpha * (td_target - Q[s_idx][action])
        state = next_state

print("Q-learning example completed.")

### Approximate & Deep Q-Learning

Instead of a Q-table, use neural networks as function approximators. The key steps involve defining a deep network, experience replay, and target networks.

## X. Implementing Deep Q-Learning

Here's a high-level pseudocode outline:

1. Build a deep Q-network (DQN)
2. Use a replay buffer to store past experiences
3. Maintain a target network with delayed updates
4. Sample mini-batches from the replay buffer for training
5. Update network weights via gradient descent

Full implementation details are extensive; refer to RL frameworks for complete code.

## XI. Deep Q-Learning Variants

- **Double DQN**: mitigates overestimation bias
- **Prioritized Replay**: samples important experiences more frequently
- **Dueling DQN**: separates value and advantage streams for better estimation

These architectures improve stability and sample efficiency.

## XII. The TF-Agents Library

TensorFlow Agents (TF-Agents) simplifies building RL pipelines.

### Installation

```bash
pip install tf-agents
```

### Setup Outline

```python
import tf_agents

# Define environment, agent, replay buffer, data collection, training loop, etc.
```

TF-Agents handles the heavy lifting for training deep RL agents across various environments.

## XIII. Overview of Popular Algorithms

- **Policy-based**: REINFORCE, PPO, A2C
- **Value-based**: DQN and its variants
- **Actor-Critic**: DDPG, SAC

Each has its strengths and suited environments.

## XIV. Exercises to Try

1. Implement **REINFORCE** on CartPole.
2. Build a **DQN** from scratch using Gym.
3. Compare **Double DQN** vs vanilla DQN.
4. Train a DQN agent with **Atari games** using TF-Agents.
5. Experiment with **Deep Deterministic Policy Gradient (DDPG)** in continuous control tasks.

Feel free to explore and expand on these ideas!