
# Fundamentals of Reinforcement Learning — with Gymnasium in Python


**What you’ll get here:**
- Clear definitions (Agent, Environment, State, Action, Reward)
- Episodic vs. Continuous tasks
- Return and **discounted return** with a numeric example
- A practical tour of **Gymnasium**: `CartPole`, `MountainCar`, `FrozenLake`, and `Taxi`
- Clean, reproducible code with seeding, rendering helpers, and safe shutdown



## 2. Reinforcement Learning in One Picture

**Reinforcement Learning (RL)** is a learning paradigm where an *agent* interacts with an *environment* by observing a **state**, taking an **action**, and receiving a **reward**. The goal is to learn a policy that **maximizes cumulative reward** over time.

- **Agent**: the learner/decision‑maker  
- **Environment**: the world with which the agent interacts  
- **State**: a snapshot of the environment at a given time  
- **Action**: an operation chosen by the agent  
- **Reward**: scalar feedback after taking an action  

The agent’s behavior is improved through **trial and error**, much like training a pet via rewards and penalties.



### 2.1 Episodic vs. Continuous Tasks

- **Episodic**: Interaction naturally breaks into episodes with a beginning and an end (e.g., a single game of CartPole ending when the pole falls or the cart goes out of bounds).  
- **Continuous**: No terminal boundary; the task proceeds indefinitely (e.g., regulating traffic lights).

In practice, many continuous tasks are treated as finite horizons for training stability.



### 2.2 Return and Discounted Return

The **return** is the sum of rewards an agent expects to accumulate from a time step onward. Because immediate rewards are often considered more valuable, we use a **discount factor** \(\gamma \in [0, 1)\) to compute the **discounted return**.

\[ G_t = \sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1} \]

Below is a concrete numerical example.



## 3. Interacting with Gymnasium Environments
Gymnasium provides a **unified interface** across many RL tasks. You can:  
- create environments with `gym.make(...)`,  
- **reset** them to get the initial state,  
- take **steps** with `env.step(action)`, and  
- optionally **render** a frame for visualization.



### 3.1 CartPole (Classic Control)

**Task**: balance a pole on a moving cart by pushing the cart left (action 0) or right (action 1).

We’ll create the environment, inspect the initial observation, take a few steps, and then run a short episode with a simple (random) policy.



### 3.2 MountainCar (Classic Control)

**Task**: drive a car up a steep hill by building momentum. Actions are discrete (push left, no push, push right).

Below, we’ll run a very short random episode just to demonstrate the interface.



### 3.3 FrozenLake (Toy Text) — Tabular Q‑Learning

**Task**: navigate a frozen grid from Start (S) to Goal (G) while avoiding holes (H). This is a small, **discrete** environment, perfect for tabular **Q‑learning**.

Below we implement a **complete** Q‑learning solution with \(\epsilon\)-greedy exploration.



#### Learning Curve (Average Return)

A simple plot of the per‑episode return shows whether learning progresses.



### 3.4 Taxi (Toy Text)

**Task**: pick up and drop off passengers at correct locations. We’ll run a brief random rollout to exercise the API.



## 4. The RL Interaction Loop (Putting It All Together)

The canonical loop:

```python
env = gym.make("CartPole-v1")
obs, info = env.reset(seed=SEED)

for t in range(max_steps):
    action = policy(obs)            # choose action
    obs, reward, terminated, truncated, info = env.step(action)
    update(policy, obs, action, reward)  # learning rule (algorithm‑dependent)
    if terminated or truncated:
        break
env.close()
```



## 5. Practice: A Minimal Policy for CartPole (Solved)

Below we implement a **naïve linear policy** for CartPole using the observation directly. This is **not** optimal, but it demonstrates how to replace randomness with a deterministic policy and achieve decent returns without training.



## 6. Episodic vs. Continuous — Tiny Simulation

A quick toy demo to illustrate the difference in code: the *continuous* loop runs until an external condition stops it, whereas the *episodic* loop ends when the environment signals `terminated` or `truncated`.



## 7. What’s Next?

- Try different Gymnasium environments and tweak hyperparameters.  
- Replace the random or heuristic policies with **learning algorithms** (e.g., Monte Carlo, TD(0), SARSA, DQN, PPO).  
- Extend the plotting to include episode lengths, moving averages, and distributional summaries.

You now have a fully working starter notebook that mirrors and completes the chapter’s content, with **executable** examples across multiple environments.
