# Introduction

```{note}
Reinforcement learning (RL) is a machine learning approach for teaching agents how to solve tasks by trial and error. Deep RL refers to the combination of RL with deep learning.
```

## Key Concepts and Terminology

![](images/agent-env.png)

The main characters of RL are the **agent** and the **environment**. The environment is the world that the agent lives in and interacts with. At every step of interaction, the agent sees a (possibly partial) observation of the state of the world, and then decides on an action to take. The environment changes when the agent acts on it, but may also change on its own.

The agent also perceives a **reward** signal from the environment, a number that tells it how good or bad the current world state is. The goal of the agent is to maximize its cumulative reward, called **return**. 

### States and Observations

A **state** $s$ is a complete description of the state of the world. There is no information about the world which is hidden from the state. An **observation** $o$ is a partial description of a state, which may omit information.

In deep RL, we almost always represent states and observations by a real-valued vector, matrix, or higher-order tensor. For instance, a visual observation could be represented by the RGB matrix of its pixel values.

When the agent is able to observe the complete state of the environment, we say that the environment is **fully observed**. When the agent can only see a partial observation, we say that the environment is **partially observed**.

```{caution}
Reinforcement learning notation sometimes puts the symbol for state, $s$, in places where it would be technically more appropriate to write the symbol for observation, $o$. Specifically, this happens when talking about how the agent decides an action: we often signal in notation that the action is conditioned on the state, when in practice, the action is conditioned on the observation because the agent does not have access to the state.<br>
We’ll follow standard conventions for notation, but it should be clear from context which is meant.
```

### Action Spaces

Different environments allow different kinds of actions. The set of all valid actions in a given environment is often called the **action space**. Some environments, like Atari and Go, have **discrete action spaces**, where only a finite number of moves are available to the agent. Other environments, like where the agent controls a robot in a physical world, have **continuous action spaces**. In continuous spaces, actions are real-valued vectors.

## Policies

![reinforce](images/reinforce.png)

That is:

$$s_{t} \xrightarrow[]{Policy} a_{t} \xrightarrow[]{Enviroment} r_{t+1},s_{t+1}$$

Policy: $\pi(a|s)$

Enviroment controls the state transition and reward process, they have the Markov property:

$$p(s_{t+1}|s_{t},a_{t},\dots,s_{0},a_{0})=p(s_{t+1}|s_{t},a_{t})$$
$$p(r_{t+1}|s_{t},a_{t},\dots,s_{0},a_{0})=p(r_{t+1}|s_{t},a_{t})$$

Obtain trajectory $(s_{0},a_{0},r_{1},s_{1},a_{1},r_{2},s_{2},a_{2},\dots)$