Skip to content

Commit

Permalink
add my reading notes
Browse files Browse the repository at this point in the history
  • Loading branch information
jiyfeng committed Apr 18, 2017
1 parent a439966 commit 8dab204
Show file tree
Hide file tree
Showing 3 changed files with 118 additions and 1 deletion.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# RL4NLP Reading Group
# RL4NLP Reading Group (Spring 2017)

- Location: CSE 203

Expand All @@ -9,6 +9,7 @@
- Yangfeng
- Time: April 17, Monday, 4:30 - 5:30 PM
- Reading: [Reinforcement Learning: An Introduction](http://incompleteideas.net/sutton/book/bookdraft2016sep.pdf) Chap 01 and 03
- Notes: [Chap 01](notes/01-rl-basic.md) and [Chap 03](notes/02-mdp.md)

### 2. Dynamic Programming and Monte Carlo Methods

Expand Down
35 changes: 35 additions & 0 deletions notes/01-rl-basic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Chap 01: The Reinforcement Learning (RL) Problem

## Introduction

- Three characteristics of the RL problems
- being closed-loop in an essential way
- not having direct instructions as to what actions to take
- where the consequences of actions, including reward signals, play out over extended time periods
- The difference between RL and supervised learning
- The difference between RL and unsupervised learning
- RL: maximize rewards
- Unsupervised learning: find hidden structures of data
- The special challenge of RL: the tradeoff between **exploration** and **exploitation**. An agent is supposed to both
- exploit what it already knows in order to obtain reward
- explore in order to make better action selections in the future

## Elements of RL

- A policy
- define the learning agent's way of behaving at a given time
- the **core** of an agent in the sense that it alone is sufficient to determine behavior
- A reward signal
- define the goal in a RL problem by determining what are the good and bad events for the agent
- the agent's sole objective is to maximize the **total** reward it receives over the long run
- the process that generates the reward signal must be unalterable by the agent
- A value function
- specify what is good in the long run
- the *value* of a state is the total amount of reward an angent can **expect** to accumulate over the future, starting from that state
- a value of the prediction of rewards in a long run given the current state
- A model of the environmnet (optional)

## Tic-Tac-Toe

- The difference between evolutionary methods and the methods of learning value functions
- learning a value function takes advantage of information available during the course of play
81 changes: 81 additions & 0 deletions notes/02-mdp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Chap 03: Finite Markov Decision Processes

Notations

- $S_t\in\mathcal{S}$: the environment state at step $t$, where $\mathcal{S}$ is the set of possible states
- $A_t\in\mathcal{A}(S_t)$: action given $S_t$, where $\mathcal{A}(S_t)$ is the set of actions available in state $S_t$
- $R_{t+1}\in\mathcal{R}\subset\mathbb{R}$: reward
- $\pi_t$: agent's policy, where $\pi_t(a|s)$ is the probability that $A_t=a$ if $S_t=s$

## Returns

Expected discounted return

$$G_t=\sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}$$
where $\gamma$ is a parameter, $0\leq \gamma\leq 1$, called the discount rate.

## MDP

A reinforcement learning task that statisfies the Markov property is called a **Markov Discision Process** (MDP).

A finite MDP is specified by its state and action sets ($\mathcal{S}$ and $\mathcal{A}$) and its one-step dynamics of the environment as:
$$p(s',r|s,a)=\text{Pr}(S_{t+1}=s',R_{t+1}=r|S_t=s,A_t=a)$$

All other things can be computed from this dynamics, including

- the expected rewards for state-action pairs $r(s,a)=\mathbb{E}(R_{t+1}|S_t=s,A_t=a)$
- the state transition prob $p(s'|s,a)=\text{Pr}(S_{t+1}=s'|S_t=s,A_t=a)$
- the expected rewards for state-action-next-state triples $r(s,a,s')=\mathbb{E}(R_{t+1}|S_t=s,A_t=a,S_{t+1}=s')$

**An alternative definition**[1] A MDP is defined by:

- a ste of states $\mathcal{S}$
- a start state or initial state $s_0\in\mathcal{S}$
- a set of actions $\mathcal{A}$
- a transition prob $P(S_{t+1} = s'|S_t = s,A_t = a)$
- a reward prob $P(R_{t+1}=r|S_t=s,A_t=a)$

## Value functions

For MDP, the **state-value function** for a policy $\pi$ is defined as
$$v_{\pi}(s)=\mathbb{E}_{\pi}(G_t|S_t=s)=\mathbb{E}_{\pi}(\sum_{k=0}^{\infty}\gamma^kR_{t+k+1})$$

Similar way can be used to define the **action-value function** for policy $\pi$, $q_{\pi}(s,a)$,
$$q_{\pi}(s,a)=\mathbb{E}_{\pi}(G_t|S_t=s,A_t=a)$$

- Q learning

### Bellman equation

The Bellman equation for $\pi$ is defined as following: for $s\in\mathcal{S}$,

$$v_{\pi}(s)=\sum_{a}\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\pi}(s')]$$
which specifies the relation between $S_t$ and $S_{t+1}$ for a given $\pi$.

Reinforcement learning can solve Markov decision processes without explicit specification of the transition probabilities.

## Optimal Value Functions

In finite MDPs, value functions define a partial ordering over policies

- A policy $\pi$ is defined to be better than or equal to a policy $\pi'$ if its expected return is greater than or equal to that of $\pi'$ for all states.

Optimal state-value function, denoted $v_{\ast}$, $\forall s\in\mathcal{S}$
$$v_{\ast}(s)=\max_{\pi}v_{\pi}(s)$$

For MDP
$$v_{\ast}(s)=\max_{a\in\mathcal{A}(s)}\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\ast}(s')]$$

This equation means that any policy that is greedy with respect to the optimal evaluation function $v_{\ast}$ is an optimal policy. Optimal policies also share the same *optimal action-value function*, denoted $q_{\ast}$,
$$q_{\ast}(s,a) = \sum_{s',r}p(s',r|s,a)[r+\gamma\max_{a'}q_{\ast}(s',a')]$$


- This equation is also a special formulation that dynamic programming could find the optimal solution [2].
- Q learning

An [example](https://aclweb.org/anthology/D/D16/D16-1261.pdf) of using MDP for information extraction.

## Reference

1. Mohri, Rostmizadeh, Talwalkar. Foundations of Machine Learning. 2012
2. Kleinberg and Tardos. Algorithm Design. 2005

0 comments on commit 8dab204

Please sign in to comment.