add my reading notes

jiyfeng · Apr 18, 2017 · 8dab204 · 8dab204
1 parent a439966
commit 8dab204
Show file tree

Hide file tree

Showing 3 changed files with 118 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# RL4NLP Reading Group
+# RL4NLP Reading Group (Spring 2017)
 
 - Location: CSE 203
 
@@ -9,6 +9,7 @@
 - Yangfeng
 - Time: April 17, Monday, 4:30 - 5:30 PM
 - Reading: [Reinforcement Learning: An Introduction](http://incompleteideas.net/sutton/book/bookdraft2016sep.pdf) Chap 01 and 03
+- Notes: [Chap 01](notes/01-rl-basic.md) and [Chap 03](notes/02-mdp.md)
 
 ### 2. Dynamic Programming and Monte Carlo Methods
 

diff --git a/notes/01-rl-basic.md b/notes/01-rl-basic.md
@@ -0,0 +1,35 @@
+# Chap 01: The Reinforcement Learning (RL) Problem
+
+## Introduction 
+
+- Three characteristics of the RL problems
+	- being closed-loop in an essential way
+	- not having direct instructions as to what actions to take
+	- where the consequences of actions, including reward signals, play out over extended time periods
+- The difference between RL and supervised learning
+- The difference between RL and unsupervised learning
+	- RL: maximize rewards
+	- Unsupervised learning: find hidden structures of data
+- The special challenge of RL: the tradeoff between **exploration** and **exploitation**. An agent is supposed to both
+	- exploit what it already knows in order to obtain reward
+	- explore in order to make better action selections in the future
+
+## Elements of RL
+
+- A policy
+	- define the learning agent's way of behaving at a given time
+	- the **core** of an agent in the sense that it alone is sufficient to determine behavior
+- A reward signal
+	- define the goal in a RL problem by determining what are the good and bad events for the agent
+	- the agent's sole objective is to maximize the **total** reward it receives over the long run
+	- the process that generates the reward signal must be unalterable by the agent
+- A value function
+	- specify what is good in the long run
+	- the *value* of a state is the total amount of reward an angent can **expect** to accumulate over the future, starting from that state
+	- a value of the prediction of rewards in a long run given the current state
+- A model of the environmnet (optional)
+
+## Tic-Tac-Toe
+
+- The difference between evolutionary methods and the methods of learning value functions
+	- learning a value function takes advantage of information available during the course of play
diff --git a/notes/02-mdp.md b/notes/02-mdp.md
@@ -0,0 +1,81 @@
+# Chap 03: Finite Markov Decision Processes
+
+Notations
+
+- $S_t\in\mathcal{S}$: the environment state at step $t$, where $\mathcal{S}$ is the set of possible states
+- $A_t\in\mathcal{A}(S_t)$: action given $S_t$, where $\mathcal{A}(S_t)$ is the set of actions available in state $S_t$
+- $R_{t+1}\in\mathcal{R}\subset\mathbb{R}$: reward
+- $\pi_t$: agent's policy, where $\pi_t(a|s)$ is the probability that $A_t=a$ if $S_t=s$
+
+## Returns
+
+Expected discounted return
+
+$$G_t=\sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}$$
+where $\gamma$ is a parameter, $0\leq \gamma\leq 1$, called the discount rate.
+
+## MDP
+
+A reinforcement learning task that statisfies the Markov property is called a **Markov Discision Process** (MDP).
+
+A finite MDP is specified by its state and action sets ($\mathcal{S}$ and $\mathcal{A}$) and its one-step dynamics of the environment as:
+$$p(s',r|s,a)=\text{Pr}(S_{t+1}=s',R_{t+1}=r|S_t=s,A_t=a)$$
+
+All other things can be computed from this dynamics, including
+
+- the expected rewards for state-action pairs $r(s,a)=\mathbb{E}(R_{t+1}|S_t=s,A_t=a)$
+- the state transition prob $p(s'|s,a)=\text{Pr}(S_{t+1}=s'|S_t=s,A_t=a)$
+- the expected rewards for state-action-next-state triples $r(s,a,s')=\mathbb{E}(R_{t+1}|S_t=s,A_t=a,S_{t+1}=s')$
+
+**An alternative definition**[1] A MDP is defined by:
+
+- a ste of states $\mathcal{S}$
+- a start state or initial state $s_0\in\mathcal{S}$
+- a set of actions $\mathcal{A}$
+- a transition prob $P(S_{t+1} = s'|S_t = s,A_t = a)$
+- a reward prob $P(R_{t+1}=r|S_t=s,A_t=a)$
+
+## Value functions
+
+For MDP, the **state-value function** for a policy $\pi$ is defined as
+$$v_{\pi}(s)=\mathbb{E}_{\pi}(G_t|S_t=s)=\mathbb{E}_{\pi}(\sum_{k=0}^{\infty}\gamma^kR_{t+k+1})$$
+
+Similar way can be used to define the **action-value function** for policy $\pi$, $q_{\pi}(s,a)$,
+$$q_{\pi}(s,a)=\mathbb{E}_{\pi}(G_t|S_t=s,A_t=a)$$
+
+- Q learning
+
+### Bellman equation
+
+The Bellman equation for $\pi$ is defined as following: for $s\in\mathcal{S}$,
+
+$$v_{\pi}(s)=\sum_{a}\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\pi}(s')]$$
+which specifies the relation between $S_t$ and $S_{t+1}$ for a given $\pi$.
+
+Reinforcement learning can solve Markov decision processes without explicit specification of the transition probabilities.
+
+## Optimal Value Functions
+
+In finite MDPs, value functions define a partial ordering over policies
+
+- A policy $\pi$ is defined to be better than or equal to a policy $\pi'$ if its expected return is greater than or equal to that of $\pi'$ for all states.
+
+Optimal state-value function, denoted $v_{\ast}$, $\forall s\in\mathcal{S}$
+$$v_{\ast}(s)=\max_{\pi}v_{\pi}(s)$$
+
+For MDP
+$$v_{\ast}(s)=\max_{a\in\mathcal{A}(s)}\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\ast}(s')]$$
+
+This equation means that any policy that is greedy with respect to the optimal evaluation function $v_{\ast}$ is an optimal policy. Optimal policies also share the same *optimal action-value function*, denoted $q_{\ast}$, 
+$$q_{\ast}(s,a) = \sum_{s',r}p(s',r|s,a)[r+\gamma\max_{a'}q_{\ast}(s',a')]$$
+
+
+- This equation is also a special formulation that dynamic programming could find the optimal solution [2].
+- Q learning
+
+An [example](https://aclweb.org/anthology/D/D16/D16-1261.pdf) of using MDP for information extraction.
+
+## Reference
+
+1. Mohri, Rostmizadeh, Talwalkar. Foundations of Machine Learning. 2012
+2. Kleinberg and Tardos. Algorithm Design. 2005