Update README.md

mimoralea · Apr 30, 2017 · 6cd410b · 6cd410b
1 parent 465d796
commit 6cd410b
Showing 1 changed file with 31 additions and 37 deletions.
diff --git a/02-sequential-decisions/README.md b/02-sequential-decisions/README.md
@@ -4,14 +4,14 @@
 
 As mentioned before, Reinforcement Learning introduces the notion of sequential decision-making. This
 idea of making a series of decisions forces the agent to take into account future sequences of actions,
-states and rewards. In this lesson, we will explore some of the most fundamental aspects of sequential
+states, and rewards. In this lesson, we will explore some of the most fundamental aspects of sequential
 decision making.
 
 #### 2.1 Modeling Decision-Making Problems
 
 In order to attempt solving a problem, we must be able to represent it in a form that abstracts it
 allowing us to work on it. For decision-making problems, we can think of few aspects that are
-common to all problems.
+common to all problems. 
 
 First, we need to be able to receive percepts of the world, that is, the agent needs to be able to sense
 its environment. The input we get from the environment could directly represent the
@@ -21,23 +21,23 @@ that has purchased stocks knows that the sell and buy price are mere estimates o
 stock will sell for. For some transactions, this price is not totally accurate. Another example in which this
 issue is much easier to understand is in robotics. For example, GPS localization is accurate
 within a few meters precision. This amount of noise on the sensor could be the difference between an autonomous
-car driving safely or instead getting on an accident with the car on the next lane. The point is that as in
-the real world, when we model it, we need to account to the fact that things that we "see" are not
-necessarily things that "are". This distinction will come up later, for now, we can will assume that we live
+car driving safely or instead of getting into an accident with the car on the next lane. The point is that as in
+the real world when we model it, we need to account for the fact that things that we "see" are not
+necessarily things that "are". This distinction will come up later, for now, we can assume that we live
 in a perfect world and that our perceptions are a true representation of the state of the world. Another
 important fact to clarify is the representation of states must include all necessary history within the state.
 In other words, the states should be represented as memory-less. This is known as Markov property and it is
 a fundamental assumption to solve decision-making problems of the kinds we will be exploring in these lessons.
 
-Second, all decision-making problems have available actions. For the stock bot we can think
+Second, all decision-making problems have available actions. For the stock bot, we can think
 of a few actions, sell, buy, hold. We could also add some special action such as limit sell, limit buy,
 options, etc. A robot could have the actions to power a given voltage to a given actuator for a given time. As
 we clarified the potential of a percept not exactly representing the state of the world, actions might
 not turn with the same outcome every time they are taken. That is, actions are not necessarily deterministic.
 For the stock agent example, we can think of the small probability that sending a buy request to the server
 returns with a server communication error. That is, the probability of actually executing the action we selected
 could be 99.9% certain but there is still a small chance that the action doesn't go through as we intended. This
-stochasticity is represented as transition functions. This functions represent the probability of successful transition
+stochasticity is represented as transition functions. These functions represent the probability of successful transition
 when taking an action on a given state. The sum of all transitions for a given state action pair must equal 1.
 One thing we need to make clear, however, is that probabilities must always be the same. That is, we might not know
 the exact probability of transitioning to a new state giving a current and action, but the value will be the
@@ -48,8 +48,8 @@ Many problems in fields other than Reinforcement Learning represent these are co
 refers to these signals as rewards. On our trading agent, the reward could simply be the profit or loss made
 from a single transition, or perhaps we could make our reward signal the difference of total assets before and
 after making a transaction. In a robotic task, the reward could be slightly more complex. For example, we could
-design an agent that gets positive reward while staying up straight walking. Or maybe it gets a reward signal after
-a specific tasks is accomplished. The important part of the reward is that this will ultimate have big influence
+design an agent that gets a positive reward while staying up straight walking. Or maybe it gets a reward signal after
+a specific task is accomplished. The important part of the reward is that this will ultimate have a big influence
 on how our agent performs. As we can see, rewards are part of the environment. However, often times we have to design
 these reward signal ourselves. Ideally, we are able to identify a natural signal that we are interested in maximizing.
 
@@ -60,41 +60,40 @@ the transition function mapping the probability of reaching a state to a state a
 
 We will be using MDPs moving forward, though it is important to mention that MDPs have lots of variants,
 Dec-MDP, POMDP, QMDP, AMDP, MC-POMDP, Dec-POMDP, ND-POMDPs, MMDPs are some of the most common ones. They all
-represent some type of problem related to MDPs. We will be loosing up the constrains MDP present and generalizing
+represent some type of problem-related to MDPs. We will be loosing up the constraints MDP present and generalizing
 the representation of decision-making problems as we go.
 
 #### 2.2 Solutions Representation
 
 Now that we have a framework to represent decision-making problems, we need to devise a way of communicating
 possible solutions to the problems. The first word that comes to mind when thinking about solutions to
-decision-making problems is plan. A plan can be seen as a sequence of steps to accomplish a goal. This is
-great, but probably too simplistic. Mike Tyson once said, "Everyone has a plan 'till they get punched in the
+decision-making problems is "plan". A plan can be seen as a sequence of steps to accomplish a goal. This is
+great but probably too simplistic. Mike Tyson once said, "Everyone has a plan 'till they get punched in the
 mouth." And it is true, we need something more adaptive than just a simple plan. The next step then is to think
-of a plan and create conditions that helps us deal with the uncertainty of the environment. This type of planning
-is known as conditional planning. Which is basically just a regular plan in which we plan in advanced the
+of a plan and create conditions that help us deal with the uncertainty of the environment. This type of planning
+is known as conditional planning. Which is basically just a regular plan in which we plan in advance the
 contingencies that may arise. However, if we expand this a bit further we can think of a conditional planning
-that takes into account every single possible contingency, even those we haven't thought of. This is call universal
+that takes into account every single possible contingency, even those we haven't thought of. This is called universal
 plan or better yet, a policy. In Reinforcement Learning, a policy is a function mapping states to actions which
 represent a solution to an MDP. The algorithms that we will be discussing later will directly or indirectly produce
 the best possible policy, also called optimal policy. This is important to understand and remember.
 
-#### 2.2 Simple Sequential Problem
+#### 2.3 Simple Sequential Problem
 
-Given all of the information above, let's review the simplest problem that we can think of. Let's think of a
-the problem of a casino with 2 slot machines. To illustrate some important points better, imagine you enter
+Given all of the information above, let's review the simplest problem that we can think of. Let's think of a problem of a casino with 2 slot machines. To illustrate some important points better, imagine you enter
 slot machine area paying a flat fee. However, you are only allowed to play 100 trials on any of the 2 machines.
 Also, the machines pay the amount of $1 or nothing on each pull according to an underlying, fixed and unknown
 probability. The Reinforcement Learning problem becomes then, how can I maximize the amount of money I could
 get from it. Should you start pulling an arm and stick to it for the 100 trials? Should you instead pull 1
 and 1? Should you pull 50 and 50? In other words, what is the best strategy or policy for maximizing all
 future rewards?
 
-The difficulty of this problem, also known as the k-Armed Bandit, in this case k=2, is that you need to
+The difficulty of this problem, also known as the k-Armed Bandit, in this case, k=2, is that you need to
 simultaneously be able to acquire knowledge of the environment and at the same time harness the knowledge you
 have already acquired. This fundamental trade-off between exploration versus exploitation is what makes
 decision-making problems hard. You might believe that a particular arm has a fairly high payoff probability;
 should you then choose it every time? Should you choose one that you know well in order to gain information
-about it's payoff? How about choosing one that you might have good information already but perhaps getting more
+about its payoff? How about choosing one that you might have good information already but perhaps getting more
 would improve your knowledge of the environment?
 
 All of the answers to the questions posed above depend on several factors. For example, if instead of allowing you
@@ -104,19 +103,14 @@ The knowledge that you gain from the initial exploration will ensure you maximiz
 term.
 
 
-#### 2.3 Slightly more complex problems
+#### 2.4 Slightly more complex problems
 
 When explaining reinforcement learning, it is very common to show a very basic world to illustrate fundamental
-concepts. Let's think of a grid world where the agent starts at 'S'. Reaching the space marked with a 'G' ends 
-the game and gives the agent a reward of 1. Reaching the space with an 'F' ends the game and gives the agent
-a reward of -1. The agent is able to select 4 actions every time, (N, S, E, W). The actions selected has exactly 
-the effect we expect. For example,N would move the agent one cell up, E to the cell on the right. Unless the 
-agent is attempting to enter a space marked with an 'X' which is a wall and cannot be entered, and unless the 
-agent is in the left most cell trying to move left, etc. Which will just bounce the agent back to the cell 
-it took the action from.
+concepts. Let's think of a grid world where the agent starts at 'S'. Reaching the space marked with a 'G' ends the game and gives the agent a reward of 1. Reaching the space with an 'F' ends the game and gives the agent
+a reward of -1. The agent is able to select 4 actions every time, (N, S, E, W). The actions selected has exactly the effect we expect. For example, N would move the agent one cell up, E to the cell on the right. Unless the agent is attempting to enter a space marked with an 'X' which is a wall and cannot be entered, and unless the agent is in the left most cell trying to move left, etc. Which will just bounce the agent back to the cell it took the action from.
 
 
-#### 2.4 Evaluating solutions
+#### 2.5 Evaluating solutions
 
 Before we being exploring how to get the best solution to this problem. I'd like us to detour into how do we
 know how good is a solution. For example, we can imagine a solution given by a series of arrows representing
@@ -129,7 +123,7 @@ as the sum of all rewards that we would get starting on state 'S' and following
 called policy evaluation.
 
 One thing you might be thinking after reading the previous paragraph is, but what happens if a policy gives
-lots of rewards early on, a nothing later. And another policy gives no rewards early on but lots of reward later.
+lots of rewards early on, a nothing later. And another policy gives no rewards early on but lots of rewards later.
 Is there a way we can account for our preference to early rewards? The answer is yes. So, instead of using the
 sum of all rewards as we mentioned before, we will use the sum of discounted rewards in which each reward at time
 `t` will be discounted by a factor, let's call it gamma, `t` times. And so we get that policy evaluation basically
@@ -142,7 +136,7 @@ Vpi(s) = Epi{r_{t+1} + g*r_{t+2} + g**2*r_{t+3} + ... | St = s}
 So, we are basically finding the value we would get from each of the states if we followed this policy.
 Fair enough. Let's forget about equations, check the "Furter reading" sections for that.
 
-#### 2.5 Improving on solutions
+#### 2.6 Improving on solutions
 
 Now that we know how to come up with a single value for a given policy. The natural question we get is how
 to we improve on a policy? If we can devise a way to improve and we know how to evaluate, we should be able
@@ -154,31 +148,31 @@ policy that would make the value calculate above larger? How about we temporaril
 that suggested by the policy and then follow the policy as originally suggested. This way we would isolate the
 effect of the action on the entire policy. This is actually the basis for an algorithm called policy improvement.
 
-#### 2.6 Finding Optimal solutions
+#### 2.7 Finding Optimal solutions
 
 One of the powerful facts about policy improvement is that this way of finding better policies from
 a given policy actually guarantees that at least a policy of the same exact quality will be returned or better.
 This allows us to think of an algorithm that uses policy evaluation to get the value of a policy and then
 policy improvement to try to improve this policy and if the improvement just returns any better policy, just
 stop. This algorithm is called policy iteration.
 
-#### 2.7 Improving on Policy Iteration
+#### 2.8 Improving on Policy Iteration
 
 Policy iteration is great because it guarantees that we will get the very best policy available for a given
 MDP. However, sometimes it can take unnecessarily large computation before it comes up with that best policy.
 Another way of thinking about this is, would there be a small number delta (E.g. 0.0001) that we would be OK with
-accepting as a measure of change on any given state. If there is, then we could just cut the policy evaluation
+accepting as a measure of change in any given state. If there is, then we could just cut the policy evaluation
 algorithm short and use the value of states to guide our decision-making. This algorithm is called value iteration.
 
-#### 2.8 Exercises
+#### 2.9 Exercises
 
-In this lesson we reviewed ways to solve sequential problems. The following Notebook goes into a little more
+In this lesson, we reviewed ways to solve sequential problems. The following Notebook goes into a little more
 detail about the Dynamic Programming way of solving problems. We will look into the Fibonacci sequence problem
 and devise few ways for solving it. Recursion, Memoization and Dynamic Programming.
 
 Lesson 2 Notebook.
 
-#### 2.9 Further Reading
+#### 2.10 Further Reading
 
   * [Dynamic programming](https://people.eecs.berkeley.edu/~vazirani/algorithms/chap6.pdf)
   * [Value iteration and policy iteration algorithms for Markov decision problem](http://www.ics.uci.edu/~csp/r42a-mdp_report.pdf)