- in the previous lesson, you learned about the core idea of localization in an intuitive way
- you also did localization quizzes in Python
- as Sebastian already told you, multiplication and addition solves the localization problem, and localization is a foundation for autonomous driving


- now, you will learn more about the whole underlying math behind such a localization problem
- this means I will teach you the **derivation of the base localization filter** called **Markov Localization**, which is one of the most common localization frameworks
  - in parallel, you will also learn to implement the math directly with C++
- I'll also talk about the motion and observation model

- Markov Localization or Bayes Filter for Localization is a generalized filter for localization and all other localization approaches are realizations of this approach, as we'll discuss later on
- by learning how to derive and implement (coding exercises) this filter we develop intuition and methods that will help us solve any vehicle localization task, including implementation of a particle filter


- we don't know exactly where our vehicle is at any given time, but can approximate it's location
- as such, we generally think of our vehicle location as a probability distribution, each time we move, our distribution becomes more diffuse (wider)
- we pass our variables (map data, observation data, and control data) into the filter to concentrate (narrow) this distribution, at each time step
- each state prior to applying the filter represents our prior and the narrowed distribution represents our Bayes' posterior

# Localization Posterior: Introduction

- here is a picture from the introduction when you first encountered localization--we have a map with all these landmarks in a global coordinate system, observations from the on-board sensor and the local coordinates system, and we also have the information how the car moves between two timesteps

<img src="resources/localization_posterior.png"/>

- the observations are defined as a vector $z$ which includes all observations from timestep $1$ to $t$
  - the observations could be range measurements, bearing angles, or images, for example
- we also have the controls of the car as a vector $u$ which includes all control elements from timestep $1$ to $t$
  - typically, you have yaw pitch or roll rates and velocity information
- the map could be a grid map of the global environment, or a database which includes global feature points and the lane geometry
  - here, we do not add the time index $t$ to the map because we assume the map does not change over time
- we assume these variables are known


- again, what we want to estimate is the transformation between the local coordinate system of the car and the global coordinate system of the map
- if we know this transformation, then we also know the pose of the car in the global map
  - the position of the car at time $t$ is defined with $x$
  - if we assume we have a 2D map for example, $x$ includes a position with $x$ and $y$ coordinates and also the orientation $\phi$
- these values are unknown


- we will never know the state $x_t$ with perfect accuracy
- what we want is to form a sufficiently accurate belief of the state $x_t$, and we want to formulate this belief $bel(x_t)$ in a probabilistic way

## Formal Definition of Variables

- $z_{1:t}$ represents the observation vector from time $0$ to $t$ (range measurements, bearing, images, etc.)
- $u_{1:t}$ represents the control vector from time $0$ to $t$ (yaw/pitch/roll rates and velocities)
- $m$ represents the map (grid maps, feature maps, landmarks)
- $x_t$ represents the pose (position (x,y) + orientation $\phi$)

# Localization Posterior Explanation and Implementation

**Q:** Given the map, the control elements of the car, and the observations, what is the definition of the *posterior distribution for the state $x$ at time $t$*?

**A:** It's $bel(x_t) = p(x_t|z_{1:t}, u_{1:t}, m)$
- the belief of $x_t$ is the posterior distribution of $x_t$ given all observations, the controls and the map

- localization is all about estimating the probability distribution of the state $x_t$, which is the pose of the car, another condition that all previous observations $z$, from time $1$ to $t$, and all previous controls $u$, from time $1$ to $t$, are given
- to solve the pure localization problem, we assume the map is correct and does not change--therefore, the map is also given
- if we would like to estimate also the map, then we would solve the simultaneous localization and mapping problem, or called SLAM problem, which is much more complex
  - $p(x_t, m|z_{1:t}, u_{1:t})$
  - I won't talk about this


- I will talk about the posterior distribution from the quiz above
- before we go deeper into math, I want to show you how we define the different input data for a specific 1D localization scenario
  - this means I will explain how the car is sensing and moving
  - I will also show you how the map looks
  - this example is very similar to what you already learned with Sebastian

- let's talk about the map first
- the map includes the position of street lamps and trees in 1D--this means we are working with landmark-based maps which are, in general, more sparse than grid-based maps

- in the 1D case, the map is a vector of the position where these objects are
  - here, the map includes six landmarks with the values $9, 15, 25, 31, 59, 77$

<img src="resources/1D_map.png"/>

- for the observation we state that the car measures the nearest $k$ seen static objects, in driving direction
- so we assume that the car can detect the distances to street lamps and trees
  - this results in an observation list which includes, for each time stamp $t$, a vector of distances $z_t$, from $1$ to $k$
  - it is possible that we detect multiple trees or street lamps at a time stamp $t$, or detect nothing at all

<img src="resources/1D_observations.png"/>


- the control vector includes a direct move of the car between consecutive time stamps
  - this means the control is defined by the distance the car traveled between $t$ and $t-1$
  - in our case, the car moves 2 meters to the right

<img src="resources/1D_control_vector.png"/>

- the true pose of the car is somewhere on the mapped area
- since the map is discrete, the pose of the car could be any integer between $0$ and $99$ meters
  - this means the belief of $x_t$ is defined as a vector of hundred elements, and each element represents a probability that the car is located at the corresponding position
  - the goal is now to estimate these values

<img src="resources/1D_true_pose.png"/>

# Bayes' Rule

- before we dive into deeper into Markov localization, we should review Bayes' Rule
- this will serve as a refresher for those familiar with Bayesian methods and we provide some additional resources for those less familiar


- recall that Bayes' Rule enables us to determine the conditional probability of a state given evidence $P(a|b)$ by relating it to the conditional probability of the evidence given the state $P(b|a)$ in the form of: $P(a)∗P(b∣a)=P(b)∗P(a∣b)$ which can be rearranged to: $P(a|b) = \dfrac{P(b|a)P(a)}{P(b)}$
- in other words the probability of state $a$, given evidence $b$, is the probability of evidence $b$, given state $a$, multiplied by the probability of state $a$, normalized by the total probability of $b$ over all states

### Bayes' Rule Applied

- let's say we have two bags of marbles, bag 1 and bag 2, filled with two types of marbles, red and blue
- bag 1 contains 10 blue marbles and 30 red marbles, whereas bag 2 contains 20 of each color marble


- if a friend were to choose a bag at random and then a marble at random, from that bag, how can we determine the probability that that marble came from a specific bag? You guessed it - Bayes' Rule!
- in this scenario, our friend produces a red marble, in that case, what is the probability that the marble came from bag 1?
- rewriting this in terms of Bayes' Rule, our solution becomes: $P(Bag1|Red) = \dfrac{P(Red|Bag1)P(Bag1)}{P(Red)}$


- what is the prior probability of choosing bag 1?--this is the term P(Bag1)
  - it's 0.5--we have two bags, giving us a 50% chance of choosing either bag
- what is the probability of choosing a red marble from bag 1?--this is our likelihood term P(Red|Bag1)
  - it's 0.75
- what is the total probability of choosing a red marble?--this is our normalization term (denominator) P(Red)
  - it's 0.625
- now, putting everything together, using the formula for Bayes' Rule, what is our posterior probability of the red marble originating from bag 1?--this is our posterior term P(Bag1|Red)
  - it's 0.60--we use values calculated above and in the formula for Bayes' Rule

### Bayesian Methods Resources

- [Sebastian Discusses Bayes Rule](https://classroom.udacity.com/nanodegrees/nd013/parts/30260907-68c1-4f24-b793-89c0c2a0ad32/modules/28233e55-d2e8-4071-8810-e83d96b5b092/lessons/3c8dae65-878d-4bee-8c83-70e39d3b96e0/concepts/487221690923?contentVersion=2.0.0&contentLocale=en-us)
- [More Bayes Rule Content from Udacity](https://classroom.udacity.com/courses/st101/lessons/48703346/concepts/483698470923)
- [Bayes Rule with Ratios](https://betterexplained.com/articles/understanding-bayes-theorem-with-ratios)
- [A Deep Dive into Bayesian Methods, for Programmers](http://greenteapress.com/wp/think-bayes/)

# Bayes' Filter For Localization

- we can apply Bayes' Rule to vehicle localization by passing variables through Bayes' Rule for each time step, as our vehicle moves
  - this is known as a **Bayes' Filter for Localization**
- we will cover the specific as the lesson continues, but the generalized form Bayes' Filter for Localization is shown below
  - you may recognize this as being similar to a Kalman filter--in fact, many localization filters, including the Kalman filter are special cases of Bayes' Filter


- remember the general form for Bayes' Rule: $P(a|b) = \dfrac{P(b|a)P(a)}{P(b)}$
- with respect to localization, these terms are:
  - $P(location|observation)$: this is $P(a|b)$, the normalized probability of a position given an observation (posterior)
  - $P(observation|location)$: this is $P(b|a)$, the probability of an observation given a position (likelihood)
  - $P(location)$: this is $P(a)$, the prior probability of a position
  - $P(observation)$: this is $P(b)$, the total probability of an observation


- without going into detail yet, be aware that $P(location)$ is determined by the motion model
- the probability returned by the motion model is the product of the transition model probability (the probability of moving from $x_{t-1}$ --> $x_t$ and the probability of the state $x_{t-1}$


- over the course of this lesson, you’ll build your own Bayes’ filter
- in the next few quizzes, you’ll write code to:
  - compute Bayes’ rule
  - calculate Bayes' posterior for localization
  - initialize a prior belief state
  - create a function to initialize a prior belief state given landmarks and assumptions

# Calculate Localization Posterior

- to continue developing our intuition for this filter and prepare for later coding exercises, let's walk through calculations for determining posterior probabilities at several pseudo positions $x$, for a single time step
- we will start with a time step after the filter has already been initialized and run a few times
- we will cover initialization of the filter in an upcoming concept


- the **raw** $P(location|observation)$ is the result prior to dividing by the total probability of $P(observation)$, the $P(b)$ term (denominator) of the generalized Bayes` rule
- the **normalized** $P(location|observation)$ is the result of after dividing by $P(observation)$

|  pseudo_position (x) |  P(location) |  P(observation∣location) | Raw  P(location∣observation) | Normalized P(location∣observation) |
|----------------------|--------------|---------------------------|------------------------------|------------------------------------|
| 1 | 1.67E-02 | 0.00E+00 | 0.00E+00 | 0.00E+00 |
| 2 | 3.86E-02 | 6.99E-03 |	?        | 2.59E-02 |
| 3 | 4.90E-02 | 8.52E-02 |	4.18E-03 | 4.01E-01 |
| 4 | 3.86E-02 | ?	      | 5.42E-03 | 5.21E-01 |
| 5 | 1.69E-02 | 3.13E-02 |	5.31E-04 | 5.10E-02 |
| 6 | 6.51E-03 | 9.46E-04 |	6.16E-06 | ?        | 
| 7 |	?	       | 3.87E-06 |	6.55E-08 | 6.29E-06 |
| 8 | 3.86E-02 | 0.00E+00 |	0.00E+00 | 0.00E+00 |

- **Q:** what is $P(observation|location)$ for $x = 4$
- **A:** to determine the observation probability divide the $P(posterior)$ by $P(position)$: $\dfrac{5.42E-3}{3.86E-2} = 1.40E-1$


- **Q:** what is the raw posterior probability $P(location|observation)$ for $x = 2$
- **A:** to determine the raw posterior probability multiply the $P(observation|location)$ by $P(location)$: $
6.99E-3 * 3.86E-2 = 2.70E-4$


- **Q:** what is the Normalized posterior probability for $x = 6$
- **A:** to determine the normalized posterior probability, first sum the raw $P(Posterior)$ to get the total:
$0.00E+00 + 2.70E-04 + 4.18E-03 + 5.42E-03 + 5.31E-04 + 6.16E-06 + 6.55E-08 + 0.00E+00 = 1.04E-02$
  - next, divide the $P(Posterior)$ by the sum: $\dfrac{6.16E-06}{1.04E-02} = 5.92E-4$


- **Q:** what is the position probability for x = 7?
- **A:** to determine the position probability divide $P(posterior)$ by $P(observation)$: $3.87E-06 * 6.55E-08 = 1.69E-2$

# Initialize Belief State

- to help develop an intuition for this filter and prepare for later coding exercises, let's walk through the process of initializing our prior belief state
- that is, what values should our initial belief state take for each possible position?
  - let's say we have a 1D map extending from $0$ to $25$ meters
  - we have landmarks at $x = 5.0$, $10.0$, and $20.0$ meters, with position standard deviation of $1.0$ meter
  - if we know that our car's initial position is at one of these three landmarks, how should we define our initial belief state?

- since we know that we are parked next to a landmark, we can set our probability of being next to a landmark as $1.0$
- accounting for a position precision of $+/- 1.0$ meters, this places our car at an initial position in the range $[4, 6] (5 +/- 1)$, $[9, 11] (10 +/- 1)$, or $[19, 21] (20 +/- 1)$
- all other positions, not within $1.0$ meter of a landmark, are initialized to $0$
- we normalize these values to a total probability of $1.0$ by dividing by the total number of positions that are potentially occupied
- in this case, that is $9$ positions, $3$ for each landmark (the landmark position and one position on either side)
- this gives us a value of $1.11E-01$ for positions $+/- 1$ from our landmarks $(1.0/9)$
- so, our initial belief state is:
```
{0, 0, 0, 1.11E-01, 1.11E-01, 1.11E-01, 0, 0, 1.11E-01, 1.11E-01, 1.11E-01, 0, 0, 0, 0, 0, 0, 0, 1.11E-01, 1.11E-01, 1.11E-01, 0, 0, 0, 0}
```

- to reinforce this concept, let's practice with a quiz
  - map size: 100 meters
  - landmark positions: {8, 15, 30, 70, 80}
  - position standard deviation: 2 meters
- assuming we are parked next to a landmark, answer the following questions about our initial belief state


- **Q:** What is our initial probability (initial belief state) for position $11$? If the answer is non-zero, enter it in scientific notation with an accuracy of two decimal places, for example $3.14E-15$.
- **A:** It's $0$. Position $11$ is not within $2$ meters of a landmark.


- **Q:** What is our initial probability (initial belief state) for position $71$? If the answer is non-zero, enter it in scientific notation with an accuracy of two decimal places, for example $3.14E-15$.
- **A:** To determine the initial probability we will divide 1.0 by the total number of positions within 2 meters of a landmark. In this case we have 5 landmarks and a position standard deviation of 2.0 meters. This gives us 5 potentially occupied positions per landmark (the landmark position and 2 each side), yielding 25 potentially occupied positions (5 landmarks * 5 positions/landmark). $1.0/25 = 4.00E-02$

# Initialize Priors Function

- in this quiz we will create a function that initializes priors (initial belief state for each position on the map) given landmark positions, a position standard deviation $(+/- 1.0)$, and the assumption that our car is parked next to a landmark


- note that the control standard deviation represents the spread from movement (movement is the result of our control input in this case)
  - we input a control of moving $1$ step but our actual movement could be in the range of $1 +/-$ control standard deviation
- the position standard deviation is the spread in our actual position
  - for example, we may believe start at a particular location, but we could be anywhere in that location $+/-$ our position standard deviation

- code is available in `code/01_initialize_priors/`
- for simplicity we assumed a position standard deviation of $1.0$ and coded a solution for initializing priors accordingly

# Quiz: How Much Data?

- before we go back to math, I want to make sure you understand how much data $z_{1:t}$ represents, so you can consider what the performance consequences would be


- **Q:** So let's pretend that each observation vector contains 100000 data points or observations. Each of those observations may have 5 data points which take 4 bytes each. If we have been driving a car for 6 hours with our sensor updating at 10 hertz, how much data is contained in this $z_{1:t}$?
- **A:** Well, 6 hours drive, times 3600 seconds, times 10 cycles each second, times 100,000 observations for each cycle, times 5 data points per observation, times 4 results in 432 gigabytes.

# Derivation Outline

- up to here, there are two problems if we want to estimate the posterior directly
  - the first one is the localizer must process on each cycle a lot of data
  - the second is the amount of data increases over time
- this won't work for real time localizer, which should run with at least 10 Hz in our vehicles


- in the following, I will present a mathematical proof showing that we can change this so our localizer:
  - only needs to handle a few bytes on each update
  - handles the same amount of data per update regardless of drive time

# Apply Bayes Rule with Additional Conditions

- you already learned that the observation vector could be a lot of data, and we do not want to carry the whole observation history to estimate the state beliefs
- the idea is that we manipulate the posterior $p(x_t|z_{1:t-1},\mu_{1:t},m)$, such in a way that you get a recursive state estimator
- we have to show that the current belief $bel(x_t)$, can be expressed by a belief one step earlier $bel(x_{t-1})$ and then update the current belief only with new observation information
  - we call this estimator the **Bayes Localization Filter** or **Markov Localization**
  - this will allow us to avoid having to carry around all historical observation and motion data


- to achieve this recursive structure, you have to apply probabilistic rules and laws like the Bayes Rule, or The Law of Total Probability
  - you already heard about this in Sebastian's Lessons so this should not be something new for you
- I will also teach about The Markov Assumption
  - this involves making meaningful assumptions about the dependencies between on certain values


- so our goal on the next steps is to define the posterior in a recursive way
- the first thing is to split a whole observation vector $z_{1:t}$ into the current observations $z_t$ and our previous informations $z_{1:t-1}$
  - the posterior can then be rewritten as $p(x_t|z_t,z_{1:t-1},u_{1:t}, m)$
  - this is important to achieve the recursive structure
- now, we apply Bayse rule
  - it is the most fundamental consideration in probabilistic inference
  - the tricky part is here, that you have more than one variables on the right side which means you have to apply Bayse rule with multiple conditions

<img src="resources/bayes_rule_with_additional_conditions.png"/>

- **Q:** Apply Bayes Rule to determine the right side of Bayes rule, where the posterior, $P(a|b)$, is $p(x_t|z_t,z_{1:t-1},u_{1:t},m)$
- **A:** It's $\dfrac{p(z_t|x_t,z_{1:t-1},u_{1:t},m) \times p(x_t|z_{1:t-1},u_{1:t},m)}{p(z_t|z_{1:t-1},u_{1:t},m)}$

# Bayes Rule and Law of Total Probability

- to define the likelihood term, $p(z_t|x_t,z_{1:t-1},u_{1:t},m)$ we swapped the state and the observation at $t$, and also take into account all other conditions
- the prior $p(x_t|z_{1:t-1},u_{1:t},m)$ and the normalizer $p(z_t|z_{1:t-1},u_{1:t},m)$ are also conditioned by the previous observations or controls and the math
- it is totally fine here to condition base rule on arbitrary random variables like the controls, like our observations, and the map
- if you remove the additional conditions in the posterior, you would end up exactly with the general Bayes' formula


- recall the likelihood term observation model, which describes the probability distribution of the observation vector, $t$--another assumption that a state $x_t$, all previous observations, all controlles, and the map are given
- the prior is called a motion model--it is a probability distribution of $x_t$ given all observations from $1:t-1$, all controls and the map, taken into account that no current observations are included in the motion model

- to simplify the normalization part, we define the normalizer as Eta $\eta$
- $\eta$ is one over the original normalization term and this term is a sum of the product of the observation and the motion model over all possible states, $x_{t_i}$.
  - this also means you only have to define the observation and motion model to estimate the beliefs
  
<img src="resources/eta_normalizer.png"/>

# Total Probability and Markov Assumption

- the problem with the definition of the motion model is that we have no information where the car was before at time $t-1$
  - this means, no information about the previous state, $x_{t-1}$
  - what kind of rule or law can we use which will help us here?
- in this case, the law of total probability will help you a lot, so this has nothing to do with Bayes Rule
- Law of Total Probability: $P(B) = \sum\limits_{i-1}^{\infty} P(B|A_i)P(A_i)$

<img src="resources/law_of_total_probability.png"/>


- we introduce a state $x_{t-1}$ and assume the state is given
- then, the probability distribution of our motion model can be expressed as the integral of $p(x_t)$ given the previous states, the previous observations, old controls, the map multiplied by the probability distribution of the previous state itself over the whole state space $x_{t-1}$


- let us represent this situation as a graph to visualize the dependencies between the variables
  - because of introducing $x_{t-1}$, you are looking at all possible states of the previous time step and then predict where the car would be in the next time step
  - since we also have all the other given values, we also use this information to estimate $x_t$.
  - the same information is also use to estimate $x_{t-1}$ itself
    - and of course, $x_{t-1}$ is unknown

# Markov Assumption for Motion Model

- **Q:** What do you think about these two assumptions:
  - (a) Since we (hypothetically) know in which state the system is at time step $t-1$, the past observations $z_{1:t-1}$ and controls $u_{1:t-1}$ would not provide us additional information to estimate the posterior for $x_t$, because they were already used to estimate $x_{t-1}$. This means, we can simplify $p(x_t|x_{t-1}, z_{1:t-1}, u_{1:t},m)$ to $p(x_t|x_{t-1}, u_t, m)$.
  - (b) Since $u_t$ is “in the future” with reference to $x_{t-1}, u_t$ does not tell us much about $x_{t-1}$. This means the term $p(x_{t-1}|z_{1:t-1}, u_{1:t}, m)$ can be simplified to $p(x_{t-1}|z_{1:t-1}, u_{1:t-1}, m)$.
- **A:** Both assumptions are meaningful.

- a Markov process is one in which the conditional probability distribution of future states (ie the next state) is dependent only upon the current state and not on other preceding states
  - this can be expressed mathematically as: $P(x_t|x_{1−t},....,x_{t−i},....,x_0)=P(x_t|x_{t−1}) P(x_t|x_{1-t},....,x_{t-i},...., x_0) = P(x_t|x_{t-1})$
  - it is important to note that the current state may contain all information from preceding states

- I want to introduce the first order Markov Assumption
- assume you want to estimate the posterior distribution of $p(x_t)$ given our previous states, and you have no observations or controls
  - this is a pretty simple example but it works fine to explain the Markov Assumption

- you can write this distribution as the following
  - this relation can be represented as a chain
  - for example, to estimate or predict $x_1$ we only use $x_0$, to estimate $x_2$ we use $x_1$ and $x_0$, and finally, for $x_3$ we use $x_2$, $x_1$ and $x_0$
  - in this example, the Markov Assumption postulates that $x_2$ is the best predictor for $x_3$
    - this means, that the other states, $x_1$ and $x_0$, arcs or future states, carry no additional information to predict $x_3$ in a better way far more accurately
    - we also say the state $x_2$ is complete
    - we remove the links or connectors between $x_1$ and $x_3$ and $x_0$ and $x_3$, which means $x_3$ is independent of $x_0$ and $x_1$--it only depends on $x_2$
    - and of course, for $x_2$, it is the same


- since we now assume, that $x_t$ only depends on the previous state, we can rewrite the posterior in this way
- so, if we want to continue this chain, which means to predict the future, we only take $x_3$ into consideration

<img src="resources/first_order_markov_assumption.png"/>

- an example could be a weather forecaster, the weather of tomorrow only depends on today and today includes our previous information and is uncertain, of course
- as an important fact, we have to assume that we have an initial guess for $x_0$--so, $x_0$ must be initialized correctly

- let's go back to our motion model and I will show you how we can benefit from the Markov Assumption
- before I talked about the Markov assumption, I stopped here and ask you, how we can simplify the structure?
- now, I will show you how the Markov Assumption can help us here

<img src="resources/motion_model.png"/>

- first, I split the control vector into the current control, $u_t$, and our previous controls
- let's take a look to the first term, the probability distribution of $p(x_t)$ is conditioned by $x_{t-1}$, all previous observations or controls, and the map
  - here, we apply the Markov Assumption the first time--since we already know $x_{t-1}$, $z_{1:t-1}$ and $u_{1:t}$ will not carry additional information to predict $x_t$ in a better way
    - these values were already used to estimate $x_{t-1}$
    - this means, $x_t$ is independent of these values
    - because of this fact, we can remove these two conditions in the graph and it results that the posterior distribution of $x_t$ only depends on $x_{t-1}$, $u_t$, and the map
  - this term is called the transition or system model, which predicts or which moves the previous state in the new one
    - we do not need the whole observation or control history
    - here, you can also consider that the map, $m$, does not influence $x_t$--it is common practice to neglect $m$, but here, we keep it
- the second term describes a posterior distribution of $x_{t-1}$, given all previous observations or controls, and the map
  - we use a Markov Assumption again
  - we assume that $u_t$ tells us nothing about $x_{t-1}$, because $u_t$ is in the future
    - we ignore $u_t$ to estimate the state, $x_{t-1}$
    - based on this assumption, we rewrite the motion model again


- after these two steps, we achieved a really, really important step

<img src="resources/motion_model.png"/>

- **Q:** Which statement is correct?
  - Statement 1: After applying the Markov Assumption, the term $p(x_{t-1} | z_{1:t-1}, u_{1:t-1}, m)$ describes exactly the belief at $x_{t-1}$! This means we achieved a recursive structure!
  - Statement 2: After applying the Markov Assumption, we can neglect the term $p(x_{t-1} | z_{1:t-1}, u_{1:t-1}, m)$ completely and we only have to estimate the posterior.
- **A:** With the help of the Markov Assumption we achieved the recursive structure!

# Recursive Structure

- we have achieved a very important step towards the final form of our recursive state estimator
- if we rewrite the second term in our integral from $p(x_{t-1}|z_{1:t-1},u_{1:t-1},m)$ to $p(x_{t-1}|z_{t-1},z_{1:t-2},u_{1:t-1},m)$ we arrive at a function that is exactly the belief from the previous time step, namely $bel(x_{t-1})$
  - now, we can rewrite the integral with a $bel(x_{t-1})$ inside
  - the amazing thing is that we have a recursive update formula and can now use the estimated state from the previous time step to predict the current state at $t$
  - this is a critical step in a recursive Bayesian filter because it renders us independent from the entire observation and control history
  - so in the graph structure, we will replace the previous state terms (highlighted) with our belief of the state at $x$ at $t-1$


- finally, we replace the integral by a sum over all $x_i$ because we have a discrete localization scenario in this case, to get the same formula in Sebastian's lesson for localization
- the process of predicting $x_t$ with a previous beliefs $x_{t-1}$ and the transition model is technically a convolution
- if you take a look to the formula again, it is essential that the belief at $x_t = 0$ is initialized with a meaningful assumption
- it depends on the localization scenario how you set the belief or in other words, how you initialize your filter
  - for example, you can use GPS to get a coarse estimate of your location

<img src="resources/recursive_structure.png"/>

# Implementation Details for Motion Model

- before you start coding, you'll need some details to help with implementing the prediction step
- at the very beginning, the assumption is that the car is parked at a tree or a street lamp plus/minus 1 meter
- the transition model is controlled only by $x_{t-1}$ and $u_t$
- here we assume, the transition model is independent from the map
- remember that $u_t$ is a direct move pointed in driving direction
- the transition model is defined by the 1D normal distribution defined by the mean $u_t$ and $\sigma_{u_t}$, and we have to evaluate at position $x_t - x_{t-1}^i$
- here, $\sigma_{u_t}$ is 1 meter
- the state space range is from 0 to 99 meters with a 1-meter step resolution

<img src="resources/implementation_details_for_motion_model.png"/>

# Noise in Motion Model: Quiz

- assume you have a 1D space between 0 and 30 meters
- at the very beginning, the robot or the car has no clue where it is, so our initial belief would be the uniform distribution, which means maximum confusion
- now we assume the car knows it is parked closely to a tree
- then, the initial belief would look like this


- **Q:** How does that belief look like after we move 10 meters to the right with no noise, with low noise, and with high noise?
- **A:** If we move with no noise, which means the transitionmodel is certain, then we would result in C. This means we just shift the initial belief 10 meters to the right. No noise does not mean we get a better precision. If we move with low noise, the belief is spread out, so, answer B is correct. And if we move with high noise, the belief is very spread out and almost looks uniform, so answer A is correct.

<img src="resources/noise_in_motion_model_quiz.png"/>

# Determine Probabilities

- to implement these models in code, we need a function to which we can pass model parameters/values and return a probability
- fortunately, we can use a normalized probability density function (PDF)
- we have implemented this Gaussian Distribution as a C++ function, `normpdf`, and will practice using it at the end of this concept
  - it accepts a value, a parameter, and a standard deviation, returning a probability

### Additional Resources for Gaussian Distributions

- [Udacity's Statistics Course content on PDF](https://classroom.udacity.com/courses/st095/lessons/86217921/concepts/1020887710923)
- http://mathworld.wolfram.com/NormalDistribution.html
- http://stattrek.com/statistics/dictionary.aspx?definition=Probability_density_function

- let's practice using `normpdf` to determine transition model probabilities
- specifically, we need to determine the probability of moving from $x_{t-1} \rightarrow x_t$
- the value entered into `normpdf` will be the distance between these two positions
- we will refer to potential values of these positions as pseudo position and pre-pseudo position
  - for example, if our pseudo position $x$ is $8$ and our pre-pseudo position is $5$, our sample value will be $3$, and our transition will be from $x - 3 \rightarrow x$


- **Q:** Given pseudo position x and a control parameter of 1 (move 1 unit each time step), which pre-pseudo position maximizes our probability?
- **A:** x-1. Our value will always be maximized when our parameter and value are equal. In this case our control value is 1 (move 1 unit per time step), generally speaking we will see our maximum probability at x - control_parameter.

- code is available in `code/02_normpdf/`

# Motion Model Probabiity I

- now we will practice implementing the motion model to determine $P(location)$ for our Bayesian filter
- we discussed the derivation of the model in Recursive Structure and Implementation Details for Motion Model


- recall that we derived the following recursive structure for the motion model: $\int p(x_t|x_{t-1}, u_t, m)bel(x_{t-1})dx_{t-1}$
- and that we will implement this in the discretized form: $\sum\limits_{i} p(x_t|x_{t-1}^{(i)}, u_t, m)bel(x_{t-1}^{(i)})$


- let's consider again what the summation above is doing - calculating the probability that the vehicle is now at a given location, $x_t$
- how is the summation doing that?
  - it's looking at each prior location where the vehicle could have been, $x_{t-1}$
  - then the summation iterates over every possible prior location, $x_{t-1}^{(1)}...x_{t-1}^{(n)}$
  - for each possible prior location in that list, $x_{t-1}^{(i)}$, the summation yields the **total probability** that the vehicle really did start at that prior location **and** that it wound up at $x_t$


- that now raises the question, how do we calculate the individual probability that the vehicle really did start at that prior location and that it wound up at $x_t$, for each possible starting position $x_{t-1}$?
  - that's where each individual element of the summation contributes
  - the likelihood of starting at $x_{t-1}$ and arriving at $x_{t}$ is simply $p(x_t|x_{t-1}) * p(x_{t-1})$
  - we can say the same thing, using different notation and incorporating all of our knowledge about the world, by writing: $p(x_t|x_{t-1}^{(i)}, u_t, m) * bel(x_{t-1}^{(i)})$
  - from the equation above we can see that our final position probability is the sum of $n$ discretized motion model calculations, where each calculation is the product of the 'i'th transition probability, $p(x_t|x_{t-1}^{(i)}, u_t, m)$, and 'i'th belief state, $bel(x_{t-1}^{(i)})$


- 'i'th Motion Model Probability: $p(x_t|x_{t-1}^{(i)}, u_t, m) * bel(x_{t-1}^{(i)})$

- **Q:** Given a transition probability of $3.99E-1$ and a belief state $bel(x_{t-1})$ of $5.56E-2$, what is the position probability returned by the motion model? Write the answer in scientific notation with an accuracy of two decimal places, for example 3.14E-15.
- **A:** It's $2.22E-2$. We multiply transition probability and a belief state.

|  pseudo_position (x) |  pre-pseudo_position |  delta position | P(transition) | $bel(x_{t−1})$ | P(position) |
|----------|----------|----------|----------|----------|----------|
| 7 | 1 | 6 | 1.49E-06 | 5.56E-02 | 8.27E-08 |
| 7 | 2 | 5 | 1.34E-04 | 5.56E-02 | 7.44E-06 |
| 7 | 3 | 4 | 4.43E-03 | 5.56E-02 | 2.46E-04 |
| 7 | 4 | ? | 5.40E-02 | 0.00E+00 | 0.00E+00 |
| 7 | 5 | 2 | ?        | 0.00E+00 | 0.00E+00 |
| 7 | 6 | 1 | 3.99E-01 | 0.00E+00 | 0.00E+00 |
| 7 | 7 | 0 | 2.42E-01 | ?         | 1.66E-03 |
| 7 | 8 | -1 | 5.40E-02 | 1.79E-03 | ? |

- **Q:** What is difference in position for an $x$ of $7$ and a pre-pseudo position of 4?
- **A:** It's 3.


- **Q:** Use `normpdf` to determine the transition probability for $x = 7$ and a pre-pseudo_position of $5$, and a control parameter of $1$, and a standard deviation of $1$. The transition probability can be determined through normpdf(delta_position, control_parameter, position_stdev). The answer must be in scientific notation with two decimal place accuracy, for example $3.14E-15$.
- **A:** It's $2.42E-1$. To determine the probability use apply the delta position as the sample value in `normpdf`.


- **Q:** In practice we only set our initial belief state, but making the following calculation is helpful in building intuition. What is the belief state $bel(x_{t-1})$ for the penultimate row of our table above? Write the answer in scientific notation with an accuracy of two decimal places, for example $3.14E-15$.
- **A:** Our positon probability is the product of the transition probability and our belief state at $t - 1$. Rearranging yields: $1.66E-03/2.42E-01 = 6.86E-03$. This is important to understand since our belief state has a major influence on our returned probability.


- **Q:** What is the discretized position probability for $x = 7$ and a pre-pseudo_position of $8$, given the belief state in the table above? Write the answer in scientific notation with an accuracy of two decimal places, for example $3.14E-15$.
- **A:** Our positon probability is the product of the transition probability and our belief state for our pre-pseudo position. $5.40E-02 * 1.79E-03 = 9.66E-05$


- **Q:** Given the table above, what is the final probability returned by our motion model. Enter the answer in scientific notation with an accuracy of two decimal places, for example $3.14E-15$.
- **A:** By summing the discrete probabilities from the table, we obtain the total probability, which estimates the probability from a continuous function. $8.27E-08 + 7.44E-06 + 2.46E-04 + 0.00E+00 + 0.00E+00 + 0.00E+00 + 1.66E-03 + 9.66E-05 = 2.02E-03$

# Coding the Motion Model

- now that we have manually calculated each step for determining the motion model probability, we will implement these steps in a function
- the starter code steps through each position `x`, calls the `motion_model` function and prints the results to stdout
- to complete this exercise fill in the `motion_model` function which will involve:
  - for each $x_{t}$:
    - calculate the transition probability for each potential value $x_{t-1}$
    - calculate the discrete motion model probability by multiplying the transition model probability by the belief state (prior) for $x_{t-1}$
  - return total probability (sum) of each discrete probability

- Reference Equations
  - Discretized Motion Model: $\sum\limits_{i} p(x_t|x_{t-1}^{(i)}, u_t, m)bel(x_{t-1}^{(i)})$
  - Transition Model: $p(x_t|x_{t-1}^{(i)}, u_t, m)$
  - 'i'th Motion Model Probability: $p(x_t|x_{t-1}^{(i)} u_t, m) *bel(x_{t-1}^{(i)})$

- code is available in `code/03_coding_the_motion_model/`

# Observation Model Introduction

- the observation model describes the probability distribution of the observations set, $T$, given the state, $x_t$, our previous observations or controls, and the map
- you can also represent a relationship as a diagram or graph--$x_t$ is unknown and points to $z_t$, as well as all other values like the controls, the map, and the previous observations

<img src="resources/observation_model.png"/>

- Observation Model: $p(z_t|x_t,z_{1:t-1},u_{1:t},m)$
- Motion Model: $p(x_t|z_{1:t-1},u_{1:t},m)$


- **Q:** What “trick” can we use there, which helps us to manipulate/simplify the observation model?
- **A:** Using the Markov Assumption (Completeness of the State Assumption).

# Markov Assumption for Observation Model

- the Markov assumption can help us simplify the observation model
- recall that the Markov Assumption is that the next state is dependent only upon the preceding states and that preceding state information has already been used in our state estimation
- as such, we can ignore terms in our observation model prior to $x_t$ since these values have already been accounted for in our current state and assume that $t$ is independent of previous observations and controls. 
- with these assumptions we simplify our posterior distribution such that the observations at $t$ are dependent only on $x$ at time $t$ and the map--$p(z_t|x_t,m)$


- since $z_t$ can be a vector of multiple observations we rewrite our observation model to account for the observation models for each single range measurement
- we assume that the noise behavior of the individual range values $z_t^1$ to $z_t^k$ is independent and that our observations are independent, allowing us to represent the observation model as a product over the individual probability distributions of each single range measurement

<img src="resources/markov_assumption_for_observation_model_1.png"/>

- now we must determine how to define the observation model for a single range measurement
- in general there exists a variety of observation models due to different sensors (lidars, cameras, radars, or ultransonic sensors), sensor specific noise behavior and performance, and map types (dense 2D or 3D grid maps or sparse feature-based maps)
- for our 1D example we assume that our sensor measures to the $n$ closest objects in the driving direction, which represent the landmarks on our map - we also assume that observation noise can be modeled as a Gaussian with a standard deviation of $1$ meter and that our sensor can measure in a range of $0-100$ meters


- to implement the observation model we use the given state $x_t$, and the given map to estimate pseudo ranges, which represent the true range values under the assumption that your car would stand at a specific position $x_t$, on the map
  - for example, if our car is standing at position $20$ it would make use $x_t$, and $m$ to make pseudo range ($z_t^*$) observations in the order of the first landmark to the last landmark or $5$, $11$, $39$, and $57$ meters
  - compared to our real observations ($z_t = [19, 37]$) the position $x_t = 20$ seems unlikely and our observation would rather fit to a position around $40$

<img src="resources/markov_assumption_for_observation_model_2.png"/>

- based on this example the observation model for a single range measurement is defined by the probability of the following normal distribution $p(z_t^k|x_t )\tilde\ N(z_t^k,z_t^{*k},\sigma z_t)$ where $z_t^{*k}$ is the mean
- this insight will ultimately allow us to implement the observation model in C++

# Finalize the Bayes Localization Filter

<img src="resources/summary_bayes_localization_filter.png"/>

- we have accomplished a lot in this lesson
  - starting with the generalized form of Bayes Rule we expressed our posterior, the belief of $x$ at $t$ as $\eta$ (normalizer) multiplied with the observation model and the motion model
  - we simplified the observation model using the Markov assumption to determine the probability of $z$ at time $t$, given only $x$ at time $t$, and the map
  - we expressed the motion model as a recursive state estimator using the Markov assumption and the law of total probability, resulting in a model that includes our belief at $t - 1$ and our transition model
  - finally we derived the general Bayes Filter for Localization (Markov Localization) by expressing our belief of $x$ at $t$ as a simplified version of our original posterior expression (top equation), $\eta$ multiplied by the simplified observation model and the motion model
    - here the motion model is written as $\hat{bel}$, a prediction model


- the Bayes Localization Filter dependencies can be represented as a graph, by combining our sub-graphs
- to estimate the new state $x$ at $t$ we only need to consider the previous belief state, the current observations and controls, and the map
- it is a common practice to represent this filter without the belief $x_t$ and to remove the map from the motion model
- ultimately we define $bel(x_t)$ (Bayes Filter for Localization (Markov Localization)) as the following expression: $bel(x_t) = p(x_t|z_t,z_{1:t-1},\mu_{1:t},m) = \eta *p(z_t|x_t,m) \hat{bel}(x_t)$

# Bayes Filter Theory Summary

- the Bayes localization filter, or Bayes filter, is a general framework for recursive state estimation
  - recursive means that we use the previous state (state at $t-1$ to estimate the new state (state at $t$) by using only current observations and controls (observations and control at $t$), and not the whole history of data (data from $0:t$)

<img src="resources/bayes_filter_theory_summary.png"/>

- the motion model describes the predictions step of the filter
- the observation model is the update step to estimate the new state probabilities
- you already heard about this interaction between prediction and update step before
  - this means the current 1D localization, Kalman filters, and also particle filters are realizations of the Bayes filter

# Observation Model Probability

- we will complete our Bayes' filter by implementing the observation model
- the observation model uses pseudo range estimates and observation measurements as inputs


- let's recap what is meant by a pseudo range estimate and an observation measurement
  - for the figure below, the top 1d map (green car) shows our observation measurements
    - these are the distances from our actual car position at time $t$, to landmarks, as detected by sensors
    - in this example, those distances are $19m$ and $37m$
  - the bottom 1d map (yellow car) shows our pseudo range estimates
    - these are the distances we would expect given the landmarks and assuming a given position $x$ at time $t$, of $20m$
    - in this example, those distances are $5$, $11$, $39$, and $57m$

<img src="resources/observation_model_probability.png"/>

- the observation model will be implemented by performing the following at each time step:
  - measure the range to landmarks up to $100m$ from the vehicle, in the driving direction (forward)
  - estimate a pseudo range from each landmark by subtracting pseudo position from the landmark position
  - match each pseudo range estimate to its closest observation measurement
  - for each pseudo range and observation measurement pair, calculate a probability by passing relevant values to `norm_pdf`: `norm_pdf(observation_measurement, pseudo_range_estimate, observation_stdev)`
  - return the product of all probabilities


- why do we multiply all the probabilities in the last step?
- our final signal (probability) must reflect all pseudo range, observation pairs
  - this blends our signal
  - for example, if we have a high probability match (small difference between the pseudo range estimate and the observation measurement) and low probability match (large difference between the pseudo range estimate and the observation measurement), our resultant probability will be somewhere in between, reflecting the overall belief we have in that state

- let's practice this process using the following information and `norm_pdf`
  - pseudo position: $x = 10m$
  - vector of landmark positions from our map: $[6m, 15m, 21m, 40m]$
  - observation measurements: $[5.5m, 11m]$
  - observation standard deviation: $1.0m$

- **Q:** Our first step is to estimate pseudo ranges, please enter the response as a vector with no spaces, in ascending order. For example $[5, 4, 7, 20]$.
- **A:** $6m$ is not in the driving direction, so we reject this. The remaining calculations are shown within the vector: $[15-10,21-10,40-10] = [5, 11, 30]$


- **Q:** Match each observation measurement with the nearest estimated pseudo range. We will only use each measurement and pseudo range once. Enter each pair as a vector of tuples, with no spaces, with tuples ordered as (observation,pseudo_range). For example $[(5.5,10),(11,15)]$.
- **A:** $[(5.5,5), (11,11)]$.


- **Q:** Calculate a probability for each observation measurement and pseudo range estimate pair by passing relevant data to `norm_pdf`. Enter your response as a vector of probabilities in scientific notation with an accuracy of two decimal places, and no spaces. For example `[3.14E-15,1.23E-5]`.
- **A:** $[3.99E-1,3.52E-1]$.


- **Q:** To complete our observation model probability, determine the product of each observation probability from the previous quiz. Remember that our observation model probability is the product of the probabilities determined using each (pseudo_range, observation) pair and `norm_pdf`. Enter your response in scientific notation with an accuracy of two decimal places. For example $2.99E-1$.
- **A:** $3.99E-01 * 3.52E-01 = 1.40E-01$.

# Get Pseudo Ranges

- in the previous exercises we manually executed the steps for determining pseudo ranges and our observation model probability
- now let's implement a function that accepts a vector of landmark positions, a pseudo position (x), and returns a vector of sorted (ascending) pseudo ranges
- later, we will use the pseudo range vector as an input for our observation model function

- to implement the `pseudo_range_estimator` function we must do the following for each pseudo position $x$:
  - for each landmark position:
    - determine the distance between each pseudo position $x$ and each landmark position
    - if the distance is positive (landmark is forward of the pseudo position) push the distance to the pseudo range vector
    - sort the pseudo range vector in ascending order
    - return the pseudo range vector


- there may be missing $x$ values in the output
  - this is because not all $x$ values have a forward landmark (positive pseudo range)

- code is available in `code/04_get_pseudo_ranges/`

# Coding the Observation Model

- the final individual model we will implement is the observation model
- the observation model accepts the pseudo range vector from the previous assignment, an observation vector (from vehicle sensors), and returns the observation model probability
- ultimately, we will multiply this by the motion model probability, then normalize to produce the belief state for the current time step

- the starter code steps through each pseudo position $x$, calls the observation_model function and prints the results to stdout


- to implement the observation_model function we must do the following for each pseudo position $x$:
  - for each observation:
    - determine if a pseudo range vector exists for the current pseudo position $x$
    - if the vector exists, extract and store the minimum distance, element $0$ of the sorted vector, and remove that element (so we don't re-use it)
      - this will be passed to `norm_pdf`
    - if the pseudo range vector does not exist, pass the maximum distance to `norm_pdf`
    - use `norm_pdf` to determine the observation model probability
    - return the total probability

- code is available in `code/05_coding_the_observation_model/`

# Coding the Full Filter

- in previous lessons we learned the basis of our filter, tried some example calculations by hand, and implemented critical steps and models for a single time step and vector of sensor observations
- in this final coding exercise we will implement the entire filter using the pieces we have already developed for multiple time steps and sensor observations


- sensor observations are provided in a 2D vector where each inner vector represents the sensor observations, in meters, at a time step
```cpp
{{1,7,12,21}, {0,6,11,20}, {5,10,19}, {4,9,18}, {3,8,17}, {2,7,16}, 
{1,6,15}, {0,5,14}, {4,13}, {3,12},{2,11},{1,10},{0,9},{8},{7},{6},{5},
{4},{3},{2},{1},{0}, {}, {}, {}};
```

- implement the Bayes' localization filter by first initializing priors, then doing the following within each time step:
  - extract sensor observations
    - for each pseudo-position:
      - get the motion model probability
      - determine pseudo ranges
      - get the observation model probability
      - use the motion and observation model probabilities to calculate the posterior probability
    - normalize posteriors (see helpers.h for a normalization function)
    - update priors (priors --> posteriors)

- code is available in `code/06_coding_the_full_filter/`