# Particle Filter for Object Tracking

The particle filter is a general algorithm capable of addressing various problem types,
with a particular strength in solving estimation problems. It excels in estimating the
states of systems with _multimodal_ states, a task that conventional Kalman filters,
including the _classic Kalman filter_, _extended Kalman filter_, and _unscented Kalman
filter_, struggle to handle effectively. In this tutorial, I will explore the
application of the particle filter for object tracking through two illustrative
examples. The first example can be addressed using algorithms from the Kalman filter
family, whereas the second problem demands the unique capabilities of the particle
filter.

**RULES:** As usual, **`OpenCV`** is banned in this repository.

Please note that this tutorial will adhere to the notation used in the
[Particle Filter's Wikipedia page](https://en.wikipedia.org/wiki/Particle_filter),
ensuring that readers can easily refer to the source for additional information when
needed.


In [None]:
from typing import List, Tuple
import cv2
import time
import imageio
import numpy as np
from skimage import color
import matplotlib.pyplot as plt
from particle_filter_utils import *
from functools import partial
from skimage import transform

video_reader = imageio.get_reader("./input/pres_debate.avi")
frames = [np.array(frame) for frame in video_reader]
video_reader.close()
president_video = np.array(frames)
assert president_video.ndim == 4  # n x H x W x 3

video_reader = imageio.get_reader("./input/pedestrians.avi")
frames = [np.array(frame) for frame in video_reader]
video_reader.close()
blonde_video = np.array(frames)
assert blonde_video.ndim == 4  # n x H x W x 3

## Introduction

This tutorial focuses on the application of a particle filter for tracking objects over
time. It's essential to clarify that we are specifically addressing a tracking problem,
not a detection problem. In other words, our objective is not to detect an object but to
track it once it's already known to us. To illustrate this, consider the scenario where
we have initial information about a car's position, and our goal is to autonomously
monitor and update the car's position continuously throughout a given time period.

When tracking an object, the primary objective is to deduce a sequence of world states,
denoted as $x_k$, from a noisy sequence of measurements or observations, denoted as
$y_k$. Particle filters share similarities with Kalman filters, as they consist of two
essential components: the dynamic or temporal model, denoted as $g(\cdot)$, and the
measurement model, denoted as $h(\cdot)$.

-   The dynamic or temporal model, $g(\cdot)$, characterizes the relationship between
    successive states. Typically, particle filters make use of the Markov assumption,
    which posits that each state depends solely on its predecessor, represented as
    $P(x_k | x_{k-1})$.

-   The measurement model, $h(\cdot)$, describes the connection between the measurement
    $y_k$ and the state $x_k$ at time $k$. We consider this model as generative, and it
    helps us model the likelihood, $P(y_k | x_k)$.

By leveraging this statistical dependency, we can infer the state, $x_k$, even when the
associated observation, $y_k$, provides partial or no informative content.

In the context of inference, the primary challenge is to calculate the marginal
posterior distribution:

$$
P(x_k|y_{0\dots k}) = \frac{P(y_k|x_k)P(x_k|y_{0\dots {k-1}})}{\int P(y_k|x_k)
P(x_k|y_{0 \dots {k-1}}) dx}
$$

To evaluate $P(x_k | y_{0\dots k})$, we need to determine $P(x_k | y_{0\dots {k-1}})$,
which signifies our prior knowledge about the state $x_k$ before incorporating the
associated measurement $y_k$. This can be computed as follows:

$$
P(x_k|y_{0 \dots k-1}) = \int P(x_k|x_{k-1}) P(x_{k-1}|y_{0 \dots {k-1}}) dx_{k-1}
$$

One of the simplest particle filter methods is the _conditional density propagation_ or
_condensation_ algorithm. In this algorithm, the probability distribution
$P(x_k|y_{0\dots k-1})$ is represented by a weighted sum of particles. The following
intuitive image provides an overview of how the condensation algorithm works:

a) The posterior at the previous step is represented as a set of weighted particles.

b) The particles are resampled according to their weights to produce a new set of
unweighted particles.

c) These particles are passed through the nonlinear temporal function.

d) Noise is added according to the temporal model.

e) The particles are passed through the measurement model and compared to the
measurement density.

f) The particles are re-weighted according to their compatibility with the measurements,
and the process can begin again.

<figure>
  <div style="display: flex; justify-content: space-between;">
    <div style="text-align: center;">
      <img src="./images/Particle.svg">
      <p><strong>The condensation algorithm </strong></p>
      <p>The image and the accompanying description provided above have been adapted from the book "Computer Vision: Models, Learning, and Inference" authored by Simon J. D. Prince.</p>
    </div>
  </div>
</figure>


## Navie Particle Filter

After providing an overview of our objective, let's delve into the specific details of
our task. To begin, it's crucial to establish the definition of our "model" or
"template" within this context. The "model" or "template" represents the object that we
aim to track. This entity could manifest as a patch in an image, a contour, or any other
descriptive representation of the object under consideration.

For the first task, we need to track a patch taken from the first frame of the video as
shown below, which is Mitt Romney's face. Our goal is to track this face throughout the
time. Thus we can define a `Template` which holds the information of the template as
follow:

```python
class Template:
    def __init__(self, img, x, y, w, h) -> None:
        self.x = int(x)
        self.y = int(y)
        self.w = int(w)
        self.h = int(h)
        self.model = img[self.y : self.y + self.h, self.x : self.x + self.w, ...]
```

Once we've established our template, it naturally leads to the design of the system's
state, which comprises the following four components: `(x, y, w, h)`. These components
represent the x-axis coordinate `x`, the y-axis coordinate `y`, the width of the window
`w`, and the height of the window `h`.


In [None]:
first_frame = president_video[0, ...]  # NOTE: uint8. 0-255.
print("first_frame: ", first_frame.shape)
x, y, w, h = 320, 175, 103, 129
template = Template(first_frame, x, y, w, h)
print(
    "template.model:",
    "shape:", template.model.shape,
    "type:", template.model.dtype,
    "min:", np.min(template.model),
    "max:", np.max(template.model),
)

fig, ax = plt.subplots(figsize=(3, 2))  # Adjust the figure size as needed
plt.axis("off")
plt.tight_layout(pad=0)
ax.imshow(template.model)
plt.show()


### Dynamic Model

Now that we have defined our system's state, we can proceed to define our dynamic
system. In this straightforward particle filter, we opt for the simplest form of motion,
known as Brownian motion:

$$
x_{k} = x_{t-1} + W_{k-1}
$$

The process noise represented by $W_{k-1}$ introduces variability in our dynamic model.
It's important to note that the range and magnitude of each dimension in the state and
the associated process noise can differ. For instance, when considering `x` and `y`,
these dimensions should remain within the boundaries of the image's dimensions.
Consequently, the standard deviation of their process noise should typically fall within
the range of 10 to 50. On the other hand, `w` and `h` should fall within an anticipated
range, typically around 10 to 25 pixels added or subtracted from the original patch
size. This choice is made based on practical considerations, as we wouldn't expect an
object, such as Mitt Romney's face, to occupy the entire image. Hence, the standard
deviation of the process noise of `w` and `h` should be only few pixels.

The choice of using Brownian motion as the dynamic model is grounded in the absence of
external control inputs to govern the object being tracked. In such scenarios, we can
assume that the object's movement is inherently random. The implementation is
straightforward, as outlined below. Essentially, during the prediction step, we
introduce Gaussian noise to the state vector. You might wonder about the magnitude of
noise added to the state. The magnitude varies depending on the specific state
dimension. For instance, if we have prior knowledge that the object in the image
typically moves only a few pixels in each time step, an appropriate value might range
from a sub ten to tens of pixels. It's important to note that in our task, we must
constrain the boundaries of the state, as it wouldn't be logical for image coordinates,
for example, to be negative.


In [None]:
class DynamicModel:
    def __init__(self, std: List) -> None:
        self.std = std
        self.num_states = len(std)

    def predict(self, particles: np.ndarray, state_boundry: List[Tuple]):
        for i in range(self.num_states):
            # Brownian motion.
            particles[:, i] += np.random.normal(
                loc=0, scale=self.std[i], size=particles[:, i].shape
            )  # Modify particles in-place

            lower_bound, upper_bound = state_boundry[i]

            # Constrain state within boundry.
            np.clip(particles[:, i], lower_bound, upper_bound, out=particles[:, i])

### Measurement Model

Now, let's delve into the components of the measurement model. We'll discuss each aspect
one by one:

#### Measurement function:

-   The measurement function is defined as:

    $$
    \text{weights} \propto p(y_k|x_k) \propto \text{exp}({\frac{-\text{distance}}{2\sigma}})
    $$

    In this context, the term "distance" always remains positive, representing the
    dissimilarity between a window and the template. If $\text{distance} = 0$, it
    indicates that the window and the template are identical. As the value of
    $\text{distance}$ increases, it signifies a greater dissimilarity between the window
    and the template. The expression $e^{-x}$ always yields values between 0 and 1,
    which proves to be quite useful. The parameter $\sigma$ represents the standard
    deviation associated with our observation data. A larger $\sigma$ implies greater
    uncertainty in our observation data, while a smaller $\sigma$ implies a higher
    degree of trust in the observation data.

-   Universal Sigma: The question arises as to whether a universal standard deviation
    can be used for different types of distance functions. It is arguable that a
    universal standard deviation can be applied if the distance is properly normalized.

-   Distance Functions: I have explored three distance functions:
    [Mean Squared Error](https://en.wikipedia.org/wiki/Mean_squared_error),
    [Chi-Squared](https://stats.stackexchange.com/questions/184101/comparing-two-histograms-using-chi-square-distance),
    and [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity).

    In this context, it's important to note that Mean Squared Error and Chi-Squared
    effectively measure dissimilarity between two images, while Cosine Similarity
    quantifies similarity. To adapt similarity to dissimilarity, I've used the
    transformation $2 - (n + 1)$. However, it's worth mentioning that only Mean Squared
    Error and Chi-Squared are effective in this context, as they align with the
    objective of measuring dissimilarity between the template and the observed window.
    Cosine Similarity, on the other hand, is not effective in this task. Additionally, I
    initially mentioned the idea of normalizing the distance, but upon experimentation,
    I found that applying normalization to Chi-Squared did not yield valid results.

-   In the particle filter, it's important to obtain windows corresponding to each
    particle. Each particle is characterized by four states: x-coordinate, y-coordinate,
    width, and height. These four parameters, (x, y, w, h), are used to crop a window
    from the given frame. There are several considerations in this process: First,
    although we've constrained the states' minimum and maximum values during the
    resampling stage, it's still not guaranteed that (x, y, w, h) will always define a
    valid window. To ensure that (x, y, w, h) represents a valid window, we would need
    to constrain the relationships between x and w, as well as y and h. However, it's
    not always necessary to impose these constraints, as we can identify invalid windows
    in later steps and reset the weight of a particle to 0 if necessary. Second, the
    cropped window obtained using (x, y, w, h) might have a different size than the
    template. To match the template's size, we need to resize the cropped window
    accordingly. Finally, it's essential to normalize the weights assigned to particles
    so that they sum up to 1. This normalization ensures that the particle weights
    represent a valid probability distribution.

-   Updating the template is a crucial aspect of object tracking, especially when the
    target's appearance can change over time. The template can be updated using the
    following equation:

    $$
    \text{template} = \alpha * \text{state} + (1 - \alpha) * \text{template}
    $$

    This update process involves blending the latest state with the historical template.
    However, it's important to exercise caution when performing these updates. While
    it's often beneficial to update the template with the latest state to adapt to
    changes, there are situations, such as occlusion, where all states should be
    considered invalid. In these cases, our current approach could lead to the
    generation of a relatively better state, which may not be desirable. Hence, when
    updating the template, careful consideration is required to ensure that the update
    process accurately reflects the evolving appearance of the target while accounting
    for potential inaccuracies and occlusion scenarios.


In [None]:
class MeasurementModel:
    def __init__(self, std, template, distance_func, alpha=0.0) -> None:
        self.std = std
        self.template = template
        self.distance_func = distance_func
        self.alpha = alpha

    def measure(self, particles, frame):
        num_particles = particles.shape[0]

        # Get windows corresponding to each particle. Fixiate the size of template and
        # resize the window of each particle to match template's.
        template_width, template_height = self.template.w, self.template.h
        artificial_measurements = []
        for i in range(num_particles):
            x = particles[i, 0]  # float
            y = particles[i, 1]  # float
            w = particles[i, 2]  # float
            h = particles[i, 3]  # float
            start_y = int(y - h / 2)
            start_x = int(x - w / 2)
            end_y = int(start_y + h)
            end_x = int(start_x + w)
            temp = frame[start_y:end_y, start_x:end_x, :]
            if temp.size == 0:
                # It's ok to still append `temp` because the weight is reset as 0 in the
                # later process.
                artificial_measurements.append(temp)
            else:
                resized_image = cv2.resize(temp, (template_width, template_height))
                # resized_image = np.asarray(Image.fromarray(temp).resize((template_width, template_height)))
                
                artificial_measurements.append(resized_image)

        # NOTE: Add measurement noise? window += np.random.normal(loc=0, scale=self.std, size=window.shape).astype(np.uint8)
        # Compute importance weights. Measuring similarity between each window and the template.
        weights = []  # FIXME: Use array broadcasting
        for m in artificial_measurements:
            weights.append(self.measure_function(m, self.template.model))
        weights = np.array(weights)

        # NOTE: WE may want to clip weight here. eg, weight[weight<1e-3] = 0
        # so that particles with very small weight can have zero chances.
        # weights *= 1000
        # weights[weights < 1e-3] = 0
        # Normalize the weights
        weights /= np.sum(weights)
        return weights

    def measure_function(self, window, template):
        if window.shape != template.shape:
            # NOTE: This statement must be here. It can not be put into distance function.
            return 0
        dist = self.distance_func(window, template)
        weight = np.exp(-dist / (2 * self.std**2))

        return weight

    def update_template(self, frame, state):
        x = state[0]
        y = state[1]
        w = state[2]
        h = state[3]
        start_x = int(x - w / 2)
        start_y = int(y - h / 2)
        end_x = int(start_x + w)
        end_y = int(start_y + h)
        best_model = frame[start_y:end_y, start_x:end_x, ...]

        # Resize template to best state because the best state shrinks.
        resized_template = cv2.resize(self.template.model, (w, h))
        # resized_template = np.asarray(Image.fromarray(self.template.model).resize((w, h)))

        if resized_template.shape != best_model.shape:
            # Edge cases when best states are near boundries.
            return

        self.template.model = (
            self.alpha * best_model + (1 - self.alpha) * resized_template
        )
        # self.template.model = self.template.model.astype(np.uint8)

        # update x, y, w, h
        self.template.x = x
        self.template.y = y
        self.template.w = w
        self.template.h = h


def mean_squared_error(window, template):
    mse = np.sum(np.subtract(window, template, dtype=np.float64) ** 2)
    mse /= float(window.shape[0] * window.shape[1])
    return mse


def cosine_similarity(window, template):
    gray_window, gray_template = color.rgb2gray(window), color.rgb2gray(template)
    # gray_window *= 255
    # gray_template *= 255
    dividend = np.sum(np.multiply(gray_window, gray_template))
    divisor = np.multiply(
        np.sqrt(np.sum(gray_window**2)), np.sqrt(np.sum(gray_template**2))
    )
    assert np.all(divisor != 0), f"divisor has 0 in it. {divisor}"
    tmp = np.divide(dividend, divisor) + 1
    assert tmp > 0
    return 2 - tmp


def chi_squared(window, template, num_bins=8):
    hist1 = []
    hist2 = []
    for channel in range(window.shape[2]):
        hist1_channel, _ = np.histogram(
            window[:, :, channel], bins=num_bins, range=(0, 256)
        )
        hist2_channel, _ = np.histogram(
            template[:, :, channel], bins=num_bins, range=(0, 256)
        )
        # Normalize the histograms does not work.
        # hist1_channel = hist1_channel.astype(np.float64)
        # hist2_channel = hist2_channel.astype(np.float64)
        # hist1_channel /= hist1_channel.sum() + 1e-10
        # hist2_channel /= hist2_channel.sum() + 1e-10
        hist1.append(hist1_channel)
        hist2.append(hist2_channel)
    hist1 = np.array(hist1)
    hist2 = np.array(hist2)

    # Compute the Chi-Squared distance
    chi_squared_distance = np.sum(((hist1 - hist2) ** 2) / (hist1 + hist2 + 1e-12))
    return chi_squared_distance

### `NaiveParticleFilter`

In the design of the `NaiveParticleFilter`, we can discuss each aspect one by one:

#### `num_states`:

The `NaiveParticleFilter` consists of four states, as discussed in the preceding
section.

#### `state_boundry`:

`state_boundry` represents the minimum and maximum values for each state, establishing
the range within which the states must be constrained.

#### `consensus`:

The `consensus` parameter is a value ranging from 0 to 1. It is used to determine the
proportion of particles with the highest weights that will contribute to certain
decisions. It employs a form of minor meritocracy, where only particles with top-tier
weights have a significant say in the decision-making process. For instance, if
`consensus=0.05` and `num_particles=500`, then only 0.05 \* 500 = 25 particles with the
highest weights will be selected for decision-making.

#### `upper_thres`:

The `upper_thres` parameter comes into play after selecting the "consensus" particles
with the highest weights. These particles must collectively contribute weights higher
than the specified threshold, `upper_thres`. The rationale behind this is as follows:
when dealing with a large number of particles (e.g., 100) randomly distributed across
the state space, they might generate relatively random weights. In such cases, each
particle's weight is approximately 0.01. While some particles may have higher weights by
chance, it doesn't necessarily mean they are good estimates. These particles might have
received higher weights purely by luck. Thus, the `upper_thres` serves to set a
threshold ensuring that particles with higher weights significantly exceed this
threshold, filtering out randomness and ensuring that the selected particles truly
provide valuable estimations.

However, if I apply a threshold on the normalized weights of particles, it can indeed
lead to inaccuracies. This is because normalization ensures that the sum of the weights
is equal to 1, effectively converting them into probabilities. By applying a threshold
on these normalized weights, I might inadvertently alter the distribution of particles
and the associated probabilities. This could result in selecting particles that would
have otherwise been filtered out during the thresholding process.

Applying the threshold in the measurement process before normalization might be a more
reasonable approach in certain scenarios. This ensures that I am selecting particles
based on the raw unnormalized weights, and the threshold doesn't interfere with the
probabilistic interpretation of particle weights.

In [None]:
class NaiveParticleFilter:
    def __init__(
        self,
        num_states: int,
        state_boundry: List[Tuple],
        dynamic_model: DynamicModel,
        measure_model: MeasurementModel,
        num_particles=128,
        consensus=0.05,  # percentage
        upper_thres=0.3,  # percentage
        lower_thres=0.1,  # percentage
    ):
        self.num_states = num_states
        self.state_boundry = state_boundry
        self.num_particles = num_particles
        self.dynamic_model = dynamic_model
        self.measure_model = measure_model

        # Democracy here.
        self.num_consensus_particles = int(self.num_particles * consensus)
        self.upper_thres = upper_thres
        # self.lower_thres = lower_thres

        self.particles = np.zeros((num_particles, num_states))  # (N, 4): x, y, w, h
        self.weights = np.ones(self.num_particles) / self.num_particles  # (N, ): weight

        self.reset_particles()
        self.state = self.particles[69, :]  # Initialize state randomly

    def reset_particles(self):
        for i in range(self.num_states):
            lower_bound, upper_bound = self.state_boundry[i]
            self.particles[:, i] = np.random.uniform(
                lower_bound, upper_bound, self.num_particles
            )
        # Rest weights.
        self.weights = np.ones(self.num_particles) / self.num_particles

    def update(self, frame):
        self.resample_particles()

        self.dynamic_model.predict(self.particles, self.state_boundry)

        self.weights = self.measure_model.measure(self.particles, frame)

        # NOTE: MAYBE not always update states. Consider occlusion.
        self.update_states()

        sorted_indices = np.argsort(self.weights)
        largest_indices = sorted_indices[-self.num_consensus_particles :]
        if np.sum(self.weights[largest_indices]) > self.upper_thres:
            self.measure_model.update_template(frame, self.state)

    def resample_particles(self):
        # Sample new particles indices using the distribution of the weights
        j = np.random.choice(
            np.arange(self.num_particles),
            self.num_particles,
            replace=True,
            p=self.weights,
        )
        self.particles = np.array(self.particles[j])
        assert self.particles.shape[0] == self.num_particles

        # Constrain particles to be within boundries.
        for i in range(self.num_states):
            lower_bound, upper_bound = self.state_boundry[i]
            np.clip(
                self.particles[:, i], lower_bound, upper_bound, out=self.particles[:, i]
            )

        # Rest weights
        self.weights = np.ones(self.num_particles) / self.num_particles

    def update_states(self):
        sorted_indices = np.argsort(self.weights)
        largest_indices = sorted_indices[-self.num_consensus_particles :]
        s = np.sum(self.particles[largest_indices, :], axis=0)
        average = s / len(largest_indices)
        self.state = average.astype(int)

### Running on a Simple Example

With all the essential components defined, we are now ready to apply the particle filter
to a simple example. The parameters chosen for this example are intentionally
reasonable, avoiding any extreme values. In this particle filter, we employ only 128
particles to track an object with four states. It's worth noting that I've followed the
20/80 rule, setting the `consensus` to 0.2 and the `upper_thres` to 0.8. This rule implies
that the sum of the top 20% of particles with the highest weights must exceed a
probability of 0.8.

The results are evident: the particles successfully and accurately track the position of
the object, such as Romney's face, throughout the entire sequence.

<center>
<video src="images/romney.mp4" type="video/mp4" controls>
</video>
</center>


In [None]:
dynamic_model = DynamicModel(std=(16, 16, 0.5, 0.5))
measure_model = MeasurementModel(
    std=16,
    template=template,
    distance_func=partial(mean_squared_error),
    # distance_func=partial(chi_squared, num_bins=8),
    # distance_func=partial(cosine_similarity), Not working
    alpha=0.01,
)
state_boundry = [
    (0, first_frame.shape[1] - 1),  # Width
    (0, first_frame.shape[0] - 1),  # Height
    (100, 105),  # Romney's head width
    (125, 135),  # Romney's head height
]
tracker = NaiveParticleFilter(
    num_states=4,
    state_boundry=state_boundry,
    dynamic_model=dynamic_model,
    measure_model=measure_model,
    num_particles=128,
    consensus=0.20,  # percentage
    upper_thres=0.80,  # percentage
    # lower_thres=0.1,  # percentage
)

for i in range(1, president_video.shape[0]):
    start_time = time.time()

    frame = president_video[i, ...]

    tracker.update(frame)

    frame = visualize_particle_filter(
        frame, tracker.particles, tracker.state, tracker.measure_model.template
    )

    delay = int(25 - (time.time() - start_time))
    if cv2.waitKey(delay) & 0xFF == ord("q"):
        break
    cv2.imshow("pres_debate", frame[:, :, ::-1])  # RGB to BGR

cv2.destroyAllWindows()

## Handling Occlusion and Depth Change

In our initial example, tracking Mitt Romney's head was relatively straightforward. Now,
we must address more realistic scenarios, such as scenes featuring occlusion and changes
in object depth.


In [None]:
first_frame = blonde_video[0, ...]  # NOTE: uint8. 0-255.
print("first_frame: ", first_frame.shape) # (360, 480, 3)
x, y, w, h = 211, 36, 100, 293
template = Template(first_frame, x, y, w, h)
print(
    "template.model:",
    "shape:", template.model.shape,
    "type:", template.model.dtype,
    "min:", np.min(template.model),
    "max:", np.max(template.model),
)

fig, ax = plt.subplots(figsize=(3, 2))  # Adjust the figure size as needed
plt.axis("off")
plt.tight_layout(pad=0)
ax.imshow(template.model)
plt.show()
