## Abstract

Many statistical populations contain distinct subpopulations, where each one may posses its own density. This can result in a multimodal distribution which can not be modelled by a single density. Modelling mixtures of distributions has key applications in document topic analysis and financial crisis modelling.

Gaussian Mixture Models, used to model subpopulations, are convex combinations of Gaussian densities. Parameters for this model can be estimated using the Expectation Maximization (EM) algorithm - a probablistic approach.

Recent discoveries in optimization research have yielded genetic algorithms inspired by processes found in nature. One particular example is Particle Swarm Optimization, which we implement to perform maximum likelihood estimation. We compare this algorithm to the classical EM by applying them to the Old Faithful dataset and to simulated mixture densities.

Particularly, we aim to compare this non-statistical approach (PSO) to the probablisitic approach (EM) to estimate densities for the univariate case where there are two distinct groups.


## Introduction

In practise, datasets are generated from process which can be represented as a combination of Gaussian distributions. The parameters of the distribution that generates these samples can offer insights into properties of the population, such as the subpopulation means and the weights of each mixture component. Concretely, this mixture distribution represents a decomposition of a single density into a convex combination of many. 

We formally define a Gaussian Mixture Model pdf $f$ with two components as

$$
f(x\space|\space\mu_{1},\sigma_{1}^{2},\mu_{2},\sigma_{2}^{2}) = \pi\phi(x\space|\space\mu_{1},\sigma_{1}^{2}) + (1-\pi)\phi(x\space|\space\mu_{2},\sigma_{2}^{2})
$$

where $\phi_{1},\phi_{2}$ are the Gaussian pdf and the weight $\pi \in [0,1]$.

This formulation describes two Gaussian components, each with a weight and their respective parameters $\mu$ and $\sigma_{2}$ being their mean and variance respectively.

We often do not know these parameters of the population given a sample dataset. At first glance, an approach would be to determine the log-likelihood function, yielding


$$
l(\mu_{1},\sigma_{1}^{2},\mu_{2},\sigma_{2}^{2}|x) = 
\sum_{i=1}^{n}[\ln(\pi\phi(x_{i}|\mu_{1},\sigma_{1}^{2}) + (1-\pi)\phi(x_{i}|\mu_{2},\sigma_{2}^{2}))]
$$

Maximum likelihood estimation by setting the score function to zero would require solving a multivariate optimization problem. This problem is challenging due to multiple variables, the complexity of the density $\phi$, and the summation of these. Thus numerical algorithms are required to find an approximate solution.


The most widely used approach is the Expectation Maximization (EM) algorithm (CITE TEXTBOOK HERE). 
It is a deterministic optimization technique derived from minorize-maximization (MM) methods. We say the log-likelihood function is maximized by by iteratively maximizing simpler functions. 




PSO = inspiration etc

We propose to compare this to the Particle Swarm Optimization - a stochastic optimization algorithm first described by Kennedy (CITE HIM) in 1995. It draws inspirations from behaviour seen in crowds of lifeforms such as school of fish or flocks of birds. The main idea is to begin with an initial 'swarm' of particles with values randomly chosen from the search space. Each particle represents a candidate solution to the optimization problem - each with its own 'inertia' which is a step size and direction. At each iteration the particles move in a direction determined by linear combination of its inertia and the global best solution found so far. After multiple iterations of this, the algorithm will terminate when the particles have converged to an optimum.

The comparison of the PSO and the EM will be the focus of the remainder of the paper. In the following section, we present the respective implementations of the methods. Then, we bring forth two ways of performing the compaarison, namely by applying our implementations to the well known Old Faithful dataset. To further quantify our comparison, we also perform Monte Carlo simulation. We end our report with the discussion of our results and potential for future improvements and research 

It was first described by Dempster et al. in 1977. 

Hype about non-statistical algos 




example figure here? :O

- comparisons between these two methods via repeated applications of thhe algooirthm to simulated datasets
- mention PSO typically used for opt problems and we wish to see how it performs in statistical density estimation 
- non-statistical methods for machine learning and estimation are gaining traction and we wish to test that shit out (cite XXX)

- in the remainder of this paper we implement 1995's PSO and X articles EM (textbook) and 


## Methodology

We begin by implementing our algorithms in Python. What follows is the mathematical description for each procedure. We make the important note that due to time constraints, we assume a priori knowledge of the true value for $\sigma^2$. For EM, we follow the formulas provided by Smyth in his notes for finite mixture models:

We begin with an intial guess for the $\pi$, $\mu$'s, and ${\sigma}^2$'s.

The Expectation (E) step:

For each data point $x_{i}$ $1 \leq i \leq n $, we compute its component membership weight that it is in the first component

$$
w_{i,1} = \frac{ \hat{\pi}\phi(x_{i}|\hat{\mu_{1}},\hat{\sigma_{1}}^2)}{  \hat{\pi}\phi(x_{i}|\hat{\mu_{1}},\hat{\sigma_{1}}^2) + (1-\hat{\pi})\phi(x_{i}|\hat{\mu_{2}},\hat{\sigma_{2}}^2)  }
$$
and the component membership weights for the second component are

$$
w_{i,2} = 1 - w_{i,1}
$$

The Maximization (M) step:

Given the weights from the E step, we can update our estimates first by estimating the number of data points in the first component by $ \hat{n_{1}} = \sum_{i=0}^{n}w_{i,1}$, the weight of the first componenent by 
$\hat{\pi} = \frac{\hat{n_{1}}}{n}$, the mean of the first component  $\hat{\mu_{1}} = \frac{1}{\hat{n_{1}}}\sum_{i=1}^{n}x_{i}w_{1,i}$ and the mean of the second component $\hat{\mu_{2}} = \frac{1}{\hat{n-n_{1}}}\sum_{i=1}^{n}x_{i}(1-w_{1,i})$. We note that the weight for the second component is $(1-\pi)$ and that we do not include the calculation for the variance as we are assuming this parameter is known due to time constraints.

The PSO algorithm is described as follow:

INCLUDE FORMULATION HERE


With the implemented algorithms, we which to evaluate them in two ways. First, we will run each algorithm on the dataset consisting of the eruption times of the Old Faithful geyser. Then we will use the estimated parameters to create a method to generate from this distribution and compare the plots of the dataset density to a plot of the density of a sample simulated from our method. To evaluate PSO we will compare the plot it produces with that of EM and the true dataset. The second method of evaluation is via Monte Carlo simulation. We will randomly generate multiple datasets from mixture distributions and run our algorithms on these simulated datasets. After running the simulation multiple times we may calculate the expected error between the estimate and the true parameter value. Please see the appendix for the R and Python code used for the experiments. In the next sections, we examine the results of our work and discuss the outcomes. 




- compare PSO and EM algos and discuss implemetnation
	- math behind PSO and how the vector is updated
	- EM formulas

- show math for likelihood function
- implemntation details
- will we quantify the comparision using
- assumptions/restrictions on our simulated params: only two mixtures
- monte carlo thing (simulated data)
	- generate multiple datasets from known mixture distributions
	- we can measure accuracy/error by comparing the resulting params to the the true ones

- how we compare
- discuss error measurements

- application of analysis to a real dataset (faithful data)
	- compare both 
	- plot the estimated densities vs the original dataset
	- simulate data from estimated mixtures and compare to original dataset - compare densities


## Results - Single Dataset

## Results - Multiple Simulated Datasets

## Discussion

## References

[1] Kennedy, J. and Eberhart, R. (1995). Particle swarm optimization. Proceedings of IEEE International Conference on Neural Networks. IV. pp. 1942–1948. 

[2] Smyth, P. (2016). The EM algorithm for Gaussian Mixtures. Course notes - CS274A. University of California, Irvine.

[3] Christian, R.P. and Casella, G. (2010). The EM Algorithm. In Introducing Monte Carlo Methods with R. New York, NY.
