# Finding the Best Transformation
In the previous section, we established that we can register two images together by creating a mapping

$$
\mathbf{Q} = (\mathbf{TM}_{S})^{-1}\mathbf{M}_{T}
$$

where $\mathbf{T}$ is some affine transformation that renders the anatomical alignment between the images as good as possible. We could approach finding the best $\mathbf{T}$ via manual registration of the images, but this would soon get impractical for large numbers of subjects. As such, not only are we looking for the best $\mathbf{T}$ we can find, but we are also looking for a way to find it *automatically* using a computer. 

In order to do this, we are going to need some way of quantifying what we mean by *best*. This means we need some numeric value that tells us how good the alignment between the images is. We can then try different transformations in $\mathbf{T}$, recalculate this value to see if the registration has improved or not. In theory, we could just keep doing this until no changes we make to $\mathbf{T}$ makes the alignment any better. At this point, we assume we have found the best $\mathbf{T}$ and stop.

## Objective functions
In order to measure how well-aligned two images are, we make use of an *objective function*. This is an equation that uses the voxel values from both images to calculate a number that tells us how well-registered the images are. Depending on the function, our aim is to find a $\mathbf{T}$ that makes this number as *big* or as *small* as possible. The distinction between minimisation and maximisation of this value is not important. We can therefore restrict focus to *cost* functions, where the aim it to make the value as small as possible. Generally, there are a huge variety of cost functions that have been described over the years. Most imaging software provides a choice of several different costs, each of which has their uses and limitations. We will go through some of the main ones available in `SPM` below.

### Least-squares cost functions

One of the most basic cost functions we can use is least-squares. This is based on adding up the differences between the voxel values of the two images. These differences are squared to prevent positive and negative differences cancelling out in the sum, as formalised in the equation below 

C equals sum from v equals 1 to n of left parenthesis T subscript v minus S subscript v space right parenthesis squared

where Tv and Sv indicate the value of voxel v from the target and source images. Given that the sum is over n voxels, it is assumed that both images are the same dimensions. If not, the source needs to be interpolated to the same dimensions as the target . The idea is that images from the same modality will have similar voxel values when they are well-aligned, compared to when they are not. As such, there will be bigger differences in voxel values when the images are misaligned and thus the sum of squared differences will also be bigger. Therefore, the smaller you can make the sum of squared differences, the better the alignment should be.

One big advantage of the least-squares function is that it is very fast to calculate. However, a disadvantage is that it is only suitable for within-modality registration. As such, if we want to register two BOLD fMRI images, or two T1-weighted images, this approach will work well. However, if we wanted to register an fMRI image to a T1 image, then this cost function will likely perform poorly. In SPM, least-squares is mainly used for fast motion correction and is not available as an option elsewhere.

### Information theory cost functions

In the early days of image registration, cost functions tended to be simple metrics such as least-squares or values related to the correlation between images. However, it was soon discovered that more flexible cost functions could be derived using the principles of Information Theory. Although the full foundations of Information Theory are beyond the scope of this lesson (Stone, 2015, is a good book, if you are curious), the aim here is to try and give a flavour of the theory so you can see why it is helpful for image registration problems. 

Within Information Theory there is a key concept known as entropy, which measures randomness or uncertainty. When applied to an image, entropy is similar in spirit to variance, as the greater the  variety of intensities, the larger the entropy becomes. This is because the image gets less predictable. To put this another way, if all the voxels has the same value, the entropy would be 0 because the image is perfectly predictable. If a voxel coordinate was selected at random and you had to guess the voxel value, it could be done with 100% accuracy. Alternatively, if the image was full of different values that occurred with exactly the same probability, entropy would be at its maximum because there is no way to predict the voxel values. As such, entropy measures where an image is on the scale of being perfectly predictable to completely unpredictable. 

## Optimisation
