## Introduction

Humans manually cateogrize and rely on expertise in order to classify MRI / PET / X-Ray scans this can pose an issue as Doctors can have an enormous amount of data to look over, and potential baises as well as fitigue could play a role in the classification process.

In the past Doctors would use CAD systems, to assist them in this classification process. "In the CAD systems, machine learning is able to extract informative features that describe the inherent patterns from data and play a vital role in medical image analysis" (Zhang et al., 2020) 

**side note** I'm not entirely sure what this means, does it mean that we can use machine learning to extract features from CAD systems or does this mean that machine learning is built into those CAD systems? I would assume the former because in a previous paragraph they stated that these systems where around in the 1980s.

Another problem we seem to run into is that the brain is highly complex and feature selection in real life is still done by doctors today, this results in a problem for someone who would want to use machine learning to solve one of these problems as we are not experts this would be difficult for us to do proper feature selection there are some algorithms that exist out there but it's still somewhat limited according to the paper.

**side note** In data science this is usually the case that feature selection will be done by a human regardless of the domain this is why we need to understand more about the particular subject that we are doing research on.

"Compared to the traditional machine learning algorithms, deep learning automatically discovers the informative representations without the professional knowledge of domain experts" (Zhang et al., 2020)

**side note** So in our approach this would be really inefficent this is talking about unsupervised learning, in the case of unsupervised learning although it can discover features by itself without humans it takes a lot more time and a lot more data than a supervised learning method. From what I understand "Deep Learning" doesn't mean that it has to be unsupervised "Deep Learning" is just a classification of learning algorithm.

It says that we can break down medical image analysis into several categories: classification, detection, registration, and segmentation. Classification aims to classify images into two or more categories "the stacked auto-encoder model" is an example of this classification. Detection consists of finding where the problem lies if there's a tumor for example the ability to identify where the tumor is. Segmentation is taking medical images and partitioning them into relevant parts i.e. tissue classes, organs etc. "Registration of medical images is a process that searches for the correct alignment of images." (Zhang et al., 2020). *Not entirely sure what this means by "alignment".*

Then there's also some other tasks that you can do such as content-based image retreival, image generation and image enhancement.

The rest of the paper will be structured as follow **TODO add table of contents**:
    - popular deep learning techniques for brain disorders
    - detailed overview of recent studies

## Deep Learning

I will add some code here to enhance my understanding. Will be using torch as well as commented out Julia code.

### MLPs and FFNN

This is the most basic type of NN takes some hidden layers and passes a bunch of non linear transforms onto data each layer is fully connected to the next layer. Given some input $x$ the composition function $y_k$ could be written as 

$$
y_k{(x;\theta)} = f^{(2)}(\Sigma^{m}_{j=1}w_{k,j}^{(2)}f^{(1)}(\Sigma_{i=1}^{N}w_{j,i}^{(1)}x_i+b_j^{(1)})+b_{k}^{(2)})
$$

Where $f^{(n)}$ denotes a non linear activation function and $\theta$ represents the parameters (Zhang et al., 2020) 

**side note** usually in a FFNN or MLP the non linear activation function is sigmoid $\frac{1}{1+e^{-x}}$

A diagram of a Multi-layer Perceptron from the article:

![](https://www.frontiersin.org/files/Articles/560709/fnins-14-00779-HTML/image_m/fnins-14-00779-g001.jpg)

In [1]:
# example of a basic MLP in torch using sequential
import torch
import torch.nn as nn
import torch.nn.functional as F

import numpy as np

class MLP(nn.Module):
    def __init__(self):
        super(MLP, self).__init__()
        self.sigmoid_stack = nn.Sequential(
            nn.Linear(128, 32), # input layer
            nn.Sigmoid(),       # non linear transform
            nn.Linear(32, 64),  # hidden layer 1
            nn.Sigmoid(),       # non linear transform
            nn.Linear(64, 64),  # hidden layer 2
            nn.Sigmoid(),       # non linear transform
            nn.Linear(64, 10)   # output layer
        )
    def forward(self, x):
        # feed forward
        logits = self.sigmoid_stack(x)
        return logits

x = torch.randn(128, 128) # fake data
model = MLP()             # instatiate the model
print(model.forward(x))   # feed forward

tensor([[0.0416, 0.3877, 0.0299,  ..., 0.2280, 0.1278, 0.2775],
        [0.0414, 0.3876, 0.0308,  ..., 0.2254, 0.1290, 0.2735],
        [0.0442, 0.3881, 0.0297,  ..., 0.2232, 0.1269, 0.2719],
        ...,
        [0.0470, 0.3909, 0.0282,  ..., 0.2222, 0.1286, 0.2685],
        [0.0442, 0.3896, 0.0296,  ..., 0.2263, 0.1297, 0.2733],
        [0.0429, 0.3901, 0.0300,  ..., 0.2244, 0.1290, 0.2711]],
       grad_fn=<AddmmBackward>)


### Backpropagation

This is the optimization algorithm that makes neural networks useful although they go over this in the paper there's a really good video on this [here](https://www.youtube.com/watch?v=Ilg3gGewQ5U) by 3blue1brown. Basiclly you want to take the gradient of the loss function and move towards a minimum in N-d space. Where N is the number of features. i.e.

$$
\nabla{\ell(x_1, x_2, x_3, ..., x_i)} = [\frac{\partial\ell}{\partial x_1}, \frac{\partial\ell}{\partial x_2}, \frac{\partial\ell}{\partial x_3}, ... \frac{\partial\ell}{\partial x_i}]
$$

Example in pytorch find docs [here](https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html) for this example this is modified version not using data loaders:

In [18]:
loss_fn = nn.MSELoss()                                   # mean squared loss
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3) # does BP for us

def train_model(model, loss_fn, optm, train_data):
    for X, y in train_data:
        # predict then compute loss
        pred = model(X)
        loss = loss_fn(pred, y)
        
        # BP
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

def test_model(model, loss_fn, test_data):
    # size of data
    size = len(test_data)
    
    # iteration
    loss = 0
    with torch.no_grad():
        for X, y in test_data:
            pred = model(X)
            loss += loss_fn(pred, y).item()
    
    return (loss / size)

This training and testing loop is very central to the entire process of data science I'd assume this is the same for Deep Learning as I mean otherwise, how are you going to optimize your models to be useful?

**side note** logits is the unbounded value and is inverse of sigmoid

### Stacked Auto Encoders

*Encode decode similar to transformers?*

Auto encoders by themselves are very limited due to the simple and shallow structure of an auto encoder iteself but when you stack these you can create a stacked auto encoder wow I know. This improves the model substantially.

Lower layers learn more simple details (similar to earlier layers in a CNN) while higher layers will be able to extract more complex characteristics (similar to later layers in a CNN). There are many variations on Auto Encoders and they can also be stacked this has potential creating more useful and robust models.

AE diagram from the article:
![](https://www.frontiersin.org/files/Articles/560709/fnins-14-00779-HTML/image_m/fnins-14-00779-g002.jpg)

To avoid drawbacks of Gradient Decent "the greedy layer-wise approach is considered to training parameters of an SAE" (Zhang et al., 2020) this is probably most interesting part of this section as non gradient decent optimization methods have been proposed for a while now. **come back and look into this**

code example:

In [29]:
class AE(nn.Module):
    def __init__(self):
        super(AE, self).__init__()
        
        # single encoder
        self.encoder = nn.Sequential(
            nn.Linear(5, 4),
            nn.Sigmoid(),
            nn.Linear(4, 4),
            nn.Sigmoid(),
            nn.Linear(4, 2),
            nn.Sigmoid()
        )
        
        # single decoder
        self.decoder = nn.Sequential(
            nn.Linear(2, 4),
            nn.Sigmoid(),
            nn.Linear(4, 4),
            nn.Sigmoid(),
            nn.Linear(4, 5)
        )
    
    # feed forward operation
    def forward(self, x):
        x = self.encoder(x)
        return self.decoder(x)
    
# test case
with torch.no_grad():
    x = torch.randn(5)
    model = AE()
    print(model.forward(x))

tensor([ 0.1016, -0.0388,  0.4285, -0.0595,  0.1546])


### Deep Belief Networks (WIP)

A deep belief network stacks multiple restricted boltzman machines (RBMs) for deep architecture construction. **First question**: what is a boltzman machine? A DBN has one visible layer and multipule hidden layers. Given a visible layer and N hidden layers we can construct the model. $f(v, hL_1, hL_2, ... hL_N)$ the eq. given is: 

$$
P(v,h(1),…,h(L))=P(v|h(1))(∏_{l=1}^{L−2}P(h(l)|h(l+1)))P(h(L−1),h(L))
$$

so from what I read elsewhere as well as re-reading over this section a DBN is made up of the first two layers creating a RBM this is the $P(v|h(1))$ part of the eq. then the rest of the network is a directed network made up of the hidden layers we can see this in the diagram where the first two layers are connected via solid lines and the other layers are connected with arrows (although it's kind of hard to see at first) this is slightly different from an FFNN due to the RBM part of the DBN. *I beleive*

diagram from the paper:
![](https://www.frontiersin.org/files/Articles/560709/fnins-14-00779-HTML/image_m/fnins-14-00779-g003.jpg)

In [3]:
class DBN(nn.Module):
    def __init__(self):
        super(DBN, self).__init__()
        
        # RBM
        self.input_layer
        
        self.bernoulli = nn.Sequential(
            nn.Linear(4, 4),
            nn.Sigmoid(),
            nn.Linear(4, 4)
        )
        
    def RBM(self, x):
        pass
    def forward(self, x):
        x = self.RBM(x)
        return self.bernoulli(x)

## Deep Boltzman Machine

The diagram from the DBNs B side is the Deep Boltzman Machine similar to the DBN except instead of putting one RBM we make the entire network out of RBMs stacked on top of each other. so it doesn't include a directed part like the DBN does the entire DBM is undirected.

There's some interesting probability that comes with this instead of $P(v|hL_1)$ we now have $P(v|hL_{N-1}, hL_{N-2})$