## Introduction

Humans manually cateogrize and rely on expertise in order to classify MRI / PET / X-Ray scans this can pose an issue as Doctors can have an enormous amount of data to look over, and potential baises as well as fitigue could play a role in the classification process.

In the past Doctors would use CAD systems, to assist them in this classification process. "In the CAD systems, machine learning is able to extract informative features that describe the inherent patterns from data and play a vital role in medical image analysis" (Zhang et al., 2020) 

**side note** I'm not entirely sure what this means, does it mean that we can use machine learning to extract features from CAD systems or does this mean that machine learning is built into those CAD systems? I would assume the former because in a previous paragraph they stated that these systems where around in the 1980s.

Another problem we seem to run into is that the brain is highly complex and feature selection in real life is still done by doctors today, this results in a problem for someone who would want to use machine learning to solve one of these problems as we are not experts this would be difficult for us to do proper feature selection there are some algorithms that exist out there but it's still somewhat limited according to the paper.

**side note** In data science this is usually the case that feature selection will be done by a human regardless of the domain this is why we need to understand more about the particular subject that we are doing research on.

"Compared to the traditional machine learning algorithms, deep learning automatically discovers the informative representations without the professional knowledge of domain experts" (Zhang et al., 2020)

**side note** So in our approach this would be really inefficent this is talking about unsupervised learning, in the case of unsupervised learning although it can discover features by itself without humans it takes a lot more time and a lot more data than a supervised learning method. From what I understand "Deep Learning" doesn't mean that it has to be unsupervised "Deep Learning" is just a classification of learning algorithm.

It says that we can break down medical image analysis into several categories: classification, detection, registration, and segmentation. Classification aims to classify images into two or more categories "the stacked auto-encoder model" is an example of this classification. Detection consists of finding where the problem lies if there's a tumor for example the ability to identify where the tumor is. Segmentation is taking medical images and partitioning them into relevant parts i.e. tissue classes, organs etc. "Registration of medical images is a process that searches for the correct alignment of images." (Zhang et al., 2020). *Not entirely sure what this means by "alignment".*

Then there's also some other tasks that you can do such as content-based image retreival, image generation and image enhancement.

The rest of the paper will be structured as follow **TODO add table of contents**:
    - popular deep learning techniques for brain disorders
    - detailed overview of recent studies

## Deep Learning

I will add some code here to enhance my understanding. Will be using torch as well as commented out Julia code.

### MLPs and FFNN

This is the most basic type of NN takes some hidden layers and passes a bunch of non linear transforms onto data each layer is fully connected to the next layer. Given some input $x$ the composition function $y_k$ could be written as 

$$
y_k{(x;\theta)} = f^{(2)}(\Sigma^{m}_{j=1}w_{k,j}^{(2)}f^{(1)}(\Sigma_{i=1}^{N}w_{j,i}^{(1)}x_i+b_j^{(1)})+b_{k}^{(2)})
$$

Where $f^{(n)}$ denotes a non linear activation function and $\theta$ represents the parameters (Zhang et al., 2020) 

**side note** usually in a FFNN or MLP the non linear activation function is sigmoid $\frac{1}{1+e^{-x}}$

A diagram of a Multi-layer Perceptron from the article:

![](https://www.frontiersin.org/files/Articles/560709/fnins-14-00779-HTML/image_m/fnins-14-00779-g001.jpg)

In [2]:
# example of a basic MLP in torch using sequential
import torch
import torch.nn as nn
import torch.nn.functional as F

import numpy as np

class MLP(nn.Module):
    def __init__(self):
        super(MLP, self).__init__()
        self.sigmoid_stack = nn.Sequential(
            nn.Linear(128, 32), # input layer
            nn.Sigmoid(),       # non linear transform
            nn.Linear(32, 64),  # hidden layer 1
            nn.Sigmoid(),       # non linear transform
            nn.Linear(64, 64),  # hidden layer 2
            nn.Sigmoid(),       # non linear transform
            nn.Linear(64, 10)   # output layer
        )
    def forward(self, x):
        # feed forward
        logits = self.sigmoid_stack(x)
        return logits

x = torch.randn(128, 128) # fake data
model = MLP()             # instatiate the model
print(model.forward(x))   # feed forward

tensor([[ 0.6636, -0.1627,  0.0725,  ...,  0.4316,  0.4435, -0.6485],
        [ 0.6644, -0.1639,  0.0698,  ...,  0.4320,  0.4468, -0.6470],
        [ 0.6642, -0.1642,  0.0719,  ...,  0.4337,  0.4453, -0.6475],
        ...,
        [ 0.6653, -0.1638,  0.0717,  ...,  0.4330,  0.4419, -0.6491],
        [ 0.6631, -0.1639,  0.0774,  ...,  0.4312,  0.4432, -0.6463],
        [ 0.6632, -0.1638,  0.0717,  ...,  0.4278,  0.4477, -0.6490]],
       grad_fn=<AddmmBackward>)


### Backpropagation

This is the optimization algorithm that makes neural networks useful although they go over this in the paper there's a really good video on this [here](https://www.youtube.com/watch?v=Ilg3gGewQ5U) by 3blue1brown. Basiclly you want to take the gradient of the loss function and move towards a minimum in N-d space. Where N is the number of features. i.e.

$$
\nabla{\ell(x_1, x_2, x_3, ..., x_i)} = [\frac{\partial\ell}{\partial x_1}, \frac{\partial\ell}{\partial x_2}, \frac{\partial\ell}{\partial x_3}, ... \frac{\partial\ell}{\partial x_i}]
$$

Example in pytorch find docs [here](https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html) for this example this is modified version not using data loaders:

In [18]:
loss_fn = nn.MSELoss()                                   # mean squared loss
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3) # does BP for us

def train_model(model, loss_fn, optm, train_data):
    for X, y in train_data:
        # predict then compute loss
        pred = model(X)
        loss = loss_fn(pred, y)
        
        # BP
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

def test_model(model, loss_fn, test_data):
    # size of data
    size = len(test_data)
    
    # iteration
    loss = 0
    with torch.no_grad():
        for X, y in test_data:
            pred = model(X)
            loss += loss_fn(pred, y).item()
    
    return (loss / size)

This training and testing loop is very central to the entire process of data science I'd assume this is the same for Deep Learning as I mean otherwise, how are you going to optimize your models to be useful?

**side note** logits is the unbounded value and is inverse of sigmoid

### Stacked Auto Encoders

*Encode decode similar to transformers?*

Auto encoders by themselves are very limited due to the simple and shallow structure of an auto encoder iteself but when you stack these you can create a stacked auto encoder wow I know. This improves the model substantially.

Lower layers learn more simple details (similar to earlier layers in a CNN) while higher layers will be able to extract more complex characteristics (similar to later layers in a CNN). There are many variations on Auto Encoders and they can also be stacked this has potential creating more useful and robust models.

AE diagram from the article:
![](https://www.frontiersin.org/files/Articles/560709/fnins-14-00779-HTML/image_m/fnins-14-00779-g002.jpg)

To avoid drawbacks of Gradient Decent "the greedy layer-wise approach is considered to training parameters of an SAE" (Zhang et al., 2020) this is probably most interesting part of this section as non gradient decent optimization methods have been proposed for a while now. **come back and look into this**

code example:

In [29]:
class AE(nn.Module):
    def __init__(self):
        super(AE, self).__init__()
        
        # single encoder
        self.encoder = nn.Sequential(
            nn.Linear(5, 4),
            nn.Sigmoid(),
            nn.Linear(4, 4),
            nn.Sigmoid(),
            nn.Linear(4, 2),
            nn.Sigmoid()
        )
        
        # single decoder
        self.decoder = nn.Sequential(
            nn.Linear(2, 4),
            nn.Sigmoid(),
            nn.Linear(4, 4),
            nn.Sigmoid(),
            nn.Linear(4, 5)
        )
    
    # feed forward operation
    def forward(self, x):
        x = self.encoder(x)
        return self.decoder(x)
    
# test case
with torch.no_grad():
    x = torch.randn(5)
    model = AE()
    print(model.forward(x))

tensor([ 0.1016, -0.0388,  0.4285, -0.0595,  0.1546])


### Deep Belief Networks (WIP)

A deep belief network stacks multiple restricted boltzman machines (RBMs) for deep architecture construction. **First question**: what is a boltzman machine? A DBN has one visible layer and multipule hidden layers. Given a visible layer and N hidden layers we can construct the model. $f(v, hL_1, hL_2, ... hL_N)$ the eq. given is: 

$$
P(v,h(1),…,h(L))=P(v|h(1))(∏_{l=1}^{L−2}P(h(l)|h(l+1)))P(h(L−1),h(L))
$$

so from what I read elsewhere as well as re-reading over this section a DBN is made up of the first two layers creating a RBM this is the $P(v|h(1))$ part of the eq. then the rest of the network is a directed network made up of the hidden layers we can see this in the diagram where the first two layers are connected via solid lines and the other layers are connected with arrows (although it's kind of hard to see at first) this is slightly different from an FFNN due to the RBM part of the DBN. *I beleive* 

A DBN **is not** a FFNN there are two parts to the training of a DBN the pre-training and fine-tuning.
"pretraining is done by stacking RBMs to find the parameter space." (Zhang et al., 2020) Take some n layer this layer is trained as a RBM via observation data from the n-1 layer.

diagram from the paper:
![](https://www.frontiersin.org/files/Articles/560709/fnins-14-00779-HTML/image_m/fnins-14-00779-g003.jpg)

In [49]:
from torch import Tensor

class DBN(nn.Module):
    def __init__(self):
        super(DBN, self).__init__()
        
        self.input_layer = RBM()
        
        self.layers = nn.Sequential(
            nn.Linear(4, 4),
            nn.Sigmoid(),
        )
        
    def pre_train(self, x):
        return self.input_layer.train(x)
    
    def fine_tuning(self, x):
        x = self.pre_train(x)
        x = torch.cat(x)
        return self.layers(x)

with torch.no_grad():
    x = torch.randn(4, 4)
    model = DBN()
    print(model.fine_tuning(x))

tensor([[0.3321, 0.3780, 0.6130, 0.3789],
        [0.5537, 0.3707, 0.5993, 0.6355],
        [0.3786, 0.4699, 0.6565, 0.4008],
        [0.4086, 0.4648, 0.6040, 0.3686],
        [0.4337, 0.3939, 0.6116, 0.4460],
        [0.3938, 0.3163, 0.4446, 0.6351],
        [0.3767, 0.4570, 0.6866, 0.3009],
        [0.3186, 0.4068, 0.6248, 0.3609],
        [0.3612, 0.3379, 0.4813, 0.5820],
        [0.5664, 0.5699, 0.8305, 0.2311],
        [0.4153, 0.3856, 0.6228, 0.3926],
        [0.3282, 0.3336, 0.6349, 0.3004],
        [0.4337, 0.3939, 0.6116, 0.4460],
        [0.3938, 0.3163, 0.4446, 0.6351],
        [0.3767, 0.4570, 0.6866, 0.3009],
        [0.3186, 0.4068, 0.6248, 0.3609],
        [0.4056, 0.4504, 0.6032, 0.3525],
        [0.3742, 0.4856, 0.6176, 0.3117],
        [0.3866, 0.4254, 0.6142, 0.3874],
        [0.4059, 0.4342, 0.5964, 0.3733],
        [0.3206, 0.5014, 0.7766, 0.1804],
        [0.3100, 0.5002, 0.7680, 0.1833],
        [0.3688, 0.4374, 0.6794, 0.3178],
        [0.3356, 0.4371, 0.6852, 0

## Restricted Boltzman Machine

This seems to be very important for understanding both DBMs and DBNs this is not gone over in the paper we get the simple $P(v|h(1))$ this is sort of useless if we don't know what $P$ is.

On top of this RBMs are not some arbitrary thing like a normal distrobution or something this is a complex probability model and make up the layers of the DBM and the first 2 layers of the DBN.

The joint probability distrobution of visible and hidden layers is repersented by $P(v, h) = \frac{1}{Z}\Sigma_{\{h\}}e^{-E(v, h)}$ where $E$ is the energy function between two variables $v$ and $h$ $E(v, h) = -a^{T}v-b^{T}h-v^{T}Wh$ and $Z$ is the partition function repersented as $Z=\Sigma e^{-E(v,h)}$ conditional probability of $P(v|h)$ and conversely the conditional probability of $P(h|v)$ we can see that $P(v|h) = \Pi P(v_i|h)$ and conversely for $P(h|v) = \Pi P(h_i|v)$ this is also sort of useless so the individual conditional probability can be given by the form $P(h_i|v) = \sigma (b_i + \Sigma_j^n w_{j,i}v_j)$ and $P(v_i|h) = \sigma (a_i + \Sigma_j^n w_{i,j}h_j)$ putting all of this together we get that
$$
P(v|h(1)) = \Pi_i^m \sigma(a_i+\Sigma_j^n w_{i,j}h_j)
$$
**Then we pass visible to a softmax?**
This whole thing in code:

In [32]:
class RBM(): 
    def __init__(self):
        self.visible = nn.Sequential(
            nn.Linear(4, 4),
            nn.Sigmoid(),
            nn.Linear(4, 4),
            nn.Sigmoid(),
            nn.Linear(4, 4),
            nn.Softmax(dim=0)
        )
        self.hidden = nn.Sequential(
            nn.Linear(4, 4),
            nn.Sigmoid(),
            nn.Linear(4, 4),
            nn.Sigmoid()
        )
        
    # step through each node in a network
    def step(self, x):
        acc1 = []
        acc2 = []
        for layer in self.visible:
            acc1.append(layer(x))
        for layer in self.hidden:
            acc2.append(layer(x))
        return acc1, acc2

    # matmul between each network
    def train(self, x):
        a, b = self.step(x)
        prod = []
        for aa in a:
            for bb in b:
                prod.append(torch.matmul(aa, bb))
        return prod

# model
model = RBM()

# fake data
x = torch.randn(4)
with torch.no_grad():
    print(model.train(x))

[tensor(0.0349), tensor(-0.0799), tensor(0.1155), tensor(-0.0799), tensor(-0.2191), tensor(0.7626), tensor(0.1422), tensor(0.7626), tensor(-0.0632), tensor(0.5142), tensor(0.0985), tensor(0.5142), tensor(-0.2191), tensor(0.7626), tensor(0.1422), tensor(0.7626), tensor(0.4488), tensor(-0.3996), tensor(-1.5879), tensor(-0.3996), tensor(-0.1298), tensor(0.4672), tensor(0.0442), tensor(0.4672)]


## Deep Boltzman Machine

The diagram from the DBNs B side is the Deep Boltzman Machine similar to the DBN except instead of putting one RBM we make the entire network out of RBMs stacked on top of each other. so it doesn't include a directed part like the DBN does the entire DBM is undirected. This also means no Bayseian Network.

There's some interesting probability that comes with this instead of $P(v|hL_1)$ we now have $P(v|hL_{N-1}, hL_{N-2})$

This sort of conflicts with the explination of pre-training, isn't the deep boltzman machine just a bunch of stacked RBMs like the pretraining method in a DBN? This is still a novel approach but I'm just kind of confused.

Also this: "Given the values of the neighboring layers, the conditional probabilities over the visible and the L set of hidden units are given by logistic sigmoid functions" **theres eq.s here**

In [37]:
# tbh this model is bad anyways no one cares lol
# stack a bunch of RBMs like this:
layer = RBM()
x = torch.randn(4)

# it's an rbm but a bunch
x = layer.train(x)
    
with torch.no_grad():
    print(x)

[tensor(0.2347, grad_fn=<DotBackward>), tensor(-0.5705, grad_fn=<DotBackward>), tensor(0.1488, grad_fn=<DotBackward>), tensor(-0.5705, grad_fn=<DotBackward>), tensor(0.8094, grad_fn=<DotBackward>), tensor(1.5803), tensor(-0.1472, grad_fn=<DotBackward>), tensor(1.5803), tensor(-0.7999, grad_fn=<DotBackward>), tensor(-0.4987, grad_fn=<DotBackward>), tensor(-0.1771, grad_fn=<DotBackward>), tensor(-0.4987, grad_fn=<DotBackward>), tensor(0.8094, grad_fn=<DotBackward>), tensor(1.5803), tensor(-0.1472, grad_fn=<DotBackward>), tensor(1.5803), tensor(-0.4939, grad_fn=<DotBackward>), tensor(-0.3230, grad_fn=<DotBackward>), tensor(-0.5385, grad_fn=<DotBackward>), tensor(-0.3230, grad_fn=<DotBackward>), tensor(0.6117, grad_fn=<DotBackward>), tensor(0.7360), tensor(-0.0326, grad_fn=<DotBackward>), tensor(0.7360)]


## Generative Adversarial Networks

The way the model works is based off of game theory where there's two players the discriminator and the generator, the generator wants to make better and better images and the discriminator wants to identify real from fake data. 

![](https://www.frontiersin.org/files/Articles/560709/fnins-14-00779-HTML/image_m/fnins-14-00779-g004.jpg)

$D(x)$ is the discriminator network which outputs the scalar probability that $x$ came from the training data rather than the generator. the output of $D(x)$ should be high when the $x$ comes from the training data and low when it comes from generated data we want this output to become $0.5$ so it cannot tell weather data came from the generator or the training set.

In [11]:
# Generative Model
class G(nn.Module):
    def __init__(self):
        super(G, self).__init__()
        
        self.network = nn.Sequential(
            nn.ConvTranspose2d(12, 32, 4, bias=False), # input layer (latent_space, features, kernel_size)
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.ConvTranspose2d(12, 32, 4, bias=False),
            nn.Tanh()
        )

    def forward(self, x):
        return self.network(x)

# Discrimnator Model
class D(nn.Module):
    def __init__(self):
        super(D, self).__init__()
        self.network = nn.Sequential(
            nn.Conv2d(12, 32, 4, bias=False),
            nn.LeakyReLU(0.2, inplace=True),
            nn.BatchNorm2d(32),
            nn.Conv2d(32, 64, bias=False),
            nn.LeakyReLU(0.2, inplace=True),
            nn.BatchNorm2d(64),
            nn.Conv2d(32, 16, bias=False),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Conv2d(16, 4),
            nn.Sigmoid()
        )

## Convolutional Neural Networks

One of the two major classes of neural networks that you hear about I already know about these and I assume this is what you would want to use for the particular task that we're doing CNNs are mostly used in Computer Vision tasks.

![](https://www.frontiersin.org/files/Articles/560709/fnins-14-00779-HTML/image_m/fnins-14-00779-g005.jpg)

As we go deeper into the network the CNN learns more and more complex features.

In [57]:
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        
        self.layers = nn.Sequential(
            nn.Conv2d(32, 16, 4),
            nn.BatchNorm2d(16),
            nn.Sigmoid(),
            nn.Conv2d(16, 8, 4),
            nn.BatchNorm2d(8),
            nn.Conv2d(8, 2, 4),
            nn.Tanh()
        )
        
    def forward(self, x):
        return self.layers(x)

with torch.no_grad():
    x = torch.randn(32, 32, 16, 10)
    model = CNN()
    print(model.forward(x))

tensor([[[[ 0.5999],
          [-0.2136],
          [-0.1519],
          [ 0.0924],
          [ 0.1214],
          [-0.2326],
          [ 0.2350]],

         [[ 0.6949],
          [-0.4470],
          [-0.1494],
          [ 0.8575],
          [ 0.0145],
          [ 0.0051],
          [ 0.9027]]],


        [[[-0.8669],
          [ 0.6195],
          [ 0.1554],
          [ 0.9137],
          [ 0.6213],
          [-0.8920],
          [ 0.3849]],

         [[-0.0340],
          [ 0.3234],
          [ 0.4354],
          [-0.4261],
          [ 0.1530],
          [ 0.3223],
          [ 0.2167]]],


        [[[-0.8804],
          [-0.4462],
          [ 0.5017],
          [ 0.1508],
          [-0.6651],
          [ 0.3460],
          [ 0.6225]],

         [[ 0.0034],
          [ 0.8998],
          [-0.2821],
          [-0.4980],
          [ 0.1084],
          [ 0.5409],
          [ 0.6698]]],


        [[[-0.7272],
          [-0.6249],
          [ 0.4753],
          [ 0.1693],
          [-0.25

## Recurrent Neural Networks

The other of the two major neural networks that you hear about this is mostly used for sequential data, ie text tasks like NLP are most common for RNNs. RNNs have some downsides so there's alternitive topologies that you usually see such as GRUs and LSTMs.

![](https://www.frontiersin.org/files/Articles/560709/fnins-14-00779-HTML/image_m/fnins-14-00779-g007.jpg)

code implementation of an RNN:

In [72]:
# so this is just built into pytorch
with torch.no_grad():
    model = nn.RNN(10, 20, 5)
    x = torch.rand(1, 1, 10)
    print(model(x))

(tensor([[[-0.0059,  0.1794,  0.0625, -0.3649,  0.0151, -0.0390, -0.0194,
          -0.0830,  0.0060,  0.2649,  0.1965, -0.4376,  0.2605,  0.0514,
          -0.0075, -0.2418, -0.2944,  0.1461,  0.1771,  0.5480]]]), tensor([[[-0.1101,  0.2291,  0.4945,  0.2298, -0.3767, -0.1674, -0.2314,
           0.3629,  0.2372, -0.3681,  0.2572,  0.2484,  0.1117, -0.4337,
          -0.0298,  0.1552, -0.4718, -0.0490,  0.1199,  0.1508]],

        [[ 0.1770, -0.0454, -0.0901,  0.0842, -0.2652, -0.0405, -0.2169,
          -0.2162,  0.3244, -0.0589, -0.2476,  0.1847, -0.0611, -0.0964,
           0.3060, -0.2483, -0.0634, -0.3680,  0.0247,  0.1622]],

        [[ 0.2270, -0.0361,  0.2402,  0.1368, -0.1286,  0.0395, -0.3134,
          -0.2564, -0.0544,  0.3799,  0.2311, -0.1477,  0.1224,  0.0483,
           0.0371, -0.4175,  0.2546,  0.1299, -0.0237, -0.0272]],

        [[ 0.1134, -0.3093, -0.1145, -0.1241,  0.2024,  0.1729,  0.0089,
          -0.0886,  0.0427, -0.1157, -0.1462, -0.0915,  0.1319,  0.0657,
