[View in Colaboratory](https://colab.research.google.com/github/adowaconan/Deep_learning_fMRI/blob/master/Guclu_and_van_Gerven_2015_ventral_representation.ipynb)

In [0]:
from IPython.display import Image

# Introduction

1. object recognition appears to be solved in the primate brain via a cascade of neural computational along the visual ventral stream that **represent increasingly complex stimulus features**, which derive from retinal input (Tanaka, 1996)
2. neurons in early visual areas have smaller receptive fields and respond to simple features such as edge orientations, whereas neurons further along the ventral pathway have larger receptive fields and are more invariant to transformations and can be selective for complex shapes (Hubel and Wiesel, 1962; Gross et al., 1972; Hung et al., 2005)
3. while the receptive fields in early visual have been characterized in terms of preferred orientation, location, and spatial frequency (Jones and Palmer, 1987)
4. exactly what stimulus features are represented in downstream areas is less clear (Cox, 2014)
5. $mapping ({DNN}\rightarrow{image})$, deeper layers can be shown to respond to increasingly complex stimulus features (Zeiler and Fergus, 2012)
6. past works (Key et al., 2008; **van Gerven, et al., 2010 [unsupervised]()**; Guclu and van Gerven, 2014 [linear response model]())
7.  \*individual neural network layers were used to predict single-voxel responses to natural images, and this allowed us to isolate different voxel groups, whose population receptive fields are best predicted by a particular neural network layer (Dumoulin and Wandell, 2008)

## goal
$$mapping (\text{receptive fields},{DNN_{layers}})$$
$$complextity$$
$$invariant$$
$$size$$



In [13]:
print('Unsupervised feature learning improves prediction of human brain activity in response to natural images')
Image(url='http://journals.plos.org/ploscompbiol/article/figure/image?size=large&id=10.1371/journal.pcbi.1003724.g001',height=800)

Unsupervised feature learning improves prediction of human brain activity in response to natural images


In [15]:
print("Tanaka, 1996\nKravitz et al., 2014")
Image(url='http://www.jneurosci.org/content/jneuro/35/27/10005/F1.large.jpg?width=800&height=600&carousel=1',height=500)

Tanaka, 1996
Kravitz et al., 2014


# Method

## 2 male subjects
## 5 sessions of data
## training images - 1750, each was repeated twice
## testing images - 120, each was repeated 13 times

## MRI
### slice thickness: 2.25 mm
### slice gap 0.15 mm
### field of view 128 x 128 $mm^2$
### TR: 1s
### TE: 28 ms
### spatial resolution: 2 x 2 x 2.5 $mm^3$

## preprocessing
### fMRI scans were coregistered and used to estimate voxel-specific response time course
### deconvolution (Zeiler and Fergus, 2012) $\rightarrow$ estimate of response amplitude  for each each unique image in each voxel $\rightarrow$ voxels were assigned to visual areas using retinotopic mapping data acquired in separate session $\rightarrow$ anatomical and functional volumes were coregistered manually (free surfer)

# Encoding and Decoding model

## transforming image features to abstract representations ---- encoding
feature model that transform a visual stimulus to a **nonlinear** feature representation (Chatfield et al., 2014; Krizhevsky et al., 2012)
## transforming abstract representations to BOLD responses ---- decoding
1. a **linear** response model that transforms nonlinear feature representations toa voxel response (Guclu and van Gerven, 2014)
2. a separate response model was trained **for each feature map/voxel combination** using regularized linear regression [what are the regressors? what is the prediction?]()
3. to examine which DNN layer was most predictive of individual respnoses, we used each of the 8 layers of feature representations as input
4. estimation of the regression coefficients $\beta_i$, 
$$\mu_i(\mathbf{x})=\beta_i^T\Phi(\mathbf{x})$$
$$\text{as the predicted response of voxel }i\text{ to input stimulus }\mathbf{x}\text{ given a chosen feature representation }\Phi(\mathbf{x})$$ 

# Similarities and differences compared to Autoencoder ([optional](http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/))
1. neural network that is an **[unsupervised learning algorithm](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.408.1839&rep=rep1&type=pdf)(Baldi and Hornik, 1988)** that applies backprogation, setting the targets values to be equal to the inputs
2. learn an **approximation to the identify function**, $h_{W,b}(x)\approx x$, so as to output $\hat{x}$ that is similar to $x$
3. if there is structure in the data, for example, if some of the inputs $x$ features are correlated, then this algorithm will be able to discover some of those correlations. In fact, this simple autoencoder often ends up learning a low-dimensional representation very similar to **PCAs** [tSNE](http://scikit-learn.org/stable/auto_examples/manifold/plot_t_sne_perplexity.html#sphx-glr-auto-examples-manifold-plot-t-sne-perplexity-py)
4. if the hidden layer is large, we can still discover interesting structure, by imposing other constraints on the network: [sparsity-denoising autoencoder](https://www.doc.ic.ac.uk/~js4416/163/website/autoencoders/denoising.html)
5. [variational autoencoder](https://arxiv.org/pdf/1312.6114.pdf): for i.i.d datasets with continous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (aka recognition model) to the intractable posterior using the lower bound estimator
6. latent space: highly abstract representation, not necessary informative
7. can be used to map anything to anything **with constraints**


In [4]:
Image(url='https://www.doc.ic.ac.uk/~js4416/163/website/img/autoencoders/autoencoder.png')

In [16]:
Image(url='https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/7a9b6d996fed89bc327a19c72adfe8f80e9b522e/4-Figure3-1.png',height=400)

In [17]:
Image(url='https://www.researchgate.net/profile/Javier_Turek/publication/306258062/figure/fig1/AS:396269664129025@1471489458335/Proposed-4D-Convolutional-Autoencoder.png',height=400)

# Quantification of model performance
1. a voxel's prediction accuracy as the pearson's $r$ between its observed and predicted responses on the *test set*
2. to account for performance variability across voxels, we compared prediction accuracies of voxels with their SNRs and the mean activities of the DNN layers across the *training set*
3. SNR was estimated as:
$$\frac{\text{mean time series}}{\text{median }\{abs[diff(\text{time point}_i,\text{time point}_{i+1})]\}}$$
4. to computing the prediction accuracy for individual voxels, we can use the accuracy of reconstructing a presented image from observed brain activity as a measure of model performance
$$\text{compute the most probable stimulus by maximizing the likelihood}$$
$$\mathbf{x}^* = arg\,max_{\mathbf{x}\in\mathbf{X}}\{-(\mathbf{y} - \mu(\mathbf{x}))^T\,\Sigma^{-1}\, (\mathbf{y} - \mu(\mathbf{x}))\}$$
$$\mu({\mathbf{x}})\text{ is the predicted response by the encoding model using the optimal layer asssignment for each voxel}$$
$$\Sigma \text{ is an estimate of the noise covariance}$$(**what noise covariance it belongs to? How to estimate such noise covariance?**)
5. those voxels that have the highest prediction accuracy onthe *test set* are chosen **without using the target stimulus** *(sounds like double dipping to me)*
6. the identification accuracy is:
$$\frac{\text{correctly identify test images from the {train and test combined}}}{120}$$

## to improve decoding performance
1. preditions were made by **refitting** an encoding model each voxel *(by what)*
2. the receptive field of each voxel was estimated by **refitting** another set of encoding models that take as input all features in the preferred layer of the voxel at individual spatial locations. The receptive field was then taken as the spatial locations whose corresponding models accurately predicted the response of the voxel. *(why can you do that)*

## control model
1.  [gabor wavelet pyramid](https://www.researchgate.net/publication/220869739_A_Gabor_Wavelet_Pyramid-Based_Object_Detection_Algorithm)
2. vgg-verydeep-16 and vgg-verydeep-19 (Simonyan and Zisserman, 2014)
3. vgg-f (fast, strides = 4), vgg-m (medium), vgg-m-2048, vgg-m-1024, and vgg-m-128 (Charfield et al., 2014)
4. caffe-ref [(Jia et al., 2014)](https://arxiv.org/pdf/1408.5093.pdf): fast feature embedding
5. caffe-alex [(Krizhevsky et al., 2012)](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf): the original Alex net

# Analysis of internal representations
1. deconvolutional network (Zeiler and Fergus, 2012) was used to reconstruct the internal representations of artifical neurons
2. the image that maximally activates each artificial neuron was selected from the ImageNet *validation set*
3. the image was first feed forward until it reached the layer of the neuron of interest
4. all the activation except the maximum activation of the neuron were set to zero (sparsity)
5. the activation of the neuron was deconvolved to produce a representation in image space
$$\text{inverting the order of the layers }\rightarrow \text{transposing the filters }\rightarrow\text{replacing max pooling with max uppooling}$$
6. after an initial evaluation of the internal representations, **9** feature classes were defined such that they were representative of the most common *low-level (blob, contrast, and edge)*, *mid-level (contour, shape, and texture)*, and *high level (irregular pattern and object part and entire object)* internal representations of the 1888 artificial neurons in the convolutional layers (sounds like a small model)
7. each of these neurons was assigned a **predefined label** by a naive subject across five-hour long sessions
8. the subject was presented with 4 instantiations of the internal representations of the neurons *(tegether with the images that were used to reconstruct them)* in a random order and was asked to assign one of the following feature classes: see above
9. each instatiation corresponded to the reconstruction of the internal representation of a neuron using one of th 4 images that activated the neuron the most (very confusing)

# Analysis of voxel groups
1. individual voxels were assigned to their optimal according to their optimal layer according to maximal prediction accuracy computed using 5-fold cross validation on the *training data*
2. voxels were grouped together according to their assigned neural network layer
3. layer complexity is defined as the mean [Kolmogorove complexity](https://github.com/MLWave/koolmogorov) of the internal representations of the artificial neurons in that layer, approximated by their normalized compressed file size
4. layer invariance is defined as the median full-width at half-maximum of 2D Gaussian surfaces that have been fitted to the 2D response surfaces of the artificial neurons in that layer (reflecting tolerance to small translations of a stimulus feature)
## the reconstruction of the internal representation of the artificial neuron is shifted to different spatial locations (the brain)
## the activity of the neurons is computed for each translation and a 2D response surface is constructed (I still don't know where is this "2D response surface" coming from)


# Cluster analysis
## don't understand at all
1. goal: identify fine-grained structure within individual visual areas, use: **peralignment**Haxby et al., 2011) $\rightarrow$ **nparametric Bayesian biclustering**Meeds and Roweis, 2007)

## hyperalignment
1. select {individual representational space of the subject that has the **most** number of voxels} as the initial common representational space
2. the common representational space was then iteratively updated for 100 iterations
3. for each iteration:
        Procrustes transformation {project the individual functional data of the 2 subjects to the common representational space} after {which the common representational space was set to the **mean** of the individual functional data of the two subjects}
4. each visual area was hyperaligned *separately*

## nonparametric Bayesian biclustering
1. simultaneously cluster rows:[individual feature maps] and columns:[region-specific voxels of the common representational space] of a z-scored prediction accuracy matrix
2. Assumption:
        the observed prediction accuracies for each feature map/voxel pair are drawn from a Gaussian with zero mean and unit standard deviation
3. [Gibbs sampler - matlab](https://github.com/ppletscher/npbb)

# Results

1. assign voxels to one of the 8 layers of the DNN, each voxel was assigned to the layer of the DNN that resulted in the lowest CV error on the *training set*
2. S1: $$\text{3381 of 25915} = 0.1304$$only the main afferent pathway of the ventral stream $$\text{1785 of 6017}= 0.2968$$
3. S2: $$\text{1185 of 26329} = 0.0450$$only the main afferent pathway of the ventral stream $$\text{768 of 4875}=0.1575$$
4. a combination of the striate and extrastriate decoding models would have a higher accuracy since the striate voxels can be used to resolve the ambiguities in the feature representations of the extrastriate voxels and vice versa

In [18]:
print('to what extent the deep model allows decoding of a perceived stimulus from observed multiple voxel responses alone')
Image(url='http://www.jneurosci.org/content/jneuro/35/27/10005/F2.large.jpg?width=800&height=600&carousel=1',height=400)

to what extent the deep model allows decoding of a perceived stimulus from observed multiple voxel responses alone


The DNN model accurately predicts voxel responses across the occipital cortex. **A**, Prediction accuracies of the significant voxels across the occipital cortex (p < 2e-6 for both subjects, Bonferroni corrected for number of voxels, Student's t test across cross-validated training images within subjects). **B**, Prediction accuracies of the significant voxels across V1[striate](), V2, V4, and LO [extrastriate]() (p < 5e-8 for both subjects, Bonferroni corrected for number of layers and voxels, Student's t test across cross-validated training images within subjects). **C**, SNRs of the voxels across the occipital cortex.

### identification - machine learning: identify the correct stimulus from a set of novel stimuli - decoding model
#### [pattern recognition](https://en.wikipedia.org/wiki/Pattern_recognition)
#### [Kay et al., 2008 Nature](https://www.nature.com/articles/nature06713)
#### [Michel et al., 2008 meanings of nouns](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.649.5692&rep=rep1&type=pdf)
1. all decoding models performed significantly better than the chance level of 5e-4% (p < 2d-308, binomial test across test images within subjects)
2. the striate decoding model correctly identified a stimulus from a set of 1870 potential stimuli at 96%-500 voxels and 79%-250 voxels accuracy, whereas the extrastriate decoding model correctly identified a stimulus from the same set at 95% and 63% accuracy
3. the ventral stream decoding model showed higher identification accuracy than either of the previous 2 decoding model (striate and extrastriate): 98 % and 93%


# Image decoding is driven by discriminative and categorical information
1. Question: to what extent decoding performance is driven by {disrimination}[identifying an image based on its unique characteristics]() or {categorization}[identifying an image based on categorical information]()
2. assigned each image in the *test set* to one of two categories (**animate vs inanimate**) as this apears to be the strongest categorical division in [*inferior temporal cortex*](https://upload.wikimedia.org/wikipedia/commons/1/18/Gray726_inferior_temporal_gyrus.png) (Khaligh-Razavi and Kriegeskorte, 2014 RSA paper)
3. 99 of 120 *test images* could be assigned to either of these categories and were used for further analysis
4. compute the pariwise linear correlations between {the observed} and {predicted responses} to each pair of images (how do you pair images?)
5. it was found that  $$\mathbf{r}_{Image\,i}(\text{observed responses},\text{predicted responses})\, >>>\, mean(\mathbf{r}(\text{observed responses}_{Image\,i},\text{predicted responses}_{Image\,not\,i}))$$
6. **this points towards idenfication based on each image's unique characteristic** (in other words, the abstract representations of the images were not overlaped too much)
7. For high-level voxels only: $$mean(\mathbf{r}(\text{observed responses}_{(Image\,i,\,Category\,j)},\text{predicted responses}_{(Image\,k,\,Category\,j)})) >>> mean (\mathbf{r}(\text{observed responses}_{(Image\,i,\,Category\,j)},\text{predicted responses}_{(Image\,k,\,Category\,p)}))$$
8. for downstream areas, **not only** characteristics of an image, **but also** its semantic content is involved in response prediction

In [8]:
Image(url='http://www.jneurosci.org/content/jneuro/35/27/10005/F3.large.jpg?width=800&height=600&carousel=1',height=800)

Properties of the voxel groups systematically change as a function of layer assignment. **A**, Significant linear partial correlations between the predicted responses of each pair of voxel groups. Line widths are proportional to mean partial correlation coefficients across subjects. **B**, Distribution of the receptive field centers for both subjects. [more voxels dedicated to foveal than peripheral vision]() **C**, Example reconstructions of the internal representations of the convolutional layers. Reconstructions are enlarged, and automatic tone, contrast, and color enhancement are applied for visualization purposes. **D**, Proportions of the internal representations of the convolutional layers that are assigned to low-level (blob, contrast, and edge), mid-level (contour, shape, and texture), and high-level (irregular pattern, object part, and entire object) feature classes. **E**, Receptive field complexity (K), invariance, and size of the voxel groups.

# Voxel groups exhibit coherent representational characteristics
## don't understand at all
1. pooled voxels that were assigned to the same DNN layer together and analyzed their properties (what do you mean by "pool")
2. the responses of successive voxel groups were more partially correlated than those of nonsuccessive groups
3. information flow mainly takes place between neighboring visual areas, providing quantitative evidence for the thesis that the visual ventral stream is *hierarchically* organized (Markov et al., 2014), with downstream areas processing increasing complex features of the retinal input

In [19]:
Image(url='http://www.jneurosci.org/content/jneuro/35/27/10005/F4.large.jpg?width=800&height=600&carousel=1',height=500)

Layer assignments of the voxels systematically increase as a function of position on the occipital cortex. **A**, Layer assignments of the significant voxels across occipital cortex (p < 2e-6 for both subjects, Bonferroni corrected for number of voxels, Student's t test across cross-validated training images within subjects). **B**, Layer assignments of the significant voxels across V1, V2, V4, and LO (p < 5e-8 for both subjects, Bonferroni corrected for number of layers and voxels, Student's t test across cross-validated training images within subjects). **C**, Proportions of voxels in areas V1, V2, V4, and LO that are assigned to low-level (blob, contrast, and edge), mid-level (contour, shape, and texture), and high-level (irregular pattern, object part, and entire object) feature classes.

# Voxel groups reveal a gradient in the complexity of neural representations
1. an increase in layer assignment was observed when moving from posterior to anterior points on the cortical surface

In [10]:
Image(url='http://www.jneurosci.org/content/jneuro/35/27/10005/F5.large.jpg?width=800&height=600&carousel=1',height=800)

Voxels in different visual areas are differentially selective to feature maps in different layers. **A**, Selectivity of the significant voxels in the occipital cortex to three distinct feature maps of varying complexity (p < 2e-6 for both subjects, Bonferroni corrected for number of voxels, Student's t test across cross-validated training images within subjects). **B**, Biclusters of hyperaligned voxels and feature maps. Horizontal and vertical red lines delineate the boundaries of clusters of feature maps and voxels, respectively. The rows and columns are thresholded such that each row and column contain at least one element that survives the threshold of $r^2$ = 0.15. The numbers in parentheses denote the number of remaining feature maps and voxels after thresholding.

# Selectivity of voxels to individual feature maps reveals distributed representations
1. individual features accurately predicted multiple voxels and invidual voxels were accurately predicted by multiple features (many-to-many mapping)
2. *for features of either low or high complexity this relationship tended to be spatially confined to either upstream or downstream visual areas, respectively*
3. biclustering of the prediction accuracy matrix revealed horizontal bands with fluctuating magnitude that point to features with similar information content, and vertical bands that point to clusters of voxels with congruent responses

In [11]:
Image(url='http://www.jneurosci.org/content/jneuro/35/27/10005/F6.large.jpg?width=800&height=600&carousel=1',height=800)

Our model performs similarly to the control models that are task optimized but outperforms those that are not task optimized across V1, V2, V4, and LO voxels of both subjects. **A**, Comparison between the prediction accuracies for our model ($r_0$) with those for the pretrained DNN ($r_P$), random DNN ($r_R$), and GWP ($r_{GWP}$) models. Red dots denote the individual voxels. Asterisks indicate the visual areas where the prediction accuracies are significantly different. **B**, Comparison between the layer assignments for our model ($DNN_0$) with those of the pretrained DNN ($DNN_P$) and random DNN ($DNN_R$) models. Red dots denote the individual voxels. Crosses indicate the mean layer assignments of the $DNN_0$ model.

# Discussion

1. semantic selectivity is organized as smooth gradients across cortext (Huth et al., 2012)
2. Ventral stream responses to scrambleed versus nonscrambled images (Grill-Spector et al., 1998)
3. Downstream receptive fields become larger and more invariant (Smith et al., 2001; DiCarlo and Cox, 2007)
4. Some downstream neurons are tuned to relatively simple features and some uptream neurons are tuned to relatively complex features in primates (Desimone et al., 1984; Hedge and Van Essen, 2007)
5. Individual features are represented in a distributed manner across a patch of cortex and multiple features are superimposed on the same cortical expanse (Fig 5,6) (Grill-Spector and Weiner, 2014)
6. complex, ecologically valid naturalistic stimuli (Felsen and Dan, 2005)
7. Highly constraint artificial stimuli (Rust and Movshon, 2005)
8. mapping of individual stimulus features confirm that 
        Low-level stimulus properties were mainly confined to early visual areas
        High-level stimulus properties were mostly represented in posterior inferior temporal areas
9. biclustering of feature-specific prediction accurarices revealed a more fine-grained functional specialization in downstream visual areas (Larsson and Heeger, 2006; Tanigawas et al., 2010)
10. the general applicability of DNN-based encoding models permits the investigation of neural representations in the other visual areas (what does this sentence mean?) (Agrawal et al., 2014)
11. in other brain regions involved in the representation of sensory information: dorsal stream (Goodale and Milner, 1992)
12.in multimodal association areas (Mesulam, 1998)
13. probbing other brain regions:
        Top-down: changes in attention (Cukur et al., 2013)
        Task demand (Emadi and Esteky, 2014; McKee et al., 2014)
        Function of experience (Rainer et al., 2004; Cukur et al., 2013)
        Neurodegenerative disorders: semantic dementia (Patterson et al., 2007)
14. probbing internally generated percepts that occur during:
        Imagery (Thirion et al., 2006)
        Memory retrieval (Harrison and Tong, 2009)
        Visual illussions (Kok and de Lange, 2014)
15. one way to improve **encoding performance** is to develop feature that outperform DNNs when it comes to capturing neural representations of *low-, mid-, and high-* level stimulus features
16. **arguably**, *unsupervised learning of statistical struture* in the environment or the maximization of expected reward during *reinforcement learning* offer more biologically plausible explanantions for the formation of receptive field properties:
        object categorization (Olshausen and Field, 1996; Schultz et al., 1997)
17. **another avenue for further research is the development of more sophisticated responses models**
18. the current response models make use of a **linear mapping** from a **nonlinear feature representation** onto **peak BOLD amplitude**, HOWEVER, the mapping from stimulus features to responses that result from changes in neuronal processing (Logothetis and Wandell, 2004; Norris, 2006)
19. It is likely that **encoding performance** will further improve by using more sophisticate and/or biophysically realistic response models (Pedregosa et al., 2014; Aquino et al., 2014)

# Encoding models as hypothese about brain function
1. DNN-based encoding models does **NOT** follow that they provide a mechanistic account for perceptual processing in their biological counterpartes
2. the use of a strictly feedforward architecture **cannot** easily be reconciled with the feedback processing inherent to neural information processing (Hochstein and Ahissar, 2002)
3. the utility of the encoding approach lies in testing *whether a particular computational model outperforms alternative computational models* **when it comes to explaining observed data** (Bayesian approach)(Naselaris et al., 2011)
4. DNN-based encoding model can be considered as implementing a hypothesis about the emergence of receptive field properties across the ventral stream (Fukushima, 1980)
5. DNNs rely on the notion of object categorization to explain the emergence of a hierachy of increasingly complex represenations (Serre et al., 2007)
6. the proposition that object categorization drives the formation of receptive field properties in the ventral stream is supported by the observation that performance-optimized hierachical models can reliably predict single-neuron responses in area IT of the macaque monkey (Yamin et al., 2014)
7. it is also substantiated by recent findings that DNNs better predict voxel responses in the human system and the representational geometry of IT responses in both macaques and humans, compared with other computational models (Cadieu et al., 2014; Khaligh-Razavi and Kriegeskorte, 2014) -- voxel sin downstream areas of the ventral stream code for increasingly complex stimulus features that drive object categorization

# The goal of future computational models
1. incorporating different assumptions or invoking other objective functions
2. reflecting alternative theories of brain functions
3. already at the earliest levels of visual processing, there remains ample room for debate as to what form an optimal computational model should take
4. Notwithstanding the debate that remains, this study substribe to a model-based approach to cognitive neuroscience in which theories about brain function are tested against each other by validating *generative models* on neural and/or behavioral data