![](https://1.cms.s81c.com/sites/default/files/2021-04-15/ICLH_Diagram_Batch_01_03-DeepNeuralNetwork-WHITEBG.png)

# MODULE 2: Deep Neural Nets as Models of Human Cognition
## Part I: Introduction

This notebook is part of the [CHI'22 Online Course on Cognitive Modeling](https://sites.google.com/view/modeling-chi22/home).

*Authors: Antti Oulasvirta, Andrew Howes, Jussi Jokinen*

### Overview of Module 2
The present wave of AI is driven by the remarkable capacity of deep neural nets (DNNs) to learn patterns in data, which has sprung new applications in speech, language, and vision. 
But what do DNNs offer for modeling _human_ cognition in HCI?

This unit answers five questions:
1. What is a DNN? 
2. Theoretically, what does it presume about human cognition?
3. How to build DNN models for HCI?
4. What are its limitations and scope in HCI?

As an application case, we look at how to model visual saliency in HCI. 

### Scope

This module is **not** about the machine learning theory of DNNs. The module is about applications of DNNs as cognitive models in HCI. Technical concepts are dealt with cursorily and from a cognition point-of-view. There are excellent tutorials available for learning the technical aspects of deep nets. 

### Notebooks in this module

1. Demo: Saliency done wrong! 
2. Part I: Introduction 
3. Part II: Application (UMSI) 

---

## 1. Deep neural nets: Basics



### 1.1. Deep neural nets are inspired by the human brain
Despite the clout surrounding deep learning in machine learning and AI, it is good to remind oneself that the origins of artificial neural networks (ANNs), as a model of intelligence, are in cognitive science. 

The ideas that underpin modern deep learning were presented in the **parallel distributed processing** (PDP) model of Rumelhart, Hinton, and McLelland in 1986 [(paper)](http://web.stanford.edu/~jlmcc/papers/PDP/Volume%201/Chap2_PDP86.pdf). PDP was a model of the ''microstructure of cognition'' -- as opposed to ''higher-level cognition'' like in verbal reasoning. The PDP model established key concepts of ANNs that remain the basis of deep learning. 

Presentday AI literature associates deep nets to the functioning of the cerebral cortex [(Paper)](https://www.pnas.org/doi/10.1073/pnas.1907373117). However, while deep nets were inspired by human brain, it was different in important respects (see Discussion). 



### 1.2. Technical concepts

Look at this [video](https://thumbs.gfycat.com/BaggyFearlessCrocodile-mobile.mp4) first.

DNNs are best understood by looking at a network both bottom-up and top down, i.e. how a single  unit (''neuron'') works versus how a whole network works:

**Single processing unit** (''neuron'') takes as input activation signal from other units and either activates or not depending on its activation function. This signal is sent to the units this unit feeds forward to.

**Activation function** tells if a unit is activated or not. It yields a biased sum of input signals. Activation functions can be divided into linear, non-linear (tanh, sigmoid). Some functions are bounded (e.g., tanh). Some are non-parametric (e.g., ReLU), some contain parameters (e.g., ELU, Leaky ReLU). Example: Sigmoid: $f(x) = 1/(1 + e^{-x})$

**Network architecture:**
The network architecture defines how single processing units are connected to others:
- _Input layer_ consists of those units that map to external input to the network. 
- _Output layer_ consists of those units that represent the output of the network as a whole. 
- _Hidden layers_ consists of units that are neither input nor output nodes but are connected to them. 

The totality of these connections, which can be represented as a direct graph, is called the network architecture. Neural networks with many hidden layers are called ''deep''. 

**Network weights** define the strength of each unit-to-unit connection in a network. A weight determines how strong an effect one unit has on another that it is connected to. Weights are updated when training the network. 

**Learning rule** determines how weights are updated during training. 
 - In _backpropagation_, weights are updated to minimize error in the mapping of the input layer to the output layer:
This is done by first computing _output error_ (distance between predicted value and ground truth value). Backpropagation determines how the weights of units in successively lower layers should be updated to decrease this error. Derivatives are computed using _gradient descent_. Intuitively, it tries to find largest decreases in error achievable by smallest changes to weights.

**Loss function** quantifies how far off (loss) a network's prediction is from ground truth data. When prediction a scalar value (regression), popular loss functions include RSE, MAE, Huber loss, KL divergence, CC (correlation coefficient). When predicting a category, popular loss functions are cross entropy and hinge loss.

How do these work together? Let's try! Tensorflow offers a brilliant [playground](https://playground.tensorflow.org/) to learn DNN design interactively.

### 1.3. Deep neural nets are universal function approximators

Deep nets learn *latent representations* that map inputs and outputs in a task-specific way. Contrary to many previous ML methods, *features* that define a representation do not need to be hand-coded -- a deep net learns this on its own. This is important both theoretical and practically: it makes deployment easier, because feature engineering is not needed. 

Deep nets achieve a remarkably good generalization performance despite being heavily overparametrized (See Terrence Sejnowski's famous paper on the [unreasonable effectiveness of neural nets](https://www.pnas.org/doi/10.1073/pnas.1907373117).

The reason is that
> ''deep neural networks have an inherent inductive bias that makes them inclined to learn generalizable hypotheses and avoid memorization. In this respect, we propose results that suggest that the inductive bias stems from neural networks being lazy: they tend to learn simpler rules first.'' [(ref)](https://dspace.mit.edu/handle/1721.1/121680)

The corresponding theoretical question studied in machine learning (ML) concerns the use of deep nets as *universal function approximators* [(ref)](https://cognitivemedium.com/magic_paper/assets/Hornik.pdf).

Deep nets struggle with out-of-distribution (OOD) samples. OODs are samples with features that are not covered well in the training data. They are often associated with poor predictive performance. The lack of robustness is a topic of intense research in the ML community.




### 1.4. Neural architectures introduce inductive biases

A bit more on neural architecture design, because it has a central role in applications in cognitive sciences. 

From an ML perspective, a network architecture is a type of **inductive bias**: ''assumptions that the learner uses to predict outputs of given inputs that it has not encountered''. In cognitive modeling, we exploit inductive biases to model central aspects of human cognition. 

Popular deep learning architectures and their ML applications include:
- *Convolutional neural networks* (CNNs); application: image recognition. We will return to this type at the end of this modeule.
- *Recurrent neural networks* (RNNs);  application: time series prediction. 
- *Long Short-term Memory* (LSTMs); application: flow prediction
- *Autoencoders*; application: physics modeling
- *Generative Adversarial Networks* (GANs); application: image generation
- *Transformers*; application: language models
... And many others. 

For a handy overview of architectures, take a look at this [cheatsheet.](https://drive.google.com/uc?export=view&id=1geYF-lILiUM_iPz3LNwUBG-CA_dpGE_q)

---

## 2. Deep learning and HCI



### 2.1. DNN as a Theory of Human Cognition

DNNs are demonstrably powerful ''universal function approximators''. They are also inspired by the brain (PDP). 
However, are they a plausible model of _human_ cognition according to evidence from cognitive and neurosciences? 

This is a heavily studied question in cognitive and neurosciences ([see this paper](https://arxiv.org/pdf/2001.07092.pdf)). The answer is ''maybe''. 

#### Positive evidence

On the one hand, certain core assumptions of deep nets enjoy broad acceptance, such as massively parallel processing and optimization-like learning -- although it is not clear what is being optimized by the human brain and how it achieves that. 

Further, there is nascent neuroscientific and behavioral evidence for specific neural architectures, such as CNNs, RNNs, and transformers. For example, some aspects of the early visual system are captured well by convolutional network models. Human brain has been argued to exhibit similar information bottlenecks as autoencoders utilize. There is also emerging evidence for backpropagation. Large language models exhibit human-like tendencies in language generation.  

#### Negative evidence

On the other hand, present-day DNNs make some highly simplistic assumptions about the brain. For example, the model of neurons ignores its ability for short- and long-term potentiation. They ignore the non-trivial role that dendrites have been discovered to have in learning. It is not clear how symbolic computations can be implemented with DNNs which rely on gradient descent. DNNs seem to be much worse than humans in symbolic reasoning (e.g., algebra). 

Further, there examples of striking dissimilarities in how deep nets and humans learn. One of the most compelling demonstrators concerns the use of adversarial examples: trivial changes -- like changing a pixel in an image -- can lead to drastic changes in predictions. 

Moreover, some core assumptions of DNNs remain hard to test empirically, such as activation functions.


### 2.2. Application examples in HCI

While DNNs are a popular method for intelligent systems, there is less work on studying DNNs as a cognitive model in HCI. One reason is that interactive tasks tend to have complex internal structure, multiple modalities, and they include different cognitive requirements from perception to memory to attention. 

With the exception of deep RL (Module 3), deep nets are mostly applied for simple **stimulus--response** tasks in HCI:
* A stimulus is presented to a user
* We want to predict a user's response
* The response can be an action or a series of actions.

#### Example applications
Example applications in HCI include:

* **Saliency:** In visual saliency modeling, we aim to predict the distribution of a user's gaze when seeing a novel user interface for the first time. Example: UMSI is a CNN-based model by Camielo Fosco and colleagues that predicts saliency over a number of different image types from UIs to infographics and posters. We discuss UMSI in Part II of this Module. [Paper](https://predimportance.mit.edu)

* **Motor control and motion prediction:** An example using GANs: [Paper](https://openaccess.thecvf.com/content_ICCV_2019/papers/Hernandez_Human_Motion_Prediction_via_Spatio-Temporal_Inpainting_ICCV_2019_paper.pdf)

* **Task performance:** In performance modeling, we want to predict how long it will take for a user to complete a task or how often errors occur. Yang Li and colleagues trained an LSTM-based DNN for predicting menu selection time. [Paper](https://dl.acm.org/doi/pdf/10.1145/3173574.3173603)

* **Search:** To model interactive search behavior on a search page, Xi Niu and Xiangyu Fan used RNNs that capture an essential aspect of 'information scent' as predicted by the information foraging theory. [Paper](https://dl.acm.org/doi/abs/10.1145/3341981.3344231).

* **Intent recognition:** Deep learning methods achieve high accuracy in recognizing social intent of people shown in images. [Paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Jia_Intentonomy_A_Dataset_and_Study_Towards_Human_Intent_Understanding_CVPR_2021_paper.pdf)

* **Clickthrough and tappability:** In tappability and click through prediction, the goal is to predict what a user will click / tap next. Amanda Sweargin and Yang Li used a CNN-driven architecture to predict which areas of a mobile UI a user will tap. [Paper](https://dl.acm.org/doi/pdf/10.1145/3290605.3300305)

In Module 3, we use deep nets to learn value estimates in RL-based models of humans.


### 2.3. What do we model when we model HCI with DNNs?

Deep learning has quickly emerged a top contender in achieving state-of-the-art results in several very hard computational problems in computer vision to natural language processing. It is therefore tempting to consider it as a model of human cognition, too. 

This view is naive, however. 

The purpose of cognitive modeling is not prediction but prediction _by reference to psychologically plausible constructs_. In HCI, we further want out models to be *actionable* when deployed for decision-making, design, or in an intelligent system.

#### Levels of explanation

When DNN-based cognitive models are applied to HCI probelms, several levels of explanation are possible:

- **Computational:** The aim is to provide a plausible account of the mental computations that produce observable behavior. This may require collecting additional data to test hypotheses concerning cognition, for example by measuring pupil dilation or collecting think aloud protocols.

- **Representational:** The aim is to model mental representations that mediate perception and action in users. A representation is a mental process that permits computations on higher-level features than those determined by input. Can DNNs give raise to human-like representation learning? 

- **Neural:** Here the aim is to match existing neuroscientifical evidence, such as relevant spatio-temporal patterns of brain activation brain, such as those as measured by brain imaging (e.g,. fMRI, EEG). This is very rarely done in HCI.

- **Behavioral:** The aim is to replicate empirical data on interactive human behavior in a computerized task. Given a stimulus, the goal is to predict human response. For example, given an image, the goal is to predict human response time, errors, or distribution of visual attention.

- **Practical:** The aim is to drive some intelligent algorithm or inform decision-making. In cases like this, whether a trained model captures cognition in a meaningful way is of secondary concern. 

When applying DNNs, it is important to first state the purpose: What is being explained here? 

### 2.4. Modeling workflow 

The basic modeling workflow for HCI applications includes the following steps:

1. Data collection: Collection of a rich and representative sample of data. We typically collect data on human responses to controlled stimuli (e.g., images). 
2. Model definition: See above for basic parameters.
3. Pre-training: A model is sometimes trained with a larger, domain-general dataset. (E.g,. natural scenes.)
4. Fine-tuning: The pretrained model is tuned to a particular domain, typically using a smaller, more specialized dataset. (E.g., images of cities with billboards.)
5. Cross-validation: Part of the training data is held out for testing that provides a test of predictive capability. 
5. Ablation studies: A modeling assumption is violated to gauge how important that assumption is for total model performance.
6. Sensitivity analysis: Some aspect of the dataset, as used in training or testing, is systematically perturbed. 
7. Computational analysis: Analysis of computational time and scalability.
8. Explainable AI: Visualizing how the model works,for example which regions are important for model predictions (e.g,. SHAP).


---

![](https://miro.medium.com/max/1400/1*vkQ0hXDaQv57sALXAJquxA.jpeg)

# 3. Convolutional neural networks (CNNs)

Convolution is one of the most successful neural architectures with applications in HCI. It is based on famous studies of Hubel and Wiesel on receptive fields in a visual system. A neuron does not respond to all other neurons but it responds *preferentially* to a field of neighboring neurons. While the basic ideas of CNNs were known already in the 1980s, the technique took of after 2012, when AlexNet showed progress in the ImageNet challenge


### 3.1. Key assumptions

* Position encoding: Positional information in input is retained;
* Convolution: Each processing unit responds to a spatially defined 'receptive field';
* Multiple scales: Multiple spatial scales act on each other. Often modeled in a pyramid-like manner with a ramping level of resolution.

![Convolution animated](https://miro.medium.com/max/1052/1*GcI7G-JLAQiEoCON7xFbhg.gif)

These basic concepts are nicely illustrated in this [tutorial](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53). 



### 3.2. CNN as a model of human visual system

This [paper](https://direct.mit.edu/jocn/article-abstract/33/10/2017/97402) evidence for CNNs as a model of object recognition in the human visual system. 

Key takeaways: 

1. Highest-level architectural assumptions are compatible with present understanding of the visual system:

> ''Each stacked bundle of convolution-nonlinearity-pooling can then be thought of as an approximation to a single visual area---usually the ones along the ventral stream such as V1, V2, V4 and IT--- each with its own retinotopy and feature maps. This stacking creates receptive fields for individual neurons that increase in size deeper in the network and the features of the image that they respond to become more complex. As has been mentioned, when trained to, these architectures can take in an image and output a category label in agreement with human judgement.'' (page 4)

This [image](https://drive.google.com/uc?export?view&id=1E_Imsrvvecr8ggPu3XcoUJPu74pjKeoA/view) from the paper illustrates current areas of research concerning CNNs and the human brain.

2. Similarities but also large differences exist in how real vs. artificial neural networks respond:

> ''Similarities have been found between units in the network and individual neurons in terms of response sparseness and size tuning, but differences have been found for object selectivity and orientation tuning.'' (page 5)

3. CNNs are best thought of as models of early processing of rapidly presented stimuli:

> ''Something to keep in mind when studying CNN behavior is that standard feedforward CNN architectures are believed to represent the very initial stages of visual processing, before various kinds of recurrent processing can take place. Therefore, when comparing the behavior of CNNs to animal behavior, fast stimulus presentation and backward masking are advised, as these are known to prevent many stages of recurrent processing'' (page 6)

The last point is important. Part II looks at an application in visual saliency modeling.