# 1. Introduction to connectionist models in cognitive psychology: Basic structures, processes, and algorithms

* 싸이그래머 / QGM : 파트 3 - 연결주의모형 [1]
* 김무성

# Contents
* INTRODUCTION
* THE NEURAL BASIS OF CONNECTIONIST MODELS
* COGNITIVE AND NEURAL INTERPRETATIONS OF CONNECTIONIST NETWORKS
* BASIC STRUCTURES AND PROCESSES OF CONNECTIONIST MODELS
* LEARNING RULES
* FURTHER EXTENSIONS OF THE DELTA RULE—MULTI-LAYER NETWORKS AND NONLINEAR MAPPINGS
* FEEDBACK AND RECURRENT ARCHITECTURES

# INTRODUCTION

#### 참고자료
* [2] Introduction to Cognitive Science - http://users.metu.edu.tr/bozsahin/intro-cogsci/w1-history.ppt
* [3] Connectionist Modeling  - http://people.umass.edu/alc/course_pages/fall_2004/modeling_behavior/lectures/connectionism.ppt 
* [4] Computationalism: The Very Idea - http://www.cs.bilkent.edu.tr/~david/papers/Computationalism.ppt
* [5] Connectionist Approaches - http://www.slideshare.net/cheapiseth1/chapter-6-connectionist-approaches
* [6] Connectionism vs. computationalism debate (wikipedia) - https://en.wikipedia.org/wiki/Connectionism#Connectionism_vs._computationalism_debate

This book aims to present an overview of the current state of connectionist modelling in cognitive psychology, covering a broad range of areas.

* connectionist (or “neural network”) models, PDP(Parallel Distributed Processing Model)
* cognitive psychology
* Connectionism vs. computationalism

# THE NEURAL BASIS OF CONNECTIONIST MODELS
* The neuron
* The action potential
* Neuronal communication and synaptic integration

## The neuron

<img src="figures/cap1.1.png" width=600 />

## The action potential

<img src="http://faculty.pasadena.edu/dkwon/chap%208_files/images/image33.png" width=600 />

## Neuronal communication and synaptic integration

<img src="http://qbi.uq.edu.au/filething/get/38344/Neuron-synapse_brain-physiology_QBI.png" width=600 />

<img src="http://ib-biology2010-12.wikispaces.com/file/view/Synaptic_Integration.jpg/307656242/649x566/Synaptic_Integration.jpg" />

# COGNITIVE AND NEURAL INTERPRETATIONS OF CONNECTIONIST NETWORKS

* Our knowledge of brain function is still at an early stage, and we should anticipate that there are many significant functional principles still to be discovered. It could still turn out that some currently cherished beliefs regard- ing the psychological significance of what we know about the brain (e.g. the role of synaptic change in memory formation) are mistaken. 
* <font color="red">If such ideas proved wrong, what would be the consequences for the connectionist program?</font>

* Some might argue that in fact it would not matter, as connectionist cognitive models are first and foremost cognitive; i.e. they are to be judged only on how well they explain and predict behavioural data from cognitive psychology experiments.
* From this point of view, <font color="red">the basic elements of connectionist models are to be interpreted only in abstract functional terms</font>.
* Connections and weights in a network represent patterns of functional interactivity between cognitive units, such that units that are cognitively related can activate each other, while units that are cognitively incompatible inhibit each other; it does not matter that the brain contains (modifiable) excitatory and inhibitory synapses—maybe the cognitively important task (generation of internally consistent states) is realized in some other way.

* Nevertheless, I believe that most connectionist cognitive modellers, however rarely a new neuroscientific finding actually impinges on their work, would reject the logical consequence of this view, in which the coarse analogy between artificial and real neural networks is entirely accidental, and of no practical or theoretical importance. 
* <font color="red">Rather, the general belief is that a neurally-grounded cognitive psychology is both desirable and achievable.</font>

* Connectionism is, at least in practice, a genuinely constraining framework, and the constraints lead to <font color="red">fundamental theoretical questions</font> that also make sense to neuroscientists; e.g. 
    - how can cognition arise from interactions between basic processors capable of only simple analogue computations? 
    - How are memories formed in a system limited to changing patterns of connectivity between these simple elements? 
    - How does binding of distributed information take place to produce unitary percepts? 
    - How does a parallel, distributed system (without a serial, central processor) control its own behaviour, in space (attention) and time (serial order)?

# BASIC STRUCTURES AND PROCESSES OF CONNECTIONIST MODELS
* Units and activation levels
* Vector representation of activation patterns
* Representation of cognitive elements by units

#### keyword
* node (= unit = neuron)
* activation level,
* weight
* layer
* patterns

<img src="figures/cap1.2.png" width=600 />

## Units and activation levels

<img src="http://i.stack.imgur.com/KUvpQ.png" width=600 />

<img src="http://homepages.gold.ac.uk/nikolaev/perceptr.gif" width=600 />

## Vector representation of activation patterns

<img src="figures/cap1.3.png" width=600 />

## Representation of cognitive elements by units

As mentioned several times above, the units in a network are used to represent the cognitive elements that are required in the domain being modelled.

#### local representation

* The simplest way to do this is to <font color="red">assign one unit to each identifiable element</font>. 
* For instance, in a model incorporating a mental lexicon (store of words), 
    - <font color="red">each known word</font> would be represented by one unit. * In a model of face processing, 
    - each unit would represent a <font color="red">known face</font>.
* This is known as a local representation, as to find out to what extent a given mental entity is <font color="red">currently active we need only look “locally” at a single unit</font>.
* orthogonality
    - From a more formal point of view, each element is represented orthogonally to all the others.
    - The activation pattern (vector) {1 0 0 0} would represent John, {0 1 0 0} would represent Paul, and so on.

#### distributed representation

* <font color="red">Local representation</font> requires as many units as there are things to be represented, and has been criticized for (amongst other things) being <font color="red">inefficient</font>. 
* An alternative is to use a <font color="blue">distributed representation</font>.
    - In this case, cognitive elements are represented not by the activation of individual units, but by <font color="blue">the pattern of activation over a set of units</font>. 
    - In the above example, John might be {1 0 1 1}, Paul {1 1 0 0}, and so on.

#### Local VS Distributed ?

* The relative advantages of local and distributed representations have been the cause of debate, with <font color="red">some influential authors considering the use of distributed representations</font> to be defining feature of the connectionist approach (see Page, 2000, and associated commentaries). 
    - In practice, many models <font color="blue">use both forms</font> of representation, and in many cases the <font color="blue">use of distributed representations is restricted to the “hidden units”, in networks trained using back-propagation</font> or a related method (discussed below in the section on learning). 
* From a psychology perspective, 
    - it is best to treat representations as embodying substantive claims about cognitive representation. 
    - <font color="red">Distributed representations can capture the similarity structure</font> of a domain.
    - <font color="blue">The central feature of local representation</font> is not really “locality” per se (i.e. all the activations in one unit), but the more <font color="blue">abstract property of orthogonality, or independence</font>, mentioned earlier.
        - That is, considered as activation vectors, any two local representations (over the same vector space) are orthogonal to (uncorrelated with) each other, so that any- thing that happens to one representation does not affect any other.
        - At the word level, cat and sat are distinct and unrelated, and local (i.e. orthogonal) lexical representations capture this perfectly.
    - <font color="red">If the brain were to use only distributed representations based on similarity structure, then in essence it would have failed to capture a pervasive and behaviourally significant feature of the world</font>.

## Connections and weights

<img src="figures/cap1.4.png" width=600 />

## The matrix representation of sets of weights

<img src="figures/cap1.5.png" width=600 />

## Spread of activation and the net (summed) input

<img src="figures/cap1.7.png" width=600 />

#### net input
As each individual input is the product of an activation and a weight [equation (1)], the net input is therefore a sum of products. The net input from a set of n units (1,2, . . ., n) to a unit j, netj, is:

<img src="figures/cap1.6.png" />

<img src="figures/cap1.8.png" width=600 />

## Spread of activation using vector and matrix notations

<img src="figures/cap1.9.png"/>

Equation (3) gives the net input to just one output unit. The same process has to be carried out for all output units (using the appropriate weight vector, or row, from the matrix).

Happily, what has just been described corresponds to the multiplication of a vector (input pattern) by a matrix (weights). Hence, using vector–matrix notation, the whole process can be written:

<img src="figures/cap1.10.png" />

## Activation functions

The rule for converting the input to a unit into an activation level is generally known as the activation function.

<img src="figures/cap1.11.png" width=600 />

Here we list a few common examples also found in articles in this book. Each example is shown graphically in Fig. 1.7. In each case the graphs plot the net input to a unit (x axis) against its activation level (y axis).

<img src="figures/cap1.12.png" />

<img src="figures/cap1.13.png"  />

## Radial basis functions (RBFs)

The activation functions discussed above share the property that, as the net input to a unit becomes more positive (excitatory), the higher the expected activation value of the unit (the thresholded unit of course changes instantly once the threshold is reached, but as the net input increases, the more likely it is that the threshold will have been crossed).

The last class of rule we discuss, radial basis function, does not have this property. Instead, we can think of a unit as having an optimal pattern of input, and as the actual input moves away from the optimal pattern, the activation level of the unit decreases (Fig. 1.7). 

<img src="figures/cap1.11.png" width=600 />

## Activation decay and temporal integration

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/95/Action_Potential.gif/600px-Action_Potential.gif" width=600 />

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4a/Action_potential.svg/600px-Action_potential.svg.png" width=600 />

<img src="http://d2gme0e5d9kd75.cloudfront.net/content/royfocus/1/1/75/F2.medium.gif" width=600 />

# LEARNING RULES
* Unsupervised learning—the Hebb rule
* Supervised learning—the delta rule
* The delta rule and the Hebb rule: A comparison

<font color="red">Learning, defined as the acquisition of knowledge and/or change in behaviour as a result of experience, is of great importance to connectionist models.</font>

<font color="blue">How does one define what is a “better” response?</font> In general, “better” means that the interaction with the environment is improved; that successful predictions can be made about it, on the basis of partial information.

#### category (or classification) learning

* In the case of category (or classification) learning, experience leads to clustering of stimuli based on similarity—in brief, similar things go together, and should be treated in a similar manner.
    - For instance, if one has learned from prior experience that some parts of plants taste good (e.g. fruits), and some bad (e.g. leaves) then the ability to classify a new stimulus as a fruit or a leaf, just by looking at it, will determine how one responds to it.

#### associative learning

* In the case of associative learning, the goal of learning is to associate disparate aspects of experience that occur together, hence enabling us to predict the structure of our environment on the basis of partial cues.
    - For instance, faces and voices are processed in different brain areas, yet through experience we associate particular voices with faces of people that we know. On hearing a known voice, we look for the presence of a particular face. If we see a friend, we immediately notice any change in his/her voice (maybe he/she has a cold), or accent (who has he/she been talking to?). Hence, one form of perceptual input gives rise to learned expectations about another, and allows us to predict aspects of our environment ahead of time (and to detect changes in them).

In neural networks, a broad distinction is made between supervised and unsupervised learning.

#### supervised learning
* In supervised learning, the network is provided with some form of environmental feedback which indicates how well it is performing, and learning will only take place when the network performs badly, i.e. makes an error.

#### unsupervised learning
* In the case of unsupervised learning, there is no explicit comparison between actual and desired outputs. Rather, the network typically has implicit goals, such as forming a useful categorization (grouping) of sets of input stimuli, or extract- ing the co-occurrence relationships amongst features of the environment, which permit the prediction of features not actually present (e.g. Hebbian learning).

Below we discuss in more detail some of the more commonly used learn- ing rules that occur in the articles in the current book. 

<font color="red">As stated, all learning rules change the weights between units.</font> 

## Unsupervised learning—the Hebb rule

#### 참고
* [8] Plasticity and Learning - http://www.intsci.ac.cn/shizz/course/ni10.ppt
* [9] Hebbian Learning - http://www.slideshare.net/mentelibre/hebbian-learning
* [10] Lebbian Learing Rule - http://www.aistudy.co.kr/neural/hebbian_learning.htm
* [11] neural-networks-12102680 - http://www.slideshare.net/balveenchugh/neural-networks-12102680

The simplest learning rule is named after the psychologist Donald Hebb, who formulated an intuitive, physiologically based version of it in a work pub- lished in 1949. Hebb proposed that knowledge was stored in “cell assemblies”, connected groups of neurons that would activate each other.

<img src="http://image.slidesharecdn.com/2008-01-09-miagkikh-140212224414-phpapp02/95/learning-in-networks-were-pavlov-and-hebb-right-7-638.jpg?cb=1392245210" width=600 />

<img src="http://neuronaldynamics.epfl.ch/online/x589.png" />

In connectionist terms, the basic rule has the form:

<img src="figures/cap1.14.png" width=600 />

where ai and aj are the activations of two units connected by the weight wij.

#### learning rate
* The rule can be added to in various ways which complicate its mathematical expression (and for this reason have been omitted). 
* For instance, a learning rate parameter is usually added, which scales the size of the weight change

#### Long-Term Potentiation (LTP) 

* Hence, the weight increases (i.e. has a positive change) whenever the two units are active at the same time (if either unit has zero activation, then clearly ∆ wij = 0, and the weight does not change). In neural terms, the rule can be thought of as stating that the strength of (excitatory) synapses should increase when- ever two neurons connected by such synapses fire (become depolarized) at the same time (or at least very closely in time).
* The phenomenon of long-term potentiation (LTP) of synapses has this characteristic (see e.g. Carlson, 2001, Chapter 14). 

#### associative learning

* The rule can be used to perform associative learning between mental entities that are active at the same time. 

<img src="figures/cap1.15.png" width=600 />

#### anti-Hebb & Long-Term Depression (LTD)

* More interestingly, the individual weights can be made to decrease actively by a so-called “anti-Hebb” rule, whereby the weight is decreased by some amount if the input unit (ai) is active but the receiving unit (aj) is inactive or only weakly active. In the example shown in Fig. 1.8, anti-Hebb learning would lead to a reduction in the weights (shown as dashed lines in the figure) from “C”, “A”, “T” to the word “Rat”, as “Rat” was not activated. 
* Interestingly, this proposal has physiological support in the phenomenon of long-term depression (LTD), in which firing of a pre-synaptic neuron combined with weak post-synaptic depolarization, or hyperpolariza- tion, leads to a reduction in synaptic strength.

## Supervised learning—the delta rule

#### 참고
* [12] Lecture 3: Delta Rule - http://www.cs.stir.ac.uk/courses/31YF/lectures/ANN/3-DeltaRule.pdf

Supervised learning, in contrast, only takes place when an error is detected in the output response of the network. For the error to be gener- ated, the actual response of the network has to be compared with an exter- nally provided “target” activation pattern

#### error type
* Errors can be of two types, which we will refer to as “omission” and “commission” errors. 
* In an omission error, 
    - the network fails to activate a node that should have come on.
    - Omission errors lead to an increase in excitatory weights (and/or reduc- tion in inhibitory weights) to the output unit that was not active enough, from all the input units that where activated by the input stimulus.
* In a commission error, 
    - the network activates a unit that should have remained inactive. 
    - Commission errors lead to a reduction in excitatory weights (and/or increase in inhibitory weights) to the inappropriately active unit from all the input units that where activated by the input stimulus. 
* The two types of error frequently co-occur. 

#### Widrow–Hoff (or “delta”) rule

<img src="figures/cap1.16.png" width=600 />

* The expression in brackets, 
    - targetj − actualj, is called the error term, 
    - and is computed very simply by subtracting the actual activation of the output unit from the target activation provided by the teacher. 
* The error term is then multiplied by the activation of the input unit from which the connection derives. 
    - Consequently, if the input unit is not active (ai = 0), then no weight change takes place. 
* Here we notice a similarity with the Hebb rule, discussed above—weight changes only take place when the “pre-synaptic neuron” is active. 
* Indeed, if we denote the error term dj, 
    - the <font color="red">delta rule is dj × ai</font>, 
    - while the <font color="blue">Hebb rule is aj × ai</font>.

<img src="figures/cap1.17.png" width=600 />

## The delta rule and the Hebb rule: A comparison

It is possible to implement the delta rule without explicitly computing the error. We can see this by applying some simple algebra. We can rewrite the delta rule [equation (8)] by multiplying out the error term in the bracket by ai, which gives:

<img src="figures/cap1.18.png" />

The rule now appears as two “Hebb-like” terms (actually, Hebb and anti-Hebb), with no direct comparison between a target and desired activation.
* The first term makes weights more positive (or less negative), 
* while the second does the opposite (note the minus sign). 
* The weight change due to the first term is precisely the Hebb rule, if we imagine the target activations instantiated on the output units, while the input units are active. This will make the connections to the output units that should be ON more positive. 
* The second term can be computed by activating the input units and then decrementing the weights to any output units that are activated as a result. This is quite literally an anti-Hebb rule.

# FURTHER EXTENSIONS OF THE DELTA RULE—MULTI-LAYER NETWORKS AND NONLINEAR MAPPINGS

The delta rule is defined for two-layer networks in which the input units are directly attached to the output units (Fig. 1.2a).

#### linearity & nonlinearity

However, the set of problems (defined as input–output mappings) that a two-layer network can solve in principle is limited. In technical terms, for a complete solution to be achievable by a two-layer net, the relationship defined by the mapping has to be linear.

<img src="https://dmm613.files.wordpress.com/2014/12/non_linearly_separable.png" width=600 />

#### hidden units

The addition of hidden units to networks permits them to handle arbitrary, nonlinear relationships (the hidden units must use a nonlinear activation function, such as the sigmoid, Fig. 1.7).

The immediate problem with applying the delta rule is that, although it could be adopted to change the weights from the hidden to output units (the activation of the input units in equation (8) could be replaced with the activation of the hidden units), it provides no way of adapting the weights from the input to hidden units. In particular, if we want to use a supervised learning rule, we need a way to define the error for the hidden units, so that the input-to-hidden weights can be changed to reduce the error. But how can this be done, given that the target activations are only defined for the output units? 

#### back-propagation learning rule

<img src="figures/cap1.19.png" width=600 />

# FEEDBACK AND RECURRENT ARCHITECTURES
* Within-layer feedback and winner-takes-all dynamics
* Between-layer feedback and attractor dynamics

## Within-layer feedback and winner-takes-all dynamics

<img src="figures/cap1.20.png" width=600 />

<img src="figures/cap1.21.png"  />

## Between-layer feedback and attractor dynamics

# 참고문헌
* [1] Connectionist Models in Cognitive Psycholgy - http://www.amazon.com/Connectionist-Cognitive-Psychology-Studies-Cognition/dp/0415646901/
* [2] Introduction to Cognitive Science - http://users.metu.edu.tr/bozsahin/intro-cogsci/w1-history.ppt
* [3] Connectionist Modeling  - http://people.umass.edu/alc/course_pages/fall_2004/modeling_behavior/lectures/connectionism.ppt 
* [4] Computationalism: The Very Idea - http://www.cs.bilkent.edu.tr/~david/papers/Computationalism.ppt
* [5] Connectionist Approaches - http://www.slideshare.net/cheapiseth1/chapter-6-connectionist-approaches
* [6] Connectionism vs. computationalism debate (wikipedia) - https://en.wikipedia.org/wiki/Connectionism#Connectionism_vs._computationalism_debate
* [7] Learning in Networks: were Pavlov and Hebb right? - http://www.slideshare.net/vmiagkikh/learning-in-networks-was-p
* [8] Plasticity and Learning - http://www.intsci.ac.cn/shizz/course/ni10.ppt
* [9] Hebbian Learning - http://www.slideshare.net/mentelibre/hebbian-learning
* [10] Lebbian Learing Rule - http://www.aistudy.co.kr/neural/hebbian_learning.htm
* [11] neural-networks-12102680 - http://www.slideshare.net/balveenchugh/neural-networks-12102680
* [12] Lecture 3: Delta Rule - http://www.cs.stir.ac.uk/courses/31YF/lectures/ANN/3-DeltaRule.pdf