# 2. Connectionist models of basic human learning processes

* 싸이그래머 / QGM : 파트 3 - 연결주의모형 [1]
* 김무성

# Contents
* LINEAR AND NONLINEAR CLASSIFICATIONS
* UNIQUE CUES
* CHALLENGING UNIQUE CUE MODELS
* BACK-PROPAGATION MODELS
* CONFIGURAL MODELS
* REPRESENTATIONAL FLEXIBILITY
* ROLE OF THE HIPPOCAMPUS IN STIMULUS CODING
* CONCLUSIONS

It is natural to believe that the current high level of interest in connectionist models of cognitive processes is attributable to the impressive accounts such models can provide of human competencies (e.g. reading, classification, skill acquisition) and that the (arguable) improvement such models provide in explanatory power over nonconnectionist models is the main impetus to their growing acceptance amongst cognitive psychologists.

In this chapter I discuss elementary learning processes in various human analogues of animal conditioning procedures and use the findings to draw conclusions about the basic properties required in realistic connectionist models.

#### High-level cognitive processes & Animal Pavlovian conditioning

<img src="http://2012books.lardbucket.org/books/beginning-psychology/section_11/72ff70c6cb32a57995de5d2081132da7.jpg" width=600 />

* In fact, there is a remarkable degree of similarity between the principles governing Pavlovian conditioning and human associative learning (Shanks, 1995) and there seems little risk at present that theories of human and animal learning will diverge seriously. 
* A second reason is more to do with persuasiveness: No matter how compelling the evidence is in Pavlovian conditioning for the involvement of connectionistic processes, there will always be a leap of faith involved in seeing a relationship between such processes and more high-level human abilities.
* Studies of human associative processes provide an indispensable bridge spanning these two extreme fields, especially if high-level cognitive processes (e.g. causal induction) can be shown to obey the same rules as conditioning (Dickinson, 2001; Lober & Shanks, 2000).

#### supervised learning

* Very many of the so-called “supervised” networks and architectures that have been most extensively studied in the last few years take as their input a pattern of activation across a set of input units, generate an output pattern across a set of output units, and then receive some sort of signal from the outside world, indicating what the output pattern generated by the system should have been.
* This sort of simple situation provides a straightforward model for
    - associative learning and 
    - memory tasks, which conform to this input–output mapping description. 
* Some examples include 
    - generating the past tense of a verb from its stem, 
    - generating the pronunciation of a word from its written form, and
    - deciding the category to which an object belongs.
* These competences are complex, however, in a number of ways.

#### Simpler situation

* In consequence, it is necessary to turn to much simpler situations in which the learning task is stripped down to its bare essentials in such a way that the coding scheme, learning algorithm, and so on can be more directly and independently studied.
* Participants are instructed to imagine that they are food allergists attempting to work out which foods various fictitious patients are allergic to. 
    - On each trial a meal is presented to the patient, comprising one or a small number of foods, and the participant has to predict whether an allergy ensues. 
    - Then feedback is given about the actual outcome and the participant goes on to the next trial. 
    - Over a series of trials (usually in the order of 50–200), various trial types are presented repeatedly in an intermixed order until the participant is correctly predicting the outcome on each trial. 
    - Finally, a transfer test is given to assess some aspect of the learning process. 
    - The parallel to Pavlovian conditioning should be clear, with the foods playing the role of conditioned stimuli (CSs) and the allergy the role of the unconditioned stimulus (US).
* As we will see, despite the simplicity of the ability we are attempting to simulate, there are profound difficulties in creating an adequate connectionist theory, but a good deal of progress has nevertheless been made. 
* Also, it should be mentioned that the experimental designs on which this chapter focuses can generally be solved without the need for selective attention, and the models I review do not incorporate attentional processes.

# LINEAR AND NONLINEAR CLASSIFICATIONS

#### feedforward networks or pattern associators

<img src="figures/cap2.1.png" width=600 />

On every trial, some item consisting of a set of cues is presented. We calculate the activation $a_{o}$ of each output unit:

<img src="figures/cap2.2.png" width=600 />

Next, we calculate the “error”, $d_{o}$, on the output unit between the obtained output, $a_{o}$, and the desired output, $t_{o}$:

<img src="figures/cap2.3.png" width=600 />

Finally, we use the well-known “delta” rule (Stone, 1986) to change each of the weights in proportion to the error:

<img src="figures/eq2.3.png" width=600 />

#### Elemental coding assumption & Rescorla–Wagner theory

#### 참고
* [2] Behavior Models - http://www.slideshare.net/zennboy/class-feb-15
* [3] 레스콜라-와그너 모델(위키피디아) -  https://ko.wikipedia.org/wiki/%EA%B3%A0%EC%A0%84%EC%A0%81_%EC%A1%B0%EA%B1%B4%ED%99%94%EC%9D%98_%ED%96%89%EB%8F%99%EC%A0%81_%EC%97%B0%EA%B5%AC#.EB.A0.88.EC.8A.A4.EC.BD.9C.EB.9D.BC-.EC.99.80.EA.B7.B8.EB.84.88_.EB.AA.A8.EB.8D.B8
* [4] 레스콜라-바그너 모델(블로그) - http://m.blog.naver.com/khrireg/10187749924

In pattern associators, associative knowledge is represented in weighted
connections between elements of the stimulus and elements of the outcome. The model given in equations (1)–(3) is in fact formally equivalent to the well-known Rescorla–Wagner theory (Rescorla & Wagner, 1972) of animal Pavlovian conditioning, a theory that has dominated conditioning research for nearly 30 years (Hall, 1991; Miller, Barnet, & Grahame, 1995) and which has many proven empirical successes.

<font color="blue">While this learning algorithm has been successfully applied to many tasks, however, there is evidence to suggest that its “elemental” representational assumption is inadequate.</font>

In learning to associate one pattern with another, for instance, it appears that in addition to learning direct associations between the outcome and the separate elements that make up the stimulus, higher-order representations of the stimulus can also be involved in associations with the outcome.

<font color="red">The inadequacy of the notion that the elements of the cue are directly and independently associated with the outcome comes from a number of sources.</font>

* According to the elemental coding assumption, a compound cue such as ABC, comprising three elements, should elicit a response that is proportional to the sum of the weights of the elements A, B, and C. If these elements have previously been separately paired with a US and have each acquired asymptotic weights of 1.0, then the compound ABC should evoke summation, that is to say, a level of responding that is far higher than that elicited by the elements themselves.
* <font color="blue">Although the precise conditions remain to be clarified , there is now convincing evidence against this prediction from studies that reveal no evidence of summation in experiments of this sort</font>

#### nonlinearly separable

<img src="http://www.massey.ac.nz/~mjjohnso/notes/59302/xor_plot.gif" />

Another and more well-established source of evidence comes from the fact that humans can learn nonlinearly separable classifications.

In single-layer networks, consider a network consisting of two input units (denoted x and y) connected to one output unit, where the inputs and correct output, to, can take on values between 0.0 and 1.0, and where the network is trained to classify input patterns into one of two categories.

<img src="figures/cap2.4.png" />

It follows that the only types of classification such a system can learn are linearly separable ones in which the members of the two categories can be distinguished by a simple linear boundary. 
* Specifically, <font color="blue">for the delta rule model to learn a discrimination, it must be possible to construct a straight line in the x,y input space</font> that exactly divides the stimuli into the correct categories.

<font color="red">However, people have no difficulty learning nonlinearly separable discriminations that the delta rule model would be unable to master.</font>

#### experiment

Consider the following simple experiment. 
* In this task, 
    - 16 participants (UCL students) 
    - had to learn relationships 
    - between 
        - foods people ate and 
        - allergic reactions caused by those foods. 
* On each trial, 
    - a list of foods was described that the person had eaten, and 
    - the participant had to choose one out of a selection of possible allergies that the person suffered. 
        - The foods were such things as 
            - bananas, 
            - avocados, 
            - etc., and 
        - the allergies were called 
            - Type 1, 
            - Type 2, 
            - etc. 
    - Some of the people suffered no allergic reaction. 
    - After making their predictions, participants received feedback telling them the correct outcome for that trial. 
* Embedded in the design were critical trials of the following sorts which conform to a feature-neutral discrimination (Rudy & Sutherland, 1995): 
    - A → O, 
    - BC → O, 
    - C → no O, and 
    - AB → no O, 
        - where A–C are different foods, 
        - O is an allergy, and 
        - no O is no allergy. 
* Participants received 10 trials of each of these types in a fully randomized and intermixed fashion. 
* In fact, the full design included 
    - three distinct sets of trials conforming to this design; 
    - we have collapsed the data across these three sets.

Examination of the trial types reveals that the discrimination is not lin- early separable. Each of the elements (A, B, C) is equally often paired with the outcome and with no outcome, so the discrimination cannot be solved on the basis of summation of the weights of individual stimulus elements.

#### result

 A moment’s thought reveals that a pattern associator like that in Fig. 2.1 cannot solve the discrimination, so the key question is, can people? The answer is “yes”, as Fig. 2.2 shows.

The graph then plots the means of these scores transformed into percentages. With this scoring scheme, the chance level is 25% for trials associated with an allergy and 75% for trials associated with no allergy. 

<img src="figures/cap2.5.png" width=600 />

<img src="figures/cap2.6.png" width=600 />

<font color="red">This is a result that single-layer pattern associators governed by the delta rule is unable to explain.</font>

# UNIQUE CUES

Just as it has been recognized for many years that humans can solve such learning problems(=nonlinearity), so a simple account of how this is achieved has long been available.

#### unique elements

Wagner and Rescorla (1972) suggested that whenever two elements occur together, their combination gives rise to further “unique” elements which function much like the elements themselves.

#### unique cue model

In network terms, unique cue theories are easy to implement. As elaborated by Gluck (1991), the idea would be to supplement the layer of input units with additional units that correspond to the unique cues.
* Thus, in addition to input units for elements A and B, there would be an AB unit which would only be activated when both A and B are present.
* What about higher-order configurations? ABC?
    - One extreme possibility is that the network should contain seven input units (three for the elements, one each for con- figurations AB, AC, and BC, and one for configuration ABC) corresponding to the complete power set of elements.
    - An alternative, therefore, is to constrain the <font color="red">unique cue units to consist of just the element-pairs AB, AC, and BC.</font> 

It should be plain that the results of the experiment shown in Fig. 2.2 are readily accounted for by a unique cue analysis.

<img src="figures/cap2.5.png" width=600 />

* In the 
    - A → O, 
    - BC → O, 
    - C → no O, 
    - AB → no O 
    - feature-neutral problem, 
        - none of the elements can become a reliable predictor of the category, 
        - but the unique cues can. 
* When unique cues are included, the trial types become 
    - A → O, 
    - BCX → O, 
    - C → no O, and 
    - ABY → no O, 
        - where X is the unique cue 
            - created by the combination of elements B and C and 
        - Y is the unique cue 
            - created by the combination of A and B. 
    - It is simple to see that the discrimination is solved if 
        - cue A and the unique cue X acquire positive weights for the category, 
        - B and C have weights of zero, and 
        - Y has a negative weight.

# CHALLENGING UNIQUE CUE MODELS

Despite the obvious predictive successes of unique cue models, it is fairly easy to show that the patterns of generalization predicted by <font color="red">such networks are inappropriate</font>, and we can illustrate this with an experiment by Shanks, Charles, Darby, and Azmi (1998a, Experiment 3) which uses a design originally adopted in animal conditioning studies by Pearce and Wilson (1991; Wilson & Pearce, 1992). 

#### experiment

* The basic idea is that 
    - participants learn 
        - an A → O, 
        - AB → no O 
        - discrimination in the first stage, 
            - which should establish 
                - A as a predictor of the outcome and 
                - B as an inhibitor of that outcome 
                    - which counteracts A’s influence. 
    - Then, participants learn that 
        - B by itself also predicts the outcome (B → O), and 
    - finally retention of the original discrimination between A and AB is tested. 
    - Unique cue theories, as we shall see, predict a dramatic influence of the B → O trials.

<img src="figures/cap2.7.png" width=600 />

* To provide adequate controls, 
    - the full design needed to be slightly more complex (see Table 2.1).
    - The critical trial types in the first stage of the experimental condition were intermixed 
        - A → $O_{1}$, 
        - AB → no O, and 
        - AC → $O_{1}$ trials. 
    - In the second stage, participants saw 
        - B → $O_{1}$ trials, 
    - and then in the test phase 
        - they were presented with 
            - A, 
            - AB, and 
            - AC test stimuli. 
    - Once again, the to-be-learned items were foods or combinations of foods, and the outcomes were various allergies (or no allergy).
* <font color="red">What are the predictions about performance in the test phase?</font> 

* <font color="blue">Let us begin by ignoring possible unique cues.</font>
    - In the first stage, 
        - element A should acquire a positive weight for outcome $O_{1}$,
        - element B should acquire an equal but negative weight for the same outcome, and 
        - C should have a weight of zero. 
    - In the second phase, 
        - cue B’s negative weight will be dramatically altered, 
            - since B now predicts the outcome whereas in the first stage it was negatively associated with it. 
    - In the test, we now have 
        - one compound (AB) each of whose elements is strongly connected to $O_{1}$, and 
        - another compound (AC) consisting of an element (A) which is connected to the outcome and 
        - another element (C) which should have a weight of zero. 
* In sum, then, <font color="red">participants should be more likely to predict outcome $O_{1}$ on an AB test trial than on an AC one.</font>

<img src="figures/cap2.7.png" width=600 />

<img src="figures/eq2.3.png" width=600 />

* <font color="blue">What are the predictions if unique cue inputs are added?</font> 
    - <font color="red">Here, things are a little more complicated, but the conclusion is that participants are predicted to be at least as likely to select $O_{1}$ on an AB test trial as on an AC one.</font> 
    - The difficulty with deriving predictions arises from the indeterminacy of the relative learning rate parameters <font color="green">[α in equation (3)]</font> for the elements and the unique cues, but we can explore certain boundary conditions. 
    - <font color="orange">If the salience and hence learning rate for the unique cues is zero</font>,
        - then the unique cue version reduces to the simple elemental model we have already considered. 
    - <font color="orange">If the learning rate is the same for the elements and unique cues</font>,
        - then participants should again be more likely to predict 
            - $O_{1}$ on an AB test trial 
                - than on an AC one, 
            - just as in the pure elemental model.
    - <font color="orange">Finally, if the learning rate or salience of a unique cue is much greater than that of an element</font>, then 
        - X will acquire a strong negative weight (close to −1.0) in stage 1 and 
        - B will maintain a weight close to zero. 
        - B’s weight will then increase to 1.0 in stage 2, and 
        - the ABX stimulus will have a combined weight of +1.0 in the test phase. 
        - For ACY, the weight will also be 1.0, 
            - since A’s weight is 1.0 and C and Y have weights of zero.
        - Thus, participants are predicted to respond identically to AB and AC at test.

#### prediction

<font color="red">To summarize, on any version of an elemental or unique cue theory, participants should be at least as likely to predict outcome $O_{1}$ on the AB test trial as on the AC one.</font> 

#### data (real)

Fig. 2.4 shows participants’ performance in each of the three phases of the experiment.

<font color="red">The key stage 3 data are shown in Fig. 2.4c and reveal that participants predicted O on AC trials more than on AB trials. The prediction of the model is entirely falsified.</font>

<font color="blue">The problem with unique cue (and elemental) theories, it seems, is the assumption that responding to an item is determined simply by the sum of the weights of its components.</font> The fact that some of those components can be unique cues does not alter the summation assumption. What our experiment indicates is that a component of a stimulus can be radically revalued without affecting to any great degree the associative connection between that stimulus and the original category.

<img src="figures/cap2.8.png" width=600 />
<img src="figures/cap2.9.png" width=600 />

# BACK-PROPAGATION MODELS

<img src="http://nbviewer.jupyter.org/github/psygrammer/qgm/blob/master/part3/connectionist/ch01/figures/cap1.19.png" wdith=600 />

* <font color="red">One particular type of hidden-unit network</font> has been extremely widely investigated and has been shown to have some very powerful properties.
* In such a “<font color="red">back-propagation-of-error</font>” network, the delta rule applies exactly as before, except that it is refined in order to determine how much the input-hidden weights and the hidden-output weights should be changed on a given trial. 
* The development of multi-layer networks using the generalized version of the delta rule has provided a major contribution to recent connectionist modelling, since phenomena such as the learning of <font color="red">nonlinear classifications</font>, which are impossible for basic single-layer networks, can be easily dealt with by <font color="red">multi-layer networks</font>.
* Even more impressive than their ability to learn nonlinear classifications is the fact, proved by Hornik, Stinchcombe, and White (1989), that <font color="red">back-propagation networks can learn essentially any mapping</font> <font color="blue">between a set of input and output patterns</font> that one cares to construct. 
    - Thus, for any set of mappings from arbitrary input patterns to arbitrary output patterns (I1 → O1, I2 → O2, I3 → O3, . . .), a back-propagation network with <font color="red">sufficient hidden units</font> will construct a set of weights to learn the mapping to <font color="red">any desired degree of approximation</font>.

Hence, there is <font color="red">no question about the power</font> of this sort of connectionist network for learning associative relationships.

But the question remains, <font color="red">does it learn in the same way as humans?</font>

#### child language acquisition

* There is undoubtedly evidence of persuasive <font color="red">correspondences between human behaviour and the predictions of back-propagation networks</font>. 
* Some of the best evidence concerns child language acquisition, where it is possible to provide a network with approximately the same sort of input that children receive and to see whether characteristics of the network’s learning match those seen in children.
    - One much-debated example concerns the learning of the past tense in English.
        - walk–walked (regular)
        - go–went, send–sent, have–had, etc. (irregular)
        - “over-regularize” irregular verbs:
            - human : “goed”, “sended”
            - It turns out that back-propagation networks are also able to produce such errors (Plunkett & Marchman, 1993).

#### human associative learning

<img src="figures/cap2.7.png" width=600 />

* From the more general perspective of human associative learning, however, the <font color="red">basic back-propagation system is inadequate</font>, and the reason is simple: 
    - As McCloskey and Cohen (1989) showed, such <font color="red">networks predict catastrophic interference</font> just like unique cue and elemental models.
        - To illustrate, we ran a simple back-propagation simulation of the experimental condition in Table 2.1. 
            - There were three input units, corresponding to stimuli 
                - A, 
                - B, and 
                - C, and 
            - one output unit corresponding to the target outcome 
                - O1. 
            - In the first stage, 
                - A → O1, 
                - AB → no O, and 
                - AC → O1 
                - trials were presented repeatedly until correct responding was observed. 
            - In the second stage, 
                - B → O1 
                - trials were presented and again, 
                - training continued until the correct output was obtained. 
            - Finally, stimuli 
                - AB and 
                - AC 
                - were presented in the test phase. 
         - <font color="red">In contradiction to the results illustrated in Fig. 2.4</font>, 
             - <font color="red">AB evoked a stronger O1 response than AC</font>. 
        - We have tried many permutations of the network architecture and training regime but have been unable to find any circumstances in which this outcome is reversed.

<img src="figures/cap2.8.png" width=600 />
<img src="figures/cap2.9.png" width=600 />

<font color="red">The problem for back-propagation models</font> is particularly starkly illustrated in an experiment and accompanying simulation conducted by López, Shanks, Almaraz, and Fernández (1998). This study again used a simple prediction task, but in this case the cues were symptoms and the outcomes diseases. The design is shown in Table 2.2. 

* Participants received information about 
    - the symptoms that a particular patient presented and 
    - they had to diagnose the disease this patient was suffering from.
* Table 2.2 shows that across 
    - stages 1 and 2 
    - the cues of interest, 
        - A and D, 
    - were followed by their respective outcomes, 
        - $O_{1}$ and $O_{2}$, 
    - exactly the same number of times and 
    - in compound with cues that had undergone exactly the same treatment. 
    - The only difference between A and D was that 
        - A was a better predictor of $O_{1}$ than 
            - its pairmate B in stage 1 and 
        - a worse predictor than 
            - its pairmate C in stage 2, 
        - whereas for D, this was reversed. 
    - That is, D was 
        - a worse predictor of $O_{2}$ than 
            - E in stage 1 and 
        - a better predictor than 
            - F in stage 2. 
    - The various trial types in each stage were randomly intermixed. 
* The critical component of the design is that, just as in the Shanks et al. (1998a) experiment shown in Table 2.1, the stage 2 contingencies indirectly contradict the ones presented in stage 1, in as much as <font color="red">stage 1 tends to suggest that A is a powerful predictor</font> of the disease and that <font color="red">D prevents the disease</font>, while <font color="blue">stage 2 tends to suggest the exact opposite</font>.

<img src="figures/cap2.10.png" width=600 />

* Participants learned to predict the correct outcomes during both stages 1 and 2. In the subsequent test stage, trials of each type from the two stages were re-presented without any feedback.
* <font color="red">The key issue is whether participants could remember the correct responses for the stage 1 trial types</font>.
    - If they have forgotten or unlearned the stage 1 trial types, they will obviously make many errors on test trials from stage 1. 
    - In fact the data, reported in Table 2.2 (bold figures), reveal good recall of the stage 1 contingencies. 
        - Participants tended to predict the target disease on AB and E trials and predicted no disease on B and DE trials, consistent with the phase 1 associations.
* We then conducted a series of <font color="red">simulations using back-propagation networks</font>, as described in McClelland and Rumelhart (1988). 
    - Different parameters and network architectures (different numbers of hidden units) were tested in these simulations, but all of them showed the basic <font color="red">catastrophic forgetting effect</font>, thus only the details of one of them will be described.
    - The outcome of this simulation is presented in Fig. 2.5. 
        - A three-layered network was used, 
            - consisting of 
                - six input units, 
                    - corresponding to the six cues (symptoms), 
                - 10 hidden units 
                    - to allow an internal representation of the input information to be formed, and 
                - two output units 
                    - corresponding to the two outcomes (diseases) used in the training stage.
                    - “No disease” was coded as an output of zero on both output units.
        - This pattern of results shows that, <font color="red">unlike the participants in the experiment</font>, the network performed during the test stage according to the contingencies it had learned during Block 2 of the learning stage and had <font color="red">catastrophically forgotten the relationships programmed during Block 1</font>. 

<img src="figures/cap2.11.png" width=600 />
<img src="figures/cap2.12.png" width=600 />

<font color="red">Yet another problem with back-propagation networks</font> <font color="blue">as models of human associative learning</font> is that they do not always appear to generalize in an appropriate manner.

* As an illustration, suppose that 
    - one classification problem consists of 
        - A → O and 
        - AB → no O trials, 
        - while a second consists of 
            - AC → O and 
            - ABC → no O trials. 
    - These problems are identical except that both trial types in the second problem have an added element, C. 
* Just as in the experiment described earlier (Fig. 2.3), intuition suggests that adding such an element makes the <font color="red">trial types more similar</font> and therefore must make the <font color="red">classification harder to learn</font>, a prediction that has been <font color="red">confirmed empirically in an animal discrimination learning study</font> by Pearce and Redhead (1993). However, Pearce (1994) showed that <font color="blue">back-propagation networks are unable to reproduce this effect</font>.

<img src="figures/cap2.6.png" width=600 />

# CONFIGURAL MODELS

The results we have reported carry a straightforward message: 
* <font color="red">Representations of complex stimuli</font> need to be bound quite <font color="red">tightly together</font> in such a way that learning <font color="red">something new</font> about one of the elements of a stimulus does not strongly transfer back to the stimulus itself. 
* In other words, <font color="red">stimuli need to be coded in a configural rather than an elemental manner</font>.

<font color="blue">The basic idea</font> behind configural networks 
* is illustrated in Fig. 2.6. 
* When a stimulus is presented, 
    - it is <font color="red">coded</font> via 
        - the <font color="red">direct activation</font> of 
        - a <font color="red">unique hidden unit</font> 
            - dedicated to <font color="red">that stimulus</font>.

Of course, as it stands, this model would be inadequate, 
* since it <font color="red">does not allow any degree of generalization</font>, 
* so it needs to be supplemented with the idea that a hidden unit dedicated to a given stimulus can be activated by another stimulus to the extent that the two are similar.

<img src="figures/cap2.13.png" width=600 />

#### Models 
* Modles based on these design characteristics have been explored by Pearce (1987, 1994, 2002) and Kruschke (1992, 1993). 
    - Although they differ in their details, each <font color="red">provides approximately the correct trade-off between generalization and protection from interference</font>.
* Pearce’s (1994) configural model
    - Let us consider how Pearce’s (1994) configural model can predict the <font color="red">absence of catastrophic forgetting</font>.
    - The model has been implemented 
        - as a connectionist network that includes 
            - four layers of units: 
                - an input layer, 
                - two layers of hidden units 
                    - (an output layer and 
                    - a layer of configural units) and 
                - a single-unit layer that 
                    - represents the outcome. 
        - The activation level of this outcome unit 
            - determines the system response. 
        - The units in the input layer 
            - can be at an activation level of either 
                - 0 or 
                - 1, 
            - depending on whether the element of the stimulus pattern the unit is representing is 
                - absent or 
                - present, respectively. 
        - Each input unit is connected to a single output unit. 
        - The intervention of these output units ensures that each stimulus pattern <font color="red">activates maximally</font> (an activation level of 1) a <font color="red">single configural unit</font>.
            - Henceforth, the <font color="blue">configural unit can be regarded as representing a particular stimulus pattern</font>.
    - <font color="red">If we assume that configural unit x becomes maximally activated when input pattern X is presented, what will be its activation value when input pattern Y is presented?</font> 
        - According to the model, the activation value $a_{x}$ of configural unit x will be proportional to the degree of similarity between the stimulus patterns.
        - Pearce (1987, 1994) assumes that the <font color="red">similarity of input patterns X and Y is a linear function of the number of elements they share</font> :
        <img src="figures/cap2.14.png" width=600 />
            - where 
                - $n_{C}$ is the number of input units both patterns share and 
                - $n_{X}$ and $n_{Y}$ are the number of input units that are specific to each stimulus pattern.
        - <font color="red">Thus, the activation level of the outcome unit when pattern X is presented ($V_{x}$) has a double origin</font>. 
            - Part of the activation is conveyed 
                - by the <font color="blue">connection between</font> 
                    - the <font color="blue">configural unit maximally activated</font> and 
                    - the <font color="blue">outcome unit ($w_{x}$),</font> and
            - some of its activation 
                - comes through the <font color="green">connections between</font> 
                    - <font color="green">other configural units activated</font> 
            - through generalization and the outcome unit:
                <img src="figures/cap2.15.png" width=600 />
                - where 
                    - $S_{x,i}$ is the squared activation of these other configural units [see equation (4)] and 
                    - $w_{i}$ represents their connections to the outcome unit.
    - <font color="red">Associative learning involves modifications only in</font> 
        - the <font color="red">weight of the connection between</font> 
            - the <font color="blue">configural unit maximally activated</font> (one for each stimulus pattern) and 
            - the <font color="blue">outcome unit</font>. 
        - These modifications are governed by equation (6):
            <img src="figures/cap2.16.png" width=600 />
            - Note the similarity between equations (3) and (6). 
            - As in the Rescorla–Wagner model, the modification of weights is proportional to 
                - an error term, 
                - $α$ represents the learning rate, and 
                - $t$ is set 
                    - to 1 
                        - when the outcome is present and 
                    - to 0 
                        - otherwise.

#### Resutls (Pearce’s configural model)

* Pearce’s model is able to account for a good deal of the data reviewed in this chapter thus far. 
    - Because stimuli are represented configurally, 
        - the model’s predictions are unaffected by whether or not a particular classification is linear, and 
        - hence it can predict the learnability of the feature-neutral discrimination shown in Fig. 2.2.
            <img src="figures/cap2.10.png" width=600 />
        - Also, the theory predicts the faster learning for A than for AB shown in Fig. 2.3, 
            - since there is less generalization between A and ABC than between AB and ABC
            <img src="figures/cap2.6.png" width=600 />
        - Finally, it accounts, at least partially (see Shanks et al., 1998a) for the data in Fig. 2.4, as Wilson and Pearce (1992) have shown formally.
        <img src="figures/cap2.8.png" width=600 />
        <img src="figures/cap2.9.png" width=600 />
* To conclude, Pearce’s configural model allows us to give 
    - a degree of coherence to the pattern of results described so far.
    - The <font color="red">absence of catastrophic forgetting</font> can be understood in terms of the model. 
    - It provides a solution to the catastrophic forgetting problem, 
        - which involves a <font color="red">trade-off between two empirical constraints</font>, namely, the fact that <font color="blue">new information does not cause complete unlearning of prior knowledge</font> and the <font color="green">ability to adapt to the new incoming information through a process of generalization</font>. 
    - This particular solution is based on the creation of <font color="red">semi-distributed representations of the incoming information</font> <font color="blue">by means of the assignment of exclusive configural representations</font> to each new input pattern and the operation of a generalization mechanism based on pattern similarity. 
    - By <font color="red">abandoning the notion that stimuli are coded elementally</font>, findings that are problematic for single-layer networks can be encompassed.

# REPRESENTATIONAL FLEXIBILITY

At this point it would be nice to end the chapter with the conclusion that connectionist models incorporating some form of configural coding scheme provide powerful models of basic human learning processes. <font color="blue">However, such models turns out to be lacking an absolutely key capacity, namely</font> <font color="red">representational flexibility</font>.

#### blocking 

##### 참고
* [5] 차례 Blocking - Kamin, 1969 - http://m.blog.naver.com/khrireg/10185382717

Williams et al. used an adaptation of the classic two- stage “blocking” design of Kamin (1968).

<img src="figures/cap2.17.png" width=600 />

Williams et al. (1994, Experiment 4) conducted a blocking experiment such as this, but pretrained various groups of participants in different ways.
* In one condition the pretraining was designed to foster an “elemental” strategy, 
* whereby participants would to some degree analyse each cue separately.
* we might expect to see that cue A is blocked and receives lower ratings than cue C, and this is exactly what happened for participants who received the elemental pretrain- ing. However, for those participants given the configural pretraining, no blocking was observed, and instead A and C received equal ratings.

<img src="figures/cap2.18.png" width=600 />

* However, in this experiment there were two modifications which we, following Williams (1995), thought might encourage a <font color="red">greater degree of elemental processing</font>. 
* First, in an initial training stage (stage 1), only the elements were presented, so participants’ initiation should have suggested to them that the task had an elemental nature.
* Second, during the stage in which the A → O/AB → no O discrimination was learned, participants also saw B → no O trials.

<img src="figures/cap2.19.png" width=600 />
<img src="figures/cap2.20.png" width=600 />

<img src="figures/cap2.8.png" width=600 />
<img src="figures/cap2.9.png" width=600 />

The pair of results given in Fig. 2.4 and 2.7 confirms Williams et al.’s conjecture that the representation of a compound stimulus AB can be <font color="red">flexibly altered</font>.

In the meantime, however, what we can say is that <font color="red">current connectionist models are missing an important feature</font> in their inability to accommodate representational flexibility.

# ROLE OF THE HIPPOCAMPUS IN STIMULUS CODING

<img src="https://upload.wikimedia.org/wikipedia/commons/2/2e/Gray739-emphasizing-hippocampus.png" width=600 />

* Efforts to develop connectionist models of associative learning have commonly been informed by <font color="red">neuropsychological considerations</font>, so in this section I briefly review the evidence concerning the neural substrates of learning.
* paired-associate learning, which requires participants to learn arbitrary associations between word pairs, and which therefore resembles the sorts of tasks used in the experiments described here, is normally included in test batteries for detecting amnesia.
* Despite this consensus, many researchers believe that the hippocampus does <font color="red">not contribute to all associative learning tasks</font> and that <font color="blue">instead its involvement is restricted to declarative memory tasks</font>, in which successful performance requires conscious retrieval of facts or episodes.
* The claim that amnesia is restricted to declarative memory deficits is <font color="red">controversial</font>, however, for two reasons. 
* Thus, it is possible that the <font color="red">hippocampus is involved</font> in, and indeed is necessary for normal performance in, <font color="red">essentially all forms of associative learning task</font>.
* Putting that issue to one side, a more subtle and germane claim is that the hippocampus is particularly involved in tasks that require the <font color="red">formation of configural representations</font>.
* The <font color="red">hippocampal-configural theory</font> of Sutherland and Rudy is particularly attractive, not only because it grounds the elemental/configural distinction on an underlying neural substrate but also because it <font color="red">hints that the representational flexibility</font> described in the previous section may be <font color="red">attributable to differential loading of tasks on hippocampal function</font>.
* <font color="red">Unfortunately, though, other evidence has failed to support the theory</font>. Perhaps most clear-cut is evidence (Gallagher & Holland, 1992) that hippocampal lesions do not impair acquisition of the feature-neutral discrimination (see Fig. 2.2) and also evidence that human amnesic patients are no more impaired on the transverse patterning task than on a linear discrimination of equivalent difficulty (Reed & Squire, 1999).2 It has been confirmed, on the other hand, that negative patterning is impaired (McDonald et al., 1997). Overall, therefore, it does not appear that the hippocampal-configural theory, as stated in its original form, is adequate.
* Rudy & Sutherland (1995) have argued persuasively that this approach can account for much of the appar- ently conflicting evidence on the effects of hippocampal lesions, and O’Reilly & Rudy (2001) have extended it further by suggesting that the hippocampal gain process is particularly critical in situations that require the rapid forma- tion of conjunctive representations. The details of how this proposal may be turned into an explicit computational model are beyond the scope of this chapter, but recent efforts (Gluck & Myers, 1997; O’Reilly & Rudy, 2001; Schmajuk & DiCarla, 1992) hold out considerable promise of integrating a wide range of data.

# CONCLUSIONS

* "localist" and "distributed”
* superposition and equipotentiality.
* similarity-based and rule-based behaviour, 
    - or (in other terminology) between “implicit” and “explicit” behaviour. 

# 참고자료
* [1] Connectionist Models in Cognitive Psycholgy - http://www.amazon.com/Connectionist-Cognitive-Psychology-Studies-Cognition/dp/0415646901/
* [2] Behavior Models - http://www.slideshare.net/zennboy/class-feb-15
* [3] 레스콜라-와그너 모델(위키피디아) -  https://ko.wikipedia.org/wiki/%EA%B3%A0%EC%A0%84%EC%A0%81_%EC%A1%B0%EA%B1%B4%ED%99%94%EC%9D%98_%ED%96%89%EB%8F%99%EC%A0%81_%EC%97%B0%EA%B5%AC#.EB.A0.88.EC.8A.A4.EC.BD.9C.EB.9D.BC-.EC.99.80.EA.B7.B8.EB.84.88_.EB.AA.A8.EB.8D.B8
* [4] 레스콜라-바그너 모델(블로그) - http://m.blog.naver.com/khrireg/10187749924
* [5] 차례 Blocking - Kamin, 1969 - http://m.blog.naver.com/khrireg/10185382717