# 2. Connectionist models of basic human learning processes

* 싸이그래머 / QGM : 파트 3 - 연결주의모형 [1]
* 김무성

# Contents
* LINEAR AND NONLINEAR CLASSIFICATIONS
* UNIQUE CUES
* CHALLENGING UNIQUE CUE MODELS
* BACK-PROPAGATION MODELS
* CONFIGURAL MODELS
* REPRESENTATIONAL FLEXIBILITY
* ROLE OF THE HIPPOCAMPUS IN STIMULUS CODING
* CONCLUSIONS

It is natural to believe that the current high level of interest in connectionist models of cognitive processes is attributable to the impressive accounts such models can provide of human competencies (e.g. reading, classification, skill acquisition) and that the (arguable) improvement such models provide in explanatory power over nonconnectionist models is the main impetus to their growing acceptance amongst cognitive psychologists.

In this chapter I discuss elementary learning processes in various human analogues of animal conditioning procedures and use the findings to draw conclusions about the basic properties required in realistic connectionist models.

#### High-level cognitive processes & Animal Pavlovian conditioning

<img src="http://2012books.lardbucket.org/books/beginning-psychology/section_11/72ff70c6cb32a57995de5d2081132da7.jpg" width=600 />

* In fact, there is a remarkable degree of similarity between the principles governing Pavlovian conditioning and human associative learning (Shanks, 1995) and there seems little risk at present that theories of human and animal learning will diverge seriously. 
* A second reason is more to do with persuasiveness: No matter how compelling the evidence is in Pavlovian conditioning for the involvement of connectionistic processes, there will always be a leap of faith involved in seeing a relationship between such processes and more high-level human abilities.
* Studies of human associative processes provide an indispensable bridge spanning these two extreme fields, especially if high-level cognitive processes (e.g. causal induction) can be shown to obey the same rules as conditioning (Dickinson, 2001; Lober & Shanks, 2000).

#### supervised learning

* Very many of the so-called “supervised” networks and architectures that have been most extensively studied in the last few years take as their input a pattern of activation across a set of input units, generate an output pattern across a set of output units, and then receive some sort of signal from the outside world, indicating what the output pattern generated by the system should have been.
* This sort of simple situation provides a straightforward model for
    - associative learning and 
    - memory tasks, which conform to this input–output mapping description. 
* Some examples include 
    - generating the past tense of a verb from its stem, 
    - generating the pronunciation of a word from its written form, and
    - deciding the category to which an object belongs.
* These competences are complex, however, in a number of ways.

#### Simpler situation

* In consequence, it is necessary to turn to much simpler situations in which the learning task is stripped down to its bare essentials in such a way that the coding scheme, learning algorithm, and so on can be more directly and independently studied.
* Participants are instructed to imagine that they are food allergists attempting to work out which foods various fictitious patients are allergic to. 
    - On each trial a meal is presented to the patient, comprising one or a small number of foods, and the participant has to predict whether an allergy ensues. 
    - Then feedback is given about the actual outcome and the participant goes on to the next trial. 
    - Over a series of trials (usually in the order of 50–200), various trial types are presented repeatedly in an intermixed order until the participant is correctly predicting the outcome on each trial. 
    - Finally, a transfer test is given to assess some aspect of the learning process. 
    - The parallel to Pavlovian conditioning should be clear, with the foods playing the role of conditioned stimuli (CSs) and the allergy the role of the unconditioned stimulus (US).
* As we will see, despite the simplicity of the ability we are attempting to simulate, there are profound difficulties in creating an adequate connectionist theory, but a good deal of progress has nevertheless been made. 
* Also, it should be mentioned that the experimental designs on which this chapter focuses can generally be solved without the need for selective attention, and the models I review do not incorporate attentional processes.

# LINEAR AND NONLINEAR CLASSIFICATIONS

#### feedforward networks or pattern associators

<img src="figures/cap2.1.png" width=600 />

On every trial, some item consisting of a set of cues is presented. We calculate the activation $a_{o}$ of each output unit:

<img src="figures/cap2.2.png" width=600 />

Next, we calculate the “error”, $d_{o}$, on the output unit between the obtained output, $a_{o}$, and the desired output, $t_{o}$:

<img src="figures/cap2.3.png" width=600 />

Finally, we use the well-known “delta” rule (Stone, 1986) to change each of the weights in proportion to the error:

<img src="figures/eq2.3.png" width=600 />

#### Elemental coding assumption & Rescorla–Wagner theory

#### 참고
* [2] Behavior Models - http://www.slideshare.net/zennboy/class-feb-15
* [3] 레스콜라-와그너 모델(위키피디아) -  https://ko.wikipedia.org/wiki/%EA%B3%A0%EC%A0%84%EC%A0%81_%EC%A1%B0%EA%B1%B4%ED%99%94%EC%9D%98_%ED%96%89%EB%8F%99%EC%A0%81_%EC%97%B0%EA%B5%AC#.EB.A0.88.EC.8A.A4.EC.BD.9C.EB.9D.BC-.EC.99.80.EA.B7.B8.EB.84.88_.EB.AA.A8.EB.8D.B8
* [4] 레스콜라-바그너 모델(블로그) - http://m.blog.naver.com/khrireg/10187749924

In pattern associators, associative knowledge is represented in weighted
connections between elements of the stimulus and elements of the outcome. The model given in equations (1)–(3) is in fact formally equivalent to the well-known Rescorla–Wagner theory (Rescorla & Wagner, 1972) of animal Pavlovian conditioning, a theory that has dominated conditioning research for nearly 30 years (Hall, 1991; Miller, Barnet, & Grahame, 1995) and which has many proven empirical successes.

<font color="blue">While this learning algorithm has been successfully applied to many tasks, however, there is evidence to suggest that its “elemental” representational assumption is inadequate.</font>

In learning to associate one pattern with another, for instance, it appears that in addition to learning direct associations between the outcome and the separate elements that make up the stimulus, higher-order representations of the stimulus can also be involved in associations with the outcome.

<font color="red">The inadequacy of the notion that the elements of the cue are directly and independently associated with the outcome comes from a number of sources.</font>

* According to the elemental coding assumption, a compound cue such as ABC, comprising three elements, should elicit a response that is proportional to the sum of the weights of the elements A, B, and C. If these elements have previously been separately paired with a US and have each acquired asymptotic weights of 1.0, then the compound ABC should evoke summation, that is to say, a level of responding that is far higher than that elicited by the elements themselves.
* <font color="blue">Although the precise conditions remain to be clarified , there is now convincing evidence against this prediction from studies that reveal no evidence of summation in experiments of this sort</font>

#### nonlinearly separable

<img src="http://www.massey.ac.nz/~mjjohnso/notes/59302/xor_plot.gif" />

Another and more well-established source of evidence comes from the fact that humans can learn nonlinearly separable classifications.

In single-layer networks, consider a network consisting of two input units (denoted x and y) connected to one output unit, where the inputs and correct output, to, can take on values between 0.0 and 1.0, and where the network is trained to classify input patterns into one of two categories.

<img src="figures/cap2.4.png" />

It follows that the only types of classification such a system can learn are linearly separable ones in which the members of the two categories can be distinguished by a simple linear boundary. 
* Specifically, <font color="blue">for the delta rule model to learn a discrimination, it must be possible to construct a straight line in the x,y input space</font> that exactly divides the stimuli into the correct categories.

<font color="red">However, people have no difficulty learning nonlinearly separable discriminations that the delta rule model would be unable to master.</font>

#### experiment

Consider the following simple experiment. 
* In this task, 
    - 16 participants (UCL students) 
    - had to learn relationships 
    - between 
        - foods people ate and 
        - allergic reactions caused by those foods. 
* On each trial, 
    - a list of foods was described that the person had eaten, and 
    - the participant had to choose one out of a selection of possible allergies that the person suffered. 
        - The foods were such things as 
            - bananas, 
            - avocados, 
            - etc., and 
        - the allergies were called 
            - Type 1, 
            - Type 2, 
            - etc. 
    - Some of the people suffered no allergic reaction. 
    - After making their predictions, participants received feedback telling them the correct outcome for that trial. 
* Embedded in the design were critical trials of the following sorts which conform to a feature-neutral discrimination (Rudy & Sutherland, 1995): 
    - A → O, 
    - BC → O, 
    - C → no O, and 
    - AB → no O, 
        - where A–C are different foods, 
        - O is an allergy, and 
        - no O is no allergy. 
* Participants received 10 trials of each of these types in a fully randomized and intermixed fashion. 
* In fact, the full design included 
    - three distinct sets of trials conforming to this design; 
    - we have collapsed the data across these three sets.

Examination of the trial types reveals that the discrimination is not lin- early separable. Each of the elements (A, B, C) is equally often paired with the outcome and with no outcome, so the discrimination cannot be solved on the basis of summation of the weights of individual stimulus elements.

#### result

 A moment’s thought reveals that a pattern associator like that in Fig. 2.1 cannot solve the discrimination, so the key question is, can people? The answer is “yes”, as Fig. 2.2 shows.

The graph then plots the means of these scores transformed into percentages. With this scoring scheme, the chance level is 25% for trials associated with an allergy and 75% for trials associated with no allergy. 

<img src="figures/cap2.5.png" width=600 />

<img src="figures/cap2.6.png" width=600 />

<font color="red">This is a result that single-layer pattern associators governed by the delta rule is unable to explain.</font>

# UNIQUE CUES

Just as it has been recognized for many years that humans can solve such learning problems(=nonlinearity), so a simple account of how this is achieved has long been available.

#### unique elements

Wagner and Rescorla (1972) suggested that whenever two elements occur together, their combination gives rise to further “unique” elements which function much like the elements themselves.

#### unique cue model

In network terms, unique cue theories are easy to implement. As elaborated by Gluck (1991), the idea would be to supplement the layer of input units with additional units that correspond to the unique cues.
* Thus, in addition to input units for elements A and B, there would be an AB unit which would only be activated when both A and B are present.
* What about higher-order configurations? ABC?
    - One extreme possibility is that the network should contain seven input units (three for the elements, one each for con- figurations AB, AC, and BC, and one for configuration ABC) corresponding to the complete power set of elements.
    - An alternative, therefore, is to constrain the <font color="red">unique cue units to consist of just the element-pairs AB, AC, and BC.</font> 

It should be plain that the results of the experiment shown in Fig. 2.2 are readily accounted for by a unique cue analysis.

<img src="figures/cap2.5.png" width=600 />

* In the 
    - A → O, 
    - BC → O, 
    - C → no O, 
    - AB → no O 
    - feature-neutral problem, 
        - none of the elements can become a reliable predictor of the category, 
        - but the unique cues can. 
* When unique cues are included, the trial types become 
    - A → O, 
    - BCX → O, 
    - C → no O, and 
    - ABY → no O, 
        - where X is the unique cue 
            - created by the combination of elements B and C and 
        - Y is the unique cue 
            - created by the combination of A and B. 
    - It is simple to see that the discrimination is solved if 
        - cue A and the unique cue X acquire positive weights for the category, 
        - B and C have weights of zero, and 
        - Y has a negative weight.

# CHALLENGING UNIQUE CUE MODELS

Despite the obvious predictive successes of unique cue models, it is fairly easy to show that the patterns of generalization predicted by <font color="red">such networks are inappropriate</font>, and we can illustrate this with an experiment by Shanks, Charles, Darby, and Azmi (1998a, Experiment 3) which uses a design originally adopted in animal conditioning studies by Pearce and Wilson (1991; Wilson & Pearce, 1992). 

#### experiment

* The basic idea is that 
    - participants learn 
        - an A → O, 
        - AB → no O 
        - discrimination in the first stage, 
            - which should establish 
                - A as a predictor of the outcome and 
                - B as an inhibitor of that outcome 
                    - which counteracts A’s influence. 
    - Then, participants learn that 
        - B by itself also predicts the outcome (B → O), and 
    - finally retention of the original discrimination between A and AB is tested. 
    - Unique cue theories, as we shall see, predict a dramatic influence of the B → O trials.

<img src="figures/cap2.7.png" width=600 />

* To provide adequate controls, 
    - the full design needed to be slightly more complex (see Table 2.1).
    - The critical trial types in the first stage of the experimental condition were intermixed 
        - A → $O_{1}$, 
        - AB → no O, and 
        - AC → $O_{1}$ trials. 
    - In the second stage, participants saw 
        - B → $O_{1}$ trials, 
    - and then in the test phase 
        - they were presented with 
            - A, 
            - AB, and 
            - AC test stimuli. 
    - Once again, the to-be-learned items were foods or combinations of foods, and the outcomes were various allergies (or no allergy).
* <font color="red">What are the predictions about performance in the test phase?</font> 

* <font color="blue">Let us begin by ignoring possible unique cues.</font>
    - In the first stage, 
        - element A should acquire a positive weight for outcome $O_{1}$,
        - element B should acquire an equal but negative weight for the same outcome, and 
        - C should have a weight of zero. 
    - In the second phase, 
        - cue B’s negative weight will be dramatically altered, 
            - since B now predicts the outcome whereas in the first stage it was negatively associated with it. 
    - In the test, we now have 
        - one compound (AB) each of whose elements is strongly connected to $O_{1}$, and 
        - another compound (AC) consisting of an element (A) which is connected to the outcome and 
        - another element (C) which should have a weight of zero. 
* In sum, then, <font color="red">participants should be more likely to predict outcome $O_{1}$ on an AB test trial than on an AC one.</font>

<img src="figures/cap2.7.png" width=600 />

<img src="figures/eq2.3.png" width=600 />

* <font color="blue">What are the predictions if unique cue inputs are added?</font> 
    - <font color="red">Here, things are a little more complicated, but the conclusion is that participants are predicted to be at least as likely to select $O_{1}$ on an AB test trial as on an AC one.</font> 
    - The difficulty with deriving predictions arises from the indeterminacy of the relative learning rate parameters <font color="green">[α in equation (3)]</font> for the elements and the unique cues, but we can explore certain boundary conditions. 
    - <font color="orange">If the salience and hence learning rate for the unique cues is zero</font>,
        - then the unique cue version reduces to the simple elemental model we have already considered. 
    - <font color="orange">If the learning rate is the same for the elements and unique cues</font>,
        - then participants should again be more likely to predict 
            - $O_{1}$ on an AB test trial 
                - than on an AC one, 
            - just as in the pure elemental model.
    - <font color="orange">Finally, if the learning rate or salience of a unique cue is much greater than that of an element</font>, then 
        - X will acquire a strong negative weight (close to −1.0) in stage 1 and 
        - B will maintain a weight close to zero. 
        - B’s weight will then increase to 1.0 in stage 2, and 
        - the ABX stimulus will have a combined weight of +1.0 in the test phase. 
        - For ACY, the weight will also be 1.0, 
            - since A’s weight is 1.0 and C and Y have weights of zero.
        - Thus, participants are predicted to respond identically to AB and AC at test.

#### prediction

<font color="red">To summarize, on any version of an elemental or unique cue theory, participants should be at least as likely to predict outcome $O_{1}$ on the AB test trial as on the AC one.</font> 

#### data (real)

Fig. 2.4 shows participants’ performance in each of the three phases of the experiment.

<font color="red">The key stage 3 data are shown in Fig. 2.4c and reveal that participants predicted O on AC trials more than on AB trials. The prediction of the model is entirely falsified.</font>

<font color="blue">The problem with unique cue (and elemental) theories, it seems, is the assumption that responding to an item is determined simply by the sum of the weights of its components.</font> The fact that some of those com- ponents can be unique cues does not alter the summation assumption. What our experiment indicates is that a component of a stimulus can be radically revalued without affecting to any great degree the associative connection between that stimulus and the original category.

<img src="figures/cap2.8.png" width=600 />
<img src="figures/cap2.9.png" width=600 />

# BACK-PROPAGATION MODELS

<img src="figures/cap2.10.png" width=600 />

<img src="figures/cap2.11.png" width=600 />
<img src="figures/cap2.12.png" width=600 />

# CONFIGURAL MODELS

<img src="figures/cap2.13.png" width=600 />

<img src="figures/cap2.14.png" width=600 />

<img src="figures/cap2.15.png" width=600 />

<img src="figures/cap2.16.png" width=600 />

# REPRESENTATIONAL FLEXIBILITY

<img src="figures/cap2.17.png" width=600 />

<img src="figures/cap2.18.png" width=600 />

# ROLE OF THE HIPPOCAMPUS IN STIMULUS CODING

<img src="figures/cap2.19.png" width=600 />
<img src="figures/cap2.20.png" width=600 />

# CONCLUSIONS

# 참고자료
* [1] Connectionist Models in Cognitive Psycholgy - http://www.amazon.com/Connectionist-Cognitive-Psychology-Studies-Cognition/dp/0415646901/
* [2] Behavior Models - http://www.slideshare.net/zennboy/class-feb-15
* [3] 레스콜라-와그너 모델(위키피디아) -  https://ko.wikipedia.org/wiki/%EA%B3%A0%EC%A0%84%EC%A0%81_%EC%A1%B0%EA%B1%B4%ED%99%94%EC%9D%98_%ED%96%89%EB%8F%99%EC%A0%81_%EC%97%B0%EA%B5%AC#.EB.A0.88.EC.8A.A4.EC.BD.9C.EB.9D.BC-.EC.99.80.EA.B7.B8.EB.84.88_.EB.AA.A8.EB.8D.B8
* [4] 레스콜라-바그너 모델(블로그) - http://m.blog.naver.com/khrireg/10187749924