
<h1 id="The-perceptron-algorithm-in-NLP">The perceptron algorithm in NLP<a class="anchor-link" href="#The-perceptron-algorithm-in-NLP">¶</a></h1>


By **Robert Östling**

<p>The perceptron is among the simplest classifiers, and yet it frequently gives very good results. Compared to the Naive Bayes classifier, it has the important advantage of being able to handle highly correlated features well. Consider the following list of tagged words:</p>


In [33]:
def read_file(file):
    words = []
    with open(file) as f:
        for line in f:
            line = line.rstrip("\n")
            line = line.split("\t")
            words.append((line[0], line[1]))
    return words

Process

In [34]:
train_set = read_file('train.txt')
test_set = read_file('test.txt')


<p>The relevant morphological information is contained in (at most) the last two letters of a word. So, we write a function that can create features based on this observation:</p>


In [35]:
def get_features(word): return { 'last': word[-1:], 'last_two': word[-2:], 'last_three': word[-3:] }


<p>We can now use this to create a list of features and their corresponding tag:</p>


In [36]:
[(get_features(word), tag) for word, tag in train_set]

[({'last': 'e', 'last_two': 'ge', 'last_three': 'dge'}, 'en'),
 ({'last': 'e', 'last_two': 'fe', 'last_three': 'afe'}, 'en'),
 ({'last': 'd', 'last_two': 'ld', 'last_three': 'ald'}, 'en'),
 ({'last': 'y', 'last_two': 'by', 'last_three': 'aby'}, 'en'),
 ({'last': 'd', 'last_two': 'ed', 'last_three': 'sed'}, 'en'),
 ({'last': 'y', 'last_two': 'ry', 'last_three': 'ary'}, 'en'),
 ({'last': 'n', 'last_two': 'gn', 'last_three': 'ign'}, 'en'),
 ({'last': 's', 'last_two': 'es', 'last_three': 'ces'}, 'en'),
 ({'last': 'y', 'last_two': 'ay', 'last_three': 'way'}, 'en'),
 ({'last': 'g', 'last_two': 'ng', 'last_three': 'ing'}, 'en'),
 ({'last': 't', 'last_two': 'nt', 'last_three': 'ant'}, 'en'),
 ({'last': 's', 'last_two': 'es', 'last_three': 'ies'}, 'en'),
 ({'last': 'n', 'last_two': 'an', 'last_three': 'ean'}, 'en'),
 ({'last': 'd', 'last_two': 'id', 'last_three': 'aid'}, 'en'),
 ({'last': 'd', 'last_two': 'ed', 'last_three': 'ged'}, 'en'),
 ({'last': 'e', 'last_two': 'ke', 'last_three': 'oke'},


<h2 id="Training-the-perceptron">Training the perceptron<a class="anchor-link" href="#Training-the-perceptron">¶</a></h2>



<p>In reality, we don't want to give the weights manually. Fortunately there is an easy way to compute them automatically based on some training data.</p>


In [27]:
# These are the possible tags in our example
tags = ('en', 'de', 'fi', 'fr', 'sv')

def update(word, correct_tag):
    # First we compute what the algorithm currently says would be the best tag for this word
    predicted_tag = max(tags, key=lambda tag: score(get_features(word), tag))
    if correct_tag == predicted_tag:
        # If the predicted tag is correct, do nothing
        print('This example is already predicted correctly')
        pass
    else:
        print('This example was incorrectly predicted to be %s' % predicted_tag)
        # Otherwise, we need to do two things:
        # First, add 1 to the weight of every (feature,correct_tag) pair
        #   this will make each feature more strongly associated with the correct tag
        for feature in get_features(word).items():
            print('Increasing association of %-20s with %s' % (feature, correct_tag))
            weights[feature,correct_tag] += 1
        # Second, remove 1 from the weight of every (feature,predicted_tag) pair
        #   this will make each feature less strongly associated with the incorrect tag
        for feature in get_features(word).items():
            print('Reducing association of   %-20s with %s' % (feature, predicted_tag))
            weights[feature,predicted_tag] -= 1


<p>And that's it! Now let's reset the weights to zero, and see what happens when we pass through the words in our little data set.</p>


In [43]:
import random
weights.clear()

for i in range(1, 11):
    print('\n---- Running epoch #{} ----\n'.format(i))
    for word, tag in train_set:
        print('  -- Training example: %s %s --' % (word, tag))
        update(word, tag)
        print()
    random.shuffle(train_set)


---- Running epoch #1 ----

  -- Training example: infångas sv --
This example was incorrectly predicted to be en
Increasing association of ('last', 's')        with sv
Increasing association of ('last_two', 'as')   with sv
Increasing association of ('last_three', 'gas') with sv
Reducing association of   ('last', 's')        with en
Reducing association of   ('last_two', 'as')   with en
Reducing association of   ('last_three', 'gas') with en

  -- Training example: Monats de --
This example was incorrectly predicted to be sv
Increasing association of ('last', 's')        with de
Increasing association of ('last_two', 'ts')   with de
Increasing association of ('last_three', 'ats') with de
Reducing association of   ('last', 's')        with sv
Reducing association of   ('last_two', 'ts')   with sv
Reducing association of   ('last_three', 'ats') with sv

  -- Training example: slänger sv --
This example was incorrectly predicted to be en
Increasing association of ('last', 'r')        wit

Reducing association of   ('last', 'l')        with sv
Reducing association of   ('last_two', 'al')   with sv
Reducing association of   ('last_three', 'ral') with sv

  -- Training example: disgusting en --
This example is already predicted correctly

  -- Training example: levinneisyys fi --
This example was incorrectly predicted to be de
Increasing association of ('last', 's')        with fi
Increasing association of ('last_two', 'ys')   with fi
Increasing association of ('last_three', 'yys') with fi
Reducing association of   ('last', 's')        with de
Reducing association of   ('last_two', 'ys')   with de
Reducing association of   ('last_three', 'yys') with de

  -- Training example: suonissaan fi --
This example is already predicted correctly

  -- Training example: sespros fr --
This example was incorrectly predicted to be fi
Increasing association of ('last', 's')        with fr
Increasing association of ('last_two', 'os')   with fr
Increasing association of ('last_three', 'ros

This example is already predicted correctly

  -- Training example: trafalgar fr --
This example was incorrectly predicted to be sv
Increasing association of ('last', 'r')        with fr
Increasing association of ('last_two', 'ar')   with fr
Increasing association of ('last_three', 'gar') with fr
Reducing association of   ('last', 'r')        with sv
Reducing association of   ('last_two', 'ar')   with sv
Reducing association of   ('last_three', 'gar') with sv

  -- Training example: suonissaan fi --
This example is already predicted correctly

  -- Training example: boursier fr --
This example is already predicted correctly

  -- Training example: barnafödandet sv --
This example was incorrectly predicted to be de
Increasing association of ('last', 't')        with sv
Increasing association of ('last_two', 'et')   with sv
Increasing association of ('last_three', 'det') with sv
Reducing association of   ('last', 't')        with de
Reducing association of   ('last_two', 'et')   with de


This example is already predicted correctly

  -- Training example: uudistuksia fi --
This example is already predicted correctly

  -- Training example: glödlamporna sv --
This example is already predicted correctly

  -- Training example: misto en --
This example is already predicted correctly

  -- Training example: mutiloa fr --
This example is already predicted correctly

  -- Training example: Sitzplatztribüne de --
This example was incorrectly predicted to be fr
Increasing association of ('last', 'e')        with de
Increasing association of ('last_two', 'ne')   with de
Increasing association of ('last_three', 'üne') with de
Reducing association of   ('last', 'e')        with fr
Reducing association of   ('last_two', 'ne')   with fr
Reducing association of   ('last_three', 'üne') with fr

  -- Training example: counterparties en --
This example is already predicted correctly

  -- Training example: annuelle fr --
This example is already predicted correctly

  -- Training example


  -- Training example: prêcherait fr --
This example was incorrectly predicted to be de
Increasing association of ('last', 't')        with fr
Increasing association of ('last_two', 'it')   with fr
Increasing association of ('last_three', 'ait') with fr
Reducing association of   ('last', 't')        with de
Reducing association of   ('last_two', 'it')   with de
Reducing association of   ('last_three', 'ait') with de

  -- Training example: nostamaan fi --
This example is already predicted correctly

  -- Training example: elämme fi --
This example was incorrectly predicted to be fr
Increasing association of ('last', 'e')        with fi
Increasing association of ('last_two', 'me')   with fi
Increasing association of ('last_three', 'mme') with fi
Reducing association of   ('last', 'e')        with fr
Reducing association of   ('last_two', 'me')   with fr
Reducing association of   ('last_three', 'mme') with fr

  -- Training example: sprang en --
This example is already predicted correct


  -- Training example: korot fi --
This example is already predicted correctly

  -- Training example: kestämään fi --
This example is already predicted correctly

  -- Training example: oblivious en --
This example is already predicted correctly

  -- Training example: basculé fr --
This example is already predicted correctly

  -- Training example: sämsta sv --
This example is already predicted correctly

  -- Training example: bespannt de --
This example was incorrectly predicted to be en
Increasing association of ('last', 't')        with de
Increasing association of ('last_two', 'nt')   with de
Increasing association of ('last_three', 'nnt') with de
Reducing association of   ('last', 't')        with en
Reducing association of   ('last_two', 'nt')   with en
Reducing association of   ('last_three', 'nnt') with en

  -- Training example: fessenheim fr --
This example is already predicted correctly

  -- Training example: archibald en --
This example is already predicted correctly



This example is already predicted correctly

  -- Training example: eftermiddagen sv --
This example was incorrectly predicted to be de
Increasing association of ('last', 'n')        with sv
Increasing association of ('last_two', 'en')   with sv
Increasing association of ('last_three', 'gen') with sv
Reducing association of   ('last', 'n')        with de
Reducing association of   ('last_two', 'en')   with de
Reducing association of   ('last_three', 'gen') with de

  -- Training example: vaikutukseltaan fi --
This example is already predicted correctly

  -- Training example: Förderungen de --
This example is already predicted correctly

  -- Training example: Lexikon de --
This example is already predicted correctly

  -- Training example: flytta sv --
This example was incorrectly predicted to be fi
Increasing association of ('last', 'a')        with sv
Increasing association of ('last_two', 'ta')   with sv
Increasing association of ('last_three', 'tta') with sv
Reducing association of

Increasing association of ('last', 't')        with fr
Increasing association of ('last_two', 'nt')   with fr
Increasing association of ('last_three', 'ant') with fr
Reducing association of   ('last', 't')        with en
Reducing association of   ('last_two', 'nt')   with en
Reducing association of   ('last_three', 'ant') with en

  -- Training example: omaa fi --
This example is already predicted correctly

  -- Training example: utrikes sv --
This example is already predicted correctly

  -- Training example: enveloppe fr --
This example was incorrectly predicted to be de
Increasing association of ('last', 'e')        with fr
Increasing association of ('last_two', 'pe')   with fr
Increasing association of ('last_three', 'ppe') with fr
Reducing association of   ('last', 'e')        with de
Reducing association of   ('last_two', 'pe')   with de
Reducing association of   ('last_three', 'ppe') with de

  -- Training example: beneficiary en --
This example is already predicted correctly



  -- Training example: litteratur sv --
This example is already predicted correctly

  -- Training example: kiinnostumaan fi --
This example is already predicted correctly

  -- Training example: imperative en --
This example is already predicted correctly

  -- Training example: changed en --
This example is already predicted correctly

  -- Training example: liikevaihtonsa fi --
This example is already predicted correctly

  -- Training example: tahoja fi --
This example is already predicted correctly

  -- Training example: osmosion fr --
This example was incorrectly predicted to be sv
Increasing association of ('last', 'n')        with fr
Increasing association of ('last_two', 'on')   with fr
Increasing association of ('last_three', 'ion') with fr
Reducing association of   ('last', 'n')        with sv
Reducing association of   ('last_two', 'on')   with sv
Reducing association of   ('last_three', 'ion') with sv

  -- Training example: contini fr --
This example was incorrectly predi

This example is already predicted correctly

  -- Training example: ylistää fi --
This example is already predicted correctly

  -- Training example: Sternhaufen de --
This example was incorrectly predicted to be sv
Increasing association of ('last', 'n')        with de
Increasing association of ('last_two', 'en')   with de
Increasing association of ('last_three', 'fen') with de
Reducing association of   ('last', 'n')        with sv
Reducing association of   ('last_two', 'en')   with sv
Reducing association of   ('last_three', 'fen') with sv

  -- Training example: lane en --
This example is already predicted correctly

  -- Training example: directed en --
This example is already predicted correctly

  -- Training example: collégiale fr --
This example was incorrectly predicted to be de
Increasing association of ('last', 'e')        with fr
Increasing association of ('last_two', 'le')   with fr
Increasing association of ('last_three', 'ale') with fr
Reducing association of   ('last', 

This example is already predicted correctly

  -- Training example: kvinnligt sv --
This example is already predicted correctly

  -- Training example: elinympäristön fi --
This example is already predicted correctly

  -- Training example: tungette fi --
This example is already predicted correctly

  -- Training example: pauvres fr --
This example was incorrectly predicted to be en
Increasing association of ('last', 's')        with fr
Increasing association of ('last_two', 'es')   with fr
Increasing association of ('last_three', 'res') with fr
Reducing association of   ('last', 's')        with en
Reducing association of   ('last_two', 'es')   with en
Reducing association of   ('last_three', 'res') with en

  -- Training example: sämsta sv --
This example is already predicted correctly

  -- Training example: självförsörjande sv --
This example is already predicted correctly

  -- Training example: lanka en --
This example is already predicted correctly

  -- Training example: leakag


<p>We can see that three of the training examples were incorrectly classified and resulted in weight updates.</p>
<p>So what are the weights now?</p>


In [44]:
from pprint import pprint
pprint(weights)

Counter({(('last_three', 'hen'), 'de'): 4,
         (('last_three', 'nnt'), 'de'): 4,
         (('last_three', 'jet'), 'fr'): 4,
         (('last_three', 'est'), 'en'): 4,
         (('last_three', 'fen'), 'de'): 4,
         (('last_three', 'are'), 'sv'): 4,
         (('last_three', 'kon'), 'de'): 4,
         (('last_three', 'kes'), 'sv'): 4,
         (('last_three', 'wer'), 'de'): 4,
         (('last_three', 'per'), 'de'): 4,
         (('last_two', 'as'), 'sv'): 3,
         (('last', 'n'), 'fi'): 3,
         (('last_three', 'son'), 'fi'): 3,
         (('last_three', 'sit'), 'fi'): 3,
         (('last_three', 'asi'), 'fr'): 3,
         (('last_three', 'ome'), 'de'): 3,
         (('last_three', 'gon'), 'en'): 3,
         (('last_three', 'tan'), 'de'): 3,
         (('last_three', 'ile'), 'fr'): 3,
         (('last_three', 'ise'), 'fr'): 3,
         (('last_three', 'het'), 'sv'): 3,
         (('last_three', 'det'), 'de'): 3,
         (('last_three', 'gen'), 'de'): 3,
         (('last_three

         (('last_three', 'ome'), 'fr'): -1,
         (('last_three', 'ume'), 'de'): -1,
         (('last_three', 'nat'), 'fi'): -1,
         (('last_three', 'ate'), 'de'): -1,
         (('last_three', 'ian'), 'sv'): -1,
         (('last_three', 'age'), 'fr'): -1,
         (('last_two', 'ra'), 'fi'): -1,
         (('last_three', 'yra'), 'fi'): -1,
         (('last_three', 'aan'), 'en'): -1,
         (('last_three', 'jet'), 'sv'): -1,
         (('last_three', 'set'), 'fr'): -1,
         (('last_three', 'zer'), 'de'): -1,
         (('last_three', 'nnt'), 'en'): -1,
         (('last_two', 'ue'), 'en'): -1,
         (('last_three', 'que'), 'en'): -1,
         (('last_three', 'ton'), 'de'): -1,
         (('last_three', 'ues'), 'de'): -1,
         (('last_three', 'rst'), 'fr'): -1,
         (('last_three', 'gon'), 'fr'): -1,
         (('last_two', 'in'), 'sv'): -1,
         (('last_three', 'iin'), 'sv'): -1,
         (('last_three', 'ile'), 'fi'): -1,
         (('last_three', 'ait'), 'fi'): -


<p>We can test these weights with a new word, <em>added</em>:</p>


In [10]:
{category: score(get_features('added'), category) for category in ['VERB', 'NOUN', 'ADJ'] }

{'VERB': 1, 'NOUN': 1, 'ADJ': -2}


<p><code>NOUN</code> and <code>VERB</code> obtain the same score, even though the <em>-ed</em> suffix should make it clear that this is a verb. This can be fixed by going another round through the training data:</p>


In [11]:
for word, tag in words:
    print('  -- Training example: %s %s --' % (word, tag))
    update(word, tag)
    print()

  -- Training example: talked VERB --
This example was incorrectly predicted to be NOUN
Increasing association of ('last', 'd')        with VERB
Increasing association of ('last_two', 'ed')   with VERB
Reducing association of   ('last', 'd')        with NOUN
Reducing association of   ('last_two', 'ed')   with NOUN

  -- Training example: hiked VERB --
This example is already predicted correctly

  -- Training example: bread NOUN --
This example is already predicted correctly

  -- Training example: oranges NOUN --
This example is already predicted correctly

  -- Training example: sweeter ADJ --
This example is already predicted correctly

  -- Training example: greener ADJ --
This example is already predicted correctly




<p>This time the weights were updated only once, because most examples were already classified correctly. Let's try agin with <em>added</em>:</p>


In [12]:
{category: score(get_features('added'), category) for category in ['VERB', 'NOUN', 'ADJ'] }

{'VERB': 3, 'NOUN': -1, 'ADJ': -2}


<p>Success! In general, one needs to update the weights several times for each item in the training data. Doing it too many times, however, can return in <em>overfitting</em>. The perceptron algorithm always leads to better classification of the <em>training</em> data, but the opposite can happen for data which is not in the training set. The basic reason for this is that the perceptron first learns very general rules (such that words ending with <em>-ed</em> tend to be verbs), but later during training starts finding more and more specific rules (such that words ending with <em>-ent</em> that are longer than 7 letters and begin with <em>emba-</em> tend to be nouns -- this happens to fit the noun <em>embarrassment</em>, but is not a general rule of English grammar).</p>
