
<h1 id="The-perceptron-algorithm-in-NLP">The perceptron algorithm in NLP<a class="anchor-link" href="#The-perceptron-algorithm-in-NLP">¶</a></h1>


Homework exercise 05.02 based on a tutorial by **Robert Östling**

<p>Convert train and test sets into a list of tuples. First element of a tuple is a word, second element of a tuple is a language of that word.</p>

In [201]:
def read_file(file):
    words = []
    with open(file) as f:
        for line in f:
            line = line.rstrip("\n")
            line = line.split("\t")
            words.append((line[0], line[1]))
    return words

In [202]:
train_set = read_file('train.txt')
test_set = read_file('test.txt')


<p>The relevant morphological information is contained in (at most) the last two letters of a word. So, we write a function that can create features based on this observation:</p>


In [203]:
def get_features(word): return { 'first' : word[:1], 'first_two' : word[:2], 
                                 'last' : word[-1:], 'last_two': word[-2:], 'last_three': word[-3:] }


<p>We can now use this to create a list of features and their corresponding tag:</p>


In [204]:
[(get_features(word), tag) for word, tag in train_set]

[({'first': 'a',
   'first_two': 'ac',
   'last': 'e',
   'last_two': 'ge',
   'last_three': 'dge'},
  'en'),
 ({'first': 'a',
   'first_two': 'aq',
   'last': 'e',
   'last_two': 'fe',
   'last_three': 'afe'},
  'en'),
 ({'first': 'a',
   'first_two': 'ar',
   'last': 'd',
   'last_two': 'ld',
   'last_three': 'ald'},
  'en'),
 ({'first': 'b',
   'first_two': 'ba',
   'last': 'y',
   'last_two': 'by',
   'last_three': 'aby'},
  'en'),
 ({'first': 'b',
   'first_two': 'ba',
   'last': 'd',
   'last_two': 'ed',
   'last_three': 'sed'},
  'en'),
 ({'first': 'b',
   'first_two': 'be',
   'last': 'y',
   'last_two': 'ry',
   'last_three': 'ary'},
  'en'),
 ({'first': 'b',
   'first_two': 'be',
   'last': 'n',
   'last_two': 'gn',
   'last_three': 'ign'},
  'en'),
 ({'first': 'b',
   'first_two': 'bo',
   'last': 's',
   'last_two': 'es',
   'last_three': 'ces'},
  'en'),
 ({'first': 'b',
   'first_two': 'br',
   'last': 'y',
   'last_two': 'ay',
   'last_three': 'way'},
  'en'),
 ({'first'

<p>The perceptron assigns a weight to each combination of a feature and a tag.</p>

In [205]:
from collections import Counter
weights = Counter()

<p>To obtain the score of a given category for a set of features, we simply sum the corresponding weights:</p>

In [206]:
def score(features, category): return sum(weights[feature, category] for feature in features.items())


<h2 id="Training-the-perceptron">Training the perceptron<a class="anchor-link" href="#Training-the-perceptron">¶</a></h2>



<p>Assign weights to each word automatically based on training data.</p>


In [207]:
# These are the possible tags in our example
tags = ('en', 'de', 'fi', 'fr', 'sv')

def update(word, correct_tag):
    # First we compute what the algorithm currently says would be the best tag for this word
    predicted_tag = max(tags, key=lambda tag: score(get_features(word), tag))
    if correct_tag == predicted_tag:
        # If the predicted tag is correct, do nothing
        print('This example is already predicted correctly')
        pass
    else:
        print('This example was incorrectly predicted to be %s' % predicted_tag)
        # Otherwise, we need to do two things:
        # First, add 1 to the weight of every (feature,correct_tag) pair
        #   this will make each feature more strongly associated with the correct tag
        for feature in get_features(word).items():
            print('Increasing association of %-20s with %s' % (feature, correct_tag))
            weights[feature,correct_tag] += 1
        # Second, remove 1 from the weight of every (feature,predicted_tag) pair
        #   this will make each feature less strongly associated with the incorrect tag
        for feature in get_features(word).items():
            print('Reducing association of   %-20s with %s' % (feature, predicted_tag))
            weights[feature,predicted_tag] -= 1


<p>And that's it! Now let's reset the weights to zero, and see what happens when we pass through the words in our little data set.</p>
<p>Run 10 epochs in total, after each epoch shuffle the training set randomly to get better results.</p>


In [208]:
import random

weights.clear()

for i in range(1, 11):
    print('\n---- Running epoch #{} ----\n'.format(i))
    random.shuffle(train_set)
    for word, tag in train_set:
        print('  -- Training example: %s %s --' % (word, tag))
        update(word, tag)
        print()


---- Running epoch #1 ----

  -- Training example: slept en --
This example is already predicted correctly

  -- Training example: vajoan fi --
This example was incorrectly predicted to be en
Increasing association of ('first', 'v')       with fi
Increasing association of ('first_two', 'va')  with fi
Increasing association of ('last', 'n')        with fi
Increasing association of ('last_two', 'an')   with fi
Increasing association of ('last_three', 'oan') with fi
Reducing association of   ('first', 'v')       with en
Reducing association of   ('first_two', 'va')  with en
Reducing association of   ('last', 'n')        with en
Reducing association of   ('last_two', 'an')   with en
Reducing association of   ('last_three', 'oan') with en

  -- Training example: optischen de --
This example was incorrectly predicted to be fi
Increasing association of ('first', 'o')       with de
Increasing association of ('first_two', 'op')  with de
Increasing association of ('last', 'n')        with de
In

Increasing association of ('last_three', 'lia') with de
Reducing association of   ('first', 'E')       with en
Reducing association of   ('first_two', 'Em')  with en
Reducing association of   ('last', 'a')        with en
Reducing association of   ('last_two', 'ia')   with en
Reducing association of   ('last_three', 'lia') with en

  -- Training example: totale de --
This example was incorrectly predicted to be fr
Increasing association of ('first', 't')       with de
Increasing association of ('first_two', 'to')  with de
Increasing association of ('last', 'e')        with de
Increasing association of ('last_two', 'le')   with de
Increasing association of ('last_three', 'ale') with de
Reducing association of   ('first', 't')       with fr
Reducing association of   ('first_two', 'to')  with fr
Reducing association of   ('last', 'e')        with fr
Reducing association of   ('last_two', 'le')   with fr
Reducing association of   ('last_three', 'ale') with fr

  -- Training example: tarkoit

Reducing association of   ('first', 'e')       with fi
Reducing association of   ('first_two', 'el')  with fi
Reducing association of   ('last', 'a')        with fi
Reducing association of   ('last_two', 'la')   with fi
Reducing association of   ('last_three', 'lla') with fi

  -- Training example: caribbean en --
This example was incorrectly predicted to be fi
Increasing association of ('first', 'c')       with en
Increasing association of ('first_two', 'ca')  with en
Increasing association of ('last', 'n')        with en
Increasing association of ('last_two', 'an')   with en
Increasing association of ('last_three', 'ean') with en
Reducing association of   ('first', 'c')       with fi
Reducing association of   ('first_two', 'ca')  with fi
Reducing association of   ('last', 'n')        with fi
Reducing association of   ('last_two', 'an')   with fi
Reducing association of   ('last_three', 'ean') with fi

  -- Training example: unvollendet de --
This example is already predicted correctl

  -- Training example: missile fr --
This example is already predicted correctly

  -- Training example: mart en --
This example is already predicted correctly

  -- Training example: Litauischen de --
This example is already predicted correctly

  -- Training example: posters fr --
This example was incorrectly predicted to be sv
Increasing association of ('first', 'p')       with fr
Increasing association of ('first_two', 'po')  with fr
Increasing association of ('last', 's')        with fr
Increasing association of ('last_two', 'rs')   with fr
Increasing association of ('last_three', 'ers') with fr
Reducing association of   ('first', 'p')       with sv
Reducing association of   ('first_two', 'po')  with sv
Reducing association of   ('last', 's')        with sv
Reducing association of   ('last_two', 'rs')   with sv
Reducing association of   ('last_three', 'ers') with sv

  -- Training example: Fernsehproduktionen de --
This example is already predicted correctly

  -- Training example

This example is already predicted correctly

  -- Training example: strategiani fi --
This example is already predicted correctly

  -- Training example: Leerstand de --
This example is already predicted correctly

  -- Training example: upphört sv --
This example was incorrectly predicted to be de
Increasing association of ('first', 'u')       with sv
Increasing association of ('first_two', 'up')  with sv
Increasing association of ('last', 't')        with sv
Increasing association of ('last_two', 'rt')   with sv
Increasing association of ('last_three', 'ört') with sv
Reducing association of   ('first', 'u')       with de
Reducing association of   ('first_two', 'up')  with de
Reducing association of   ('last', 't')        with de
Reducing association of   ('last_two', 'rt')   with de
Reducing association of   ('last_three', 'ört') with de

  -- Training example: halvvägs sv --
This example was incorrectly predicted to be de
Increasing association of ('first', 'h')       with sv
Increa

This example is already predicted correctly

  -- Training example: deponerade sv --
This example is already predicted correctly

  -- Training example: sections fr --
This example is already predicted correctly

  -- Training example: katholischem de --
This example is already predicted correctly

  -- Training example: micro-organisme fr --
This example is already predicted correctly

  -- Training example: acknowledge en --
This example is already predicted correctly

  -- Training example: phones en --
This example was incorrectly predicted to be fr
Increasing association of ('first', 'p')       with en
Increasing association of ('first_two', 'ph')  with en
Increasing association of ('last', 's')        with en
Increasing association of ('last_two', 'es')   with en
Increasing association of ('last_three', 'nes') with en
Reducing association of   ('first', 'p')       with fr
Reducing association of   ('first_two', 'ph')  with fr
Reducing association of   ('last', 's')        with fr


  -- Training example: Phyllostachys de --
This example is already predicted correctly

  -- Training example: kieltävät fi --
This example is already predicted correctly

  -- Training example: données fr --
This example is already predicted correctly

  -- Training example: ordnat sv --
This example is already predicted correctly

  -- Training example: solférino fr --
This example is already predicted correctly

  -- Training example: benign en --
This example is already predicted correctly

  -- Training example: torppia fi --
This example is already predicted correctly

  -- Training example: rockies en --
This example is already predicted correctly

  -- Training example: nostamaan fi --
This example is already predicted correctly

  -- Training example: montasio fr --
This example is already predicted correctly

  -- Training example: greensboro en --
This example is already predicted correctly

  -- Training example: avtal sv --
This example is already predicted correctly

  -


  -- Training example: vaikutukseltaan fi --
This example is already predicted correctly

  -- Training example: sympati sv --
This example is already predicted correctly

  -- Training example: allaitante fr --
This example was incorrectly predicted to be fi
Increasing association of ('first', 'a')       with fr
Increasing association of ('first_two', 'al')  with fr
Increasing association of ('last', 'e')        with fr
Increasing association of ('last_two', 'te')   with fr
Increasing association of ('last_three', 'nte') with fr
Reducing association of   ('first', 'a')       with fi
Reducing association of   ('first_two', 'al')  with fi
Reducing association of   ('last', 'e')        with fi
Reducing association of   ('last_two', 'te')   with fi
Reducing association of   ('last_three', 'nte') with fi

  -- Training example: reino fr --
This example is already predicted correctly

  -- Training example: Förderungen de --
This example is already predicted correctly

  -- Training exampl

Reducing association of   ('first', 's')       with en
Reducing association of   ('first_two', 'st')  with en
Reducing association of   ('last', 's')        with en
Reducing association of   ('last_two', 'ts')   with en
Reducing association of   ('last_three', 'ats') with en

  -- Training example: yhteiset fi --
This example is already predicted correctly

  -- Training example: Landungstrupp de --
This example is already predicted correctly

  -- Training example: flakes sv --
This example is already predicted correctly

  -- Training example: inbrottsstöld sv --
This example is already predicted correctly

  -- Training example: oestrogène fr --
This example is already predicted correctly

  -- Training example: itsenäinen fi --
This example is already predicted correctly

  -- Training example: Kurdistan de --
This example is already predicted correctly

  -- Training example: mcginnis en --
This example is already predicted correctly

  -- Training example: optique fr --
This exam

This example is already predicted correctly

  -- Training example: deals en --
This example is already predicted correctly

  -- Training example: orientaux fr --
This example is already predicted correctly

  -- Training example: Sommerregen de --
This example is already predicted correctly

  -- Training example: wallace fr --
This example is already predicted correctly

  -- Training example: parlée fr --
This example is already predicted correctly

  -- Training example: corn sv --
This example is already predicted correctly

  -- Training example: friendship en --
This example is already predicted correctly

  -- Training example: presenterades sv --
This example was incorrectly predicted to be en
Increasing association of ('first', 'p')       with sv
Increasing association of ('first_two', 'pr')  with sv
Increasing association of ('last', 's')        with sv
Increasing association of ('last_two', 'es')   with sv
Increasing association of ('last_three', 'des') with sv
Reducing as


<p>Some of the training examples were incorrectly classified and resulted in weight updates.</p>
<p>So what are the weights now?</p>


In [209]:
from pprint import pprint
pprint(weights)

Counter({(('last_three', 'des'), 'sv'): 6,
         (('last_three', 'ies'), 'en'): 5,
         (('first', 'k'), 'fi'): 5,
         (('last_three', 'ées'), 'fr'): 5,
         (('last_three', 'lse'), 'sv'): 5,
         (('last_three', 'gon'), 'en'): 5,
         (('last_three', 'hen'), 'de'): 4,
         (('last_three', 'ons'), 'fr'): 4,
         (('first_two', 'av'), 'sv'): 4,
         (('first_two', 'ex'), 'en'): 4,
         (('first', 'M'), 'de'): 4,
         (('last_three', 'ose'), 'en'): 4,
         (('last_three', 'ige'), 'de'): 4,
         (('last_three', 'nnt'), 'de'): 4,
         (('last_three', 'nor'), 'sv'): 4,
         (('first_two', 'fu'), 'en'): 4,
         (('last_three', 'ble'), 'en'): 4,
         (('last_three', 'ota'), 'fr'): 4,
         (('last_three', 'dar'), 'sv'): 4,
         (('last_three', 'ean'), 'en'): 4,
         (('last_three', 'nsa'), 'fi'): 4,
         (('last_three', 'red'), 'fr'): 4,
         (('last_three', 'mme'), 'fi'): 4,
         (('last_three', 'est')

         (('first_two', 'it'), 'sv'): -1,
         (('first_two', 'sa'), 'de'): -1,
         (('last_two', 'ye'), 'de'): -1,
         (('last_three', 'aye'), 'de'): -1,
         (('first', 'ö'), 'fr'): -1,
         (('first_two', 'öv'), 'fr'): -1,
         (('last_two', 'ds'), 'fr'): -1,
         (('last_three', 'ids'), 'fr'): -1,
         (('last_two', 'se'), 'fr'): -1,
         (('first_two', 'uu'), 'sv'): -1,
         (('last_three', 'sia'), 'sv'): -1,
         (('first_two', 'ph'), 'sv'): -1,
         (('last_two', 'es'), 'sv'): -1,
         (('last_three', 'nes'), 'sv'): -1,
         (('first_two', 'Va'), 'sv'): -1,
         (('last_three', 'ure'), 'fr'): -1,
         (('last', 'y'), 'sv'): -1,
         (('last_two', 'cy'), 'sv'): -1,
         (('last_three', 'ncy'), 'sv'): -1,
         (('first', 'w'), 'en'): -1,
         (('first_two', 'wa'), 'en'): -1,
         (('last_two', 'ce'), 'en'): -1,
         (('last_three', 'ace'), 'en'): -1,
         (('first_two', 'fl'), 'en'): -1,



<p>Test these weights with test set.</p>
<p>Create a dictionary. Assign scores (value) to each word (key). Before adding the scores into the dictionary, sort the scores from the highest to the lowest (so we can easily print how our word was classified by the perceptron).</p>


In [210]:
from operator import itemgetter

tested_scores = {}
for word, lang in test_set:
    features = {lang: score(get_features(word), lang) for lang in tags}
    features = sorted(features.items(), reverse=True, key=itemgetter(1))
    tested_scores[word] = features

<p>Print the results.</p>
<p>Compare a predicted language with a correct language.</p>

In [211]:
i = 0
correct = 0
incorrect = 0

for word, stats in tested_scores.items():
    if stats[0][0] == test_set[i][1]:
        print(word, 'predicted correctly as:', stats[0][0])
        correct += 1
    else:
        print(word, 'predicted incorrectly as:', stats[0][0], 'desired language:', test_set[i][1])
        incorrect += 1
    i += 1

bladder predicted incorrectly as: sv desired language: en
blindly predicted correctly as: en
colourful predicted incorrectly as: fr desired language: en
implemented predicted correctly as: en
nondescript predicted incorrectly as: fi desired language: en
reconsider predicted incorrectly as: de desired language: en
slaves predicted incorrectly as: sv desired language: en
strategic predicted correctly as: en
utilising predicted correctly as: en
weekly predicted correctly as: en
avustettuna predicted incorrectly as: sv desired language: fi
jalostettuja predicted correctly as: fi
kansallista predicted correctly as: fi
kansoillemme predicted correctly as: fi
näen predicted correctly as: fi
opeteltavaa predicted correctly as: fi
soittaneen predicted incorrectly as: sv desired language: fi
tiedosta predicted correctly as: fi
tuijotin predicted correctly as: fi
verojärjestelmän predicted correctly as: fi
catastrophes predicted correctly as: fr
manifester predicted incorrectly as: sv desired lan

<p>How accurate the perceptron is?</p>

In [212]:
print('Correctly classified:', correct)
print('Incorrectly classified:', incorrect)
print('Accuracy:', correct/len(test_set))

Correctly classified: 28
Incorrectly classified: 22
Accuracy: 0.56


In [None]:
# I think 40% accuracy is a good result considering the fact we only look at last two letters and update weights.
# When I used last three letters and first two letters, I got accuracy around 57%. But getting even better accuracy
# is very difficult. Random shuffling before each epoch helps, but not as much as I thought.

In [None]:
# About the linearity: Simple perceptron is linear (like our case), but there are also multi-layer perceptrons. 
# MLPs have a hidden layer. Jurafsky and Martin say: '...taking a weighted sum of its inputs and then applying
# a non-linearity.' So I think we would have to change our score() function, so it would not return a simple
# sum of weights, but e.g. sigmoid, tanh, or ReLu. From these three, ReLu is the simplest one and most commonly 
# used, so I would use that one. I tried to implement it in my code to the score() function, but I got errors.
# It would be nice to see the solution for this, if you have it implemented :)


<p>Success! In general, one needs to update the weights several times for each item in the training data. Doing it too many times, however, can return in <em>overfitting</em>. The perceptron algorithm always leads to better classification of the <em>training</em> data, but the opposite can happen for data which is not in the training set. The basic reason for this is that the perceptron first learns very general rules (such that words ending with <em>-ed</em> tend to be verbs), but later during training starts finding more and more specific rules (such that words ending with <em>-ent</em> that are longer than 7 letters and begin with <em>emba-</em> tend to be nouns -- this happens to fit the noun <em>embarrassment</em>, but is not a general rule of English grammar).</p>
