# Baseline: Most Frequent Assigned Label

The most naive approach for PoS Tagging would be label a word with the most used label to that word in the corpus. This is the baseline approach that we are going to try in this notebook.

The idea is load a treebank structured file (see how to create one in Reading, Saving and Loading PTbank files notebook) and get all nodes from the dataset. Then, assign to a dictionary the frequence table of occurrencies of labels for each word.

```
{
    ...
    'book': {
        'NN': 256,
        'VV': 126
    },
    'played': {
        'VBN': 421
    }
    ...
}
```

Take, for instance, the word book. To it could be assigned 'NN' (noun) in "Where is my book?" but also could be a 'VV' (verb, base form) "I want to book a hotel". The noun form is far more used than verb form, so if in an random unseen sentence it would be a good bet that the book word in it is a noun.

At the end our frequence table will become a predictor with only the label more often for that symbol/word.

```
{
    ...
    'book': 'NN',
    'played': 'VBN'
    ...
}
```

We are going to use Treebank to load the structure and get the nodes:

In [2]:
from postag import Treebank

In [4]:
ptb = Treebank()
ptb.load('data/dumps/my_ptb_struct')
len(ptb.instances)

39831

Then, lets take all words-label tuple in the corpus, which are the nodes from each instance in the treebank.

In [5]:
nodes = ptb.get_nodes()
len(nodes)

1014096

To be sure if the predictor we are creating is good enought is interesting to split the samples in two groups: one for training, used to create the predictor table shown above, and another one for testing.

The predictor maker is not going to see any test sample, so maybe new words appear, different labels for a know word, etc. The test set will be used to calculate the accuracy of the predictor, by running the predictions and counting the corrects and wrongs anwsers from the predictor.

In order to split any unknown nodes we create the `split_data` function:

In [6]:
def split_data(nodes, train_perc, test_perc):
    splitter = int(train_perc * len(nodes))
    return nodes[:splitter], nodes[splitter:]

In [7]:
# split nodes in train set (with 70%) and test set (30%)
train_set, test_set = split_data(nodes, 0.7, 0.3)

Let's create the frequence table that counts the occurence of labels for a given word

In [24]:
freq_table = {}

# Generate the frequence table for classes / value tuple
for node in train_set:
    if node.value not in freq_table:
        freq_table[node.value] = {}
        freq_table[node.value][node.class_name] = 1
    else:
        if node.class_name in freq_table[node.value]:
            freq_table[node.value][node.class_name] += 1
        else:
            freq_table[node.value][node.class_name] = 1

In [27]:
freq_table['fight']

{'VB': 15, 'NN': 22, 'VBP': 1}

With the freq table, it is time to pick the label with the most occurrences.

In [29]:
# Pick the most frequent class for each value
for value in freq_table.keys():
    label = max(iter(freq_table[value].keys()), key=(lambda key: freq_table[value][key]))
    freq_table[value] = label

In [30]:
freq_table['fight']

'NN'

That's it. Our freq_table is now the predictor based on most frequents labels. Next, let's apply this predictor to the test set while counting the hits, wrongs and missing words

In [32]:
# Run the test evaluation
hits = 0
wrongs = 0
misses = 0

for node in test_set:
    if node.value in freq_table:
        predicted = freq_table[node.value]
        if predicted == node.class_name:
            hits += 1
        else:
            wrongs += 1
    else:
        misses += 1

Here is the resulting accuracy for that experiment

In [33]:
print("%f Acurracy (%d hits; %d wrongs)" % (((hits/(hits+wrongs))*100), hits, wrongs))
print("Missed %d values" % misses)

94.406750 Acurracy (277165 hits; 16421 wrongs)
Missed 10643 values


As you can see, this approach has a ~94% accuracy on PoS

10,000 "words" are missing, that means they where not labeled because the predictor never saw them before. In this case I just ignored the missing words, but if we count a missing word as a wrong answer we get:

In [34]:
print("%f Acurracy (%d hits; %d wrongs)" % (((hits/(hits+wrongs+misses))*100), hits, wrongs))
print("Missed %d values" % misses)

91.104070 Acurracy (277165 hits; 16421 wrongs)
Missed 10643 values
