# **FolT Homework 8**

## Alexander Praus, Maike Arnold

### Homework 8.1 (10 points)
In natural language processing, machine learning algorithms are widely used to automatically classify texts according
to a given task. Usually, a training and a development corpus are given in advance to prepare the classification system.
The training corpus is used to train a system, the development corpus is used to fine-tune the parameters. The testing
corpus is usually kept secret and is used only for the **final** evaluation of the systems. It must not be used for testing
during the development phase and for fine-tuning.

In the next task, you will create a part-of-speech tagger. The training, the development and the test set are given to you
in advance. Your tagger is supposed to be only trained on the training corpus and maybe on the development corpus.
You must not use the test corpus for training or fine-tuning your (hyper)parameters.


In [1]:
import nltk
from nltk.corpus import nps_chat, brown

In [2]:
def get_train_dev_test(tagged_parts):
  nr_parts = len(tagged_parts)
  #first 80% of the corpus
  train_parts = tagged_parts[:(nr_parts*8)//10] 
  # 80-90% of the corpus
  development_parts = tagged_parts[(nr_parts*8)//10:(nr_parts*9)//10] 
  # last 10% of the corpus
  test_parts = tagged_parts[((nr_parts*9)//10):] 
  print(nr_parts, ":", len(train_parts), len(development_parts), len(test_parts))
  return train_parts, development_parts, test_parts

In [3]:
train_posts, development_posts, test_posts = get_train_dev_test(
    nps_chat.tagged_posts(tagset='universal'))

train_sents, development_sents, test_sents = get_train_dev_test(
    brown.tagged_sents(tagset='universal'))

10567 : 8453 1057 1057
57340 : 45872 5734 5734


**Homework 8.1** (2 points, no programming needed)

Please explain why POS taggers are important for natural language processing and what kinds of downstream tasks
are enabled with POS tagging (give one example). What different strategies for implementing a POS tagger do you
already know and what are the pros and cons of these approaches (name at least two). Do you have suggestions how
they can be improved?

Answer each question in two or three sentences.



POS taggers are important because otherwise we wouldn't be able to distinguish what part of speech a token is. This would cause a lot of NLP tasks to be very difficult if not impossible. For example, machine translation (or anything where the meaning of text needs to be interpreted) would translate homographs the same way all of the time so that for example "lead" in  "Lead is toxic" and "I lead the group" would be translated to "führe" or "Blei" in both cases.

Three different kinds of POS Taggers:
 - DefaultTagger: This kind of tagger simply assigns all tokens one and the same POS-tag. This is usually the most common tag.
     - Pro: easy to implement, fast
     - Con: Not very accurate and pretty much useless only approach
 - BigramTagger: This tagger takes the direct neighbors of a token into consideration when assigning the token a POS-tag.
     - Pro: A lot more accurate, takes context into consideration
     - Con: More training data needed for good results
     
A BigramTagger is a specific type of NgramTagger. By increasing N and taking more context into consideration, the results might improve. Otherwise, things like Deep Learning could help imporve the results of POS-tagging.

**Homework 8.2** (8 points) 

Develop a part-of-speech tagger for texts in the chat domain and the brown corpus. You may explore several
directions such as handling rare tokens, handling special chat related phenomena like smilies, or using better and
more training data. You can try different things to improve the performance of the tagger on the training data, but be
careful not to overfit on the training data. Document your ideas and process well. 

Hints:

- You do not need to implement a tagger from scratch.
- You may train your tagger on more than one corpus.
- The corpora provided by NLTK offer a README in the corresponding folder, i.e. within the sub folders of ``nltk
\_data/corpora`` in your home folder (``Anwendungsdaten`` for Windows) you will find additional information.
- Use the universal tag set.
- Typical chat tokens like *’lol’* or *’:-)’* should be tagged as *’X’*, i.e. other.
- Read the paragraph *“Tagging Unknown Words”*. (See NLTK-book page chapter 5.5, page 206)
- use `import matplotlib.pyplot as plt` for the plotting


Collect your top 3 results on the development set for the brown and the chat corpus with the diffrerent hyperparameters and plot them. 
Evaluate your best tagger for brown and the chat corpus on the testset. Upload the code and report the accuracy as part of your submission. Use the
text field in the submission module to submit the final accuracy (just put a single number there for each corpus).



In [7]:
def calculate_baseline(corpora):
    '''Calculates baseline values for the following taggers:
        DefaultTagger, UnigramTagger, BigramTagger
        
        Attributes:
            corpora: list of tuples (train_set, test_set)
    '''
    default_tagger = nltk.DefaultTagger('NOUN')
    results = []
    for corpus in corpora:
        result = []
        unigram_tagger = nltk.UnigramTagger(corpus[0])
        bigram_tagger = nltk.BigramTagger(corpus[0])
        trigram_tagger = nltk.NgramTagger(3, corpus[0])
        
        result.append(default_tagger.evaluate(corpus[1]))
        result.append(unigram_tagger.evaluate(corpus[1]))
        result.append(bigram_tagger.evaluate(corpus[1]))
        result.append(trigram_tagger.evaluate(corpus[1]))
        
        results.append(result)
    return results

baselines = calculate_baseline([(train_posts, test_posts), (train_sents, test_sents)])

print('Baselines Chat')
print(baselines[0])
print()
print('Baselines Brown')
print(baselines[1])

Baselines Chat
[0.1724051896207585, 0.8265968063872255, 0.5174650698602794, 0.4209081836327345]

Baselines Brown
[0.1853691247040087, 0.91269042978982, 0.454579744766455, 0.2907839316024392]


In [15]:
def tag_unknown(corpus, n):
    '''Executes the suggested approch from the nltk book 5.5
        Attributes:
            corpus: list of list of tagged tokens
            n: number of most common tokens to be used
        Returns:
            list: list of list with tokens not included in n most common replaced
    '''
    most_common = [token for (token, _) in nltk.FreqDist([token for text in corpus for (token,_) in text]).most_common(n)]
    new_sents = []
    for text in corpus:
        new_sent = []
        for (token, tag) in text:
            if token in most_common:
                new_sent.append((token, tag))
            else:
                new_sent.append(('UNK', tag))
        new_sents.append(new_sent)
    return new_sents

    
def test_tagger(train_data, test_data, name):
    default_tagger = nltk.DefaultTagger('NOUN')
    unigram_tagger = nltk.UnigramTagger(train_data, backoff=default_tagger)
    bigram_tagger = nltk.BigramTagger(train_data, backoff=unigram_tagger)
    trigram_tagger = nltk.NgramTagger(3, train_data, backoff=bigram_tagger)
    
    print('Results: ' + name)
    print(unigram_tagger.evaluate(test_data))
    print(bigram_tagger.evaluate(test_data))
    print(trigram_tagger.evaluate(test_data))
    
test_tagger(tag_unknown(train_sents, 1000), tag_unknown(test_sents, 1000), 'Brown Corpus 1000')
test_tagger(tag_unknown(train_sents, 2000), tag_unknown(test_sents, 2000), 'Brown Corpus 2000')
test_tagger(tag_unknown(train_sents, 3000), tag_unknown(test_sents, 3000), 'Brown Corpus 3000')
test_tagger(tag_unknown(train_sents, 4000), tag_unknown(test_sents, 4000), 'Brown Corpus 4000')
test_tagger(tag_unknown(train_sents, 5000), tag_unknown(test_sents, 5000), 'Brown Corpus 5000')

test_tagger(tag_unknown(train_posts, 1000), tag_unknown(test_posts, 1000), 'Chat Corpus 1000')
test_tagger(tag_unknown(train_posts, 2000), tag_unknown(test_posts, 2000), 'Chat Corpus 2000')
test_tagger(tag_unknown(train_posts, 3000), tag_unknown(test_posts, 3000), 'Chat Corpus 3000')
test_tagger(tag_unknown(train_posts, 4000), tag_unknown(test_posts, 4000), 'Chat Corpus 4000')
test_tagger(tag_unknown(train_posts, 5000), tag_unknown(test_posts, 5000), 'Chat Corpus 5000')

Results: Brown Corpus 1000
0.8229919741832736
0.8446491062634899
0.8448586576140483
Results: Brown Corpus 2000
0.8521824773160663
0.8694809413046667
0.8694599861696108
Results: Brown Corpus 3000
0.8697952683305044
0.8844114750319566
0.8837723434127533
Results: Brown Corpus 4000
0.8810376982879655
0.8939251063473104
0.8933174074306909
Results: Brown Corpus 5000
0.8899960185243394
0.9019090128035875
0.9008193457806836
Results: Chat Corpus 1000
0.8265968063872255
0.845309381237525
0.843812375249501
Results: Chat Corpus 2000
0.8600299401197605
0.875249500998004
0.8727544910179641
Results: Chat Corpus 3000
0.8642714570858283
0.8792415169660679
0.876746506986028
Results: Chat Corpus 4000
0.8715069860279441
0.8862275449101796
0.8837325349301397
Results: Chat Corpus 5000
0.8779940119760479
0.8924650698602794
0.8904690618762475


In [23]:
freq_tags_brown = nltk.FreqDist([tag for text in train_sents for (_,tag) in text])
print(freq_tags_brown.most_common(20))

freq_tags_chat = nltk.FreqDist([tag for text in train_posts for (_,tag) in text])
print(freq_tags_chat.most_common(20))

def test_tagger_def(train_data, test_data, tag, name):
    '''evalutes default, unigram, and bigram tagger
        Attributes:
            train_data: training data for tagger
            test_data: test data for tagger
            tag: tag to be used by DefaultTagger
            name: name of corpus/test to be printed
    '''
    default_tagger = nltk.DefaultTagger(tag)
    unigram_tagger = nltk.UnigramTagger(train_data, backoff=default_tagger)
    bigram_tagger = nltk.BigramTagger(train_data, backoff=unigram_tagger)
    
    print('Results: ' + name)
    print(unigram_tagger.evaluate(test_data))
    print(bigram_tagger.evaluate(test_data))

test_tagger_def(tag_unknown(train_sents, 5000), tag_unknown(test_sents, 5000), 'VERB', 'Brown Corpus 5000, VERB')
test_tagger_def(tag_unknown(train_posts, 5000), tag_unknown(test_posts, 5000), 'VERB', 'Chat Corpus 5000, VERB')
test_tagger_def(tag_unknown(train_sents, 5000), tag_unknown(test_sents, 5000), 'NUM', 'Brown Corpus 5000, NUM')
test_tagger_def(tag_unknown(train_posts, 5000), tag_unknown(test_posts, 5000), 'NUM', 'Chat Corpus 5000, NUM')

[('NOUN', 241528), ('VERB', 150459), ('ADP', 126332), ('.', 118482), ('DET', 116989), ('ADJ', 73866), ('ADV', 45940), ('PRON', 35550), ('CONJ', 32177), ('PRT', 23316), ('NUM', 13802), ('X', 1205)]
[('NOUN', 7731), ('VERB', 7116), ('X', 5381), ('PRON', 3810), ('.', 3408), ('ADV', 1826), ('DET', 1778), ('ADP', 1635), ('ADJ', 1491), ('PRT', 784), ('CONJ', 567), ('NUM', 433)]
Results: Brown Corpus 5000, VERB
0.8601454286372876
0.8706439513002662
Results: Chat Corpus 5000, VERB
0.842564870259481
0.8577844311377245
Results: Brown Corpus 5000, NUM
0.8418411181660066
0.853031160285828
Results: Chat Corpus 5000, NUM
0.8305888223552894
0.842564870259481


In [33]:
# Test traing tagger using train_sents and train_posts

default_tagger = nltk.DefaultTagger('NOUN')
unigram_tagger = nltk.UnigramTagger(train_sents+train_posts, backoff=default_tagger)
bigram_tagger = nltk.BigramTagger(train_sents+train_posts, backoff=unigram_tagger)
print('Results Brown: ')
print(unigram_tagger.evaluate(test_sents))
print(bigram_tagger.evaluate(test_sents))

print('Results Chat: ')
print(unigram_tagger.evaluate(test_posts))
print(bigram_tagger.evaluate(test_posts))

Results Brown: 
0.9380670983424488
0.9462710337168123
Results Chat: 
0.8602794411177644
0.8832335329341318


In [34]:
# Using train_sents/train_posts + development_sents/development_posts

default_tagger = nltk.DefaultTagger('NOUN')
unigram_tagger = nltk.UnigramTagger(train_sents+development_sents, backoff=default_tagger)
bigram_tagger = nltk.BigramTagger(train_sents+development_sents, backoff=unigram_tagger)
print('Results Brown: ')
print(unigram_tagger.evaluate(test_sents))
print(bigram_tagger.evaluate(test_sents))

unigram_tagger = nltk.UnigramTagger(train_posts+development_posts, backoff=default_tagger)
bigram_tagger = nltk.BigramTagger(train_posts+development_posts, backoff=unigram_tagger)
print('Results Chat: ')
print(unigram_tagger.evaluate(test_posts))
print(bigram_tagger.evaluate(test_posts))

Results Brown: 
0.9398168521196119
0.9485446658703716
Results Chat: 
0.8807385229540918
0.8962075848303394


In [37]:
def replace_tags(old_tag, new_tag, corpus):
    '''replaces all of tags in a corpus
        Attributes:
            old_tag: tag to be replaced
            new_tag: new tag to be used
            corpus
    '''
    sents = []
    for sentence in corpus:
        sent = []
        for (token,tag) in sentence:
            if tag == old_tag:
                tag == new_tag
            sent.append((token, tag))
        sents.append(sent)
    return sents

default_tagger = nltk.DefaultTagger('NOUN')
unigram_tagger = nltk.UnigramTagger(replace_tags('.', 'X', train_posts+development_posts), backoff=default_tagger)
bigram_tagger = nltk.BigramTagger(replace_tags('.', 'X', train_posts+development_posts), backoff=unigram_tagger)
print('Results: ')
print(unigram_tagger.evaluate(test_posts))
print(bigram_tagger.evaluate(test_posts))
        

Results: 
0.8807385229540918
0.8962075848303394


## Notes/Documentation

### Baseline

The first thing we did is check what kind of results we get without changing the data. For this, we chose to look a four common nltk taggers and see what our baseline values our so that. These were our results:

#### Brown

- **Default:** 0.185
- **Unigram:** 0.912
- **Bigram:** 0.454
- **Trigram:** 0.290

#### Chat

- **Default:** 0.172
- **Unigram:** 0.827
- **Bigram:** 0.574
- **Trigram:** 0.420

In both corpora the UnigramTagger was the most successful. We found the accuracy of the basic UnigramTagger wuite surprising, as 0.912 and 0.827 are quite good accuracies.


### NLTK 5.5 Tagging Unknown Words

We then read the paragraph *Tagging Unknown Words* as suggested in the instructions and decided to implement a tagger like that. That means created a FreqDist of the most common words in each corpus and use only the most common *x* words and replace all others with the token "UNK". We experimented with different values of *x*. These were the results for each corpus with the different values of *x*.


#### Brown

##### 1000

- **Unigram:** 0.823
- **Bigram:** 0.845
- **Trigram:** 0.845

##### 2000

- **Unigram:** 0.852
- **Bigram:** 0.869
- **Trigram:** 0.869

##### 3000

- **Unigram:** 0.869
- **Bigram:** 0.884
- **Trigram:** 0.884

##### 4000

- **Unigram:** 0.881
- **Bigram:** 0.894
- **Trigram:** 0.894

##### 5000

- **Unigram:** 0.890
- **Bigram:** 0.902
- **Trigram:** 0.901

#### Chat

##### 1000

- **Unigram:** 0.827
- **Bigram:** 0.845
- **Trigram:** 0.844

##### 2000

- **Unigram:** 0.860
- **Bigram:** 0.875
- **Trigram:** 0.873

##### 3000

- **Unigram:** 0.864
- **Bigram:** 0.879
- **Trigram:** 0.877

##### 4000

- **Unigram:** 0.872
- **Bigram:** 0.886
- **Trigram:** 0.884

##### 5000

- **Unigram:** 0.878
- **Bigram:** 0.892
- **Trigram:** 0.890


The best result was achieved with n=5000.

Overall, the results were significantly better in all scenarios except for the UnigramTagger, which achieved a higher accuracy before. The best tagger seems to be the BigramTagger as it always outperforms the other two.

The TrigramTagger was always the same or worse than the BigramTagger so we chose to not test the TrigramTagger any futher as it seems that the additional token context does not do anything to improve accuracy.


### Changing DefaultTagger

We decided to have the taggers fall back on each other, meaning the TrigramTagger falls back on the BigramTagger, which falls back on the UnigramTagger, which falls back on the DefaultTagger. The DefaultTagger tags everything as 'NOUN' and we decided to experiment with this to see if changing this would have an affect on our results. We decided to test this only on our best cases of the previous approach, meaning we tested with the replaced tokens with x=5000.

We tested this with 'VERB' and 'NUM' and the results got considerably worse, which isn't really surprsing seeing as 'NOUN' is the most common tag in both corpora.

### Training Data

We decided to see what would happen if we used the same taggers for both corpora and used both sets of training data. We expected the results to get worse but wanted to see how much worse.

As expected, the results got worse
#### Results Brown: 
- UnigramTagger: 0.938
- BigramTagger: 0.946

#### Results Chat: 
- UnigramTagger: 0.860
- BigramTagger: 0.883


### Development Set

We forgot about the development set until now. We decided to train the taggers using the respective development and training data and then test them again. Again we are using only the Unigram and BigramTagger and with all tokens not included in the most common 5000 replaced by 'UNK'.

We expect the results to improve seeing as the taggers are trained on more relevant data. This happened as expected. The reason more training data here leads to better results whereas more data led to worse results in the previous approach (combining Brown and Chat training data) is most likely due to the fact that the Chat and Brown corpus are very different and cover different to

#### Results Brown: 

- UnigramTagger: 0.940
- BigramTagger: 0.949

#### Results Chat: 
- UnigramTagger: 0.881
- BigramTagger: 0.896

### Changing Tags

A last thought we had was changing all "." tags in the Chat corpus to "X" thinking that punctionation a lot of the time is part of typical chat tokens and therefore doing this might improve the results but this was not the case. 
