# Exercise 1 
### 1a
We will simplify and use the universal pos tagset in this exercise to make
the experiments run faster.

We will be a little more cautious than the NLTK-book, when it comes to training and test sets.
We will:
- Split the News-section into three sets:
    - 10% for final testing which we tuck aside for now, call it news_test
    - 10% for development testing, call it news_dev_test
    - 80% for training, call it news_train
- Make the data sets, and repeat the training and evaluation with news_train and news_dev_test.
- Use 4 counting decimal places and stick to that throughout the exercise set.

How is the result compared to using the full brown tagset? Why do you think one of the tagsets
yields higher scores than the other one?


In [1]:
import re
import pprint
import nltk
import numpy as np
from tqdm import tqdm
import time

In [2]:
from nltk.corpus import brown

tagged_sents = brown.tagged_sents(categories='news') # read in corups as sentences
# Shoulnd't we randomize these?

size = int(len(tagged_sents) * 0.1)
train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]

In [3]:
news_train = tagged_sents[:-2*size] 
news_dev_test = tagged_sents[-2*size:-size]
news_final_test = tagged_sents[-size:]

print('90/10/00:', len(train_sents), len(test_sents))
print('80/10/10:', len(news_train), len(news_dev_test), len(news_final_test))

90/10/00: 4161 462
80/10/10: 3699 462 462


In [4]:
news_train[0]

[('The', 'AT'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('Grand', 'JJ-TL'),
 ('Jury', 'NN-TL'),
 ('said', 'VBD'),
 ('Friday', 'NR'),
 ('an', 'AT'),
 ('investigation', 'NN'),
 ('of', 'IN'),
 ("Atlanta's", 'NP$'),
 ('recent', 'JJ'),
 ('primary', 'NN'),
 ('election', 'NN'),
 ('produced', 'VBD'),
 ('``', '``'),
 ('no', 'AT'),
 ('evidence', 'NN'),
 ("''", "''"),
 ('that', 'CS'),
 ('any', 'DTI'),
 ('irregularities', 'NNS'),
 ('took', 'VBD'),
 ('place', 'NN'),
 ('.', '.')]

In [5]:
news_train[1][-3:]

[('was', 'BEDZ'), ('conducted', 'VBN'), ('.', '.')]

In [6]:
def pos_features(sentence, i, history):
    '''
    Takes in a sentence and index and finds the 3 preceeding letters.
    
    The idea here is to look at how this word ends versus the context of the previous word. 
    
    History is currently not used.
    
    '''
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"      # The start of this sentence set
    else:
        features["prev-word"] = sentence[i-1]  # Why are we returning the whole sentence here? 
    return features

In [7]:
class ConsecutivePosTagger(nltk.TaggerI):   # Inherits attributes from TaggerI
    '''
    
    [TaggerI package](https://www.nltk.org/api/nltk.tag.html?highlight=taggeri#nltk.tag.api.TaggerI)
    evaluate(gold)[source]
        Score the accuracy of the tagger against the gold standard. Strip the tags from the gold standard text, retag it using the tagger, then compute the accuracy score.

        Parameters
        gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.

        Return type
        float - the score of that test set.
    '''
    
    def __init__(self, train_sents, features=pos_features):
        self.features = features
        train_set = []
        for tagged_sent in tqdm(train_sents):             # Singling out each sentence
            untagged_sent = nltk.tag.untag(tagged_sent)   # Untagged sentence for words as list
            history = []                                  # Stores the tags for each word (i.e. word type)
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = features(untagged_sent, i, history) # This is a call to pos_features
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = self.features(sentence, i, history) # This is a call to pos_features
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

In [8]:
old_tagger = ConsecutivePosTagger(train_sents)
print(round(old_tagger.evaluate(test_sents), 4))

100%|█████████████████████████████████████████████████████████████████████████| 4161/4161 [00:00<00:00, 5196.72it/s]


0.7915


In [9]:
new_tagger = ConsecutivePosTagger(news_train)
print(round(new_tagger.evaluate(news_dev_test), 4))

100%|█████████████████████████████████████████████████████████████████████████| 3699/3699 [00:00<00:00, 4661.34it/s]


0.7653


Since our model is built using fewer data points, it makes sense that it gives a slightly lower accuracy score than the model built on the full brown corpus. 

# 1b

One of the first things we should do in an experiment like this, is to establish a reasonable baseline.

A reasonable baseline here is the <mark>Most Frequent Class</mark> baseline. 
Each word which is seen during training should get its most frequent tag from the training.  
For words not seen during training, we simply use the most frequent overall tag.
With news_train as training set and news_dev_set as valuation set, what is the accuracy of this
baseline?

Does the tagger from part (a) using the features from the NLTK book beat the baseline?

[FROM NLTK](https://www.nltk.org/api/nltk.tag.html?highlight=most%20frequent)

_This package defines several taggers, which take a list of tokens, assign a tag to each one, and return the resulting list of tagged tokens. Most of the taggers are built automatically based on a training corpus. For example, the unigram tagger tags each word w by checking what the most frequent tag for w was in a training corpus:_

In order to make sure I split the tags properly, I will use a frequency distribution to count the tags. 

From Wikipedia:
_Note that some versions of the tagged Brown corpus contain combined tags. For instance the word "wanna" is tagged VB+TO, since it is a contracted form of the two words, want/VB and to/TO. Also some tags might be negated, for instance "aren't" would be tagged "BER*", where * signifies the negation. Additionally, tags may have hyphenations: The tag -HL is hyphenated to the regular tags of words in headlines. The tag -TL is hyphenated to the regular tags of words in titles. The hyphenation -NC signifies an emphasized word. Sometimes the tag has a FW- prefix which means foreign word._

In [10]:
pos_counts = nltk.FreqDist([tag
                            for sentence in news_train
                            for (word, tag) in sentence])

print("the five most common tags are", pos_counts.most_common(5))

the five most common tags are [('NN', 10656), ('IN', 8404), ('AT', 7080), ('NP', 5859), (',', 3976)]


In [11]:
pos_counts.most_common(1)[0][0]
most_common = pos_counts.most_common(1)[0][0]

In [14]:
from nltk.corpus import brown
from nltk.tag import UnigramTagger
baseline_news_tagger = UnigramTagger(news_train)

baseline_news = [
    [(word, tag) if tag!=None else (word, most_common)
     for word, tag in baseline_news_tagger.tag(nltk.tag.untag(sentence))]
     for sentence in news_dev_test 
]

In [20]:
results = []

for i in range(len(news_dev_test)):
    for w in range(len(news_dev_test[i])):
        if news_dev_test[i][w] == baseline_news[i][w]:
            results.append(1)
        else:
            results.append(0)

In [21]:
print('The baseline accuracy is:', round(np.mean(results),4))

The baseline accuracy is: 0.8268


Our model gave an accuracy of 0.7653, and therefore does _not_ beat this baseline. 

# Exercise 2
Our goal will now be to improve the tagger compared to the simple suffix-based tagger. For the further experiments, we move to scikit-learn which yields more options for considering various alternatives. We have reimplemented the ConsecutivePosTagger to use scikit-learn classifiers below. We have made the classifier a parameter so that it can easily be exchanged. We start with the BernoulliNB-classifier which should correspond to the way it is done in NLTK.

In [26]:
import sklearn

from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer


class ScikitConsecutivePosTagger(nltk.TaggerI): 

    def __init__(self, train_sents, 
                 features=pos_features, clf = BernoulliNB()):
        # Using pos_features as default.
        self.features = features
        train_features = []
        train_labels = []
        for tagged_sent in tqdm(train_sents):
            history = []
            untagged_sent = nltk.tag.untag(tagged_sent)
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = features(untagged_sent, i, history)
                train_features.append(featureset)
                train_labels.append(tag)
                history.append(tag)
        v = DictVectorizer()
        X_train = v.fit_transform(train_features)
        y_train = np.array(train_labels)
        
        # For help in exercise 5
        try:
            clf.fit(X_train, y_train)
        except MemoryError:
            print('Memory error. Attempting partial fit...')
            clf.partial_fit(X_train, y_train, n_jobs=-2)
            for n in range(100):    # arbitrarily chose this range
                batch_indexs = np.random.sample(range(2000), 20)
                clf.partial_fit(X_train[batch_indexes, :], y_train[batch_indexes])
        
        
        self.classifier = clf
        self.dict = v

    def tag(self, sentence):
        test_features = []
        history = []
        for i, word in enumerate(sentence):
            featureset = self.features(sentence, i, history)
            test_features.append(featureset)
        X_test = self.dict.transform(test_features)
        tags = self.classifier.predict(X_test)
        return zip(sentence, tags)

### Part 2a.
In this part, we will train the `ScikitConsecutivePosTagger` on the `news_train` set and test on the `news_dev_test` set with the `pos_features`, to see if we get the same result as in exercise 1a.

In [23]:
sklearn_tagger = ScikitConsecutivePosTagger(news_train)

100%|█████████████████████████████████████████████████████████████████████████| 3699/3699 [00:00<00:00, 5589.83it/s]


In [24]:
round(sklearn_tagger.evaluate(news_dev_test), 4)

0.5853

### Part 2b.
I get inferior results compared to using the NLTK set-up with the same feature extractors. The only explanation I could find is that the smoothing is too strong. `BernoulliNB()` from scikit-learn uses Laplace smoothing as default `("add-one")`. The smoothing is generalized to Lidstone smoothing which is expressed by the alpha parameter to `BernoulliNB(alpha=...)` Therefore, we can tune this hyper-parameter with different `alpha`s; specifically `[1, 0.5, 0.1, 0.01, 0.001, 0.0001]`. 

In [27]:
alphas =  [1, 0.5, 0.1, 0.01, 0.001, 0.0001]
scores = []
for a in alphas:
    sklearn_tagger = ScikitConsecutivePosTagger(news_train, clf= BernoulliNB(alpha=a))
    scores.append(round(sklearn_tagger.evaluate(news_dev_test), 4))

100%|█████████████████████████████████████████████████████████████████████████| 3699/3699 [00:00<00:00, 6043.90it/s]
100%|█████████████████████████████████████████████████████████████████████████| 3699/3699 [00:00<00:00, 7772.06it/s]
100%|█████████████████████████████████████████████████████████████████████████| 3699/3699 [00:00<00:00, 5786.51it/s]
100%|█████████████████████████████████████████████████████████████████████████| 3699/3699 [00:00<00:00, 5803.78it/s]
100%|█████████████████████████████████████████████████████████████████████████| 3699/3699 [00:00<00:00, 5368.22it/s]
100%|█████████████████████████████████████████████████████████████████████████| 3699/3699 [00:00<00:00, 4638.10it/s]


In [28]:
best = [z for z in zip(alphas, scores) if z[1]==max(scores)][0]
print(f'Best alpha {best[0]} with score {best[1]}')

Best alpha 0.1 with score 0.7664


In [29]:
scores # All of the scores

[0.5853, 0.6811, 0.7664, 0.7631, 0.7419, 0.7348]

### Part 2c.
To improve the results, we may change the feature selector. This is which attributes we are pulling from the sentences. So far, in `pos_features`, we include both the previous word, as well as the last 1, 2 and 3 preceding letters as the features to predict each word.

We start with a simple improvement of our feature selector. In addition the previous word, we will expand our feature selector to also contain the word itself. Intuitively, the word itself should be a stronger feature than the previous. 

In [30]:
def expanded_pos_features(sentence, i, history):
    '''
    Takes in a sentence and index and finds the 3 preceeding letters.
    
    The idea here is to look at how this word ends versus the context of the previous word. 
    
    History is currently not used.
    
    '''
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"      # The start of this sentence set
    else:
        features["prev-word"] = sentence[i-1]  # Why are we returning the whole sentence here? 
    
    features['token'] = sentence[i]            # add word as feature to data
    
    return features

With this new feature selector, we can rerun the experiment with the various `alphas` and record the results. 

It will be particularly interesting to see if we get the same `alpha` as above, along with the changes in the accuracy.

In [31]:
expanded_scores = []
for a in tqdm(alphas):
    sklearn_tagger = ScikitConsecutivePosTagger(news_train, features=expanded_pos_features, clf= BernoulliNB(alpha=a))
    expanded_scores.append(round(sklearn_tagger.evaluate(news_dev_test), 4))

  0%|                                                                                         | 0/6 [00:00<?, ?it/s]
  0%|                                                                                      | 0/3699 [00:00<?, ?it/s]
 35%|█████████████████████████▎                                              | 1303/3699 [00:00<00:00, 12901.83it/s]
 71%|███████████████████████████████████████████████████▎                    | 2635/3699 [00:00<00:00, 12986.40it/s]
 17%|█████████████▌                                                                   | 1/6 [00:51<04:19, 51.83s/it]
  0%|                                                                                      | 0/3699 [00:00<?, ?it/s]
 29%|████████████████████▊                                                   | 1068/3699 [00:00<00:00, 10574.57it/s]
 67%|████████████████████████████████████████████████                        | 2466/3699 [00:00<00:00, 11380.36it/s]
 33%|███████████████████████████                                

In [43]:
expanded_scores

[0.5603, 0.7143, 0.8416, 0.8436, 0.8348, 0.8321]

In [42]:
expanded_best = [z for z in zip(alphas, expanded_scores) if z[1]==max(expanded_scores)][0]
print(f'Best alpha {expanded_best[0]} with score {expanded_best[1]}')

Best alpha 0.01 with score 0.8436


Here we got a slightly different `alpha` than before, even though the alpha from above came in a close second. Since all of these scores seem higher than above, it is safe to conclude that our model is better off including the token as well as the previous features.

We also got a score slightly better than the baseline here. 



__(EXPLAIN WHY THESE SHOULD THEORETICALLY BE THE SAME!!!)__




# Ex 3: Logistic regression (10 points)
### Part a.
We proceed with the best feature selector from the last exercise. 

We want to study the effect of the learner. 

We start by importing `LogisticRegression` to it instead of `BernoulliNB` as the `clf` parameter in our sklearn-tagger. We can then train again on `news_train` and test on `news_dev_test`, recording the results. 

In [45]:
logreg_tagger = ScikitConsecutivePosTagger(news_train, features=expanded_pos_features, clf= LogisticRegression())
round(logreg_tagger.evaluate(news_dev_test), 4)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.886

This already scores better than the optimally tuned Bernoulli Naive Bayes, beating it by a whole 4 percent points! 

### Part b.
Similarly to the Naive Bayes classifier, the logistic regression method can be tuned for our dataset. Smoothing for LogisticRegression is done by _regularization_. In scikit-learn, regularization is expressed by the parameter C. A smaller C means a heavier smoothing. (C is the inverse of the parameter $\alpha$ in the lectures.) It will be interesting to see if we can get this same theoretical value for C in practice.

We will try with C in range [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0] and see which value yields the best result.

In [46]:
C = [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]

logreg_scores = []
for c in tqdm(C):
    sklearn_tagger = ScikitConsecutivePosTagger(news_train, features=expanded_pos_features, clf= LogisticRegression(C=c))
    logreg_scores.append(round(sklearn_tagger.evaluate(news_dev_test), 4))


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#log

In [47]:
logreg_scores

[0.6747, 0.8348, 0.886, 0.8963, 0.8954, 0.8929]

In [52]:
logreg_best = [z for z in zip(C, logreg_scores) if z[1]==max(logreg_scores)][0]
print(f'Best C={logreg_best[0]} with score {logreg_best[1]}')
best_C = logreg_best[0]

Best C=10.0 with score 0.8963


# Ex 4: Features (10 points)
### Part 4a.
We will now stick to the `LogisticRegression()` with the optimal C from the last point and see whether we are able to improve the results further by extending the feature extractor with more features. 

First, we try adding a feature for the next word in the sentence, and then train and test.

In [55]:
def prev_next_suffix_features(sentence, i, history):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"      # start of this sentence set
    else:
        features["prev-word"] = sentence[i-1]  # previous word in this sentence 
    
    features['token'] = sentence[i]            # add word as feature to data
    
    if i < len(sentence)-1:                    # make sure i doesn't exceed the indexes of the sentence
        features['next-word'] = sentence[i+1] 
    else:
        features['next-word'] = '<END>'
    
    return features

In [56]:
logreg_tagger = ScikitConsecutivePosTagger(news_train, 
                                           features=prev_next_suffix_features, 
                                           clf= LogisticRegression(C=best_C))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [65]:
print('Features: previous, current, and next words + suffixes, score: %5.4f' % round(logreg_tagger.evaluate(news_dev_test), 4))

Features: previous, current, and next words + suffixes, score: 0.9127


### Part 4b.
Try to add more features to get an even better tagger. Only the fantasy sets limits to what you may consider. 

Some candidates include: 
- Is the word a number? 
- Is it capitalized? 
- Does it contain capitals? 
- Does it contain a hyphen? 
- Consider larger contexts? 


What is the best feature set you can come up with? Train and test various feature sets and select the best one.

In [82]:
def my_features(sentence, i, history):
    # Suffixes
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    
    # Previous word
    if i == 0:
        features["prev-word"] = "<START>"      # start of this sentence set
    else:
        features["prev-word"] = sentence[i-1]  # previous word in this sentence 
    
    # Current word
    features['token'] = sentence[i]            # add word as feature to data
    
    # Next word
    if i < len(sentence)-1:                    # make sure i doesn't exceed the indexes of the sentence
        features['next-word'] = sentence[i+1] 
    else:
        features['next-word'] = '<END>'
        
    # Is number
    try:
        int(features['token'])
    except ValueError:
        features['is_numeric'] = False
    else:
        features['is_numeric'] = True
        
    # Is capitalized
    if features['token'][0].isupper():
        features['capitalized'] = True
    else:
        features['capitalized'] = False
        
#     # Word length (worsened the model)
#     features['word_length'] = len(features['token'])
    
#     # Sentence length (worsened the model)
#     features['sentence_length'] = len(sentence)
    
#     # Previous tags (didn't work bc of how tag() is written)
#     if i==0:
#         features['prev-tag'] = '<START>'
#     else:
#         features['prev-tag'] = history[i-1]
#     if i>1:
#         features['prev-tag(2)'] = history[i-2]
        
    return features
    

In [83]:
my_logreg_tagger = ScikitConsecutivePosTagger(news_train,
                                              features=my_features,
                                              clf= LogisticRegression(C=best_C))


  0%|                                                                                         | 0/3699 [00:00<?, ?it/s]
 22%|█████████████████                                                            | 821/3699 [00:00<00:00, 8208.76it/s]
 48%|████████████████████████████████████▎                                       | 1768/3699 [00:00<00:00, 8527.32it/s]
 70%|█████████████████████████████████████████████████████▎                      | 2595/3699 [00:00<00:00, 8448.38it/s]
 95%|████████████████████████████████████████████████████████████████████████▏   | 3514/3699 [00:00<00:00, 8634.29it/s]
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [84]:
round(my_logreg_tagger.evaluate(news_dev_test), 4)

0.9262

This was the best score that I was able to get. It barely best the expanded score above. Surprisingly enough, word and sentence length actually caused the scores to drop dramatically, while checking for capitalized words and numbers increased it. 

# Ex5: Larger corpus and evaluation (15 points)
### Part a.
We can now test our best tagger so far on the `news_final_test` set, to see how is the result compares to testing on news_dev_test.

In [85]:
round(my_logreg_tagger.evaluate(news_final_test), 4)

0.923

### Part b.
But we are looking for bigger fish. How good is our settings when trained on a bigger corpus?

We will use nearly the whole Brown corpus. But we will take away two categories for later evaluation: adventure and hobbies. We will also initially stay clear of news to be sure not to mix training and test data.

Create a variable `rest` containing the Brown corpus with all categories except these three. 

In [34]:
selected_categories = [c for c in brown.categories() if c not in ['adventure', 'hobbies', 'news']]
rest = brown.tagged_sents(categories=selected_categories) # read in corups as sentences

Shuffle the tagged sentences from `rest` and remember to use the universal pos tagset. Then split the set into 80%-10%-10%: `rest_train`, `rest_dev_test`, `rest_final_test`. (Why not just 90%-10%?)

In [35]:
rest = list(rest)
np.random.shuffle(rest)

In [36]:
rest[0]

[('Where', 'WRB'),
 ('schools', 'NNS'),
 (',', ','),
 ('fire', 'NN'),
 ('and', 'CC'),
 ('police', 'NN'),
 ('protection', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('similar', 'JJ'),
 ('municipal', 'JJ'),
 ('services', 'NNS'),
 ('are', 'BER'),
 ('of', 'IN'),
 ('equal', 'JJ'),
 ('quality', 'NN'),
 ('in', 'IN'),
 ('city', 'NN'),
 ('and', 'CC'),
 ('country', 'NN'),
 (',', ','),
 ('real', 'JJ'),
 ('estate', 'NN'),
 ('taxes', 'NNS'),
 ('are', 'BER'),
 ('usually', 'RB'),
 ('about', 'RB'),
 ('the', 'AT'),
 ('same', 'AP'),
 ('.', '.')]

In [37]:
size = int(len(rest) * 0.1)

rest_train = rest[:-2*size] 
rest_dev_test = rest[-2*size:-size]
rest_final_test = rest[-size:]

In [38]:
print(len(rest_train), len(rest_dev_test), len(rest_final_test))

35111 4388 4388


We can then merge these three sets with the corresponding sets from news to get final training and test sets:

In [39]:
train = rest_train+news_train
dev   = rest_dev_test + news_dev_test
test  = rest_final_test + news_final_test

We now need to establish a new baseline. This can be done the same way as above.

In [40]:
baseline_tagger = UnigramTagger(train)

baseline = [
    [(word, tag) if tag!=None else (word, most_common)
     for word, tag in baseline_tagger.tag(nltk.tag.untag(sentence))]
     for sentence in dev 
]

In [41]:
results = []

for i in range(len(dev)):
    for w in range(len(dev[i])):
        if dev[i][w] == baseline[i][w]:
            results.append(1)
        else:
            results.append(0)

In [42]:
print('New baseline: ', round(np.mean(results), 4))

New baseline:  0.8999


### Part c.
We can then build our tagger for this larger domain. Use the best settings from the earlier exercises, train on train and test on test. What is the accuracy of your tagger?

Warning: Running this experiment may take 15-30 min.

In [164]:
# Because of memory error, I had to use the SGD library to implement the logistic regression.
# This allowed my to train my model in batches, thus avoiding the memory error.
from sklearn.linear_model import SGDClassifier


big_tagger = ScikitConsecutivePosTagger(train,
                                        features=my_features,
                                        clf= SGDClassifier(loss='log',
                                                           alpha=1./best_C,
                                                           warm_start=True))
                                                          

# big_tagger = ScikitConsecutivePosTagger(train,
#                                         features=my_features,
#                                         clf= LogisticRegression(C=best_C))


  0%|                                                                                        | 0/38810 [00:00<?, ?it/s]
  0%|                                                                              | 1/38810 [00:00<7:11:39,  1.50it/s]
  2%|█▎                                                                          | 689/38810 [00:00<4:56:50,  2.14it/s]
  4%|███▏                                                                       | 1625/38810 [00:00<3:22:42,  3.06it/s]
  7%|█████▍                                                                     | 2788/38810 [00:00<2:17:28,  4.37it/s]
 10%|███████▊                                                                   | 4012/38810 [00:01<1:32:58,  6.24it/s]
 14%|██████████▌                                                                | 5465/38810 [00:01<1:02:22,  8.91it/s]
 18%|█████████████▊                                                               | 6964/38810 [00:01<41:42, 12.72it/s]
 22%|████████████████▉                 

In [167]:
# HOPEFULLY THERE ARE RESULTS HERE!!!!

In [168]:
round(big_tagger.evaluate(dev), 4)

0.5311

This score should have been much higher, compare to some fellow students' results. This probably has to do with the change in classifier I used to try and solve my memory problem. I guess it didn't pay off...

# Ex6: Comparing to other taggers (10 points)
### Part a.
In the lectures, we spent quite some time on the HMM-tagger. NLTK comes with an HMM-tagger which we may train and test on our own corpus. It can be trained by


In [43]:
news_hmm_tagger = nltk.HiddenMarkovModelTagger.train(train)

and tested similarly as we have tested our other taggers. Train and test it, first on the news set then on the big train/test set. How does it perform compared to your best tagger? What about speed?

In [45]:
st = time.time()
print('hmm score', news_hmm_tagger.evaluate(dev))
print(round(time.time()-st, 4), 'seconds')

MemoryError: 

The first time i ran this it took 224 seconds, but I unfortunately did not store the score. Since I keep getting a memory error, I cannot compare this tagger to my best one.

###  Part b
NLTK also comes with an averaged perceptron tagger which we may train and test. It is currently considered the best tagger included with NLTK. It can be trained as follows:

In [None]:
st = time.time()
per_tagger = nltk.PerceptronTagger(load=False)
per_tagger.train(train)
print('Training perceptron took', time.time()-st,  'seconds')

In [None]:
st = time.time()
print('Preceptron score:', per_tagger.evaluate(dev))
print(round(time.time()-st, 4), 'seconds')

It is tested similarly to our other taggers.

Train and test it, first on the news set and then on the big train/test set. How does it perform compared to your best tagger? Did you beat it? What about speed?

I also ran into memory problems here. For this last execution, I never managed to get it to finish running. 