Building own classifier based POS tagger using SklearnClassifier and ClassifierBasedPOSTagger #2781

abdalimran · 2021-08-11T08:14:35Z

I'm trying to build my own classifier based POS tagger using SklearnClassifier and ClassifierBasedPOSTagger. The code that I've tried is given below.

from nltk.corpus import treebank
nltk.download('treebank')

data = treebank.tagged_sents()
train_data = data[:3500]
test_data = data[3500:]

from nltk.classify import SklearnClassifier
from sklearn.naive_bayes import BernoulliNB
from nltk.tag.sequential import ClassifierBasedPOSTagger

bnb = SklearnClassifier(BernoulliNB())
bnb_tagger = ClassifierBasedPOSTagger(train=train_data,
                                      classifier_builder=bnb.train)

# evaluate tagger on test data and sample sentence
print(bnb_tagger.evaluate(test_data))

# see results on our previously defined sentence
print(bnb_tagger.tag(nltk.word_tokenize(sentence)))

This code is yielding the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\Users\ABDULL~1.IMR\AppData\Local\Temp/ipykernel_6580/266992580.py in <module>
      4 
      5 bnb = SklearnClassifier(BernoulliNB())
----> 6 bnb_tagger = ClassifierBasedPOSTagger(train=train_data,
      7                                       classifier_builder=bnb.train)
      8 

~\Miniconda3\envs\nlp_course\lib\site-packages\nltk\tag\sequential.py in __init__(self, feature_detector, train, classifier_builder, classifier, backoff, cutoff_prob, verbose)
    637 
    638         if train:
--> 639             self._train(train, classifier_builder, verbose)
    640 
    641     def choose_tag(self, tokens, index, history):

~\Miniconda3\envs\nlp_course\lib\site-packages\nltk\tag\sequential.py in _train(self, tagged_corpus, classifier_builder, verbose)
    673         if verbose:
    674             print("Training classifier ({} instances)".format(len(classifier_corpus)))
--> 675         self._classifier = classifier_builder(classifier_corpus)
    676 
    677     def __repr__(self):

~\Miniconda3\envs\nlp_course\lib\site-packages\nltk\classify\scikitlearn.py in train(self, labeled_featuresets)
    110 
    111         X, y = list(zip(*labeled_featuresets))
--> 112         X = self._vectorizer.fit_transform(X)
    113         y = self._encoder.fit_transform(y)
    114         self._clf.fit(X, y)

~\Miniconda3\envs\nlp_course\lib\site-packages\sklearn\feature_extraction\_dict_vectorizer.py in fit_transform(self, X, y)
    288             Feature vectors; always 2-d.
    289         """
--> 290         return self._transform(X, fitting=True)
    291 
    292     def inverse_transform(self, X, dict_type=dict):

~\Miniconda3\envs\nlp_course\lib\site-packages\sklearn\feature_extraction\_dict_vectorizer.py in _transform(self, X, fitting)
    233                     if feature_name in vocab:
    234                         indices.append(vocab[feature_name])
--> 235                         values.append(self.dtype(v))
    236 
    237             indptr.append(len(indices))

TypeError: float() argument must be a string or a number, not 'NoneType'

How to do it right?

The text was updated successfully, but these errors were encountered:

tomaarsen · 2021-08-17T18:19:02Z

This is a consequence of (what I believe to be) a bug in scikit-learn. Let me show you:

ClassifierBasedPOSTagger (and it's superclass ClassifierBasedTagger) uses a feature_detector method, which given some parameters (tokens, index, history) produces a dictionary like the following:

{
    "prevtag": prevtag,
    "prevprevtag": prevprevtag,
    "word": word,
    "word.lower": word.lower(),
    "suffix3": word.lower()[-3:],
    "suffix2": word.lower()[-2:],
    "suffix1": word.lower()[-1:],
    "prevprevword": prevprevword,
    "prevword": prevword,
    "prevtag+word": f"{prevtag}+{word.lower()}",
    "prevprevtag+word": f"{prevprevtag}+{word.lower()}",
    "prevword+word": f"{prevword}+{word.lower()}",
    "shape": shape,
}

Now, in some situations, such as for the first word in the sentence, the previous tag is simply nonexistant, i.e. None. For example, if we take the sample training sentence:

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]

Then ClassifierBasedPOSTagger will in _train call feature_detector, producing:

{'prevprevtag': None,
 'prevprevtag+word': 'None+pierre',
 'prevprevword': None,
 'prevtag': None,
 'prevtag+word': 'None+pierre',
 'prevword': None,
 'prevword+word': 'None+pierre',
 'shape': 'upcase',
 'suffix1': 'e',
 'suffix2': 're',
 'suffix3': 'rre',
 'word': 'Pierre',
 'word.lower': 'pierre'}

So, this mapping from feature names to feature values has an expected type of Dict[str, Optional[str]].

This is passed along the chain, to sklearn\feature_extraction\_dict_vectorizer.py, to a _transform method of DictVectorizer. With x as a feature dict like the last example, the following section of code is run:
https://github.com/scikit-learn/scikit-learn/blob/e64714637d8cc9f4724ae21ea500e4bdc57b0a39/sklearn/feature_extraction/_dict_vectorizer.py#L223-L255

Line 227 is responsible for handling the case where the feature value is None, setting the feature name to e.g. "prevprevtag". However, then in line 255, the dtype (defaults to np.float64) of our feature value is taken, which produces np.float64(None). This throws the TypeError you're experiencing.

In short - Scikit-learn's _transform method of DictVectorizer in sklearn/feature_extraction/_dict_vectorizer.py fails when the input argument X contains mappings to None.

My advise is to report this as an issue over at https://github.com/scikit-learn/scikit-learn/issues. You can link to this message if you wish.

Perhaps in the meantime you can use the following snippet:

class CustomClassifierBasedPOSTagger(ClassifierBasedPOSTagger):

    def feature_detector(self, tokens, index, history):
        return {
            key: str(value) # Ensure that the feature value is a string. Converts None to 'None'
            for key, value in super().feature_detector(tokens, index, history).items()
        }

This will prevent feature values of None while the bug persists. Instead, it uses 'None'. Alternatively, you can come up with whatever token you want to use to replace None, e.g. key: value if value else 'my_custom_value_which_represents_none'.

I don't have the time to actually train something bigger for you, so I quickly trained this with just three sentences:

import nltk
from nltk.corpus import treebank

from nltk.classify import SklearnClassifier
from sklearn.naive_bayes import BernoulliNB
from nltk.tag.sequential import ClassifierBasedPOSTagger

nltk.download('treebank')

data = treebank.tagged_sents()
train_data = data[:3]
test_data = data[3:]

class CustomClassifierBasedPOSTagger(ClassifierBasedPOSTagger):

    def feature_detector(self, tokens, index, history):
        return {
            key: str(value) # Ensure that the feature value is a string. Converts None to 'None'
            for key, value in super().feature_detector(tokens, index, history).items()
        }

bnb = SklearnClassifier(BernoulliNB())
bnb_tagger = CustomClassifierBasedPOSTagger(train=train_data,
                                            classifier_builder=bnb.train,
                                            verbose=True)

sentence = "This is a sample sentence which I just made for fun."
# evaluate tagger on test data and sample sentence
print(bnb_tagger.evaluate(test_data))

# see results on our previously defined sentence
print(bnb_tagger.tag(nltk.word_tokenize(sentence)))

Which outputs:

[nltk_data] Downloading package treebank to C:\Users\Tom/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
Constructing training corpus for classifier.
Training classifier (58 instances)
0.09338289371682999
[('This', 'NNP'), ('is', 'NNP'), ('a', 'NNP'), ('sample', 'NNP'), ('sentence', 'NNP'), ('which', 'NNP'), ('I', 'NNP'), ('just', 'NNP'), ('made', 'NNP'), ('for', 'NNP'), ('fun', 'NNP'), ('.', 'NNP')]

(Obviously this evaluates horribly, but the point is that it doesn't break anymore)

Happy tagging,

Tom Aarsen

stevenbird · 2021-08-18T07:23:11Z

@larsmans – in case you have any views on this question of a possible bug in scikit-learn

tomaarsen added classifier tagger third-party labels Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building own classifier based POS tagger using SklearnClassifier and ClassifierBasedPOSTagger #2781

Building own classifier based POS tagger using SklearnClassifier and ClassifierBasedPOSTagger #2781

abdalimran commented Aug 11, 2021

tomaarsen commented Aug 17, 2021

stevenbird commented Aug 18, 2021

Building own classifier based POS tagger using SklearnClassifier and ClassifierBasedPOSTagger #2781

Building own classifier based POS tagger using SklearnClassifier and ClassifierBasedPOSTagger #2781

Comments

abdalimran commented Aug 11, 2021

tomaarsen commented Aug 17, 2021

stevenbird commented Aug 18, 2021